Fun with manpages

When reading about a Unix command or C library function it's relatively common to see it suffixed with a number in brackets. This is to make it clear what exactly you're talking about, so if someone is discussing mknod they might write mknod(1) they're talking about the shell command or mknod(2) if they mean the syscall. The number used refers to the manpage section, so to see the manpage for the shell function:

$ man 1 mknod

And to see what the syscall is all about:

$ man 2 mknod

According the man manpage on my system there are eight standard manpage sections in total, with one non-standard section:

  1. Executable programs or shell commands
  2. System calls (functions provided by the kernel)
  3. Library calls (functions within program libraries)
  4. Special files (usually found in /dev)
  5. File formats and conventions eg /etc/passwd
  6. Games
  7. Miscellaneous (including macro packages and conventions), e.g. man(7), groff(7)
  8. System administration commands (usually only for root)
  9. Kernel routines [Non standard]
Since something can be present in more than one section, I wondered which symbol had the most manpages so I wrote a script to look through each of the directories in /usr/share/man/man[1-8], list and parse the gzipped filenames (they're usually named symbol.sectionnumber.gz) and then find out the sections they're all present in:
import os
import re
from collections import defaultdict

manpage_gz_pattern = "(.*)\.\w+.gz"
manpage_dir_base = "/usr/share/man"
manpage_sections = range(1, 9)
manpage_entries = defaultdict(list)

for manpage_section in manpage_sections:
    manpage_section_dir = os.path.join(manpage_dir_base, f"man{str(manpage_section)}")
    manpage_section_contents = os.listdir(manpage_section_dir)

    for manpage_entry_filename in manpage_section_contents:
        gz_entry = re.match(manpage_gz_pattern, manpage_entry_filename)
        manpage_entry = gz_entry.groups()[0] 
        manpage_entries[manpage_entry] += [(manpage_section, manpage_entry_filename)]

for section_count in manpage_sections:
    number_of_manpages = len([ m for m in manpage_entries if len(manpage_entries[m]) == section_count])
    print(f"number of manpages in {section_count} sections: {number_of_manpages}")

The results are:
$ python -i
number of manpages in 1 sections: 10763
number of manpages in 2 sections: 107
number of manpages in 3 sections: 7
number of manpages in 4 sections: 0
number of manpages in 5 sections: 0
number of manpages in 6 sections: 0
number of manpages in 7 sections: 0
number of manpages in 8 sections: 1
There's seemingly a clear winner, you can find a single symbol in all eight standard manpage sections. However this is a little misleading because after a bit of inspection this symbol is "intro" - it is not a shell command, syscall, stdlib function, game or anything like that - it's a manpage that describes a bit about each section.

So ignoring intro the most common symbols and their manpage entries are
  • mdoc (mdoc.1.gz, mdoc.5.gz, mdoc.7.gz)
  • locale (locale.1.gz, locale.5.gz, locale.7.gz)
  • hostname (hostname.1.gz, hostname.5.gz, hostname.7.gz)
  • passwd (passwd.1ssl.gz, passwd.1.gz, passwd.5.gz)
  • time (time.2.gz, time.3am.gz, time.7.gz)
  • readdir (readdir.2.gz, readdir.3am.gz, readdir.3.gz)
  • random (random.3.gz, random.4.gz, random.7.gz)
This reveals something else interesting - the section needn't be a number. The two commands are both valid and access completely separate manpages:

$ man 1ssl passwd
$ man 1 passwd

I used to think that each time I opened a manpage I learn something completely new and unexpected - but I never thought I'd find something interesting just by looking at the manpages' gzipped filenames!