One of the defining features of UNIX and UNIX-like operating systems is that "everything is a file" [1].* Documents, directories, links, input and output devices and so on, are all just sinks for or sources of streams of bytes from the point of view of the OS [2]. You can verify that a directory is just a special kind of file with the command:
$ view /etc
...which will show the contents of the /etc
file (directory). Note that view
is shorthand for running vi
in read-only mode, vi -R
[7], so you can exit it by typing :q!
. Be extremely careful when opening directories as files -- you can cause serious damage to your system if you accidentally edit something you shouldn't have. On a Linux file system, the /dev
directory contains files that represent devices, /proc
contains special files that represent process and system information, and so on:
$ ls -l /dev/std*
lrwxrwxrwx 1 root 15 Nov 3 2017 /dev/stderr -> /proc/self/fd/2
lrwxrwxrwx 1 root 15 Nov 3 2017 /dev/stdin -> /proc/self/fd/0
lrwxrwxrwx 1 root 15 Nov 3 2017 /dev/stdout -> /proc/self/fd/1
$ cat /proc/uptime # shows current system uptime
36710671.21 1127406622.14
$ cat /dev/random # returns randomly-generated bytes
{IGnh▒I侨▒▒Ұ>j▒L▒▒%▒=▒▒U@▒Q▒2▒;▒l▒q▒$▒r▒1▒U...
Other common /dev
files include /dev/zero
which produces a constant stream of zeros, and /dev/null
, which accepts all input and does nothing with it (think of it like a rubbish bin) [8]. Note that "files" can show the same information each time you open them, or they may be constantly changing -- reflecting the current state of the system, or displaying the results of some constantly-running process.
But why is everything a file in UNIX?
As Dennis Ritchie and Ken Thompson outlined in their 1974 Communications of the ACM paper, "The UNIX Time-Sharing System", there are three main advantages to the "everything is a file" approach:
- file and device I/O are as similar as possible
- file and device names have the same syntax and meaning, so that a program expecting a file name as a parameter can be passed a device name
- special files [
/dev/random
, etc.] are subject to the same protection mechanism as regular files
The advantages of this approach are illustrated in this 1982 Bell Labs video, in which Brian Kernighan, creator of the AWK programming language and co-creator of UNIX, describes how programs can be pipelined to reduce unnecessary code repetition (through modularisation) and increase flexibility (jump to 05:30):
$ makewords sentence | lowercase | sort | unique | mismatch
The code Kernighan types is reproduced above. In it, makewords
splits the text file sentence
into words (separated by whitespace characters) and returns one word per line. makewords
was passed a file as input and its output would normally be sent to the terminal, but we've piped it (using |
) as input to the next method, lowercase
. lowercase
treats each line as a separate argument, converting all uppercase characters to lowercase, and then pipes its output to sort
, which sorts the list of words alphabetically. sort
pipes its output to unique
, which removes duplicate words from the list, and sends its output to mismatch
, which checks the list of unique, all-lowercase words against a dictionary file. Any misspelled words (words not appearing in the dictionary) are then printed to the terminal by default. By connecting these five separate functions, we've easily created a brand new function which spell-checks a file.
Note that input can come from regular files, but also special files like disks and other input devices like the terminal itself. Output can be sent to the terminal, or to other programs, or written to the disk as files. This ability to pipeline together functions, treating files and disks and special I/O devices identically, greatly increases the power and flexibility of UNIX relative to other systems which treat these things differently from one another.
* This is more correctly written as "everything is a file descriptor or a process" [3]. (File descriptors are also sometimes called "handles".) File descriptors are simply non-negative indices which refer to files, directories, input and output devices, etc. [5] The file descriptors of stdin
, stdout
, and stderr
are 0, 1, and 2, respectively. This is why, when we want to suppress error output, we redirect it (with the >
command [4]) to /dev/null
with
$ command 2>/dev/null
We can send all output (stdout
and stderr
) of a particular command as input to another command with 2>&1 |
, or, more simply, '|&' [6]:
$ command1 2>&1 | command2
$ command1 |& command2
The OS maps file descriptors to files (the actual locations of the bytes on disk) using the inode (index node) [10].
Processes [9] are separate from file descriptors. A process is an instance of a program which is currently being executed. A process contains an image (read-only copy) of the code to be executed, some memory and heap space to use during execution, relevant file descriptors, and so on. Processes also have their own separate indexing system (process ids, or pid
s).
Related:
Introduction to Linux (2008), Machtelt Garrels
An Introduction to UNIX Processes (2014), Brian Storti
Ghosts of UNIX Past: A Historical Search for Design Patterns (2010), Neil Brown
unix-history-repo (continuous UNIX commit history from 1970 until today)
Top comments (4)
Great article. Kernighan's example reminds me of Doug McIlroy's code review of Donald Knuth where he replaces a large Pascal program with a shell one-liner.
The related section is very interesting.
Within the
unix-history-repo
, there's a link to this visualisation, which I think is pretty cool.For anyone interested, the visualization is created using Gource.
Beautiful article beautifully written 👍