grep: An Hour at Bell Labs in 1973

#grep #unixtools #freebsd #regex

Technical Beauty — Episode 36

A development server holds a mystery. Someone deployed something, the logs hold the truth. One types grep -i timeout /var/log/messages and three lines admit what happened. The command was unremarkable. The thing that made it possible has been answering that question since November 1973, and the architecture inside has held that whole time.

The Hour

The story is well-known in Unix circles but bears retelling because it remains the cleanest example of what a productive afternoon at Bell Labs in 1973 looked like.

Doug McIlroy, who ran the research department at Bell Labs and is the man behind the Unix pipeline as a concept, came to Ken Thompson with a request. Lee E. McMahon, also at Bell Labs, wanted to analyse the text of the Federalist Papers by pattern. The need was specific and mundane: search a body of text for occurrences of certain words and word-classes. The tools available were ed's g/re/p command — global, regular-expression, print — which printed every line of a file matching a regex. It worked, but only inside ed.

McIlroy asked Thompson to extract the command into a standalone tool. Thompson, by Brian Kernighan's later telling (in "The Unix Programming Environment" and elsewhere), disappeared into his office for about an hour and came back with grep. The name was the ed command turned into a single word. The interface was the same regex language that ed users already typed daily.

That hour produced a tool that, fifty-three years later, ships in the base system of every Unix-derived operating system in production. It is also, on any reasonable telling, the tool that taught a generation of working programmers what regular expressions were for.

The Surface

The surface of grep is famously austere.

grep pattern files

One pattern, n files, print the matching lines. The pattern is a regular expression in the grep dialect (POSIX BRE by default, POSIX ERE with -E, fixed strings with -F, Perl-compatible with -P where the implementation includes it). Three flags carry the bulk of daily use:

grep ERROR /var/log/messages              # case-sensitive, basic regex
grep -i timeout /var/log/messages         # case-insensitive
grep -v INFO /var/log/messages            # invert: print non-matching
grep -r "TODO" src/                       # recurse into directories
ps aux | grep -v grep | grep nginx        # composed through a pipe

The output is text on stdout: matching lines, one per line, optionally preceded by filename and line number with -n -H. The compositional fit with pipes is what made grep central to Unix from its first day. grep | awk | sort | uniq is not a script someone wrote; it is a sentence in the language Unix happens to be.

Over the years the variant tools merged. egrep (extended regex with +, ?, alternation) became grep -E. fgrep (fixed strings, no regex parsing) became grep -F. pcregrep (Perl-compatible regex with lookahead and named groups) became grep -P on implementations that include it. One tool, four dialects, single calling convention.

The Layers Underneath

The implementation is the part that does not look modest once you look at it.

Naïve regex matching is slow. The straightforward approach (backtracking through the pattern tree at every position in the input) has worst-case runtime exponential in the pattern length for certain patterns. Production grep cannot afford that.

The interior of GNU grep, the most widely deployed implementation, was carried for many years by Mike Haertel (afaik the main author from the late 1980s through the 1990s and into the 2000s). His "why GNU grep is fast" note, posted to the FreeBSD mailing list in August 2010, is one of the cleanest pieces of performance-engineering writing one will encounter. He states the principle: "The key to making programs fast is to make them do practically nothing."

The mechanics he applied:

Boyer-Moore for fixed strings. When the pattern contains no regex metacharacters, GNU grep uses the Boyer-Moore algorithm (Robert Boyer and J Strother Moore, 1977). Boyer-Moore is sub-linear on average: it skips characters in the input it has already determined cannot start a match. On a typical search, this is the single largest performance win.
Two-way string matching for harder cases. When fixed-string matching does not apply but the pattern has a fixed prefix or suffix, a two-way string matcher (Crochemore and Perrin, 1991) handles the harder cases without falling back to general regex.
Thompson's NFA construction. For full regex, GNU grep falls back to the NFA-construction algorithm Thompson himself published in 1968 (CACM, "Regular Expression Search Algorithm"). The NFA is built once and run as a deterministic state machine over the input, avoiding the exponential blow-up that naïve backtracking allows.
mmap on the input. GNU grep mmaps the input file into memory rather than reading it through stdio. The match runs directly on the mapped pages; the kernel pages-in on demand. No userland copy, no stdio buffering.
Avoid the input where possible. GNU grep does not look at every byte. The Boyer-Moore skip table lets it advance by the pattern length on a mismatch; on a typical English-text input with a four- or five-character pattern, the inner loop touches a small fraction of the file.

The result is a tool that, on modern commodity hardware, scans several gigabytes of plain text per second. The 1973 surface and the late-1980s interior happen to fit together such that the workflow Thompson designed in an hour is still the workflow one wants in 2026.

On FreeBSD

FreeBSD has shipped bsdgrep in the base system since around 2010 (afaik), a BSD-licensed reimplementation that replaced the GNU-licensed version in the base tree. The reason was licence: the FreeBSD project preferred to keep its base userland under BSD licences for the same reasons of coexistence that have applied to FreeBSD's licence philosophy since the project's earliest days.

bsdgrep is a leaner implementation than GNU grep. It does not aim to match GNU grep's last-mile performance optimisations on every input class; it aims for correctness, POSIX compliance, and small footprint. For the typical workloads on a FreeBSD system (log scanning, build-system uses, ports infrastructure), the difference is invisible. For the workload of scanning a multi-gigabyte corpus repeatedly, GNU grep is one pkg install gnugrep away.

The FreeBSD path to grep is therefore: /usr/bin/grep is bsdgrep, available without installing anything, on every FreeBSD system from a minimal install upward. OpenBSD and NetBSD follow the same pattern. On Linux, GNU grep is the default; /bin/grep and /usr/bin/grep point to it on the major distributions.

The Lineage

The shape — pattern, file, matching lines on stdout — has stayed identical for fifty-three years. The modern descendants reproduce it exactly.

ack arrived in 2005, written by Andy Lester (afaik) in Perl. Its primary differences from grep were ergonomic: by default it skipped version-control directories, build artefacts and binary files, and it recurse-searched by default. The interface was grep's.

ag, the silver searcher, arrived in 2011 from Geoff Greer (afaik), written in C. ag's primary differences from ack were performance: ack was Perl and slow; ag was C and fast. The interface was still grep's.

ripgrep arrived in 2016 from Andrew Gallant (BurntSushi on GitHub), written in Rust. ripgrep honours .gitignore by default, uses SIMD instructions for fast searching, and is competitive with or faster than GNU grep on most workloads. The interface was, once again, still grep's.

The pattern across these descendants is worth naming. Each was written by a developer who looked at grep and concluded that it was the right shape, that the only thing worth changing was the implementation. Thompson's original choice (the regex grammar from ed, the line-oriented output, the pipe-as-default-output) survived three implementation rewrites across nearly two decades without anyone proposing a different interface. That is what an interface looks like when its designer was paying attention.

A Note on Regular Expressions

Stephen Cole Kleene formalised regular sets in 1956 as a mathematical object: a finite-state recogniser for regular languages. The construction was not intended for programmers; it was a piece of theoretical computer science.

Thompson, in 1968, wrote the first practical implementation in his CACM paper. He showed that an NFA could be built directly from a regex syntax, that the NFA could be simulated efficiently, and that this gave a fast pattern-matching procedure suitable for an editor. The QED editor he was working on at the time was the first user of this construction; ed inherited it; grep extracted it.

The pedagogical consequence has been quietly enormous. A working programmer in 2026 who knows regex learnt it because grep was the first place they typed one. The pattern language travelled from ed to grep to sed to awk to Perl to JavaScript to every modern programming environment. Kleene's mathematical object became a working tool because Thompson saw, in 1968, what one could do with it, and because grep, in 1973, made it the default search interface for the working programmer.

The Aphorism

A man wrote, in about an hour, a tool that taught a generation of programmers what regular expressions were for. Then he wrote Unix, B, C, and UTF-8 in the years on either side. One rather suspects he was paying attention to which tools the world would still need fifty years later.

The Federalist Papers got their pattern analysis. The world got grep. Lee McMahon, on the available record, has been somewhat under-credited; one rather hopes he was pleased.

Read the full article on vivianvoss.net →

By Vivian Voss — System Architect & Software Developer. Follow me on LinkedIn for daily technical writing.