The classic Aho–Corasick multi-pattern matcher as a Go CLI. It finds every occurrence of a whole dictionary of patterns in a single pass costing
O(text + patterns + matches)— independent of how many patterns you search for. Three layers: (1) a trie of all patterns; (2) failure links — when the next char doesn't extend the match, jump to the longest proper suffix that is also a prefix of some pattern, which is exactly KMP's failure function generalized from one pattern to a set; (3) dictionary-suffix links, so overlapping matches (hersalso endshe) are all reported. A turn from the probabilistic-data-structure series (#274–278) to classic string algorithms.
📦 GitHub: https://github.com/sen-ltd/aho-corasick-cli
The problem
You have many needles, not one. Running a single-pattern search (even a good one
like KMP) once per pattern costs O(P · text) for P patterns. Aho–Corasick
does it in one pass: build an automaton from the whole dictionary, then sweep
the text once, emitting every match — overlapping ones included.
Here's the CLI, real output:
$ ac search -p he -p she -p his -p hers ushers
1 she
2 he
2 hers # overlapping matches, all found in one pass
— 3 match(es) of 4 pattern(s) in 6 bytes, automaton has 10 states
$ ac count --patterns signatures.txt --from app.log
ERROR 6034
WARN 4107
timeout 2025
connection refused 1960
...
— 22323 total match(es) in one pass over 542889 bytes
Sweep one log for a whole set of error signatures at once — a natural fit.
The build: a trie + two kinds of links
1. A trie (goto). Every pattern is inserted; walking it consumes matching
prefixes of the text.
2. Failure links. When the next character doesn't extend the current match,
where do you go without re-reading text? To the longest proper suffix of what
you've matched that is also a prefix of some pattern — exactly KMP's failure
function, generalized from one pattern to a set. Built with a BFS, each node's
link derived from its parent's:
f := m.fail[u]
for {
if c, ok := m.children[f][b]; ok { m.fail[v] = c; break }
if f == 0 { m.fail[v] = 0; break }
f = m.fail[f]
}
Because failure links only point to strictly shallower nodes, the scan never
moves backward in the text — that's what keeps it linear.
3. Dictionary-suffix links. One text position can finish several patterns at
once — matching hers also finishes he and her. Each node links to the
nearest shorter pattern that also ends there, so a report walks that chain and
emits them all:
if m.out[m.fail[u]] != -1 { m.dict[u] = m.fail[u] } else { m.dict[u] = m.dict[m.fail[u]] }
TestDictSuffixReportsNested pins that matching "hers" reports hers, he,
and her; TestClassicUshers pins the textbook {he, she, his, hers} over
"ushers".
The scan: one linear pass
node := 0
for i := 0; i < len(text); i++ {
node = m.step(node, text[i]) // goto, falling back via failure links
for t := node; t != -1; t = m.dict[t] // walk the dictionary-suffix chain
if pid := m.out[t]; pid != -1 { emit(pid, endsAt: i) }
}
step follows failure links until a goto edge (or the root); amortized, the whole
scan is linear in the text length. TestAgainstBruteForce checks the counts
against a naive O(P·n) matcher on mixed overlapping patterns.
UTF-8
Matching is over bytes, so multibyte patterns just work and offsets are byte
offsets. TestUnicodeByteMatching searches 東京都 for the overlapping patterns
東京, 京都, and 東京都 and finds all three.
aho/aho.go — automaton: trie, failure & dict links, FindAll / Count (10 tests)
main.go — CLI: search / count, patterns from flags or file, text from arg/file/stdin
Takeaway
Aho–Corasick is the go-to for "many needles in one pass." The keys are
failure links as generalized KMP, dictionary-suffix links so overlapping
matches aren't missed, and a scan that's linear in the text. It's a turn
from the probabilistic structures of #274–278 to classic string algorithms, but
the instinct is the same: pay once to build an automaton, then run the hot path
cheap.

Top comments (0)