DEV Community

SEN LLC
SEN LLC

Posted on

Building an Aho–Corasick CLI in Go — Match a Whole Dictionary in One Pass, Failure Links as Generalized KMP, and Every Overlapping Match

The classic Aho–Corasick multi-pattern matcher as a Go CLI. It finds every occurrence of a whole dictionary of patterns in a single pass costing O(text + patterns + matches) — independent of how many patterns you search for. Three layers: (1) a trie of all patterns; (2) failure links — when the next char doesn't extend the match, jump to the longest proper suffix that is also a prefix of some pattern, which is exactly KMP's failure function generalized from one pattern to a set; (3) dictionary-suffix links, so overlapping matches (hers also ends he) are all reported. A turn from the probabilistic-data-structure series (#274–278) to classic string algorithms.

📦 GitHub: https://github.com/sen-ltd/aho-corasick-cli

Screenshot

The problem

You have many needles, not one. Running a single-pattern search (even a good one
like KMP) once per pattern costs O(P · text) for P patterns. Aho–Corasick
does it in one pass: build an automaton from the whole dictionary, then sweep
the text once, emitting every match — overlapping ones included.

Here's the CLI, real output:

$ ac search -p he -p she -p his -p hers ushers
       1  she
       2  he
       2  hers          # overlapping matches, all found in one pass
— 3 match(es) of 4 pattern(s) in 6 bytes, automaton has 10 states

$ ac count --patterns signatures.txt --from app.log
ERROR               6034
WARN                4107
timeout             2025
connection refused  1960
...
— 22323 total match(es) in one pass over 542889 bytes
Enter fullscreen mode Exit fullscreen mode

Sweep one log for a whole set of error signatures at once — a natural fit.

The build: a trie + two kinds of links

1. A trie (goto). Every pattern is inserted; walking it consumes matching
prefixes of the text.

2. Failure links. When the next character doesn't extend the current match,
where do you go without re-reading text? To the longest proper suffix of what
you've matched that is also a prefix of some pattern — exactly KMP's failure
function, generalized from one pattern to a set. Built with a BFS, each node's
link derived from its parent's:

f := m.fail[u]
for {
    if c, ok := m.children[f][b]; ok { m.fail[v] = c; break }
    if f == 0 { m.fail[v] = 0; break }
    f = m.fail[f]
}
Enter fullscreen mode Exit fullscreen mode

Because failure links only point to strictly shallower nodes, the scan never
moves backward in the text — that's what keeps it linear.

3. Dictionary-suffix links. One text position can finish several patterns at
once — matching hers also finishes he and her. Each node links to the
nearest shorter pattern that also ends there, so a report walks that chain and
emits them all:

if m.out[m.fail[u]] != -1 { m.dict[u] = m.fail[u] } else { m.dict[u] = m.dict[m.fail[u]] }
Enter fullscreen mode Exit fullscreen mode

TestDictSuffixReportsNested pins that matching "hers" reports hers, he,
and her; TestClassicUshers pins the textbook {he, she, his, hers} over
"ushers".

The scan: one linear pass

node := 0
for i := 0; i < len(text); i++ {
    node = m.step(node, text[i])          // goto, falling back via failure links
    for t := node; t != -1; t = m.dict[t] // walk the dictionary-suffix chain
        if pid := m.out[t]; pid != -1 { emit(pid, endsAt: i) }
}
Enter fullscreen mode Exit fullscreen mode

step follows failure links until a goto edge (or the root); amortized, the whole
scan is linear in the text length. TestAgainstBruteForce checks the counts
against a naive O(P·n) matcher on mixed overlapping patterns.

UTF-8

Matching is over bytes, so multibyte patterns just work and offsets are byte
offsets. TestUnicodeByteMatching searches 東京都 for the overlapping patterns
東京, 京都, and 東京都 and finds all three.

aho/aho.go — automaton: trie, failure & dict links, FindAll / Count (10 tests)
main.go    — CLI: search / count, patterns from flags or file, text from arg/file/stdin
Enter fullscreen mode Exit fullscreen mode

Takeaway

Aho–Corasick is the go-to for "many needles in one pass." The keys are
failure links as generalized KMP, dictionary-suffix links so overlapping
matches aren't missed
, and a scan that's linear in the text. It's a turn
from the probabilistic structures of #274–278 to classic string algorithms, but
the instinct is the same: pay once to build an automaton, then run the hot path
cheap.

📦 GitHub: https://github.com/sen-ltd/aho-corasick-cli

Top comments (0)