DEV Community

Cover image for Aho-Corasick in Go: Multi-Pattern String Matching at 6 GB/s with Zero Allocations
Andrey Kolkov
Andrey Kolkov

Posted on • Originally published at github.com

Aho-Corasick in Go: Multi-Pattern String Matching at 6 GB/s with Zero Allocations

When your application needs to find thousands of keywords in a stream of text — think log analysis, content moderation, network intrusion detection, or DNA sequencing — you need an algorithm that doesn't slow down as the number of patterns grows. That algorithm is Aho-Corasick.

We built coregx/ahocorasick, a pure Go implementation that achieves over 6 GB/s on a single core with zero heap allocations on the hot path. No cgo, no assembly, no unsafe — just carefully structured Go code and a deep understanding of how modern CPUs access memory.

This article explains what we built, why, and the techniques that made it fast.


The Problem: O(n × k) Doesn't Scale

Suppose you have 100 keywords to search for in a 1 MB document. The naive approach — call strings.Contains for each keyword — performs 100 scans of the document. That's O(n × k) where n is the document size and k is the number of patterns.

Aho-Corasick solves this in O(n + z) — one pass through the document, regardless of how many patterns you have. Whether you're searching for 10 patterns or 10,000, the scan time is the same.

The algorithm builds a finite automaton from all patterns at once: a trie with failure links that allow the search to continue without backtracking. Every byte of input advances the automaton by exactly one state.

What We Built

coregx/ahocorasick is part of the coregx organization — a collection of high-performance libraries for Go. It serves as the multi-pattern search engine inside coregex, our regex engine, where it accelerates literal alternations like error|warning|fatal|critical.

Key Design Decisions

Pure Go, zero dependencies. The library compiles on every platform Go supports — Linux, Windows, macOS, ARM, WASM — without any build complexity. There's no cgo bridge, no platform-specific assembly files to maintain.

DFA compilation. At build time, the library constructs a noncontiguous NFA (trie + failure links), then compiles it into a fully resolved DFA — a single flat []uint32 array where every state transition is pre-computed. At search time, there are no failure links to follow. One table lookup per byte. Always.

Builder pattern API. The automaton is immutable after construction. You configure it once, build it once, and then search concurrently from any number of goroutines without synchronization.

ac, err := ahocorasick.NewBuilder().
    AddStrings([]string{"error", "warning", "fatal", "critical"}).
    Build()
if err != nil {
    log.Fatal(err)
}

// Safe for concurrent use
if ac.IsMatch(logLine) {
    // handle match
}
Enter fullscreen mode Exit fullscreen mode

Performance: The Numbers

Benchmarks on Intel i7-1255U, single core:

Operation Input Throughput Allocations
IsMatch (match found) 64 KB 6.3 GB/s 0
IsMatch (no match) 64 KB 6.0 GB/s 0
Find (first match) 64 KB 3.5 GB/s 1 (24 B)
FindAll (all matches) 77 B 100 MB/s 4 (720 B)

These are median values across multiple benchmark runs. Individual runs may be higher or lower depending on CPU frequency scaling. For context, reading from an NVMe SSD tops out at ~7 GB/s — this library processes data nearly as fast as most storage can deliver it.

How It Works: Three Layers of Optimization

Layer 1: The DFA Transition Table

The classical Aho-Corasick NFA has a problem: when a byte doesn't match any transition from the current state, it follows a chain of failure links back toward the root. In the worst case, this is O(depth) operations per byte.

Our DFA eliminates this entirely. At build time, for every state and every possible byte class, we pre-compute the final destination by following all failure links. The result is stored in a flat array:

next_state = trans[current_state + byte_class]
Enter fullscreen mode Exit fullscreen mode

One addition. One array load. No branches, no link following, no indirection. The state IDs are pre-multiplied by the stride (the row width of the transition table), so the lookup doesn't even need a multiplication — just an addition.

Layer 2: Match Flag in the Transition Table

Each entry in the transition table is a uint32. The lower 31 bits hold the (pre-multiplied) destination state ID. The high bit — bit 31 — is a match flag: if set, the destination state has at least one matching pattern.

This means the hot loop checks for matches with a single AND instruction:

raw := trans[sid + class]
if raw & matchFlag != 0 {
    return true  // match found
}
sid = raw  // no flag → raw IS the clean state ID
Enter fullscreen mode Exit fullscreen mode

The critical insight: when there's no match (the common case for most bytes), raw has bit 31 clear, so it's already a valid state ID. No masking needed. The mask is only applied on the rare match path.

Layer 3: SIMD Prefilter with Skip-Ahead

The DFA processes one byte at a time. But modern CPUs have SIMD instructions that can scan 16–32 bytes in a single operation. Go's bytes.IndexByte uses these instructions internally (SSE2/AVX2 on x86, NEON on ARM).

Before entering the DFA loop, we scan the haystack for any byte that could start a pattern match. If none of the pattern start bytes exist in the haystack, we return immediately — no DFA traversal at all.

More importantly, we re-engage this prefilter during the DFA loop. Whenever the automaton returns to its start state (meaning no pattern prefix is currently being tracked), we call bytes.IndexByte to skip ahead to the next position where a match could begin:

if sid == startState && len(startBytes) > 0 {
    skip := findEarliestStartByte(haystack[i+1:], startBytes)
    if skip < 0 {
        return false  // no more start bytes → no match possible
    }
    i += skip
}
Enter fullscreen mode Exit fullscreen mode

This is the same technique used by BurntSushi's Rust aho-corasick crate, adapted for Go's runtime. It turns the no-match worst case from a full O(n) DFA scan into a handful of SIMD scans.

Real-World Usage

Log Analysis Pipeline

// Build once at startup
patterns := []string{
    "ERROR", "FATAL", "PANIC", "OOM",
    "connection refused", "timeout exceeded",
    "permission denied", "disk full",
}
ac, _ := ahocorasick.NewBuilder().
    AddStrings(patterns).
    Build()

// Process log lines (concurrent-safe)
func processLine(line []byte) {
    if !ac.IsMatch(line) {
        return // fast path: no keywords found
    }
    for _, m := range ac.FindAll(line, -1) {
        alert(patterns[m.PatternID], line)
    }
}
Enter fullscreen mode Exit fullscreen mode

Content Moderation

// Load blocklist from database
blocklist := loadBlocklist() // []string, potentially thousands of terms
ac, _ := ahocorasick.NewBuilder().
    AddStrings(blocklist).
    Build()

func moderate(content []byte) bool {
    return ac.IsMatch(content) // O(n), regardless of blocklist size
}
Enter fullscreen mode Exit fullscreen mode

Regex Engine Prefilter

Inside coregex, when the regex engine encounters a pattern like:

(?i)(error|warning|fatal|critical|info|debug|trace)
Enter fullscreen mode Exit fullscreen mode

It extracts the literal alternation, builds an Aho-Corasick automaton, and uses it as a prefilter. The automaton quickly identifies candidate positions in the text, and the full regex engine only runs at those positions. For large inputs with rare matches, this can be 10–100x faster than running the regex directly.

Part of the coregx Ecosystem

ahocorasick is one component in the coregx organization's text processing toolkit:

  • ahocorasick — Multi-pattern string matching (this library)
  • coregex — High-performance regex engine that uses ahocorasick as its literal search backend

The libraries are designed to work together but are independently useful. ahocorasick has zero dependencies and can be used standalone in any Go project.

If you're interested in the regex engine side of things, check out Go's Regexp is Slow. So I Built My Own — up to 3000x Faster, where we describe coregex and how it uses this Aho-Corasick library under the hood.

Getting Started

go get github.com/coregx/ahocorasick
Enter fullscreen mode Exit fullscreen mode
package main

import (
    "fmt"
    "github.com/coregx/ahocorasick"
)

func main() {
    ac, _ := ahocorasick.NewBuilder().
        AddStrings([]string{"quick", "brown", "fox"}).
        Build()

    text := []byte("the quick brown fox jumps over the lazy dog")

    // Zero-allocation existence check
    fmt.Println("Has match:", ac.IsMatch(text)) // true

    // Find all matches
    for _, m := range ac.FindAll(text, -1) {
        fmt.Printf("  %q at [%d:%d]\n", text[m.Start:m.End], m.Start, m.End)
    }
}
Enter fullscreen mode Exit fullscreen mode

Output:

Has match: true
  "quick" at [4:9]
  "brown" at [10:15]
  "fox" at [16:19]
Enter fullscreen mode Exit fullscreen mode

What's Next

  • Teddy SIMD prefilter — A packed SIMD algorithm (inspired by Hyperscan) for even faster candidate scanning with ≤64 patterns
  • Contiguous NFA — A memory-efficient alternative to the DFA for very large pattern sets (10,000+ patterns) where the full DFA transition table doesn't fit in L2 cache
  • Stream searchio.Reader interface for searching in streaming data without loading the entire input into memory

Links:

Licensed under MIT. Contributions welcome.

Top comments (0)