DEV Community

Cover image for Go's Regexp is Slow. So I Built My Own - up to 3000x Faster
Andrey Kolkov
Andrey Kolkov

Posted on • Edited on • Originally published at github.com

Go's Regexp is Slow. So I Built My Own - up to 3000x Faster

🔥 Updated January 5, 2026: v0.9.1 released! New features: DigitPrefilter (154x on IP patterns, 2.7x faster than Rust!), Aho-Corasick (113x on multi-literal patterns), adaptive prefilter switching. 13 strategies, ~38K LOC.

I've been writing Go for years. Love the language. But there's one thing that always bothered me: regex performance.

Recently I was processing large text files (logs, traces, datasets) and kept hitting the same wall - regexp.Find() consuming 70-80% of CPU time. Not on complex patterns. On simple stuff like .*error.*connection.*.

So I did what any reasonable developer would do: spent months building coregex, a drop-in replacement for Go's regexp that's 3-3000x faster while keeping the same O(n) guarantees.

Here's how and why.


The Problem

Let's be direct: Go's regexp is not optimized for performance.

It uses Thompson's NFA exclusively - great for correctness (no ReDoS, guaranteed O(n)), terrible for speed:

// Searching for "error" in 1MB file
re := regexp.MustCompile(`.*error.*`)
start := time.Now()
re.Find(data) // 27ms
Enter fullscreen mode Exit fullscreen mode

Why so slow?

  1. No SIMD - processes bytes sequentially
  2. No prefilters - scans entire input even when pattern clearly won't match
  3. Single engine - uses same algorithm for all patterns
  4. Allocation overhead in hot paths

Compare with Rust's regex crate on same pattern: 21µs. That's 1,285x faster.

The gap is real. And fixable.


The Rust Rabbit Hole

I started digging into how Rust achieves this. Turns out, they use multi-engine architecture:

  1. Literal extraction: Pull out fixed strings from pattern (.*error.* → literal "error")
  2. SIMD prefilter: Use AVX2 to find candidates (12x faster byte search)
  3. Strategy selection: Pick optimal engine per pattern
    • DFA for simple patterns
    • NFA for complex patterns
    • Reverse search for suffix patterns
    • Specialized engines for common cases
  4. Zero allocations: Object pools, preallocated buffers

Go has none of this. regexp package is ~10 years old, hasn't changed much.

Could I bring this to Go? Challenge accepted.


Building coregex - The Journey

Phase 1: SIMD Primitives

Started with the foundation. Go's bytes.IndexByte is okay. AVX2 is better.

Wrote assembly:

// Find byte in slice using AVX2 (32 bytes parallel)
TEXT ·memchrAVX2(SB), NOSPLIT, $0-40
    MOVQ    haystack+0(FP), DI
    MOVQ    len+8(FP), CX
    MOVBQZX needle+24(FP), AX

    // Broadcast needle to YMM0
    VMOVD   AX, X0
    VPBROADCASTB X0, Y0

loop:
    VMOVDQU (DI), Y1       // Load 32 bytes
    VPCMPEQB Y0, Y1, Y2    // Compare (parallel)
    VPMOVMSKB Y2, AX       // Extract mask
    TESTL   AX, AX
    JNZ     found
    // ... continue
Enter fullscreen mode Exit fullscreen mode

Benchmark (finding 'e' in 64KB):

  • stdlib: 8,367 ns
  • coregex: 679 ns
  • 12.3x faster

Then memmem (substring search with paired-byte SIMD), then teddy (multi-pattern SIMD).

Phase 2: Strategy Selection

Not all patterns are equal. Suffix patterns need reverse search. Inner literal patterns need bidirectional search.

Built meta-engine with 13 strategies:

func selectStrategy(pattern *syntax.Regexp, info *PatternInfo) Strategy {
    // Each pattern gets the optimal engine
    switch {
    case info.HasSuffix && len(info.Suffix) >= 3:
        return UseReverseSuffix     // 1000x for .*\.txt
    case info.HasInner && len(info.Inner) >= 3:
        return UseReverseInner      // 3000x for .*error.*
    case info.IsDigitLead:
        return UseDigitPrefilter    // 154x for IP patterns
    case info.IsLiteralAlt && len(info.Literals) > 8:
        return UseAhoCorasick       // 113x for many literals
    case info.IsLiteralAlt && len(info.Literals) <= 8:
        return UseTeddy             // 242x for few literals
    // ... 8 more strategies
    }
}
Enter fullscreen mode Exit fullscreen mode

Phase 3: ReverseInner - The Killer Optimization

This is where it got interesting.

Problem: Pattern .*error.*connection.* on large file.

stdlib approach:

for each position:
    try to match .*error.*connection.*
    (scan entire input, backtracking, slow)
Enter fullscreen mode Exit fullscreen mode

coregex approach:

1. Find "connection" with SIMD → 200ns (not 27ms!)
2. From "connection", scan backward for "error" → fast DFA
3. From "connection", scan forward for .* → fast DFA
4. First match = done (leftmost-first semantics)
Enter fullscreen mode Exit fullscreen mode

Benchmark (250KB file):

stdlib:  12.6ms
coregex: 4µs
speedup: 3,154x
Enter fullscreen mode Exit fullscreen mode

Not 3x. Not 30x. Three thousand times faster.

Phase 4: DigitPrefilter - Beating Rust

IP address patterns like \d+\.\d+\.\d+\.\d+ were slow everywhere. Even in Rust.

Insight: Digits are rare in most text. SIMD scan for [0-9] first, then verify.

// SIMD scan for any digit (0-9)
func scanDigits(haystack []byte) int {
    // AVX2: check 32 bytes in parallel
    // Use range comparison: byte >= '0' && byte <= '9'
}
Enter fullscreen mode Exit fullscreen mode

Result on IP validation (6MB input):

Engine Time vs stdlib
Go stdlib 493 ms baseline
coregex 3.2 ms 154x faster
Rust regex 12 ms 41x faster

coregex is 2.7x faster than Rust on this pattern. First time beating Rust!

But there was a catch: on dense digit data (many digits, few matches), prefilter hurt performance. Solution? Adaptive switching:

const adaptiveThreshold = 64

// After 64 consecutive false positives, switch to DFA
if consecutiveFPs >= adaptiveThreshold {
    return e.findWithDFA(haystack, pos)  // Abandon prefilter
}
Enter fullscreen mode Exit fullscreen mode

Based on Rust regex insight: "prefilter with high FP rate makes search slower"

Phase 5: Aho-Corasick - Multi-Pattern at Scale

Patterns like error|warning|fatal|critical|panic|timeout|... with many alternatives were slow.

Integrated Aho-Corasick automaton:

Result (6MB input, 12 literal patterns):

Engine Time vs stdlib
Go stdlib 473 ms baseline
coregex 4.2 ms 113x faster
Rust regex 0.7 ms

Phase 6: Zero Allocations

Hot paths must not allocate. Profiled with pprof, eliminated every allocation:

  • Object pooling for DFA states
  • Preallocated buffers
  • Careful escape analysis
  • Stack-only data structures

Result:

BenchmarkIsMatch-8  1000000  1.2 µs/op  0 B/op  0 allocs/op
Enter fullscreen mode Exit fullscreen mode

Zero.


The Architecture

Pattern → Parse → NFA → Literal Extract → Strategy Select
                                               ↓
                    ┌──────────────────────────────────────┐
                    │  13 Strategies:                      │
                    │  • ReverseInner (3000x)              │
                    │  • DigitPrefilter (154x, beats Rust) │
                    │  • ReverseSuffix (1000x)             │
                    │  • AhoCorasick (113x)                │
                    │  • Teddy SIMD (242x)                 │
                    │  • CharClassSearcher (23x)           │
                    │  • LazyDFA, PikeVM, OnePass...       │
                    └──────────────────────────────────────┘
                                               ↓
Input → Prefilter (SIMD) → Engine → Match Result
Enter fullscreen mode Exit fullscreen mode

SIMD Primitives (AMD64):

  • memchr — single byte search (AVX2, 32 bytes/cycle)
  • memmem — substring search with paired-byte heuristic (SSSE3)
  • teddy — multi-pattern search (SSSE3)
  • digit scanner — SIMD range check for [0-9]

Pure Go fallback on other architectures.


Cross-Language Benchmarks

Real benchmarks on 6MB input (source):

Pattern Go stdlib coregex Rust regex vs stdlib vs Rust
IP validation 493 ms 3.2 ms 12 ms 154x 2.7x faster!
Inner .*keyword.* 231 ms 1.9 ms 0.6 ms 122x Rust 3x
Suffix .*\.txt 233 ms 1.8 ms 1.4 ms 127x ~tie
Literal alternation 473 ms 4.2 ms 0.7 ms 113x Rust 6x
Email validation 259 ms 1.7 ms 1.3 ms 155x ~tie
URL extraction 266 ms 2.8 ms 0.9 ms 96x Rust 3x

Key insight: coregex now beats Rust on IP/digit patterns. Competitive on most others.


API - Drop-In Replacement

// Change one line
// import "regexp"
import "github.com/coregx/coregex"

re := coregex.MustCompile(`.*error.*`)
re.Find(data)    // 3000x faster, same API
Enter fullscreen mode Exit fullscreen mode

Full compatibility:

  • Find, FindAll, Match, MatchString
  • ✅ Capture groups with FindSubmatch
  • ✅ Named captures with SubexpNames()
  • Replace, ReplaceAll, Split
  • ✅ Unicode support via regexp/syntax

Zero-allocation APIs:

// Zero allocations - returns bool
matched := re.IsMatch(text)

// Zero allocations - returns (start, end, found)
start, end, found := re.FindIndices(text)
Enter fullscreen mode Exit fullscreen mode

When to Use It

Use coregex if:

  • Regex is a bottleneck (profile first!)
  • You have patterns with wildcards (.*error.*)
  • You parse IP addresses, phone numbers, versions
  • You do case-insensitive matching at scale
  • You parse logs, traces, or large text files

Stick with stdlib if:

  • Regex is not your bottleneck
  • You need maximum API stability (coregex is v0.9, pre-1.0)
  • You want minimal dependencies

Performance by pattern type:

  • IP/phone patterns: 40-154x (beats Rust!)
  • Suffix patterns (.*\.txt): 100-1500x
  • Inner patterns (.*error.*): 100-3000x
  • Literal alternations: 15-242x
  • Case-insensitive: 100-300x
  • Simple patterns: 2-10x

Battle-Tested

coregex was tested in GoAWK. This real-world testing uncovered 15+ edge cases that synthetic benchmarks missed:

  • Unicode boundary handling
  • Longest-match mode compatibility
  • Empty match semantics
  • Capture group edge cases

All fixed. All tested.

Real-World: uawk - Ultra AWK

We built uawk — a modern AWK interpreter powered by coregex. Results speak for themselves:

Benchmarks vs GoAWK (10MB dataset):

Pattern GoAWK uawk Speedup
Regex alternation 1.85s 97ms 19x
IP matching 290ms 99ms 2.9x
General regex 320ms 100ms 3.2x

uawk wins all 10 benchmarks. This is coregex in production — not synthetic tests.

Features:

  • POSIX AWK compatible
  • Stack-based VM with 80+ bytecode ops
  • Zero CGO, cross-platform
  • Embeddable Go API
# Install
go install github.com/kolkov/uawk/cmd/uawk@latest

# Use (same as awk)
uawk '/error/ { print $0 }' server.log
Enter fullscreen mode Exit fullscreen mode

We need more testers! If you have a project using regexp, try coregex and report issues.


Trying It Out

go get github.com/coregx/coregex@latest
Enter fullscreen mode Exit fullscreen mode

Run benchmarks on your patterns. If it's faster, great. If not, file an issue - I want to know why.

Project stats:

  • ~87% test coverage (main package)
  • Zero known bugs
  • ~38K lines of code
  • 13 specialized strategies
  • MIT licensed

Current version: v0.9.1

Roadmap:

  • v1.0.0: API stability guarantee
  • v1.1.0+: ARM NEON SIMD (waiting for Go native SIMD proposal)

Technical Deep Dive: SIMD Assembly

If you're into this stuff, here's how AVX2 memchr works:

// Find byte 'e' in haystack (processes 32 bytes in parallel)
VPBROADCASTB needle, YMM0    // Broadcast to 256-bit register

loop:
    VMOVDQU (haystack), YMM1  // Load 32 bytes (unaligned)
    VPCMPEQB YMM0, YMM1, YMM2 // Compare all 32 at once
    VPMOVMSKB YMM2, RAX       // Extract bitmask
    TEST RAX, RAX             // Any matches?
    JNZ found                 // Jump if found
    ADD haystack, 32          // Next 32 bytes
    JMP loop

found:
    TZCNT RAX, RAX            // Find first set bit
    ADD result, haystack
    VZEROUPPER                // Critical! AVX-SSE transition
    RET
Enter fullscreen mode Exit fullscreen mode

This is why it's 12x faster than Go's byte-by-byte approach.


Lessons Learned

  1. Profile before optimizing - but don't stop at profiling
  2. SIMD is accessible - Go's asm syntax is surprisingly clean
  3. Algorithm matters more than micro-opts - ReverseInner is 3000x not because of clever asm, but because of smart strategy
  4. Adaptive algorithms win - static heuristics fail on edge cases, runtime adaptation works
  5. Zero allocs is achievable - requires discipline but pays off
  6. Compatibility is hard - matching stdlib behavior has edge cases (GoAWK found 15!)
  7. You CAN beat Rust - on specific patterns, Go + SIMD + smart algorithms wins

Open Source & Contributions

Source: github.com/coregx/coregex

PRs welcome for:

  • ARM NEON implementation
  • Additional SIMD primitives
  • Test coverage improvements
  • Documentation

Not welcome:

  • "Rewrite in Rust" (this is a Go library)
  • Breaking API changes pre-1.0
  • Features without benchmarks

Why I Built This

Honest answer: I wanted to learn.

How do modern regex engines work? Can I replicate Rust's performance in Go? What are the limits of Go's runtime?

Turned out the answer was: Yes, you can, and on some patterns you can even beat Rust if you use the right algorithms.

Also, I needed this for my own projects. I maintain racedetector (pure-Go race detection) and other tools that parse a lot of text. This helps those projects too.


The Bottom Line

Go's regexp is fine for most use cases. But if regex is your bottleneck, you have options now.

coregex isn't magic - it's just better algorithms + SIMD + adaptive strategies. The techniques are well-known (Rust uses them, RE2 uses parts of them). I just brought them to Go - and on some patterns, made them faster than Rust.

If you're spending significant CPU time in regex, benchmark coregex. If it helps, use it. If you find bugs or slowness, report them.

Links:

Related Go Issues:

Star if useful. File issues if broken. PRs if you want to contribute.


P.S. The most fun part? Writing AVX2 assembly at 2am and having it actually work. Second most fun? Seeing "coregex 2.7x faster than Rust" in benchmark output and thinking "wait, that can't be right" (it was).

Built by @kolkov as part of CoreGX - Production Go libraries.

Top comments (2)

Collapse
 
mx_tech profile image
Moksh

Really curious to use this package.

Collapse
 
kolkov profile image
Andrey Kolkov

Hi! Yes, this library needs real users. Together with Ben Hoyt, we identified numerous bugs through integration with GoAWK and significantly improved performance on certain patterns and small datasets.