DEV Community

Cover image for Go's Regexp is Slow. So I Built My Own - up to 3000x Faster
Andrey Kolkov
Andrey Kolkov

Posted on • Originally published at github.com

Go's Regexp is Slow. So I Built My Own - up to 3000x Faster

I've been writing Go for years. Love the language. But there's one thing that always bothered me: regex performance.

Recently I was processing large text files (logs, traces, datasets) and kept hitting the same wall - regexp.Find() consuming 70-80% of CPU time. Not on complex patterns. On simple stuff like .*error.*connection.*.

So I did what any reasonable developer would do: spent 6 months building coregex, a drop-in replacement for Go's regexp that's 3-3000x faster while keeping the same O(n) guarantees.

Here's how and why.


The Problem

Let's be direct: Go's regexp is not optimized for performance.

It uses Thompson's NFA exclusively - great for correctness (no ReDoS, guaranteed O(n)), terrible for speed:

// Searching for "error" in 1MB file
re := regexp.MustCompile(`.*error.*`)
start := time.Now()
re.Find(data) // 27ms
Enter fullscreen mode Exit fullscreen mode

Why so slow?

  1. No SIMD - processes bytes sequentially
  2. No prefilters - scans entire input even when pattern clearly won't match
  3. Single engine - uses same algorithm for all patterns
  4. Allocation overhead in hot paths

Compare with Rust's regex crate on same pattern: 21µs. That's 1,285x faster.

The gap is real. And fixable.


The Rust Rabbit Hole

I started digging into how Rust achieves this. Turns out, they use multi-engine architecture:

  1. Literal extraction: Pull out fixed strings from pattern (.*error.* → literal "error")
  2. SIMD prefilter: Use AVX2 to find candidates (12x faster byte search)
  3. Strategy selection: Pick optimal engine per pattern
    • DFA for simple patterns
    • NFA for complex patterns
    • Reverse search for suffix patterns
    • Specialized engines for common cases
  4. Zero allocations: Object pools, preallocated buffers

Go has none of this. regexp package is ~10 years old, hasn't changed much.

Could I bring this to Go? Challenge accepted.


Building coregex - Month by Month

Month 1-2: SIMD Primitives

Started with the foundation. Go's bytes.IndexByte is okay. AVX2 is better.

Wrote assembly:

// Find byte in slice using AVX2 (32 bytes parallel)
TEXT ·memchr(SB), NOSPLIT, $0-40
    MOVQ    haystack+0(FP), DI
    MOVQ    len+8(FP), CX
    MOVBQZX needle+24(FP), AX

    // Broadcast needle to YMM0
    VMOVD   AX, X0
    VPBROADCASTB X0, Y0

loop:
    VMOVDQU (DI), Y1       // Load 32 bytes
    VPCMPEQB Y0, Y1, Y2    // Compare (parallel)
    VPMOVMSKB Y2, AX       // Extract mask
    TESTL   AX, AX
    JNZ     found
    // ... continue
Enter fullscreen mode Exit fullscreen mode

Benchmark (finding 'e' in 64KB):

  • stdlib: 8,367 ns
  • coregex: 679 ns
  • 12.3x faster

Then memmem (substring search), then teddy (multi-pattern SIMD).

Month 3-4: Strategy Selection

Not all patterns are equal. Suffix patterns need reverse search. Inner literal patterns need bidirectional search.

Built meta-engine:

func selectStrategy(pattern *syntax.Regexp) Strategy {
    // Extract literals
    prefix := extractPrefix(pattern)
    suffix := extractSuffix(pattern)
    inner := extractInner(pattern)

    // Choose optimal strategy
    if len(suffix) >= 3 {
        return UseReverseSuffix  // 1000x for .*\.txt
    }
    if len(inner) >= 3 {
        return UseReverseInner   // 3000x for .*error.*
    }
    if len(prefix) >= 3 {
        return UseDFA            // 5-10x for prefix.*
    }
    return UseAdaptive           // Try DFA, fallback NFA
}
Enter fullscreen mode Exit fullscreen mode

Each pattern gets the right tool for the job.

Month 5: ReverseInner - The Killer Optimization

This is where it got interesting.

Problem: Pattern .*error.*connection.* on large file.

stdlib approach:

for each position:
    try to match .*error.*connection.*
    (scan entire input, backtracking, slow)
Enter fullscreen mode Exit fullscreen mode

coregex approach:

1. Find "connection" with SIMD → 200ns (not 27ms!)
2. From "connection", scan backward for "error" → fast DFA
3. From "connection", scan forward for .* → fast DFA
4. First match = done (leftmost-first semantics)
Enter fullscreen mode Exit fullscreen mode

Benchmark (250KB file):

stdlib:  12.6ms
coregex: 4µs
speedup: 3,154x
Enter fullscreen mode Exit fullscreen mode

Not 3x. Not 30x. Three thousand times faster.

Month 6: Zero Allocations

Hot paths must not allocate. Profiled with pprof, eliminated every allocation:

  • Object pooling for DFA states
  • Preallocated buffers
  • Careful escape analysis
  • Stack-only data structures

Result:

BenchmarkIsMatch-8  1000000  1.2 µs/op  0 B/op  0 allocs/op
Enter fullscreen mode Exit fullscreen mode

Zero.


How It Actually Works

Let's trace .*error.*connection.* step by step.

Input: 250KB log file with "connection" appearing 5 times.

stdlib (NFA):

Position 0:    Try .*error.*connection.* → fail
Position 1:    Try .*error.*connection.* → fail
...
Position 250000: Try .*error.*connection.* → fail
Time: 12.6ms
Enter fullscreen mode Exit fullscreen mode

coregex (ReverseInner):

SIMD scan: Find "connection" → positions [1245, 5678, ...]
At position 1245:
  Reverse DFA: Scan backward, find "error" at 1230 ✓
  Forward DFA: Scan forward, match .* ✓
  Return [1230, 1260]
Time: 4µs
Enter fullscreen mode Exit fullscreen mode

We don't scan 250,000 positions. We check 5 candidates (SIMD-found literals). That's why it's 3000x faster.


The Architecture

┌─────────────────────────┐
│     Meta-Engine         │
│  (Strategy Selection)   │
└────────┬────────────────┘
         │
    ┌────┴────┐
    │Prefilter│──► memchr (AVX2)
    └────┬────┘──► memmem (SIMD)
         │     ──► teddy (multi-pattern)
         │
┌────────┴──────────────┐
│  ┌────┐  ┌────┐  ┌───┐│
│  │DFA │  │NFA │  │Rev││
│  └────┘  └────┘  └───┘│
└───────────────────────┘
Enter fullscreen mode Exit fullscreen mode

Simple idea: Use specialized algorithms for specialized patterns.


Real Benchmarks

Here's what matters - patterns you actually use:

Pattern Size stdlib coregex Speedup
.*\.txt$ 1MB 27ms 21µs 1,314x
.*error.* 250KB 12.6ms 4µs 3,154x
(?i)error 32KB 1.23ms 4.7µs 263x
\w+@\w+\.\w+ 1KB 688ns 196ns 3.5x

Even simple patterns are faster.


API - Drop-In Replacement

// Change one line
// import "regexp"
import "github.com/coregx/coregex"

re := coregex.MustCompile(`.*error.*`)
re.Find(data)    // 3000x faster, same API
Enter fullscreen mode Exit fullscreen mode

Full compatibility:

  • Find, FindAll, Match, MatchString
  • ✅ Capture groups with FindSubmatch
  • ✅ Named captures with SubexpNames()
  • Replace, ReplaceAll, Split
  • ✅ Unicode support via regexp/syntax

Only difference: no backreferences (they require backtracking → exponential worst-case).


When to Use It

Use coregex if:

  • Regex is a bottleneck (profile first!)
  • You have patterns with wildcards (.*error.*)
  • You do case-insensitive matching at scale
  • You parse logs, traces, or large text files
  • Performance matters

Stick with stdlib if:

  • Regex is not your bottleneck
  • You need maximum API stability (coregex is v0.8, pre-1.0)
  • You want zero dependencies

Performance by pattern type:

  • Suffix patterns (.*\.txt): 1000-1500x
  • Inner patterns (.*error.*): 2000-3000x
  • Case-insensitive: 100-300x
  • Simple patterns: 2-10x

Part of CoreGX

coregex is the first release from CoreGX - an org I created for high-quality Go libraries.

Mission: Build production-grade tools that should exist but don't.

Why? Go stdlib prioritizes simplicity and correctness. Great for most cases. But sometimes you need performance without sacrificing correctness. That's CoreGX.

Future projects:

  • JSON parser with SIMD
  • Advanced concurrency primitives
  • Zero-copy data structures

All MIT. All production-ready. All built for real use.


Trying It Out

go get github.com/coregx/coregex@v0.8.0
Enter fullscreen mode Exit fullscreen mode

Run benchmarks on your patterns. If it's faster, great. If not, file an issue - I want to know why.

Project stats:

  • 88.3% test coverage
  • Zero known bugs
  • 18.5K lines of code
  • 6 months development
  • MIT licensed

Roadmap:

  • v0.9.0 (Q1 2026): Look-around assertions
  • v1.0.0 (Q2 2026): API stability guarantee
  • v1.1.0+: ARM NEON SIMD

Technical Deep Dive: SIMD Assembly

If you're into this stuff, here's how AVX2 memchr works:

// Find byte 'e' in haystack (processes 32 bytes in parallel)
VPBROADCASTB needle, YMM0    // Broadcast to 256-bit register

loop:
    VMOVDQU (haystack), YMM1  // Load 32 bytes (unaligned)
    VPCMPEQB YMM0, YMM1, YMM2 // Compare all 32 at once
    VPMOVMSKB YMM2, RAX       // Extract bitmask
    TEST RAX, RAX             // Any matches?
    JNZ found                 // Jump if found
    ADD haystack, 32          // Next 32 bytes
    JMP loop

found:
    TZCNT RAX, RAX            // Find first set bit
    ADD result, haystack
Enter fullscreen mode Exit fullscreen mode

This is why it's 12x faster than Go's byte-by-byte approach.


Lessons Learned

  1. Profile before optimizing - but don't stop at profiling
  2. SIMD is accessible - Go's asm syntax is surprisingly clean
  3. Algorithm matters more than micro-opts - ReverseInner is 3000x not because of clever asm, but because of smart strategy
  4. Zero allocs is achievable - requires discipline but pays off
  5. Compatibility is hard - matching stdlib behavior has edge cases

Open Source & Contributions

Source: github.com/coregx/coregex

PRs welcome for:

  • ARM NEON implementation
  • Additional SIMD primitives
  • Test coverage improvements
  • Documentation

Not welcome:

  • "Rewrite in Rust" (this is a Go library)
  • Breaking API changes pre-1.0
  • Features without benchmarks

Why I Built This

Honest answer: I wanted to learn.

How do modern regex engines work? Can I replicate Rust's performance in Go? What are the limits of Go's runtime?

Turned out the answer was: Yes, you can, and it's actually not that hard if you use the right algorithms.

Also, I needed this for my own projects. I maintain racedetector (pure-Go race detection) and other tools that parse a lot of text. This will help those projects too once v1.0 lands.


The Bottom Line

Go's regexp is fine for most use cases. But if regex is your bottleneck, you have options now.

coregex isn't magic - it's just better algorithms + SIMD. The techniques are well-known (Rust uses them, RE2 uses parts of them). I just brought them to Go.

If you're spending significant CPU time in regex, benchmark coregex. If it helps, use it. If you find bugs or slowness, report them.

Links:

Star if useful. File issues if broken. PRs if you want to contribute.


P.S. The most fun part? Writing AVX2 assembly at 2am and having it actually work. Second most fun? Seeing 3000x in benchmark output and thinking "that can't be right" (it was).

Built by @kolkov as part of CoreGX - Production Go libraries.

Top comments (0)