Andrey Kolkov

Posted on Nov 29, 2025 • Edited on Jan 5 • Originally published at github.com

Go's Regexp is Slow. So I Built My Own - up to 3000x Faster

#go #opensource #regex #performance

🔥 Updated January 5, 2026: v0.9.1 released! New features: DigitPrefilter (154x on IP patterns, 2.7x faster than Rust!), Aho-Corasick (113x on multi-literal patterns), adaptive prefilter switching. 13 strategies, ~38K LOC.

I've been writing Go for years. Love the language. But there's one thing that always bothered me: regex performance.

Recently I was processing large text files (logs, traces, datasets) and kept hitting the same wall - regexp.Find() consuming 70-80% of CPU time. Not on complex patterns. On simple stuff like .*error.*connection.*.

So I did what any reasonable developer would do: spent months building coregex, a drop-in replacement for Go's regexp that's 3-3000x faster while keeping the same O(n) guarantees.

Here's how and why.

The Problem

Let's be direct: Go's regexp is not optimized for performance.

It uses Thompson's NFA exclusively - great for correctness (no ReDoS, guaranteed O(n)), terrible for speed:

// Searching for "error" in 1MB file
re := regexp.MustCompile(`.*error.*`)
start := time.Now()
re.Find(data) // 27ms

Why so slow?

No SIMD - processes bytes sequentially
No prefilters - scans entire input even when pattern clearly won't match
Single engine - uses same algorithm for all patterns
Allocation overhead in hot paths

Compare with Rust's regex crate on same pattern: 21µs. That's 1,285x faster.

The gap is real. And fixable.

The Rust Rabbit Hole

I started digging into how Rust achieves this. Turns out, they use multi-engine architecture:

Literal extraction: Pull out fixed strings from pattern (.*error.* → literal "error")
SIMD prefilter: Use AVX2 to find candidates (12x faster byte search)
Strategy selection: Pick optimal engine per pattern
- DFA for simple patterns
- NFA for complex patterns
- Reverse search for suffix patterns
- Specialized engines for common cases
Zero allocations: Object pools, preallocated buffers

Go has none of this. regexp package is ~10 years old, hasn't changed much.

Could I bring this to Go? Challenge accepted.

Building coregex - The Journey

Phase 1: SIMD Primitives

Started with the foundation. Go's bytes.IndexByte is okay. AVX2 is better.

Wrote assembly:

// Find byte in slice using AVX2 (32 bytes parallel)
TEXT ·memchrAVX2(SB), NOSPLIT, $0-40
    MOVQ    haystack+0(FP), DI
    MOVQ    len+8(FP), CX
    MOVBQZX needle+24(FP), AX

    // Broadcast needle to YMM0
    VMOVD   AX, X0
    VPBROADCASTB X0, Y0

loop:
    VMOVDQU (DI), Y1       // Load 32 bytes
    VPCMPEQB Y0, Y1, Y2    // Compare (parallel)
    VPMOVMSKB Y2, AX       // Extract mask
    TESTL   AX, AX
    JNZ     found
    // ... continue

Benchmark (finding 'e' in 64KB):

stdlib: 8,367 ns
coregex: 679 ns
12.3x faster

Then memmem (substring search with paired-byte SIMD), then teddy (multi-pattern SIMD).

Phase 2: Strategy Selection

Not all patterns are equal. Suffix patterns need reverse search. Inner literal patterns need bidirectional search.

Built meta-engine with 13 strategies:

func selectStrategy(pattern *syntax.Regexp, info *PatternInfo) Strategy {
    // Each pattern gets the optimal engine
    switch {
    case info.HasSuffix && len(info.Suffix) >= 3:
        return UseReverseSuffix     // 1000x for .*\.txt
    case info.HasInner && len(info.Inner) >= 3:
        return UseReverseInner      // 3000x for .*error.*
    case info.IsDigitLead:
        return UseDigitPrefilter    // 154x for IP patterns
    case info.IsLiteralAlt && len(info.Literals) > 8:
        return UseAhoCorasick       // 113x for many literals
    case info.IsLiteralAlt && len(info.Literals) <= 8:
        return UseTeddy             // 242x for few literals
    // ... 8 more strategies
    }
}

Phase 3: ReverseInner - The Killer Optimization

This is where it got interesting.

Problem: Pattern .*error.*connection.* on large file.

stdlib approach:

for each position:
    try to match .*error.*connection.*
    (scan entire input, backtracking, slow)

coregex approach:

1. Find "connection" with SIMD → 200ns (not 27ms!)
2. From "connection", scan backward for "error" → fast DFA
3. From "connection", scan forward for .* → fast DFA
4. First match = done (leftmost-first semantics)

Benchmark (250KB file):

stdlib:  12.6ms
coregex: 4µs
speedup: 3,154x

Not 3x. Not 30x. Three thousand times faster.

Phase 4: DigitPrefilter - Beating Rust

IP address patterns like \d+\.\d+\.\d+\.\d+ were slow everywhere. Even in Rust.

Insight: Digits are rare in most text. SIMD scan for [0-9] first, then verify.

// SIMD scan for any digit (0-9)
func scanDigits(haystack []byte) int {
    // AVX2: check 32 bytes in parallel
    // Use range comparison: byte >= '0' && byte <= '9'
}

Result on IP validation (6MB input):

Engine	Time	vs stdlib
Go stdlib	493 ms	baseline
coregex	3.2 ms	154x faster
Rust regex	12 ms	41x faster

coregex is 2.7x faster than Rust on this pattern. First time beating Rust!

But there was a catch: on dense digit data (many digits, few matches), prefilter hurt performance. Solution? Adaptive switching:

const adaptiveThreshold = 64

// After 64 consecutive false positives, switch to DFA
if consecutiveFPs >= adaptiveThreshold {
    return e.findWithDFA(haystack, pos)  // Abandon prefilter
}

Based on Rust regex insight: "prefilter with high FP rate makes search slower"

Phase 5: Aho-Corasick - Multi-Pattern at Scale

Integrated Aho-Corasick automaton:

Result (6MB input, 12 literal patterns):

Engine	Time	vs stdlib
Go stdlib	473 ms	baseline
coregex	4.2 ms	113x faster
Rust regex	0.7 ms	—

Phase 6: Zero Allocations

Hot paths must not allocate. Profiled with pprof, eliminated every allocation:

Object pooling for DFA states
Preallocated buffers
Careful escape analysis
Stack-only data structures

Result:

BenchmarkIsMatch-8  1000000  1.2 µs/op  0 B/op  0 allocs/op

Zero.

The Architecture

Pattern → Parse → NFA → Literal Extract → Strategy Select
                                               ↓
                    ┌──────────────────────────────────────┐
                    │  13 Strategies:                      │
                    │  • ReverseInner (3000x)              │
                    │  • DigitPrefilter (154x, beats Rust) │
                    │  • ReverseSuffix (1000x)             │
                    │  • AhoCorasick (113x)                │
                    │  • Teddy SIMD (242x)                 │
                    │  • CharClassSearcher (23x)           │
                    │  • LazyDFA, PikeVM, OnePass...       │
                    └──────────────────────────────────────┘
                                               ↓
Input → Prefilter (SIMD) → Engine → Match Result

SIMD Primitives (AMD64):

memchr — single byte search (AVX2, 32 bytes/cycle)
memmem — substring search with paired-byte heuristic (SSSE3)
teddy — multi-pattern search (SSSE3)
digit scanner — SIMD range check for [0-9]

Pure Go fallback on other architectures.

Cross-Language Benchmarks

Real benchmarks on 6MB input (source):

Pattern	Go stdlib	coregex	Rust regex	vs stdlib	vs Rust
IP validation	493 ms	3.2 ms	12 ms	154x	2.7x faster!
Inner `.keyword.`	231 ms	1.9 ms	0.6 ms	122x	Rust 3x
Suffix `.*\.txt`	233 ms	1.8 ms	1.4 ms	127x	~tie
Literal alternation	473 ms	4.2 ms	0.7 ms	113x	Rust 6x
Email validation	259 ms	1.7 ms	1.3 ms	155x	~tie
URL extraction	266 ms	2.8 ms	0.9 ms	96x	Rust 3x

Key insight: coregex now beats Rust on IP/digit patterns. Competitive on most others.

API - Drop-In Replacement

// Change one line
// import "regexp"
import "github.com/coregx/coregex"

re := coregex.MustCompile(`.*error.*`)
re.Find(data)    // 3000x faster, same API

Full compatibility:

✅ Find, FindAll, Match, MatchString
✅ Capture groups with FindSubmatch
✅ Named captures with SubexpNames()
✅ Replace, ReplaceAll, Split
✅ Unicode support via regexp/syntax

Zero-allocation APIs:

// Zero allocations - returns bool
matched := re.IsMatch(text)

// Zero allocations - returns (start, end, found)
start, end, found := re.FindIndices(text)

When to Use It

Use coregex if:

Regex is a bottleneck (profile first!)
You have patterns with wildcards (.*error.*)
You parse IP addresses, phone numbers, versions
You do case-insensitive matching at scale
You parse logs, traces, or large text files

Stick with stdlib if:

Regex is not your bottleneck
You need maximum API stability (coregex is v0.9, pre-1.0)
You want minimal dependencies

Performance by pattern type:

IP/phone patterns: 40-154x (beats Rust!)
Suffix patterns (.*\.txt): 100-1500x
Inner patterns (.*error.*): 100-3000x
Literal alternations: 15-242x
Case-insensitive: 100-300x
Simple patterns: 2-10x

Battle-Tested

coregex was tested in GoAWK. This real-world testing uncovered 15+ edge cases that synthetic benchmarks missed:

Unicode boundary handling
Longest-match mode compatibility
Empty match semantics
Capture group edge cases

All fixed. All tested.

Real-World: uawk - Ultra AWK

We built uawk — a modern AWK interpreter powered by coregex. Results speak for themselves:

Benchmarks vs GoAWK (10MB dataset):

Pattern	GoAWK	uawk	Speedup
Regex alternation	1.85s	97ms	19x
IP matching	290ms	99ms	2.9x
General regex	320ms	100ms	3.2x

uawk wins all 10 benchmarks. This is coregex in production — not synthetic tests.

Features:

POSIX AWK compatible
Stack-based VM with 80+ bytecode ops
Zero CGO, cross-platform
Embeddable Go API

# Install
go install github.com/kolkov/uawk/cmd/uawk@latest

# Use (same as awk)
uawk '/error/ { print $0 }' server.log

We need more testers! If you have a project using regexp, try coregex and report issues.

Trying It Out

go get github.com/coregx/coregex@latest

Run benchmarks on your patterns. If it's faster, great. If not, file an issue - I want to know why.

Project stats:

~87% test coverage (main package)
Zero known bugs
~38K lines of code
13 specialized strategies
MIT licensed

Current version: v0.9.1

Roadmap:

v1.0.0: API stability guarantee
v1.1.0+: ARM NEON SIMD (waiting for Go native SIMD proposal)

Technical Deep Dive: SIMD Assembly

If you're into this stuff, here's how AVX2 memchr works:

// Find byte 'e' in haystack (processes 32 bytes in parallel)
VPBROADCASTB needle, YMM0    // Broadcast to 256-bit register

loop:
    VMOVDQU (haystack), YMM1  // Load 32 bytes (unaligned)
    VPCMPEQB YMM0, YMM1, YMM2 // Compare all 32 at once
    VPMOVMSKB YMM2, RAX       // Extract bitmask
    TEST RAX, RAX             // Any matches?
    JNZ found                 // Jump if found
    ADD haystack, 32          // Next 32 bytes
    JMP loop

found:
    TZCNT RAX, RAX            // Find first set bit
    ADD result, haystack
    VZEROUPPER                // Critical! AVX-SSE transition
    RET

This is why it's 12x faster than Go's byte-by-byte approach.

Lessons Learned

Profile before optimizing - but don't stop at profiling
SIMD is accessible - Go's asm syntax is surprisingly clean
Algorithm matters more than micro-opts - ReverseInner is 3000x not because of clever asm, but because of smart strategy
Adaptive algorithms win - static heuristics fail on edge cases, runtime adaptation works
Zero allocs is achievable - requires discipline but pays off
Compatibility is hard - matching stdlib behavior has edge cases (GoAWK found 15!)
You CAN beat Rust - on specific patterns, Go + SIMD + smart algorithms wins

Open Source & Contributions

Source: github.com/coregx/coregex

PRs welcome for:

ARM NEON implementation
Additional SIMD primitives
Test coverage improvements
Documentation

Not welcome:

"Rewrite in Rust" (this is a Go library)
Breaking API changes pre-1.0
Features without benchmarks

Why I Built This

Honest answer: I wanted to learn.

How do modern regex engines work? Can I replicate Rust's performance in Go? What are the limits of Go's runtime?

Turned out the answer was: Yes, you can, and on some patterns you can even beat Rust if you use the right algorithms.

Also, I needed this for my own projects. I maintain racedetector (pure-Go race detection) and other tools that parse a lot of text. This helps those projects too.

The Bottom Line

Go's regexp is fine for most use cases. But if regex is your bottleneck, you have options now.

coregex isn't magic - it's just better algorithms + SIMD + adaptive strategies. The techniques are well-known (Rust uses them, RE2 uses parts of them). I just brought them to Go - and on some patterns, made them faster than Rust.

If you're spending significant CPU time in regex, benchmark coregex. If it helps, use it. If you find bugs or slowness, report them.

Links:

GitHub
uawk — Ultra AWK interpreter powered by coregex
Benchmarks
Discussions

Related Go Issues:

golang/go#26623 — Go regexp performance discussion (2018, still open)
golang/go#76818 — Upstream path proposal

Star if useful. File issues if broken. PRs if you want to contribute.

P.S. The most fun part? Writing AVX2 assembly at 2am and having it actually work. Second most fun? Seeing "coregex 2.7x faster than Rust" in benchmark output and thinking "wait, that can't be right" (it was).

Built by @kolkov as part of CoreGX - Production Go libraries.

Top comments (2)

Moksh • Dec 14 '25

Really curious to use this package.

Andrey Kolkov • Dec 15 '25

Hi! Yes, this library needs real users. Together with Ben Hoyt, we identified numerous bugs through integration with GoAWK and significantly improved performance on certain patterns and small datasets.