Andrey Kolkov

Posted on Mar 18

From 100x Slower Than Rust to Beating It: The coregex Journey

#go #performance #regex #rust

A few days ago, @kostya27 posted on r/golang (124 upvotes):

"Why is Go's regex so slow?" Go total time on LangArena: 116.6 seconds. Without two regex tests: 78.5 seconds. Without regex, Go is competitive with Rust/C++. With regex, it's not even close.

He's right. And this post is about what we did about it.

Six months ago, I wrote about building coregex — a regex engine for Go that's 3-3000x faster than stdlib. The benchmarks looked great. Then reality hit.

@kostya, author of LangArena (2,900+ stars on GitHub), tried coregex on his benchmark suite. His verdict:

"I tried coregex, but no luck."

His numbers told the story:

Benchmark	Go stdlib	coregex v0.12.8	Rust regex	PCRE2 JIT
LogParser (13 patterns)	22.7s	22.0s	0.2s	—
Template::Regex	6.5s	7.0s	3.8s	1.0s

We were 100x slower than Rust on LogParser. On the same machine. Same input. Same patterns.

Our "3000x faster than stdlib" claim? True — on many patterns we tested. But we had blind spots we didn't know about: case-insensitive patterns, dense-digit data, multi-wildcard suffixes. On a real-world benchmark with 13 diverse patterns covering all these cases, we were barely faster than stdlib.

That was March 10, 2026. Here's what happened in the next week.

The LangArena Challenge

LangArena's LogParser benchmark parses Apache log files with 13 regex patterns — the kind of patterns you'd find in any log analysis pipeline:

errors:        ` [5][0-9]{2} | [4][0-9]{2} `
bots:          `(?i)bot|crawler|scanner|spider|indexing|crawl`
suspicious:    `(?i)etc/passwd|wp-admin|\.\./`
ips:           `\d+\.\d+\.\d+\.35`
api_calls:     `/api/[^ "]+`
methods:       `(?i)get|post|put`
emails:        `[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}`
...and 6 more

Nothing exotic. These are bread-and-butter patterns that every Go developer uses. And we were 100x slower than Rust on them.

The question wasn't "can we optimize one pattern?" — it was "can we close a 100x gap across 13 different pattern types?"

Step 1: Learning from the Enemy

Before writing a single line of code, I needed to understand what Rust does differently. Not from reading docs — from running Rust with debug logging.

Rust's regex crate has RUST_LOG=debug:

$ RUST_LOG=debug ./rust-benchmark input.txt
[regex] prefixes extracted: Seq["EVA", "EVa", "EvA", "Eva", "eVA", ...]
[regex] prefilter built: teddy
[regex] using reverse suffix strategy

Every strategy decision, every prefilter choice, every literal extraction — logged. I could see exactly what Rust did for each pattern.

We had nothing like this. So I built COREGEX_DEBUG:

$ COREGEX_DEBUG=1 ./my-app
[coregex] pattern="(?i:GET|POST|PUT)" strategy=UseTeddy nfa_states=43 literals=40 complete=true
[coregex] prefilter=FatTeddy (AVX2 fat) complete=true

Now I could compare strategy selection side-by-side. And the differences were immediately obvious.

Step 2: The Root Causes

Bug #1: Refusing to extract case-insensitive literals

A real user (#137) reported this WAF pattern was 88,000x slower than stdlib.

Rust extracts 250 case-fold literal variants:

eval → EVAL, EVAl, EVaL, EVal, EvAL, ... eval  (16 variants)
system → SYSTEM, SYSTEm, SYSTem, ...            (32 variants)

Then trims to 60 three-byte prefixes → Teddy SIMD prefilter → scans 968 bytes in 263 nanoseconds. Done.

Our literal extractor? One line killed everything:

// literal/extractor.go:137
if re.Flags&syntax.FoldCase != 0 {
    return NewSeq()  // Return EMPTY. For ALL case-insensitive patterns.
}

This guard was added for a previous bug (#87) — naive extraction of single-case variants caused prefilter false negatives. The fix was correct for that bug, but the blanket rejection meant zero prefilter for any (?i) pattern. Without prefilter, the engine fell back to lazy DFA, which cache-thrashed on the 181-state NFA.

Fix: Expand (?i) literals into ALL case-fold variants (like Rust), then trim to 3-byte prefixes. One function, ~50 lines.

Result: 88,000x slower → 24x faster than stdlib.

Bug #2: isMatchDigitPrefilter was O(n²)

Pattern: \d{3}-\d{3}-\d{4} (phone numbers)

On 6MB of log data: 7 minutes per Match() call. Stdlib: 262ms.

Root cause: isMatchDigitPrefilter used dfa.FindAt() (unanchored search) which scans from each digit position to end of input:

// Before (O(n²)):
endPos := e.dfa.FindAt(haystack, digitPos)  // Scans to EOF!

// After (O(pattern_len)):
endPos := e.dfa.SearchAtAnchored(haystack, digitPos)  // Checks only at position

One function call change. 7 minutes → 2.1ms. 200,000x faster.

The same pattern was already fixed in findIndicesDigitPrefilter months ago — but isMatchDigitPrefilter was never updated. Copy-paste divergence.

Bug #3: ReverseSuffix rejected multi-wildcard patterns

Pattern: \d+\.\d+\.\d+\.35 (IP address suffix)

This pattern has a clear suffix: .35. Rust finds it instantly with memmem, then reverse-scans for the start. Our isSafeForReverseSuffix rejected it because it had 3 wildcard subexpressions (\d+):

if wildcardCount >= 2 {
    return false  // "multiple wildcards break reverse NFA"
}

The guard existed because our reverse NFA builder had a bug with mixed byte+epsilon states. That bug was fixed in v0.12.9. But the guard stayed.

Fix: Remove the guard. Also fix Find() leftmost semantics — bytes.LastIndex → bytes.Index for non-.* patterns.

Result: 57ms → 0.63ms (603x faster, 1.6x faster than Rust!)

Bug #4: FatTeddy AVX2 missed matches

Pattern: (?i)get|post|put (40 case-fold expanded literals)

FatTeddy (33-64 pattern SIMD search) found only 11,456 matches. Correct answer: 34,368.

Root cause: One assembly instruction.

FatTeddy uses 256-bit AVX2 registers with two 128-bit lanes. Low lane handles buckets 0-7, high lane handles buckets 8-15. The code used ANDL to combine lane results — requiring a match in both lanes. But GET variants (8 patterns) were all in buckets 0-7 (low lane only), PUT variants in buckets 8-15 (high lane only). ANDL zeroed them out.

; Before (incorrect):
ANDL CX, AX          ; Requires BOTH lanes to match

; After (correct):
ORL  CX, AX          ; Either lane is sufficient

One instruction. 22,912 missing matches fixed.

Step 3: Building What Rust Has

Beyond bug fixes, we needed architectural improvements to match Rust's approach:

Bidirectional DFA

Previously, UseDFA patterns did: forward DFA → match end, then PikeVM → exact boundaries. PikeVM is O(n×states) — a second full scan.

Now: forward DFA → end, reverse DFA → start, anchored DFA → exact end. Three O(n) passes instead of one O(n×states) pass.

Cascading Prefix Trim

When case-fold expansion produces too many literals (>64), we trim them using Rust's approach:

128 six-byte literals → try keep 4 bytes → 18 unique → fits Teddy!

This is directly from Rust's optimize_for_prefix_by_preference() with their ATTEMPTS table: [(4,64), (3,64), (2,64)].

Aho-Corasick DFA Backend

Our Aho-Corasick library got a complete DFA backend rewrite:

Flat transition table with premultiplied state IDs
Match flag in high bit (single AND instruction for detection)
SIMD skip-ahead prefilter via bytes.IndexByte

Result: 300 MB/s → 3,400 MB/s (Find), 260 MB/s → 5,900 MB/s (IsMatch). 11-22x throughput improvement.

The Results

Benchmark: 8 Real-World Patterns on 6.3 MB Input

100 iterations each, best of 5, same machine (i7-1255U):

Pattern	Go stdlib	coregex v0.12.13	Rust regex	vs stdlib	vs Rust
`.*@example\.com`	420 ms	3.3 ms	7.2 ms	126x	2.2x faster
`.*\.(txt\|log\|md)`	426 ms	1.0 ms	1.8 ms	425x	1.8x faster
email validation	447 ms	3.4 ms	3.8 ms	132x	1.1x faster
`\d+\.\d+\.\d+\.35`	381 ms	0.63 ms	0.98 ms	603x	1.6x faster
`(?i)get\|post\|put`	561 ms	16.6 ms	7.0 ms	34x	2.4x slower
`(?i)bot\|crawler\|...`	883 ms	38.4 ms	6.7 ms	23x	5.7x slower
`password=[^&\s"]+`	24 ms	8.9 ms	2.9 ms	3x	3.1x slower
`session[_-]?id=...`	8 ms	2.7 ms	1.2 ms	3x	2.3x slower

4 out of 8 patterns are faster than Rust. All 8 are faster than Go stdlib.

@kostya's Update

Remember "no luck"? Here's the progression on his M1 MacBook:

Version	LogParser	Gap to Rust
v0.12.8 (start)	22.0s	100x
v0.12.9	5.3s	26x
v0.12.10	2.67s	13x
v0.12.13 (current)	2.12s	10x

From 100x slower to 10x. Not parity yet — but a different conversation than "no luck."

Why Not Just Use CGO?

Every other Go regex alternative uses CGO or Wasm:

go-re2: C++ RE2 via Wasm (wazero)
regexp2: Backtracking (.NET-style) — no O(n) guarantee
rubex: Oniguruma via CGO
go-pcre: PCRE via CGO

coregex is pure Go + Go assembly. No CGO, no Wasm, no external dependencies.

Why does this matter?

Cross-compilation: GOOS=linux GOARCH=arm64 go build just works
Static binaries: No shared libraries to ship
Go toolchain: go vet, go test -race, pprof all work
Debugging: Standard Go debugging, no FFI boundary
Security: No C memory safety issues in regex hot paths

The performance gap to pure-CGO solutions (PCRE2 JIT) exists — JIT compiles regex to native machine code, achieving 1.0s where we take 7.1s on Template::Regex. But that's an architectural tier boundary — we're competing within the automata-based class (like RE2 and Rust regex), not against JIT engines.

What We Learned

1. Debug logging is not optional

Building COREGEX_DEBUG was the single most impactful decision. Without it, every optimization was guesswork. With it, we could see exactly why a pattern was slow and verify our fix matched Rust's approach.

If you're building any kind of engine — regex, query planner, compiler — add strategy logging from day one.

2. One instruction can hide 23,000 bugs

The FatTeddy ANDL → ORL fix taught us that SIMD code correctness is binary. Not "mostly correct" or "works for some patterns." If your lane combining logic is wrong, you silently drop matches. No error, no panic — just wrong results.

Always verify match counts against stdlib. On every pattern. On every change.

3. Benchmarks lie — until they don't

Our "3000x faster" headline was true for .*error.* patterns. But @kostya's LangArena showed the full picture: on diverse real-world patterns, we were barely faster than stdlib.

Real benchmarks use real patterns from real users. We now run regex-bench CI on every PR — 16 core patterns + 13 LangArena patterns, compared against both stdlib and Rust regex, on Linux AMD EPYC and macOS Apple Silicon.

4. Guard clauses outlive their bugs

Three of our four major bugs were caused by guards that stayed after the underlying bug was fixed. FoldCase rejection, wildcardCount >= 2, unanchored FindAt — all were correct when added. All became performance killers months later when the original bugs were resolved.

Track why a guard exists. Remove it when the reason is gone.

5. Go ASM is production-viable for SIMD

We wrote ~500 lines of AVX2/SSSE3 assembly for Teddy multi-pattern search. It works. FatTeddy throughput: 12 GB/s on single-call scans (2x faster than SlimTeddy SSSE3!).

The challenge isn't writing the ASM — it's the Go→ASM function call boundary. Each call costs ~60ns + mask reload. For high-match-count patterns, this adds up. Our batch API (64KB chunks) reduces round-trips, but the integrated prefilter+DFA loop that Rust uses remains the gold standard.

Current State: v0.12.13

97,000 lines of code. 17 strategies. 1,470 tests. 5 releases in one week.

go get github.com/coregx/coregex@v0.12.13

Drop-in replacement:

It's a true drop-in replacement for Go's regexp package — same API, same types (Regexp is aliased), same method signatures:

import "github.com/coregx/coregex"  // instead of "regexp"

re := coregex.MustCompile(`(?i)get|post|put`)
matches := re.FindAllString(data, -1)  // Same API, faster execution

In most cases, changing the import path is all you need.

Debug your patterns:

COREGEX_DEBUG=1 ./your-app
# [coregex] pattern="(?i:GET|P(?:OST|UT))" strategy=UseTeddy nfa_states=43 literals=40 complete=true
# [coregex] prefilter=FatTeddy (AVX2 fat) complete=true

What's Still Slower Than Rust

Honesty matters. Here's where we're still behind:

Gap	Root cause	Status
`(?i)` patterns: 2-6x	FatTeddy ORL creates more false positives than Rust's interleave verification	Researched, needs ASM rewrite
DFA verification: 3-7x	Go→ASM round trip overhead, no integrated prefilter+DFA loop	Architectural
Template::Regex: 1.8x	Two-phase DFA+PikeVM vs Rust's single-phase lazy DFA	Planned
ARM: 5-15x vs Rust	No SIMD prefilters on ARM (Teddy/memchr are x86 only)	Waiting for Go NEON support

We're not hiding these gaps. They're tracked, researched, and planned. The goal is Rust parity on all pattern types — we're not there yet on (?i) and DFA-heavy patterns.

Community Testing Matters — A Lot

A multi-engine regex library is inherently complex. 17 strategies, SIMD assembly, lazy DFA, reverse search, prefilter cascading — every combination of pattern shape × input data × strategy is a potential edge case. No amount of internal testing can cover what real users discover in minutes.

Every major fix in this article came from community feedback:

@kostya's LangArena exposed the 100x gap we didn't know about
tjbrains' WAF pattern (#137) revealed the 88,000x regression in case-insensitive matching
GoAWK integration uncovered 15+ Unicode edge cases months earlier

The pattern is consistent: someone runs coregex on their specific workload, finds a pattern type we haven't optimized yet, reports it — and we fix it in hours, not weeks. The FatTeddy lane bug? Fixed same day. The DigitPrefilter O(n²)? Fixed in one line. Case-insensitive literal extraction? Researched Rust's approach, implemented, released — all within 24 hours.

There are likely more patterns that aren't optimized yet. That's the nature of a 17-strategy engine — some strategy paths get less testing than others. But the architecture is sound, the fix turnaround is fast, and every report makes the library better for everyone.

We proposed coregex for Go's standard library. It wasn't accepted — and honestly, that's okay. As an independent library, we can iterate faster, ship SIMD assembly that the Go team wouldn't merge, and make decisions optimized for performance rather than compatibility. The Go ecosystem is better with options.

Don't hesitate to contribute. File issues with your patterns and inputs. Even a simple "this pattern is slower than stdlib" report helps — it tells us which strategy path needs work. The more diverse patterns we see, the fewer blind spots remain.

Pull requests are especially welcome. We know that a healthy open source project is built by its community, and we value every contributor. Don't worry if your PR isn't perfect — we'll review the code, help you fix any issues, guide you through our conventions, and explain what's needed to get it merged. Whether it's a new test case, a documentation fix, a strategy optimization, or a bug report with a reproducer — every contribution counts and every contributor gets credited.

Try It

If regex is a bottleneck in your Go application:

Profile first — make sure regex is actually the problem
Benchmark your specific patterns — performance varies by pattern type
Check match counts — coregex.FindAll() must match regexp.FindAll() exactly
Report issues — we fixed #137 (88,000x regression) within 24 hours

# Quick benchmark
go get github.com/coregx/coregex@v0.12.13
COREGEX_DEBUG=1 go test -bench=. -benchmem your-package

Links:

The most humbling moment? Seeing ANDL CX, AX in our FatTeddy ASM and realizing one wrong instruction had been silently dropping 23,000 matches. The most satisfying? Seeing coregex 1.6x faster than Rust on the IP pattern that started this whole journey.

Built by @kolkov as part of CoreGX — production Go libraries.

DEV Community