I've been writing Go for years. Love the language. But there's one thing that always bothered me: regex performance.
Recently I was processing large text files (logs, traces, datasets) and kept hitting the same wall - regexp.Find() consuming 70-80% of CPU time. Not on complex patterns. On simple stuff like .*error.*connection.*.
So I did what any reasonable developer would do: spent 6 months building coregex, a drop-in replacement for Go's regexp that's 3-3000x faster while keeping the same O(n) guarantees.
Here's how and why.
The Problem
Let's be direct: Go's regexp is not optimized for performance.
It uses Thompson's NFA exclusively - great for correctness (no ReDoS, guaranteed O(n)), terrible for speed:
// Searching for "error" in 1MB file
re := regexp.MustCompile(`.*error.*`)
start := time.Now()
re.Find(data) // 27ms
Why so slow?
- No SIMD - processes bytes sequentially
- No prefilters - scans entire input even when pattern clearly won't match
- Single engine - uses same algorithm for all patterns
- Allocation overhead in hot paths
Compare with Rust's regex crate on same pattern: 21µs. That's 1,285x faster.
The gap is real. And fixable.
The Rust Rabbit Hole
I started digging into how Rust achieves this. Turns out, they use multi-engine architecture:
-
Literal extraction: Pull out fixed strings from pattern (
.*error.*→ literal "error") - SIMD prefilter: Use AVX2 to find candidates (12x faster byte search)
-
Strategy selection: Pick optimal engine per pattern
- DFA for simple patterns
- NFA for complex patterns
- Reverse search for suffix patterns
- Specialized engines for common cases
- Zero allocations: Object pools, preallocated buffers
Go has none of this. regexp package is ~10 years old, hasn't changed much.
Could I bring this to Go? Challenge accepted.
Building coregex - Month by Month
Month 1-2: SIMD Primitives
Started with the foundation. Go's bytes.IndexByte is okay. AVX2 is better.
Wrote assembly:
// Find byte in slice using AVX2 (32 bytes parallel)
TEXT ·memchr(SB), NOSPLIT, $0-40
MOVQ haystack+0(FP), DI
MOVQ len+8(FP), CX
MOVBQZX needle+24(FP), AX
// Broadcast needle to YMM0
VMOVD AX, X0
VPBROADCASTB X0, Y0
loop:
VMOVDQU (DI), Y1 // Load 32 bytes
VPCMPEQB Y0, Y1, Y2 // Compare (parallel)
VPMOVMSKB Y2, AX // Extract mask
TESTL AX, AX
JNZ found
// ... continue
Benchmark (finding 'e' in 64KB):
- stdlib: 8,367 ns
- coregex: 679 ns
- 12.3x faster
Then memmem (substring search), then teddy (multi-pattern SIMD).
Month 3-4: Strategy Selection
Not all patterns are equal. Suffix patterns need reverse search. Inner literal patterns need bidirectional search.
Built meta-engine:
func selectStrategy(pattern *syntax.Regexp) Strategy {
// Extract literals
prefix := extractPrefix(pattern)
suffix := extractSuffix(pattern)
inner := extractInner(pattern)
// Choose optimal strategy
if len(suffix) >= 3 {
return UseReverseSuffix // 1000x for .*\.txt
}
if len(inner) >= 3 {
return UseReverseInner // 3000x for .*error.*
}
if len(prefix) >= 3 {
return UseDFA // 5-10x for prefix.*
}
return UseAdaptive // Try DFA, fallback NFA
}
Each pattern gets the right tool for the job.
Month 5: ReverseInner - The Killer Optimization
This is where it got interesting.
Problem: Pattern .*error.*connection.* on large file.
stdlib approach:
for each position:
try to match .*error.*connection.*
(scan entire input, backtracking, slow)
coregex approach:
1. Find "connection" with SIMD → 200ns (not 27ms!)
2. From "connection", scan backward for "error" → fast DFA
3. From "connection", scan forward for .* → fast DFA
4. First match = done (leftmost-first semantics)
Benchmark (250KB file):
stdlib: 12.6ms
coregex: 4µs
speedup: 3,154x
Not 3x. Not 30x. Three thousand times faster.
Month 6: Zero Allocations
Hot paths must not allocate. Profiled with pprof, eliminated every allocation:
- Object pooling for DFA states
- Preallocated buffers
- Careful escape analysis
- Stack-only data structures
Result:
BenchmarkIsMatch-8 1000000 1.2 µs/op 0 B/op 0 allocs/op
Zero.
How It Actually Works
Let's trace .*error.*connection.* step by step.
Input: 250KB log file with "connection" appearing 5 times.
stdlib (NFA):
Position 0: Try .*error.*connection.* → fail
Position 1: Try .*error.*connection.* → fail
...
Position 250000: Try .*error.*connection.* → fail
Time: 12.6ms
coregex (ReverseInner):
SIMD scan: Find "connection" → positions [1245, 5678, ...]
At position 1245:
Reverse DFA: Scan backward, find "error" at 1230 ✓
Forward DFA: Scan forward, match .* ✓
Return [1230, 1260]
Time: 4µs
We don't scan 250,000 positions. We check 5 candidates (SIMD-found literals). That's why it's 3000x faster.
The Architecture
┌─────────────────────────┐
│ Meta-Engine │
│ (Strategy Selection) │
└────────┬────────────────┘
│
┌────┴────┐
│Prefilter│──► memchr (AVX2)
└────┬────┘──► memmem (SIMD)
│ ──► teddy (multi-pattern)
│
┌────────┴──────────────┐
│ ┌────┐ ┌────┐ ┌───┐│
│ │DFA │ │NFA │ │Rev││
│ └────┘ └────┘ └───┘│
└───────────────────────┘
Simple idea: Use specialized algorithms for specialized patterns.
Real Benchmarks
Here's what matters - patterns you actually use:
| Pattern | Size | stdlib | coregex | Speedup |
|---|---|---|---|---|
.*\.txt$ |
1MB | 27ms | 21µs | 1,314x |
.*error.* |
250KB | 12.6ms | 4µs | 3,154x |
(?i)error |
32KB | 1.23ms | 4.7µs | 263x |
\w+@\w+\.\w+ |
1KB | 688ns | 196ns | 3.5x |
Even simple patterns are faster.
API - Drop-In Replacement
// Change one line
// import "regexp"
import "github.com/coregx/coregex"
re := coregex.MustCompile(`.*error.*`)
re.Find(data) // 3000x faster, same API
Full compatibility:
- ✅
Find,FindAll,Match,MatchString - ✅ Capture groups with
FindSubmatch - ✅ Named captures with
SubexpNames() - ✅
Replace,ReplaceAll,Split - ✅ Unicode support via
regexp/syntax
Only difference: no backreferences (they require backtracking → exponential worst-case).
When to Use It
Use coregex if:
- Regex is a bottleneck (profile first!)
- You have patterns with wildcards (
.*error.*) - You do case-insensitive matching at scale
- You parse logs, traces, or large text files
- Performance matters
Stick with stdlib if:
- Regex is not your bottleneck
- You need maximum API stability (coregex is v0.8, pre-1.0)
- You want zero dependencies
Performance by pattern type:
- Suffix patterns (
.*\.txt): 1000-1500x - Inner patterns (
.*error.*): 2000-3000x - Case-insensitive: 100-300x
- Simple patterns: 2-10x
Part of CoreGX
coregex is the first release from CoreGX - an org I created for high-quality Go libraries.
Mission: Build production-grade tools that should exist but don't.
Why? Go stdlib prioritizes simplicity and correctness. Great for most cases. But sometimes you need performance without sacrificing correctness. That's CoreGX.
Future projects:
- JSON parser with SIMD
- Advanced concurrency primitives
- Zero-copy data structures
All MIT. All production-ready. All built for real use.
Trying It Out
go get github.com/coregx/coregex@v0.8.0
Run benchmarks on your patterns. If it's faster, great. If not, file an issue - I want to know why.
Project stats:
- 88.3% test coverage
- Zero known bugs
- 18.5K lines of code
- 6 months development
- MIT licensed
Roadmap:
- v0.9.0 (Q1 2026): Look-around assertions
- v1.0.0 (Q2 2026): API stability guarantee
- v1.1.0+: ARM NEON SIMD
Technical Deep Dive: SIMD Assembly
If you're into this stuff, here's how AVX2 memchr works:
// Find byte 'e' in haystack (processes 32 bytes in parallel)
VPBROADCASTB needle, YMM0 // Broadcast to 256-bit register
loop:
VMOVDQU (haystack), YMM1 // Load 32 bytes (unaligned)
VPCMPEQB YMM0, YMM1, YMM2 // Compare all 32 at once
VPMOVMSKB YMM2, RAX // Extract bitmask
TEST RAX, RAX // Any matches?
JNZ found // Jump if found
ADD haystack, 32 // Next 32 bytes
JMP loop
found:
TZCNT RAX, RAX // Find first set bit
ADD result, haystack
This is why it's 12x faster than Go's byte-by-byte approach.
Lessons Learned
- Profile before optimizing - but don't stop at profiling
- SIMD is accessible - Go's asm syntax is surprisingly clean
- Algorithm matters more than micro-opts - ReverseInner is 3000x not because of clever asm, but because of smart strategy
- Zero allocs is achievable - requires discipline but pays off
- Compatibility is hard - matching stdlib behavior has edge cases
Open Source & Contributions
Source: github.com/coregx/coregex
PRs welcome for:
- ARM NEON implementation
- Additional SIMD primitives
- Test coverage improvements
- Documentation
Not welcome:
- "Rewrite in Rust" (this is a Go library)
- Breaking API changes pre-1.0
- Features without benchmarks
Why I Built This
Honest answer: I wanted to learn.
How do modern regex engines work? Can I replicate Rust's performance in Go? What are the limits of Go's runtime?
Turned out the answer was: Yes, you can, and it's actually not that hard if you use the right algorithms.
Also, I needed this for my own projects. I maintain racedetector (pure-Go race detection) and other tools that parse a lot of text. This will help those projects too once v1.0 lands.
The Bottom Line
Go's regexp is fine for most use cases. But if regex is your bottleneck, you have options now.
coregex isn't magic - it's just better algorithms + SIMD. The techniques are well-known (Rust uses them, RE2 uses parts of them). I just brought them to Go.
If you're spending significant CPU time in regex, benchmark coregex. If it helps, use it. If you find bugs or slowness, report them.
Links:
Star if useful. File issues if broken. PRs if you want to contribute.
P.S. The most fun part? Writing AVX2 assembly at 2am and having it actually work. Second most fun? Seeing 3000x in benchmark output and thinking "that can't be right" (it was).
Built by @kolkov as part of CoreGX - Production Go libraries.
Top comments (0)