🔥 Updated January 5, 2026: v0.9.1 released! New features: DigitPrefilter (154x on IP patterns, 2.7x faster than Rust!), Aho-Corasick (113x on multi-literal patterns), adaptive prefilter switching. 13 strategies, ~38K LOC.
I've been writing Go for years. Love the language. But there's one thing that always bothered me: regex performance.
Recently I was processing large text files (logs, traces, datasets) and kept hitting the same wall - regexp.Find() consuming 70-80% of CPU time. Not on complex patterns. On simple stuff like .*error.*connection.*.
So I did what any reasonable developer would do: spent months building coregex, a drop-in replacement for Go's regexp that's 3-3000x faster while keeping the same O(n) guarantees.
Here's how and why.
The Problem
Let's be direct: Go's regexp is not optimized for performance.
It uses Thompson's NFA exclusively - great for correctness (no ReDoS, guaranteed O(n)), terrible for speed:
// Searching for "error" in 1MB file
re := regexp.MustCompile(`.*error.*`)
start := time.Now()
re.Find(data) // 27ms
Why so slow?
- No SIMD - processes bytes sequentially
- No prefilters - scans entire input even when pattern clearly won't match
- Single engine - uses same algorithm for all patterns
- Allocation overhead in hot paths
Compare with Rust's regex crate on same pattern: 21µs. That's 1,285x faster.
The gap is real. And fixable.
The Rust Rabbit Hole
I started digging into how Rust achieves this. Turns out, they use multi-engine architecture:
-
Literal extraction: Pull out fixed strings from pattern (
.*error.*→ literal "error") - SIMD prefilter: Use AVX2 to find candidates (12x faster byte search)
-
Strategy selection: Pick optimal engine per pattern
- DFA for simple patterns
- NFA for complex patterns
- Reverse search for suffix patterns
- Specialized engines for common cases
- Zero allocations: Object pools, preallocated buffers
Go has none of this. regexp package is ~10 years old, hasn't changed much.
Could I bring this to Go? Challenge accepted.
Building coregex - The Journey
Phase 1: SIMD Primitives
Started with the foundation. Go's bytes.IndexByte is okay. AVX2 is better.
Wrote assembly:
// Find byte in slice using AVX2 (32 bytes parallel)
TEXT ·memchrAVX2(SB), NOSPLIT, $0-40
MOVQ haystack+0(FP), DI
MOVQ len+8(FP), CX
MOVBQZX needle+24(FP), AX
// Broadcast needle to YMM0
VMOVD AX, X0
VPBROADCASTB X0, Y0
loop:
VMOVDQU (DI), Y1 // Load 32 bytes
VPCMPEQB Y0, Y1, Y2 // Compare (parallel)
VPMOVMSKB Y2, AX // Extract mask
TESTL AX, AX
JNZ found
// ... continue
Benchmark (finding 'e' in 64KB):
- stdlib: 8,367 ns
- coregex: 679 ns
- 12.3x faster
Then memmem (substring search with paired-byte SIMD), then teddy (multi-pattern SIMD).
Phase 2: Strategy Selection
Not all patterns are equal. Suffix patterns need reverse search. Inner literal patterns need bidirectional search.
Built meta-engine with 13 strategies:
func selectStrategy(pattern *syntax.Regexp, info *PatternInfo) Strategy {
// Each pattern gets the optimal engine
switch {
case info.HasSuffix && len(info.Suffix) >= 3:
return UseReverseSuffix // 1000x for .*\.txt
case info.HasInner && len(info.Inner) >= 3:
return UseReverseInner // 3000x for .*error.*
case info.IsDigitLead:
return UseDigitPrefilter // 154x for IP patterns
case info.IsLiteralAlt && len(info.Literals) > 8:
return UseAhoCorasick // 113x for many literals
case info.IsLiteralAlt && len(info.Literals) <= 8:
return UseTeddy // 242x for few literals
// ... 8 more strategies
}
}
Phase 3: ReverseInner - The Killer Optimization
This is where it got interesting.
Problem: Pattern .*error.*connection.* on large file.
stdlib approach:
for each position:
try to match .*error.*connection.*
(scan entire input, backtracking, slow)
coregex approach:
1. Find "connection" with SIMD → 200ns (not 27ms!)
2. From "connection", scan backward for "error" → fast DFA
3. From "connection", scan forward for .* → fast DFA
4. First match = done (leftmost-first semantics)
Benchmark (250KB file):
stdlib: 12.6ms
coregex: 4µs
speedup: 3,154x
Not 3x. Not 30x. Three thousand times faster.
Phase 4: DigitPrefilter - Beating Rust
IP address patterns like \d+\.\d+\.\d+\.\d+ were slow everywhere. Even in Rust.
Insight: Digits are rare in most text. SIMD scan for [0-9] first, then verify.
// SIMD scan for any digit (0-9)
func scanDigits(haystack []byte) int {
// AVX2: check 32 bytes in parallel
// Use range comparison: byte >= '0' && byte <= '9'
}
Result on IP validation (6MB input):
| Engine | Time | vs stdlib |
|---|---|---|
| Go stdlib | 493 ms | baseline |
| coregex | 3.2 ms | 154x faster |
| Rust regex | 12 ms | 41x faster |
coregex is 2.7x faster than Rust on this pattern. First time beating Rust!
But there was a catch: on dense digit data (many digits, few matches), prefilter hurt performance. Solution? Adaptive switching:
const adaptiveThreshold = 64
// After 64 consecutive false positives, switch to DFA
if consecutiveFPs >= adaptiveThreshold {
return e.findWithDFA(haystack, pos) // Abandon prefilter
}
Based on Rust regex insight: "prefilter with high FP rate makes search slower"
Phase 5: Aho-Corasick - Multi-Pattern at Scale
Patterns like error|warning|fatal|critical|panic|timeout|... with many alternatives were slow.
Integrated Aho-Corasick automaton:
Result (6MB input, 12 literal patterns):
| Engine | Time | vs stdlib |
|---|---|---|
| Go stdlib | 473 ms | baseline |
| coregex | 4.2 ms | 113x faster |
| Rust regex | 0.7 ms | — |
Phase 6: Zero Allocations
Hot paths must not allocate. Profiled with pprof, eliminated every allocation:
- Object pooling for DFA states
- Preallocated buffers
- Careful escape analysis
- Stack-only data structures
Result:
BenchmarkIsMatch-8 1000000 1.2 µs/op 0 B/op 0 allocs/op
Zero.
The Architecture
Pattern → Parse → NFA → Literal Extract → Strategy Select
↓
┌──────────────────────────────────────┐
│ 13 Strategies: │
│ • ReverseInner (3000x) │
│ • DigitPrefilter (154x, beats Rust) │
│ • ReverseSuffix (1000x) │
│ • AhoCorasick (113x) │
│ • Teddy SIMD (242x) │
│ • CharClassSearcher (23x) │
│ • LazyDFA, PikeVM, OnePass... │
└──────────────────────────────────────┘
↓
Input → Prefilter (SIMD) → Engine → Match Result
SIMD Primitives (AMD64):
-
memchr— single byte search (AVX2, 32 bytes/cycle) -
memmem— substring search with paired-byte heuristic (SSSE3) -
teddy— multi-pattern search (SSSE3) -
digit scanner— SIMD range check for[0-9]
Pure Go fallback on other architectures.
Cross-Language Benchmarks
Real benchmarks on 6MB input (source):
| Pattern | Go stdlib | coregex | Rust regex | vs stdlib | vs Rust |
|---|---|---|---|---|---|
| IP validation | 493 ms | 3.2 ms | 12 ms | 154x | 2.7x faster! |
Inner .*keyword.*
|
231 ms | 1.9 ms | 0.6 ms | 122x | Rust 3x |
Suffix .*\.txt
|
233 ms | 1.8 ms | 1.4 ms | 127x | ~tie |
| Literal alternation | 473 ms | 4.2 ms | 0.7 ms | 113x | Rust 6x |
| Email validation | 259 ms | 1.7 ms | 1.3 ms | 155x | ~tie |
| URL extraction | 266 ms | 2.8 ms | 0.9 ms | 96x | Rust 3x |
Key insight: coregex now beats Rust on IP/digit patterns. Competitive on most others.
API - Drop-In Replacement
// Change one line
// import "regexp"
import "github.com/coregx/coregex"
re := coregex.MustCompile(`.*error.*`)
re.Find(data) // 3000x faster, same API
Full compatibility:
- ✅
Find,FindAll,Match,MatchString - ✅ Capture groups with
FindSubmatch - ✅ Named captures with
SubexpNames() - ✅
Replace,ReplaceAll,Split - ✅ Unicode support via
regexp/syntax
Zero-allocation APIs:
// Zero allocations - returns bool
matched := re.IsMatch(text)
// Zero allocations - returns (start, end, found)
start, end, found := re.FindIndices(text)
When to Use It
Use coregex if:
- Regex is a bottleneck (profile first!)
- You have patterns with wildcards (
.*error.*) - You parse IP addresses, phone numbers, versions
- You do case-insensitive matching at scale
- You parse logs, traces, or large text files
Stick with stdlib if:
- Regex is not your bottleneck
- You need maximum API stability (coregex is v0.9, pre-1.0)
- You want minimal dependencies
Performance by pattern type:
- IP/phone patterns: 40-154x (beats Rust!)
- Suffix patterns (
.*\.txt): 100-1500x - Inner patterns (
.*error.*): 100-3000x - Literal alternations: 15-242x
- Case-insensitive: 100-300x
- Simple patterns: 2-10x
Battle-Tested
coregex was tested in GoAWK. This real-world testing uncovered 15+ edge cases that synthetic benchmarks missed:
- Unicode boundary handling
- Longest-match mode compatibility
- Empty match semantics
- Capture group edge cases
All fixed. All tested.
Real-World: uawk - Ultra AWK
We built uawk — a modern AWK interpreter powered by coregex. Results speak for themselves:
Benchmarks vs GoAWK (10MB dataset):
| Pattern | GoAWK | uawk | Speedup |
|---|---|---|---|
| Regex alternation | 1.85s | 97ms | 19x |
| IP matching | 290ms | 99ms | 2.9x |
| General regex | 320ms | 100ms | 3.2x |
uawk wins all 10 benchmarks. This is coregex in production — not synthetic tests.
Features:
- POSIX AWK compatible
- Stack-based VM with 80+ bytecode ops
- Zero CGO, cross-platform
- Embeddable Go API
# Install
go install github.com/kolkov/uawk/cmd/uawk@latest
# Use (same as awk)
uawk '/error/ { print $0 }' server.log
We need more testers! If you have a project using regexp, try coregex and report issues.
Trying It Out
go get github.com/coregx/coregex@latest
Run benchmarks on your patterns. If it's faster, great. If not, file an issue - I want to know why.
Project stats:
- ~87% test coverage (main package)
- Zero known bugs
- ~38K lines of code
- 13 specialized strategies
- MIT licensed
Current version: v0.9.1
Roadmap:
- v1.0.0: API stability guarantee
- v1.1.0+: ARM NEON SIMD (waiting for Go native SIMD proposal)
Technical Deep Dive: SIMD Assembly
If you're into this stuff, here's how AVX2 memchr works:
// Find byte 'e' in haystack (processes 32 bytes in parallel)
VPBROADCASTB needle, YMM0 // Broadcast to 256-bit register
loop:
VMOVDQU (haystack), YMM1 // Load 32 bytes (unaligned)
VPCMPEQB YMM0, YMM1, YMM2 // Compare all 32 at once
VPMOVMSKB YMM2, RAX // Extract bitmask
TEST RAX, RAX // Any matches?
JNZ found // Jump if found
ADD haystack, 32 // Next 32 bytes
JMP loop
found:
TZCNT RAX, RAX // Find first set bit
ADD result, haystack
VZEROUPPER // Critical! AVX-SSE transition
RET
This is why it's 12x faster than Go's byte-by-byte approach.
Lessons Learned
- Profile before optimizing - but don't stop at profiling
- SIMD is accessible - Go's asm syntax is surprisingly clean
- Algorithm matters more than micro-opts - ReverseInner is 3000x not because of clever asm, but because of smart strategy
- Adaptive algorithms win - static heuristics fail on edge cases, runtime adaptation works
- Zero allocs is achievable - requires discipline but pays off
- Compatibility is hard - matching stdlib behavior has edge cases (GoAWK found 15!)
- You CAN beat Rust - on specific patterns, Go + SIMD + smart algorithms wins
Open Source & Contributions
Source: github.com/coregx/coregex
PRs welcome for:
- ARM NEON implementation
- Additional SIMD primitives
- Test coverage improvements
- Documentation
Not welcome:
- "Rewrite in Rust" (this is a Go library)
- Breaking API changes pre-1.0
- Features without benchmarks
Why I Built This
Honest answer: I wanted to learn.
How do modern regex engines work? Can I replicate Rust's performance in Go? What are the limits of Go's runtime?
Turned out the answer was: Yes, you can, and on some patterns you can even beat Rust if you use the right algorithms.
Also, I needed this for my own projects. I maintain racedetector (pure-Go race detection) and other tools that parse a lot of text. This helps those projects too.
The Bottom Line
Go's regexp is fine for most use cases. But if regex is your bottleneck, you have options now.
coregex isn't magic - it's just better algorithms + SIMD + adaptive strategies. The techniques are well-known (Rust uses them, RE2 uses parts of them). I just brought them to Go - and on some patterns, made them faster than Rust.
If you're spending significant CPU time in regex, benchmark coregex. If it helps, use it. If you find bugs or slowness, report them.
Links:
- GitHub
- uawk — Ultra AWK interpreter powered by coregex
- Benchmarks
- Discussions
Related Go Issues:
- golang/go#26623 — Go regexp performance discussion (2018, still open)
- golang/go#76818 — Upstream path proposal
Star if useful. File issues if broken. PRs if you want to contribute.
P.S. The most fun part? Writing AVX2 assembly at 2am and having it actually work. Second most fun? Seeing "coregex 2.7x faster than Rust" in benchmark output and thinking "wait, that can't be right" (it was).
Built by @kolkov as part of CoreGX - Production Go libraries.
Top comments (2)
Really curious to use this package.
Hi! Yes, this library needs real users. Together with Ben Hoyt, we identified numerous bugs through integration with GoAWK and significantly improved performance on certain patterns and small datasets.