DEV Community

Cover image for From 0 to 11 Bugs Fixed: How GoAWK Battle-Tested My 3000x Faster Regex Engine
Andrey Kolkov
Andrey Kolkov

Posted on

From 0 to 11 Bugs Fixed: How GoAWK Battle-Tested My 3000x Faster Regex Engine

The Best Kind of Feedback

A week ago, I published "Go's Regexp is Slow. So I Built My Own". The response was incredible - but the most valuable feedback came from Ben Hoyt, creator of GoAWK.

He didn't just read the article. He tried to actually use coregex.

"I've started integrating coregex into GoAWK... I'm finding a few issues."

That message led to one of the most productive weeks of debugging I've ever had.

11 Bugs in 7 Days

Ben's GoAWK test suite is ruthless - 1000+ regex patterns covering edge cases I never imagined. Here's what he found:

Day Bug Pattern Symptom
1 [^,]* Negated char class Crash
1 [oO]+d Case-insensitive Wrong match
2 ^foo Start anchor Matched everywhere
2 \bword\b Word boundary Find returned empty
3 ^ in FindAll Anchor in loop Matched at every position
3 Error format - Different from stdlib
4 \w+@... Capture groups DFA returned false
4 (?s:.) Inline flags Ignored
5 a$ End anchor First call wrong
6 `(#\ #!)` Longest()

Each bug taught me something. Some were embarrassing oversights. Others revealed fundamental gaps in my understanding.

The Worst Bug: ^ Anchor

The start anchor (^) was my nemesis. It seemed simple - match only at position 0. But in a multi-engine architecture, "simple" gets complicated fast.

Version 1: Naively checked pos == 0. Worked for IsMatch, broke for FindAllIndex.

Version 2: Added FindAt(haystack, offset) methods. Now FindAllIndex could tell the engine "this is position 5 in the original string."

Version 3: Discovered DFA's epsilonClosure didn't respect anchors. Implemented proper LookSet following Rust's regex-automata.

Three attempts over two days. Ben kept testing. I kept fixing.

The Sneakiest Bug: Longest()

This one was humbling. The Longest() method existed since v0.8.2. Documentation claimed it worked. Tests passed.

It was a no-op stub.

// What I wrote (v0.8.2)
func (r *Regex) Longest() {
    // TODO: implement leftmost-longest semantics
}

// What Ben expected
re := coregex.MustCompile(`(a|ab)`)
re.Longest()
// "ab" should match "ab" (longest), not "a" (first)
Enter fullscreen mode Exit fullscreen mode

AWK uses POSIX semantics (leftmost-longest). Go's stdlib uses Perl semantics (leftmost-first) by default, but Longest() switches modes. My engine only supported Perl semantics.

The fix required understanding a fundamental distinction:

Leftmost-First (Perl):   (a|ab) on "ab" → "a" (first alternative wins)
Leftmost-Longest (POSIX): (a|ab) on "ab" → "ab" (longer match wins)
Enter fullscreen mode Exit fullscreen mode

Implementing this in PikeVM took 100 lines. No performance regression in default mode.

The Fix Velocity

Version Date Fixes
v0.8.3 Dec 4 Negated classes, case-insensitive
v0.8.4 Dec 4 ^ anchor (professional fix)
v0.8.5 Dec 5 Word boundaries \b \B
v0.8.6 Dec 7 ^ in FindAll/ReplaceAll
v0.8.7 Dec 7 Error message format
v0.8.8 Dec 7 DFA + capture groups
v0.8.9 Dec 7 Linter compatibility
v0.8.10 Dec 7 Inline flags (?s:...)
v0.8.11 Dec 8 End anchor first-call bug
v0.8.12 Dec 8 Longest() implementation

9 releases in 5 days. Each one making coregex more stdlib-compatible.

Performance: Still Fast

The real question: did all these fixes kill performance?

Pattern: .*connection.*
Input: 250KB log file

stdlib:   12.6 ms
coregex:  4 µs

Speedup: 3,154x (unchanged from v0.8.0)
Enter fullscreen mode Exit fullscreen mode

The architectural decisions paid off. SIMD prefiltering. Lazy DFA. Strategy selection. They handle the fast path. The bug fixes lived in edge case handling - code that rarely runs.

Full Stdlib Compatibility

After v0.8.12, GoAWK's test suite passes completely:

$ cd goawk
$ go test ./...
ok      github.com/benhoyt/goawk    4.832s
Enter fullscreen mode Exit fullscreen mode

Drop-in replacement confirmed.

// Before
import "regexp"

// After
import "github.com/coregx/coregex"

// That's it. Same API. 5-3000x faster.
Enter fullscreen mode Exit fullscreen mode

What I Learned

1. Real-world testing > Unit tests

My test coverage was 88%. GoAWK found 11 bugs. Unit tests catch what you imagine. Users catch what you don't.

2. Multi-engine architecture = Multi-engine bugs

Each strategy (DFA, NFA, ReverseAnchored, OnePass) had its own edge cases. A fix in one could break another. Integration tests between engines became critical.

3. "Works on my machine" is worthless

Ben tested on different inputs, different patterns, different use cases. His AWK interpreter exercises regex in ways my benchmarks never did.

4. Fast feedback loops matter

GitHub Issues → Fix → Release → Test. Sometimes twice a day. Ben's patience and detailed bug reports made this possible.

The Collaboration

I want to publicly thank Ben Hoyt. He could have said "this library has bugs, I'll use stdlib." Instead, he filed detailed issues, provided test cases, and kept testing each release.

This is open source at its best.

Try It Yourself

go get github.com/coregx/coregex@v0.8.12
Enter fullscreen mode Exit fullscreen mode
package main

import (
    "fmt"
    "github.com/coregx/coregex"
)

func main() {
    re := coregex.MustCompile(`\w+@[\w.]+`)
    fmt.Println(re.FindString("email: test@example.com"))
    // Output: test@example.com
}
Enter fullscreen mode Exit fullscreen mode

Found a bug? Open an issue. I'll fix it.

What's Next

  • v0.9.0: ARM NEON SIMD (waiting for Go 1.26)
  • v1.0.0: API stability guarantee, security audit
  • Your feedback: The fastest path to production-ready

Links:


From 0 to 11 bugs fixed. From "interesting project" to "production-ready." Thanks to one developer who actually tried to use it.

Top comments (0)