https://www.youtube.com/watch?v=4xhxORnBeo4
title: "Regex: Tiny Pattern, Hidden Machine"
published: true
tags: regex, computerscience, programming, beginners
You type a regex and it matches text. But under the hood, your regex engine is building and executing a full state machine — every single time.
From Pattern to Machine
A regex like a(b|c)*d looks simple. Five characters. But when the engine compiles it, here's what actually happens:
- Each character becomes two states connected by a transition
- Concatenation chains state machines together
-
Alternation (
|) creates branching paths with epsilon transitions — free jumps that consume no input - The star operator adds a loop from the accept state back to the start
This is Thompson's construction, invented in 1968. Your tiny pattern becomes a nondeterministic finite automaton (NFA) — a graph with multiple possible paths at every step.
NFA → DFA: Removing the Guesswork
An NFA can be in multiple states at once. That's powerful but expensive to simulate. So most engines convert it to a deterministic finite automaton (DFA) — one state at a time, one transition per input character.
The conversion uses the subset construction algorithm: each DFA state represents a set of NFA states. The trade-off? The DFA can have exponentially more states, but each step is O(1).
When Regex Goes Wrong
Not all engines use DFAs. Many (including Python, Java, JavaScript, Ruby, and PHP) use backtracking — they try one path, and if it fails, they back up and try another.
For most patterns, this is fine. But for patterns with nested quantifiers over overlapping characters, the number of paths explodes exponentially.
The pattern (a+)+b matched against aaaaaaaaaaX forces the engine to explore every possible way to divide those as into groups. Ten as? About a thousand paths. Twenty? Over a million. Thirty? Over a billion.
This is ReDoS — Regular Expression Denial of Service. One carefully crafted string can pin a CPU at 100% for hours.
Real incidents:
- 2016: A regex crashed Cloudflare's servers for 27 minutes
- 2019: A single regex took down npm for 6 hours
The Fix
- Avoid nested quantifiers over overlapping patterns
- Use atomic groups or possessive quantifiers
- Switch to an engine like RE2 that uses Thompson's NFA simulation — linear time, guaranteed
Your regex is a program. And like any program, it can have performance bugs hiding in plain sight.
Watch the Full Video
The video walks through NFA construction, DFA conversion, backtracking simulation, and the ReDoS attack — all animated step by step.
References
- Thompson's 1968 paper — the original algorithm
- Russ Cox: Regular Expression Matching Can Be Simple And Fast — the best deep-dive on why backtracking engines are broken
- RE2 — Google's linear-time regex engine
Top comments (0)