DEV Community

Neural Download
Neural Download

Posted on

Regex: Tiny Pattern, Hidden Machine

https://www.youtube.com/watch?v=4xhxORnBeo4


title: "Regex: Tiny Pattern, Hidden Machine"
published: true

tags: regex, computerscience, programming, beginners

You type a regex and it matches text. But under the hood, your regex engine is building and executing a full state machine — every single time.

From Pattern to Machine

A regex like a(b|c)*d looks simple. Five characters. But when the engine compiles it, here's what actually happens:

  1. Each character becomes two states connected by a transition
  2. Concatenation chains state machines together
  3. Alternation (|) creates branching paths with epsilon transitions — free jumps that consume no input
  4. The star operator adds a loop from the accept state back to the start

This is Thompson's construction, invented in 1968. Your tiny pattern becomes a nondeterministic finite automaton (NFA) — a graph with multiple possible paths at every step.

NFA → DFA: Removing the Guesswork

An NFA can be in multiple states at once. That's powerful but expensive to simulate. So most engines convert it to a deterministic finite automaton (DFA) — one state at a time, one transition per input character.

The conversion uses the subset construction algorithm: each DFA state represents a set of NFA states. The trade-off? The DFA can have exponentially more states, but each step is O(1).

When Regex Goes Wrong

Not all engines use DFAs. Many (including Python, Java, JavaScript, Ruby, and PHP) use backtracking — they try one path, and if it fails, they back up and try another.

For most patterns, this is fine. But for patterns with nested quantifiers over overlapping characters, the number of paths explodes exponentially.

The pattern (a+)+b matched against aaaaaaaaaaX forces the engine to explore every possible way to divide those as into groups. Ten as? About a thousand paths. Twenty? Over a million. Thirty? Over a billion.

This is ReDoS — Regular Expression Denial of Service. One carefully crafted string can pin a CPU at 100% for hours.

Real incidents:

  • 2016: A regex crashed Cloudflare's servers for 27 minutes
  • 2019: A single regex took down npm for 6 hours

The Fix

  • Avoid nested quantifiers over overlapping patterns
  • Use atomic groups or possessive quantifiers
  • Switch to an engine like RE2 that uses Thompson's NFA simulation — linear time, guaranteed

Your regex is a program. And like any program, it can have performance bugs hiding in plain sight.

Watch the Full Video

The video walks through NFA construction, DFA conversion, backtracking simulation, and the ReDoS attack — all animated step by step.

References

Top comments (0)