DEV Community

Deva Krishna
Deva Krishna

Posted on

Type-Safe Regex Matching from First Principles

I came across a post from ArkType recently:

Regex patterns being parsed at compile time! TypeScript inferring exact string literals like "ac" | "abc" from the pattern "^ab?c$". Not just string, but the actual possible matches.

That got me curious. Regex has always been a runtime thing, you write a pattern, match strings against it, and the type system just sees string everywhere. But this was doing it all at compile time. I wanted to understand how, so I decided to build a minimal version myself.

What is ArkType Regex?

ArkType Regex is a compile-time regex parser written entirely in TypeScript's type system. Instead of just validating that a string might match a regex at runtime, it infers the exact string literal types that a regex can produce.

For example:

// Traditional regex: gives you `string`
const match = "abc".match(/^ab?c$/) // string | null

// ArkType regex: gives you the exact type
type Pattern = Regex<"^ab?c$"> // "ac" | "abc"
Enter fullscreen mode Exit fullscreen mode

How does this work?

The implementation uses Template literal types to manipulate strings at the type level.

The parser reads the regex pattern character by character, builds an internal representation (AST), and then "finalizes" it into the actual TypeScript union type representing all possible matches.


Building Our Own From First Principles

Let's forget about ArkType's full implementation and build a minimal type-level regex parser ourselves. This will teach us the fundamental concepts of type-level computation in TypeScript.

The Goal

We want to create a Regex<Pattern> type that:

  • Takes a regex pattern string
  • Returns a union of all possible string literals that match
type R1 = Regex<"abc">     // "abc"
type R2 = Regex<"ab?c">    // "ac" | "abc"  
type R3 = Regex<"a|b">     // "a" | "b"
type R4 = Regex<"^ab?c$">  // "ac" | "abc"
Enter fullscreen mode Exit fullscreen mode

Step 1: Understanding Template Literal Type Manipulation

Before we parse anything, we need to understand how to work with strings at the type level.

TypeScript's template literal types let us decompose strings:

// Extract the first character and the rest
type Shift<S extends string> = 
  S extends `${infer Head}${infer Tail}` 
    ? [Head, Tail] 
    : null

type Test1 = Shift<"abc">  // ["a", "bc"]
type Test2 = Shift<"a">    // ["a", ""]
type Test3 = Shift<"">     // null
Enter fullscreen mode Exit fullscreen mode

The infer keyword lets us pattern-match on template literals and capture parts of the string into type variables.

We can also build strings:

type Concat<A extends string, B extends string> = `${A}${B}`

type Test4 = Concat<"hello", "world">  // "helloworld"
Enter fullscreen mode Exit fullscreen mode

Step 2: Designing the State Machine

What Does "Parsing" Even Mean?

Parsing means reading through a string character by character and building up some result. In JavaScript, you might write:

function parse(pattern) {
  let result = ""
  for (let i = 0; i < pattern.length; i++) {
    const char = pattern[i]
    // do something with char, build up result
  }
  return result
}
Enter fullscreen mode Exit fullscreen mode

Notice what we naturally need:

  1. Where we are in the input - the index i
  2. What we've built so far - the result we're accumulating

This is state - information that persists and changes as we move through the loop.

Tracking Position: Index vs. "What's Left"

In the loop above, we use i to know our position. But there's another way to think about it.

Instead of "I'm at position 2 of abc", we can say "I've already read ab, and c is left to read."

// Using index:
pattern = "abc", i = 2  next char is pattern[2] = "c"

// Using "what's left":
unscanned = "c"  next char is "c"
Enter fullscreen mode Exit fullscreen mode

Both track the same thing - where we are. But "what's left" is easier to work with in TypeScript's type system as we'll see in the following section.

The Problem: Types Can't Have Variables

In runtime JavaScript, we have mutable variables. We can write result += char and i++.

But TypeScript's type system has no variables. Types are immutable. So how do we "loop" and "accumulate"?

The answer: recursion with parameters.

Instead of mutating variables, we pass updated values to the next recursive call:

// Runtime approach (mutation):
let result = ""
for (char of input) {
  result = result + char  // mutate
}

// Type-level approach (recursion):
type Parse<Input, Result> = 
  Input extends `${infer Char}${infer Rest}`
    ? Parse<Rest, `${Result}${Char}`>  // "pass" new values to next iteration
    : Result  // base case: return accumulated result
Enter fullscreen mode Exit fullscreen mode

Each "iteration" is a new type instantiation with updated parameters. If you don't understand this it's fine, read further you will get an understanding.

What State Do We Need?

At minimum, we need to track:

type State = {
  unscanned: string   // What's LEFT to read (like the remaining input)
  result: string      // What we've BUILT so far
}
Enter fullscreen mode Exit fullscreen mode

Let's trace parsing "abc" with the simplest possible logic (just accumulate characters):

Step unscanned result Action
0 abc "" Start
1 bc a Read a, add to result
2 c ab Read b, add to result
3 "" abc Read c, add to result
Done "" abc Nothing left, return result

This is the essence of parsing: consume input, build output.

Why Can't result Just Be a Simple String?

For basic concatenation, result: string works fine. But regex has quantifiers like ?.

Consider parsing ab?c. When we hit ?, we need to make the previous thing optional. In this case we will need to match ac and abc, as b is considered optional.

Step unscanned result Action
0 ab?c "" Start
1 b?c a Read a
2 ?c ab Read b
3 c ??? Read ? - now what?

Problem: We see ? and need to make b optional. But result is ab - just a flat string. How do we know that ? applies to b and not to ab or just a?

We've lost track of what the last thing we added was.

The Fix: Track "Current Element" Separately

We need to split our result into two parts:

type State = {
  unscanned: string   // What's left to read
  sequence: string    // Everything BEFORE the current element
  root: string        // The CURRENT element (what quantifiers target)
}
Enter fullscreen mode Exit fullscreen mode

Now let's trace ab?c:

Step unscanned sequence root Action
0 ab?c "" "" Start
1 b?c "" a a becomes current element
2 ?c a b a moves to sequence, b is current
3 c a "" | b ? makes root optional!
4 "" a | ab c c is current, root joins sequence
Done Combine: ac | abc

By keeping the "current element" in root separate from "everything before" in sequence, we always know exactly what a quantifier should modify.

One More Thing: Alternation Needs Memory

The | operator creates alternatives: a|b should produce a | b.

When we see |, we need to:

  1. Save the current pattern as a completed branch
  2. Start fresh for the next alternative

We need somewhere to store those saved branches:

type State = {
  unscanned: string    // What's left to read
  sequence: string     // Pattern before current element  
  root: string         // Current element (quantifier target)
  branches: string[]   // Completed alternatives (for |)
}
Enter fullscreen mode Exit fullscreen mode

Trace a|b:

Step unscanned sequence root branches Action
0 a&#124;b "" "" [] Start
1 &#124;b "" a [] a is current
2 b "" "" ["a"] See &#124;: save a, reset
3 "" "" b ["a"] b is current
Done Union: a | b

Summary: Why Each Field Exists

Field Why We Need It
unscanned Track what's left to read (our "loop counter")
sequence Accumulate "finished" parts of pattern
root Keep "current element" separate so quantifiers know what to modify
branches Store completed alternatives when we see &#124;
type State = {
  unscanned: string
  sequence: string
  root: string
  branches: string[]
}

type Init<S extends string> = {
  unscanned: S
  sequence: ""
  root: ""
  branches: []
}
Enter fullscreen mode Exit fullscreen mode

Now we have a state structure that can handle the core regex operations. Next, we'll build the logic that transforms this state character by character.

Step 3: The Core Parse Loop

Recursive types let us implement loops:

type Parse<S extends State> = 
  S["unscanned"] extends ""      // Is input exhausted?
    ? Finalize<S>                // Yes: finalize and return result
    : Parse<Next<S>>             // No: process next char, recurse
Enter fullscreen mode Exit fullscreen mode

This is the heart of the parser. It repeatedly calls Next to process one character until the input is empty, then calls Finalize to produce the result.

Step 4: Processing Literal Characters

Let's start with the simplest case: literal characters.

What we expect:

type Result = Regex<"abc"> // "abc"
Enter fullscreen mode Exit fullscreen mode

Implementation:

type Next<S extends State> = 
  S["unscanned"] extends `${infer H}${infer T}` 
    ? {
        unscanned: T                           // Consume the character
        sequence: `${S["sequence"]}${S["root"]}`  // Push old root to sequence
        root: H                                // New char becomes root
        branches: S["branches"]                // Branches unchanged
      }
    : S
Enter fullscreen mode Exit fullscreen mode

What's happening here:

  1. Extract the first character H and rest T
  2. The current root joins the sequence (it's no longer the "current token")
  3. The new character becomes the root
  4. Branches stay the same

Let's add a simple finalizer:

type CurrentPattern<S extends State> = `${S["sequence"]}${S["root"]}`

type Finalize<S extends State> = 
  S["branches"] extends [] 
    ? CurrentPattern<S>                              // No branches: just return pattern
    : [...S["branches"], CurrentPattern<S>][number]  // Branches: union them all
Enter fullscreen mode Exit fullscreen mode

The [number] trick converts an array type to a union of its elements.

Test it:

type R = Parse<Init<"abc">>  // "abc" 
Enter fullscreen mode Exit fullscreen mode

Step 5: The Optional Quantifier ?

Now the interesting part. When we see ?, we need to make the root optional.

What we expect:

type Result = Regex<"ab?c"> // "ac" | "abc"
Enter fullscreen mode Exit fullscreen mode

Implementation:

type Next<S extends State> = 
  S["unscanned"] extends `${infer H}${infer T}` 
    ? H extends "?"
      ? {
          unscanned: T
          sequence: S["sequence"]       // Sequence unchanged
          root: "" | S["root"]          // Root becomes optional!
          branches: S["branches"]
        }
      : {
          unscanned: T
          sequence: `${S["sequence"]}${S["root"]}`
          root: H
          branches: S["branches"]
        }
    : S
Enter fullscreen mode Exit fullscreen mode

The magic is "" | S["root"]. If root was b, it becomes "" | b.

When we finalize with sequence = "a" and root = "" | "b":

`${S["sequence"]}${S["root"]}`
// = `${"a"}${"" | "b"}`
// = "a" | "ab"
Enter fullscreen mode Exit fullscreen mode

TypeScript distributes the union across the template literal! This is key to the whole approach.

Wait, we also need to handle c after ?. Let's trace through ab?c:

  1. Initial: {unscanned: "ab?c", sequence: "", root: "", branches: []}
  2. After a: {unscanned: "b?c", sequence: "", root: "a", branches: []}
  3. After b: {unscanned: "?c", sequence: "a", root: "b", branches: []}
  4. After ?: {unscanned: "c", sequence: "a", root: "" | "b", branches: []}
  5. After c: {unscanned: "", sequence: "a" | "ab", root: "c", branches: []}
  6. Finalize: "ac" | "abc"

The distribution happens at step 5 when we concatenate the union in root into sequence.

Step 6: Handling Anchors

Anchors ^ and $ assert position but don't consume characters. For simplicity, let's strip them:

type StripAnchors<S extends string> = 
  S extends `^${infer R}` 
    ? R extends `${infer M}$` 
      ? M 
      : R
    : S extends `${infer M}$` 
      ? M 
      : S

type Regex<S extends string> = Parse<Init<StripAnchors<S>>>
Enter fullscreen mode Exit fullscreen mode

Now Regex<"^ab?c$"> works like Regex<"ab?c">.

Step 7: Adding More Quantifiers

Let's add + (one or more) and * (zero or more).

What we expect:

type Result1 = Regex<"a+"> // `a${string}`
type Result2 = Regex<"a*"> // "" | `a${string}`
Enter fullscreen mode Exit fullscreen mode

Implementation:

type Next<S extends State> = 
  S["unscanned"] extends `${infer H}${infer T}` 
    ? H extends "?"
      ? { 
          unscanned: T
          sequence: S["sequence"]
          root: "" | S["root"]           // 0 or 1
          branches: S["branches"] 
        }
      : H extends "+"
      ? { 
          unscanned: T
          sequence: S["sequence"]
          root: `${S["root"]}${string}`  // 1 or more
          branches: S["branches"] 
        }
      : H extends "*"
      ? { 
          unscanned: T
          sequence: S["sequence"]
          root: "" | `${S["root"]}${string}`  // 0 or more
          branches: S["branches"] 
        }
      : { 
          unscanned: T
          sequence: `${S["sequence"]}${S["root"]}`
          root: H
          branches: S["branches"] 
        }
    : S
Enter fullscreen mode Exit fullscreen mode

For + and *, we can't enumerate all possibilities (infinite), so we use ${string} to represent "any continuation":

  • a+`a${string}` (starts with a, followed by anything)
  • a*"" | `a${string}` (empty or starts with a)

This widens the type but keeps it useful for validation.

Step 8: The Alternation Operator |

The | creates alternatives. When we see it, we need to:

  1. Finalize the current branch
  2. Push it to branches
  3. Start fresh for the next alternative

What we expect:

type Result = Regex<"a|b"> // "a" | "b"
Enter fullscreen mode Exit fullscreen mode

Implementation:

type Next<S extends State> = 
  S["unscanned"] extends `${infer H}${infer T}` 
    ? H extends "|"
      ? {
          unscanned: T
          sequence: ""                         // Reset sequence
          root: ""                               // Reset root
          branches: [...S["branches"], CurrentPattern<S>]  // Save current branch
        }
      : H extends "?"
        // ... rest of cases
Enter fullscreen mode Exit fullscreen mode

In Finalize, we union all branches together:

type Finalize<S extends State> = 
  S["branches"] extends [] 
    ? CurrentPattern<S>
    : [...S["branches"], CurrentPattern<S>][number]
Enter fullscreen mode Exit fullscreen mode

The [...array, lastElement][number] pattern creates a union from the array plus the last element.

The Complete Implementation

// === STATE TYPES ===
type State = {
  unscanned: string
  sequence: string  
  root: string
  branches: string[]
}

type Init<S extends string> = {
  unscanned: S
  sequence: ""
  root: ""
  branches: []
}

// === HELPERS ===
type CurrentPattern<S extends State> = `${S["sequence"]}${S["root"]}`

type StripAnchors<S extends string> = 
  S extends `^${infer R}` 
    ? R extends `${infer M}$` ? M : R
    : S extends `${infer M}$` ? M : S

// === FINALIZE ===
type Finalize<S extends State> = 
  S["branches"] extends [] 
    ? CurrentPattern<S>
    : [...S["branches"], CurrentPattern<S>][number]

// === NEXT - CHARACTER DISPATCH ===
type Next<S extends State> = 
  S["unscanned"] extends `${infer H}${infer T}` 
    ? H extends "|"
      ? { unscanned: T; sequence: ""; root: ""; 
          branches: [...S["branches"], CurrentPattern<S>] }
      : H extends "?"
      ? { unscanned: T; sequence: S["sequence"]; root: "" | S["root"]; 
          branches: S["branches"] }
      : H extends "+"
      ? { unscanned: T; sequence: S["sequence"]; root: `${S["root"]}${string}`; 
          branches: S["branches"] }
      : H extends "*"
      ? { unscanned: T; sequence: S["sequence"]; root: "" | `${S["root"]}${string}`; 
          branches: S["branches"] }
      : { unscanned: T; sequence: `${S["sequence"]}${S["root"]}`; root: H; 
          branches: S["branches"] }
    : S

// === PARSE LOOP ===
type Parse<S extends State> = 
  S["unscanned"] extends "" ? Finalize<S> : Parse<Next<S>>

// === PUBLIC API ===
type Regex<S extends string> = Parse<Init<StripAnchors<S>>>

// === TESTS ===
type T1 = Regex<"abc">       // "abc"
type T2 = Regex<"^abc$">     // "abc"  
type T3 = Regex<"ab?c">      // "ac" | "abc"
type T4 = Regex<"a|b">       // "a" | "b"
type T5 = Regex<"^a?b?$">    // "" | "a" | "b" | "ab"
type T6 = Regex<"a+">        // `a${string}`
type T7 = Regex<"a*">        // "" | `a${string}`
Enter fullscreen mode Exit fullscreen mode

What We've Built

Our ~50 lines of type-level code can:

  • Parse literal characters
  • Handle optional ?, one-or-more +, and zero-or-more * quantifiers
  • Process alternation with |
  • Strip anchors ^ and $

What's Still Missing

A production implementation (like ArkType's) would add:

  1. Groups (...) - Requires a stack to track nested groups
  2. Character classes [abc] - Need a sub-parser for charsets
  3. Escape sequences \d, \w - Map to appropriate types like ${bigint}
  4. Range quantifiers {n,m} - Need recursive loops to enumerate
  5. Backreferences \1 - Track captures and resolve references
  6. Named captures (?<name>...) - Track in separate context

Each feature adds complexity, but the fundamental approach remains: recursive type computation with state threading.


This approach demonstrates that TypeScript's type system can implement surprisingly complex algorithms purely at compile time, catching errors before your code ever runs.


Found this useful? I write about TypeScript, type-level programming, and the weird corners of web development. Subscribe to catch the next one.

Follow me on X for more TypeScript tips and dev thoughts.

Top comments (0)