Deva Krishna

Posted on Nov 30, 2025

Type-Safe Regex Matching from First Principles

#typescript #arktype #regex

I came across a post from ArkType recently:

Regex patterns being parsed at compile time! TypeScript inferring exact string literals like "ac" | "abc" from the pattern "^ab?c$". Not just string, but the actual possible matches.

That got me curious. Regex has always been a runtime thing, you write a pattern, match strings against it, and the type system just sees string everywhere. But this was doing it all at compile time. I wanted to understand how, so I decided to build a minimal version myself.

What is ArkType Regex?

ArkType Regex is a compile-time regex parser written entirely in TypeScript's type system. Instead of just validating that a string might match a regex at runtime, it infers the exact string literal types that a regex can produce.

For example:

// Traditional regex: gives you `string`
const match = "abc".match(/^ab?c$/) // string | null

// ArkType regex: gives you the exact type
type Pattern = Regex<"^ab?c$"> // "ac" | "abc"

How does this work?

The implementation uses Template literal types to manipulate strings at the type level.

The parser reads the regex pattern character by character, builds an internal representation (AST), and then "finalizes" it into the actual TypeScript union type representing all possible matches.

Building Our Own From First Principles

Let's forget about ArkType's full implementation and build a minimal type-level regex parser ourselves. This will teach us the fundamental concepts of type-level computation in TypeScript.

The Goal

We want to create a Regex<Pattern> type that:

Takes a regex pattern string
Returns a union of all possible string literals that match

type R1 = Regex<"abc">     // "abc"
type R2 = Regex<"ab?c">    // "ac" | "abc"  
type R3 = Regex<"a|b">     // "a" | "b"
type R4 = Regex<"^ab?c$">  // "ac" | "abc"

Step 1: Understanding Template Literal Type Manipulation

Before we parse anything, we need to understand how to work with strings at the type level.

TypeScript's template literal types let us decompose strings:

// Extract the first character and the rest
type Shift<S extends string> = 
  S extends `${infer Head}${infer Tail}` 
    ? [Head, Tail] 
    : null

type Test1 = Shift<"abc">  // ["a", "bc"]
type Test2 = Shift<"a">    // ["a", ""]
type Test3 = Shift<"">     // null

The infer keyword lets us pattern-match on template literals and capture parts of the string into type variables.

We can also build strings:

type Concat<A extends string, B extends string> = `${A}${B}`

type Test4 = Concat<"hello", "world">  // "helloworld"

Step 2: Designing the State Machine

What Does "Parsing" Even Mean?

Parsing means reading through a string character by character and building up some result. In JavaScript, you might write:

function parse(pattern) {
  let result = ""
  for (let i = 0; i < pattern.length; i++) {
    const char = pattern[i]
    // do something with char, build up result
  }
  return result
}

Notice what we naturally need:

Where we are in the input - the index i
What we've built so far - the result we're accumulating

This is state - information that persists and changes as we move through the loop.

Tracking Position: Index vs. "What's Left"

In the loop above, we use i to know our position. But there's another way to think about it.

Instead of "I'm at position 2 of abc", we can say "I've already read ab, and c is left to read."

// Using index:
pattern = "abc", i = 2 → next char is pattern[2] = "c"

// Using "what's left":
unscanned = "c" → next char is "c"

Both track the same thing - where we are. But "what's left" is easier to work with in TypeScript's type system as we'll see in the following section.

The Problem: Types Can't Have Variables

In runtime JavaScript, we have mutable variables. We can write result += char and i++.

But TypeScript's type system has no variables. Types are immutable. So how do we "loop" and "accumulate"?

The answer: recursion with parameters.

Instead of mutating variables, we pass updated values to the next recursive call:

// Runtime approach (mutation):
let result = ""
for (char of input) {
  result = result + char  // mutate
}

// Type-level approach (recursion):
type Parse<Input, Result> = 
  Input extends `${infer Char}${infer Rest}`
    ? Parse<Rest, `${Result}${Char}`>  // "pass" new values to next iteration
    : Result  // base case: return accumulated result

Each "iteration" is a new type instantiation with updated parameters. If you don't understand this it's fine, read further you will get an understanding.

What State Do We Need?

At minimum, we need to track:

type State = {
  unscanned: string   // What's LEFT to read (like the remaining input)
  result: string      // What we've BUILT so far
}

Let's trace parsing "abc" with the simplest possible logic (just accumulate characters):

Step	unscanned	result	Action
0	`abc`	`""`	Start
1	`bc`	`a`	Read `a`, add to result
2	`c`	`ab`	Read `b`, add to result
3	`""`	`abc`	Read `c`, add to result
Done	`""`	`abc`	Nothing left, return result

This is the essence of parsing: consume input, build output.

Why Can't `result` Just Be a Simple String?

For basic concatenation, result: string works fine. But regex has quantifiers like ?.

Consider parsing ab?c. When we hit ?, we need to make the previous thing optional. In this case we will need to match ac and abc, as b is considered optional.

Step	unscanned	result	Action
0	`ab?c`	`""`	Start
1	`b?c`	`a`	Read `a`
2	`?c`	`ab`	Read `b`
3	`c`	`???`	Read `?` - now what?

Problem: We see ? and need to make b optional. But result is ab - just a flat string. How do we know that ? applies to b and not to ab or just a?

We've lost track of what the last thing we added was.

The Fix: Track "Current Element" Separately

We need to split our result into two parts:

type State = {
  unscanned: string   // What's left to read
  sequence: string    // Everything BEFORE the current element
  root: string        // The CURRENT element (what quantifiers target)
}

Now let's trace ab?c:

Step	unscanned	sequence	root	Action
0	`ab?c`	`""`	`""`	Start
1	`b?c`	`""`	`a`	`a` becomes current element
2	`?c`	`a`	`b`	`a` moves to sequence, `b` is current
3	`c`	`a`	`""` \| `b`	`?` makes root optional!
4	`""`	`a` \| `ab`	`c`	`c` is current, root joins sequence
Done				Combine: `ac` \| `abc`

By keeping the "current element" in root separate from "everything before" in sequence, we always know exactly what a quantifier should modify.

One More Thing: Alternation Needs Memory

The | operator creates alternatives: a|b should produce a | b.

When we see |, we need to:

Save the current pattern as a completed branch
Start fresh for the next alternative

We need somewhere to store those saved branches:

type State = {
  unscanned: string    // What's left to read
  sequence: string     // Pattern before current element  
  root: string         // Current element (quantifier target)
  branches: string[]   // Completed alternatives (for |)
}

Trace a|b:

Step	unscanned	sequence	root	branches	Action
0	`a\|b`	`""`	`""`	`[]`	Start
1	`\|b`	`""`	`a`	`[]`	`a` is current
2	`b`	`""`	`""`	`["a"]`	See `\|`: save `a`, reset
3	`""`	`""`	`b`	`["a"]`	`b` is current
Done					Union: `a` \| `b`

Summary: Why Each Field Exists

Field	Why We Need It
`unscanned`	Track what's left to read (our "loop counter")
`sequence`	Accumulate "finished" parts of pattern
`root`	Keep "current element" separate so quantifiers know what to modify
`branches`	Store completed alternatives when we see `\|`

type State = {
  unscanned: string
  sequence: string
  root: string
  branches: string[]
}

type Init<S extends string> = {
  unscanned: S
  sequence: ""
  root: ""
  branches: []
}

Now we have a state structure that can handle the core regex operations. Next, we'll build the logic that transforms this state character by character.

Step 3: The Core Parse Loop

Recursive types let us implement loops:

type Parse<S extends State> = 
  S["unscanned"] extends ""      // Is input exhausted?
    ? Finalize<S>                // Yes: finalize and return result
    : Parse<Next<S>>             // No: process next char, recurse

This is the heart of the parser. It repeatedly calls Next to process one character until the input is empty, then calls Finalize to produce the result.

Step 4: Processing Literal Characters

Let's start with the simplest case: literal characters.

What we expect:

type Result = Regex<"abc"> // "abc"

Implementation:

type Next<S extends State> = 
  S["unscanned"] extends `${infer H}${infer T}` 
    ? {
        unscanned: T                           // Consume the character
        sequence: `${S["sequence"]}${S["root"]}`  // Push old root to sequence
        root: H                                // New char becomes root
        branches: S["branches"]                // Branches unchanged
      }
    : S

What's happening here:

Extract the first character H and rest T
The current root joins the sequence (it's no longer the "current token")
The new character becomes the root
Branches stay the same

Let's add a simple finalizer:

type CurrentPattern<S extends State> = `${S["sequence"]}${S["root"]}`

type Finalize<S extends State> = 
  S["branches"] extends [] 
    ? CurrentPattern<S>                              // No branches: just return pattern
    : [...S["branches"], CurrentPattern<S>][number]  // Branches: union them all

The [number] trick converts an array type to a union of its elements.

Test it:

type R = Parse<Init<"abc">>  // "abc"

Step 5: The Optional Quantifier `?`

Now the interesting part. When we see ?, we need to make the root optional.

What we expect:

type Result = Regex<"ab?c"> // "ac" | "abc"

Implementation:

type Next<S extends State> = 
  S["unscanned"] extends `${infer H}${infer T}` 
    ? H extends "?"
      ? {
          unscanned: T
          sequence: S["sequence"]       // Sequence unchanged
          root: "" | S["root"]          // Root becomes optional!
          branches: S["branches"]
        }
      : {
          unscanned: T
          sequence: `${S["sequence"]}${S["root"]}`
          root: H
          branches: S["branches"]
        }
    : S

The magic is "" | S["root"]. If root was b, it becomes "" | b.

When we finalize with sequence = "a" and root = "" | "b":

`${S["sequence"]}${S["root"]}`
// = `${"a"}${"" | "b"}`
// = "a" | "ab"

TypeScript distributes the union across the template literal! This is key to the whole approach.

Wait, we also need to handle c after ?. Let's trace through ab?c:

Initial: {unscanned: "ab?c", sequence: "", root: "", branches: []}
After a: {unscanned: "b?c", sequence: "", root: "a", branches: []}
After b: {unscanned: "?c", sequence: "a", root: "b", branches: []}
After ?: {unscanned: "c", sequence: "a", root: "" | "b", branches: []}
After c: {unscanned: "", sequence: "a" | "ab", root: "c", branches: []}
Finalize: "ac" | "abc"

The distribution happens at step 5 when we concatenate the union in root into sequence.

Step 6: Handling Anchors

Anchors ^ and $ assert position but don't consume characters. For simplicity, let's strip them:

type StripAnchors<S extends string> = 
  S extends `^${infer R}` 
    ? R extends `${infer M}$` 
      ? M 
      : R
    : S extends `${infer M}$` 
      ? M 
      : S

type Regex<S extends string> = Parse<Init<StripAnchors<S>>>

Now Regex<"^ab?c$"> works like Regex<"ab?c">.

Step 7: Adding More Quantifiers

Let's add + (one or more) and * (zero or more).

What we expect:

type Result1 = Regex<"a+"> // `a${string}`
type Result2 = Regex<"a*"> // "" | `a${string}`

Implementation:

type Next<S extends State> = 
  S["unscanned"] extends `${infer H}${infer T}` 
    ? H extends "?"
      ? { 
          unscanned: T
          sequence: S["sequence"]
          root: "" | S["root"]           // 0 or 1
          branches: S["branches"] 
        }
      : H extends "+"
      ? { 
          unscanned: T
          sequence: S["sequence"]
          root: `${S["root"]}${string}`  // 1 or more
          branches: S["branches"] 
        }
      : H extends "*"
      ? { 
          unscanned: T
          sequence: S["sequence"]
          root: "" | `${S["root"]}${string}`  // 0 or more
          branches: S["branches"] 
        }
      : { 
          unscanned: T
          sequence: `${S["sequence"]}${S["root"]}`
          root: H
          branches: S["branches"] 
        }
    : S

For + and *, we can't enumerate all possibilities (infinite), so we use ${string} to represent "any continuation":

a+ → `a${string}` (starts with a, followed by anything)
a* → "" | `a${string}` (empty or starts with a)

This widens the type but keeps it useful for validation.

Step 8: The Alternation Operator `|`

The | creates alternatives. When we see it, we need to:

Finalize the current branch
Push it to branches
Start fresh for the next alternative

What we expect:

type Result = Regex<"a|b"> // "a" | "b"

Implementation:

type Next<S extends State> = 
  S["unscanned"] extends `${infer H}${infer T}` 
    ? H extends "|"
      ? {
          unscanned: T
          sequence: ""                         // Reset sequence
          root: ""                               // Reset root
          branches: [...S["branches"], CurrentPattern<S>]  // Save current branch
        }
      : H extends "?"
        // ... rest of cases

In Finalize, we union all branches together:

type Finalize<S extends State> = 
  S["branches"] extends [] 
    ? CurrentPattern<S>
    : [...S["branches"], CurrentPattern<S>][number]

The [...array, lastElement][number] pattern creates a union from the array plus the last element.

The Complete Implementation

// === STATE TYPES ===
type State = {
  unscanned: string
  sequence: string  
  root: string
  branches: string[]
}

type Init<S extends string> = {
  unscanned: S
  sequence: ""
  root: ""
  branches: []
}

// === HELPERS ===
type CurrentPattern<S extends State> = `${S["sequence"]}${S["root"]}`

type StripAnchors<S extends string> = 
  S extends `^${infer R}` 
    ? R extends `${infer M}$` ? M : R
    : S extends `${infer M}$` ? M : S

// === FINALIZE ===
type Finalize<S extends State> = 
  S["branches"] extends [] 
    ? CurrentPattern<S>
    : [...S["branches"], CurrentPattern<S>][number]

// === NEXT - CHARACTER DISPATCH ===
type Next<S extends State> = 
  S["unscanned"] extends `${infer H}${infer T}` 
    ? H extends "|"
      ? { unscanned: T; sequence: ""; root: ""; 
          branches: [...S["branches"], CurrentPattern<S>] }
      : H extends "?"
      ? { unscanned: T; sequence: S["sequence"]; root: "" | S["root"]; 
          branches: S["branches"] }
      : H extends "+"
      ? { unscanned: T; sequence: S["sequence"]; root: `${S["root"]}${string}`; 
          branches: S["branches"] }
      : H extends "*"
      ? { unscanned: T; sequence: S["sequence"]; root: "" | `${S["root"]}${string}`; 
          branches: S["branches"] }
      : { unscanned: T; sequence: `${S["sequence"]}${S["root"]}`; root: H; 
          branches: S["branches"] }
    : S

// === PARSE LOOP ===
type Parse<S extends State> = 
  S["unscanned"] extends "" ? Finalize<S> : Parse<Next<S>>

// === PUBLIC API ===
type Regex<S extends string> = Parse<Init<StripAnchors<S>>>

// === TESTS ===
type T1 = Regex<"abc">       // "abc"
type T2 = Regex<"^abc$">     // "abc"  
type T3 = Regex<"ab?c">      // "ac" | "abc"
type T4 = Regex<"a|b">       // "a" | "b"
type T5 = Regex<"^a?b?$">    // "" | "a" | "b" | "ab"
type T6 = Regex<"a+">        // `a${string}`
type T7 = Regex<"a*">        // "" | `a${string}`

What We've Built

Our ~50 lines of type-level code can:

Parse literal characters
Handle optional ?, one-or-more +, and zero-or-more * quantifiers
Process alternation with |
Strip anchors ^ and $

What's Still Missing

A production implementation (like ArkType's) would add:

Groups (...) - Requires a stack to track nested groups
Character classes [abc] - Need a sub-parser for charsets
Escape sequences \d, \w - Map to appropriate types like ${bigint}
Range quantifiers {n,m} - Need recursive loops to enumerate
Backreferences \1 - Track captures and resolve references
Named captures (?<name>...) - Track in separate context

Each feature adds complexity, but the fundamental approach remains: recursive type computation with state threading.

This approach demonstrates that TypeScript's type system can implement surprisingly complex algorithms purely at compile time, catching errors before your code ever runs.

Found this useful? I write about TypeScript, type-level programming, and the weird corners of web development. Subscribe to catch the next one.

Follow me on X for more TypeScript tips and dev thoughts.

DEV Community

Type-Safe Regex Matching from First Principles

What is ArkType Regex?

Building Our Own From First Principles

The Goal

Step 1: Understanding Template Literal Type Manipulation

Step 2: Designing the State Machine

What Does "Parsing" Even Mean?

Tracking Position: Index vs. "What's Left"

The Problem: Types Can't Have Variables

What State Do We Need?

Why Can't `result` Just Be a Simple String?

The Fix: Track "Current Element" Separately

One More Thing: Alternation Needs Memory

Summary: Why Each Field Exists

Step 3: The Core Parse Loop

Step 4: Processing Literal Characters

Step 5: The Optional Quantifier `?`

Step 6: Handling Anchors

Step 7: Adding More Quantifiers

Step 8: The Alternation Operator `|`

The Complete Implementation

What We've Built

What's Still Missing

Top comments (0)

What is ArkType Regex?

Building Our Own From First Principles

The Goal

Step 1: Understanding Template Literal Type Manipulation

Step 2: Designing the State Machine

What Does "Parsing" Even Mean?

Tracking Position: Index vs. "What's Left"

The Problem: Types Can't Have Variables

What State Do We Need?

Why Can't result Just Be a Simple String?

The Fix: Track "Current Element" Separately

One More Thing: Alternation Needs Memory

Summary: Why Each Field Exists

Step 3: The Core Parse Loop

Step 4: Processing Literal Characters

Step 5: The Optional Quantifier ?

Step 6: Handling Anchors

Step 7: Adding More Quantifiers

Step 8: The Alternation Operator |

The Complete Implementation

What We've Built

What's Still Missing

Why Can't `result` Just Be a Simple String?

Step 5: The Optional Quantifier `?`

Step 8: The Alternation Operator `|`