I came across a post from ArkType recently:
Regex patterns being parsed at compile time! TypeScript inferring exact string literals like "ac" | "abc" from the pattern "^ab?c$". Not just string, but the actual possible matches.
That got me curious. Regex has always been a runtime thing, you write a pattern, match strings against it, and the type system just sees string everywhere. But this was doing it all at compile time. I wanted to understand how, so I decided to build a minimal version myself.
What is ArkType Regex?
ArkType Regex is a compile-time regex parser written entirely in TypeScript's type system. Instead of just validating that a string might match a regex at runtime, it infers the exact string literal types that a regex can produce.
For example:
// Traditional regex: gives you `string`
const match = "abc".match(/^ab?c$/) // string | null
// ArkType regex: gives you the exact type
type Pattern = Regex<"^ab?c$"> // "ac" | "abc"
How does this work?
The implementation uses Template literal types to manipulate strings at the type level.
The parser reads the regex pattern character by character, builds an internal representation (AST), and then "finalizes" it into the actual TypeScript union type representing all possible matches.
Building Our Own From First Principles
Let's forget about ArkType's full implementation and build a minimal type-level regex parser ourselves. This will teach us the fundamental concepts of type-level computation in TypeScript.
The Goal
We want to create a Regex<Pattern> type that:
- Takes a regex pattern string
- Returns a union of all possible string literals that match
type R1 = Regex<"abc"> // "abc"
type R2 = Regex<"ab?c"> // "ac" | "abc"
type R3 = Regex<"a|b"> // "a" | "b"
type R4 = Regex<"^ab?c$"> // "ac" | "abc"
Step 1: Understanding Template Literal Type Manipulation
Before we parse anything, we need to understand how to work with strings at the type level.
TypeScript's template literal types let us decompose strings:
// Extract the first character and the rest
type Shift<S extends string> =
S extends `${infer Head}${infer Tail}`
? [Head, Tail]
: null
type Test1 = Shift<"abc"> // ["a", "bc"]
type Test2 = Shift<"a"> // ["a", ""]
type Test3 = Shift<""> // null
The infer keyword lets us pattern-match on template literals and capture parts of the string into type variables.
We can also build strings:
type Concat<A extends string, B extends string> = `${A}${B}`
type Test4 = Concat<"hello", "world"> // "helloworld"
Step 2: Designing the State Machine
What Does "Parsing" Even Mean?
Parsing means reading through a string character by character and building up some result. In JavaScript, you might write:
function parse(pattern) {
let result = ""
for (let i = 0; i < pattern.length; i++) {
const char = pattern[i]
// do something with char, build up result
}
return result
}
Notice what we naturally need:
-
Where we are in the input - the index
i -
What we've built so far - the
resultwe're accumulating
This is state - information that persists and changes as we move through the loop.
Tracking Position: Index vs. "What's Left"
In the loop above, we use i to know our position. But there's another way to think about it.
Instead of "I'm at position 2 of abc", we can say "I've already read ab, and c is left to read."
// Using index:
pattern = "abc", i = 2 → next char is pattern[2] = "c"
// Using "what's left":
unscanned = "c" → next char is "c"
Both track the same thing - where we are. But "what's left" is easier to work with in TypeScript's type system as we'll see in the following section.
The Problem: Types Can't Have Variables
In runtime JavaScript, we have mutable variables. We can write result += char and i++.
But TypeScript's type system has no variables. Types are immutable. So how do we "loop" and "accumulate"?
The answer: recursion with parameters.
Instead of mutating variables, we pass updated values to the next recursive call:
// Runtime approach (mutation):
let result = ""
for (char of input) {
result = result + char // mutate
}
// Type-level approach (recursion):
type Parse<Input, Result> =
Input extends `${infer Char}${infer Rest}`
? Parse<Rest, `${Result}${Char}`> // "pass" new values to next iteration
: Result // base case: return accumulated result
Each "iteration" is a new type instantiation with updated parameters. If you don't understand this it's fine, read further you will get an understanding.
What State Do We Need?
At minimum, we need to track:
type State = {
unscanned: string // What's LEFT to read (like the remaining input)
result: string // What we've BUILT so far
}
Let's trace parsing "abc" with the simplest possible logic (just accumulate characters):
| Step | unscanned | result | Action |
|---|---|---|---|
| 0 | abc |
"" |
Start |
| 1 | bc |
a |
Read a, add to result |
| 2 | c |
ab |
Read b, add to result |
| 3 | "" |
abc |
Read c, add to result |
| Done | "" |
abc |
Nothing left, return result |
This is the essence of parsing: consume input, build output.
Why Can't result Just Be a Simple String?
For basic concatenation, result: string works fine. But regex has quantifiers like ?.
Consider parsing ab?c. When we hit ?, we need to make the previous thing optional. In this case we will need to match ac and abc, as b is considered optional.
| Step | unscanned | result | Action |
|---|---|---|---|
| 0 | ab?c |
"" |
Start |
| 1 | b?c |
a |
Read a
|
| 2 | ?c |
ab |
Read b
|
| 3 | c |
??? |
Read ? - now what? |
Problem: We see ? and need to make b optional. But result is ab - just a flat string. How do we know that ? applies to b and not to ab or just a?
We've lost track of what the last thing we added was.
The Fix: Track "Current Element" Separately
We need to split our result into two parts:
type State = {
unscanned: string // What's left to read
sequence: string // Everything BEFORE the current element
root: string // The CURRENT element (what quantifiers target)
}
Now let's trace ab?c:
| Step | unscanned | sequence | root | Action |
|---|---|---|---|---|
| 0 | ab?c |
"" |
"" |
Start |
| 1 | b?c |
"" |
a |
a becomes current element |
| 2 | ?c |
a |
b |
a moves to sequence, b is current |
| 3 | c |
a |
"" | b
|
? makes root optional! |
| 4 | "" |
a | ab
|
c |
c is current, root joins sequence |
| Done | Combine: ac | abc
|
By keeping the "current element" in root separate from "everything before" in sequence, we always know exactly what a quantifier should modify.
One More Thing: Alternation Needs Memory
The | operator creates alternatives: a|b should produce a | b.
When we see |, we need to:
- Save the current pattern as a completed branch
- Start fresh for the next alternative
We need somewhere to store those saved branches:
type State = {
unscanned: string // What's left to read
sequence: string // Pattern before current element
root: string // Current element (quantifier target)
branches: string[] // Completed alternatives (for |)
}
Trace a|b:
| Step | unscanned | sequence | root | branches | Action |
|---|---|---|---|---|---|
| 0 | a|b |
"" |
"" |
[] |
Start |
| 1 | |b |
"" |
a |
[] |
a is current |
| 2 | b |
"" |
"" |
["a"] |
See |: save a, reset |
| 3 | "" |
"" |
b |
["a"] |
b is current |
| Done | Union: a | b
|
Summary: Why Each Field Exists
| Field | Why We Need It |
|---|---|
unscanned |
Track what's left to read (our "loop counter") |
sequence |
Accumulate "finished" parts of pattern |
root |
Keep "current element" separate so quantifiers know what to modify |
branches |
Store completed alternatives when we see |
|
type State = {
unscanned: string
sequence: string
root: string
branches: string[]
}
type Init<S extends string> = {
unscanned: S
sequence: ""
root: ""
branches: []
}
Now we have a state structure that can handle the core regex operations. Next, we'll build the logic that transforms this state character by character.
Step 3: The Core Parse Loop
Recursive types let us implement loops:
type Parse<S extends State> =
S["unscanned"] extends "" // Is input exhausted?
? Finalize<S> // Yes: finalize and return result
: Parse<Next<S>> // No: process next char, recurse
This is the heart of the parser. It repeatedly calls Next to process one character until the input is empty, then calls Finalize to produce the result.
Step 4: Processing Literal Characters
Let's start with the simplest case: literal characters.
What we expect:
type Result = Regex<"abc"> // "abc"
Implementation:
type Next<S extends State> =
S["unscanned"] extends `${infer H}${infer T}`
? {
unscanned: T // Consume the character
sequence: `${S["sequence"]}${S["root"]}` // Push old root to sequence
root: H // New char becomes root
branches: S["branches"] // Branches unchanged
}
: S
What's happening here:
- Extract the first character
Hand restT - The current
rootjoins thesequence(it's no longer the "current token") - The new character becomes the
root - Branches stay the same
Let's add a simple finalizer:
type CurrentPattern<S extends State> = `${S["sequence"]}${S["root"]}`
type Finalize<S extends State> =
S["branches"] extends []
? CurrentPattern<S> // No branches: just return pattern
: [...S["branches"], CurrentPattern<S>][number] // Branches: union them all
The [number] trick converts an array type to a union of its elements.
Test it:
type R = Parse<Init<"abc">> // "abc"
Step 5: The Optional Quantifier ?
Now the interesting part. When we see ?, we need to make the root optional.
What we expect:
type Result = Regex<"ab?c"> // "ac" | "abc"
Implementation:
type Next<S extends State> =
S["unscanned"] extends `${infer H}${infer T}`
? H extends "?"
? {
unscanned: T
sequence: S["sequence"] // Sequence unchanged
root: "" | S["root"] // Root becomes optional!
branches: S["branches"]
}
: {
unscanned: T
sequence: `${S["sequence"]}${S["root"]}`
root: H
branches: S["branches"]
}
: S
The magic is "" | S["root"]. If root was b, it becomes "" | b.
When we finalize with sequence = "a" and root = "" | "b":
`${S["sequence"]}${S["root"]}`
// = `${"a"}${"" | "b"}`
// = "a" | "ab"
TypeScript distributes the union across the template literal! This is key to the whole approach.
Wait, we also need to handle c after ?. Let's trace through ab?c:
- Initial:
{unscanned: "ab?c", sequence: "", root: "", branches: []} - After
a:{unscanned: "b?c", sequence: "", root: "a", branches: []} - After
b:{unscanned: "?c", sequence: "a", root: "b", branches: []} - After
?:{unscanned: "c", sequence: "a", root: "" | "b", branches: []} - After
c:{unscanned: "", sequence: "a" | "ab", root: "c", branches: []} - Finalize:
"ac" | "abc"
The distribution happens at step 5 when we concatenate the union in root into sequence.
Step 6: Handling Anchors
Anchors ^ and $ assert position but don't consume characters. For simplicity, let's strip them:
type StripAnchors<S extends string> =
S extends `^${infer R}`
? R extends `${infer M}$`
? M
: R
: S extends `${infer M}$`
? M
: S
type Regex<S extends string> = Parse<Init<StripAnchors<S>>>
Now Regex<"^ab?c$"> works like Regex<"ab?c">.
Step 7: Adding More Quantifiers
Let's add + (one or more) and * (zero or more).
What we expect:
type Result1 = Regex<"a+"> // `a${string}`
type Result2 = Regex<"a*"> // "" | `a${string}`
Implementation:
type Next<S extends State> =
S["unscanned"] extends `${infer H}${infer T}`
? H extends "?"
? {
unscanned: T
sequence: S["sequence"]
root: "" | S["root"] // 0 or 1
branches: S["branches"]
}
: H extends "+"
? {
unscanned: T
sequence: S["sequence"]
root: `${S["root"]}${string}` // 1 or more
branches: S["branches"]
}
: H extends "*"
? {
unscanned: T
sequence: S["sequence"]
root: "" | `${S["root"]}${string}` // 0 or more
branches: S["branches"]
}
: {
unscanned: T
sequence: `${S["sequence"]}${S["root"]}`
root: H
branches: S["branches"]
}
: S
For + and *, we can't enumerate all possibilities (infinite), so we use ${string} to represent "any continuation":
-
a+→`a${string}`(starts witha, followed by anything) -
a*→"" | `a${string}`(empty or starts witha)
This widens the type but keeps it useful for validation.
Step 8: The Alternation Operator |
The | creates alternatives. When we see it, we need to:
- Finalize the current branch
- Push it to
branches - Start fresh for the next alternative
What we expect:
type Result = Regex<"a|b"> // "a" | "b"
Implementation:
type Next<S extends State> =
S["unscanned"] extends `${infer H}${infer T}`
? H extends "|"
? {
unscanned: T
sequence: "" // Reset sequence
root: "" // Reset root
branches: [...S["branches"], CurrentPattern<S>] // Save current branch
}
: H extends "?"
// ... rest of cases
In Finalize, we union all branches together:
type Finalize<S extends State> =
S["branches"] extends []
? CurrentPattern<S>
: [...S["branches"], CurrentPattern<S>][number]
The [...array, lastElement][number] pattern creates a union from the array plus the last element.
The Complete Implementation
// === STATE TYPES ===
type State = {
unscanned: string
sequence: string
root: string
branches: string[]
}
type Init<S extends string> = {
unscanned: S
sequence: ""
root: ""
branches: []
}
// === HELPERS ===
type CurrentPattern<S extends State> = `${S["sequence"]}${S["root"]}`
type StripAnchors<S extends string> =
S extends `^${infer R}`
? R extends `${infer M}$` ? M : R
: S extends `${infer M}$` ? M : S
// === FINALIZE ===
type Finalize<S extends State> =
S["branches"] extends []
? CurrentPattern<S>
: [...S["branches"], CurrentPattern<S>][number]
// === NEXT - CHARACTER DISPATCH ===
type Next<S extends State> =
S["unscanned"] extends `${infer H}${infer T}`
? H extends "|"
? { unscanned: T; sequence: ""; root: "";
branches: [...S["branches"], CurrentPattern<S>] }
: H extends "?"
? { unscanned: T; sequence: S["sequence"]; root: "" | S["root"];
branches: S["branches"] }
: H extends "+"
? { unscanned: T; sequence: S["sequence"]; root: `${S["root"]}${string}`;
branches: S["branches"] }
: H extends "*"
? { unscanned: T; sequence: S["sequence"]; root: "" | `${S["root"]}${string}`;
branches: S["branches"] }
: { unscanned: T; sequence: `${S["sequence"]}${S["root"]}`; root: H;
branches: S["branches"] }
: S
// === PARSE LOOP ===
type Parse<S extends State> =
S["unscanned"] extends "" ? Finalize<S> : Parse<Next<S>>
// === PUBLIC API ===
type Regex<S extends string> = Parse<Init<StripAnchors<S>>>
// === TESTS ===
type T1 = Regex<"abc"> // "abc"
type T2 = Regex<"^abc$"> // "abc"
type T3 = Regex<"ab?c"> // "ac" | "abc"
type T4 = Regex<"a|b"> // "a" | "b"
type T5 = Regex<"^a?b?$"> // "" | "a" | "b" | "ab"
type T6 = Regex<"a+"> // `a${string}`
type T7 = Regex<"a*"> // "" | `a${string}`
What We've Built
Our ~50 lines of type-level code can:
- Parse literal characters
- Handle optional
?, one-or-more+, and zero-or-more*quantifiers - Process alternation with
| - Strip anchors
^and$
What's Still Missing
A production implementation (like ArkType's) would add:
-
Groups
(...)- Requires a stack to track nested groups -
Character classes
[abc]- Need a sub-parser for charsets -
Escape sequences
\d,\w- Map to appropriate types like${bigint} -
Range quantifiers
{n,m}- Need recursive loops to enumerate -
Backreferences
\1- Track captures and resolve references -
Named captures
(?<name>...)- Track in separate context
Each feature adds complexity, but the fundamental approach remains: recursive type computation with state threading.
This approach demonstrates that TypeScript's type system can implement surprisingly complex algorithms purely at compile time, catching errors before your code ever runs.
Found this useful? I write about TypeScript, type-level programming, and the weird corners of web development. Subscribe to catch the next one.
Follow me on X for more TypeScript tips and dev thoughts.

Top comments (0)