FSM Parser Theory: A Friendly Conversation with Claude Opus
These are real questions I asked Claude Opus in order to learn the theory behind parsers. The starter point is Combinational Logic, and the end point is a FINITE STATE MACHINE. The answers build on each other incrementally.
Let's walk through the theory behind finite state machines by building one from scratch. And from scratch I mean No libraries. No classes. Just logic and a minimal set of 10 operators.
At the end, you will find a working FSM that can recognize a complete <div>text</div> structure. This can be expanded to recognize more patterns.
The automata hierarchy
Q: Automata theory has a class hierarchy, starting from combinational logic. Right?
Yes, that's a clean way to frame it. The hierarchy goes from "no memory" to "more memory":
Combinational logic — output depends only on current input. No memory, no state. Just pure boolean functions (
AND,OR, etc.).Finite State Machine (FSM) — adds memory in the form of state. Output depends on current input + current state. Memory is finite and fixed.
Pushdown Automaton (PDA) — FSM + a stack. The stack gives it unlimited but structured memory (LIFO). This is what recognizes nested structures like matching parentheses.
Turing Machine — FSM + an infinite tape (read/write). The most powerful model; can simulate any computation.
Each level can recognize a strictly larger class of languages (the Chomsky hierarchy):
| Machine | Language class | Example |
|---|---|---|
| Combinational | — | Boolean expression |
| FSM | Regular | /ab*c/ |
| PDA | Context-free | Balanced {}[]()
|
| Turing | Recursively enumerable | Anything computable |
For parsing HTML, you need at minimum a PDA (because of nested tags), but in practice browsers use a stateful tokenizer (FSM) + a tree builder (stack-based) — which together behave like a PDA.
The boolean foundation
Q: Which combinational logic operators do we need to implement a finite state machine? Conceptually.
Conceptually, you only need three:
- AND — "state is X and input is Y"
- OR — "condition A or condition B leads to the same result"
- NOT — negate a condition
These three form a functionally complete set — you can express any boolean function with just AND, OR, NOT.
The core of an FSM is its transition function:
next_state = f(current_state, input)
That function is purely combinational — it's just a truth table. Every row is a conjunction (AND) of conditions, and you OR the rows together to select the next state. Something like:
if state == A AND input == 'x' → go to B
if state == A AND input == 'y' → go to C
if state == B AND input == 'x' → go to A
...
Each line is AND. Multiple lines that share an outcome are OR'd together.
NOT comes in when you need to express "any input that is not this character."
That's the combinational part. The state register (what actually holds the current state between steps) is the sequential part — it's where memory enters the picture.
The seven primitive operators
Q: What else do we need from a computational point of view? Something like inc, if/else, cmp, etc.
Q: Is this list complete? Provide a better list and a simple conceptual example.
The complete set of primitive operations for an FSM:
| Operation | What it does | Why an FSM needs it |
|---|---|---|
| READ | Get the current input symbol | The machine must consume input |
| CMP | Test equality of two values | Check state and input against known constants |
| BRANCH | Conditional jump | Select which transition fires |
| JMP | Unconditional jump | Loop back to process next character |
| ASSIGN | Write a value to a register | Update the state register |
| ADVANCE | Move input pointer forward | Step through the input string |
| HALT | Stop execution | Accept or reject; terminate the machine |
Seven operations total. No arithmetic, no stack, no heap.
Example: an FSM that accepts strings matching ab*c (an a, then zero or more bs, then c).
States: S0 (start), S1 (got a), ACCEPT, DEAD
LOOP:
READ char ; get current input symbol
; ── S0: expecting 'a' ──
CMP state, S0
BRANCH not_s0
CMP char, 'a'
BRANCH s0_fail
ASSIGN state, S1
JMP next
s0_fail:
ASSIGN state, DEAD
JMP next
not_s0:
; ── S1: expecting 'b' or 'c' ──
CMP state, S1
BRANCH not_s1
CMP char, 'b'
BRANCH s1_not_b
ASSIGN state, S1 ; stay in S1
JMP next
s1_not_b:
CMP char, 'c'
BRANCH s1_fail
ASSIGN state, ACCEPT
JMP next
s1_fail:
ASSIGN state, DEAD
JMP next
not_s1:
; ── DEAD or ACCEPT + more input → DEAD ──
ASSIGN state, DEAD
next:
ADVANCE ; move to next character
CMP input_end, true
BRANCH LOOP ; not at end → loop
; ── input exhausted ──
CMP state, ACCEPT
BRANCH reject
HALT accept
reject:
HALT reject
Trace with input "abbc":
| Step | char | state before | transition | state after |
|---|---|---|---|---|
| 1 | a |
S0 | S0 + a → S1 |
S1 |
| 2 | b |
S1 | S1 + b → S1 |
S1 |
| 3 | b |
S1 | S1 + b → S1 |
S1 |
| 4 | c |
S1 | S1 + c → ACCEPT |
ACCEPT |
End of input, state is ACCEPT → accepted.
Every single line maps to one of the seven primitives — nothing else is needed.
Visualization of the FSM for ab*c
Four states, all transitions accounted for — including the error paths to DEAD.
Are seven operators enough for any FSM?
Q: Can we build any FSM with these seven operators?
Yes, the seven operations are sufficient. In fact, they're slightly redundant.
JMP is redundant — it's just BRANCH(always_true, target). And CMP can be absorbed into BRANCH (as "branch-on-symbol-match"). And READ + ADVANCE can be fused into one operation.
So the theoretical minimum is 4 operations:
| Minimal op | Absorbs |
|---|---|
| READ/ADVANCE | READ + ADVANCE |
| BRANCH-on-match | CMP + BRANCH + JMP |
| ASSIGN | — |
| HALT | — |
This matches Wang's B-machine (1957) — a minimal abstract machine with a read-only input tape and finite control that is exactly a DFA. Its instruction set is: {read, branch-on-symbol, halt-accept, halt-reject}.
Key references:
-
Sipser (Introduction to the Theory of Computation, Ch. 1) — treats the transition function
δ(state, symbol) → stateas a single atomic primitive. The 7 ops are a decomposition of that. - Hopcroft/Ullman/Motwani (Intro to Automata Theory, Ch. 2) — same formal treatment.
- Minsky (Computation: Finite and Infinite Machines, 1967, Ch. 3) — decomposes FSMs into read input, compute next-state (combinational logic), latch new state (memory). The hardware view of exactly these ops.
- Wang, 1957 — "A Variant to Turing's Theory of Computing Machines", Journal of the ACM 4(1). The closest classical reference to a minimal FSM instruction set. ACM link
Bottom line: The 7 ops are sufficient for any FSM. We keep them separate (rather than using the minimal 4) because they map more clearly to code — each one becomes a distinct, readable operation in the implementation.
The assembly connection
Q: Those 7 operators are very much like assembly operators.
Exactly. They are assembly-level operations — that's not a coincidence.
An FSM is the computational model that sits right above combinational logic. It's the simplest machine that has state. And assembly is the thinnest software abstraction over hardware. So they naturally meet at the same level.
The mapping is direct:
| FSM op | x86 equivalent |
|---|---|
| READ |
mov al, [rsi] (load byte from memory) |
| CMP | cmp al, 'a' |
| BRANCH | jne label |
| JMP | jmp label |
| ASSIGN | mov state, S1 |
| ADVANCE | inc rsi |
| HALT |
ret / syscall exit
|
Hand-written FSMs in assembly (like in high-performance lexers) look almost identical to our pseudocode. The abstraction gap is essentially zero.
This is also why FSMs are so fast in practice — every operation maps 1:1 to a single CPU instruction. No function calls, no memory allocation, no indirection. Just compare, branch, assign, repeat.
When we move to a higher-level language like Python, we'll express these same 7 operations using if/elif, ==, variable assignment, and a for loop — but conceptually, it's the same assembly underneath.
Expanding the operators for real HTML
Q: Can we implement an FSM that recognizes <div>text</div> where text is alphanumeric? We need range comparisons (<=, >=) and different branch types.
Yes. To handle character ranges like [a-z], [A-Z], [0-9], we need to expand CMP and split BRANCH into variants.
Updated operator set (10 operations):
| Operation | What it does | Why an FSM needs it | x86 equivalent |
|---|---|---|---|
| READ | Get current input symbol | Consume input | mov al, [rsi] |
| CMP | Compare two values, set flags | Every transition needs comparison | cmp al, 'a' |
| BRANCH_EQ | Jump if equal | Exact character match | je label |
| BRANCH_NE | Jump if not equal | Skip when no match | jne label |
| BRANCH_LT | Jump if less than | Lower bound of range check | jb label |
| BRANCH_GT | Jump if greater than | Upper bound of range check | ja label |
| JMP | Unconditional jump | Loop, skip blocks | jmp label |
| ASSIGN | Write value to register | Update state | mov reg, val |
| ADVANCE | Move input pointer forward | Step through input | inc rsi |
| HALT | Stop execution | Accept or reject |
ret / syscall
|
We split the old BRANCH into four variants (EQ, NE, LT, GT). CMP now sets flags rather than returning a boolean — exactly like x86 cmp. Range checks like [a-z] become:
CMP char, 'a'
BRANCH_LT fail ; char < 'a' → not lowercase
CMP char, 'z'
BRANCH_GT fail ; char > 'z' → not lowercase
; if we reach here, 'a' <= char <= 'z'
FSM for <div>[a-zA-Z0-9]+</div>
States and what has been consumed:
| State | Consumed so far | Expects next |
|---|---|---|
| S0 | (start) | < |
| S1 | < |
d |
| S2 | <d |
i |
| S3 | <di |
v |
| S4 | <div |
> |
| S5 | <div> |
first alphanumeric |
| S6 | <div>text |
more alphanumeric or <
|
| S7 | <div>text< |
/ |
| S8 | <div>text</ |
d |
| S9 | <div>text</d |
i |
| S10 | <div>text</di |
v |
| S11 | <div>text</div |
> |
| ACCEPT | <div>text</div> |
(end of input) |
| DEAD | (error) | (absorb remaining) |
Pseudocode using the 10 operators:
ASSIGN state, S0
LOOP:
; ── DEAD: absorb remaining input ──
CMP state, DEAD
BRANCH_EQ next
; ── ACCEPT + more input → error ──
CMP state, ACCEPT
BRANCH_EQ to_dead
READ char
; ── S0: expecting '<' ──
CMP state, S0
BRANCH_NE not_s0
CMP char, '<'
BRANCH_NE to_dead
ASSIGN state, S1
JMP next
not_s0:
; ── S1: expecting 'd' ──
CMP state, S1
BRANCH_NE not_s1
CMP char, 'd'
BRANCH_NE to_dead
ASSIGN state, S2
JMP next
not_s1:
; ── S2: expecting 'i' ──
CMP state, S2
BRANCH_NE not_s2
CMP char, 'i'
BRANCH_NE to_dead
ASSIGN state, S3
JMP next
not_s2:
; ── S3: expecting 'v' ──
CMP state, S3
BRANCH_NE not_s3
CMP char, 'v'
BRANCH_NE to_dead
ASSIGN state, S4
JMP next
not_s3:
; ── S4: expecting '>' ──
CMP state, S4
BRANCH_NE not_s4
CMP char, '>'
BRANCH_NE to_dead
ASSIGN state, S5
JMP next
not_s4:
; ── S5: expecting first alphanumeric ──
CMP state, S5
BRANCH_NE not_s5
JMP check_alnum_s6 ; reuse alnum check, go to S6
not_s5:
; ── S6: more alphanumeric or '<' ──
CMP state, S6
BRANCH_NE not_s6
CMP char, '<'
BRANCH_NE s6_alnum
ASSIGN state, S7
JMP next
s6_alnum:
JMP check_alnum_stay ; stay in S6
not_s6:
; ── S7: expecting '/' ──
CMP state, S7
BRANCH_NE not_s7
CMP char, '/'
BRANCH_NE to_dead
ASSIGN state, S8
JMP next
not_s7:
; ── S8: expecting 'd' ──
CMP state, S8
BRANCH_NE not_s8
CMP char, 'd'
BRANCH_NE to_dead
ASSIGN state, S9
JMP next
not_s8:
; ── S9: expecting 'i' ──
CMP state, S9
BRANCH_NE not_s9
CMP char, 'i'
BRANCH_NE to_dead
ASSIGN state, S10
JMP next
not_s9:
; ── S10: expecting 'v' ──
CMP state, S10
BRANCH_NE not_s10
CMP char, 'v'
BRANCH_NE to_dead
ASSIGN state, S11
JMP next
not_s10:
; ── S11: expecting '>' ──
CMP state, S11
BRANCH_NE to_dead
CMP char, '>'
BRANCH_NE to_dead
ASSIGN state, ACCEPT
JMP next
; ── alphanumeric check → transition to S6 ──
check_alnum_s6:
CMP char, 'a'
BRANCH_LT upper_s6
CMP char, 'z'
BRANCH_GT upper_s6
ASSIGN state, S6
JMP next
upper_s6:
CMP char, 'A'
BRANCH_LT digit_s6
CMP char, 'Z'
BRANCH_GT digit_s6
ASSIGN state, S6
JMP next
digit_s6:
CMP char, '0'
BRANCH_LT to_dead
CMP char, '9'
BRANCH_GT to_dead
ASSIGN state, S6
JMP next
; ── alphanumeric check → stay in S6 ──
check_alnum_stay:
CMP char, 'a'
BRANCH_LT upper_stay
CMP char, 'z'
BRANCH_GT upper_stay
JMP next
upper_stay:
CMP char, 'A'
BRANCH_LT digit_stay
CMP char, 'Z'
BRANCH_GT digit_stay
JMP next
digit_stay:
CMP char, '0'
BRANCH_LT to_dead
CMP char, '9'
BRANCH_GT to_dead
JMP next
to_dead:
ASSIGN state, DEAD
next:
ADVANCE
CMP input_end, true
BRANCH_NE LOOP
; ── input exhausted ──
CMP state, ACCEPT
BRANCH_NE reject
HALT accept
reject:
HALT reject
Trace with input <div>hello</div>:
| Step | char | state | transition |
|---|---|---|---|
| 1 | < |
S0 → S1 | exact match |
| 2 | d |
S1 → S2 | exact match |
| 3 | i |
S2 → S3 | exact match |
| 4 | v |
S3 → S4 | exact match |
| 5 | > |
S4 → S5 | exact match |
| 6 | h |
S5 → S6 | range: 'a' ≤ 'h' ≤ 'z'
|
| 7 | e |
S6 → S6 | range: 'a' ≤ 'e' ≤ 'z'
|
| 8 | l |
S6 → S6 | range: 'a' ≤ 'l' ≤ 'z'
|
| 9 | l |
S6 → S6 | range: 'a' ≤ 'l' ≤ 'z'
|
| 10 | o |
S6 → S6 | range: 'a' ≤ 'o' ≤ 'z'
|
| 11 | < |
S6 → S7 | exact match |
| 12 | / |
S7 → S8 | exact match |
| 13 | d |
S8 → S9 | exact match |
| 14 | i |
S9 → S10 | exact match |
| 15 | v |
S10 → S11 | exact match |
| 16 | > |
S11 → ACCEPT | exact match |
End of input, state is ACCEPT → accepted.
What's next
At this point our FSM can recognize whether a string like <div>hello</div> is valid — it answers yes or no. But a real tokenizer needs to do more than accept or reject. It needs to emit tokens: structured data that tells the next stage of the parser what was found and where.
That means we need a data structure to hold the tokenizer's output — something that captures the tag names, the text content, and the boundaries between them. In the next post, we'll design that output format and turn our recognizer into a proper tokenizer that produces a stream of tokens.
Stay tuned.


Top comments (0)