Parser Theory: Q/A with Claude Opus
In part 1, we built a working FSM that recognizes <div>text</div> using just 7 primitives mapped 1:1 to assembly opcodes. But FSMs have a hard limit: they can't handle nested structures like <div><div>hello</div></div>. In this post, we climb the Chomsky hierarchy from finite state machines to pushdown automata, build a PDA that recognizes nested <div> tags, and then turn it into a transducer that emits tokens. In other words we are building the core of a lexer.
Q: Why can't FSMs handle nested structures?
Because an FSM has a fixed number of states, and that's all the memory it has.
Consider nested divs:
<div><div><div>hello</div></div></div>
To correctly match closing tags, you need to count how many <div>s you've opened so you know how many </div>s to expect. An FSM with, say, 12 states can handle nesting up to some fixed depth — but someone can always write HTML nested one level deeper than your states can track.
Put simply:
- 1 level deep → you need states to remember "I'm 1 div in"
- 2 levels deep → different states for "I'm 2 divs in"
- n levels deep → you'd need n distinct state configurations
Since n is unbounded but your state count is finite, you run out of states. There's no way to encode "how deep am I?" in a fixed-size state register when the depth is arbitrary.
This is exactly what the stack in a pushdown automaton solves. Every time you see <div>, you push onto the stack; every </div>, you pop. The stack grows as needed, giving you unbounded memory in one direction (depth), which is precisely what nesting requires.
Q: So, FSMs needs memory to manage remember where they are in the nested structure, correct?
It depends on what you mean by "memory."
An FSM does have memory, but it's extremely limited. The current state itself is the memory. If your FSM is in state S7, that encodes something about what it has seen so far. That's how it knows it's partway through reading </div> vs still in content.
But that memory is:
- Fixed-size, bounded by the number of states, decided at design time
- Lossy, it can't remember the full history of what it read, only a compressed summary encoded in which state it's in
- Non-growable, it can't allocate more memory as input gets longer
So when people say "FSMs have no memory," they really mean no auxiliary memory, that means no stack, no tape, no buffer that grows with input. The only thing an FSM "remembers" is which state it's currently in.
A PDA adds exactly one auxiliary memory structure: a stack. That single addition is enough to go from "can't handle nesting" to "can handle nesting."
Q: What is the stack, formally?
A stack is a last-in, first-out (LIFO) data structure with two operations:
- PUSH(symbol): place a symbol on top
- POP: remove and return the top symbol
So if your stack is ABB, a PUSH(C) gives ABBC, and a POP returns C leaving ABB again.
The key properties:
- You can only access the top: no reading the middle or bottom
- Unbounded size: it can grow as large as needed (unlike the FSM's fixed states)
-
It starts empty: at the beginning of computation, the stack contains nothing (or a special bottom marker like
$)
That's it. It's the simplest possible auxiliar memory that's still useful. You get only sequential access from one end. But that restriction is exactly what matches the structure of nesting: the most recently opened thing is the next thing that needs to close.
Q: Which operations and structures do we need to add to the seven primitives to create a PDA that handles nested structures?
We need two new operations and one new structure:
New structure
- STACK: a LIFO storage addressed by a stack pointer (SP)
Updated primitives table
| # | Operation | Description | Type | Assembly equivalent |
|---|---|---|---|---|
| 1 | READ | Load current input symbol | FSM | MOV reg, [src] |
| 2 | CMP | Test value equality | FSM | CMP reg, imm |
| 3 | BRANCH | Conditional jump | FSM | JE / JNE label |
| 4 | JMP | Unconditional jump | FSM | JMP label |
| 5 | ASSIGN | Update state register | FSM | MOV reg, imm |
| 6 | ADVANCE | Move input pointer forward | FSM | INC reg |
| 7 | HALT | Terminate execution | FSM | HLT |
| 8 | PUSH | Place symbol on top of stack | PDA | PUSH reg |
| 9 | POP | Remove and return top of stack | PDA | POP reg |
Two operations and one structure turn an FSM into a PDA.
x86 has native PUSH/POP instructions with a hardware stack pointer (SP/ESP/RSP), so the 1:1 mapping to CPU instructions still holds. PDAs are just as mechanically efficient as FSMs. The only cost is that memory usage now grows with nesting depth instead of being fixed.
You could also add a PEEK operation (read top without removing), but it's not strictly necessary — you can always POP then PUSH back. It's a convenience, not a primitive.
Q: What's the simplest PDA architecture?
A PDA architecture needs exactly four components:
- Input tape — the string being read (read-only, left-to-right)
- Input pointer — current position on the tape
- State register — holds the current state from a finite set Q
- Stack — LIFO storage over alphabet Γ
And the nine operations from our table are the instruction set that operates on these four components.
Compared to the FSM architecture from part 1, we added exactly one component (the stack) and two operations (PUSH, POP). The stack pointer (SP) lives inside the finite control alongside the state register — it's a register, not part of the stack itself.
The PDA architecture is a von Neumann architecture:
| PDA component | Von Neumann equivalent |
|---|---|
| Input tape | Memory (data segment) |
| Input pointer | Program counter / index register |
| State register | CPU register (accumulator) |
| Stack pointer | SP register |
| Stack (in RAM) | Stack segment of memory |
| 9 operations | Subset of the instruction set |
What's missing for full Von Neumann: writable memory, random access, arithmetic, and stored program. The automata hierarchy maps directly onto hardware capability:
- FSM → combinational logic + a register
- PDA → CPU + stack segment
- Turing machine → CPU + full random-access memory = von Neumann
Building a PDA for nested <div> recognition
Pseudocode
STATES: S0, TAG_OPEN, TAG_NAME, CONTENT, CLOSE_SLASH,
CLOSE_NAME, ACCEPT, DEAD
STACK ALPHABET: Γ = { DIV, $ }
START:
PUSH($) -- bottom marker
ASSIGN(S0)
S0:
READ(ch)
CMP(ch, '<')
BRANCH(eq → TAG_OPEN)
JMP(DEAD)
TAG_OPEN:
ADVANCE
READ(ch)
CMP(ch, '/')
BRANCH(eq → CLOSE_SLASH) -- it's a closing tag
CMP(ch, 'd')
BRANCH(eq → TAG_NAME) -- it's an opening tag
JMP(DEAD)
TAG_NAME: -- read "div>"
ADVANCE
READ(ch); CMP(ch, 'i'); BRANCH(ne → DEAD)
ADVANCE
READ(ch); CMP(ch, 'v'); BRANCH(ne → DEAD)
ADVANCE
READ(ch); CMP(ch, '>'); BRANCH(ne → DEAD)
PUSH(DIV) -- ← matched <div>, push it
ADVANCE
ASSIGN(CONTENT)
CONTENT:
READ(ch)
CMP(ch, '<')
BRANCH(eq → TAG_OPEN) -- ← could be nested <div> or </div>
CMP(ch, EOF)
BRANCH(eq → DEAD) -- input ended with open tags
ADVANCE -- consume text character
JMP(CONTENT)
CLOSE_SLASH: -- inside </
ADVANCE
READ(ch); CMP(ch, 'd'); BRANCH(ne → DEAD)
ADVANCE
READ(ch); CMP(ch, 'i'); BRANCH(ne → DEAD)
ADVANCE
READ(ch); CMP(ch, 'v'); BRANCH(ne → DEAD)
ADVANCE
READ(ch); CMP(ch, '>'); BRANCH(ne → DEAD)
POP(top) -- ← pop the stack
CMP(top, DIV)
BRANCH(ne → DEAD) -- mismatch
ADVANCE
READ(ch)
CMP(ch, EOF)
BRANCH(eq → CHECK_EMPTY)
ASSIGN(CONTENT) -- more input, keep going
CHECK_EMPTY:
POP(top)
CMP(top, $)
BRANCH(eq → ACCEPT) -- stack empty = all tags matched
JMP(DEAD) -- leftover tags = unbalanced
ACCEPT:
HALT(accept)
DEAD:
HALT(reject)
The entire FSM logic is reused. The only additions are three PUSH/POP calls. One PUSH per <div>, one POP per </div>, one final POP to verify the stack is empty.
Trace: <div><div>hi</div></div>
Input: < d i v > < d i v > h i < / d i v > < / d i v > EOF
0 1 2 3 4 5 6 7 8 9 ...
Stack starts: [ $ ]
1. First <div>
State | Read | Action | Stack
----------|------|----------------------|--------
S0 | < | match, → TAG_OPEN | [ $ ]
TAG_OPEN | d | not '/', → TAG_NAME | [ $ ]
TAG_NAME | i | match | [ $ ]
| v | match | [ $ ]
| > | match, PUSH(DIV) | [ $ DIV ]
| | → CONTENT |
2. Second <div>
State | Read | Action | Stack
----------|------|----------------------|-----------
CONTENT | < | → TAG_OPEN | [ $ DIV ]
TAG_OPEN | d | not '/', → TAG_NAME | [ $ DIV ]
TAG_NAME | i | match | [ $ DIV ]
| v | match | [ $ DIV ]
| > | match, PUSH(DIV) | [ $ DIV DIV ]
| | → CONTENT |
3. Text "hi"
State | Read | Action | Stack
----------|------|----------------------|-------------
CONTENT | h | not '<', ADVANCE | [ $ DIV DIV ]
CONTENT | i | not '<', ADVANCE | [ $ DIV DIV ]
4. First </div>
State | Read | Action | Stack
------------|------|--------------------|--------------
CONTENT | < | → TAG_OPEN | [ $ DIV DIV ]
TAG_OPEN | / | → CLOSE_SLASH | [ $ DIV DIV ]
CLOSE_SLASH | d | match | [ $ DIV DIV ]
| i | match | [ $ DIV DIV ]
| v | match | [ $ DIV DIV ]
| > | match, POP → DIV | [ $ DIV ]
| | DIV == DIV ✓ |
| | read next |
5. Second </div>
State | Read | Action | Stack
------------|------|--------------------|---------
CONTENT | < | → TAG_OPEN | [ $ DIV ]
TAG_OPEN | / | → CLOSE_SLASH | [ $ DIV ]
CLOSE_SLASH | d | match | [ $ DIV ]
| i | match | [ $ DIV ]
| v | match | [ $ DIV ]
| > | match, POP → DIV | [ $ ]
| | DIV == DIV ✓ |
| | read next |
6. End of input
State | Read | Action | Stack
------------|------|--------------------|---------
| EOF | → CHECK_EMPTY | [ $ ]
CHECK_EMPTY | | POP → $ | [ ]
| | $ == $ ✓ |
| | → ACCEPT |
Result: ACCEPT
The stack grew to depth 2 (two DIVs), then shrank back to just $, confirming every open tag had a matching close. An FSM would have needed dedicated states for each depth level. This PDA handles any depth with the same 8 states.
Q: Our PDA can parse nested structures and output accept or reject. But a real tokenizer needs to emit tokens, right?
Right. The PDA is still just a recognizer. To become a tokenizer, we need to understand what changes formally.
A recognizer takes an input string and produces a single binary answer: accept or reject.
- Input: a string over alphabet
- Output: yes or no
A transducer takes an input string and produces an output string (or sequence of output symbols). It transforms input into output.
- Input: a string over alphabet
- Output: a string y over an output alphabet Δ
A tokenizer is a transducer: it reads characters (Σ = ASCII) and emits tokens (Δ = the set of possible tokens).
Q: What is ?
An output buffer is a tape, simpler than the stack:
- Append-only: you can only write to the end, never read back or modify
- Write-only: the machine never consults its own output to make decisions
- Unbounded: it grows as tokens are emitted
- Starts empty
Output structure
OUTPUT ALPHABET: Δ = TokenType × Σ*
TOKEN TYPES: TokenType = { OPEN_TAG, CLOSE_TAG, TEXT }
OUTPUT BUFFER: append-only list of (type, value) pairs, starts empty
For <div><div>hi</div></div>, the output buffer should end up as:
[
(OPEN_TAG, "div"),
(OPEN_TAG, "div"),
(TEXT, "hi"),
(CLOSE_TAG, "div"),
(CLOSE_TAG, "div")
]
Updated operations table (11 operations)
| # | Operation | Description | Type | Assembly |
|---|---|---|---|---|
| 1 | READ | Load current input symbol | FSM | MOV reg, [src] |
| 2 | CMP | Test value equality | FSM | CMP reg, reg/imm |
| 3 | BRANCH | Conditional jump | FSM | JE / JNE label |
| 4 | JMP | Unconditional jump | FSM | JMP label |
| 5 | ASSIGN | Update state register | FSM | MOV reg, imm |
| 6 | ADVANCE | Move input pointer forward | FSM | INC reg |
| 7 | HALT | Terminate execution | FSM | HLT |
| 8 | PUSH | Place symbol on top of stack | PDA | PUSH reg |
| 9 | POP | Remove and return top of stack | PDA | POP reg |
| 10 | MARK | Save current input position | Transducer | MOV reg, reg |
| 11 | EMIT | Append token to output buffer | Transducer | MOV [dst], reg |
Transducer Architecture: 6 components
- Input tape (read-only)
- Input pointer
- State register
- Stack pointer + Stack
- Mark register — saves start position of current token
- Output buffer — append-only, unbounded
Pseudocode (transducer)
STATES: S0, TAG_OPEN, TAG_NAME, CONTENT, CONTENT_LOOP,
CONTENT_DONE, CLOSE_SLASH, CHECK_EMPTY, ACCEPT, DEAD
INPUT ALPHABET: Σ = ASCII
STACK ALPHABET: Γ = { DIV, $ }
OUTPUT ALPHABET: Δ = { OPEN_TAG, CLOSE_TAG, TEXT } × Σ*
START:
PUSH($)
ASSIGN(S0)
S0:
READ(ch)
CMP(ch, '<')
BRANCH(eq → TAG_OPEN)
JMP(DEAD)
TAG_OPEN:
ADVANCE
READ(ch)
CMP(ch, '/')
BRANCH(eq → CLOSE_SLASH)
CMP(ch, 'd')
BRANCH(eq → TAG_NAME)
JMP(DEAD)
TAG_NAME:
ADVANCE
READ(ch); CMP(ch, 'i'); BRANCH(ne → DEAD)
ADVANCE
READ(ch); CMP(ch, 'v'); BRANCH(ne → DEAD)
ADVANCE
READ(ch); CMP(ch, '>'); BRANCH(ne → DEAD)
PUSH(DIV)
EMIT(OPEN_TAG, "div") -- ← emit opening tag token
ADVANCE
ASSIGN(CONTENT)
CONTENT:
MARK -- ← save start of text region
CONTENT_LOOP:
READ(ch)
CMP(ch, '<')
BRANCH(eq → CONTENT_DONE)
CMP(ch, EOF)
BRANCH(eq → DEAD)
ADVANCE
JMP(CONTENT_LOOP)
CONTENT_DONE:
CMP(mark, pos) -- any text between tags?
BRANCH(eq → TAG_OPEN) -- no text, skip emit
EMIT(TEXT, input[mark..pos]) -- ← emit text token
JMP(TAG_OPEN)
CLOSE_SLASH:
ADVANCE
READ(ch); CMP(ch, 'd'); BRANCH(ne → DEAD)
ADVANCE
READ(ch); CMP(ch, 'i'); BRANCH(ne → DEAD)
ADVANCE
READ(ch); CMP(ch, 'v'); BRANCH(ne → DEAD)
ADVANCE
READ(ch); CMP(ch, '>'); BRANCH(ne → DEAD)
POP(top)
CMP(top, DIV)
BRANCH(ne → DEAD)
EMIT(CLOSE_TAG, "div") -- ← emit closing tag token
ADVANCE
READ(ch)
CMP(ch, EOF)
BRANCH(eq → CHECK_EMPTY)
ASSIGN(CONTENT)
CHECK_EMPTY:
POP(top)
CMP(top, $)
BRANCH(eq → ACCEPT)
JMP(DEAD)
ACCEPT:
HALT(accept) -- output buffer contains all tokens
DEAD:
HALT(reject)
Trace: <div><div>hi</div></div> (transducer)
Input: < d i v > < d i v > h i < / d i v > < / d i v > EOF
Pos: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
Stack starts: [ $ ]
Output starts: [ ]
1. First <div>
State | Pos | Read | Action | Stack | Output
------------|-----|------|----------------------|-----------|-------
S0 | 0 | < | match, → TAG_OPEN | [ $ ] |
TAG_OPEN | 1 | d | not '/', → TAG_NAME | [ $ ] |
TAG_NAME | 2 | i | match | [ $ ] |
| 3 | v | match | [ $ ] |
| 4 | > | match | [ $ ] |
| | | PUSH(DIV) | [ $ DIV ] |
| | | EMIT(OPEN_TAG,"div")| | ← (OPEN_TAG, "div")
| 5 | | ADVANCE, → CONTENT | |
2. Enter CONTENT, see < immediately
State | Pos | Read | Action | Stack | Output
-------------|-----|------|--------------------|-----------|-------
CONTENT | 5 | | MARK → mark=5 | [ $ DIV ] |
CONTENT_LOOP | 5 | < | CMP(mark,pos): 5=5 | |
| | | equal, skip emit | |
| | | → TAG_OPEN | |
3. Second <div>
State | Pos | Read | Action | Stack | Output
------------|-----|------|----------------------|---------------|-------
TAG_OPEN | 6 | d | not '/', → TAG_NAME | [ $ DIV ] |
TAG_NAME | 7 | i | match | [ $ DIV ] |
| 8 | v | match | [ $ DIV ] |
| 9 | > | match | [ $ DIV ] |
| | | PUSH(DIV) | [ $ DIV DIV ] |
| | | EMIT(OPEN_TAG,"div")| | ← (OPEN_TAG, "div")
| 10 | | ADVANCE, → CONTENT | |
4. Text "hi"
State | Pos | Read | Action | Stack | Output
-------------|-----|------|--------------------|---------------|-------
CONTENT | 10 | | MARK → mark=10 | [ $ DIV DIV ] |
CONTENT_LOOP | 10 | h | not '<', ADVANCE | |
CONTENT_LOOP | 11 | i | not '<', ADVANCE | |
CONTENT_LOOP | 12 | < | CMP(mark,pos):10≠12| |
| | | EMIT(TEXT, "hi") | | ← (TEXT, "hi")
| | | → TAG_OPEN | |
5. First </div>
State | Pos | Read | Action | Stack | Output
------------|-----|------|----------------------|---------------|-------
TAG_OPEN | 13 | / | → CLOSE_SLASH | [ $ DIV DIV ] |
CLOSE_SLASH | 14 | d | match | |
| 15 | i | match | |
| 16 | v | match | |
| 17 | > | match | |
| | | POP → DIV | [ $ DIV ] |
| | | DIV == DIV ✓ | |
| | | EMIT(CLOSE_TAG,"div")| | ← (CLOSE_TAG, "div")
| 18 | < | ADVANCE, not EOF | |
| | | → CONTENT | |
6. Enter CONTENT, see < immediately
State | Pos | Read | Action | Stack | Output
-------------|-----|------|----------------------|-----------|-------
CONTENT | 18 | | MARK → mark=18 | [ $ DIV ] |
CONTENT_LOOP | 18 | < | CMP(mark,pos): 18=18| |
| | | equal, skip emit | |
| | | → TAG_OPEN | |
7. Second </div>
State | Pos | Read | Action | Stack | Output
------------|-----|------|------------------------|-----------|-------
TAG_OPEN | 19 | / | → CLOSE_SLASH | [ $ DIV ] |
CLOSE_SLASH | 20 | d | match | |
| 21 | i | match | |
| 22 | v | match | |
| 23 | > | match | |
| | | POP → DIV | [ $ ] |
| | | DIV == DIV ✓ | |
| | | EMIT(CLOSE_TAG, "div") | | ← (CLOSE_TAG, "div")
| 24 | EOF | ADVANCE, EOF | |
| | | → CHECK_EMPTY | |
8. End of input
State | Pos | Read | Action | Stack | Output
------------|-----|------|----------------------|-------|-------
CHECK_EMPTY | | | POP → $ | [ ] |
| | | $ == $ ✓ | |
| | | → ACCEPT | |
ACCEPT | | | HALT(accept) | |
Final output buffer
[
(OPEN_TAG, "div"),
(OPEN_TAG, "div"),
(TEXT, "hi"),
(CLOSE_TAG, "div"),
(CLOSE_TAG, "div")
]
Same 10 states, same stack operations. The three EMITs fired at exactly the right boundaries, and the two CMP(mark, pos) guards correctly skipped empty text regions between consecutive tags.
Q: Is the tokenizer a lexer?
Yes. Tokenizer, lexer, lexical analyzer, scanner, all the same thing. Characters in, tokens out.
The distinction that matters is lexer vs parser:
- Lexer (what we built): characters in, tokens out. Flat stream. Answers: "what are the words?"
- Parser: tokens in, tree out. Hierarchical structure. Answers: "how do the words relate to each other?"
Our transducer is a lexer. It outputs a flat list:
(OPEN_TAG, "div"), (OPEN_TAG, "div"), (TEXT, "hi"), (CLOSE_TAG, "div"), (CLOSE_TAG, "div")
A parser would produce a tree:
div
└─ div
└─ "hi"
Our PDA already knows the tree structure in some way, the stack depth at any point tells us exactly where we are in the hierarchy. We just don't capture it. We throw that information away and emit a flat stream. Building the tree means using the stack to capture that structure as we go.
What's next
We started this post with an FSM that couldn't count, and ended with a transducer that emits tokens from arbitrarily nested HTML. Along the way, a pattern kept surfacing: climbing the Chomsky hierarchy (FSM → PDA → Turing machine) mirrors the historical evolution of CPU architecture (register → stack segment → full random-access memory). That parallel is not a coincidence, and we will keep pulling on that thread.
Next up: a lexer becomes a parser, a Turing Complete machine.
If you have questions or want to dig deeper into any of these topics, drop a comment below. Follow to catch Part 3 when it lands. Stay tuned...





Top comments (0)