From Tokens to Trees: Four Paths to a Full Parser
In part 2, we built a pushdown automaton transducer, with 11 operations, 6 components, one stack. Our PDA Transducer turns nested <div> tags into a flat stream of tokens. In this post, we will explore what do we need to add to get a simple but practical parser that outputs a DOM tree.
Q: Is the transducer enough to parse HTML?
No. The transducer takes <div><div>hi</div></div> and emits:
[(OPEN, "div"), (OPEN, "div"), (TEXT, "hi"),
(CLOSE, "div"), (CLOSE, "div")]
That's a flat list. It tells you what was in the input, but not how things relate to each other. A parser needs to produce a tree, a structure where the outer <div> is the parent and the inner <div> is its child:
div
└── div
└── "hi"
Our transducer can't do this. Its only writable memory is the stack, and the stack is consumed during validation. Every PUSH is matched by a POP to check nesting. By the time the transducer is done, the stack is empty. There's nowhere to store the tree.
To become a parser, the PDA needs a way to build and keep a tree structure in memory. Not an easy task.
Q: There are multiple ways to give the PDA tree-building ability?
Yes. Four classical computational models, each adding something different to the same PDA architecture we built in part 2.
| Path | What you add | Computation model | Language family |
|---|---|---|---|
| 2-Stack machine | A second stack | Stack machine | Forth, PostScript, Factor |
| Tree rewriting | A pool of cons cells + substitution rules | Lambda calculus / rewriting | Lisp, Scheme |
| Stack combinators | Combinator primitives (dup, swap, quote, apply) | Combinatory logic | Joy, Cat |
| Register machine | Writable random-access memory + ALU | Von Neumann | Assembly, C |
All four give the PDA persistent, writable memory to store a tree. But each structures that memory differently. Each created a different programming tradition.
The 2-Stack Machine (Forth, Factor)
We add a second LIFO stack. That's it. Two stacks cooperating through the finite control. This is the stack machine, the model behind Forth (Charles Moore, 1970), PostScript, and Factor.
Stack A keeps its original job: validating nesting (PUSH on open, POP on close, CMP to check the match). Stack B is new. It accumulates completed subtrees. When a closing tag matches, the parser pops the validated tag from Stack A and pushes the finished node (with its children) onto Stack B.
Parsing <div><div>hi</div></div>: The transducer reads tokens as before. When it encounters (CLOSE, "div") for the inner div, it pops DIV from Stack A (validation) and pushes a completed node div("hi") onto Stack B. When the outer </div> closes, it pops from Stack A again and assembles the final tree by popping the inner node from Stack B as its child. The result on Stack B: div(div("hi")).
Two stacks can also simulate a Turing machine tape: one stack holds everything left of the head, the other holds everything right. Move left = pop from left, push to right. Move right = the reverse. This is a well-known equivalence: a 2-stack PDA equals a Turing machine.
| Pros | Cons |
|---|---|
| Minimal addition, only 2 new operations | Trees must be linearized into stack form |
| Closest to the existing PDA architecture | Deep trees require deep stacks, no random access |
| Simple to implement and reason about | Awkward when you need to revisit earlier nodes |
Tree Rewriting (Lisp, Scheme)
We replace the stack with a pool of dynamically allocated cons cells. Each cell holds a pair of pointers: left (CAR) and right (CDR), forming a binary tree. Computation is pattern matching on tree structure and substitution: find a subtree that matches a rule, replace it with the result. This is the lambda calculus model: the foundation of Lisp (Alonzo Church, 1936; John McCarthy, 1958), Scheme, and Racket.
The finite control drives parsing as before, but instead of pushing symbols onto a stack, it allocates cons cells in memory. Each opening tag creates a new node. Each closing tag triggers a CONS that links children to their parent. The tree is built directly in memory, with no linearization needed.
What you add to the PDA:
| New operations | New data structures |
|---|---|
CONS, CAR, CDR
|
Pool of dynamically allocated cons cells |
Parsing <div><div>hi</div></div>: When the parser sees (TEXT, "hi"), it allocates a leaf cell holding "hi". When (CLOSE, "div") arrives for the inner div, a CONS links "hi" as the child of a new div node. When the outer </div> closes, another CONS links the inner div node as the child of the outer div node. The result is a tree in memory: (div . (div . "hi")).
| Pros | Cons |
|---|---|
| Trees are the native data structure. No encoding needed | Requires garbage collection or manual memory management |
| Natural representation for S-expressions: code and data share the same form | More complex memory model than a stack |
| Pattern matching + substitution is powerful for tree transformations | Pointer chasing is slow on modern hardware (poor cache locality) |
Stack Combinators (Joy, Cat)
We keep one stack. Don't add memory. Instead, change what the stack is allowed to hold. In our PDA, the stack holds passive symbols (like DIV). In a combinator machine, the stack holds executable programs. quote takes an operation and pushes it as data. apply pops it and executes it. The stack becomes simultaneously data storage and code. This is the foundation of concatenative languages: Joy (Manfred von Thun, 1990s), Cat, and Factor. The theoretical roots go back to Moses Schonfinkel (1920s) and Haskell Curry (1930s).
The parser quotes partial tree fragments and pushes them onto the stack as data. When a closing tag arrives, combinators like swap and dup rearrange the fragments, and apply assembles them into a larger tree. No heap, no RAM, no second stack. The stack itself is the only memory, and its entries can be either data or programs.
What you add to the PDA:
| New operations | New data structures |
|---|---|
DUP, SWAP, QUOTE, APPLY
|
None actually. Stack entries just become richer (data + code) |
Parsing <div><div>hi</div></div>: When the parser sees (TEXT, "hi"), it quotes the text as a data fragment and pushes it. When (CLOSE, "div") arrives, it quotes a make-div-node operation, swaps it under the "hi" fragment, and applies, producing a quoted div("hi") node on the stack. The outer </div> repeats the pattern, composing the inner node into the outer one. The result: a single quoted tree [div(div("hi"))] on top of the stack.
| Pros | Cons |
|---|---|
| No new memory means zero hardware cost | Hardest model to reason about for most developers |
| Elegant point-free composition. No variables, no names | Stack shuffling (swap, dup, rot) gets complex fast |
| Code and data unification is powerful for metaprogramming | Small community, fewer learning resources |
The Register Machine (Assembly, C)
Weplace the stack with addressable RAM memory. Add an ALU with arithmetic operations. The finite control gains index registers for addressing. This is the Von Neumann architecture, the model John von Neumann described in 1945, building on Alan Turing's theoretical work (1936). Every CPU since x86, ARM, RISC-V, has this as it's essentially this model.
Random access changes everything. The parser allocates tree nodes at arbitrary memory addresses and links them by pointer. Need to revisit a node three levels up? Just follow the address — no popping through a stack. The tree lives in RAM as a set of linked records, each holding a tag name, a pointer to its first child, and a pointer to its next sibling.
What you add to the PDA:
| New operations | New data structures |
|---|---|
LOAD, STORE, ADD, SUB
|
RAM + ALU + index registers |
Parsing <div><div>hi</div></div>: The parser allocates a record at address 0x00 for the outer div. When (OPEN, "div") arrives for the inner div, it allocates at 0x01 and writes 0x00 as its parent pointer. When (TEXT, "hi") arrives, it allocates at 0x02 and writes 0x01 as its parent. The closing tags update child pointers. The result is a linked structure in RAM:
| Address | Tag | First child | Next sibling |
|---|---|---|---|
| 0x00 | div | → 0x01 | null |
| 0x01 | div | → 0x02 | null |
| 0x02 | "hi" | null | null |
| Pros | Cons |
|---|---|
| Random access: read/write any node by address, O(1) | Most complex addition: RAM, ALU, index registers |
| Maps directly to real hardware (silicon is random-access) | Requires manual memory management or allocator |
| Every programming language compiles to this | Pointer arithmetic is a source of bugs |
This is the path that won. Not because it's the most elegant, in fact the 2-stack machine is simpler, the tree rewriter is more natural for trees, the combinator machine needs no extra memory at all. It won because transistors are random-access by design. RAM is cheap to build in silicon. The other three models require the hardware to simulate structured access patterns on top of flat memory. An extra layer the register machine skips entirely.
A side effect: Turing completeness
Each of these four additions was designed to solve a specific problem: giving the PDA enough writable memory to build a parse tree. But each one also has a deeper consequence: it makes the machine Turing complete, a machine capable of computing anything that any computer can compute.
This isn't a coincidence. To build an arbitrary tree, you need to store and retrieve an arbitrary amount of structured data. That's exactly what separates a context-free machine (our PDA) from a Turing machine. The four paths don't just add tree-building, they each provide enough general-purpose memory to simulate a Turing machine tape.
Here's how they compare side by side:
| 2-Stack | Tree Rewriting | Stack Combinators | Register Machine | |
|---|---|---|---|---|
| New ops | PUSH_B, POP_B | CONS, CAR, CDR | DUP, SWAP, QUOTE, APPLY | LOAD, STORE, ADD, SUB, MUL |
| New op count | 2 | 3 | 4 | 5 |
| Total ops (11 + N) | 13 | 14 | 15 | 16 |
| New data structure | Second LIFO stack | Pool of cons cells | None (enriched stack) | RAM + ALU + index registers |
| Primary data structure | Two LIFO stacks | Binary tree (cons cells) | One stack (data + code) | Flat array of addressed cells |
| How composition works | Stack effects (before/after) | S-expressions nested lists | Point-free sequencing | Instruction sequences |
| Why it's Turing complete | Two stacks simulate a tape | Tree substitution simulates a tape | quote + apply = self-modifying programs | RAM = direct tape simulation |
We started at the bottom of this hierarchy in part 1 with a finite state machine. In part 2 we added a stack and crossed into context-free territory. Now we've seen the four doors at the top; four ways to cross into Turing completeness. All equivalent in power, all different in structure.
Q: If the PDA is so rigid, can it work with something flexible like an LLM?
Yes, and this is where things get practical.
From PDA to transformer, step by step:
- PDA: hand-coded rules, explicit stack, one language (HTML)
- Learned PDA: same architecture, learn transition rules from data (Grefenstette et al., 2015)
- RNN: replace explicit stack with hidden state vector; approximates a stack but struggles with deep nesting
- LSTM — add gates to hidden state; better stack approximation; pre-transformer breakthrough
- Transformer — replace sequential hidden state with attention over entire input; random access to every previous token, not just stack top
Each step trades transparency for generality. The PDA handles one grammar perfectly and you can see why. The transformer handles every grammar approximately and you can't see how.
Critical limit: The PDA has unbounded stack, arbitrarily deep nesting, always. The transformer has finite context and finite precision — it's really a bounded-depth PDA. It can approximate stack behavior within its context window, but it can't guarantee correct nesting beyond that.
Research backing: Attention heads in transformers learn to simulate stack operations. Specific heads match opening tags to closing tags by attending back to the most recent unmatched opener — the attention pattern literally looks like LIFO behavior (Murty et al., Merrill et al.).
The practical combination: grammar-constrained decoding. Run the PDA alongside the LLM. The LLM generates tokens based on meaning. The PDA vetoes structurally invalid ones.
Step 1: LLM emits "<ul>" stack: [UL] ✅
Step 2: LLM emits "<li>" stack: [UL, LI] ✅
Step 3: LLM emits "Apple" stack: [UL, LI] ✅
Step 4: LLM wants "</ul>" stack: [UL, LI] ❌ BLOCKED
top is LI, not UL
next best: "</li>" stack: [UL] ✅
Step 5: LLM emits "<li>" stack: [UL, LI] ✅
Step 6: LLM emits "Banana" ✅
Step 7: LLM emits "</li>" stack: [UL] ✅
Step 8: LLM emits "</ul>" stack: [] ✅
Step 4 is the key: the LLM tried to close </ul> while <li> was still open. The PDA masked it out, and the LLM's second-best token </li> took over.
| LLM | PDA | |
|---|---|---|
| Decides | "Apple" and "Banana" are fruits |
</ul> can't close before </li>
|
| Good at | Meaning, creativity, world knowledge | Structure, nesting, grammar rules |
| Fails at | Guaranteeing valid nesting | Knowing what a fruit is |
This combination is already in production:
- Outlines (Python): attaches a grammar to any HuggingFace model
- llama.cpp GBNF: grammar-constrained decoding in C++
- Guidance (Microsoft): interleaves free generation with grammar constraints
- OpenAI structured outputs: grammar enforcement server-side via JSON schema
The PDA we built in parts 1 and 2 is the exact mechanism these tools use to guarantee structure. A hand-coded automaton means the simplest kind of program, guarding the output of the most complex kind.
What's next
In parts 1 and 2 we built a tokenizer. In this post we saw four ways to extend it into a full parser, and how the simplest of those machines can guard the output of the most powerful ones.
Next up: we pick the register machine path and build a parser that writes a tree to memory.
Stay tuned...








Top comments (0)