Jocer Franquiz

Posted on Mar 28

How a Pushdown Automaton becomes a Parser [part 3]

#parsing #computerscience #learning #programming

From Tokens to Trees: Four Paths to a Full Parser

In part 2, we built a pushdown automaton transducer, with 11 operations, 6 components, one stack. Our PDA Transducer turns nested <div> tags into a flat stream of tokens. In this post, we will explore what do we need to add to get a simple but practical parser that outputs a DOM tree.

Q: Is the transducer enough to parse HTML?

No. The transducer takes <div><div>hi</div></div> and emits:

[(OPEN, "div"), (OPEN, "div"), (TEXT, "hi"),
 (CLOSE, "div"), (CLOSE, "div")]

That's a flat list. It tells you what was in the input, but not how things relate to each other. A parser needs to produce a tree, a structure where the outer <div> is the parent and the inner <div> is its child:

div
 └── div
      └── "hi"

Our transducer can't do this. Its only writable memory is the stack, and the stack is consumed during validation. Every PUSH is matched by a POP to check nesting. By the time the transducer is done, the stack is empty. There's nowhere to store the tree.

To become a parser, the PDA needs a way to build and keep a tree structure in memory. Not an easy task.

Q: There are multiple ways to give the PDA tree-building ability?

Yes. Four classical computational models, each adding something different to the same PDA architecture we built in part 2.

Path	What you add	Computation model	Language family
2-Stack machine	A second stack	Stack machine	Forth, PostScript, Factor
Tree rewriting	A pool of cons cells + substitution rules	Lambda calculus / rewriting	Lisp, Scheme
Stack combinators	Combinator primitives (dup, swap, quote, apply)	Combinatory logic	Joy, Cat
Register machine	Writable random-access memory + ALU	Von Neumann	Assembly, C

All four give the PDA persistent, writable memory to store a tree. But each structures that memory differently. Each created a different programming tradition.

The 2-Stack Machine (Forth, Factor)

We add a second LIFO stack. That's it. Two stacks cooperating through the finite control. This is the stack machine, the model behind Forth (Charles Moore, 1970), PostScript, and Factor.

Stack A keeps its original job: validating nesting (PUSH on open, POP on close, CMP to check the match). Stack B is new. It accumulates completed subtrees. When a closing tag matches, the parser pops the validated tag from Stack A and pushes the finished node (with its children) onto Stack B.

Parsing <div><div>hi</div></div>: The transducer reads tokens as before. When it encounters (CLOSE, "div") for the inner div, it pops DIV from Stack A (validation) and pushes a completed node div("hi") onto Stack B. When the outer </div> closes, it pops from Stack A again and assembles the final tree by popping the inner node from Stack B as its child. The result on Stack B: div(div("hi")).

Two stacks can also simulate a Turing machine tape: one stack holds everything left of the head, the other holds everything right. Move left = pop from left, push to right. Move right = the reverse. This is a well-known equivalence: a 2-stack PDA equals a Turing machine.

Pros	Cons
Minimal addition, only 2 new operations	Trees must be linearized into stack form
Closest to the existing PDA architecture	Deep trees require deep stacks, no random access
Simple to implement and reason about	Awkward when you need to revisit earlier nodes

Tree Rewriting (Lisp, Scheme)

We replace the stack with a pool of dynamically allocated cons cells. Each cell holds a pair of pointers: left (CAR) and right (CDR), forming a binary tree. Computation is pattern matching on tree structure and substitution: find a subtree that matches a rule, replace it with the result. This is the lambda calculus model: the foundation of Lisp (Alonzo Church, 1936; John McCarthy, 1958), Scheme, and Racket.

The finite control drives parsing as before, but instead of pushing symbols onto a stack, it allocates cons cells in memory. Each opening tag creates a new node. Each closing tag triggers a CONS that links children to their parent. The tree is built directly in memory, with no linearization needed.

What you add to the PDA:

New operations	New data structures
`CONS`, `CAR`, `CDR`	Pool of dynamically allocated `cons` cells

Parsing <div><div>hi</div></div>: When the parser sees (TEXT, "hi"), it allocates a leaf cell holding "hi". When (CLOSE, "div") arrives for the inner div, a CONS links "hi" as the child of a new div node. When the outer </div> closes, another CONS links the inner div node as the child of the outer div node. The result is a tree in memory: (div . (div . "hi")).

Pros	Cons
Trees are the native data structure. No encoding needed	Requires garbage collection or manual memory management
Natural representation for S-expressions: code and data share the same form	More complex memory model than a stack
Pattern matching + substitution is powerful for tree transformations	Pointer chasing is slow on modern hardware (poor cache locality)

Stack Combinators (Joy, Cat)

We keep one stack. Don't add memory. Instead, change what the stack is allowed to hold. In our PDA, the stack holds passive symbols (like DIV). In a combinator machine, the stack holds executable programs. quote takes an operation and pushes it as data. apply pops it and executes it. The stack becomes simultaneously data storage and code. This is the foundation of concatenative languages: Joy (Manfred von Thun, 1990s), Cat, and Factor. The theoretical roots go back to Moses Schonfinkel (1920s) and Haskell Curry (1930s).

The parser quotes partial tree fragments and pushes them onto the stack as data. When a closing tag arrives, combinators like swap and dup rearrange the fragments, and apply assembles them into a larger tree. No heap, no RAM, no second stack. The stack itself is the only memory, and its entries can be either data or programs.

What you add to the PDA:

New operations	New data structures
`DUP`, `SWAP`, `QUOTE`, `APPLY`	None actually. Stack entries just become richer (data + code)

Parsing <div><div>hi</div></div>: When the parser sees (TEXT, "hi"), it quotes the text as a data fragment and pushes it. When (CLOSE, "div") arrives, it quotes a make-div-node operation, swaps it under the "hi" fragment, and applies, producing a quoted div("hi") node on the stack. The outer </div> repeats the pattern, composing the inner node into the outer one. The result: a single quoted tree [div(div("hi"))] on top of the stack.

Pros	Cons
No new memory means zero hardware cost	Hardest model to reason about for most developers
Elegant point-free composition. No variables, no names	Stack shuffling (`swap`, `dup`, `rot`) gets complex fast
Code and data unification is powerful for metaprogramming	Small community, fewer learning resources

The Register Machine (Assembly, C)

Weplace the stack with addressable RAM memory. Add an ALU with arithmetic operations. The finite control gains index registers for addressing. This is the Von Neumann architecture, the model John von Neumann described in 1945, building on Alan Turing's theoretical work (1936). Every CPU since x86, ARM, RISC-V, has this as it's essentially this model.

Random access changes everything. The parser allocates tree nodes at arbitrary memory addresses and links them by pointer. Need to revisit a node three levels up? Just follow the address — no popping through a stack. The tree lives in RAM as a set of linked records, each holding a tag name, a pointer to its first child, and a pointer to its next sibling.

What you add to the PDA:

New operations	New data structures
`LOAD`, `STORE`, `ADD`, `SUB`	RAM + ALU + index registers

Parsing <div><div>hi</div></div>: The parser allocates a record at address 0x00 for the outer div. When (OPEN, "div") arrives for the inner div, it allocates at 0x01 and writes 0x00 as its parent pointer. When (TEXT, "hi") arrives, it allocates at 0x02 and writes 0x01 as its parent. The closing tags update child pointers. The result is a linked structure in RAM:

Address	Tag	First child	Next sibling
0x00	div	→ 0x01	null
0x01	div	→ 0x02	null
0x02	"hi"	null	null

Pros	Cons
Random access: read/write any node by address, O(1)	Most complex addition: RAM, ALU, index registers
Maps directly to real hardware (silicon is random-access)	Requires manual memory management or allocator
Every programming language compiles to this	Pointer arithmetic is a source of bugs

This is the path that won. Not because it's the most elegant, in fact the 2-stack machine is simpler, the tree rewriter is more natural for trees, the combinator machine needs no extra memory at all. It won because transistors are random-access by design. RAM is cheap to build in silicon. The other three models require the hardware to simulate structured access patterns on top of flat memory. An extra layer the register machine skips entirely.

A side effect: Turing completeness

Each of these four additions was designed to solve a specific problem: giving the PDA enough writable memory to build a parse tree. But each one also has a deeper consequence: it makes the machine Turing complete, a machine capable of computing anything that any computer can compute.

This isn't a coincidence. To build an arbitrary tree, you need to store and retrieve an arbitrary amount of structured data. That's exactly what separates a context-free machine (our PDA) from a Turing machine. The four paths don't just add tree-building, they each provide enough general-purpose memory to simulate a Turing machine tape.

Here's how they compare side by side:

	2-Stack	Tree Rewriting	Stack Combinators	Register Machine
New ops	PUSH_B, POP_B	CONS, CAR, CDR	DUP, SWAP, QUOTE, APPLY	LOAD, STORE, ADD, SUB, MUL
New op count	2	3	4	5
Total ops (11 + N)	13	14	15	16
New data structure	Second LIFO stack	Pool of cons cells	None (enriched stack)	RAM + ALU + index registers
Primary data structure	Two LIFO stacks	Binary tree (cons cells)	One stack (data + code)	Flat array of addressed cells
How composition works	Stack effects (before/after)	S-expressions nested lists	Point-free sequencing	Instruction sequences
Why it's Turing complete	Two stacks simulate a tape	Tree substitution simulates a tape	quote + apply = self-modifying programs	RAM = direct tape simulation

We started at the bottom of this hierarchy in part 1 with a finite state machine. In part 2 we added a stack and crossed into context-free territory. Now we've seen the four doors at the top; four ways to cross into Turing completeness. All equivalent in power, all different in structure.

Q: If the PDA is so rigid, can it work with something flexible like an LLM?

Yes, and this is where things get practical.

From PDA to transformer, step by step:

PDA: hand-coded rules, explicit stack, one language (HTML)
Learned PDA: same architecture, learn transition rules from data (Grefenstette et al., 2015)
RNN: replace explicit stack with hidden state vector; approximates a stack but struggles with deep nesting
LSTM — add gates to hidden state; better stack approximation; pre-transformer breakthrough
Transformer — replace sequential hidden state with attention over entire input; random access to every previous token, not just stack top

Each step trades transparency for generality. The PDA handles one grammar perfectly and you can see why. The transformer handles every grammar approximately and you can't see how.

Critical limit: The PDA has unbounded stack, arbitrarily deep nesting, always. The transformer has finite context and finite precision — it's really a bounded-depth PDA. It can approximate stack behavior within its context window, but it can't guarantee correct nesting beyond that.

Research backing: Attention heads in transformers learn to simulate stack operations. Specific heads match opening tags to closing tags by attending back to the most recent unmatched opener — the attention pattern literally looks like LIFO behavior (Murty et al., Merrill et al.).

The practical combination: grammar-constrained decoding. Run the PDA alongside the LLM. The LLM generates tokens based on meaning. The PDA vetoes structurally invalid ones.

Step 1: LLM emits "<ul>"       stack: [UL]         ✅
Step 2: LLM emits "<li>"       stack: [UL, LI]     ✅
Step 3: LLM emits "Apple"      stack: [UL, LI]     ✅
Step 4: LLM wants "</ul>"      stack: [UL, LI]     ❌ BLOCKED
        top is LI, not UL
        next best: "</li>"     stack: [UL]         ✅
Step 5: LLM emits "<li>"       stack: [UL, LI]     ✅
Step 6: LLM emits "Banana"                         ✅
Step 7: LLM emits "</li>"      stack: [UL]         ✅
Step 8: LLM emits "</ul>"      stack: []           ✅

Step 4 is the key: the LLM tried to close </ul> while <li> was still open. The PDA masked it out, and the LLM's second-best token </li> took over.

	LLM	PDA
Decides	"Apple" and "Banana" are fruits	`</ul>` can't close before `</li>`
Good at	Meaning, creativity, world knowledge	Structure, nesting, grammar rules
Fails at	Guaranteeing valid nesting	Knowing what a fruit is

This combination is already in production:

Outlines (Python): attaches a grammar to any HuggingFace model
llama.cpp GBNF: grammar-constrained decoding in C++
Guidance (Microsoft): interleaves free generation with grammar constraints
OpenAI structured outputs: grammar enforcement server-side via JSON schema

The PDA we built in parts 1 and 2 is the exact mechanism these tools use to guarantee structure. A hand-coded automaton means the simplest kind of program, guarding the output of the most complex kind.

What's next

In parts 1 and 2 we built a tokenizer. In this post we saw four ways to extend it into a full parser, and how the simplest of those machines can guard the output of the most powerful ones.

Next up: we pick the register machine path and build a parser that writes a tree to memory.

Stay tuned...

DEV Community

How a Pushdown Automaton becomes a Parser [part 3]

From Tokens to Trees: Four Paths to a Full Parser

Q: Is the transducer enough to parse HTML?

Q: There are multiple ways to give the PDA tree-building ability?

The 2-Stack Machine (Forth, Factor)

Tree Rewriting (Lisp, Scheme)

Stack Combinators (Joy, Cat)

The Register Machine (Assembly, C)

A side effect: Turing completeness

Q: If the PDA is so rigid, can it work with something flexible like an LLM?

What's next

Top comments (0)