DEV Community

Cover image for How a Pushdown Automaton becomes a Parser [part 3]
Jocer Franquiz
Jocer Franquiz

Posted on

How a Pushdown Automaton becomes a Parser [part 3]

From Tokens to Trees: Four Paths to a Full Parser

In part 2, we built a pushdown automaton transducer, with 11 operations, 6 components, one stack. Our PDA Transducer turns nested <div> tags into a flat stream of tokens. In this post, we will explore what do we need to add to get a simple but practical parser that outputs a DOM tree.


Q: Is the transducer enough to parse HTML?

No. The transducer takes <div><div>hi</div></div> and emits:

[(OPEN, "div"), (OPEN, "div"), (TEXT, "hi"),
 (CLOSE, "div"), (CLOSE, "div")]
Enter fullscreen mode Exit fullscreen mode

That's a flat list. It tells you what was in the input, but not how things relate to each other. A parser needs to produce a tree, a structure where the outer <div> is the parent and the inner <div> is its child:

div
 └── div
      └── "hi"
Enter fullscreen mode Exit fullscreen mode

Our transducer can't do this. Its only writable memory is the stack, and the stack is consumed during validation. Every PUSH is matched by a POP to check nesting. By the time the transducer is done, the stack is empty. There's nowhere to store the tree.

To become a parser, the PDA needs a way to build and keep a tree structure in memory. Not an easy task.


Q: There are multiple ways to give the PDA tree-building ability?

Yes. Four classical computational models, each adding something different to the same PDA architecture we built in part 2.

Path What you add Computation model Language family
2-Stack machine A second stack Stack machine Forth, PostScript, Factor
Tree rewriting A pool of cons cells + substitution rules Lambda calculus / rewriting Lisp, Scheme
Stack combinators Combinator primitives (dup, swap, quote, apply) Combinatory logic Joy, Cat
Register machine Writable random-access memory + ALU Von Neumann Assembly, C

All four give the PDA persistent, writable memory to store a tree. But each structures that memory differently. Each created a different programming tradition.

Four paths from PDA to full parser


The 2-Stack Machine (Forth, Factor)

We add a second LIFO stack. That's it. Two stacks cooperating through the finite control. This is the stack machine, the model behind Forth (Charles Moore, 1970), PostScript, and Factor.

Stack A keeps its original job: validating nesting (PUSH on open, POP on close, CMP to check the match). Stack B is new. It accumulates completed subtrees. When a closing tag matches, the parser pops the validated tag from Stack A and pushes the finished node (with its children) onto Stack B.

Parsing <div><div>hi</div></div>: The transducer reads tokens as before. When it encounters (CLOSE, "div") for the inner div, it pops DIV from Stack A (validation) and pushes a completed node div("hi") onto Stack B. When the outer </div> closes, it pops from Stack A again and assembles the final tree by popping the inner node from Stack B as its child. The result on Stack B: div(div("hi")).

2-Stack PDA architecture

Two stacks can also simulate a Turing machine tape: one stack holds everything left of the head, the other holds everything right. Move left = pop from left, push to right. Move right = the reverse. This is a well-known equivalence: a 2-stack PDA equals a Turing machine.

Two stacks simulating a Turing machine tape

Pros Cons
Minimal addition, only 2 new operations Trees must be linearized into stack form
Closest to the existing PDA architecture Deep trees require deep stacks, no random access
Simple to implement and reason about Awkward when you need to revisit earlier nodes

Tree Rewriting (Lisp, Scheme)

We replace the stack with a pool of dynamically allocated cons cells. Each cell holds a pair of pointers: left (CAR) and right (CDR), forming a binary tree. Computation is pattern matching on tree structure and substitution: find a subtree that matches a rule, replace it with the result. This is the lambda calculus model: the foundation of Lisp (Alonzo Church, 1936; John McCarthy, 1958), Scheme, and Racket.

The finite control drives parsing as before, but instead of pushing symbols onto a stack, it allocates cons cells in memory. Each opening tag creates a new node. Each closing tag triggers a CONS that links children to their parent. The tree is built directly in memory, with no linearization needed.

What you add to the PDA:

New operations New data structures
CONS, CAR, CDR Pool of dynamically allocated cons cells

Parsing <div><div>hi</div></div>: When the parser sees (TEXT, "hi"), it allocates a leaf cell holding "hi". When (CLOSE, "div") arrives for the inner div, a CONS links "hi" as the child of a new div node. When the outer </div> closes, another CONS links the inner div node as the child of the outer div node. The result is a tree in memory: (div . (div . "hi")).

Tree rewriting architecture

Pros Cons
Trees are the native data structure. No encoding needed Requires garbage collection or manual memory management
Natural representation for S-expressions: code and data share the same form More complex memory model than a stack
Pattern matching + substitution is powerful for tree transformations Pointer chasing is slow on modern hardware (poor cache locality)

Stack Combinators (Joy, Cat)

We keep one stack. Don't add memory. Instead, change what the stack is allowed to hold. In our PDA, the stack holds passive symbols (like DIV). In a combinator machine, the stack holds executable programs. quote takes an operation and pushes it as data. apply pops it and executes it. The stack becomes simultaneously data storage and code. This is the foundation of concatenative languages: Joy (Manfred von Thun, 1990s), Cat, and Factor. The theoretical roots go back to Moses Schonfinkel (1920s) and Haskell Curry (1930s).

The parser quotes partial tree fragments and pushes them onto the stack as data. When a closing tag arrives, combinators like swap and dup rearrange the fragments, and apply assembles them into a larger tree. No heap, no RAM, no second stack. The stack itself is the only memory, and its entries can be either data or programs.

What you add to the PDA:

New operations New data structures
DUP, SWAP, QUOTE, APPLY None actually. Stack entries just become richer (data + code)

Parsing <div><div>hi</div></div>: When the parser sees (TEXT, "hi"), it quotes the text as a data fragment and pushes it. When (CLOSE, "div") arrives, it quotes a make-div-node operation, swaps it under the "hi" fragment, and applies, producing a quoted div("hi") node on the stack. The outer </div> repeats the pattern, composing the inner node into the outer one. The result: a single quoted tree [div(div("hi"))] on top of the stack.

Stack combinators architecture

Pros Cons
No new memory means zero hardware cost Hardest model to reason about for most developers
Elegant point-free composition. No variables, no names Stack shuffling (swap, dup, rot) gets complex fast
Code and data unification is powerful for metaprogramming Small community, fewer learning resources

The Register Machine (Assembly, C)

Weplace the stack with addressable RAM memory. Add an ALU with arithmetic operations. The finite control gains index registers for addressing. This is the Von Neumann architecture, the model John von Neumann described in 1945, building on Alan Turing's theoretical work (1936). Every CPU since x86, ARM, RISC-V, has this as it's essentially this model.

Random access changes everything. The parser allocates tree nodes at arbitrary memory addresses and links them by pointer. Need to revisit a node three levels up? Just follow the address — no popping through a stack. The tree lives in RAM as a set of linked records, each holding a tag name, a pointer to its first child, and a pointer to its next sibling.

What you add to the PDA:

New operations New data structures
LOAD, STORE, ADD, SUB RAM + ALU + index registers

Parsing <div><div>hi</div></div>: The parser allocates a record at address 0x00 for the outer div. When (OPEN, "div") arrives for the inner div, it allocates at 0x01 and writes 0x00 as its parent pointer. When (TEXT, "hi") arrives, it allocates at 0x02 and writes 0x01 as its parent. The closing tags update child pointers. The result is a linked structure in RAM:

Address Tag First child Next sibling
0x00 div → 0x01 null
0x01 div → 0x02 null
0x02 "hi" null null

Register machine architecture

Pros Cons
Random access: read/write any node by address, O(1) Most complex addition: RAM, ALU, index registers
Maps directly to real hardware (silicon is random-access) Requires manual memory management or allocator
Every programming language compiles to this Pointer arithmetic is a source of bugs

This is the path that won. Not because it's the most elegant, in fact the 2-stack machine is simpler, the tree rewriter is more natural for trees, the combinator machine needs no extra memory at all. It won because transistors are random-access by design. RAM is cheap to build in silicon. The other three models require the hardware to simulate structured access patterns on top of flat memory. An extra layer the register machine skips entirely.


A side effect: Turing completeness

Each of these four additions was designed to solve a specific problem: giving the PDA enough writable memory to build a parse tree. But each one also has a deeper consequence: it makes the machine Turing complete, a machine capable of computing anything that any computer can compute.

This isn't a coincidence. To build an arbitrary tree, you need to store and retrieve an arbitrary amount of structured data. That's exactly what separates a context-free machine (our PDA) from a Turing machine. The four paths don't just add tree-building, they each provide enough general-purpose memory to simulate a Turing machine tape.

Here's how they compare side by side:

2-Stack Tree Rewriting Stack Combinators Register Machine
New ops PUSH_B, POP_B CONS, CAR, CDR DUP, SWAP, QUOTE, APPLY LOAD, STORE, ADD, SUB, MUL
New op count 2 3 4 5
Total ops (11 + N) 13 14 15 16
New data structure Second LIFO stack Pool of cons cells None (enriched stack) RAM + ALU + index registers
Primary data structure Two LIFO stacks Binary tree (cons cells) One stack (data + code) Flat array of addressed cells
How composition works Stack effects (before/after) S-expressions nested lists Point-free sequencing Instruction sequences
Why it's Turing complete Two stacks simulate a tape Tree substitution simulates a tape quote + apply = self-modifying programs RAM = direct tape simulation

Hierarchy from combinational logic to Turing completeness

We started at the bottom of this hierarchy in part 1 with a finite state machine. In part 2 we added a stack and crossed into context-free territory. Now we've seen the four doors at the top; four ways to cross into Turing completeness. All equivalent in power, all different in structure.


Q: If the PDA is so rigid, can it work with something flexible like an LLM?

Yes, and this is where things get practical.

From PDA to transformer, step by step:

  1. PDA: hand-coded rules, explicit stack, one language (HTML)
  2. Learned PDA: same architecture, learn transition rules from data (Grefenstette et al., 2015)
  3. RNN: replace explicit stack with hidden state vector; approximates a stack but struggles with deep nesting
  4. LSTM — add gates to hidden state; better stack approximation; pre-transformer breakthrough
  5. Transformer — replace sequential hidden state with attention over entire input; random access to every previous token, not just stack top

Each step trades transparency for generality. The PDA handles one grammar perfectly and you can see why. The transformer handles every grammar approximately and you can't see how.

Critical limit: The PDA has unbounded stack, arbitrarily deep nesting, always. The transformer has finite context and finite precision — it's really a bounded-depth PDA. It can approximate stack behavior within its context window, but it can't guarantee correct nesting beyond that.

Research backing: Attention heads in transformers learn to simulate stack operations. Specific heads match opening tags to closing tags by attending back to the most recent unmatched opener — the attention pattern literally looks like LIFO behavior (Murty et al., Merrill et al.).

The practical combination: grammar-constrained decoding. Run the PDA alongside the LLM. The LLM generates tokens based on meaning. The PDA vetoes structurally invalid ones.

Step 1: LLM emits "<ul>"       stack: [UL]         ✅
Step 2: LLM emits "<li>"       stack: [UL, LI]     ✅
Step 3: LLM emits "Apple"      stack: [UL, LI]     ✅
Step 4: LLM wants "</ul>"      stack: [UL, LI]     ❌ BLOCKED
        top is LI, not UL
        next best: "</li>"     stack: [UL]         ✅
Step 5: LLM emits "<li>"       stack: [UL, LI]     ✅
Step 6: LLM emits "Banana"                         ✅
Step 7: LLM emits "</li>"      stack: [UL]         ✅
Step 8: LLM emits "</ul>"      stack: []           ✅
Enter fullscreen mode Exit fullscreen mode

Step 4 is the key: the LLM tried to close </ul> while <li> was still open. The PDA masked it out, and the LLM's second-best token </li> took over.

LLM PDA
Decides "Apple" and "Banana" are fruits </ul> can't close before </li>
Good at Meaning, creativity, world knowledge Structure, nesting, grammar rules
Fails at Guaranteeing valid nesting Knowing what a fruit is

Grammar-constrained decoding — LLM + PDA collaboration

This combination is already in production:

The PDA we built in parts 1 and 2 is the exact mechanism these tools use to guarantee structure. A hand-coded automaton means the simplest kind of program, guarding the output of the most complex kind.


What's next

In parts 1 and 2 we built a tokenizer. In this post we saw four ways to extend it into a full parser, and how the simplest of those machines can guard the output of the most powerful ones.

Next up: we pick the register machine path and build a parser that writes a tree to memory.

Stay tuned...

Top comments (0)