Jocer Franquiz

Posted on Mar 26

Building a Tokenizer from Scratch [part 2]

#parsing #claude #computerscience #ai

Parser Theory: Q/A with Claude Opus

In part 1, we built a working FSM that recognizes <div>text</div> using just 7 primitives mapped 1:1 to assembly opcodes. But FSMs have a hard limit: they can't handle nested structures like <div><div>hello</div></div>. In this post, we climb the Chomsky hierarchy from finite state machines to pushdown automata, build a PDA that recognizes nested <div> tags, and then turn it into a transducer that emits tokens. In other words we are building the core of a lexer.

Q: Why can't FSMs handle nested structures?

Because an FSM has a fixed number of states, and that's all the memory it has.

Consider nested divs:

<div><div><div>hello</div></div></div>

To correctly match closing tags, you need to count how many <div>s you've opened so you know how many </div>s to expect. An FSM with, say, 12 states can handle nesting up to some fixed depth — but someone can always write HTML nested one level deeper than your states can track.

Put simply:

1 level deep → you need states to remember "I'm 1 div in"
2 levels deep → different states for "I'm 2 divs in"
n levels deep → you'd need n distinct state configurations

Since n is unbounded but your state count is finite, you run out of states. There's no way to encode "how deep am I?" in a fixed-size state register when the depth is arbitrary.

This is exactly what the stack in a pushdown automaton solves. Every time you see <div>, you push onto the stack; every </div>, you pop. The stack grows as needed, giving you unbounded memory in one direction (depth), which is precisely what nesting requires.

Q: So, FSMs needs memory to manage remember where they are in the nested structure, correct?

It depends on what you mean by "memory."

An FSM does have memory, but it's extremely limited. The current state itself is the memory. If your FSM is in state S7, that encodes something about what it has seen so far. That's how it knows it's partway through reading </div> vs still in content.

But that memory is:

Fixed-size, bounded by the number of states, decided at design time
Lossy, it can't remember the full history of what it read, only a compressed summary encoded in which state it's in
Non-growable, it can't allocate more memory as input gets longer

So when people say "FSMs have no memory," they really mean no auxiliary memory, that means no stack, no tape, no buffer that grows with input. The only thing an FSM "remembers" is which state it's currently in.

A PDA adds exactly one auxiliary memory structure: a stack. That single addition is enough to go from "can't handle nesting" to "can handle nesting."

Q: What is the stack, formally?

A stack is a last-in, first-out (LIFO) data structure with two operations:

PUSH(symbol): place a symbol on top
POP: remove and return the top symbol

So if your stack is ABB, a PUSH(C) gives ABBC, and a POP returns C leaving ABB again.

The key properties:

You can only access the top: no reading the middle or bottom
Unbounded size: it can grow as large as needed (unlike the FSM's fixed states)
It starts empty: at the beginning of computation, the stack contains nothing (or a special bottom marker like $)

That's it. It's the simplest possible auxiliar memory that's still useful. You get only sequential access from one end. But that restriction is exactly what matches the structure of nesting: the most recently opened thing is the next thing that needs to close.

Q: Which operations and structures do we need to add to the seven primitives to create a PDA that handles nested structures?

We need two new operations and one new structure:

New structure

STACK: a LIFO storage addressed by a stack pointer (SP)

Updated primitives table

#	Operation	Description	Type	Assembly equivalent
1	READ	Load current input symbol	FSM	`MOV reg, [src]`
2	CMP	Test value equality	FSM	`CMP reg, imm`
3	BRANCH	Conditional jump	FSM	`JE / JNE label`
4	JMP	Unconditional jump	FSM	`JMP label`
5	ASSIGN	Update state register	FSM	`MOV reg, imm`
6	ADVANCE	Move input pointer forward	FSM	`INC reg`
7	HALT	Terminate execution	FSM	`HLT`
8	PUSH	Place symbol on top of stack	PDA	`PUSH reg`
9	POP	Remove and return top of stack	PDA	`POP reg`

Two operations and one structure turn an FSM into a PDA.

x86 has native PUSH/POP instructions with a hardware stack pointer (SP/ESP/RSP), so the 1:1 mapping to CPU instructions still holds. PDAs are just as mechanically efficient as FSMs. The only cost is that memory usage now grows with nesting depth instead of being fixed.

You could also add a PEEK operation (read top without removing), but it's not strictly necessary — you can always POP then PUSH back. It's a convenience, not a primitive.

Q: What's the simplest PDA architecture?

A PDA architecture needs exactly four components:

Input tape — the string being read (read-only, left-to-right)
Input pointer — current position on the tape
State register — holds the current state from a finite set Q
Stack — LIFO storage over alphabet Γ

And the nine operations from our table are the instruction set that operates on these four components.

Compared to the FSM architecture from part 1, we added exactly one component (the stack) and two operations (PUSH, POP). The stack pointer (SP) lives inside the finite control alongside the state register — it's a register, not part of the stack itself.

The PDA architecture is a von Neumann architecture:

PDA component	Von Neumann equivalent
Input tape	Memory (data segment)
Input pointer	Program counter / index register
State register	CPU register (accumulator)
Stack pointer	SP register
Stack (in RAM)	Stack segment of memory
9 operations	Subset of the instruction set

What's missing for full Von Neumann: writable memory, random access, arithmetic, and stored program. The automata hierarchy maps directly onto hardware capability:

FSM → combinational logic + a register
PDA → CPU + stack segment
Turing machine → CPU + full random-access memory = von Neumann

Building a PDA for nested `<div>` recognition

Pseudocode

STATES: S0, TAG_OPEN, TAG_NAME, CONTENT, CLOSE_SLASH,
        CLOSE_NAME, ACCEPT, DEAD

STACK ALPHABET: Γ = { DIV, $ }

START:
  PUSH($)                    -- bottom marker
  ASSIGN(S0)

S0:
  READ(ch)
  CMP(ch, '<')
  BRANCH(eq → TAG_OPEN)
  JMP(DEAD)

TAG_OPEN:
  ADVANCE
  READ(ch)
  CMP(ch, '/')
  BRANCH(eq → CLOSE_SLASH)   -- it's a closing tag
  CMP(ch, 'd')
  BRANCH(eq → TAG_NAME)      -- it's an opening tag
  JMP(DEAD)

TAG_NAME:                     -- read "div>"
  ADVANCE
  READ(ch); CMP(ch, 'i'); BRANCH(ne → DEAD)
  ADVANCE
  READ(ch); CMP(ch, 'v'); BRANCH(ne → DEAD)
  ADVANCE
  READ(ch); CMP(ch, '>'); BRANCH(ne → DEAD)
  PUSH(DIV)                   -- ← matched <div>, push it
  ADVANCE
  ASSIGN(CONTENT)

CONTENT:
  READ(ch)
  CMP(ch, '<')
  BRANCH(eq → TAG_OPEN)      -- ← could be nested <div> or </div>
  CMP(ch, EOF)
  BRANCH(eq → DEAD)           -- input ended with open tags
  ADVANCE                     -- consume text character
  JMP(CONTENT)

CLOSE_SLASH:                  -- inside </
  ADVANCE
  READ(ch); CMP(ch, 'd'); BRANCH(ne → DEAD)
  ADVANCE
  READ(ch); CMP(ch, 'i'); BRANCH(ne → DEAD)
  ADVANCE
  READ(ch); CMP(ch, 'v'); BRANCH(ne → DEAD)
  ADVANCE
  READ(ch); CMP(ch, '>'); BRANCH(ne → DEAD)
  POP(top)                    -- ← pop the stack
  CMP(top, DIV)
  BRANCH(ne → DEAD)           -- mismatch
  ADVANCE
  READ(ch)
  CMP(ch, EOF)
  BRANCH(eq → CHECK_EMPTY)
  ASSIGN(CONTENT)             -- more input, keep going

CHECK_EMPTY:
  POP(top)
  CMP(top, $)
  BRANCH(eq → ACCEPT)         -- stack empty = all tags matched
  JMP(DEAD)                   -- leftover tags = unbalanced

ACCEPT:
  HALT(accept)

DEAD:
  HALT(reject)

The entire FSM logic is reused. The only additions are three PUSH/POP calls. One PUSH per <div>, one POP per </div>, one final POP to verify the stack is empty.

Trace: `<div><div>hi</div></div>`

Input: < d i v > < d i v > h i < / d i v > < / d i v > EOF
       0 1 2 3 4 5 6 7 8 9 ...

Stack starts: [ $ ]

1. First <div>

State     | Read | Action              | Stack
----------|------|----------------------|--------
S0        | <    | match, → TAG_OPEN   | [ $ ]
TAG_OPEN  | d    | not '/', → TAG_NAME | [ $ ]
TAG_NAME  | i    | match               | [ $ ]
          | v    | match               | [ $ ]
          | >    | match, PUSH(DIV)    | [ $ DIV ]
          |      | → CONTENT           |

2. Second <div>

State     | Read | Action              | Stack
----------|------|----------------------|-----------
CONTENT   | <    | → TAG_OPEN          | [ $ DIV ]
TAG_OPEN  | d    | not '/', → TAG_NAME | [ $ DIV ]
TAG_NAME  | i    | match               | [ $ DIV ]
          | v    | match               | [ $ DIV ]
          | >    | match, PUSH(DIV)    | [ $ DIV DIV ]
          |      | → CONTENT           |

3. Text "hi"

State     | Read | Action              | Stack
----------|------|----------------------|-------------
CONTENT   | h    | not '<', ADVANCE    | [ $ DIV DIV ]
CONTENT   | i    | not '<', ADVANCE    | [ $ DIV DIV ]

4. First </div>

State       | Read | Action            | Stack
------------|------|--------------------|--------------
CONTENT     | <    | → TAG_OPEN        | [ $ DIV DIV ]
TAG_OPEN    | /    | → CLOSE_SLASH     | [ $ DIV DIV ]
CLOSE_SLASH | d    | match             | [ $ DIV DIV ]
            | i    | match             | [ $ DIV DIV ]
            | v    | match             | [ $ DIV DIV ]
            | >    | match, POP → DIV  | [ $ DIV ]
            |      | DIV == DIV ✓      |
            |      | read next         |

5. Second </div>

State       | Read | Action            | Stack
------------|------|--------------------|---------
CONTENT     | <    | → TAG_OPEN        | [ $ DIV ]
TAG_OPEN    | /    | → CLOSE_SLASH     | [ $ DIV ]
CLOSE_SLASH | d    | match             | [ $ DIV ]
            | i    | match             | [ $ DIV ]
            | v    | match             | [ $ DIV ]
            | >    | match, POP → DIV  | [ $ ]
            |      | DIV == DIV ✓      |
            |      | read next         |

6. End of input

State       | Read | Action            | Stack
------------|------|--------------------|---------
            | EOF  | → CHECK_EMPTY     | [ $ ]
CHECK_EMPTY |      | POP → $           | [ ]
            |      | $ == $ ✓          |
            |      | → ACCEPT          |

Result: ACCEPT

The stack grew to depth 2 (two DIVs), then shrank back to just $, confirming every open tag had a matching close. An FSM would have needed dedicated states for each depth level. This PDA handles any depth with the same 8 states.

Q: Our PDA can parse nested structures and output accept or reject. But a real tokenizer needs to emit tokens, right?

Right. The PDA is still just a recognizer. To become a tokenizer, we need to understand what changes formally.

A recognizer takes an input string and produces a single binary answer: accept or reject.

Input: a string over alphabet
Output: yes or no

A transducer takes an input string and produces an output string (or sequence of output symbols). It transforms input into output.

Input: a string over alphabet
Output: a string y over an output alphabet Δ

A tokenizer is a transducer: it reads characters (Σ = ASCII) and emits tokens (Δ = the set of possible tokens).

Q: What is ?

An output buffer is a tape, simpler than the stack:

Append-only: you can only write to the end, never read back or modify
Write-only: the machine never consults its own output to make decisions
Unbounded: it grows as tokens are emitted
Starts empty

Output structure

OUTPUT ALPHABET:  Δ = TokenType × Σ*
TOKEN TYPES:      TokenType = { OPEN_TAG, CLOSE_TAG, TEXT }
OUTPUT BUFFER:    append-only list of (type, value) pairs, starts empty

For <div><div>hi</div></div>, the output buffer should end up as:

[
  (OPEN_TAG,  "div"),
  (OPEN_TAG,  "div"),
  (TEXT,      "hi"),
  (CLOSE_TAG, "div"),
  (CLOSE_TAG, "div")
]

Updated operations table (11 operations)

#	Operation	Description	Type	Assembly
1	READ	Load current input symbol	FSM	`MOV reg, [src]`
2	CMP	Test value equality	FSM	`CMP reg, reg/imm`
3	BRANCH	Conditional jump	FSM	`JE / JNE label`
4	JMP	Unconditional jump	FSM	`JMP label`
5	ASSIGN	Update state register	FSM	`MOV reg, imm`
6	ADVANCE	Move input pointer forward	FSM	`INC reg`
7	HALT	Terminate execution	FSM	`HLT`
8	PUSH	Place symbol on top of stack	PDA	`PUSH reg`
9	POP	Remove and return top of stack	PDA	`POP reg`
10	MARK	Save current input position	Transducer	`MOV reg, reg`
11	EMIT	Append token to output buffer	Transducer	`MOV [dst], reg`

Transducer Architecture: 6 components

Input tape (read-only)
Input pointer
State register
Stack pointer + Stack
Mark register — saves start position of current token
Output buffer — append-only, unbounded

Pseudocode (transducer)

STATES: S0, TAG_OPEN, TAG_NAME, CONTENT, CONTENT_LOOP,
        CONTENT_DONE, CLOSE_SLASH, CHECK_EMPTY, ACCEPT, DEAD

INPUT ALPHABET:  Σ = ASCII
STACK ALPHABET:  Γ = { DIV, $ }
OUTPUT ALPHABET: Δ = { OPEN_TAG, CLOSE_TAG, TEXT } × Σ*

START:
  PUSH($)
  ASSIGN(S0)

S0:
  READ(ch)
  CMP(ch, '<')
  BRANCH(eq → TAG_OPEN)
  JMP(DEAD)

TAG_OPEN:
  ADVANCE
  READ(ch)
  CMP(ch, '/')
  BRANCH(eq → CLOSE_SLASH)
  CMP(ch, 'd')
  BRANCH(eq → TAG_NAME)
  JMP(DEAD)

TAG_NAME:
  ADVANCE
  READ(ch); CMP(ch, 'i'); BRANCH(ne → DEAD)
  ADVANCE
  READ(ch); CMP(ch, 'v'); BRANCH(ne → DEAD)
  ADVANCE
  READ(ch); CMP(ch, '>'); BRANCH(ne → DEAD)
  PUSH(DIV)
  EMIT(OPEN_TAG, "div")           -- ← emit opening tag token
  ADVANCE
  ASSIGN(CONTENT)

CONTENT:
  MARK                             -- ← save start of text region
CONTENT_LOOP:
  READ(ch)
  CMP(ch, '<')
  BRANCH(eq → CONTENT_DONE)
  CMP(ch, EOF)
  BRANCH(eq → DEAD)
  ADVANCE
  JMP(CONTENT_LOOP)

CONTENT_DONE:
  CMP(mark, pos)                   -- any text between tags?
  BRANCH(eq → TAG_OPEN)            -- no text, skip emit
  EMIT(TEXT, input[mark..pos])     -- ← emit text token
  JMP(TAG_OPEN)

CLOSE_SLASH:
  ADVANCE
  READ(ch); CMP(ch, 'd'); BRANCH(ne → DEAD)
  ADVANCE
  READ(ch); CMP(ch, 'i'); BRANCH(ne → DEAD)
  ADVANCE
  READ(ch); CMP(ch, 'v'); BRANCH(ne → DEAD)
  ADVANCE
  READ(ch); CMP(ch, '>'); BRANCH(ne → DEAD)
  POP(top)
  CMP(top, DIV)
  BRANCH(ne → DEAD)
  EMIT(CLOSE_TAG, "div")          -- ← emit closing tag token
  ADVANCE
  READ(ch)
  CMP(ch, EOF)
  BRANCH(eq → CHECK_EMPTY)
  ASSIGN(CONTENT)

CHECK_EMPTY:
  POP(top)
  CMP(top, $)
  BRANCH(eq → ACCEPT)
  JMP(DEAD)

ACCEPT:
  HALT(accept)                     -- output buffer contains all tokens

DEAD:
  HALT(reject)

Trace: `<div><div>hi</div></div>` (transducer)

Input: < d i v > < d i v > h i < / d i v > < / d i v > EOF
Pos:   0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

Stack starts:  [ $ ]
Output starts: [ ]

1. First <div>

State       | Pos | Read | Action              | Stack     | Output
------------|-----|------|----------------------|-----------|-------
S0          |  0  | <    | match, → TAG_OPEN   | [ $ ]     |
TAG_OPEN    |  1  | d    | not '/', → TAG_NAME | [ $ ]     |
TAG_NAME    |  2  | i    | match               | [ $ ]     |
            |  3  | v    | match               | [ $ ]     |
            |  4  | >    | match               | [ $ ]     |
            |     |      | PUSH(DIV)           | [ $ DIV ] |
            |     |      | EMIT(OPEN_TAG,"div")|           | ← (OPEN_TAG, "div")
            |  5  |      | ADVANCE, → CONTENT  |           |

2. Enter CONTENT, see < immediately

State        | Pos | Read | Action             | Stack     | Output
-------------|-----|------|--------------------|-----------|-------
CONTENT      |  5  |      | MARK → mark=5      | [ $ DIV ] |
CONTENT_LOOP |  5  | <    | CMP(mark,pos): 5=5 |           |
             |     |      | equal, skip emit   |           |
             |     |      | → TAG_OPEN         |           |

3. Second <div>

State       | Pos | Read | Action              | Stack         | Output
------------|-----|------|----------------------|---------------|-------
TAG_OPEN    |  6  | d    | not '/', → TAG_NAME | [ $ DIV ]     |
TAG_NAME    |  7  | i    | match               | [ $ DIV ]     |
            |  8  | v    | match               | [ $ DIV ]     |
            |  9  | >    | match               | [ $ DIV ]     |
            |     |      | PUSH(DIV)           | [ $ DIV DIV ] |
            |     |      | EMIT(OPEN_TAG,"div")|               | ← (OPEN_TAG, "div")
            | 10  |      | ADVANCE, → CONTENT  |               |

4. Text "hi"

State        | Pos | Read | Action             | Stack         | Output
-------------|-----|------|--------------------|---------------|-------
CONTENT      | 10  |      | MARK → mark=10     | [ $ DIV DIV ] |
CONTENT_LOOP | 10  | h    | not '<', ADVANCE   |               |
CONTENT_LOOP | 11  | i    | not '<', ADVANCE   |               |
CONTENT_LOOP | 12  | <    | CMP(mark,pos):10≠12|              |
             |     |      | EMIT(TEXT, "hi")   |               | ← (TEXT, "hi")
             |     |      | → TAG_OPEN         |               |

5. First </div>

State       | Pos | Read | Action              | Stack         | Output
------------|-----|------|----------------------|---------------|-------
TAG_OPEN    | 13  | /    | → CLOSE_SLASH       | [ $ DIV DIV ] |
CLOSE_SLASH | 14  | d    | match               |               |
            | 15  | i    | match               |               |
            | 16  | v    | match               |               |
            | 17  | >    | match               |               |
            |     |      | POP → DIV           | [ $ DIV ]     |
            |     |      | DIV == DIV ✓        |               |
            |     |      | EMIT(CLOSE_TAG,"div")|              | ← (CLOSE_TAG, "div")
            | 18  | <    | ADVANCE, not EOF    |               |
            |     |      | → CONTENT           |               |

6. Enter CONTENT, see < immediately

State        | Pos | Read | Action              | Stack     | Output
-------------|-----|------|----------------------|-----------|-------
CONTENT      | 18  |      | MARK → mark=18      | [ $ DIV ] |
CONTENT_LOOP | 18  | <    | CMP(mark,pos): 18=18|           |
             |     |      | equal, skip emit    |           |
             |     |      | → TAG_OPEN          |           |

7. Second </div>

State       | Pos | Read | Action               | Stack     | Output
------------|-----|------|------------------------|-----------|-------
TAG_OPEN    | 19  | /    | → CLOSE_SLASH          | [ $ DIV ] |
CLOSE_SLASH | 20  | d    | match                  |           |
            | 21  | i    | match                  |           |
            | 22  | v    | match                  |           |
            | 23  | >    | match                  |           |
            |     |      | POP → DIV              | [ $ ]     |
            |     |      | DIV == DIV ✓           |           |
            |     |      | EMIT(CLOSE_TAG, "div") |           | ← (CLOSE_TAG, "div")
            | 24  | EOF  | ADVANCE, EOF           |           |
            |     |      | → CHECK_EMPTY          |           |

8. End of input

State       | Pos | Read | Action              | Stack | Output
------------|-----|------|----------------------|-------|-------
CHECK_EMPTY |     |      | POP → $              | [ ]   |
            |     |      | $ == $ ✓             |       |
            |     |      | → ACCEPT             |       |
ACCEPT      |     |      | HALT(accept)         |       |

Final output buffer

[
  (OPEN_TAG,  "div"),
  (OPEN_TAG,  "div"),
  (TEXT,      "hi"),
  (CLOSE_TAG, "div"),
  (CLOSE_TAG, "div")
]

Same 10 states, same stack operations. The three EMITs fired at exactly the right boundaries, and the two CMP(mark, pos) guards correctly skipped empty text regions between consecutive tags.

Q: Is the tokenizer a lexer?

Yes. Tokenizer, lexer, lexical analyzer, scanner, all the same thing. Characters in, tokens out.

The distinction that matters is lexer vs parser:

Lexer (what we built): characters in, tokens out. Flat stream. Answers: "what are the words?"
Parser: tokens in, tree out. Hierarchical structure. Answers: "how do the words relate to each other?"

Our transducer is a lexer. It outputs a flat list:

(OPEN_TAG, "div"), (OPEN_TAG, "div"), (TEXT, "hi"), (CLOSE_TAG, "div"), (CLOSE_TAG, "div")

A parser would produce a tree:

div
 └─ div
     └─ "hi"

Our PDA already knows the tree structure in some way, the stack depth at any point tells us exactly where we are in the hierarchy. We just don't capture it. We throw that information away and emit a flat stream. Building the tree means using the stack to capture that structure as we go.

What's next

We started this post with an FSM that couldn't count, and ended with a transducer that emits tokens from arbitrarily nested HTML. Along the way, a pattern kept surfacing: climbing the Chomsky hierarchy (FSM → PDA → Turing machine) mirrors the historical evolution of CPU architecture (register → stack segment → full random-access memory). That parallel is not a coincidence, and we will keep pulling on that thread.

Next up: a lexer becomes a parser, a Turing Complete machine.

If you have questions or want to dig deeper into any of these topics, drop a comment below. Follow to catch Part 3 when it lands. Stay tuned...

DEV Community

Building a Tokenizer from Scratch [part 2]

Parser Theory: Q/A with Claude Opus

Q: Why can't FSMs handle nested structures?

Q: So, FSMs needs memory to manage remember where they are in the nested structure, correct?

Q: What is the stack, formally?

Q: Which operations and structures do we need to add to the seven primitives to create a PDA that handles nested structures?

New structure

Updated primitives table

Q: What's the simplest PDA architecture?

Building a PDA for nested `<div>` recognition

Pseudocode

Trace: `<div><div>hi</div></div>`

Q: Our PDA can parse nested structures and output accept or reject. But a real tokenizer needs to emit tokens, right?

Q: What is ?

Output structure

Updated operations table (11 operations)

Transducer Architecture: 6 components

Pseudocode (transducer)

Trace: `<div><div>hi</div></div>` (transducer)

Final output buffer

Q: Is the tokenizer a lexer?

What's next

Top comments (0)

Parser Theory: Q/A with Claude Opus

Q: Why can't FSMs handle nested structures?

Q: So, FSMs needs memory to manage remember where they are in the nested structure, correct?

Q: What is the stack, formally?

Q: Which operations and structures do we need to add to the seven primitives to create a PDA that handles nested structures?

New structure

Updated primitives table

Q: What's the simplest PDA architecture?

Building a PDA for nested <div> recognition

Pseudocode

Trace: <div><div>hi</div></div>

Q: Our PDA can parse nested structures and output accept or reject. But a real tokenizer needs to emit tokens, right?

Q: What is ?

Output structure

Updated operations table (11 operations)

Transducer Architecture: 6 components

Pseudocode (transducer)

Trace: <div><div>hi</div></div> (transducer)

Final output buffer

Q: Is the tokenizer a lexer?

What's next

Building a PDA for nested `<div>` recognition

Trace: `<div><div>hi</div></div>`

Trace: `<div><div>hi</div></div>` (transducer)