DEV Community

Cover image for Building a Tokenizer from Scratch [part 2]
Jocer Franquiz
Jocer Franquiz

Posted on

Building a Tokenizer from Scratch [part 2]

Parser Theory: Q/A with Claude Opus

In part 1, we built a working FSM that recognizes <div>text</div> using just 7 primitives mapped 1:1 to assembly opcodes. But FSMs have a hard limit: they can't handle nested structures like <div><div>hello</div></div>. In this post, we climb the Chomsky hierarchy from finite state machines to pushdown automata, build a PDA that recognizes nested <div> tags, and then turn it into a transducer that emits tokens. In other words we are building the core of a lexer.


Q: Why can't FSMs handle nested structures?

Because an FSM has a fixed number of states, and that's all the memory it has.

Consider nested divs:

<div><div><div>hello</div></div></div>
Enter fullscreen mode Exit fullscreen mode

To correctly match closing tags, you need to count how many <div>s you've opened so you know how many </div>s to expect. An FSM with, say, 12 states can handle nesting up to some fixed depth — but someone can always write HTML nested one level deeper than your states can track.

Put simply:

  • 1 level deep → you need states to remember "I'm 1 div in"
  • 2 levels deep → different states for "I'm 2 divs in"
  • n levels deep → you'd need n distinct state configurations

Since n is unbounded but your state count is finite, you run out of states. There's no way to encode "how deep am I?" in a fixed-size state register when the depth is arbitrary.

This is exactly what the stack in a pushdown automaton solves. Every time you see <div>, you push onto the stack; every </div>, you pop. The stack grows as needed, giving you unbounded memory in one direction (depth), which is precisely what nesting requires.


Q: So, FSMs needs memory to manage remember where they are in the nested structure, correct?

It depends on what you mean by "memory."

An FSM does have memory, but it's extremely limited. The current state itself is the memory. If your FSM is in state S7, that encodes something about what it has seen so far. That's how it knows it's partway through reading </div> vs still in content.

But that memory is:

  • Fixed-size, bounded by the number of states, decided at design time
  • Lossy, it can't remember the full history of what it read, only a compressed summary encoded in which state it's in
  • Non-growable, it can't allocate more memory as input gets longer

FSM Architecture

So when people say "FSMs have no memory," they really mean no auxiliary memory, that means no stack, no tape, no buffer that grows with input. The only thing an FSM "remembers" is which state it's currently in.

A PDA adds exactly one auxiliary memory structure: a stack. That single addition is enough to go from "can't handle nesting" to "can handle nesting."


Q: What is the stack, formally?

A stack is a last-in, first-out (LIFO) data structure with two operations:

  • PUSH(symbol): place a symbol on top
  • POP: remove and return the top symbol

So if your stack is ABB, a PUSH(C) gives ABBC, and a POP returns C leaving ABB again.

The key properties:

  1. You can only access the top: no reading the middle or bottom
  2. Unbounded size: it can grow as large as needed (unlike the FSM's fixed states)
  3. It starts empty: at the beginning of computation, the stack contains nothing (or a special bottom marker like $)

That's it. It's the simplest possible auxiliar memory that's still useful. You get only sequential access from one end. But that restriction is exactly what matches the structure of nesting: the most recently opened thing is the next thing that needs to close.


Q: Which operations and structures do we need to add to the seven primitives to create a PDA that handles nested structures?

We need two new operations and one new structure:

New structure

  • STACK: a LIFO storage addressed by a stack pointer (SP)

Updated primitives table

# Operation Description Type Assembly equivalent
1 READ Load current input symbol FSM MOV reg, [src]
2 CMP Test value equality FSM CMP reg, imm
3 BRANCH Conditional jump FSM JE / JNE label
4 JMP Unconditional jump FSM JMP label
5 ASSIGN Update state register FSM MOV reg, imm
6 ADVANCE Move input pointer forward FSM INC reg
7 HALT Terminate execution FSM HLT
8 PUSH Place symbol on top of stack PDA PUSH reg
9 POP Remove and return top of stack PDA POP reg

Two operations and one structure turn an FSM into a PDA.

x86 has native PUSH/POP instructions with a hardware stack pointer (SP/ESP/RSP), so the 1:1 mapping to CPU instructions still holds. PDAs are just as mechanically efficient as FSMs. The only cost is that memory usage now grows with nesting depth instead of being fixed.

You could also add a PEEK operation (read top without removing), but it's not strictly necessary — you can always POP then PUSH back. It's a convenience, not a primitive.


Q: What's the simplest PDA architecture?

A PDA architecture needs exactly four components:

  1. Input tape — the string being read (read-only, left-to-right)
  2. Input pointer — current position on the tape
  3. State register — holds the current state from a finite set Q
  4. Stack — LIFO storage over alphabet Γ

And the nine operations from our table are the instruction set that operates on these four components.

Compared to the FSM architecture from part 1, we added exactly one component (the stack) and two operations (PUSH, POP). The stack pointer (SP) lives inside the finite control alongside the state register — it's a register, not part of the stack itself.

PDA Architecture

The PDA architecture is a von Neumann architecture:

PDA component Von Neumann equivalent
Input tape Memory (data segment)
Input pointer Program counter / index register
State register CPU register (accumulator)
Stack pointer SP register
Stack (in RAM) Stack segment of memory
9 operations Subset of the instruction set

What's missing for full Von Neumann: writable memory, random access, arithmetic, and stored program. The automata hierarchy maps directly onto hardware capability:

  • FSM → combinational logic + a register
  • PDA → CPU + stack segment
  • Turing machine → CPU + full random-access memory = von Neumann

Building a PDA for nested <div> recognition

Pseudocode

STATES: S0, TAG_OPEN, TAG_NAME, CONTENT, CLOSE_SLASH,
        CLOSE_NAME, ACCEPT, DEAD

STACK ALPHABET: Γ = { DIV, $ }

START:
  PUSH($)                    -- bottom marker
  ASSIGN(S0)

S0:
  READ(ch)
  CMP(ch, '<')
  BRANCH(eq  TAG_OPEN)
  JMP(DEAD)

TAG_OPEN:
  ADVANCE
  READ(ch)
  CMP(ch, '/')
  BRANCH(eq  CLOSE_SLASH)   -- it's a closing tag
  CMP(ch, 'd')
  BRANCH(eq  TAG_NAME)      -- it's an opening tag
  JMP(DEAD)

TAG_NAME:                     -- read "div>"
  ADVANCE
  READ(ch); CMP(ch, 'i'); BRANCH(ne → DEAD)
  ADVANCE
  READ(ch); CMP(ch, 'v'); BRANCH(ne → DEAD)
  ADVANCE
  READ(ch); CMP(ch, '>'); BRANCH(ne → DEAD)
  PUSH(DIV)                   --  matched <div>, push it
  ADVANCE
  ASSIGN(CONTENT)

CONTENT:
  READ(ch)
  CMP(ch, '<')
  BRANCH(eq  TAG_OPEN)      --  could be nested <div> or </div>
  CMP(ch, EOF)
  BRANCH(eq  DEAD)           -- input ended with open tags
  ADVANCE                     -- consume text character
  JMP(CONTENT)

CLOSE_SLASH:                  -- inside </
  ADVANCE
  READ(ch); CMP(ch, 'd'); BRANCH(ne → DEAD)
  ADVANCE
  READ(ch); CMP(ch, 'i'); BRANCH(ne → DEAD)
  ADVANCE
  READ(ch); CMP(ch, 'v'); BRANCH(ne → DEAD)
  ADVANCE
  READ(ch); CMP(ch, '>'); BRANCH(ne → DEAD)
  POP(top)                    --  pop the stack
  CMP(top, DIV)
  BRANCH(ne  DEAD)           -- mismatch
  ADVANCE
  READ(ch)
  CMP(ch, EOF)
  BRANCH(eq  CHECK_EMPTY)
  ASSIGN(CONTENT)             -- more input, keep going

CHECK_EMPTY:
  POP(top)
  CMP(top, $)
  BRANCH(eq  ACCEPT)         -- stack empty = all tags matched
  JMP(DEAD)                   -- leftover tags = unbalanced

ACCEPT:
  HALT(accept)

DEAD:
  HALT(reject)
Enter fullscreen mode Exit fullscreen mode

The entire FSM logic is reused. The only additions are three PUSH/POP calls. One PUSH per <div>, one POP per </div>, one final POP to verify the stack is empty.

Trace: <div><div>hi</div></div>

Input: < d i v > < d i v > h i < / d i v > < / d i v > EOF
       0 1 2 3 4 5 6 7 8 9 ...

Stack starts: [ $ ]
Enter fullscreen mode Exit fullscreen mode

1. First <div>

State     | Read | Action              | Stack
----------|------|----------------------|--------
S0        | <    | match, → TAG_OPEN   | [ $ ]
TAG_OPEN  | d    | not '/', → TAG_NAME | [ $ ]
TAG_NAME  | i    | match               | [ $ ]
          | v    | match               | [ $ ]
          | >    | match, PUSH(DIV)    | [ $ DIV ]
          |      | → CONTENT           |
Enter fullscreen mode Exit fullscreen mode

2. Second <div>

State     | Read | Action              | Stack
----------|------|----------------------|-----------
CONTENT   | <    | → TAG_OPEN          | [ $ DIV ]
TAG_OPEN  | d    | not '/', → TAG_NAME | [ $ DIV ]
TAG_NAME  | i    | match               | [ $ DIV ]
          | v    | match               | [ $ DIV ]
          | >    | match, PUSH(DIV)    | [ $ DIV DIV ]
          |      | → CONTENT           |
Enter fullscreen mode Exit fullscreen mode

3. Text "hi"

State     | Read | Action              | Stack
----------|------|----------------------|-------------
CONTENT   | h    | not '<', ADVANCE    | [ $ DIV DIV ]
CONTENT   | i    | not '<', ADVANCE    | [ $ DIV DIV ]
Enter fullscreen mode Exit fullscreen mode

4. First </div>

State       | Read | Action            | Stack
------------|------|--------------------|--------------
CONTENT     | <    | → TAG_OPEN        | [ $ DIV DIV ]
TAG_OPEN    | /    | → CLOSE_SLASH     | [ $ DIV DIV ]
CLOSE_SLASH | d    | match             | [ $ DIV DIV ]
            | i    | match             | [ $ DIV DIV ]
            | v    | match             | [ $ DIV DIV ]
            | >    | match, POP → DIV  | [ $ DIV ]
            |      | DIV == DIV ✓      |
            |      | read next         |
Enter fullscreen mode Exit fullscreen mode

5. Second </div>

State       | Read | Action            | Stack
------------|------|--------------------|---------
CONTENT     | <    | → TAG_OPEN        | [ $ DIV ]
TAG_OPEN    | /    | → CLOSE_SLASH     | [ $ DIV ]
CLOSE_SLASH | d    | match             | [ $ DIV ]
            | i    | match             | [ $ DIV ]
            | v    | match             | [ $ DIV ]
            | >    | match, POP → DIV  | [ $ ]
            |      | DIV == DIV ✓      |
            |      | read next         |
Enter fullscreen mode Exit fullscreen mode

6. End of input

State       | Read | Action            | Stack
------------|------|--------------------|---------
            | EOF  | → CHECK_EMPTY     | [ $ ]
CHECK_EMPTY |      | POP → $           | [ ]
            |      | $ == $ ✓          |
            |      | → ACCEPT          |
Enter fullscreen mode Exit fullscreen mode

Result: ACCEPT

The stack grew to depth 2 (two DIVs), then shrank back to just $, confirming every open tag had a matching close. An FSM would have needed dedicated states for each depth level. This PDA handles any depth with the same 8 states.

PDA State diagram


Q: Our PDA can parse nested structures and output accept or reject. But a real tokenizer needs to emit tokens, right?

Right. The PDA is still just a recognizer. To become a tokenizer, we need to understand what changes formally.

A recognizer takes an input string and produces a single binary answer: accept or reject.

  • Input: a string over alphabet
  • Output: yes or no

A transducer takes an input string and produces an output string (or sequence of output symbols). It transforms input into output.

  • Input: a string over alphabet
  • Output: a string y over an output alphabet Δ

A tokenizer is a transducer: it reads characters (Σ = ASCII) and emits tokens (Δ = the set of possible tokens).

Q: What is ?

An output buffer is a tape, simpler than the stack:

  • Append-only: you can only write to the end, never read back or modify
  • Write-only: the machine never consults its own output to make decisions
  • Unbounded: it grows as tokens are emitted
  • Starts empty

Output structure

OUTPUT ALPHABET:  Δ = TokenType × Σ*
TOKEN TYPES:      TokenType = { OPEN_TAG, CLOSE_TAG, TEXT }
OUTPUT BUFFER:    append-only list of (type, value) pairs, starts empty
Enter fullscreen mode Exit fullscreen mode

For <div><div>hi</div></div>, the output buffer should end up as:

[
  (OPEN_TAG,  "div"),
  (OPEN_TAG,  "div"),
  (TEXT,      "hi"),
  (CLOSE_TAG, "div"),
  (CLOSE_TAG, "div")
]
Enter fullscreen mode Exit fullscreen mode

Updated operations table (11 operations)

# Operation Description Type Assembly
1 READ Load current input symbol FSM MOV reg, [src]
2 CMP Test value equality FSM CMP reg, reg/imm
3 BRANCH Conditional jump FSM JE / JNE label
4 JMP Unconditional jump FSM JMP label
5 ASSIGN Update state register FSM MOV reg, imm
6 ADVANCE Move input pointer forward FSM INC reg
7 HALT Terminate execution FSM HLT
8 PUSH Place symbol on top of stack PDA PUSH reg
9 POP Remove and return top of stack PDA POP reg
10 MARK Save current input position Transducer MOV reg, reg
11 EMIT Append token to output buffer Transducer MOV [dst], reg

Transducer Architecture: 6 components

  1. Input tape (read-only)
  2. Input pointer
  3. State register
  4. Stack pointer + Stack
  5. Mark register — saves start position of current token
  6. Output buffer — append-only, unbounded

Transducer Architecture


Pseudocode (transducer)

STATES: S0, TAG_OPEN, TAG_NAME, CONTENT, CONTENT_LOOP,
        CONTENT_DONE, CLOSE_SLASH, CHECK_EMPTY, ACCEPT, DEAD

INPUT ALPHABET:  Σ = ASCII
STACK ALPHABET:  Γ = { DIV, $ }
OUTPUT ALPHABET: Δ = { OPEN_TAG, CLOSE_TAG, TEXT } × Σ*

START:
  PUSH($)
  ASSIGN(S0)

S0:
  READ(ch)
  CMP(ch, '<')
  BRANCH(eq  TAG_OPEN)
  JMP(DEAD)

TAG_OPEN:
  ADVANCE
  READ(ch)
  CMP(ch, '/')
  BRANCH(eq  CLOSE_SLASH)
  CMP(ch, 'd')
  BRANCH(eq  TAG_NAME)
  JMP(DEAD)

TAG_NAME:
  ADVANCE
  READ(ch); CMP(ch, 'i'); BRANCH(ne → DEAD)
  ADVANCE
  READ(ch); CMP(ch, 'v'); BRANCH(ne → DEAD)
  ADVANCE
  READ(ch); CMP(ch, '>'); BRANCH(ne → DEAD)
  PUSH(DIV)
  EMIT(OPEN_TAG, "div")           --  emit opening tag token
  ADVANCE
  ASSIGN(CONTENT)

CONTENT:
  MARK                             --  save start of text region
CONTENT_LOOP:
  READ(ch)
  CMP(ch, '<')
  BRANCH(eq  CONTENT_DONE)
  CMP(ch, EOF)
  BRANCH(eq  DEAD)
  ADVANCE
  JMP(CONTENT_LOOP)

CONTENT_DONE:
  CMP(mark, pos)                   -- any text between tags?
  BRANCH(eq  TAG_OPEN)            -- no text, skip emit
  EMIT(TEXT, input[mark..pos])     --  emit text token
  JMP(TAG_OPEN)

CLOSE_SLASH:
  ADVANCE
  READ(ch); CMP(ch, 'd'); BRANCH(ne → DEAD)
  ADVANCE
  READ(ch); CMP(ch, 'i'); BRANCH(ne → DEAD)
  ADVANCE
  READ(ch); CMP(ch, 'v'); BRANCH(ne → DEAD)
  ADVANCE
  READ(ch); CMP(ch, '>'); BRANCH(ne → DEAD)
  POP(top)
  CMP(top, DIV)
  BRANCH(ne  DEAD)
  EMIT(CLOSE_TAG, "div")          --  emit closing tag token
  ADVANCE
  READ(ch)
  CMP(ch, EOF)
  BRANCH(eq  CHECK_EMPTY)
  ASSIGN(CONTENT)

CHECK_EMPTY:
  POP(top)
  CMP(top, $)
  BRANCH(eq  ACCEPT)
  JMP(DEAD)

ACCEPT:
  HALT(accept)                     -- output buffer contains all tokens

DEAD:
  HALT(reject)
Enter fullscreen mode Exit fullscreen mode

Trace: <div><div>hi</div></div> (transducer)

Input: < d i v > < d i v > h i < / d i v > < / d i v > EOF
Pos:   0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

Stack starts:  [ $ ]
Output starts: [ ]
Enter fullscreen mode Exit fullscreen mode

1. First <div>

State       | Pos | Read | Action              | Stack     | Output
------------|-----|------|----------------------|-----------|-------
S0          |  0  | <    | match, → TAG_OPEN   | [ $ ]     |
TAG_OPEN    |  1  | d    | not '/', → TAG_NAME | [ $ ]     |
TAG_NAME    |  2  | i    | match               | [ $ ]     |
            |  3  | v    | match               | [ $ ]     |
            |  4  | >    | match               | [ $ ]     |
            |     |      | PUSH(DIV)           | [ $ DIV ] |
            |     |      | EMIT(OPEN_TAG,"div")|           | ← (OPEN_TAG, "div")
            |  5  |      | ADVANCE, → CONTENT  |           |
Enter fullscreen mode Exit fullscreen mode

2. Enter CONTENT, see < immediately

State        | Pos | Read | Action             | Stack     | Output
-------------|-----|------|--------------------|-----------|-------
CONTENT      |  5  |      | MARK → mark=5      | [ $ DIV ] |
CONTENT_LOOP |  5  | <    | CMP(mark,pos): 5=5 |           |
             |     |      | equal, skip emit   |           |
             |     |      | → TAG_OPEN         |           |
Enter fullscreen mode Exit fullscreen mode

3. Second <div>

State       | Pos | Read | Action              | Stack         | Output
------------|-----|------|----------------------|---------------|-------
TAG_OPEN    |  6  | d    | not '/', → TAG_NAME | [ $ DIV ]     |
TAG_NAME    |  7  | i    | match               | [ $ DIV ]     |
            |  8  | v    | match               | [ $ DIV ]     |
            |  9  | >    | match               | [ $ DIV ]     |
            |     |      | PUSH(DIV)           | [ $ DIV DIV ] |
            |     |      | EMIT(OPEN_TAG,"div")|               | ← (OPEN_TAG, "div")
            | 10  |      | ADVANCE, → CONTENT  |               |
Enter fullscreen mode Exit fullscreen mode

4. Text "hi"

State        | Pos | Read | Action             | Stack         | Output
-------------|-----|------|--------------------|---------------|-------
CONTENT      | 10  |      | MARK → mark=10     | [ $ DIV DIV ] |
CONTENT_LOOP | 10  | h    | not '<', ADVANCE   |               |
CONTENT_LOOP | 11  | i    | not '<', ADVANCE   |               |
CONTENT_LOOP | 12  | <    | CMP(mark,pos):10≠12|              |
             |     |      | EMIT(TEXT, "hi")   |               | ← (TEXT, "hi")
             |     |      | → TAG_OPEN         |               |
Enter fullscreen mode Exit fullscreen mode

5. First </div>

State       | Pos | Read | Action              | Stack         | Output
------------|-----|------|----------------------|---------------|-------
TAG_OPEN    | 13  | /    | → CLOSE_SLASH       | [ $ DIV DIV ] |
CLOSE_SLASH | 14  | d    | match               |               |
            | 15  | i    | match               |               |
            | 16  | v    | match               |               |
            | 17  | >    | match               |               |
            |     |      | POP → DIV           | [ $ DIV ]     |
            |     |      | DIV == DIV ✓        |               |
            |     |      | EMIT(CLOSE_TAG,"div")|              | ← (CLOSE_TAG, "div")
            | 18  | <    | ADVANCE, not EOF    |               |
            |     |      | → CONTENT           |               |
Enter fullscreen mode Exit fullscreen mode

6. Enter CONTENT, see < immediately

State        | Pos | Read | Action              | Stack     | Output
-------------|-----|------|----------------------|-----------|-------
CONTENT      | 18  |      | MARK → mark=18      | [ $ DIV ] |
CONTENT_LOOP | 18  | <    | CMP(mark,pos): 18=18|           |
             |     |      | equal, skip emit    |           |
             |     |      | → TAG_OPEN          |           |
Enter fullscreen mode Exit fullscreen mode

7. Second </div>

State       | Pos | Read | Action               | Stack     | Output
------------|-----|------|------------------------|-----------|-------
TAG_OPEN    | 19  | /    | → CLOSE_SLASH          | [ $ DIV ] |
CLOSE_SLASH | 20  | d    | match                  |           |
            | 21  | i    | match                  |           |
            | 22  | v    | match                  |           |
            | 23  | >    | match                  |           |
            |     |      | POP → DIV              | [ $ ]     |
            |     |      | DIV == DIV ✓           |           |
            |     |      | EMIT(CLOSE_TAG, "div") |           | ← (CLOSE_TAG, "div")
            | 24  | EOF  | ADVANCE, EOF           |           |
            |     |      | → CHECK_EMPTY          |           |
Enter fullscreen mode Exit fullscreen mode

8. End of input

State       | Pos | Read | Action              | Stack | Output
------------|-----|------|----------------------|-------|-------
CHECK_EMPTY |     |      | POP → $              | [ ]   |
            |     |      | $ == $ ✓             |       |
            |     |      | → ACCEPT             |       |
ACCEPT      |     |      | HALT(accept)         |       |
Enter fullscreen mode Exit fullscreen mode

Final output buffer

[
  (OPEN_TAG,  "div"),
  (OPEN_TAG,  "div"),
  (TEXT,      "hi"),
  (CLOSE_TAG, "div"),
  (CLOSE_TAG, "div")
]
Enter fullscreen mode Exit fullscreen mode

Same 10 states, same stack operations. The three EMITs fired at exactly the right boundaries, and the two CMP(mark, pos) guards correctly skipped empty text regions between consecutive tags.

State diagram for PDA transducer tokenizing nested div tags


Q: Is the tokenizer a lexer?

Yes. Tokenizer, lexer, lexical analyzer, scanner, all the same thing. Characters in, tokens out.

The distinction that matters is lexer vs parser:

  • Lexer (what we built): characters in, tokens out. Flat stream. Answers: "what are the words?"
  • Parser: tokens in, tree out. Hierarchical structure. Answers: "how do the words relate to each other?"

Our transducer is a lexer. It outputs a flat list:

(OPEN_TAG, "div"), (OPEN_TAG, "div"), (TEXT, "hi"), (CLOSE_TAG, "div"), (CLOSE_TAG, "div")
Enter fullscreen mode Exit fullscreen mode

A parser would produce a tree:

div
 └─ div
     └─ "hi"
Enter fullscreen mode Exit fullscreen mode

Our PDA already knows the tree structure in some way, the stack depth at any point tells us exactly where we are in the hierarchy. We just don't capture it. We throw that information away and emit a flat stream. Building the tree means using the stack to capture that structure as we go.


What's next

We started this post with an FSM that couldn't count, and ended with a transducer that emits tokens from arbitrarily nested HTML. Along the way, a pattern kept surfacing: climbing the Chomsky hierarchy (FSM → PDA → Turing machine) mirrors the historical evolution of CPU architecture (register → stack segment → full random-access memory). That parallel is not a coincidence, and we will keep pulling on that thread.

Next up: a lexer becomes a parser, a Turing Complete machine.

If you have questions or want to dig deeper into any of these topics, drop a comment below. Follow to catch Part 3 when it lands. Stay tuned...

Top comments (0)