DEV Community

Jocer Franquiz
Jocer Franquiz

Posted on

Building a Tokenizer from Scratch

Finite State Machine that recognizes <div>

FSM Parser Theory: A Friendly Conversation with Claude Opus


These are real questions I asked Claude Opus in order to learn the theory behind parsers. The starter point is Combinational Logic, and the end point is a FINITE STATE MACHINE. The answers build on each other incrementally.

Let's walk through the theory behind finite state machines by building one from scratch. And from scratch I mean No libraries. No classes. Just logic and a minimal set of 10 operators.

At the end, you will find a working FSM that can recognize a complete <div>text</div> structure. This can be expanded to recognize more patterns.


The automata hierarchy

Q: Automata theory has a class hierarchy, starting from combinational logic. Right?

Yes, that's a clean way to frame it. The hierarchy goes from "no memory" to "more memory":

  1. Combinational logic — output depends only on current input. No memory, no state. Just pure boolean functions (AND, OR, etc.).

  2. Finite State Machine (FSM) — adds memory in the form of state. Output depends on current input + current state. Memory is finite and fixed.

  3. Pushdown Automaton (PDA) — FSM + a stack. The stack gives it unlimited but structured memory (LIFO). This is what recognizes nested structures like matching parentheses.

  4. Turing Machine — FSM + an infinite tape (read/write). The most powerful model; can simulate any computation.

Each level can recognize a strictly larger class of languages (the Chomsky hierarchy):

Machine Language class Example
Combinational Boolean expression
FSM Regular /ab*c/
PDA Context-free Balanced {}[]()
Turing Recursively enumerable Anything computable

For parsing HTML, you need at minimum a PDA (because of nested tags), but in practice browsers use a stateful tokenizer (FSM) + a tree builder (stack-based) — which together behave like a PDA.


The boolean foundation

Q: Which combinational logic operators do we need to implement a finite state machine? Conceptually.

Conceptually, you only need three:

  • AND — "state is X and input is Y"
  • OR — "condition A or condition B leads to the same result"
  • NOT — negate a condition

These three form a functionally complete set — you can express any boolean function with just AND, OR, NOT.

The core of an FSM is its transition function:

next_state = f(current_state, input)
Enter fullscreen mode Exit fullscreen mode

That function is purely combinational — it's just a truth table. Every row is a conjunction (AND) of conditions, and you OR the rows together to select the next state. Something like:

if state == A AND input == 'x' → go to B
if state == A AND input == 'y' → go to C
if state == B AND input == 'x' → go to A
...
Enter fullscreen mode Exit fullscreen mode

Each line is AND. Multiple lines that share an outcome are OR'd together.

NOT comes in when you need to express "any input that is not this character."

That's the combinational part. The state register (what actually holds the current state between steps) is the sequential part — it's where memory enters the picture.


The seven primitive operators

Q: What else do we need from a computational point of view? Something like inc, if/else, cmp, etc.

Q: Is this list complete? Provide a better list and a simple conceptual example.

The complete set of primitive operations for an FSM:

Operation What it does Why an FSM needs it
READ Get the current input symbol The machine must consume input
CMP Test equality of two values Check state and input against known constants
BRANCH Conditional jump Select which transition fires
JMP Unconditional jump Loop back to process next character
ASSIGN Write a value to a register Update the state register
ADVANCE Move input pointer forward Step through the input string
HALT Stop execution Accept or reject; terminate the machine

Seven operations total. No arithmetic, no stack, no heap.

Example: an FSM that accepts strings matching ab*c (an a, then zero or more bs, then c).

States: S0 (start), S1 (got a), ACCEPT, DEAD

LOOP:
    READ   char                ; get current input symbol

    ; ── S0: expecting 'a' ──
    CMP    state, S0
    BRANCH not_s0
    CMP    char, 'a'
    BRANCH s0_fail
    ASSIGN state, S1
    JMP    next

  s0_fail:
    ASSIGN state, DEAD
    JMP    next

  not_s0:
    ; ── S1: expecting 'b' or 'c' ──
    CMP    state, S1
    BRANCH not_s1
    CMP    char, 'b'
    BRANCH s1_not_b
    ASSIGN state, S1           ; stay in S1
    JMP    next

  s1_not_b:
    CMP    char, 'c'
    BRANCH s1_fail
    ASSIGN state, ACCEPT
    JMP    next

  s1_fail:
    ASSIGN state, DEAD
    JMP    next

  not_s1:
    ; ── DEAD or ACCEPT + more input → DEAD ──
    ASSIGN state, DEAD

  next:
    ADVANCE                    ; move to next character
    CMP    input_end, true
    BRANCH LOOP                ; not at end → loop

    ; ── input exhausted ──
    CMP    state, ACCEPT
    BRANCH reject
    HALT   accept

  reject:
    HALT   reject
Enter fullscreen mode Exit fullscreen mode

Trace with input "abbc":

Step char state before transition state after
1 a S0 S0 + a → S1 S1
2 b S1 S1 + b → S1 S1
3 b S1 S1 + b → S1 S1
4 c S1 S1 + c → ACCEPT ACCEPT

End of input, state is ACCEPT → accepted.

Every single line maps to one of the seven primitives — nothing else is needed.


Visualization of the FSM for ab*c

Finite State Machine that recognizes ab*c

Four states, all transitions accounted for — including the error paths to DEAD.


Are seven operators enough for any FSM?

Q: Can we build any FSM with these seven operators?

Yes, the seven operations are sufficient. In fact, they're slightly redundant.

JMP is redundant — it's just BRANCH(always_true, target). And CMP can be absorbed into BRANCH (as "branch-on-symbol-match"). And READ + ADVANCE can be fused into one operation.

So the theoretical minimum is 4 operations:

Minimal op Absorbs
READ/ADVANCE READ + ADVANCE
BRANCH-on-match CMP + BRANCH + JMP
ASSIGN
HALT

This matches Wang's B-machine (1957) — a minimal abstract machine with a read-only input tape and finite control that is exactly a DFA. Its instruction set is: {read, branch-on-symbol, halt-accept, halt-reject}.

Key references:

  • Sipser (Introduction to the Theory of Computation, Ch. 1) — treats the transition function δ(state, symbol) → state as a single atomic primitive. The 7 ops are a decomposition of that.
  • Hopcroft/Ullman/Motwani (Intro to Automata Theory, Ch. 2) — same formal treatment.
  • Minsky (Computation: Finite and Infinite Machines, 1967, Ch. 3) — decomposes FSMs into read input, compute next-state (combinational logic), latch new state (memory). The hardware view of exactly these ops.
  • Wang, 1957 — "A Variant to Turing's Theory of Computing Machines", Journal of the ACM 4(1). The closest classical reference to a minimal FSM instruction set. ACM link

Bottom line: The 7 ops are sufficient for any FSM. We keep them separate (rather than using the minimal 4) because they map more clearly to code — each one becomes a distinct, readable operation in the implementation.


The assembly connection

Q: Those 7 operators are very much like assembly operators.

Exactly. They are assembly-level operations — that's not a coincidence.

An FSM is the computational model that sits right above combinational logic. It's the simplest machine that has state. And assembly is the thinnest software abstraction over hardware. So they naturally meet at the same level.

The mapping is direct:

FSM op x86 equivalent
READ mov al, [rsi] (load byte from memory)
CMP cmp al, 'a'
BRANCH jne label
JMP jmp label
ASSIGN mov state, S1
ADVANCE inc rsi
HALT ret / syscall exit

Hand-written FSMs in assembly (like in high-performance lexers) look almost identical to our pseudocode. The abstraction gap is essentially zero.

This is also why FSMs are so fast in practice — every operation maps 1:1 to a single CPU instruction. No function calls, no memory allocation, no indirection. Just compare, branch, assign, repeat.

When we move to a higher-level language like Python, we'll express these same 7 operations using if/elif, ==, variable assignment, and a for loop — but conceptually, it's the same assembly underneath.


Expanding the operators for real HTML

Q: Can we implement an FSM that recognizes <div>text</div> where text is alphanumeric? We need range comparisons (<=, >=) and different branch types.

Yes. To handle character ranges like [a-z], [A-Z], [0-9], we need to expand CMP and split BRANCH into variants.

Updated operator set (10 operations):

Operation What it does Why an FSM needs it x86 equivalent
READ Get current input symbol Consume input mov al, [rsi]
CMP Compare two values, set flags Every transition needs comparison cmp al, 'a'
BRANCH_EQ Jump if equal Exact character match je label
BRANCH_NE Jump if not equal Skip when no match jne label
BRANCH_LT Jump if less than Lower bound of range check jb label
BRANCH_GT Jump if greater than Upper bound of range check ja label
JMP Unconditional jump Loop, skip blocks jmp label
ASSIGN Write value to register Update state mov reg, val
ADVANCE Move input pointer forward Step through input inc rsi
HALT Stop execution Accept or reject ret / syscall

We split the old BRANCH into four variants (EQ, NE, LT, GT). CMP now sets flags rather than returning a boolean — exactly like x86 cmp. Range checks like [a-z] become:

CMP       char, 'a'
BRANCH_LT fail        ; char < 'a' → not lowercase
CMP       char, 'z'
BRANCH_GT fail        ; char > 'z' → not lowercase
; if we reach here, 'a' <= char <= 'z'
Enter fullscreen mode Exit fullscreen mode

FSM for <div>[a-zA-Z0-9]+</div>

States and what has been consumed:

State Consumed so far Expects next
S0 (start) <
S1 < d
S2 <d i
S3 <di v
S4 <div >
S5 <div> first alphanumeric
S6 <div>text more alphanumeric or <
S7 <div>text< /
S8 <div>text</ d
S9 <div>text</d i
S10 <div>text</di v
S11 <div>text</div >
ACCEPT <div>text</div> (end of input)
DEAD (error) (absorb remaining)

Finite State Machine that recognizes <div>

Pseudocode using the 10 operators:

    ASSIGN state, S0

LOOP:
    ; ── DEAD: absorb remaining input ──
    CMP    state, DEAD
    BRANCH_EQ next

    ; ── ACCEPT + more input → error ──
    CMP    state, ACCEPT
    BRANCH_EQ to_dead

    READ   char

    ; ── S0: expecting '<' ──
    CMP    state, S0
    BRANCH_NE not_s0
    CMP    char, '<'
    BRANCH_NE to_dead
    ASSIGN state, S1
    JMP    next

  not_s0:
    ; ── S1: expecting 'd' ──
    CMP    state, S1
    BRANCH_NE not_s1
    CMP    char, 'd'
    BRANCH_NE to_dead
    ASSIGN state, S2
    JMP    next

  not_s1:
    ; ── S2: expecting 'i' ──
    CMP    state, S2
    BRANCH_NE not_s2
    CMP    char, 'i'
    BRANCH_NE to_dead
    ASSIGN state, S3
    JMP    next

  not_s2:
    ; ── S3: expecting 'v' ──
    CMP    state, S3
    BRANCH_NE not_s3
    CMP    char, 'v'
    BRANCH_NE to_dead
    ASSIGN state, S4
    JMP    next

  not_s3:
    ; ── S4: expecting '>' ──
    CMP    state, S4
    BRANCH_NE not_s4
    CMP    char, '>'
    BRANCH_NE to_dead
    ASSIGN state, S5
    JMP    next

  not_s4:
    ; ── S5: expecting first alphanumeric ──
    CMP    state, S5
    BRANCH_NE not_s5
    JMP    check_alnum_s6     ; reuse alnum check, go to S6

  not_s5:
    ; ── S6: more alphanumeric or '<' ──
    CMP    state, S6
    BRANCH_NE not_s6
    CMP    char, '<'
    BRANCH_NE s6_alnum
    ASSIGN state, S7
    JMP    next
  s6_alnum:
    JMP    check_alnum_stay   ; stay in S6

  not_s6:
    ; ── S7: expecting '/' ──
    CMP    state, S7
    BRANCH_NE not_s7
    CMP    char, '/'
    BRANCH_NE to_dead
    ASSIGN state, S8
    JMP    next

  not_s7:
    ; ── S8: expecting 'd' ──
    CMP    state, S8
    BRANCH_NE not_s8
    CMP    char, 'd'
    BRANCH_NE to_dead
    ASSIGN state, S9
    JMP    next

  not_s8:
    ; ── S9: expecting 'i' ──
    CMP    state, S9
    BRANCH_NE not_s9
    CMP    char, 'i'
    BRANCH_NE to_dead
    ASSIGN state, S10
    JMP    next

  not_s9:
    ; ── S10: expecting 'v' ──
    CMP    state, S10
    BRANCH_NE not_s10
    CMP    char, 'v'
    BRANCH_NE to_dead
    ASSIGN state, S11
    JMP    next

  not_s10:
    ; ── S11: expecting '>' ──
    CMP    state, S11
    BRANCH_NE to_dead
    CMP    char, '>'
    BRANCH_NE to_dead
    ASSIGN state, ACCEPT
    JMP    next

    ; ── alphanumeric check → transition to S6 ──
  check_alnum_s6:
    CMP    char, 'a'
    BRANCH_LT upper_s6
    CMP    char, 'z'
    BRANCH_GT upper_s6
    ASSIGN state, S6
    JMP    next
  upper_s6:
    CMP    char, 'A'
    BRANCH_LT digit_s6
    CMP    char, 'Z'
    BRANCH_GT digit_s6
    ASSIGN state, S6
    JMP    next
  digit_s6:
    CMP    char, '0'
    BRANCH_LT to_dead
    CMP    char, '9'
    BRANCH_GT to_dead
    ASSIGN state, S6
    JMP    next

    ; ── alphanumeric check → stay in S6 ──
  check_alnum_stay:
    CMP    char, 'a'
    BRANCH_LT upper_stay
    CMP    char, 'z'
    BRANCH_GT upper_stay
    JMP    next
  upper_stay:
    CMP    char, 'A'
    BRANCH_LT digit_stay
    CMP    char, 'Z'
    BRANCH_GT digit_stay
    JMP    next
  digit_stay:
    CMP    char, '0'
    BRANCH_LT to_dead
    CMP    char, '9'
    BRANCH_GT to_dead
    JMP    next

  to_dead:
    ASSIGN state, DEAD

  next:
    ADVANCE
    CMP    input_end, true
    BRANCH_NE LOOP

    ; ── input exhausted ──
    CMP    state, ACCEPT
    BRANCH_NE reject
    HALT   accept

  reject:
    HALT   reject
Enter fullscreen mode Exit fullscreen mode

Trace with input <div>hello</div>:

Step char state transition
1 < S0 → S1 exact match
2 d S1 → S2 exact match
3 i S2 → S3 exact match
4 v S3 → S4 exact match
5 > S4 → S5 exact match
6 h S5 → S6 range: 'a''h''z'
7 e S6 → S6 range: 'a''e''z'
8 l S6 → S6 range: 'a''l''z'
9 l S6 → S6 range: 'a''l''z'
10 o S6 → S6 range: 'a''o''z'
11 < S6 → S7 exact match
12 / S7 → S8 exact match
13 d S8 → S9 exact match
14 i S9 → S10 exact match
15 v S10 → S11 exact match
16 > S11 → ACCEPT exact match

End of input, state is ACCEPT → accepted.


What's next

At this point our FSM can recognize whether a string like <div>hello</div> is valid — it answers yes or no. But a real tokenizer needs to do more than accept or reject. It needs to emit tokens: structured data that tells the next stage of the parser what was found and where.

That means we need a data structure to hold the tokenizer's output — something that captures the tag names, the text content, and the boundaries between them. In the next post, we'll design that output format and turn our recognizer into a proper tokenizer that produces a stream of tokens.

Stay tuned.

Top comments (0)