Jocer Franquiz

Posted on Mar 24 • Edited on Mar 28

Finite State Machine from first principles [part 1]

#parsing #computerscience #programming #learning

From Combinational Logic to FSM: Q/A

The starter point is Combinational Logic. The idea is to understand what do we really need to add to CL to create a Finite State Machine.

At the end, we ended up building a working FSM that can recognize a complete <div>text</div> structure, using just 7 primitives. Funny enough, those primitives are equivalent to 7 Assembly opcodes (not a coincidence, this was all planned by the Elders).

Q: Automata theory has a class hierarchy, starting from combinational logic. Right?

Yes, that's a clean way to frame it. The hierarchy goes from "no memory" to "more memory":

Combinational logic — output depends only on current input. No memory, no state. Just pure boolean functions (AND, OR, etc.).
Finite State Machine (FSM) — adds memory in the form of state. Output depends on current input + current state. Memory is finite and fixed.
Pushdown Automaton (PDA) — FSM + a stack. The stack gives it unlimited but structured memory (LIFO). This is what recognizes nested structures like matching parentheses.
Turing Machine — FSM + an infinite tape (read/write). The most powerful model; can simulate any computation.

Each level can recognize a strictly larger class of languages (the Chomsky hierarchy):

Machine	Language class	Example
Combinational	—	Boolean expression
FSM	Regular	`/ab*c/`
PDA	Context-free	Balanced `{}[]()`
Turing	Recursively enumerable	Anything computable

For parsing HTML, you need at minimum a PDA (because of nested tags), but in practice browsers use a stateful tokenizer (FSM) + a tree builder (stack-based) — which together behave like a PDA.

Q: Which combinational logic operators do we need to implement a finite state machine? Conceptually.

Conceptually, you only need three:

AND — "state is X and input is Y"
OR — "condition A or condition B leads to the same result"
NOT — negate a condition

These three form a functionally complete set — you can express any boolean function with just AND, OR, NOT.

The core of an FSM is its transition function:

next_state = f(current_state, input)

That function is purely combinational — it's just a truth table. Every row is a conjunction (AND) of conditions, and you OR the rows together to select the next state. Something like:

if state == A AND input == 'x' → go to B
if state == A AND input == 'y' → go to C
if state == B AND input == 'x' → go to A
...

Each line is AND. Multiple lines that share an outcome are OR'd together.

NOT comes in when you need to express "any input that is not this character."

That's the combinational part. The state register (what actually holds the current state between steps) is the sequential part — it's where memory enters the picture.

Q: What else do we need from a computational point of view? Something like `inc`, `if/else`, `cmp`, etc.

Here is the complete set of primitive operations for an FSM:

Operation	What it does	Why an FSM needs it
READ	Get the current input symbol	The machine must consume input
CMP	Test equality of two values	Check state and input against known constants
BRANCH	Conditional jump	Select which transition fires
JMP	Unconditional jump	Loop back to process next character
ASSIGN	Write a value to a register	Update the state register
ADVANCE	Move input pointer forward	Step through the input string
HALT	Stop execution	Accept or reject; terminate the machine

Seven operations total. No arithmetic, no stack, no heap.

Example: an FSM that accepts strings matching ab*c (an a, then zero or more bs, then c).

States: S0 (start), S1 (got a), ACCEPT, DEAD

LOOP:
    READ   char                ; get current input symbol

    ; ── S0: expecting 'a' ──
    CMP    state, S0
    BRANCH not_s0
    CMP    char, 'a'
    BRANCH s0_fail
    ASSIGN state, S1
    JMP    next

  s0_fail:
    ASSIGN state, DEAD
    JMP    next

  not_s0:
    ; ── S1: expecting 'b' or 'c' ──
    CMP    state, S1
    BRANCH not_s1
    CMP    char, 'b'
    BRANCH s1_not_b
    ASSIGN state, S1           ; stay in S1
    JMP    next

  s1_not_b:
    CMP    char, 'c'
    BRANCH s1_fail
    ASSIGN state, ACCEPT
    JMP    next

  s1_fail:
    ASSIGN state, DEAD
    JMP    next

  not_s1:
    ; ── DEAD or ACCEPT + more input → DEAD ──
    ASSIGN state, DEAD

  next:
    ADVANCE                    ; move to next character
    CMP    input_end, true
    BRANCH LOOP                ; not at end → loop

    ; ── input exhausted ──
    CMP    state, ACCEPT
    BRANCH reject
    HALT   accept

  reject:
    HALT   reject

Trace with input "abbc":

Step	char	state before	transition	state after
1	`a`	S0	S0 + `a` → S1	S1
2	`b`	S1	S1 + `b` → S1	S1
3	`b`	S1	S1 + `b` → S1	S1
4	`c`	S1	S1 + `c` → ACCEPT	ACCEPT

Here is a graph representation

Four states, all transitions accounted for, including the error paths to DEAD.

End of input, state is ACCEPT → accepted.

Every single line maps to one of the seven primitives. Nothing else is needed.

Q: How do we know these seven operators are enough for any FSM? Fact check this!

Yes, the seven operations are sufficient. In fact, they're slightly redundant.

JMP is redundant — it's just BRANCH(always_true, target). And CMP can be absorbed into BRANCH (as "branch-on-symbol-match"). And READ + ADVANCE can be fused into one operation.

So the theoretical minimum is 4 operations:

Minimal op	Absorbs
READ/ADVANCE	READ + ADVANCE
BRANCH-on-match	CMP + BRANCH + JMP
ASSIGN	—
HALT	—

This matches Wang's B-machine (1957) — a minimal abstract machine with a read-only input tape and finite control that is exactly a DFA. Its instruction set is: {read, branch-on-symbol, halt-accept, halt-reject}.

Key references:

Sipser (Introduction to the Theory of Computation, Ch. 1) — treats the transition function δ(state, symbol) → state as a single atomic primitive. The 7 ops are a decomposition of that.
Hopcroft/Ullman/Motwani (Intro to Automata Theory, Ch. 2) — same formal treatment.
Minsky (Computation: Finite and Infinite Machines, 1967, Ch. 3) — decomposes FSMs into read input, compute next-state (combinational logic), latch new state (memory). The hardware view of exactly these ops.
Wang, 1957 — "A Variant to Turing's Theory of Computing Machines", Journal of the ACM 4(1). The closest classical reference to a minimal FSM instruction set. ACM link

Bottom line: The 7 ops are sufficient for any FSM. We keep them separate (rather than using the minimal 4) because they map more clearly to code — each one becomes a distinct, readable operation in the implementation.

Q: Why those 7 operators are very much like Assembly operators?

They are assembly-level operations. That's not a coincidence.

An FSM is the computational model that sits right above combinational logic. It's the simplest machine that has state. And assembly is the thinnest software abstraction over hardware. So they naturally meet at the same level.

The mapping is direct:

FSM op	x86 equivalent
READ	`mov al, [rsi]` (load byte from memory)
CMP	`cmp al, 'a'`
BRANCH	`jne label`
JMP	`jmp label`
ASSIGN	`mov state, S1`
ADVANCE	`inc rsi`
HALT	`ret` / `syscall exit`

Hand-written FSMs in assembly (like in high-performance lexers) look almost identical to our pseudocode. The abstraction gap is essentially zero.

This is also why FSMs are so fast in practice — every operation maps 1:1 to a single CPU instruction. No function calls, no memory allocation, no indirection. Just compare, branch, assign, repeat.

When we move to a higher-level language like Python, we'll express these same 7 operations using if/elif, ==, variable assignment, and a for loop — but conceptually, it's the same assembly underneath.

Q: Can we implement an FSM that recognizes `<div>text</div>` where text is alphanumeric?

Yes. To handle character ranges like [a-z], [A-Z], [0-9], we need to expand CMP and split BRANCH into variants.

Updated operator set (10 operations):

Operation	What it does	Why an FSM needs it	x86 equivalent
READ	Get current input symbol	Consume input	`mov al, [rsi]`
CMP	Compare two values, set flags	Every transition needs comparison	`cmp al, 'a'`
BRANCH_EQ	Jump if equal	Exact character match	`je label`
BRANCH_NE	Jump if not equal	Skip when no match	`jne label`
BRANCH_LT	Jump if less than	Lower bound of range check	`jb label`
BRANCH_GT	Jump if greater than	Upper bound of range check	`ja label`
JMP	Unconditional jump	Loop, skip blocks	`jmp label`
ASSIGN	Write value to register	Update state	`mov reg, val`
ADVANCE	Move input pointer forward	Step through input	`inc rsi`
HALT	Stop execution	Accept or reject	`ret` / `syscall`

We split the old BRANCH into four variants (EQ, NE, LT, GT). CMP now sets flags rather than returning a boolean — exactly like x86 cmp. Range checks like [a-z] become:

CMP       char, 'a'
BRANCH_LT fail        ; char < 'a' → not lowercase
CMP       char, 'z'
BRANCH_GT fail        ; char > 'z' → not lowercase
; if we reach here, 'a' <= char <= 'z'

FSM for `<div>[a-zA-Z0-9]+</div>`

States and what has been consumed:

State	Consumed so far	Expects next
S0	(start)	`<`
S1	`<`	`d`
S2	`<d`	`i`
S3	`<di`	`v`
S4	`<div`	`>`
S5	`<div>`	first alphanumeric
S6	`<div>text`	more alphanumeric or `<`
S7	`<div>text<`	`/`
S8	`<div>text</`	`d`
S9	`<div>text</d`	`i`
S10	`<div>text</di`	`v`
S11	`<div>text</div`	`>`
ACCEPT	`<div>text</div>`	(end of input)
DEAD	(error)	(absorb remaining)

Pseudocode using the 10 operators:

    ASSIGN state, S0

LOOP:
    ; ── DEAD: absorb remaining input ──
    CMP    state, DEAD
    BRANCH_EQ next

    ; ── ACCEPT + more input → error ──
    CMP    state, ACCEPT
    BRANCH_EQ to_dead

    READ   char

    ; ── S0: expecting '<' ──
    CMP    state, S0
    BRANCH_NE not_s0
    CMP    char, '<'
    BRANCH_NE to_dead
    ASSIGN state, S1
    JMP    next

  not_s0:
    ; ── S1: expecting 'd' ──
    CMP    state, S1
    BRANCH_NE not_s1
    CMP    char, 'd'
    BRANCH_NE to_dead
    ASSIGN state, S2
    JMP    next

  not_s1:
    ; ── S2: expecting 'i' ──
    CMP    state, S2
    BRANCH_NE not_s2
    CMP    char, 'i'
    BRANCH_NE to_dead
    ASSIGN state, S3
    JMP    next

  not_s2:
    ; ── S3: expecting 'v' ──
    CMP    state, S3
    BRANCH_NE not_s3
    CMP    char, 'v'
    BRANCH_NE to_dead
    ASSIGN state, S4
    JMP    next

  not_s3:
    ; ── S4: expecting '>' ──
    CMP    state, S4
    BRANCH_NE not_s4
    CMP    char, '>'
    BRANCH_NE to_dead
    ASSIGN state, S5
    JMP    next

  not_s4:
    ; ── S5: expecting first alphanumeric ──
    CMP    state, S5
    BRANCH_NE not_s5
    JMP    check_alnum_s6     ; reuse alnum check, go to S6

  not_s5:
    ; ── S6: more alphanumeric or '<' ──
    CMP    state, S6
    BRANCH_NE not_s6
    CMP    char, '<'
    BRANCH_NE s6_alnum
    ASSIGN state, S7
    JMP    next
  s6_alnum:
    JMP    check_alnum_stay   ; stay in S6

  not_s6:
    ; ── S7: expecting '/' ──
    CMP    state, S7
    BRANCH_NE not_s7
    CMP    char, '/'
    BRANCH_NE to_dead
    ASSIGN state, S8
    JMP    next

  not_s7:
    ; ── S8: expecting 'd' ──
    CMP    state, S8
    BRANCH_NE not_s8
    CMP    char, 'd'
    BRANCH_NE to_dead
    ASSIGN state, S9
    JMP    next

  not_s8:
    ; ── S9: expecting 'i' ──
    CMP    state, S9
    BRANCH_NE not_s9
    CMP    char, 'i'
    BRANCH_NE to_dead
    ASSIGN state, S10
    JMP    next

  not_s9:
    ; ── S10: expecting 'v' ──
    CMP    state, S10
    BRANCH_NE not_s10
    CMP    char, 'v'
    BRANCH_NE to_dead
    ASSIGN state, S11
    JMP    next

  not_s10:
    ; ── S11: expecting '>' ──
    CMP    state, S11
    BRANCH_NE to_dead
    CMP    char, '>'
    BRANCH_NE to_dead
    ASSIGN state, ACCEPT
    JMP    next

    ; ── alphanumeric check → transition to S6 ──
  check_alnum_s6:
    CMP    char, 'a'
    BRANCH_LT upper_s6
    CMP    char, 'z'
    BRANCH_GT upper_s6
    ASSIGN state, S6
    JMP    next
  upper_s6:
    CMP    char, 'A'
    BRANCH_LT digit_s6
    CMP    char, 'Z'
    BRANCH_GT digit_s6
    ASSIGN state, S6
    JMP    next
  digit_s6:
    CMP    char, '0'
    BRANCH_LT to_dead
    CMP    char, '9'
    BRANCH_GT to_dead
    ASSIGN state, S6
    JMP    next

    ; ── alphanumeric check → stay in S6 ──
  check_alnum_stay:
    CMP    char, 'a'
    BRANCH_LT upper_stay
    CMP    char, 'z'
    BRANCH_GT upper_stay
    JMP    next
  upper_stay:
    CMP    char, 'A'
    BRANCH_LT digit_stay
    CMP    char, 'Z'
    BRANCH_GT digit_stay
    JMP    next
  digit_stay:
    CMP    char, '0'
    BRANCH_LT to_dead
    CMP    char, '9'
    BRANCH_GT to_dead
    JMP    next

  to_dead:
    ASSIGN state, DEAD

  next:
    ADVANCE
    CMP    input_end, true
    BRANCH_NE LOOP

    ; ── input exhausted ──
    CMP    state, ACCEPT
    BRANCH_NE reject
    HALT   accept

  reject:
    HALT   reject

Trace with input <div>hello</div>:

Step	char	state	transition
1	`<`	S0 → S1	exact match
2	`d`	S1 → S2	exact match
3	`i`	S2 → S3	exact match
4	`v`	S3 → S4	exact match
5	`>`	S4 → S5	exact match
6	`h`	S5 → S6	range: `'a'` ≤ `'h'` ≤ `'z'`
7	`e`	S6 → S6	range: `'a'` ≤ `'e'` ≤ `'z'`
8	`l`	S6 → S6	range: `'a'` ≤ `'l'` ≤ `'z'`
9	`l`	S6 → S6	range: `'a'` ≤ `'l'` ≤ `'z'`
10	`o`	S6 → S6	range: `'a'` ≤ `'o'` ≤ `'z'`
11	`<`	S6 → S7	exact match
12	`/`	S7 → S8	exact match
13	`d`	S8 → S9	exact match
14	`i`	S9 → S10	exact match
15	`v`	S10 → S11	exact match
16	`>`	S11 → ACCEPT	exact match

End of input, state is ACCEPT → accepted.

What's next

At this point our FSM can recognize whether a string like <div>hello</div> is valid, but it can only answer yes or no. A real tokenizer needs to do more than accept or reject. It needs to emit tokens: structured data that tells the next stage of the parser what was found and where.

That means we need a data structure to hold the tokenizer's output — something that captures the tag names, the text content, and the boundaries between them. In the next post, we'll design that output format and turn our recognizer into a proper tokenizer that produces a stream of tokens.

Stay tuned...

DEV Community

Finite State Machine from first principles [part 1]

From Combinational Logic to FSM: Q/A

Q: Automata theory has a class hierarchy, starting from combinational logic. Right?

Q: Which combinational logic operators do we need to implement a finite state machine? Conceptually.

Q: What else do we need from a computational point of view? Something like `inc`, `if/else`, `cmp`, etc.

Q: How do we know these seven operators are enough for any FSM? Fact check this!

Q: Why those 7 operators are very much like Assembly operators?

Q: Can we implement an FSM that recognizes `<div>text</div>` where text is alphanumeric?

FSM for `<div>[a-zA-Z0-9]+</div>`

What's next

Top comments (0)

From Combinational Logic to FSM: Q/A

Q: Automata theory has a class hierarchy, starting from combinational logic. Right?

Q: Which combinational logic operators do we need to implement a finite state machine? Conceptually.

Q: What else do we need from a computational point of view? Something like inc, if/else, cmp, etc.

Q: How do we know these seven operators are enough for any FSM? Fact check this!

Q: Why those 7 operators are very much like Assembly operators?

Q: Can we implement an FSM that recognizes <div>text</div> where text is alphanumeric?

FSM for <div>[a-zA-Z0-9]+</div>

What's next

Q: What else do we need from a computational point of view? Something like `inc`, `if/else`, `cmp`, etc.

Q: Can we implement an FSM that recognizes `<div>text</div>` where text is alphanumeric?

FSM for `<div>[a-zA-Z0-9]+</div>`