DEV Community

Cover image for Finite State Machine from first principles [part 1]
Jocer Franquiz
Jocer Franquiz

Posted on • Edited on

Finite State Machine from first principles [part 1]

From Combinational Logic to FSM: Q/A

The starter point is Combinational Logic. The idea is to understand what do we really need to add to CL to create a Finite State Machine.

At the end, we ended up building a working FSM that can recognize a complete <div>text</div> structure, using just 7 primitives. Funny enough, those primitives are equivalent to 7 Assembly opcodes (not a coincidence, this was all planned by the Elders).


Q: Automata theory has a class hierarchy, starting from combinational logic. Right?

Yes, that's a clean way to frame it. The hierarchy goes from "no memory" to "more memory":

  1. Combinational logic — output depends only on current input. No memory, no state. Just pure boolean functions (AND, OR, etc.).

  2. Finite State Machine (FSM) — adds memory in the form of state. Output depends on current input + current state. Memory is finite and fixed.

  3. Pushdown Automaton (PDA) — FSM + a stack. The stack gives it unlimited but structured memory (LIFO). This is what recognizes nested structures like matching parentheses.

  4. Turing Machine — FSM + an infinite tape (read/write). The most powerful model; can simulate any computation.

Each level can recognize a strictly larger class of languages (the Chomsky hierarchy):

Machine Language class Example
Combinational Boolean expression
FSM Regular /ab*c/
PDA Context-free Balanced {}[]()
Turing Recursively enumerable Anything computable

For parsing HTML, you need at minimum a PDA (because of nested tags), but in practice browsers use a stateful tokenizer (FSM) + a tree builder (stack-based) — which together behave like a PDA.


Q: Which combinational logic operators do we need to implement a finite state machine? Conceptually.

Conceptually, you only need three:

  • AND — "state is X and input is Y"
  • OR — "condition A or condition B leads to the same result"
  • NOT — negate a condition

These three form a functionally complete set — you can express any boolean function with just AND, OR, NOT.

The core of an FSM is its transition function:

next_state = f(current_state, input)
Enter fullscreen mode Exit fullscreen mode

That function is purely combinational — it's just a truth table. Every row is a conjunction (AND) of conditions, and you OR the rows together to select the next state. Something like:

if state == A AND input == 'x' → go to B
if state == A AND input == 'y' → go to C
if state == B AND input == 'x' → go to A
...
Enter fullscreen mode Exit fullscreen mode

Each line is AND. Multiple lines that share an outcome are OR'd together.

NOT comes in when you need to express "any input that is not this character."

That's the combinational part. The state register (what actually holds the current state between steps) is the sequential part — it's where memory enters the picture.


Q: What else do we need from a computational point of view? Something like inc, if/else, cmp, etc.

Here is the complete set of primitive operations for an FSM:

Operation What it does Why an FSM needs it
READ Get the current input symbol The machine must consume input
CMP Test equality of two values Check state and input against known constants
BRANCH Conditional jump Select which transition fires
JMP Unconditional jump Loop back to process next character
ASSIGN Write a value to a register Update the state register
ADVANCE Move input pointer forward Step through the input string
HALT Stop execution Accept or reject; terminate the machine

Seven operations total. No arithmetic, no stack, no heap.

Example: an FSM that accepts strings matching ab*c (an a, then zero or more bs, then c).

States: S0 (start), S1 (got a), ACCEPT, DEAD

LOOP:
    READ   char                ; get current input symbol

    ; ── S0: expecting 'a' ──
    CMP    state, S0
    BRANCH not_s0
    CMP    char, 'a'
    BRANCH s0_fail
    ASSIGN state, S1
    JMP    next

  s0_fail:
    ASSIGN state, DEAD
    JMP    next

  not_s0:
    ; ── S1: expecting 'b' or 'c' ──
    CMP    state, S1
    BRANCH not_s1
    CMP    char, 'b'
    BRANCH s1_not_b
    ASSIGN state, S1           ; stay in S1
    JMP    next

  s1_not_b:
    CMP    char, 'c'
    BRANCH s1_fail
    ASSIGN state, ACCEPT
    JMP    next

  s1_fail:
    ASSIGN state, DEAD
    JMP    next

  not_s1:
    ; ── DEAD or ACCEPT + more input → DEAD ──
    ASSIGN state, DEAD

  next:
    ADVANCE                    ; move to next character
    CMP    input_end, true
    BRANCH LOOP                ; not at end → loop

    ; ── input exhausted ──
    CMP    state, ACCEPT
    BRANCH reject
    HALT   accept

  reject:
    HALT   reject
Enter fullscreen mode Exit fullscreen mode

Trace with input "abbc":

Step char state before transition state after
1 a S0 S0 + a → S1 S1
2 b S1 S1 + b → S1 S1
3 b S1 S1 + b → S1 S1
4 c S1 S1 + c → ACCEPT ACCEPT

Here is a graph representation

Finite State Machine that recognizes ab*c

Four states, all transitions accounted for, including the error paths to DEAD.

End of input, state is ACCEPT → accepted.

Every single line maps to one of the seven primitives. Nothing else is needed.


Q: How do we know these seven operators are enough for any FSM? Fact check this!

Yes, the seven operations are sufficient. In fact, they're slightly redundant.

JMP is redundant — it's just BRANCH(always_true, target). And CMP can be absorbed into BRANCH (as "branch-on-symbol-match"). And READ + ADVANCE can be fused into one operation.

So the theoretical minimum is 4 operations:

Minimal op Absorbs
READ/ADVANCE READ + ADVANCE
BRANCH-on-match CMP + BRANCH + JMP
ASSIGN
HALT

This matches Wang's B-machine (1957) — a minimal abstract machine with a read-only input tape and finite control that is exactly a DFA. Its instruction set is: {read, branch-on-symbol, halt-accept, halt-reject}.

Key references:

  • Sipser (Introduction to the Theory of Computation, Ch. 1) — treats the transition function δ(state, symbol) → state as a single atomic primitive. The 7 ops are a decomposition of that.
  • Hopcroft/Ullman/Motwani (Intro to Automata Theory, Ch. 2) — same formal treatment.
  • Minsky (Computation: Finite and Infinite Machines, 1967, Ch. 3) — decomposes FSMs into read input, compute next-state (combinational logic), latch new state (memory). The hardware view of exactly these ops.
  • Wang, 1957 — "A Variant to Turing's Theory of Computing Machines", Journal of the ACM 4(1). The closest classical reference to a minimal FSM instruction set. ACM link

Bottom line: The 7 ops are sufficient for any FSM. We keep them separate (rather than using the minimal 4) because they map more clearly to code — each one becomes a distinct, readable operation in the implementation.


Q: Why those 7 operators are very much like Assembly operators?

They are assembly-level operations. That's not a coincidence.

An FSM is the computational model that sits right above combinational logic. It's the simplest machine that has state. And assembly is the thinnest software abstraction over hardware. So they naturally meet at the same level.

The mapping is direct:

FSM op x86 equivalent
READ mov al, [rsi] (load byte from memory)
CMP cmp al, 'a'
BRANCH jne label
JMP jmp label
ASSIGN mov state, S1
ADVANCE inc rsi
HALT ret / syscall exit

Hand-written FSMs in assembly (like in high-performance lexers) look almost identical to our pseudocode. The abstraction gap is essentially zero.

This is also why FSMs are so fast in practice — every operation maps 1:1 to a single CPU instruction. No function calls, no memory allocation, no indirection. Just compare, branch, assign, repeat.

When we move to a higher-level language like Python, we'll express these same 7 operations using if/elif, ==, variable assignment, and a for loop — but conceptually, it's the same assembly underneath.


Q: Can we implement an FSM that recognizes <div>text</div> where text is alphanumeric?

Yes. To handle character ranges like [a-z], [A-Z], [0-9], we need to expand CMP and split BRANCH into variants.

Updated operator set (10 operations):

Operation What it does Why an FSM needs it x86 equivalent
READ Get current input symbol Consume input mov al, [rsi]
CMP Compare two values, set flags Every transition needs comparison cmp al, 'a'
BRANCH_EQ Jump if equal Exact character match je label
BRANCH_NE Jump if not equal Skip when no match jne label
BRANCH_LT Jump if less than Lower bound of range check jb label
BRANCH_GT Jump if greater than Upper bound of range check ja label
JMP Unconditional jump Loop, skip blocks jmp label
ASSIGN Write value to register Update state mov reg, val
ADVANCE Move input pointer forward Step through input inc rsi
HALT Stop execution Accept or reject ret / syscall

We split the old BRANCH into four variants (EQ, NE, LT, GT). CMP now sets flags rather than returning a boolean — exactly like x86 cmp. Range checks like [a-z] become:

CMP       char, 'a'
BRANCH_LT fail        ; char < 'a' → not lowercase
CMP       char, 'z'
BRANCH_GT fail        ; char > 'z' → not lowercase
; if we reach here, 'a' <= char <= 'z'
Enter fullscreen mode Exit fullscreen mode

FSM for <div>[a-zA-Z0-9]+</div>

States and what has been consumed:

State Consumed so far Expects next
S0 (start) <
S1 < d
S2 <d i
S3 <di v
S4 <div >
S5 <div> first alphanumeric
S6 <div>text more alphanumeric or <
S7 <div>text< /
S8 <div>text</ d
S9 <div>text</d i
S10 <div>text</di v
S11 <div>text</div >
ACCEPT <div>text</div> (end of input)
DEAD (error) (absorb remaining)

Finite State Machine that recognizes <div>

Pseudocode using the 10 operators:

    ASSIGN state, S0

LOOP:
    ; ── DEAD: absorb remaining input ──
    CMP    state, DEAD
    BRANCH_EQ next

    ; ── ACCEPT + more input → error ──
    CMP    state, ACCEPT
    BRANCH_EQ to_dead

    READ   char

    ; ── S0: expecting '<' ──
    CMP    state, S0
    BRANCH_NE not_s0
    CMP    char, '<'
    BRANCH_NE to_dead
    ASSIGN state, S1
    JMP    next

  not_s0:
    ; ── S1: expecting 'd' ──
    CMP    state, S1
    BRANCH_NE not_s1
    CMP    char, 'd'
    BRANCH_NE to_dead
    ASSIGN state, S2
    JMP    next

  not_s1:
    ; ── S2: expecting 'i' ──
    CMP    state, S2
    BRANCH_NE not_s2
    CMP    char, 'i'
    BRANCH_NE to_dead
    ASSIGN state, S3
    JMP    next

  not_s2:
    ; ── S3: expecting 'v' ──
    CMP    state, S3
    BRANCH_NE not_s3
    CMP    char, 'v'
    BRANCH_NE to_dead
    ASSIGN state, S4
    JMP    next

  not_s3:
    ; ── S4: expecting '>' ──
    CMP    state, S4
    BRANCH_NE not_s4
    CMP    char, '>'
    BRANCH_NE to_dead
    ASSIGN state, S5
    JMP    next

  not_s4:
    ; ── S5: expecting first alphanumeric ──
    CMP    state, S5
    BRANCH_NE not_s5
    JMP    check_alnum_s6     ; reuse alnum check, go to S6

  not_s5:
    ; ── S6: more alphanumeric or '<' ──
    CMP    state, S6
    BRANCH_NE not_s6
    CMP    char, '<'
    BRANCH_NE s6_alnum
    ASSIGN state, S7
    JMP    next
  s6_alnum:
    JMP    check_alnum_stay   ; stay in S6

  not_s6:
    ; ── S7: expecting '/' ──
    CMP    state, S7
    BRANCH_NE not_s7
    CMP    char, '/'
    BRANCH_NE to_dead
    ASSIGN state, S8
    JMP    next

  not_s7:
    ; ── S8: expecting 'd' ──
    CMP    state, S8
    BRANCH_NE not_s8
    CMP    char, 'd'
    BRANCH_NE to_dead
    ASSIGN state, S9
    JMP    next

  not_s8:
    ; ── S9: expecting 'i' ──
    CMP    state, S9
    BRANCH_NE not_s9
    CMP    char, 'i'
    BRANCH_NE to_dead
    ASSIGN state, S10
    JMP    next

  not_s9:
    ; ── S10: expecting 'v' ──
    CMP    state, S10
    BRANCH_NE not_s10
    CMP    char, 'v'
    BRANCH_NE to_dead
    ASSIGN state, S11
    JMP    next

  not_s10:
    ; ── S11: expecting '>' ──
    CMP    state, S11
    BRANCH_NE to_dead
    CMP    char, '>'
    BRANCH_NE to_dead
    ASSIGN state, ACCEPT
    JMP    next

    ; ── alphanumeric check → transition to S6 ──
  check_alnum_s6:
    CMP    char, 'a'
    BRANCH_LT upper_s6
    CMP    char, 'z'
    BRANCH_GT upper_s6
    ASSIGN state, S6
    JMP    next
  upper_s6:
    CMP    char, 'A'
    BRANCH_LT digit_s6
    CMP    char, 'Z'
    BRANCH_GT digit_s6
    ASSIGN state, S6
    JMP    next
  digit_s6:
    CMP    char, '0'
    BRANCH_LT to_dead
    CMP    char, '9'
    BRANCH_GT to_dead
    ASSIGN state, S6
    JMP    next

    ; ── alphanumeric check → stay in S6 ──
  check_alnum_stay:
    CMP    char, 'a'
    BRANCH_LT upper_stay
    CMP    char, 'z'
    BRANCH_GT upper_stay
    JMP    next
  upper_stay:
    CMP    char, 'A'
    BRANCH_LT digit_stay
    CMP    char, 'Z'
    BRANCH_GT digit_stay
    JMP    next
  digit_stay:
    CMP    char, '0'
    BRANCH_LT to_dead
    CMP    char, '9'
    BRANCH_GT to_dead
    JMP    next

  to_dead:
    ASSIGN state, DEAD

  next:
    ADVANCE
    CMP    input_end, true
    BRANCH_NE LOOP

    ; ── input exhausted ──
    CMP    state, ACCEPT
    BRANCH_NE reject
    HALT   accept

  reject:
    HALT   reject
Enter fullscreen mode Exit fullscreen mode

Trace with input <div>hello</div>:

Step char state transition
1 < S0 → S1 exact match
2 d S1 → S2 exact match
3 i S2 → S3 exact match
4 v S3 → S4 exact match
5 > S4 → S5 exact match
6 h S5 → S6 range: 'a''h''z'
7 e S6 → S6 range: 'a''e''z'
8 l S6 → S6 range: 'a''l''z'
9 l S6 → S6 range: 'a''l''z'
10 o S6 → S6 range: 'a''o''z'
11 < S6 → S7 exact match
12 / S7 → S8 exact match
13 d S8 → S9 exact match
14 i S9 → S10 exact match
15 v S10 → S11 exact match
16 > S11 → ACCEPT exact match

End of input, state is ACCEPT → accepted.


What's next

At this point our FSM can recognize whether a string like <div>hello</div> is valid, but it can only answer yes or no. A real tokenizer needs to do more than accept or reject. It needs to emit tokens: structured data that tells the next stage of the parser what was found and where.

That means we need a data structure to hold the tokenizer's output — something that captures the tag names, the text content, and the boundaries between them. In the next post, we'll design that output format and turn our recognizer into a proper tokenizer that produces a stream of tokens.

Stay tuned...

Top comments (0)