DEV Community: Jocer Franquiz

Knowing When Your LLM Is Wrong: A Field Guide for Agentic Systems

Jocer Franquiz — Mon, 11 May 2026 17:12:24 +0000

I see people increasingly delegating operational decisions to LLM agents. But these agents are not deterministic. On the contrary, every decision they make is based on probabilities, and it will be wrong at some point. Think about it.

The mechanism that decides whether your agent picks the right route is probabilistic (formally, a stochastic process). Right or wrong depends on the odds. Each LLM agent decides based on hidden probabilities whether it answers the question correctly, calls the right tool, escalates the right ticket, refunds the right order, asks for clarification, etc. If we can't tell, automatically and at scale, when your agent is wrong, you can't improve it, you can't trust it, and you can't ship it.

So, how can we be sure this very simple agent took the right decision? Well, we can't! The answer is not binary. It's affected by many factors, internal and external: the prompt may not carry the right context, the model may be too small, the output format (structured vs unstructured) may constrain it, or the vendor may silently change the model underneath you.

Now, there is some good news. There's a clean conceptual framework underneath all of this, borrowed from Decision Theory and Reinforcement Learning, that turns "is the LLM right or wrong?" from a vague worry into a set of measurable, improvable engineering problems.

This post walks through the framework end-to-end, using the routing agent as a running example. A companion script llm_correctness_demo.py implements every numbered step below as runnable Python, so you can poke at it as you read.

1. What "correct" actually means?

Before measuring anything, we have to be precise about what we're measuring. Let's focus on a simple use case: an agent receives a message How to restart a failed batch job in our internal data pipeline?, and it must do one simple thing: to decide if execute task_1 (answer from knowledge/memory) or execute task_2 (search the web). For each input, there's a correct route:

Picks task_1, no need for web searches (which is the expected behavior), but the answer is wrong.
Picks task_2, and you lose tokens (and money) for calling the browser tool.

When we ask "is the answer correct?", we're really asking whether the answer matches a ground truth. Ground truth has two ingredients:

The intended meaning of the input: what did the user actually ask?
The fact of the matter given that meaning: what is true in the world?

For the routing agent, ingredient #2 is the part that looks like classification: given a disambiguated input, is task_1 or task_2 the right route? But ingredient #1 lurks underneath every real-world application.

It's worth separating correctness categories up front, because the techniques for catching each are different:

Factual error: the model states something false about a fixed-meaning input.
Hallucination: the model invents a citation, a fact, or a referent.
Reasoning error: the steps don't compose into the conclusion.
Instruction-following error: the model ignored constraints in the prompt.
Ambiguity error: the model committed to one interpretation when it should have asked, or chose the less likely interpretation.

Routing agents are most often hit by the last category, but a production system has to cope with all five.

2. Measuring the error rate

Once "correct" is defined, the rest is bookkeeping. Done well.

The practical recipe

We collect a set of representative inputs. For each, a human (or a trusted process) writes down the correct route. Call this the gold set. We run the agent on every input and compare its choice to the gold label.

error_rate = (# disagreements) / (total examples)

A few details that matter more than they sound:

The gold set must reflect production traffic. If 70% of real user messages are the kind that should go to task_1, your gold set should look the same. Sample from real logs whenever you can. A gold set built from intuitions about what users might ask will systematically miss the cases that hurt you.

Size determines what you can detect. With 100 examples and an observed 8% error rate, the 95% confidence interval is roughly ± 5%, meaning we literally can't tell apart "8% error" from "13% error" at that sample size. With 1,000 examples it tightens to ±1.7%. Always report a confidence interval (Wilson or bootstrap), never a bare point estimate.

Error rate hides structure. For binary routing, decompose into a 2×2 confusion matrix:

We usually care about which kind of error is happening. Routing a query that needed web search to the knowledge-only task might cost you a wrong answer. Routing a knowledge query to web search might cost you latency and money. Which mistake is more expensive depends on the product, and that asymmetry will drive every later decision about thresholds and calibration.

The theoretical floor

Some inputs are genuinely ambiguous; even a perfect oracle would split. This is the Bayes error rate: the irreducible error present in the input distribution itself. We estimate it by having multiple humans label the same gold set; their disagreement rate is a lower bound on what any agent can achieve. If three humans disagree on 4% of examples, expecting our agent to get below 4% is fantasy. Knowing this number keeps optimization grounded.

3. How to calibrate a black-box?

Suppose we've measured 8% error and want to do better. The path forward usually isn't "make the LLM smarter." It's "make the agent know when it's unsure, and act on that."

That's calibration. And most teams are calibrating a black box: a commercial LLM from a frontier lab, accessed through an API with no access to weights, training, or sometimes even logits. The classical calibration toolkit doesn't fully apply. Here's what does.

Step 1: extract a confidence signal

You have four options, in roughly increasing reliability and cost:

A. Ask the model. Prompt it for both a decision and a self-reported probability:

Reply in JSON: {"choice": 1 or 2, "confidence": 0.0 to 1.0}

LLMs are typically overconfident when asked this way, often reporting probabilities well above their empirical accuracy. Useful as a raw signal, but not to be trusted without correction.

B. Use token log-probabilities, if exposed. Constrain the model to answer with a single token (1 or 2) and read off P(token="1") and P(token="2") directly. Much more reliable than asking the model to introspect, because it reflects the model's actual next-token distribution rather than its meta-cognition about that distribution. The problem is, many labs don't give us access to the probabilities.

C. Sample multiple times and count. Run the same prompt N times with temperature > 0. If 8 out of 10 samples say 1, your empirical confidence is 0.8. Often called self-consistency. Expensive (N× cost), works on any API, produces a more stable signal than a single call.

D. Ensemble across prompts or models. Ask the same question with three rephrasings, or across Claude + Gemini + DeepSeek, and aggregate. Disagreement equals uncertainty. The most expensive option, often the best signal. Disagreement across models is a strong predictor of "this case is hard."

For most production routing agents, B if available, otherwise C is the sweet spot.

Step 2: correct the signal

Once you have a raw confidence score s ∈ [0,1], you fit a function f(s) → p that maps it to actual probabilities. We learn f on a held-out calibration set where you know the true labels.

Temperature scaling: if we have token logits, divide them by a learned scalar T before softmax. One parameter, works with 100-200 calibration examples. T > 1 softens overconfident predictions; T < 1 sharpens underconfident ones.
Platt scaling: fit a logistic regression mapping raw score to true probability. Two parameters. Works on any confidence signal, including self-reported probabilities and self-consistency vote fractions.
Isotonic regression: non-parametric, learns an arbitrary monotonic mapping. More flexible than Platt, needs 1,000+ examples to avoid overfitting.
Histogram binning: bin predictions by raw confidence, record empirical accuracy per bin. Crude but interpretable, and a useful diagnostic even if you don't deploy it.

Start with Platt scaling on top of self-consistency votes. Cheap, simple, and usually moves the needle a lot. Reach for isotonic only when you have the data and Platt isn't expressive enough.

Step 3: pick a threshold (and an abstention zone)

Calibrated probabilities make threshold tuning meaningful. If errors are symmetric, cut at 0.5. If routing wrongly to task_2 costs 5× more than the reverse, shift the threshold accordingly.

Better still: introduce an abstention zone. If 0.4 < p < 0.6, escalate to a human, ask a clarifying question, or run both tasks and reconcile. Calibrated probabilities are what make abstention zones trustworthy. Without calibration, "0.4 < p < 0.6" doesn't carve out the genuinely uncertain cases, just the cases where the model happens to output middling numbers.

Step 4: verify calibration improved

Two standard metrics:

Expected Calibration Error (ECE): bin predictions by confidence, compute |accuracy − confidence| per bin, take the weighted average. Lower is better; well-calibrated agents have ECE < 0.05.
Brier score: mean squared error between predicted probability and true outcome. Penalizes miscalibration and inaccuracy in one number.

Plot a reliability diagram before and after: predicted confidence on the x-axis, actual accuracy on the y-axis. A perfectly calibrated model lies on the diagonal. This visualization will sell the work to anyone reading the report.

The diagram above shows the typical pattern: the raw confidence signal sits below the diagonal in the high-confidence range (the model says "90%" but is right 75% of the time) while the calibrated curve hugs the diagonal much more closely. (Self-consistency vote fractions sometimes show the opposite pattern, sitting above the diagonal: 10 samples can only express confidence up to 1.0 even when the underlying belief is more concentrated. Either way, calibration corrects it.)

A black-box caveat that bites everyone eventually

Calibration drifts when the model changes underneath us. A lab pushes a silent update, we switch from one snapshot to another, our fitted calibration map no longer holds. Two defenses:

Pin model versions when the API supports it.
Re-measure ECE periodically on fresh labeled samples, and re-fit when it drifts past a threshold.

4. Optional: the formal frame

Everything above can be derived informally. But there's a precise formal frame underneath, and once you see it, a lot of design choices stop feeling like ad hoc tricks and start feeling like instances of a single pattern.

A stochastic state machine, sort of

The routing agent is a probabilistic transition: from a start state, conditioned on an input, into one of K terminal states (task_1, task_2, possibly abstain). At temperature > 0, the same input can yield different transitions across runs. That's a stochastic state machine: a Markov chain with input-conditioned transition probabilities.

The frame is useful because it puts the earlier vocabulary on a single footing: error rate, calibration, self-consistency, and abstention all become instances of a single object, namely a transition distribution and its discrepancy from truth.

The frame strains in three places. The transitions aren't conditioned on a discrete symbol but on a high-dimensional natural-language input. They're not stationary; the underlying LLM drifts. And in multi-step agents the decision depends on full conversation history, not just the previous state. The cleaner formalization is below.

POMDP: the precise version

A Partially Observable Markov Decision Process has six ingredients:

States S: the true, hidden state of the world.
Actions A: what the agent can do.
Observations O: what the agent sees.
Transition function T(s' | s, a): how the world evolves.
Observation function Z(o | s): how observations are generated from states; the noise.
Reward R(s, a): value or cost.

The agent never sees s directly. It maintains a belief state b(s) (a probability distribution over possible true states) and updates it via Bayes' rule whenever a new observation arrives:

b'(s') ∝ Z(o | s') · Σ_s T(s' | s, a) · b(s)

The optimal policy is a function from belief states to actions: π(b) → a.

Mapping this onto an LLM agent:

Hidden state = the user's actual intent and the actual situation.
Observation = the prompt, conversation, tool outputs the LLM sees.
Belief state = the LLM's internal representation of what's probably going on. When you ask for a confidence score or read token log-probs, you're trying to extract a projection of this belief state.
Observation function = the noise model. Users phrase the same intent in many ways; tools return ambiguous results. This is irreducible.
Policy = the prompt + model + decoding strategy that maps the conversation to an action.
Reward = your task-specific success metric.

Two kinds of uncertainty (this is the payoff)

The POMDP frame separates two sources of uncertainty that "LLM confidence" otherwise muddles together:

Aleatoric uncertainty: the world is genuinely ambiguous. Even a perfect agent couldn't tell from "my order isn't here yet" whether the order is lost or merely delayed. The observation simply doesn't carry enough information. This is the Z(o | s) noise, and it's irreducible (it is the Bayes floor from section 2, now named).

Epistemic uncertainty: the agent itself is uncertain because of its limitations: training data gaps, prompt ambiguity, model capability. A bigger or better-prompted model would be more confident on the same input. This is reducible, in principle.

Most off-the-shelf "confidence" signals mix the two. That's why calibration is hard: you're calibrating a quantity that conflates two things, against an outcome distribution that depends on both.

The practical implication is what makes the distinction worth carrying around. Low confidence calls for different responses depending on its source. Aleatoric uncertainty calls for gathering more observations: ask a clarifying question, fetch more context, call a tool. Epistemic uncertainty calls for changing the policy: use a stronger model, rewrite the prompt, escalate to a human. A well-designed agent distinguishes them. You can get a partial empirical separation: aleatoric uncertainty tends to be stable across model sizes and prompt rephrasings, while epistemic uncertainty shrinks as you scale up. Disagreement between models on the same input is a decent epistemic signal; disagreement within a single model across rephrasings is a decent aleatoric signal.

If you want a single sentence to take away from the formal section: an LLM agent is a POMDP policy that operates on an implicit, uncalibrated belief state; building reliable agents is largely the work of making that belief state explicit, calibrated, and observable.

5. The policy is the unit you're shipping

"Policy" gets thrown around loosely. Pinning it down clarifies a surprising amount of operational practice.

A policy is the rule the agent uses to pick an action given what it has observed. The dumbest possible policy: always pick task_1. A slightly less dumb policy: keyword rule. The policy you actually have: send the message to Claude with a specific prompt, read the answer, route accordingly. All three have the same shape (a function from what-the-agent-has-seen to what-the-agent-does) and differ only in complexity.

The policy of an LLM agent is a composite object. It includes:

The model: claude-opus-4-7 and claude-haiku-4-5 are different policies even with identical prompts.
The prompt: system prompt, instructions, output format constraints.
The few-shot examples: each one shifts the action distribution.
The decoding parameters: temperature, top-p, max tokens.
The tool set: what tools are available and how they're described.
The control flow: retries, self-consistency voting, "if confidence < threshold, ask for clarification," fallbacks.

Anything we can change that would change the action distribution is part of the policy.

This framing has three concrete payoffs:

The policy is the unit of evaluation. When you measure error rate, we measure it for a specific policy. Change any of the six items and your old measurement is, strictly speaking, no longer valid.

The policy is the unit of comparison. "Claude is better than Gemini for routing" actually means "policy A (Claude + prompt P + temp T) is better than policy B (Gemini + prompt P' + temp T')." A fair comparison fixes everything except the model. Most informal LLM comparisons fail this test.

The policy is the unit of improvement. Every calibration technique we discussed maps onto a specific policy change. Temperature scaling adds a post-processing step. Self-consistency wraps the LLM call in an aggregator. An abstention zone augments the action space and the decision rule. Switching models swaps the core component. Changing the prompt modifies the conditioning.

Think of your agent as a stack: control flow on top, then prompts and config, then the model and decoding parameters. The whole stack is your policy. When you "improve the agent," you modify one or more layers. When you measure error rate, you measure the whole stack. When the vendor pushes a model update, the bottom layer shifts and your measurements drift even though you changed nothing.

A lot of operational discipline follows from this: version pinning, A/B testing prompts, regression evaluation on every prompt change, treating the prompt as code with the same review rigor. These all become obvious once you accept that the policy (not "the LLM") is the artifact you're shipping.

6. A/B testing policies properly

We have policy A in production. You think policy B is better. How do you actually find out?

The naive approach: run both on yesterday's traffic, count errors, compare. This works for offline regression checks but isn't really an A/B test: you have no traffic randomization, no measurement of downstream effects, no statistical claim. A real A/B test is a controlled experiment that splits live traffic and measures the difference under conditions that allow causal attribution.

The structure

A well-formed A/B test has six pieces:

A clear hypothesis with a primary metric and direction. "Policy B reduces routing error rate by at least 2 percentage points compared to policy A." Vague hypotheses ("B is better") produce ambiguous results.
A unit of randomization. Per-request, per-user, or per-session. Must match what you're measuring. If you're measuring per-request errors, randomize per request; if you're measuring user satisfaction, randomize per user.
A primary metric. One number that decides the test. Pick one. Five primary metrics is zero primary metrics.
Guardrail metrics. Things that must not get worse: latency, cost, abstention rate, downstream tool error rate, safety violations. A policy that improves accuracy but doubles latency is a regression in disguise.
A pre-registered sample size and stopping rule. Compute, before starting, how many requests you need to detect your minimum effect with adequate power. Then commit to running until that sample is reached. Peeking and stopping when it looks good is the single most common way teams fool themselves.
An analysis plan. What test, what threshold, how subgroups are handled. Written down before the data arrives.

Sample size: the number that matters most

For comparing two error rates with a two-proportion z-test, the rough sample size per arm is:

n ≈ 16 · p̄(1−p̄) / Δ²

where p̄ is the average error rate and Δ is the minimum detectable effect (constants correspond to 80% power at 5% significance).

Sobering numbers: if your current error rate is 8% and you want to detect a 2 point improvement (8% → 6%), you need roughly 1,200 requests per arm. To detect a 0.5 point improvement, you need roughly 19,000 per arm. Most "the new prompt seems better" claims are made with sample sizes that couldn't possibly detect the effect being claimed.

This matters specifically for LLM policy testing because differences between two reasonable prompts are often genuinely small in absolute terms, and detecting them reliably takes more traffic than people expect.

Analysis

For binary outcomes, two-proportion test (z-test or Fisher's exact) with confidence intervals on the difference. Report point estimate and CI: "B beat A by 1.3 points, 95% CI [0.4, 2.2]" is informative. "B was better, p=0.03" is not.

For continuous metrics (latency, cost, score), t-test or Mann-Whitney depending on distribution shape. For skewed distributions like latency tails, bootstrap rather than trusting parametric tests.

For multiple comparisons (testing four prompts against a baseline), correct accordingly: Bonferroni for conservative family-wise error, Benjamini-Hochberg for false discovery rate.

What's special about A/B testing LLM policies

Five things make this harder than testing button colors:

Stochasticity within a single policy. Even at temperature 0, LLM outputs aren't perfectly deterministic in production (batching effects, vendor-side variation). Some of the variance you see between A and B is within-policy noise. Running each request through both policies and comparing (paired sampling) can dramatically reduce variance and shrink required sample sizes.

Drift in the underlying model. If A's baseline was measured three weeks ago, that number may no longer hold when B starts running. Always measure A and B concurrently on the same traffic, not B against a stale historical estimate of A.

Distribution shift in inputs. Yesterday's traffic isn't tomorrow's traffic. Run long enough to span typical variation; analyze by time bucket to check stability.

Heterogeneous effects across subgroups. B might be better on average but worse on a critical subset (non-English queries, edge cases involving tools). Slice results by meaningful subgroups before declaring victory. Pre-specify the slices to avoid post-hoc fishing.

Cost asymmetry. Unlike button colors, A and B may have very different costs per request. Bake cost into the metric (error rate per dollar) or report a Pareto frontier rather than a single winner.

The right ladder: offline → shadow → canary → full A/B

Don't go straight to live traffic. The standard escalation:

Offline. Run both policies on your gold set. Cheap, fast, low risk, limited by gold-set coverage. Necessary, not sufficient.

Shadow. All traffic still goes through A (A's response is what the user sees), but B runs in parallel on the same requests and its decisions are logged. Real traffic, zero user risk. The catch: you can't measure outcomes that depend on B's response actually being delivered.

Canary. Send a small fraction (1-5%) of live traffic to B. Monitor guardrails. If nothing breaks, ramp up.

Full A/B. 50/50 (or whatever your power calculation requires) on live traffic, run to pre-specified sample size, decide.

Most policy changes can be killed at the offline or shadow stage. Reserve full A/B for changes that have already passed the cheaper checks.

Pitfalls that catch people repeatedly

Peeking and early stopping. Checking the test daily and stopping the moment p < 0.05 makes your true false-positive rate much higher than 5%. Either commit to a fixed sample size, or use sequential methods (mSPRT, always-valid p-values) designed for continuous monitoring.
Confounding by version drift. Don't change the prompt and upgrade the model in the same test. One change at a time, or use a factorial design.
Survivorship bias. If B includes "abstain when uncertain" and A doesn't, B's accuracy on requests it did answer will look better even if it's worse overall. Always measure on the full denominator, including abstentions.
Cost of the test itself. Running B on 50% of production traffic for a week costs real money. Factor that in when deciding whether the expected benefit justifies the test.
Treating offline gold-set wins as production wins. Gold sets are static; production isn't. A policy that wins offline by 5 points often wins online by 1, or by zero. Always confirm online before declaring success.

7. Putting it together

The five-line summary of everything above:

Define correctness precisely: disambiguated input plus ground truth, by error category.
Measure error rate: on a gold set drawn from production traffic, with confidence intervals, broken down by confusion-matrix cell, with the Bayes floor estimated from inter-annotator agreement.
Calibrate: extract a confidence signal (token log-probs or self-consistency), fit a calibration map (Platt or temperature scaling) on a held-out set, set thresholds and an abstention zone, verify with ECE and reliability diagrams.
Treat the policy as the unit: model, prompt, parameters, control flow, all together. Pin versions. Measure, compare, and ship policies, not "the LLM."
A/B test changes properly: clear hypothesis, pre-registered sample size, paired sampling where possible, the offline → shadow → canary → full ladder, no peeking.

None of this requires access to the model's weights. None of it requires fine-tuning. All of it works on commercial APIs. What it does require is taking the question "is the LLM right or wrong?" seriously enough to answer it the way you'd answer any other measurement problem in software: with definitions, instruments, statistics, and discipline.

The teams that do this consistently ship agents that work. The teams that don't ship demos.

Cloud Architecture Distilled

Jocer Franquiz — Tue, 05 May 2026 16:02:43 +0000

A Provider-Agnostic Cloud Architecture Reference

If you've ever switched cloud providers (or worked across more than one) you know the pain: every vendor invents its own name for the same fundamental building block. S3 is Blob Storage is Cloud Storage. EKS is AKS is GKE. The architecture is largely the same; the marketing is different.

I got tired of mentally translating, so I built a single-page reference that maps 14 cloud components and 65 distinct types across AWS, Azure, GCP, and the open-source ecosystem.

What's in it

Compute: VMs, containers, serverless, batch
Storage: object, block, file, archive
Networking: VPCs, load balancers, CDNs, DNS, VPN
Databases: relational, NoSQL, graph, time-series, warehouse, lake
IAM, Security, Messaging, API Management, Observability, Orchestration, CI/CD, Caching, DR, Cost & Governance

Each type has a one-line description and the canonical equivalent in each provider, with notes on services in transition (e.g. App Mesh retiring Sep 2026, GCP Cloud Source Repositories closed to new customers).

Tech

Single self-contained HTML file
No build step, no dependencies (except Google Fonts)
Live search + provider filtering
Nord theme, responsive, print-friendly
Hosted on GitHub Pages
Licensed CC BY-SA 4.0

Link

👉 https://jocerfranquiz.github.io/cloud-architecture-distilled/

Feedback wanted

If you spot a wrong mapping, a missing component, or a service that's been deprecated, drop a comment or open an issue. I'd like to keep this current.

What would you add?

A Serious (and hype-less) Study Guide on Agents and LLMs

Jocer Franquiz — Thu, 09 Apr 2026 18:48:19 +0000

A curated set of resources for understanding LLM agent architecture, the control plane, and how to build effective agents, with direct links to every resource.

1. Recommended path

If you only have a few hours, do these in order:

Anthropic: Building effective agents (~1 hour) The single best practical overview from people who ship them.
Lilian Weng: LLM Powered Autonomous Agents (~1 hour) The canonical academic-flavored overview: planning, memory, tool use.
Model Context Protocol intro + Claude Code documentation (1–2 hours) The control-plane mental model clicks fast once you've read both.
Skim one framework's "concepts" page, LangGraph overview is the densest (30 min).
Dip into papers (ReAct, Reflexion, …) only when a specific pattern catches your interest.

2. Foundational essays: read these first

Building effective agents

Erik Schluntz & Barry Zhang, Anthropic, December 2024. The best practical overview. Covers workflows vs agents, common patterns (prompt chaining, routing, parallelization, orchestrator-workers, evaluator-optimizer), and (crucially) when not to use an agent. The companion code lives in the Claude Cookbooks agent patterns folder.

LLM Powered Autonomous Agents

Lilian Weng (OpenAI), June 2023. The canonical academic-flavored overview: planning, memory, tool use. Still the most-cited single piece in the field. Lives on her blog Lil'Log.

AI Engineering (chapter on Agents)

Chip Huyen, O'Reilly, 2024. Excellent on the engineering side: evaluation, failure modes, planning loops. The whole book is worth owning. See also Chip Huyen's books page and the supporting GitHub repository.

3. Patterns & techniques: the original papers

Paper	Year	Key idea
ReAct	Yao et al., 2022	Interleave Thought → Action → Observation
Reflexion	Shinn et al., 2023	Self-critique to improve over iterations
Toolformer	Schick et al., 2023	Tool use as a learned skill
Tree of Thoughts	Yao et al., 2023	Explicit search over reasoning branches
Plan-and-Solve	Wang et al., 2023	Decompose first, then execute step by step
Voyager	Wang et al., 2023	Skill libraries / procedural memory in the wild (project site)
Self-Refine	Madaan et al., 2023	Iterative improvement via self-feedback (project site)
Chain-of-Thought	Wei et al., 2022	Step-by-step reasoning prompts
Generative Agents	Park et al., 2023	The famous Smallville simulation

4. Protocols & specs (the control-plane stuff)

Model Context Protocol (MCP)

Anthropic's open spec for plugging tool servers into any agent. The de-facto standard for tool interoperability. Start with the introduction and the main GitHub org.

AGENTS.md

Cross-vendor spec for "instructions to coding agents" files. Originated by OpenAI Codex, Amp, Jules (Google), Cursor, and Factory; now stewarded by the Agentic AI Foundation under the Linux Foundation. Implemented across most coding agents. Source on GitHub.

Agent Skills

Anthropic's open SKILL.md standard for lazy-loaded capability bundles. olders of instructions, scripts, and resources that an agent discovers via metadata and loads on demand. Originally a Claude Code feature, now adopted by Cursor, GitHub Copilot, VS Code, Gemini CLI, OpenAI Codex, OpenHands, Goose, Letta, JetBrains Junie, Factory, Amp, and ~20 other tools. Start with the overview, then the specification. Source on GitHub; Anthropic's example skills at anthropics/skills.

OpenAPI → tool schemas

Tool schemas can be auto-generated from OpenAPI specs. Most frameworks support this directly.

5. Claude Code & Anthropic ecosystem

Claude Code documentation

The official source of truth, updates frequently. Sections on hooks, skills, subagents, MCP, settings, slash commands, plugins, output styles, status lines. The mirror at docs.anthropic.com/en/docs/claude-code also serves the same content. Source on GitHub.

Claude Agent SDK

Same docs site. The SDK exposes the same primitives (tools, hooks, permissions) that Claude Code uses, so reading the SDK docs is one of the fastest ways to understand the harness model.

Claude Cookbooks

Practical agent recipes on GitHub (formerly Anthropic Cookbook). The patterns/agents/ folder contains the reference implementations for Building Effective Agents (orchestrator-workers, evaluator-optimizer, etc.).

Anthropic Engineering blog

Periodic deep dives on agent design, tool use, and prompt engineering. Published under anthropic.com/engineering and anthropic.com/research.

6. Frameworks (good for "show me code")

Each framework's docs is essentially an opinionated essay on agent architecture. Read the concepts pages, not the API reference.

Framework	Strength	Links
LangGraph (LangChain)	Stateful loops, multi-agent, human-in-the-loop	docs · product · GitHub
LlamaIndex Workflows / Agents	Retrieval and memory	agents docs · Workflows 1.0 announcement
Pydantic AI	Typed tool calls, clean mental model	docs · GitHub
smolagents (Hugging Face)	Minimal, code-as-action	docs · GitHub · intro blog
CrewAI	Multi-agent role-based	docs · GitHub
AutoGen (Microsoft)	Conversational multi-agent (now in maintenance, see Microsoft Agent Framework below)	docs · GitHub
Microsoft Agent Framework	The successor to AutoGen, enterprise-ready	docs
OpenAI Agents SDK	Lightweight handoff-based (production successor to Swarm)	docs · GitHub · original Swarm
DSPy (Stanford)	Programmatic prompts, optimization	site · GitHub

7. Memory & retrieval

GraphRAG (Microsoft Research, 2024): graph-augmented retrieval over a corpus. GitHub · project page · paper.
MemGPT / Letta: tiered memory inspired by OS virtual memory. The original MemGPT paper (Packer et al., 2023) is the canonical reference; the modern Letta framework is the production successor.
Vector DB docs: Qdrant, Weaviate, pgvector, Chroma: each has good intro material.
For RAG patterns, the LlamaIndex agents docs is the canonical reference.

8. Observability & evaluation

Tracing platforms

Each has docs that double as a tutorial on what to instrument:

Langfuse: open source, self-hostable. Docs · GitHub
LangSmith: hosted, by LangChain. Docs
Arize Phoenix: open source, very conceptual docs. Docs · GitHub
Helicone: proxy-based. Docs · GitHub
Braintrust: eval-focused. Docs

Standards

OpenTelemetry GenAI semantic conventions: the emerging standard for tracing LLM/agent calls. See also the agent-spans page.

Evaluation frameworks & benchmarks

lm-evaluation-harness (EleutherAI): base-model benchmarks; backend for the HuggingFace Open LLM Leaderboard.
HELM (Stanford CRFM): holistic evaluation framework. GitHub · paper
AgentBench (Tsinghua): multi-environment LLM-as-agent benchmark. Paper
SWE-bench: solving real GitHub issues. GitHub
τ-bench (Sierra): tool-agent-user interaction in real-world domains. Blog post · τ²-bench

9. Safety, security, and guardrails

OWASP Top 10 for LLM Applications: the standard threat list. Project page · 2025 PDF
Prompt injection: Simon Willison's prompt injection series is the most comprehensive ongoing coverage. He coined the term and continues to write about new variants on his main blog and his substack.
NIST AI Risk Management Framework: for governance angles. AI RMF 1.0 PDF · Resource Center
Anthropic's Responsible Scaling Policy model-level safety thinking, published on anthropic.com.

10. Multi-agent & emerging directions

AutoGen paper (Wu et al., 2023): multi-agent conversation framework
MetaGPT: assembly-line multi-agent. Paper
ChatDev: software-company-as-multi-agent. Paper
Generative Agents (Park et al., 2023): the famous Smallville simulation. Paper

11. Going deeper: books

Chip Huyen: AI Engineering (O'Reilly, 2024): production AI systems. Author page · GitHub
Jay Alammar & Maarten Grootendorst: Hands-On Large Language Models (O'Reilly, 2024): visual, accessible. O'Reilly · GitHub
Sebastian Raschka: Build a Large Language Model (From Scratch) (Manning, 2024): for understanding what's inside the LLM. GitHub · author's books page

12. Communities & ongoing reading

Anthropic, OpenAI, DeepMind engineering blogs: best practical writing
- Anthropic Engineering · Anthropic Research
Simon Willison's blog: daily LLM news and analysis (the best single feed in the field)
Latent Space podcast: interviews with builders, hosted by swyx and Alessio. Newsletter
Hacker News ai tag: high-signal discussions
LangChain blog, LlamaIndex blog: framework-level pattern writeups
arXiv cs.CL and cs.AI: primary research

13. By topic: quick reference

If you want to understand…	Start with
What an agent is	Anthropic Building effective agents
Planning patterns	ReAct, Plan-and-Solve papers
Memory architectures	Lilian Weng's post, MemGPT/Letta
Tool integration	MCP docs
Configuration / control plane	Claude Code docs (hooks, skills, subagents)
Multi-agent systems	LangGraph, AutoGen, MetaGPT
Production tracing	Arize Phoenix or Langfuse
Agent evaluation	SWE-bench, τ-bench, AgentBench
Prompt injection / safety	Simon Willison's series, OWASP LLM Top 10
RAG	LlamaIndex agents, GraphRAG
LLMs from the inside	Sebastian Raschka's book

14. A note on freshness

This field moves fast. Patterns from 2023 may be obsolete; protocols from 2024 may be standard by next quarter. Treat any specific tool or framework recommendation as a snapshot, not gospel. The concepts (loop, memory, tools, control plane, three knobs) are stable. The implementations churn.

When in doubt: read the official docs of whatever tool you're actually using, then triangulate with one or two of the foundational essays above.

Why Formal Systems Can't Read Their Own Output?

Jocer Franquiz — Tue, 07 Apr 2026 18:29:28 +0000

What do concatenative combinator calculus, mathematicians, and language models have in common? They all hit the same wall: Evaluation is lossy. A formal system can produce expressions but cannot unambiguously interpret them using only its own rules, because reduction destroys the information needed to recover intent.

This post traces a thread that surprised me, starting from a very concrete implementation question about a stack-based evaluator and ending somewhere between information theory and machine learning.

Concatenative Combinator evaluation

Imagine a simple concatenative (stack-based) combinator language. We have primitive operations like swap, drop, dip, call. We can define new words in a dictionary:

true  := [ swap drop ]
false := [ drop ]
not   := [ [ swap drop ] [ drop ] ] dip call

The evaluator reduces expressions by expanding words into their definitions and applying reduction rules. It works fine. But the thing is, once an expression is fully reduced, we're left staring at a pile of primitives. The word not has been dissolved into [ [ swap drop ] [ drop ] ] dip call, and nothing in the output tells us it was ever not.

This is normal. Evaluators reduce. They don't un-reduce.

The Reflection Problem

Now suppose we want the system to have a reflection property (programs can inspect their own structure at runtime and check whether some subexpression matches a word in the dictionary). Not as an external prettifier applied after the fact, but as a capability within the calculus itself. A program can look at the top of the stack and ask: "do I recognize this?"

Structural equality checking is straightforward. We walk two expression trees and compare. The real problem emerges when we try it on a concrete example.

Take this expression sitting on the stack:

[ [ [ swap drop ] [ drop ] ] dip call ]

Consulting the dictionary, we can read this as:

Level 0: just primitives: [ [ [ swap drop ] [ drop ] ] dip call ]
Level 1: partial recognition: [ [ true false ] dip call ]
Level 2: full recognition: [ not ]

All three are correct. The expression is all of these simultaneously.

The matching is easy. The problem is that we get multiple valid matchings at different levels of abstraction, and nothing inside the formal system tells us which one is "right."

But this ambiguity is familiar

This is exactly what happens in natural language. "She is cold" means:

She is emotionally less expressive than others.
Her body temperature is below average.

Both readings are syntactically and semantically valid. The sentence is genuinely, irreducibly ambiguous. Linguists have studied this for decades (Grice's cooperative principle, Sperber and Wilson's relevance theory, for example). And the consensus is the same: we resolve it through context, the surrounding conversation, shared knowledge, the situation, pragmatic inference about the speaker's intent. The resolution comes from outside the sentence itself.

Our combinator expression has the same property. The syntax supports multiple interpretations. The formal system can enumerate them but cannot choose between them. The intent of the original author: "did they mean not, or were they deliberately composing true and false with a branching pattern?" was lost during evaluation.

Evaluation is lossy. It destroys authorial intent. Reflection tries to recover the signal that reduction discarded.

If the system can't resolve the ambiguity from within, what can?

The obvious engineering answer is: don't lose the information in the first place. Instrument the evaluator to retain provenance, record which dictionary words were expanded at each step, and we can trace back to the original expression deterministically. This works in a closed system where we control the evaluator.

But it breaks down quickly in practice: expressions arrive from external sources without history, interop boundaries strip annotations, provenance metadata grows combinatorially with expression size, and legacy systems weren't built to carry it. In the general case (reading an expression we didn't produce) the information is gone.

So what resolves the ambiguity when provenance isn't available? The same thing that resolves "she is cold": learned preferences shaped by experience.

The approach would be:

The system presents all valid interpretations to the user, ranked by some initial heuristic (longest match, most abstract, most frequently chosen).
The user reorders them: "no, in this context I meant not, not [ true false ] branch."
A model learns from these supervised signals.
Over time, the system predicts the preferred interpretation without asking.

This is supervised machine learning. More specifically, it's the same setup that underlies language models: given an ambiguous sequence, predict the most probable interpretation based on patterns learned from human-labeled examples.

We started with combinatory logic and arrived at a setting where a language model is a natural fit.

Proof tells us what's equivalent. Prediction tells us what was meant.

The Information-Theoretic View

Why is this ambiguity fundamental rather than incidental?

Evaluation is a many-to-one function. Multiple distinct source expressions reduce to the same normal form. not, [ true false ] branch, and [ [ swap drop ] [ drop ] ] dip call all collapse to identical primitive sequences. The mapping from source to reduced form is non-injective, it destroys distinctions.

Recovering intent from a reduced expression is therefore an inverse problem: given one output, determine which of many possible inputs produced it. This is the same structure we find in lossy compression, decompilation, and signal reconstruction. The ambiguity isn't a bug in our evaluator. It's a structural property of any system where evaluation collapses distinctions that interpretation needs.

This has a precise formal counterpart. In equality saturation, an e-graph data structure maintains all equivalent representations of an expression simultaneously, exactly the way our dictionary check produces multiple valid readings at different abstraction levels. The egg library (Willsey et al., POPL 2021) formalized this approach, and a key result is that the extraction problem (i.e. choosing the "best" equivalent term from an e-graph according to a cost function) is NP-hard in general.

That hardness result matters here. It tells us that even when we have a perfect representation of all valid interpretations (the e-graph), selecting the right one is fundamentally expensive. Simple cost functions (fewest nodes, shortest expression) give us a tractable approximation, but they don't capture intent. A learned cost model, trained on which interpretations users actually prefer, is a language model by another name. This is what it looks like when a formal system acquires a statistical component not as a convenience, but because the selection problem demands it.

Mathematicians Have the Same Problem

This pattern shows up everywhere, including in how humans do mathematics.

A mathematician works through a long calculation and arrives at some algebraic structure. Formally it's symbols on a page. But after years of training, they look at it and think: "of course! this is a Lie algebra." Or: "this has a Dirac delta function hiding in it."

That recognition is not deduction. They didn't derive it axiomatically in the moment. They pattern-matched against a vast internal library of structures, built up through years of supervised exposure: working problems, reading textbooks, being corrected by professors.

And just like the combinator case, the recognition is ambiguous. The same structure might be a Lie algebra, a tangent space, or an instance of a fiber bundle. The mathematician picks the interpretation most useful in context, and that judgment comes from experience, not from the formalism.

This is why mathematics takes years to learn. Part of that time goes to absorbing the sheer volume of formal content — modern mathematics is vast, and simply learning the definitions, theorems, and proof techniques is itself a major undertaking. But beyond that, what takes years is training the pattern recognition model in the mathematician's head. Learning to look at a page of symbols and feel that something "smells like" a cohomology group or is "obviously" a convolution. Mathematicians use aesthetic and intuitive language because the recognition process operates partly below the level of conscious formal reasoning.

The pedagogy even mirrors supervised learning: a student works through a problem; the professor says "do we see this is a Lie algebra?"; the student didn't see it; now they have a label; next time, they're more likely to recognize it. Over hundreds of examples with feedback, the model trains.

One Common Thread

Combinator calculus, Mathematical practice, and Natural Language are three seemingly unrelated domains, but all exhibit the same structure:

	Combinator Reflection	Mathematics	Natural Language
System	Combinator calculus	Symbolic algebra	Grammar + vocabulary
Expression	Reduced stack	Calculation result	Sentence
Dictionary	Word definitions	Known structures	Word meanings
Ambiguity	Multiple valid readings	Multiple recognizable patterns	Multiple interpretations
Resolution	Heuristics or learned preference model	Formal training + intuition	Pragmatic inference + contextual knowledge

In every case:

The formal content is produced mechanically.
Interpreting it requires recognition that is not mechanical.
Recognition involves ambiguity that the formalism cannot resolve internally.
Resolution requires judgment shaped by experience, whether heuristic, learned, or pragmatic.

The formalism produces. Recognition consumes. And they require fundamentally different mechanisms.

Final remarks

Any sufficiently expressive formal system that tries to understand its own outputs the way its users understand them will eventually benefit from something like a language model inside it. Not as an add-on or a convenience, but as a natural component, because the gap between producing expressions and interpreting them is an information gap. Evaluation destroys distinctions that interpretation needs, and no amount of internal reasoning can recover what was never preserved.

The gap is not a bug in the implementation. It's a structural property of non-injective evaluation. And the way we cope with it in programming, in mathematics, in everyday speech, is always the same: we learn to predict what was meant, because the formal system alone can't recover it.

How a Pushdown Automaton becomes a Parser [part 3]

Jocer Franquiz — Sat, 28 Mar 2026 16:33:32 +0000

From Tokens to Trees: Four Paths to a Full Parser

In part 2, we built a pushdown automaton transducer, with 11 operations, 6 components, one stack. Our PDA Transducer turns nested <div> tags into a flat stream of tokens. In this post, we will explore what do we need to add to get a simple but practical parser that outputs a DOM tree.

Q: Is the transducer enough to parse HTML?

No. The transducer takes <div><div>hi</div></div> and emits:

[(OPEN, "div"), (OPEN, "div"), (TEXT, "hi"),
 (CLOSE, "div"), (CLOSE, "div")]

That's a flat list. It tells you what was in the input, but not how things relate to each other. A parser needs to produce a tree, a structure where the outer <div> is the parent and the inner <div> is its child:

div
 └── div
      └── "hi"

Our transducer can't do this. Its only writable memory is the stack, and the stack is consumed during validation. Every PUSH is matched by a POP to check nesting. By the time the transducer is done, the stack is empty. There's nowhere to store the tree.

To become a parser, the PDA needs a way to build and keep a tree structure in memory. Not an easy task.

Q: There are multiple ways to give the PDA tree-building ability?

Yes. Four classical computational models, each adding something different to the same PDA architecture we built in part 2.

Path	What you add	Computation model	Language family
2-Stack machine	A second stack	Stack machine	Forth, PostScript, Factor
Tree rewriting	A pool of cons cells + substitution rules	Lambda calculus / rewriting	Lisp, Scheme
Stack combinators	Combinator primitives (dup, swap, quote, apply)	Combinatory logic	Joy, Cat
Register machine	Writable random-access memory + ALU	Von Neumann	Assembly, C

All four give the PDA persistent, writable memory to store a tree. But each structures that memory differently. Each created a different programming tradition.

The 2-Stack Machine (Forth, Factor)

We add a second LIFO stack. That's it. Two stacks cooperating through the finite control. This is the stack machine, the model behind Forth (Charles Moore, 1970), PostScript, and Factor.

Stack A keeps its original job: validating nesting (PUSH on open, POP on close, CMP to check the match). Stack B is new. It accumulates completed subtrees. When a closing tag matches, the parser pops the validated tag from Stack A and pushes the finished node (with its children) onto Stack B.

Parsing <div><div>hi</div></div>: The transducer reads tokens as before. When it encounters (CLOSE, "div") for the inner div, it pops DIV from Stack A (validation) and pushes a completed node div("hi") onto Stack B. When the outer </div> closes, it pops from Stack A again and assembles the final tree by popping the inner node from Stack B as its child. The result on Stack B: div(div("hi")).

Two stacks can also simulate a Turing machine tape: one stack holds everything left of the head, the other holds everything right. Move left = pop from left, push to right. Move right = the reverse. This is a well-known equivalence: a 2-stack PDA equals a Turing machine.

Pros	Cons
Minimal addition, only 2 new operations	Trees must be linearized into stack form
Closest to the existing PDA architecture	Deep trees require deep stacks, no random access
Simple to implement and reason about	Awkward when you need to revisit earlier nodes

Tree Rewriting (Lisp, Scheme)

We replace the stack with a pool of dynamically allocated cons cells. Each cell holds a pair of pointers: left (CAR) and right (CDR), forming a binary tree. Computation is pattern matching on tree structure and substitution: find a subtree that matches a rule, replace it with the result. This is the lambda calculus model: the foundation of Lisp (Alonzo Church, 1936; John McCarthy, 1958), Scheme, and Racket.

The finite control drives parsing as before, but instead of pushing symbols onto a stack, it allocates cons cells in memory. Each opening tag creates a new node. Each closing tag triggers a CONS that links children to their parent. The tree is built directly in memory, with no linearization needed.

What you add to the PDA:

New operations	New data structures
`CONS`, `CAR`, `CDR`	Pool of dynamically allocated `cons` cells

Parsing <div><div>hi</div></div>: When the parser sees (TEXT, "hi"), it allocates a leaf cell holding "hi". When (CLOSE, "div") arrives for the inner div, a CONS links "hi" as the child of a new div node. When the outer </div> closes, another CONS links the inner div node as the child of the outer div node. The result is a tree in memory: (div . (div . "hi")).

Pros	Cons
Trees are the native data structure. No encoding needed	Requires garbage collection or manual memory management
Natural representation for S-expressions: code and data share the same form	More complex memory model than a stack
Pattern matching + substitution is powerful for tree transformations	Pointer chasing is slow on modern hardware (poor cache locality)

Stack Combinators (Joy, Cat)

We keep one stack. Don't add memory. Instead, change what the stack is allowed to hold. In our PDA, the stack holds passive symbols (like DIV). In a combinator machine, the stack holds executable programs. quote takes an operation and pushes it as data. apply pops it and executes it. The stack becomes simultaneously data storage and code. This is the foundation of concatenative languages: Joy (Manfred von Thun, 1990s), Cat, and Factor. The theoretical roots go back to Moses Schonfinkel (1920s) and Haskell Curry (1930s).

The parser quotes partial tree fragments and pushes them onto the stack as data. When a closing tag arrives, combinators like swap and dup rearrange the fragments, and apply assembles them into a larger tree. No heap, no RAM, no second stack. The stack itself is the only memory, and its entries can be either data or programs.

What you add to the PDA:

New operations	New data structures
`DUP`, `SWAP`, `QUOTE`, `APPLY`	None actually. Stack entries just become richer (data + code)

Parsing <div><div>hi</div></div>: When the parser sees (TEXT, "hi"), it quotes the text as a data fragment and pushes it. When (CLOSE, "div") arrives, it quotes a make-div-node operation, swaps it under the "hi" fragment, and applies, producing a quoted div("hi") node on the stack. The outer </div> repeats the pattern, composing the inner node into the outer one. The result: a single quoted tree [div(div("hi"))] on top of the stack.

Pros	Cons
No new memory means zero hardware cost	Hardest model to reason about for most developers
Elegant point-free composition. No variables, no names	Stack shuffling (`swap`, `dup`, `rot`) gets complex fast
Code and data unification is powerful for metaprogramming	Small community, fewer learning resources

The Register Machine (Assembly, C)

Weplace the stack with addressable RAM memory. Add an ALU with arithmetic operations. The finite control gains index registers for addressing. This is the Von Neumann architecture, the model John von Neumann described in 1945, building on Alan Turing's theoretical work (1936). Every CPU since x86, ARM, RISC-V, has this as it's essentially this model.

Random access changes everything. The parser allocates tree nodes at arbitrary memory addresses and links them by pointer. Need to revisit a node three levels up? Just follow the address — no popping through a stack. The tree lives in RAM as a set of linked records, each holding a tag name, a pointer to its first child, and a pointer to its next sibling.

What you add to the PDA:

New operations	New data structures
`LOAD`, `STORE`, `ADD`, `SUB`	RAM + ALU + index registers

Parsing <div><div>hi</div></div>: The parser allocates a record at address 0x00 for the outer div. When (OPEN, "div") arrives for the inner div, it allocates at 0x01 and writes 0x00 as its parent pointer. When (TEXT, "hi") arrives, it allocates at 0x02 and writes 0x01 as its parent. The closing tags update child pointers. The result is a linked structure in RAM:

Address	Tag	First child	Next sibling
0x00	div	→ 0x01	null
0x01	div	→ 0x02	null
0x02	"hi"	null	null

Pros	Cons
Random access: read/write any node by address, O(1)	Most complex addition: RAM, ALU, index registers
Maps directly to real hardware (silicon is random-access)	Requires manual memory management or allocator
Every programming language compiles to this	Pointer arithmetic is a source of bugs

This is the path that won. Not because it's the most elegant, in fact the 2-stack machine is simpler, the tree rewriter is more natural for trees, the combinator machine needs no extra memory at all. It won because transistors are random-access by design. RAM is cheap to build in silicon. The other three models require the hardware to simulate structured access patterns on top of flat memory. An extra layer the register machine skips entirely.

A side effect: Turing completeness

Each of these four additions was designed to solve a specific problem: giving the PDA enough writable memory to build a parse tree. But each one also has a deeper consequence: it makes the machine Turing complete, a machine capable of computing anything that any computer can compute.

This isn't a coincidence. To build an arbitrary tree, you need to store and retrieve an arbitrary amount of structured data. That's exactly what separates a context-free machine (our PDA) from a Turing machine. The four paths don't just add tree-building, they each provide enough general-purpose memory to simulate a Turing machine tape.

Here's how they compare side by side:

	2-Stack	Tree Rewriting	Stack Combinators	Register Machine
New ops	PUSH_B, POP_B	CONS, CAR, CDR	DUP, SWAP, QUOTE, APPLY	LOAD, STORE, ADD, SUB, MUL
New op count	2	3	4	5
Total ops (11 + N)	13	14	15	16
New data structure	Second LIFO stack	Pool of cons cells	None (enriched stack)	RAM + ALU + index registers
Primary data structure	Two LIFO stacks	Binary tree (cons cells)	One stack (data + code)	Flat array of addressed cells
How composition works	Stack effects (before/after)	S-expressions nested lists	Point-free sequencing	Instruction sequences
Why it's Turing complete	Two stacks simulate a tape	Tree substitution simulates a tape	quote + apply = self-modifying programs	RAM = direct tape simulation

We started at the bottom of this hierarchy in part 1 with a finite state machine. In part 2 we added a stack and crossed into context-free territory. Now we've seen the four doors at the top; four ways to cross into Turing completeness. All equivalent in power, all different in structure.

Q: If the PDA is so rigid, can it work with something flexible like an LLM?

Yes, and this is where things get practical.

From PDA to transformer, step by step:

PDA: hand-coded rules, explicit stack, one language (HTML)
Learned PDA: same architecture, learn transition rules from data (Grefenstette et al., 2015)
RNN: replace explicit stack with hidden state vector; approximates a stack but struggles with deep nesting
LSTM — add gates to hidden state; better stack approximation; pre-transformer breakthrough
Transformer — replace sequential hidden state with attention over entire input; random access to every previous token, not just stack top

Each step trades transparency for generality. The PDA handles one grammar perfectly and you can see why. The transformer handles every grammar approximately and you can't see how.

Critical limit: The PDA has unbounded stack, arbitrarily deep nesting, always. The transformer has finite context and finite precision — it's really a bounded-depth PDA. It can approximate stack behavior within its context window, but it can't guarantee correct nesting beyond that.

Research backing: Attention heads in transformers learn to simulate stack operations. Specific heads match opening tags to closing tags by attending back to the most recent unmatched opener — the attention pattern literally looks like LIFO behavior (Murty et al., Merrill et al.).

The practical combination: grammar-constrained decoding. Run the PDA alongside the LLM. The LLM generates tokens based on meaning. The PDA vetoes structurally invalid ones.

Step 1: LLM emits "<ul>"       stack: [UL]         ✅
Step 2: LLM emits "<li>"       stack: [UL, LI]     ✅
Step 3: LLM emits "Apple"      stack: [UL, LI]     ✅
Step 4: LLM wants "</ul>"      stack: [UL, LI]     ❌ BLOCKED
        top is LI, not UL
        next best: "</li>"     stack: [UL]         ✅
Step 5: LLM emits "<li>"       stack: [UL, LI]     ✅
Step 6: LLM emits "Banana"                         ✅
Step 7: LLM emits "</li>"      stack: [UL]         ✅
Step 8: LLM emits "</ul>"      stack: []           ✅

Step 4 is the key: the LLM tried to close </ul> while <li> was still open. The PDA masked it out, and the LLM's second-best token </li> took over.

	LLM	PDA
Decides	"Apple" and "Banana" are fruits	`</ul>` can't close before `</li>`
Good at	Meaning, creativity, world knowledge	Structure, nesting, grammar rules
Fails at	Guaranteeing valid nesting	Knowing what a fruit is

This combination is already in production:

Outlines (Python): attaches a grammar to any HuggingFace model
llama.cpp GBNF: grammar-constrained decoding in C++
Guidance (Microsoft): interleaves free generation with grammar constraints
OpenAI structured outputs: grammar enforcement server-side via JSON schema

The PDA we built in parts 1 and 2 is the exact mechanism these tools use to guarantee structure. A hand-coded automaton means the simplest kind of program, guarding the output of the most complex kind.

What's next

In parts 1 and 2 we built a tokenizer. In this post we saw four ways to extend it into a full parser, and how the simplest of those machines can guard the output of the most powerful ones.

Next up: we pick the register machine path and build a parser that writes a tree to memory.

Stay tuned...

Building a Tokenizer from Scratch [part 2]

Jocer Franquiz — Sat, 28 Mar 2026 16:30:18 +0000

From FSM to PDA

In part 1, we built a working FSM that recognizes <div>text</div> using just 7 primitives mapped 1:1 to assembly opcodes. But FSMs have a hard limit: they can't handle nested structures like <div><div>hello</div></div>. In this post, we climb the Chomsky hierarchy from finite state machines to pushdown automata, build a PDA that recognizes nested <div> tags, and then turn it into a transducer that emits tokens. In other words we are building the core of a lexer.

Q: Why can't FSMs handle nested structures?

Because an FSM has a fixed number of states, and that's all the memory it has.

Consider nested divs:

<div><div><div>hello</div></div></div>

To correctly match closing tags, you need to count how many <div>s you've opened so you know how many </div>s to expect. An FSM with, say, 12 states can handle nesting up to some fixed depth — but someone can always write HTML nested one level deeper than your states can track.

Put simply:

1 level deep → you need states to remember "I'm 1 div in"
2 levels deep → different states for "I'm 2 divs in"
n levels deep → you'd need n distinct state configurations

Since n is unbounded but your state count is finite, you run out of states. There's no way to encode "how deep am I?" in a fixed-size state register when the depth is arbitrary.

This is exactly what the stack in a pushdown automaton solves. Every time you see <div>, you push onto the stack; every </div>, you pop. The stack grows as needed, giving you unbounded memory in one direction (depth), which is precisely what nesting requires.

Q: So, FSMs needs memory to "remember" where they are in the nested structure, correct?

It depends on what you mean by "memory."

An FSM does have memory, but it's extremely limited. The current state itself is the memory. If your FSM is in state S7, that encodes something about what it has seen so far. That's how it knows it's partway through reading </div> vs still in content.

But that memory is:

Fixed-size, bounded by the number of states, decided at design time
Lossy, it can't remember the full history of what it read, only a compressed summary encoded in which state it's in
Non-growable, it can't allocate more memory as input gets longer

So when people say "FSMs have no memory," they really mean no auxiliary memory, that means no stack, no tape, no buffer that grows with input. The only thing an FSM "remembers" is which state it's currently in.

A PDA adds exactly one auxiliary memory structure: a stack. That single addition is enough to go from "can't handle nesting" to "can handle nesting."

Q: What is the stack, formally?

A stack is a last-in, first-out (LIFO) data structure with two operations:

PUSH(symbol): place a symbol on top
POP: remove and return the top symbol

So if your stack is ABB, a PUSH(C) gives ABBC, and a POP returns C leaving ABB again.

The key properties:

You can only access the top: no reading the middle or bottom
Unbounded size: it can grow as large as needed (unlike the FSM's fixed states)
It starts empty: at the beginning of computation, the stack contains nothing (or a special bottom marker like $)

That's it. It's the simplest possible auxiliar memory that's still useful. You get only sequential access from one end. But that restriction is exactly what matches the structure of nesting: the most recently opened thing is the next thing that needs to close.

Q: Which operations and structures do we need to add to the seven primitives to create a PDA that handles nested structures?

We need two new operations and one new structure:

New structure

STACK: a LIFO storage addressed by a stack pointer (SP)

Updated primitives table

#	Operation	Description	Type	Assembly equivalent
1	READ	Load current input symbol	FSM	`MOV reg, [src]`
2	CMP	Test value equality	FSM	`CMP reg, imm`
3	BRANCH	Conditional jump	FSM	`JE / JNE label`
4	JMP	Unconditional jump	FSM	`JMP label`
5	ASSIGN	Update state register	FSM	`MOV reg, imm`
6	ADVANCE	Move input pointer forward	FSM	`INC reg`
7	HALT	Terminate execution	FSM	`HLT`
8	PUSH	Place symbol on top of stack	PDA	`PUSH reg`
9	POP	Remove and return top of stack	PDA	`POP reg`

Two operations and one structure turn an FSM into a PDA.

x86 has native PUSH/POP instructions with a hardware stack pointer (SP/ESP/RSP), so the 1:1 mapping to CPU instructions still holds. PDAs are just as mechanically efficient as FSMs. The only cost is that memory usage now grows with nesting depth instead of being fixed.

You could also add a PEEK operation (read top without removing), but it's not strictly necessary — you can always POP then PUSH back. It's a convenience, not a primitive.

Q: What's the simplest PDA architecture?

A PDA architecture needs exactly four components:

Input tape — the string being read (read-only, left-to-right)
Input pointer — current position on the tape
State register — holds the current state from a finite set Q
Stack — LIFO storage over alphabet Γ

And the nine operations from our table are the instruction set that operates on these four components.

Compared to the FSM architecture from part 1, we added exactly one component (the stack) and two operations (PUSH, POP). The stack pointer (SP) lives inside the finite control alongside the state register — it's a register, not part of the stack itself.

The PDA architecture is a von Neumann architecture:

PDA component	Von Neumann equivalent
Input tape	Memory (data segment)
Input pointer	Program counter / index register
State register	CPU register (accumulator)
Stack pointer	SP register
Stack (in RAM)	Stack segment of memory
9 operations	Subset of the instruction set

What's missing for full Von Neumann: writable memory, random access, arithmetic, and stored program. The automata hierarchy maps directly onto hardware capability:

FSM → combinational logic + a register
PDA → CPU + stack segment
Turing machine → CPU + full random-access memory = von Neumann

Building a PDA for nested `<div>` recognition

Pseudocode

STATES: S0, TAG_OPEN, TAG_NAME, CONTENT, CLOSE_SLASH,
        CLOSE_NAME, ACCEPT, DEAD

STACK ALPHABET: Γ = { DIV, $ }

START:
  PUSH($)                    -- bottom marker
  ASSIGN(S0)

S0:
  READ(ch)
  CMP(ch, '<')
  BRANCH(eq → TAG_OPEN)
  JMP(DEAD)

TAG_OPEN:
  ADVANCE
  READ(ch)
  CMP(ch, '/')
  BRANCH(eq → CLOSE_SLASH)   -- it's a closing tag
  CMP(ch, 'd')
  BRANCH(eq → TAG_NAME)      -- it's an opening tag
  JMP(DEAD)

TAG_NAME:                     -- read "div>"
  ADVANCE
  READ(ch); CMP(ch, 'i'); BRANCH(ne → DEAD)
  ADVANCE
  READ(ch); CMP(ch, 'v'); BRANCH(ne → DEAD)
  ADVANCE
  READ(ch); CMP(ch, '>'); BRANCH(ne → DEAD)
  PUSH(DIV)                   -- ← matched <div>, push it
  ADVANCE
  ASSIGN(CONTENT)

CONTENT:
  READ(ch)
  CMP(ch, '<')
  BRANCH(eq → TAG_OPEN)      -- ← could be nested <div> or </div>
  CMP(ch, EOF)
  BRANCH(eq → DEAD)           -- input ended with open tags
  ADVANCE                     -- consume text character
  JMP(CONTENT)

CLOSE_SLASH:                  -- inside </
  ADVANCE
  READ(ch); CMP(ch, 'd'); BRANCH(ne → DEAD)
  ADVANCE
  READ(ch); CMP(ch, 'i'); BRANCH(ne → DEAD)
  ADVANCE
  READ(ch); CMP(ch, 'v'); BRANCH(ne → DEAD)
  ADVANCE
  READ(ch); CMP(ch, '>'); BRANCH(ne → DEAD)
  POP(top)                    -- ← pop the stack
  CMP(top, DIV)
  BRANCH(ne → DEAD)           -- mismatch
  ADVANCE
  READ(ch)
  CMP(ch, EOF)
  BRANCH(eq → CHECK_EMPTY)
  ASSIGN(CONTENT)             -- more input, keep going

CHECK_EMPTY:
  POP(top)
  CMP(top, $)
  BRANCH(eq → ACCEPT)         -- stack empty = all tags matched
  JMP(DEAD)                   -- leftover tags = unbalanced

ACCEPT:
  HALT(accept)

DEAD:
  HALT(reject)

The entire FSM logic is reused. The only additions are three PUSH/POP calls. One PUSH per <div>, one POP per </div>, one final POP to verify the stack is empty.

Trace: `<div><div>hi</div></div>`

Input: < d i v > < d i v > h i < / d i v > < / d i v > EOF
       0 1 2 3 4 5 6 7 8 9 ...

Stack starts: [ $ ]

1. First <div>

State     | Read | Action              | Stack
----------|------|----------------------|--------
S0        | <    | match, → TAG_OPEN   | [ $ ]
TAG_OPEN  | d    | not '/', → TAG_NAME | [ $ ]
TAG_NAME  | i    | match               | [ $ ]
          | v    | match               | [ $ ]
          | >    | match, PUSH(DIV)    | [ $ DIV ]
          |      | → CONTENT           |

2. Second <div>

State     | Read | Action              | Stack
----------|------|----------------------|-----------
CONTENT   | <    | → TAG_OPEN          | [ $ DIV ]
TAG_OPEN  | d    | not '/', → TAG_NAME | [ $ DIV ]
TAG_NAME  | i    | match               | [ $ DIV ]
          | v    | match               | [ $ DIV ]
          | >    | match, PUSH(DIV)    | [ $ DIV DIV ]
          |      | → CONTENT           |

3. Text "hi"

State     | Read | Action              | Stack
----------|------|----------------------|-------------
CONTENT   | h    | not '<', ADVANCE    | [ $ DIV DIV ]
CONTENT   | i    | not '<', ADVANCE    | [ $ DIV DIV ]

4. First </div>

State       | Read | Action            | Stack
------------|------|--------------------|--------------
CONTENT     | <    | → TAG_OPEN        | [ $ DIV DIV ]
TAG_OPEN    | /    | → CLOSE_SLASH     | [ $ DIV DIV ]
CLOSE_SLASH | d    | match             | [ $ DIV DIV ]
            | i    | match             | [ $ DIV DIV ]
            | v    | match             | [ $ DIV DIV ]
            | >    | match, POP → DIV  | [ $ DIV ]
            |      | DIV == DIV ✓      |
            |      | read next         |

5. Second </div>

State       | Read | Action            | Stack
------------|------|--------------------|---------
CONTENT     | <    | → TAG_OPEN        | [ $ DIV ]
TAG_OPEN    | /    | → CLOSE_SLASH     | [ $ DIV ]
CLOSE_SLASH | d    | match             | [ $ DIV ]
            | i    | match             | [ $ DIV ]
            | v    | match             | [ $ DIV ]
            | >    | match, POP → DIV  | [ $ ]
            |      | DIV == DIV ✓      |
            |      | read next         |

6. End of input

State       | Read | Action            | Stack
------------|------|--------------------|---------
            | EOF  | → CHECK_EMPTY     | [ $ ]
CHECK_EMPTY |      | POP → $           | [ ]
            |      | $ == $ ✓          |
            |      | → ACCEPT          |

Result: ACCEPT

The stack grew to depth 2 (two DIVs), then shrank back to just $, confirming every open tag had a matching close. An FSM would have needed dedicated states for each depth level. This PDA handles any depth with the same 8 states.

Q: Our PDA can parse nested structures and output accept or reject. But a real tokenizer needs to emit tokens, right?

Right. The PDA is still just a recognizer. To become a tokenizer, we need to understand what changes formally.

A recognizer takes an input string and produces a single binary answer: accept or reject.

Input: a string over alphabet
Output: yes or no

A transducer takes an input string and produces an output string (or sequence of output symbols). It transforms input into output.

Input: a string over alphabet
Output: a string y over an output alphabet Δ

A tokenizer is a transducer: it reads characters (Σ = ASCII) and emits tokens (Δ = the set of possible tokens).

Q: What is ?

An output buffer is a tape, simpler than the stack:

Append-only: you can only write to the end, never read back or modify
Write-only: the machine never consults its own output to make decisions
Unbounded: it grows as tokens are emitted
Starts empty

Output structure

OUTPUT ALPHABET:  Δ = TokenType × Σ*
TOKEN TYPES:      TokenType = { OPEN_TAG, CLOSE_TAG, TEXT }
OUTPUT BUFFER:    append-only list of (type, value) pairs, starts empty

For <div><div>hi</div></div>, the output buffer should end up as:

[
  (OPEN_TAG,  "div"),
  (OPEN_TAG,  "div"),
  (TEXT,      "hi"),
  (CLOSE_TAG, "div"),
  (CLOSE_TAG, "div")
]

Updated operations table (11 operations)

#	Operation	Description	Type	Assembly
1	READ	Load current input symbol	FSM	`MOV reg, [src]`
2	CMP	Test value equality	FSM	`CMP reg, reg/imm`
3	BRANCH	Conditional jump	FSM	`JE / JNE label`
4	JMP	Unconditional jump	FSM	`JMP label`
5	ASSIGN	Update state register	FSM	`MOV reg, imm`
6	ADVANCE	Move input pointer forward	FSM	`INC reg`
7	HALT	Terminate execution	FSM	`HLT`
8	PUSH	Place symbol on top of stack	PDA	`PUSH reg`
9	POP	Remove and return top of stack	PDA	`POP reg`
10	MARK	Save current input position	Transducer	`MOV reg, reg`
11	EMIT	Append token to output buffer	Transducer	`MOV [dst], reg`

Transducer Architecture: 6 components

Input tape (read-only)
Input pointer
State register
Stack pointer + Stack
Mark register — saves start position of current token
Output buffer — append-only, unbounded

Pseudocode (transducer)

STATES: S0, TAG_OPEN, TAG_NAME, CONTENT, CONTENT_LOOP,
        CONTENT_DONE, CLOSE_SLASH, CHECK_EMPTY, ACCEPT, DEAD

INPUT ALPHABET:  Σ = ASCII
STACK ALPHABET:  Γ = { DIV, $ }
OUTPUT ALPHABET: Δ = { OPEN_TAG, CLOSE_TAG, TEXT } × Σ*

START:
  PUSH($)
  ASSIGN(S0)

S0:
  READ(ch)
  CMP(ch, '<')
  BRANCH(eq → TAG_OPEN)
  JMP(DEAD)

TAG_OPEN:
  ADVANCE
  READ(ch)
  CMP(ch, '/')
  BRANCH(eq → CLOSE_SLASH)
  CMP(ch, 'd')
  BRANCH(eq → TAG_NAME)
  JMP(DEAD)

TAG_NAME:
  ADVANCE
  READ(ch); CMP(ch, 'i'); BRANCH(ne → DEAD)
  ADVANCE
  READ(ch); CMP(ch, 'v'); BRANCH(ne → DEAD)
  ADVANCE
  READ(ch); CMP(ch, '>'); BRANCH(ne → DEAD)
  PUSH(DIV)
  EMIT(OPEN_TAG, "div")           -- ← emit opening tag token
  ADVANCE
  ASSIGN(CONTENT)

CONTENT:
  MARK                             -- ← save start of text region
CONTENT_LOOP:
  READ(ch)
  CMP(ch, '<')
  BRANCH(eq → CONTENT_DONE)
  CMP(ch, EOF)
  BRANCH(eq → DEAD)
  ADVANCE
  JMP(CONTENT_LOOP)

CONTENT_DONE:
  CMP(mark, pos)                   -- any text between tags?
  BRANCH(eq → TAG_OPEN)            -- no text, skip emit
  EMIT(TEXT, input[mark..pos])     -- ← emit text token
  JMP(TAG_OPEN)

CLOSE_SLASH:
  ADVANCE
  READ(ch); CMP(ch, 'd'); BRANCH(ne → DEAD)
  ADVANCE
  READ(ch); CMP(ch, 'i'); BRANCH(ne → DEAD)
  ADVANCE
  READ(ch); CMP(ch, 'v'); BRANCH(ne → DEAD)
  ADVANCE
  READ(ch); CMP(ch, '>'); BRANCH(ne → DEAD)
  POP(top)
  CMP(top, DIV)
  BRANCH(ne → DEAD)
  EMIT(CLOSE_TAG, "div")          -- ← emit closing tag token
  ADVANCE
  READ(ch)
  CMP(ch, EOF)
  BRANCH(eq → CHECK_EMPTY)
  ASSIGN(CONTENT)

CHECK_EMPTY:
  POP(top)
  CMP(top, $)
  BRANCH(eq → ACCEPT)
  JMP(DEAD)

ACCEPT:
  HALT(accept)                     -- output buffer contains all tokens

DEAD:
  HALT(reject)

Trace: `<div><div>hi</div></div>` (transducer)

Input: < d i v > < d i v > h i < / d i v > < / d i v > EOF
Pos:   0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

Stack starts:  [ $ ]
Output starts: [ ]

1. First <div>

State       | Pos | Read | Action              | Stack     | Output
------------|-----|------|----------------------|-----------|-------
S0          |  0  | <    | match, → TAG_OPEN   | [ $ ]     |
TAG_OPEN    |  1  | d    | not '/', → TAG_NAME | [ $ ]     |
TAG_NAME    |  2  | i    | match               | [ $ ]     |
            |  3  | v    | match               | [ $ ]     |
            |  4  | >    | match               | [ $ ]     |
            |     |      | PUSH(DIV)           | [ $ DIV ] |
            |     |      | EMIT(OPEN_TAG,"div")|           | ← (OPEN_TAG, "div")
            |  5  |      | ADVANCE, → CONTENT  |           |

2. Enter CONTENT, see < immediately

State        | Pos | Read | Action             | Stack     | Output
-------------|-----|------|--------------------|-----------|-------
CONTENT      |  5  |      | MARK → mark=5      | [ $ DIV ] |
CONTENT_LOOP |  5  | <    | CMP(mark,pos): 5=5 |           |
             |     |      | equal, skip emit   |           |
             |     |      | → TAG_OPEN         |           |

3. Second <div>

State       | Pos | Read | Action              | Stack         | Output
------------|-----|------|----------------------|---------------|-------
TAG_OPEN    |  6  | d    | not '/', → TAG_NAME | [ $ DIV ]     |
TAG_NAME    |  7  | i    | match               | [ $ DIV ]     |
            |  8  | v    | match               | [ $ DIV ]     |
            |  9  | >    | match               | [ $ DIV ]     |
            |     |      | PUSH(DIV)           | [ $ DIV DIV ] |
            |     |      | EMIT(OPEN_TAG,"div")|               | ← (OPEN_TAG, "div")
            | 10  |      | ADVANCE, → CONTENT  |               |

4. Text "hi"

State        | Pos | Read | Action             | Stack         | Output
-------------|-----|------|--------------------|---------------|-------
CONTENT      | 10  |      | MARK → mark=10     | [ $ DIV DIV ] |
CONTENT_LOOP | 10  | h    | not '<', ADVANCE   |               |
CONTENT_LOOP | 11  | i    | not '<', ADVANCE   |               |
CONTENT_LOOP | 12  | <    | CMP(mark,pos):10≠12|              |
             |     |      | EMIT(TEXT, "hi")   |               | ← (TEXT, "hi")
             |     |      | → TAG_OPEN         |               |

5. First </div>

State       | Pos | Read | Action              | Stack         | Output
------------|-----|------|----------------------|---------------|-------
TAG_OPEN    | 13  | /    | → CLOSE_SLASH       | [ $ DIV DIV ] |
CLOSE_SLASH | 14  | d    | match               |               |
            | 15  | i    | match               |               |
            | 16  | v    | match               |               |
            | 17  | >    | match               |               |
            |     |      | POP → DIV           | [ $ DIV ]     |
            |     |      | DIV == DIV ✓        |               |
            |     |      | EMIT(CLOSE_TAG,"div")|              | ← (CLOSE_TAG, "div")
            | 18  | <    | ADVANCE, not EOF    |               |
            |     |      | → CONTENT           |               |

6. Enter CONTENT, see < immediately

State        | Pos | Read | Action              | Stack     | Output
-------------|-----|------|----------------------|-----------|-------
CONTENT      | 18  |      | MARK → mark=18      | [ $ DIV ] |
CONTENT_LOOP | 18  | <    | CMP(mark,pos): 18=18|           |
             |     |      | equal, skip emit    |           |
             |     |      | → TAG_OPEN          |           |

7. Second </div>

State       | Pos | Read | Action               | Stack     | Output
------------|-----|------|------------------------|-----------|-------
TAG_OPEN    | 19  | /    | → CLOSE_SLASH          | [ $ DIV ] |
CLOSE_SLASH | 20  | d    | match                  |           |
            | 21  | i    | match                  |           |
            | 22  | v    | match                  |           |
            | 23  | >    | match                  |           |
            |     |      | POP → DIV              | [ $ ]     |
            |     |      | DIV == DIV ✓           |           |
            |     |      | EMIT(CLOSE_TAG, "div") |           | ← (CLOSE_TAG, "div")
            | 24  | EOF  | ADVANCE, EOF           |           |
            |     |      | → CHECK_EMPTY          |           |

8. End of input

State       | Pos | Read | Action              | Stack | Output
------------|-----|------|----------------------|-------|-------
CHECK_EMPTY |     |      | POP → $              | [ ]   |
            |     |      | $ == $ ✓             |       |
            |     |      | → ACCEPT             |       |
ACCEPT      |     |      | HALT(accept)         |       |

Final output buffer

[
  (OPEN_TAG,  "div"),
  (OPEN_TAG,  "div"),
  (TEXT,      "hi"),
  (CLOSE_TAG, "div"),
  (CLOSE_TAG, "div")
]

Same 10 states, same stack operations. The three EMITs fired at exactly the right boundaries, and the two CMP(mark, pos) guards correctly skipped empty text regions between consecutive tags.

Q: Is the tokenizer a lexer?

Yes. Tokenizer, lexer, lexical analyzer, scanner, all the same thing. Characters in, tokens out.

The distinction that matters is lexer vs parser:

Lexer (what we built): characters in, tokens out. Flat stream. Answers: "what are the words?"
Parser: tokens in, tree out. Hierarchical structure. Answers: "how do the words relate to each other?"

Our transducer is a lexer. It outputs a flat list:

(OPEN_TAG, "div"), (OPEN_TAG, "div"), (TEXT, "hi"), (CLOSE_TAG, "div"), (CLOSE_TAG, "div")

A parser would produce a tree:

div
 └─ div
     └─ "hi"

Our PDA already knows the tree structure in some way, the stack depth at any point tells us exactly where we are in the hierarchy. We just don't capture it. We throw that information away and emit a flat stream. Building the tree means using the stack to capture that structure as we go.

What's next

We started this post with an FSM that couldn't count, and ended with a transducer that emits tokens from arbitrarily nested HTML. Along the way, a pattern kept surfacing: climbing the Chomsky hierarchy (FSM → PDA → Turing machine) mirrors the historical evolution of CPU architecture (register → stack segment → full random-access memory). That parallel is not a coincidence, and we will keep pulling on that thread.

Next up: a lexer becomes a parser, a Turing Complete machine.

If you have questions or want to dig deeper into any of these topics, drop a comment below. Follow to catch Part 3 when it lands. Stay tuned...

Finite State Machine from first principles [part 1]

Jocer Franquiz — Tue, 24 Mar 2026 22:12:17 +0000

From Combinational Logic to FSM: Q/A

The starter point is Combinational Logic. The idea is to understand what do we really need to add to CL to create a Finite State Machine.

At the end, we ended up building a working FSM that can recognize a complete <div>text</div> structure, using just 7 primitives. Funny enough, those primitives are equivalent to 7 Assembly opcodes (not a coincidence, this was all planned by the Elders).

Q: Automata theory has a class hierarchy, starting from combinational logic. Right?

Yes, that's a clean way to frame it. The hierarchy goes from "no memory" to "more memory":

Combinational logic — output depends only on current input. No memory, no state. Just pure boolean functions (AND, OR, etc.).
Finite State Machine (FSM) — adds memory in the form of state. Output depends on current input + current state. Memory is finite and fixed.
Pushdown Automaton (PDA) — FSM + a stack. The stack gives it unlimited but structured memory (LIFO). This is what recognizes nested structures like matching parentheses.
Turing Machine — FSM + an infinite tape (read/write). The most powerful model; can simulate any computation.

Each level can recognize a strictly larger class of languages (the Chomsky hierarchy):

Machine	Language class	Example
Combinational	—	Boolean expression
FSM	Regular	`/ab*c/`
PDA	Context-free	Balanced `{}[]()`
Turing	Recursively enumerable	Anything computable

For parsing HTML, you need at minimum a PDA (because of nested tags), but in practice browsers use a stateful tokenizer (FSM) + a tree builder (stack-based) — which together behave like a PDA.

Q: Which combinational logic operators do we need to implement a finite state machine? Conceptually.

Conceptually, you only need three:

AND — "state is X and input is Y"
OR — "condition A or condition B leads to the same result"
NOT — negate a condition

These three form a functionally complete set — you can express any boolean function with just AND, OR, NOT.

The core of an FSM is its transition function:

next_state = f(current_state, input)

That function is purely combinational — it's just a truth table. Every row is a conjunction (AND) of conditions, and you OR the rows together to select the next state. Something like:

if state == A AND input == 'x' → go to B
if state == A AND input == 'y' → go to C
if state == B AND input == 'x' → go to A
...

Each line is AND. Multiple lines that share an outcome are OR'd together.

NOT comes in when you need to express "any input that is not this character."

That's the combinational part. The state register (what actually holds the current state between steps) is the sequential part — it's where memory enters the picture.

Q: What else do we need from a computational point of view? Something like `inc`, `if/else`, `cmp`, etc.

Here is the complete set of primitive operations for an FSM:

Operation	What it does	Why an FSM needs it
READ	Get the current input symbol	The machine must consume input
CMP	Test equality of two values	Check state and input against known constants
BRANCH	Conditional jump	Select which transition fires
JMP	Unconditional jump	Loop back to process next character
ASSIGN	Write a value to a register	Update the state register
ADVANCE	Move input pointer forward	Step through the input string
HALT	Stop execution	Accept or reject; terminate the machine

Seven operations total. No arithmetic, no stack, no heap.

Example: an FSM that accepts strings matching ab*c (an a, then zero or more bs, then c).

States: S0 (start), S1 (got a), ACCEPT, DEAD

LOOP:
    READ   char                ; get current input symbol

    ; ── S0: expecting 'a' ──
    CMP    state, S0
    BRANCH not_s0
    CMP    char, 'a'
    BRANCH s0_fail
    ASSIGN state, S1
    JMP    next

  s0_fail:
    ASSIGN state, DEAD
    JMP    next

  not_s0:
    ; ── S1: expecting 'b' or 'c' ──
    CMP    state, S1
    BRANCH not_s1
    CMP    char, 'b'
    BRANCH s1_not_b
    ASSIGN state, S1           ; stay in S1
    JMP    next

  s1_not_b:
    CMP    char, 'c'
    BRANCH s1_fail
    ASSIGN state, ACCEPT
    JMP    next

  s1_fail:
    ASSIGN state, DEAD
    JMP    next

  not_s1:
    ; ── DEAD or ACCEPT + more input → DEAD ──
    ASSIGN state, DEAD

  next:
    ADVANCE                    ; move to next character
    CMP    input_end, true
    BRANCH LOOP                ; not at end → loop

    ; ── input exhausted ──
    CMP    state, ACCEPT
    BRANCH reject
    HALT   accept

  reject:
    HALT   reject

Trace with input "abbc":

Step	char	state before	transition	state after
1	`a`	S0	S0 + `a` → S1	S1
2	`b`	S1	S1 + `b` → S1	S1
3	`b`	S1	S1 + `b` → S1	S1
4	`c`	S1	S1 + `c` → ACCEPT	ACCEPT

Here is a graph representation

Four states, all transitions accounted for, including the error paths to DEAD.

End of input, state is ACCEPT → accepted.

Every single line maps to one of the seven primitives. Nothing else is needed.

Q: How do we know these seven operators are enough for any FSM? Fact check this!

Yes, the seven operations are sufficient. In fact, they're slightly redundant.

JMP is redundant — it's just BRANCH(always_true, target). And CMP can be absorbed into BRANCH (as "branch-on-symbol-match"). And READ + ADVANCE can be fused into one operation.

So the theoretical minimum is 4 operations:

Minimal op	Absorbs
READ/ADVANCE	READ + ADVANCE
BRANCH-on-match	CMP + BRANCH + JMP
ASSIGN	—
HALT	—

This matches Wang's B-machine (1957) — a minimal abstract machine with a read-only input tape and finite control that is exactly a DFA. Its instruction set is: {read, branch-on-symbol, halt-accept, halt-reject}.

Key references:

Sipser (Introduction to the Theory of Computation, Ch. 1) — treats the transition function δ(state, symbol) → state as a single atomic primitive. The 7 ops are a decomposition of that.
Hopcroft/Ullman/Motwani (Intro to Automata Theory, Ch. 2) — same formal treatment.
Minsky (Computation: Finite and Infinite Machines, 1967, Ch. 3) — decomposes FSMs into read input, compute next-state (combinational logic), latch new state (memory). The hardware view of exactly these ops.
Wang, 1957 — "A Variant to Turing's Theory of Computing Machines", Journal of the ACM 4(1). The closest classical reference to a minimal FSM instruction set. ACM link

Bottom line: The 7 ops are sufficient for any FSM. We keep them separate (rather than using the minimal 4) because they map more clearly to code — each one becomes a distinct, readable operation in the implementation.

Q: Why those 7 operators are very much like Assembly operators?

They are assembly-level operations. That's not a coincidence.

An FSM is the computational model that sits right above combinational logic. It's the simplest machine that has state. And assembly is the thinnest software abstraction over hardware. So they naturally meet at the same level.

The mapping is direct:

FSM op	x86 equivalent
READ	`mov al, [rsi]` (load byte from memory)
CMP	`cmp al, 'a'`
BRANCH	`jne label`
JMP	`jmp label`
ASSIGN	`mov state, S1`
ADVANCE	`inc rsi`
HALT	`ret` / `syscall exit`

Hand-written FSMs in assembly (like in high-performance lexers) look almost identical to our pseudocode. The abstraction gap is essentially zero.

This is also why FSMs are so fast in practice — every operation maps 1:1 to a single CPU instruction. No function calls, no memory allocation, no indirection. Just compare, branch, assign, repeat.

When we move to a higher-level language like Python, we'll express these same 7 operations using if/elif, ==, variable assignment, and a for loop — but conceptually, it's the same assembly underneath.

Q: Can we implement an FSM that recognizes `<div>text</div>` where text is alphanumeric?

Yes. To handle character ranges like [a-z], [A-Z], [0-9], we need to expand CMP and split BRANCH into variants.

Updated operator set (10 operations):

Operation	What it does	Why an FSM needs it	x86 equivalent
READ	Get current input symbol	Consume input	`mov al, [rsi]`
CMP	Compare two values, set flags	Every transition needs comparison	`cmp al, 'a'`
BRANCH_EQ	Jump if equal	Exact character match	`je label`
BRANCH_NE	Jump if not equal	Skip when no match	`jne label`
BRANCH_LT	Jump if less than	Lower bound of range check	`jb label`
BRANCH_GT	Jump if greater than	Upper bound of range check	`ja label`
JMP	Unconditional jump	Loop, skip blocks	`jmp label`
ASSIGN	Write value to register	Update state	`mov reg, val`
ADVANCE	Move input pointer forward	Step through input	`inc rsi`
HALT	Stop execution	Accept or reject	`ret` / `syscall`

We split the old BRANCH into four variants (EQ, NE, LT, GT). CMP now sets flags rather than returning a boolean — exactly like x86 cmp. Range checks like [a-z] become:

CMP       char, 'a'
BRANCH_LT fail        ; char < 'a' → not lowercase
CMP       char, 'z'
BRANCH_GT fail        ; char > 'z' → not lowercase
; if we reach here, 'a' <= char <= 'z'

FSM for `<div>[a-zA-Z0-9]+</div>`

States and what has been consumed:

State	Consumed so far	Expects next
S0	(start)	`<`
S1	`<`	`d`
S2	`<d`	`i`
S3	`<di`	`v`
S4	`<div`	`>`
S5	`<div>`	first alphanumeric
S6	`<div>text`	more alphanumeric or `<`
S7	`<div>text<`	`/`
S8	`<div>text</`	`d`
S9	`<div>text</d`	`i`
S10	`<div>text</di`	`v`
S11	`<div>text</div`	`>`
ACCEPT	`<div>text</div>`	(end of input)
DEAD	(error)	(absorb remaining)

Pseudocode using the 10 operators:

    ASSIGN state, S0

LOOP:
    ; ── DEAD: absorb remaining input ──
    CMP    state, DEAD
    BRANCH_EQ next

    ; ── ACCEPT + more input → error ──
    CMP    state, ACCEPT
    BRANCH_EQ to_dead

    READ   char

    ; ── S0: expecting '<' ──
    CMP    state, S0
    BRANCH_NE not_s0
    CMP    char, '<'
    BRANCH_NE to_dead
    ASSIGN state, S1
    JMP    next

  not_s0:
    ; ── S1: expecting 'd' ──
    CMP    state, S1
    BRANCH_NE not_s1
    CMP    char, 'd'
    BRANCH_NE to_dead
    ASSIGN state, S2
    JMP    next

  not_s1:
    ; ── S2: expecting 'i' ──
    CMP    state, S2
    BRANCH_NE not_s2
    CMP    char, 'i'
    BRANCH_NE to_dead
    ASSIGN state, S3
    JMP    next

  not_s2:
    ; ── S3: expecting 'v' ──
    CMP    state, S3
    BRANCH_NE not_s3
    CMP    char, 'v'
    BRANCH_NE to_dead
    ASSIGN state, S4
    JMP    next

  not_s3:
    ; ── S4: expecting '>' ──
    CMP    state, S4
    BRANCH_NE not_s4
    CMP    char, '>'
    BRANCH_NE to_dead
    ASSIGN state, S5
    JMP    next

  not_s4:
    ; ── S5: expecting first alphanumeric ──
    CMP    state, S5
    BRANCH_NE not_s5
    JMP    check_alnum_s6     ; reuse alnum check, go to S6

  not_s5:
    ; ── S6: more alphanumeric or '<' ──
    CMP    state, S6
    BRANCH_NE not_s6
    CMP    char, '<'
    BRANCH_NE s6_alnum
    ASSIGN state, S7
    JMP    next
  s6_alnum:
    JMP    check_alnum_stay   ; stay in S6

  not_s6:
    ; ── S7: expecting '/' ──
    CMP    state, S7
    BRANCH_NE not_s7
    CMP    char, '/'
    BRANCH_NE to_dead
    ASSIGN state, S8
    JMP    next

  not_s7:
    ; ── S8: expecting 'd' ──
    CMP    state, S8
    BRANCH_NE not_s8
    CMP    char, 'd'
    BRANCH_NE to_dead
    ASSIGN state, S9
    JMP    next

  not_s8:
    ; ── S9: expecting 'i' ──
    CMP    state, S9
    BRANCH_NE not_s9
    CMP    char, 'i'
    BRANCH_NE to_dead
    ASSIGN state, S10
    JMP    next

  not_s9:
    ; ── S10: expecting 'v' ──
    CMP    state, S10
    BRANCH_NE not_s10
    CMP    char, 'v'
    BRANCH_NE to_dead
    ASSIGN state, S11
    JMP    next

  not_s10:
    ; ── S11: expecting '>' ──
    CMP    state, S11
    BRANCH_NE to_dead
    CMP    char, '>'
    BRANCH_NE to_dead
    ASSIGN state, ACCEPT
    JMP    next

    ; ── alphanumeric check → transition to S6 ──
  check_alnum_s6:
    CMP    char, 'a'
    BRANCH_LT upper_s6
    CMP    char, 'z'
    BRANCH_GT upper_s6
    ASSIGN state, S6
    JMP    next
  upper_s6:
    CMP    char, 'A'
    BRANCH_LT digit_s6
    CMP    char, 'Z'
    BRANCH_GT digit_s6
    ASSIGN state, S6
    JMP    next
  digit_s6:
    CMP    char, '0'
    BRANCH_LT to_dead
    CMP    char, '9'
    BRANCH_GT to_dead
    ASSIGN state, S6
    JMP    next

    ; ── alphanumeric check → stay in S6 ──
  check_alnum_stay:
    CMP    char, 'a'
    BRANCH_LT upper_stay
    CMP    char, 'z'
    BRANCH_GT upper_stay
    JMP    next
  upper_stay:
    CMP    char, 'A'
    BRANCH_LT digit_stay
    CMP    char, 'Z'
    BRANCH_GT digit_stay
    JMP    next
  digit_stay:
    CMP    char, '0'
    BRANCH_LT to_dead
    CMP    char, '9'
    BRANCH_GT to_dead
    JMP    next

  to_dead:
    ASSIGN state, DEAD

  next:
    ADVANCE
    CMP    input_end, true
    BRANCH_NE LOOP

    ; ── input exhausted ──
    CMP    state, ACCEPT
    BRANCH_NE reject
    HALT   accept

  reject:
    HALT   reject

Trace with input <div>hello</div>:

Step	char	state	transition
1	`<`	S0 → S1	exact match
2	`d`	S1 → S2	exact match
3	`i`	S2 → S3	exact match
4	`v`	S3 → S4	exact match
5	`>`	S4 → S5	exact match
6	`h`	S5 → S6	range: `'a'` ≤ `'h'` ≤ `'z'`
7	`e`	S6 → S6	range: `'a'` ≤ `'e'` ≤ `'z'`
8	`l`	S6 → S6	range: `'a'` ≤ `'l'` ≤ `'z'`
9	`l`	S6 → S6	range: `'a'` ≤ `'l'` ≤ `'z'`
10	`o`	S6 → S6	range: `'a'` ≤ `'o'` ≤ `'z'`
11	`<`	S6 → S7	exact match
12	`/`	S7 → S8	exact match
13	`d`	S8 → S9	exact match
14	`i`	S9 → S10	exact match
15	`v`	S10 → S11	exact match
16	`>`	S11 → ACCEPT	exact match

End of input, state is ACCEPT → accepted.

What's next

At this point our FSM can recognize whether a string like <div>hello</div> is valid, but it can only answer yes or no. A real tokenizer needs to do more than accept or reject. It needs to emit tokens: structured data that tells the next stage of the parser what was found and where.

That means we need a data structure to hold the tokenizer's output — something that captures the tag names, the text content, and the boundaries between them. In the next post, we'll design that output format and turn our recognizer into a proper tokenizer that produces a stream of tokens.

Stay tuned...

DEV Community: Jocer Franquiz

Knowing When Your LLM Is Wrong: A Field Guide for Agentic Systems

1. What "correct" actually means?

2. Measuring the error rate

The practical recipe

The theoretical floor

3. How to calibrate a black-box?

Step 1: extract a confidence signal

Step 2: correct the signal

Step 3: pick a threshold (and an abstention zone)

Step 4: verify calibration improved

A black-box caveat that bites everyone eventually

4. Optional: the formal frame

A stochastic state machine, sort of

POMDP: the precise version

Two kinds of uncertainty (this is the payoff)

5. The policy is the unit you're shipping

6. A/B testing policies properly

The structure

Sample size: the number that matters most

Analysis

What's special about A/B testing LLM policies

The right ladder: offline → shadow → canary → full A/B

Pitfalls that catch people repeatedly

7. Putting it together

Further reading

Cloud Architecture Distilled

A Provider-Agnostic Cloud Architecture Reference

What's in it

Tech

Link

Feedback wanted

A Serious (and hype-less) Study Guide on Agents and LLMs

1. Recommended path

2. Foundational essays: read these first

Building effective agents

LLM Powered Autonomous Agents

AI Engineering (chapter on Agents)

3. Patterns & techniques: the original papers

4. Protocols & specs (the control-plane stuff)

Model Context Protocol (MCP)

AGENTS.md

Agent Skills

OpenAPI → tool schemas

5. Claude Code & Anthropic ecosystem

Claude Code documentation

Claude Agent SDK

Claude Cookbooks

Anthropic Engineering blog

6. Frameworks (good for "show me code")

7. Memory & retrieval

8. Observability & evaluation

Tracing platforms

Standards

Evaluation frameworks & benchmarks

9. Safety, security, and guardrails

10. Multi-agent & emerging directions

11. Going deeper: books

12. Communities & ongoing reading

13. By topic: quick reference

14. A note on freshness

Why Formal Systems Can't Read Their Own Output?

Concatenative Combinator evaluation

The Reflection Problem

But this ambiguity is familiar

If the system can't resolve the ambiguity from within, what can?

The Information-Theoretic View

Mathematicians Have the Same Problem

One Common Thread

Final remarks

Further Reading

Concatenative Combinatory Logic

E-Graphs and Equality Saturation

Self-Reference and Meaning

Pattern Recognition in Mathematics

Language Models and Ambiguity

How a Pushdown Automaton becomes a Parser [part 3]

From Tokens to Trees: Four Paths to a Full Parser

Q: Is the transducer enough to parse HTML?

Q: There are multiple ways to give the PDA tree-building ability?

Building a PDA for nested `<div>` recognition

Trace: `<div><div>hi</div></div>`

Trace: `<div><div>hi</div></div>` (transducer)

Q: What else do we need from a computational point of view? Something like `inc`, `if/else`, `cmp`, etc.

Q: Can we implement an FSM that recognizes `<div>text</div>` where text is alphanumeric?

FSM for `<div>[a-zA-Z0-9]+</div>`