Adam Munawar Rahman

Posted on Jun 9

meet elliot: a robot sim an llm drives, kept honest by a state machine

#ai #opensource #agents #mcp

Elliot drives. The state machine says no.

I gave a language model the wheel. Then I gave a state machine the power to
refuse any move it tried to make.

30 seconds to running. No API key, no model, fully offline (a deterministic
navigator drives):

git clone https://github.com/msradam/elliot && cd elliot
uv venv && uv pip install -e .
python run.py --offline

That is Elliot, a robot in a 2D simulation. An LLM is driving him. It narrates
every step in its own words and reaches for the next phase the moment it thinks
it is ready. The red lines are the state machine telling it no.

TL;DR: the model proposes the next move; a finite state machine validates it
against real state and refuses the rest. The refusing is done by Theodosia, a
small open-source library I built, and it is one line of glue. The robot is just
the most fun way I found to explain it.

the problem: LLMs are confident, not correct

If you have built an agent that does more than one step, you know the failure
mode. The model decides it has finished a step it has not finished. It skips
ahead. It claims the file is written, the order is placed, the deploy is done.

Mine once marked an order "placed" because the payment call timed out instead of
returning an error. The model read the non-answer as a yes, moved on, and fired
the confirmation email for an order that did not exist.

It is not lying on purpose. It has no ground truth, only its own belief, and its
belief sits upstream of reality.

The usual fix is to prompt harder. Add "do not proceed until X." That works
until it doesn't.

the idea: make the workflow a state machine, and let the model only propose

What if the workflow itself were a finite state machine, with explicit states
and explicit transitions, and the model could only propose the next
transition? Something else validates the proposal against real state and either
runs it or refuses it and hands back the moves that are actually legal.

You get determinism where you need it (the graph and the gates) and model
judgment where you want it (which legal move, and why). The model drives. It
does not get to redraw the road.

meet Elliot

Elliot is that idea staged in a sim, and a little theatrical.

His rulebook is a finite state machine, built with Apache Burr (a state-machine library): five phases, boot, recon, exploit, exfil, ghost, and the legal moves between them. It decides nothing. It only says which moves exist. (Yes, named after that Elliot. If you know, you know.)
The mind is an LLM. It reads the state, narrates, and picks which legal move to make, through litellm, so the model is swappable.
His senses are the state: a 2D lidar, his position, a collision flag, an arrival flag, all from ir-sim (a 2D robot simulator), running headless.
A plain controller does the actual steering, because that is motor work and the model is bad at motor work.

Watch the console in that screenshot. The model is in RECON, closing on the
target. It is eager, so it keeps reaching for EXPLOIT. And the machine keeps
answering:

REFUSED reached for exploit from recon; not earned. allowed: recon

It cannot move to EXPLOIT until the target is actually within sensing range. It
cannot move to EXFIL until the simulator's own arrival flag fires. It cannot
GHOST until it is genuinely back home. Every gate is a fact the world has to
supply, not a claim the model can make. When the model reaches early, it gets
the refusal plus the list of moves it is allowed, and it works the phase it is
in instead.

That refusal is not an error. It is the point. It is what makes it safe to hand
the model the wheel.

the machine refuses anything the world has not earned.

the thing doing the refusing: Theodosia

Here is the part I actually want to show you.

Theodosia takes an Apache Burr state machine and mounts it as an
MCP server. One call:

import theodosia

server = theodosia.mount(burr_app, name="elliot")

Now any MCP client (Claude, your own agent, a script) sees a tiny, constant tool
surface: mostly one step tool. The client calls step(action). Theodosia
checks that action against the transitions actually reachable from the current
state. If it is legal, it runs. If it is not, the client gets told no, and told
what it can do:

{
  "error": "invalid_transition",
  "requested": "exfil",
  "valid_next_actions": ["exploit"],
  "message": "action 'exfil' is not reachable from current state. Valid actions now: ['exploit']."
}

The graph is the contract.

The gates are conditions on real state, and you write them in plain Burr:

from burr.core import when

builder.with_transitions(
    ("exploit", "exfil", when(target_reached=True)),  # only once the world says so
    ("exploit", "exploit"),                            # otherwise, keep driving in
)

Two more things Theodosia gives you that I leaned on:

State lives on the server. The model never holds the state and cannot drift it. It proposes; the server is the source of truth.
Every step is recorded. Theodosia keeps a hash-chained ledger of every action and every refusal: each entry carries a hash of the one before it, so a single edited or dropped step breaks the chain. Replay a session and you can verify nothing was quietly changed. For anything auditable (payments, deploys, support actions) that is the part that matters.

The Burr graph is the only thing you write. Theodosia is the one line that turns
it into something an agent can drive but cannot break.

this is not really about robots

Elliot is a robot because a robot is fun to watch and easy to understand. The
pattern is for any LLM-driven workflow where the model should drive and should
not be trusted to report its own progress:

a checkout flow where "payment captured" has to be true, not claimed
a deploy pipeline where you cannot run the next stage until the last one passed
a multi-step form, a support runbook, an agent task graph

Those are not hypothetical. The production-grade version is
Leavitt, an on-call incident-triage agent
built on Theodosia: it reads Grafana metrics and logs, k6 load, and deployment
context, correlates them, and writes a triage report whose disposition is
constrained by the evidence, not the model's confidence.

It only ever reads. You can point it at production and walk away. On Microsoft
Research's AIOpsLab benchmark the enforcement layer costs nothing in accuracy; it
just turns a confident wrong report into a "degraded" or "inconclusive" one.

Draw it as a state machine. Put the conditions on real state. Mount it with
Theodosia. The model gets to be smart inside the rails, and the rails do not
move.

why not LangGraph or LangChain?

LangGraph and LangChain are in-process orchestration layers. You compose nodes,
hold the state, and run the loop inside your own program, and they are good at
that: building graphs, wiring tools, threading memory and retrieval through a
chain. If you want a flexible framework for assembling an agent's logic, that is
exactly what they are for.

Theodosia solves a different problem, and here is where I will plant a flag: a
guardrail that runs in the same process as the agent is a guardrail the agent can
route around. Theodosia is a contract enforced at the server level, not a
framework you orchestrate from. It mounts a plain state machine as an MCP server,
and the server decides which transition is even allowed and refuses the rest. The
state lives server-side, out of the model's reach, and every step lands in a
tamper-evident ledger. You can put Theodosia behind an agent written in
LangGraph, LangChain, or nothing at all, because the enforcement does not live in
the client.

try it

Two steps, and the second one is a single line:

pip install theodosia

import theodosia

server = theodosia.mount(your_burr_app)  # now it is an MCP server that refuses illegal moves

The Theodosia repo has the full guide.
And if you want to watch it drive something before you wire up your own graph,
Elliot is the runnable demo, the offline command at the top of this post.

msradam / theodosia

Put an AI agent on rails: mount a Burr state machine as an MCP server so the agent can only take the next allowed step, with every step recorded and replayable.

Theodosia

Theodosia mounts a Burr Application as an MCP server. Every Burr action is reachable through a single step(action, inputs) tool; the server checks reachability against the graph before each action runs, refuses out-of-order calls with the legal next moves, and records every attempt.

Install

Python 3.11, 3.12, or 3.13 (Burr does not yet support 3.14). On a fresh Python 3.14 install you will see "no version that satisfies the requirement theodosia"; create a 3.11–3.13 venv first.

uv venv --python 3.13       # or: python3.13 -m venv .venv
uv pip install theodosia    # or: pip install theodosia

Optional extras: theodosia[observability], theodosia[ui], theodosia[claude], theodosia[mellea], theodosia[all].

On a slim Docker image (python:3.13-slim, Alpine) the install pulls a psutil build that needs gcc and python3-dev. Either use the full python:3.13 image, or apt-get install -y gcc python3-dev before pip install.

Try it without an API

…

View on GitHub

msradam / elliot

a robot whose mind is a finite state machine. an llm drives the transitions, the machine refuses any move the world hasn't earned. hello, friend.

  ════════════════════════════════════════════════════════════════

     E L L I O T                          (yes, named after him.)

  ════════════════════════════════════════════════════════════════

   a robot that is a finite state machine. its sensors are the
   state. an llm drives the transitions. the machine refuses any
   move the world has not earned.

   hello, friend. you are about to run something that thinks, and i
   would rather you knew how little of it actually gets to decide.
   read this before you run me.

─[ what i am ]──────────────────────────────────────────────────
  one robot in a small, unmapped 2d world. somewhere out there is
  a target, and an obstacle or two between me and it. my job is a
  four-step break-in:

      ◉ boot  ▸  ◈ recon  ▸  ◆ exploit  ▸  ◇ exfil  ▸  ✕ ghost

    boot     wake up. read my own senses twice, make them agree,
             then start. i trust nothing yet, least of all me.
    recon    close on the target through open ground, around
             whatever is

…