When you write a library, sooner or later you run into the engine. Not the pretty external interface, not the wrappers, but the part inside that drives the process through its states: it generated something, checked it, decided what to do next, repeated. A couple of flags, a while loop, a big if in the middle, and a month later you can't even remember which transitions are possible in there and why one of the branches is unreachable.
Recently I was building exactly this kind of engine and stumbled onto a library that does the job noticeably more cleanly. It's called pydantic-graph. Hardly anyone writes about it, even though the whole of pydantic-ai (the agent framework from the authors of Pydantic) is built on top of it. Below I'll walk through it with a concrete example: a reliability harness for weak language models.
Let me clarify the term up front, since it's everywhere right now. A harness isn't just MCP, skills, and memory. It's also robustness, including for very small models. It's that second part I'm using as the example. But the article isn't so much about models as about the approach itself. The main idea is simple: this is a convenient way to assemble an engine for anything that has states and transitions, without drowning in your own loop.
The problem: the model is capable, but trips over small things
The example I worked everything out on is this:
Take a small model (in my case it's Llama 3.1 8B Instruct) and ask it to solve problems from HumanEval+ (a code benchmark for LLMs). Low-resource models don't exactly fail on such tasks: they succeed about half the time. Off the cuff, with no scaffolding at all, this model scores a pass@1 of around 0.61. Roughly speaking, about two attempts out of five end up in the trash, even though the solution is usually almost correct and stumbles on something trivial.
I ran 20 samples per problem across all 164 problems and looked through the error log. Here are the characteristic examples:
-
NameError: name 'encode_cyclic' is not defined. The model called a helper function that it forgot to write out itself. -
max() arg is an empty sequence. The empty input isn't handled, and the solution crashes on it. - The most common case: the code runs without a single complaint, but returns the wrong thing on a hidden test. The logic diverged from the requirement by one step.
All these failures share a common trait, and it matters for everything that follows. Each of them is visible automatically, and once an error is visible without a human in the loop, you can react to it without knowing anything about the model's internals and without touching its weights. That observation is where the next idea came from.
The same 164 problems, 20 samples per problem, temp=0.2. Figures are for hidden tests (plus), with public tests (base) in parentheses. The pass@k curve (number of attempts):
| k | plus | base |
|---|---|---|
| pass@1 | 0.607 | 0.665 |
| pass@2 | 0.654 | 0.712 |
| pass@5 | 0.698 | 0.758 |
| pass@10 | 0.722 | 0.782 |
The idea: the engine drives the agent
The idea is actually trivial. We treat the agent as a black box and run it through a series of special nodes that work with that box. Generated an answer, checked it; if it's bad, we append to the prompt what the problem was and generate again. Once the attempt budget runs out, we return the best of the variants and honestly mark it as imperfect.
The crucial condition this was all built around: in this loop, only the model should think, and only in one place. Generation is so far the single step where there's any nondeterminism. Checking, the "fix or give up" decision, and assembling the next prompt are ordinary, predictable code.
A construction like this is natural to describe as a finite state machine. Nodes are states, edges are transitions, with strict types at the seams. This is where the library enters the frame.
pydantic-graph: a state graph built on types
It installs separately; you don't need to drag all of pydantic-ai along with it:
uv add pydantic-graph==1.105.0 # version at the time of writing
What's worth knowing about this library: it's a state-graph engine built entirely on types. A node is a class, and a transition is expressed by the return type of a method. It sounds unusual, but in practice it's convenient, because the type checker and the IDE can see the topology of the graph: you can't silently draw an impossible edge.
The library itself doesn't have to know anything about language models. Yes, pydantic-ai is built on top of it, but at its core it's a general-purpose state machine. It suits any process with states: an order-processing pipeline, a state machine in a game, retries when calling an external service. What won me over is that it reads like ordinary Python without hidden magic, while the graph stays verifiable.
The agent contract: a prompt in, a string out
For the engine to work with any agent, it needs a minimal shared language. As a first approximation I decided to limit it to a single function: a prompt goes in, a string comes out.
This is the narrow waist through which anything can be threaded into the engine. In this design, AgentFn doesn't depend on scale, so you can wrap it around any chunk of the system. You can put the harness on a single agent inside a multi-agent system, in which case AgentFn is exactly that agent, while the planner and the other executors stay outside.
Or you can wrap it around the whole team. The planner, the executors, all the internal orchestration get hidden inside, and from the outside the harness still sees a single agent: a prompt in, a string out. The engine doesn't care what's going on outside.
How the graph is built
The working nodes return into a central decision node (Decide).
Here's what one node looks like in code. Pay attention to the signature of the run method.
That -> Validate is the detail that kept me on the library. It's not a comment or a line in a config; it's a real type annotation. The type checker knows that from the generation node you can only get to the validation node. If I mistakenly return a node that shouldn't be reachable at that point, it gets flagged in the IDE before I even run anything.
Everything is assembled with a builder.
A quick word on the roles of the nodes. The generation node is the only place with the model. The validation node is deterministic: it reduces the messy semantics of an answer to a simple verdict: passed or not, what kind of error, a short reason, the fraction of checks passed. The decision node (Decide) is pure logic over that verdict, the history of attempts, and the remaining budget, with no model inside. The repair node assembles the next prompt from the last error and calls generation again. The fallback node fires when the attempts run out: it returns the best-scoring variant and marks the result as degraded.
Validators: swappable checks in a single slot
The validation node (Validate) doesn't know in advance what counts as a correct answer. It accepts any callback of the form "string in, verdict out," and that turns it into a slot you can plug anything into.
There's room to get creative here. The cheapest checks are mechanical: parsing, a linter, type checking, validating JSON against a schema, the plain "does it even run?" They're universal and work on any string, but the signal they give is weak: they only catch format issues and crashes. My sense is that the strong signal is always domain-specific (for code, that's running the tests). On top of that you can hang a model in the judge role, either one judge or a whole ensemble.
What's more, a combination of validators is itself a validator, from the engine's point of view. So no matter how many there are, to the engine it's always a single slot. The complexity hides in the combinator, and the graph stays the same.
The repair node and where it can grow
Right now my repair node is fairly simple: here's the error, regenerate taking the feedback into account. A kind of prompt engineering. But you can build in different repair strategies. For example, route by error type: if it crashed at runtime, try a bigger model. Or run several models supplied by the user and take the best candidate.
What came out in the end?
| Mode | pass@1 on HumanEval+ |
|---|---|
| Baseline | 0.607 ± 0.034 |
| With harness | 0.623 ± 0.035 |
The harness raises pass@1 from 0.607 to 0.623. The deviation came out to ±0.034, and at first glance the gain dissolves into it. But both modes were run on the same 164 problems, so what you should look at is the paired per-problem difference, and that comes to +0.016 ± 0.007, i.e. consistently positive, if small.
What I liked along the way, beyond the numbers themselves: all of it turned out to be extensible and strictly typed. The graph reads directly in the IDE, and the transitions can't be mixed up unnoticed.
Conclusion
The idea of applying finite state machines in development isn't new in itself. But the library that implements it struck me, personally, as interesting and underexposed. I won't claim everyone needs it. Nor will I risk saying that finite-state logic is the one right solution for the idea in this article.
If the model is already fine-tuned to answer reliably in the required format, the corresponding harness validator simply stops doing anything, and the absence of a gain here is a perfectly logical outcome. If you have a completely linear pipeline with no branching, pydantic-graph will be overkill, since a plain function is enough.
But if the process branches, has states, and it matters to you that the IDE holds its topology for you, the library fits well. And, to repeat, it doesn't have to be about language models at all.
All the code is packaged into a library and available on GitHub.



Top comments (0)