Nikhil Verma

Posted on May 29 • Originally published at nikhil-verma.com on May 20

Building a harness that makes a small LLM reliable

#llm #aiagents #aiengineering #agentreliability

When I started building a multi-turn agent that could actually do things — search records, update a status, link a piece of evidence, the usual kit — my first instinct was the one I think everyone has: reach for the biggest, smartest model and hope it behaves.

I ended up doing nearly the opposite. The agent I shipped runs on Haiku 3 — small, cheap, fast, and on its own absolutely not what you'd call a careful reasoner. The reliability didn't come from the model. It came from the scaffolding I built around it. And that shift — from "make the model smarter" to "make the harness stronger" — is honestly the most useful thing I've picked up about getting agents to work.

Let me walk you through two pieces of that harness, because they did most of the heavy lifting. First, though, the little incident that kicked it all off.

I was scrolling back through a long session, looking at tool-call errors. (You should be watching those, by the way — a model retrying its own failed tool calls is normal, but a pattern of failures usually means a tool is confusing or a contract is too loose.) And I spotted a few calls hitting IDs that didn't exist. The IDs looked perfect: flawless, textbook UUIDs. Right shape, right everything. They just... weren't real.

The model made them up. An LLM loves to match patterns, and Haiku had watched hundreds of UUIDs stream past in its context, absorbed the shape, and somewhere between "search for items" and "mark this one done" it confidently produced a plausible fake.

Now, the lazy fix is "use a better model" — a bigger model hallucinates less. But that's the wrong lesson, and I want to spend a paragraph on why, because it shapes everything below.

The thing I got wrong for too long: the harness is the lever, not the model

It's so tempting to treat reliability as something you buy from the model. Agent flaky? Upgrade. And it sort of works — partly, and expensively. But a frontier model with no guardrails still invents IDs and still skips steps; it just does it rarely enough that you stop noticing, which is arguably worse.

What actually moved the needle for me was flipping the assumption: assume the model will get things wrong — early, often, in daft ways — and build a harness that makes those mistakes either flat-out impossible or automatically recoverable. Do that, and a lovely thing happens: the bar for how clever the model has to be drops through the floor. A small model on a strong harness beats a big model on a weak one, and it costs a fraction as much per turn.

So the two patterns below aren't "prompt it better." They're rails the model can't leave even if it wants to.

Raw IDs are basically a loaded gun

If you put raw UUIDs into the model's context, sooner or later it'll hand raw UUIDs back. Some real, some invented, and from the outside — just staring at the text — there's no reliable way to tell which is which. The smaller the model, the sooner this happens.

You can ask nicely, of course — "only use IDs I've actually given you" — but a prompt is a polite request, not a contract. Give a model a long context, a bit of instruction-following pressure, a small distraction, and it'll cheerfully forget. And there's no feedback loop: it has no idea the ID was wrong until your database has already either rejected it or, on a bad day, quietly swallowed it.

I dug into the why of this a while back — how models fabricate IDs and a couple of ways to stop it — in a separate post on UUID hallucination. The short version: enum constraints work nicely for tightly-controlled flows, and token aliasing (swapping UUIDs for simple ITEM-1/ITEM-2 handles) works for multi-turn agents. That post is the background, if it's useful. What I want to get into here is what token aliasing has to grow into once a small model is allowed to actually write to your database — and how it becomes one leg of the harness.

Either way, the fix isn't to ask harder. It's to make raw IDs invisible to the model in the first place. If it never sees a UUID, it can't parrot one back.

Pattern 1: the ref proxy

The idea is dead simple. Every UUID that leaves your system on its way to the model gets swapped for a short, session-scoped nickname first. The model only ever sees things like ref_A3k9Xp2Q. Your database only ever sees real UUIDs. A little registry sits in the middle and does the translation.

Here's the shape of that registry:

type EntityRefRegistry = {
  version: 1;
  // ref token → { entityType, entityId, createdAt, lastSeenAt }
  refs: Record<string, EntityRefRecord>;
  // "type:uuid" → ref token (reverse lookup)
  entityToRef: Record<string, string>;
};

I park this as a durable JSONB column on the chat thread row. It survives across turns, so the same entity always gets the same nickname for the whole session — no confusion for a model that's already easily confused.

The proxy works both ways. On the way out (system → model): any time your tools hand back data, a proxyForModel pass walks the output, finds every UUID-shaped string, and swaps it for a ref token — minting a new registry entry if it hasn't seen that entity before.

function proxyForModel<T>(
  value: T,
  registry: EntityRefRegistry
): { value: T; registry: EntityRefRegistry } {
  // walk value, replace UUIDs with ref_* tokens
  // registry is append-only — same UUID always gets same token
}

On the way in (model → system): when the model calls a tool with a ref_* value, a proxyForSystem pass turns it back into the real UUID before your handler ever lays eyes on it.

function proxyForSystem<T>(
  value: T,
  registry: EntityRefRegistry
): { value: T; unresolvedRefs: string[]; rawUuids: string[] }

Notice it hands back two lists rather than throwing a tantrum. unresolvedRefs catches refs the model invented that aren't in the registry. rawUuids catches the times it bypassed the whole scheme and pasted a real UUID straight in — which happens more than you'd hope, especially early on when some old prompt or context snippet leaked raw IDs in before the proxy was watching the door.

The tool wrapper checks both lists before it runs anything:

execute: async (...args) => {
  const systemInput = proxyForSystem(args[0], registryRef.current);

  if (systemInput.unresolvedRefs.length > 0) {
    return toolInputError(
      `Unknown model reference(s): ${systemInput.unresolvedRefs.join(", ")}.` +
      ` Use a ref returned earlier in this chat, or search again for a fresh ref.`
    );
  }

  if (systemInput.rawUuids.length > 0) {
    return toolInputError(
      `Raw UUIDs are not accepted: ${systemInput.rawUuids.join(", ")}.` +
      ` Use the corresponding ref from earlier in this chat.`
    );
  }

  const result = await execute(systemInput.value, ...rest);
  // Outbound: swap UUIDs in result for refs
  const modelResult = proxyForModel(result, registryRef.current);
  registryRef.current = modelResult.registry;
  return modelResult.value;
}

Here's the bit that matters most: these are recoverable errors. The model gets to read them and try again, in the same conversation, without anything blowing up. This is how the harness teaches a weak model on the fly — not by being cleverer, but by catching the mistake and handing back a specific, actionable correction.

The simplest way is to return a structured error as the tool's result instead of the normal output. toolInputError here is just a small helper of mine that returns a plain object with a message in it:

function toolInputError(message: string) {
  return { ok: false, error: message };
}

The model reads that like any other tool output, goes "ah, my mistake," and tries again. It works purely because the message text spells out what to do next — no magic, the model is just reading English. Which means, for a small model, the wording of that error is part of your prompt engineering. "Invalid input" gets you nowhere; "Unknown ref ref_Q7zXm1 — search again to get a fresh one" gets you a correct retry. Be specific.

(You can also achieve the same thing by throwing and letting your agent loop catch the error and feed it back as the next tool result — whichever your framework makes easiest. The one myth worth killing: a thrown tool error doesn't have to abort the turn. Whether it does is up to how your loop is wired, not some law of nature.)

And the schemas get rewritten before the model ever sees them, too. Any { type: "string", format: "uuid" } field becomes:

{
  "type": "string",
  "pattern": "^ref_[A-Za-z0-9]{6,12}$",
  "description": "Use the model-visible ref_* value from chat/page context. Do not provide a raw UUID."
}

Worth being honest about what this does, though: that schema is a hint, not a fence. In plain JSON Schema, format and pattern are annotations — the model treats them as guidance, and a vanilla validator won't necessarily enforce them. The actual enforcement is my proxy layer rejecting anything that isn't a genuine ref. So the schema tells the model what good looks like; the harness is the thing that holds the line. Belt and braces — which is the whole game when the model is small.

The entire tool set gets wrapped in a single call:

const modelVisibleTools = wrapToolSetWithRefProxy(yourTools, registryRef);

Your existing tools don't change at all. The proxy is transparent to them — which is the point. The harness should be something you bolt around your tools, not something you have to rewrite them for.

Bonus prize: the registry doubles as a complete audit trail. Every entity the model touched, when it first appeared, when it was last referenced — all sitting in the thread row. For a system that has to answer "what did the agent actually do," that fell out for free.

Pattern 2: mandatory tool contracts

The second failure mode was sneakier, and it's classic small-model behaviour. The model would breeze through a whole workflow and then skip the one step that actually commits the result — it'd narrate what it found, lay out a lovely conclusion, and then end the turn with prose instead of calling the tool that records the verdict.

Prompting did not reliably fix it. "You MUST call submit_verdict before ending" helps a bit, but it's a vibe, not a guarantee, and the smaller the model the more of a vibe it is. Long multi-step runs drift from earlier instructions. Context compaction eats your constraints. So instead of nagging, I made the endTurn tool aware of what has to have happened before it's allowed to succeed.

Workflows declare their required tools right in the frontmatter:

---
id: control-assessment
title: Control Assessment
mandatory_tools:
  - name: search_evidence
    minCalls: 3
    onTooFew: "You must search for evidence at least 3 times before concluding."
  - name: submit_assessment
    onMissing: "You must call submit_assessment before ending this turn."
---

The runner normalizes these into rules and attaches them to the conversation. Before the stream even starts, it also replays the prior message history to count any mandatory calls that already happened in earlier turns — so a tool called back in turn 2 still counts when you're in turn 5.

Enforcement comes in two layers. First, every mandatory tool's execute gets wrapped with a counter: a successful call ticks it up, an errored call (including the ref-proxy rejections above) does not. A failed attempt doesn't get to check the box.

Second, the endTurn tool gets a bouncer on the door:

if (name === "endTurn") {
  const violations = validateMandatoryToolCounts(
    state.rules,
    state.counts
  ).filter(violation => violation.rule.name !== "endTurn");

  if (violations.length > 0) {
    return mandatoryToolErrorResult(violations);
  }
}

mandatoryToolErrorResult returns a structured (again, returned, not thrown) error:

{
  isError: true,
  message: "You cannot end this turn yet. Required tool \"submit_assessment\" has not been called...",
  missingMandatoryTools: ["submit_assessment"],
  violations: [{ toolName, count, minCalls, message }]
}

The model reads it, realises it hasn't held up its end, and calls the missing tool. With Haiku the self-correction rate here was genuinely high — because the message is specific enough that there's no thinking required, just following.

And there's a second lock on the same door. The stop condition itself is gated, so even a stray endTurn won't actually halt the loop. It's a small enough check to just write by hand — whatever your agent loop is, it almost certainly calls some "should I stop now?" predicate after each step. Mine only lets the loop stop once the model has called endTurn and every mandatory tool has hit its count:

const MAX_STEPS = 200;

// Called by the agent loop after each step to decide whether to stop.
function shouldStop(steps: Step[]): boolean {
  if (steps.length >= MAX_STEPS) return true; // hard safety cap, no matter what

  const lastStep = steps[steps.length - 1];
  const calledEndTurn = lastStep.toolCalls.some((c) => c.toolName === "endTurn");
  if (!calledEndTurn) return false; // not trying to end — keep going

  // The model wants to end. Only allow it if the contract is satisfied.
  const counts = countToolCalls(steps);
  return validateMandatoryToolCounts(rules, counts).length === 0;
}

If the model somehow sneaks endTurn past the wrapper, this just refuses to fire. The stream keeps going and it gets another crack at the contract. It isn't getting out until the work is done.

And for the genuinely grim cases — a run that dies on the step limit or a network blip without ever satisfying the contract — the runner checks on finish and can replay the turn, this time forcing the very next step to call the missing tool directly (most agent loops let you pin which tool runs next). That's the last line of defence: the harness physically walking the model to the till.

The thread running through both: recover, don't demand perfection

If you squint, both patterns are the same move. Neither one assumes the model is good. Both assume it'll fumble and build the catch around it.

The ref proxy says: the model cannot produce a valid raw ID, and if it tries, it gets a precise nudge and another go. The mandatory contract says: the model cannot end before the work is committed, and if it tries, it gets a precise nudge and another go. The capability that used to live in the model — "remember to use real IDs," "remember to commit" — now lives in the harness as something structural. The model just has to follow rails, and following rails is something even a small model is pretty good at.

This is also why the error wording isn't an afterthought. For a small model the recoverable error is the teaching signal. "Invalid input" gets you nowhere; "Unknown ref ref_Q7zXm1 — search again to get a fresh one" gets you a correct retry. You're not arguing with the model, you're handing it the next instruction at exactly the moment it's listening hardest.

What changed — and the bit that surprised me

Before: occasional hallucinated IDs slipping into the database, semi-regular turns that fizzled out in prose instead of a committed result, behaviour that wobbled with context length. Death by a thousand "huh, that's odd"s.

After: hallucinated IDs became structurally impossible — there's no UUID surface left to hallucinate against. Required-tool completion became something I could verify instead of hope for — not "the model said it was done" but "the counter hit the minimum." And the audit trail came free.

The genuinely surprising part was the cost side. I'd assumed reliable agent behaviour meant paying for a big model on every turn. A small, cheap model — Haiku 3 — carried the whole thing once the rails were in place, at a fraction of the per-turn cost and noticeably faster. The harness was a one-time build; the savings are every single request, forever.

The same harness makes a bigger model better too. Guardrails help across the board. The win isn't "make the frontier model slightly more reliable." It's that you stop needing the frontier model for a big chunk of the work.

The takeaway

I spent a long time treating the model as the variable I could turn up when things got unreliable. The thing I'd tell my earlier self is: turn up the harness instead.

There are two failure modes people lump together when they say "my agent won't follow instructions." One is behavioural — the model drifts, gets muddled, picks the wrong priority. That one genuinely is a prompting problem, and a bigger model genuinely helps. The other is structural — your "enforcement" is just text in a prompt that the model can and eventually will ignore. No amount of bold MUST saves you there, and no model is big enough to fully trust.

For the structural stuff you need structural fixes, and the nice surprise is that once you've built them, the model underneath can be small and cheap. Make hallucinated IDs invalid by construction. Make skipping a step mechanically impossible. Make every mistake recoverable with a clear nudge. Do that, and a model like Haiku 3 will quietly do work you assumed needed something four times the price.

The mental model I use now: if the correctness of a workflow depends on the model choosing to do the right thing, stop and ask whether the harness can make that choice for it. It usually can. And it's almost always cheaper than reaching for a smarter model to paper over the gap.

Top comments (2)

Harjot Singh • May 31

This is the proof of the whole thesis: you took a model that on its own is not a careful reasoner (Haiku 3) and made it reliable through scaffolding, which is the strongest possible demonstration that reliability is a property of the harness, not the model. Most people do the opposite (reach for the biggest model and hope), and they end up paying frontier prices for behavior they could have engineered around a cheap model. The double win here is the part I'd shout about: you didn't just make it reliable, you made it reliable AND cheap, because once the harness owns correctness (validation, bounded retries, verify-before-act, constrained tool calls), the model's job shrinks to the narrow thing small models are actually fine at. The frontier model was never buying you reliability, it was buying you a slightly higher chance of getting away without a harness, and that bet fails in production. Build the scaffold, run the cheap model. This is exactly the conviction behind Moonshift, harness over model, cost tracks the work. What did the most heavy lifting in your harness, the output validation, or constraining the tool calls so Haiku couldn't wander?

Tae Kim • Aug 1

The ref proxy solves the exact problem I ran into with a document-processing agent where the model hallucinated entity IDs from earlier in the conversation. One thing I would add from experience: the unresolved-ref error message is more effective when it tells the model what action to take next rather than just what went wrong — in my case switching from "unknown ref: X" to "unknown ref: X — call search_entity() to get a fresh ref" cut the retry failure rate in half because it removed the model's need to reason about the recovery path.