DEV Community

azena.ai
azena.ai

Posted on

The reliability gap: what it actually takes to put an AI agent in production

A demo agent is easy. It calls a model, the model calls a tool, the tool returns something plausible, and everyone in the room nods. Then you put the same agent in front of real users, real data, and real money — and it quietly does the wrong thing 4% of the time. Nobody notices until a customer does.

That 4% is the reliability gap. It is the entire distance between a convincing demo and a system you can actually depend on, and almost nothing in the typical LLM tutorial prepares you for it.

Here is what closing that gap actually involves.

The three things that make agents hard

1. They are non-deterministic by construction. The same input can produce a different tool call tomorrow. Your regression intuition — "I didn't touch that code, so it still works" — is simply false. A prompt tweak three steps upstream can change a decision three steps downstream.

2. They fail silently. A traditional service throws. An agent confidently returns a wrong answer in the same shape as a right one. There is no stack trace for "the model misread the invoice total."

3. There is rarely a ground truth at runtime. When the agent decides, you usually cannot check the decision against an oracle in the moment. You only find out later, in aggregate, if you measured.

If you internalise nothing else: an agent is not a function you debug, it is a population you have to measure.

Evals are the test suite you're missing

The single highest-leverage thing a team can build is an eval set — a collection of realistic inputs with known-good outcomes that you run on every change. Not "does it sound good," but "did it pick the right tool / extract the right field / refuse the out-of-scope request."

A useful eval set has three properties:

  • It is drawn from real traffic, not from your imagination. Log production interactions, sample the weird ones, and turn the failures into permanent test cases.
  • It scores behaviour, not vibes. "Selected the refund tool when the policy said deny" is checkable. "Was helpful" is not.
  • It runs in CI. A prompt change that lifts one metric and quietly drops another should fail the build before it ships, exactly like a unit test.

This is the part most teams skip, and it is the part that separates an agent you can iterate on from one you are afraid to touch. I wrote up the failure modes in more detail here: why AI agents fail in production and what evals have to do with it.

Guardrails constrain the action space, not the prose

A common mistake is to treat reliability as a prompting problem — add another paragraph of "you must never…" and hope. Prompts are persuasion, not enforcement.

Real guardrails live in code, around the model:

  • Allow-list the tools available in each state. An agent in a "read-only support" state should not have a delete_account tool in scope at all. Don't ask it nicely — don't hand it the gun.
  • Validate every tool call against a schema and against business rules before execution. The model proposes; deterministic code disposes.
  • Bound the loop. Max steps, max spend, max retries. An agent with an unbounded loop and a credit card is an incident waiting for a date.
  • Make refusal a first-class outcome. "I don't have enough information, escalating to a human" is a success, not a failure, and your evals should reward it.

The mental model: the LLM is the planner, but the runtime is the adult in the room.

Human-in-the-loop is an architecture, not an apology

There is a persistent fantasy that "fully autonomous" is the goal and a human checkpoint is a temporary crutch. For anything with legal, financial, or safety weight, that is backwards. The human checkpoint is the design.

The interesting engineering question is not whether a human reviews, but where — you want the agent to do the 90% that is mechanical (gather, draft, structure, pre-fill) and route the 10% that carries liability to a person, with the full context assembled so the review takes seconds, not minutes. That's the difference between automation that scales and automation that creates a new bottleneck.

We unpack where to draw that line — chatbot vs. agent, and which workflows should never be fully autonomous — here: agentic AI without the autonomy theatre.

Where agents should not go

Honesty is a feature. Some boundaries are not optimisation problems:

  • Anything where a hallucinated fact becomes a liability (a legal citation, a medical dosage, a contractual figure) needs a deterministic source of record and a human signature — not a more confident model.
  • Anything irreversible should be gated behind an explicit confirmation that a person, not the agent, owns.
  • Anything touching regulated or personal data should be designed for data control from day one — which European model and infrastructure you run on is a real architectural choice, not an afterthought.

Saying "an agent is the wrong tool here" out loud is one of the most senior things an engineer building these systems can do.

The unglamorous summary

Reliable agents are less about a clever prompt and more about boring infrastructure: a real eval set wired into CI, guardrails enforced in code, bounded loops, and a deliberate human checkpoint exactly where the stakes are. None of it is exciting. All of it is the difference between a demo and a system.

If you're a small or mid-sized team that wants agents in production but doesn't have an in-house ML platform team to build that scaffolding, that gap is exactly the thing a focused engineering partner exists to close — that's the work we do at azena, an EU AI boutique: bespoke systems, evaluated, with the guardrails and the data-control decisions made on purpose.

Build the eval set first. Everything else gets easier once you can measure.

Top comments (1)

Collapse
 
ahmetozel profile image
Ahmet Özel

The 4% framing is the right one. One thing I would add from shipping these: the 4% is rarely uniform, it clusters around a few input shapes, ambiguous queries, very long contexts, or tool errors that get retried into a worse state. Once you log enough real traces you can usually name the clusters, and then the fix is not a stronger model, it is a guard or a fallback for that specific shape. The non-determinism point also means evals have to be distributional: run the same case N times and look at pass rate, because a single green run tells you almost nothing.