What actually breaks when you put AI agents in production

#ai #llm #agents #softwareengineering

Demos lie. An AI agent that books a meeting, queries an API, and summarizes the result in a slick demo is maybe 20% of the work. The other 80% is everything that happens when the same agent meets a real user, real data, and a Tuesday afternoon when an upstream API is having a bad day.

We build multi-agent systems for companies for a living, and the gap between "works in the notebook" and "works in production" is where most AI projects quietly die. Here are the failure modes we see most often — and what we actually do about them.

1. The agent is confidently wrong, and nothing catches it

A single LLM call has no idea when it's hallucinating. Chain three of them together and the errors compound: agent A invents a customer ID, agent B dutifully looks it up, agent C writes a confident summary about a customer who doesn't exist.

The fix isn't a better prompt. It's treating every agent output as untrusted input — the same discipline you'd apply to a form field from the public internet. Validate structured outputs against a schema. Make tools return typed results, not prose. And put a deterministic check between "the model decided X" and "X happened in your database."

[REAL EXAMPLE → a time a validation layer caught a bad agent action before it hit prod, and what the check was]

2. There's no boundary between "thinking" and "doing"

The scariest agent bugs aren't wrong answers — they're wrong actions. An agent with write access to your systems and a fuzzy objective will eventually do something irreversible.

We draw a hard line between read and write. Agents can plan, retrieve, and propose freely. Anything that mutates state — sends an email, charges a card, updates a record — goes through a narrow, audited gate with explicit guardrails, and often a human approval step for the high-stakes ones. It's the difference between an assistant and a loose cannon.

3. You can't see what the agent did

When a multi-agent run fails, "the AI messed up" is not a debuggable statement. If you can't replay exactly which agent called which tool with which arguments and got which result, you're flying blind — and you will be flying blind at 2am when a customer complains.

Tracing is not optional. Every agent step, every tool call, every input and output gets logged and is replayable. This one decision — building observability before you scale the agents — saves more production incidents than any model upgrade.

4. Cost and latency are a feature, not an afterthought

A multi-agent system that calls the biggest model for every step is both slow and expensive, and users feel both. The engineering work is matching the model to the job: a small fast model for routing and classification, the big one only where reasoning genuinely pays for itself. Cache aggressively. Set hard timeouts. Budget tokens like you'd budget any other resource.

[REAL EXAMPLE → a before/after on latency or cost when you right-sized models for a real client]

5. "Done" is undefined, so it's never done

The most common reason an AI project stalls isn't technical — it's that nobody agreed what success looks like before building. "Add AI agents" is not a spec. "Cut average handle time on tier-1 tickets by 30% without raising escalations" is.

We define the success metric before writing a line of code. It tells you which failures matter, when to ship, and whether the thing is actually working — instead of shipping an impressive demo that moves no real number.

The pattern underneath all of these

None of this is exotic AI knowledge. It's ordinary production engineering — input validation, least-privilege access, observability, resource budgets, clear requirements — applied to a new and unusually unpredictable component. The teams that ship reliable agents aren't the ones with the cleverest prompts. They're the ones who treat the LLM as one more untrusted, expensive, non-deterministic dependency and engineer around it accordingly.

That's the lens we bring from years of building systems at scale before this wave: the AI is new, but the discipline that makes it production-grade is not. If you're moving an agent system from demo to production and want a second set of eyes from a team that does this daily, that's the kind of multi-agent AI work we do at Krazimo.

What failure modes have bitten you in production? I'll answer questions in the comments.