Viansh

Posted on Apr 17

AI Agents in Production: The Hardest Part Isn't the Model

#ai #agents #llm #architecture

Everyone's talking about AI Agents. But most teams I speak to are still figuring out where to even begin — and the ones that have started are hitting walls they didn't anticipate.

I've spent the last several months building and debugging agentic workflows in production. Here's the honest truth I wish someone had told me earlier.

The hardest part isn't the model

When engineers first hear "AI Agents," the mental model is usually: pick a powerful LLM, give it some tools, and let it run. The model does the heavy lifting, right?

Wrong.

The hardest part is the glue — the orchestration layer, the state management, the retry logic, the error handling, and the observability. The LLM is the easy part. Everything around it is where production systems live or die.

Where agents actually fail

Here's what I've seen break in real systems:

Infinite loops — An agent hits a wall, retries the same tool call, gets the same error, and loops indefinitely. Without explicit loop detection or max-step limits, you're burning tokens and time with zero output.

Silent failures — The agent "succeeds" but the result is wrong. No exception was raised, no alert was fired. Without structured output validation and logging, you won't even know.

Context blowout — Long-running agents accumulate context fast. Past a certain point, the model starts ignoring early instructions or losing track of the task entirely. Managing what goes in and out of context is an active engineering problem, not an afterthought.

Three things that actually worked

After many painful debugging sessions, here's what moved the needle for us:

1. Explicit state machines
Stop treating agents as black boxes. Model your agent as a state machine — define valid states, transitions, and terminal conditions. When something breaks, you know exactly where it broke.

2. Human-in-the-loop checkpoints
Not every action needs to be autonomous. For irreversible or high-stakes actions (sending emails, writing to databases, calling external APIs), add a confirmation step. The 2-second pause is worth it.

3. Observability from day one
Log every tool call, every LLM response, every state transition. Use something like LangSmith, Weights & Biases, or even just structured JSON logs. If you can't replay what your agent did and why, you can't debug it — and you can't improve it.

Where we actually are

Let's be real: most "agentic" systems in production today are sophisticated prompt chains with some tool-use bolted on. That's not a criticism — it's genuinely useful. But it's not the autonomous reasoning loop the demos suggest.

And that's fine. Start simple. Build reliability before you build autonomy. A deterministic, debuggable agent that does one thing well is infinitely more valuable than a flashy agent that occasionally does everything and frequently does nothing.

What's next

The field is moving fast — multi-agent coordination, memory architectures, and better tool-use APIs are all maturing. But the foundational engineering discipline of building reliable systems? That never changes.

Master the boring stuff first. The exciting stuff will follow.

What's the biggest challenge you've hit building agents in production? I'd love to hear what's broken for your team — drop it in the comments.

Top comments (3)

Stan Brandon • Apr 20

We also noticed in our experience that state machines is almost the main architectural part that ensures reliability.

What we've noticed, is that striving to achieve full "agenticity" is just a marketing notion. We are noticing in stead that in many production use-cases we teams striving to achieve maximum predictability and minimum agentic behaviour where AI predicts steps. Each use case is different of course, but this is our experience.

Armorer Labs • May 12

Strongly agree with “build reliability before autonomy.” The state-machine point is the one I keep seeing teams underestimate.

The extra step I would add is to persist the state transitions as a first-class run record, not just log lines. For each transition: current state, next state, reason, tool call id if there was one, validation result, and whether a human checkpoint was involved.

That gives you something you can diff between a good and bad run. The moment you can say “this run diverged at state X because tool Y returned Z,” debugging gets much less mystical.

Some comments may only be visible to logged-in visitors. Sign in to view all comments.