Francisco Humarang

Posted on Mar 13

The 9 production realities for AI agents: A guide to building reliable agents from a Flakestorm perspective.

#ai #testing #cicd

After months of building AI agents and even leading teams on them, I’ve seen firsthand how quickly the dream of autonomous intelligence can turn into a production nightmare.

There’s a lot of excitement about AI agents right now, and for good reason. The idea of systems that can reason, plan, and execute multi-step tasks feels like a leap forward. But beneath the surface, the reality of getting these agents to work reliably in production is far messier than most people acknowledge. From my experience, the common advice to "just start with a team" often skips over the fundamental engineering challenges.

Building agents isn’t just about chaining LLM calls; it's about dealing with non-deterministic systems in unpredictable environments. If you want to build agents that actually hold up, you need to face these nine production realities head-on. This isn't about doomsaying; it's about understanding the challenges that tools built for agent reliability, like Flakestorm, aim to address.

1. Hallucinated Responses are a Feature, Not a Bug

Let's be clear: LLMs will hallucinate. An agent, by its nature, amplifies this. If an LLM misinterprets a prompt or invents a fact, the agent might then base subsequent actions on that fabrication. I’ve seen agents confidently generate entire plans based on nonexistent data, leading to wasted API calls and incorrect outputs. You can't train this out entirely. You have to build systems that anticipate and either detect or recover from these moments of unreality.

2. Tool Timeouts Lead to Cascading Failures

Agents rely heavily on external tools—APIs, databases, web scrapers. What happens when one of those tools is slow, or worse, times out? The agent doesn't just stop. It might retry endlessly, consume excessive tokens, or get stuck in a loop, leading to a cascading failure across the entire workflow. A single flaky API call can derail a complex agentic task, making the whole system unreliable. Designing for robust tool interaction, including graceful degradation and smart retries, is non-negotiable.

3. Prompt Injection Attacks are Relentless

This isn't just a security vulnerability; it's a reliability issue. A malicious prompt injection can hijack an agent's intent, causing it to perform unintended actions, leak sensitive data, or simply break its operational flow. Indirect injection—where the malicious prompt comes from data retrieved by the agent itself—makes detection even harder. It's an ongoing battle to secure agent prompts against manipulation, and every new vector needs consideration.

4. Flaky Evals Make Progress Hard to Measure

How do you know if your agent is actually getting better? Traditional unit tests often fall short for complex, non-deterministic agent behaviors. "Flaky evals" are a common problem: an agent passes a test one minute and fails it the next, without any code changes. This makes it incredibly difficult to iterate and improve. You need evaluation strategies that account for variability and truly capture agent robustness, rather than just simple pass/fail metrics.

5. Autonomous Agents Go Off-Script (Unsupervised Behavior)

Granting an agent autonomy is like handing the keys to a teenager. You hope they make good choices, but you know there’s a chance they'll drive somewhere unexpected. Agents operating without constant human oversight can exhibit unsupervised behavior, burning through tokens, hitting rate limits, or getting stuck in expensive loops. This isn't malice; it's the natural outcome of a system exploring its environment in ways you didn't explicitly predict. Observability is key here, to understand why they went off-script.

6. Multi-Fault Scenarios Are Inevitable

It’s rare that just one thing goes wrong. In production, you'll encounter multi-fault scenarios—a tool times out while an LLM hallucinates and an external API returns unexpected data. Frameworks like LangChain, while powerful, can break down quickly under these combined stresses if not designed with extreme resilience in mind. Expecting perfect conditions is a fantasy; preparing for multiple concurrent failures is smart engineering.

7. Token Burn is a Real Operational Cost

Every LLM call costs money. An agent that gets stuck in a loop, retries too aggressively, or generates verbose, unnecessary output can quickly lead to "token burn"—excessive and often hidden operational costs. I’ve seen agent designs that looked brilliant on paper but became prohibitively expensive in practice due to inefficient token usage. This isn't just about efficiency; it's about making your agent economically viable to run.

8. Testing Agents in CI/CD is a Whole New Challenge

Traditional CI/CD pipelines aren't built for the non-deterministic nature of AI agents. Running comprehensive tests for agents in a continuous integration/delivery environment is complex. How do you consistently test multi-step reasoning, tool interactions, and error recovery? Agent stress testing and adversarial LLM testing become crucial to find breakpoints before they hit users. Building the right testing harness is a significant engineering effort.

9. AI Agent Observability is Your Lifeline

When an agent fails, you need to know why. Production LLM failures are often opaque. Was it the prompt? The tool output? An internal reasoning error? Without deep AI agent observability—logging every thought, every tool call, every output—debugging becomes a nightmare. You can’t fix what you can’t see. This means instrumenting your agents from the ground up to provide clear, actionable insights into their execution flow.

Building for Resilience

The promise of AI agents is huge, but their production reality is challenging. My experience has taught me that overlooking these complexities leads to frustration and unreliable systems. If you're building agents, don't just focus on the happy path. Design for failure, instrument for observability, and test for robustness across every one of these realities. Approaching agent development with a clear understanding of these hurdles is how you build systems that truly deliver value, rather than just breaking in production.

DEV Community