You ship an AI agent to production. It works perfectly in development. Three days later, at 2am, it silently starts returning garbage data. Your users are affected before you even know there's a problem.
This is not a model problem. It's a capability problem — and almost no one is talking about it.
The part of agent development nobody warns you about
When you build an AI agent, you focus on the model — the prompts, the reasoning chain, the output format. This makes sense. The model is where the magic is.
But in production, agents don't just reason. They act. They call tools, fetch data, validate information, and make decisions based on what those tools return. And those tools are connected to external services — APIs, registries, databases — that are entirely outside your control.
Here's the failure taxonomy that will eventually hit every agent in production:
Silent upstream failures. The company registry API your KYB agent depends on starts returning malformed responses. The model doesn't know this. It reasons confidently on bad data.
Schema drift. An external API you depend on changes its response format. Your agent keeps calling it. The data comes back, just different. Depending on how you handle this, you either get silent errors or an agent that produces outputs based on a field that no longer exists.
Latency spikes. The API your agent calls has a bad hour. Calls time out. Your agent either fails hard (if you handle it well) or hangs (if you don't).
Conditional availability. Some data sources work fine in Western Europe but time out consistently from US infrastructure. Your agent worked in testing. It breaks in production for a specific user segment.
The uncomfortable math
If your agent pipeline calls 5 external capabilities, and each has 99% uptime, your composite reliability is 0.99^5 = 95.1%. That's one failure every 20 calls — before you've written a single bug.
At 10 capabilities: 90.4%. One failure every 10 calls.
This is just uptime. It doesn't account for schema drift, latency degradation, or partial failures (the API responds, but with stale or incorrect data).
The multi-step agentic pipelines people are building today compound this problem significantly. Every hop adds a new failure surface.
What actually needs to happen
The model quality problem in AI agents is mostly solved — or at least actively being worked on. The capability quality problem is not.
What "solved" looks like:
Continuous testing against known-answer fixtures. Not just "does the API respond" but "does it return the right answer." A sanctions check that returns 200 OK on a known-flagged entity has a bug. This requires ground truth data and a test suite that runs on a schedule, not just at deploy time.
Separate quality profiles for code and data. A capability can have excellent code — good error handling, correct schema, fast execution — but depend on a data source that's stale, incomplete, or geographically inconsistent. These are different failure modes and need different remediation strategies.
Upstream awareness. When a capability fails, the failure classification matters. "Our code threw an exception" is different from "the upstream service returned 503" is different from "the upstream service returned 200 with an error buried in the response body." An agent that knows the difference can make better recovery decisions.
Execution guidance. Rather than a binary "works / doesn't work" quality signal, what agents actually need is: "here's the confidence level, here's the current reliability profile, here's whether you should proceed or fall back."
The MCP angle
MCP has made capability discovery dramatically easier. Registries like Smithery and mcp.so are indexing thousands of servers. This is genuinely useful.
But discovery and quality are different problems. A registry tells you a server exists and has a description. It doesn't tell you whether it's reliable enough for a production pipeline, what its upstream dependencies are, or how it behaves under the error conditions that matter.
This gap will close — it has to. As agents move from demos to production, the question "can I trust this capability enough to act on its output" becomes the blocking question. Quality signals will become a first-class part of capability metadata, not an afterthought.
What you can do now
If you're shipping agents to production today:
Test your critical capabilities against ground truth. Pick the 5 capabilities your agent depends on most. Find a known-good input/output pair for each. Run that test on a schedule — daily at minimum, hourly if the capability is critical.
Build in upstream failure detection. When a capability returns an unexpected response, classify the failure before retrying. Blind retries on an upstream outage make things worse.
Track capability-specific latency separately from model latency. When something is slow, knowing whether it's the model or the tool call is the difference between a model config change and an API support ticket.
Use capability quality scores where they exist. Some capability platforms now publish quality metadata — test coverage, reliability profiles, historical uptime. Use this signal when choosing which capabilities to route through.
Plan for graceful degradation. For every critical capability in your pipeline: what does the agent do if this call fails? "Return an error" is a valid answer. "Silently continue with missing data" is not.
The longer arc
The agent economy is real and it's moving fast. But the infrastructure assumptions behind it — that capabilities are reliable, that data is accurate, that tools behave consistently — are not yet guaranteed.
The teams that figure out capability quality now will have a significant advantage as pipelines get longer and more autonomous. The failure modes don't get simpler as agents become more capable. They get more consequential.
Strale is a capability marketplace for AI agents — 250+ independently tested capabilities across 27 countries, with Quality and Reliability profiles on every capability. Built to give agents a trust layer, not just a tool layer. strale.dev
Top comments (0)