Last month, a developer posted on Hacker News: their GPT-4o agent got stuck in a retry loop and ran up a bill before anyone noticed.
I've had my own version of this. A LangChain agent I built went into a recursive loop in production. No alert. No warning.
So I started digging.
The pattern
I pulled GitHub issues, HN threads, Reddit posts. Dozens of them. The same story kept showing up:
- Agent loops and burns through API credits
- Agent hallucinates and nobody catches it for hours
- Agent works fine for weeks, then silently degrades
- Developer finds out from users, not from monitoring
The tools we have -- LangSmith, LangFuse, Arize, Helicone -- show traces. Latency, token counts, spans. They answer what happened but not:
- Is my agent actually reliable right now?
- Is it producing business value or just burning tokens?
- Will I know when it breaks before my users do?
What I found in public repositories
This isn't just anecdotal. The evidence is sitting in open GitHub issues.
No alerting for over two years
LangFuse's alerting feature request has been open since December 2023. You can see every trace in beautiful detail, but if your agent starts failing at 3am, you'll find out at 9am.
langfuse/langfuse#714 -- opened Dec 18, 2023
The monitoring tool crashed production
LangSmith's tracing decorator crashed production apps during an outage. The tool that's supposed to tell you something went wrong... was the thing that went wrong.
langchain-ai/langsmith-sdk#1306 -- opened Dec 9, 2024
Cost tracking doesn't add up
Cost calculations are wrong for cached tokens and vision models. Your dashboard says you're spending one amount. Your actual invoice says another.
langchain-ai/langsmith-sdk#1375 -- opened Jan 4, 2025
The gap
Current observability tools are designed for debugging after the fact, not for catching failures as they happen. They're flight recorders, not collision avoidance systems.
What's missing:
Reliability scoring. Not "here's a trace" but "your agent's reliability dropped from 94% to 71% in the last hour." A single number that tells you whether to worry.
Business outcome connection. Your monitoring tool says the agent completed successfully. Your analytics says conversion rate dropped. Nobody connects these two.
Simple alerting. Not enterprise-grade configuration with 47 steps. Just: "error rate crossed 5%, here's the Slack message."
The scale of the problem
57% of teams now run AI agents in production. Gartner predicts 40% of agentic AI projects will be canceled by 2027 due to trust issues.
The trust problem isn't about AI capability. It's about not knowing when it's broken.
I'm researching this
I'm not building a product pitch. I'm trying to understand if this is a widespread problem or just my own frustration.
If you run AI agents in production (or plan to), I'd appreciate 2 minutes:
5-question survey -- no signup, no email required.
I'll share the results publicly here on dev.to.
Whether this becomes a tool, an open-source library, or just a post with interesting data -- the findings will be useful either way.
Top comments (0)