Ceyhun Aksan

Posted on Mar 16 • Originally published at ceaksan.com

An AI agent got stuck in a loop. The monitoring tools saw nothing.

#ai #llm #observability #productivity

Last month, a developer posted on Hacker News: their GPT-4o agent got stuck in a retry loop and ran up a bill before anyone noticed.

I've had my own version of this. A LangChain agent I built went into a recursive loop in production. No alert. No warning.

So I started digging.

The pattern

I pulled GitHub issues, HN threads, Reddit posts. Dozens of them. The same story kept showing up:

Agent loops and burns through API credits
Agent hallucinates and nobody catches it for hours
Agent works fine for weeks, then silently degrades
Developer finds out from users, not from monitoring

The tools we have -- LangSmith, LangFuse, Arize, Helicone -- show traces. Latency, token counts, spans. They answer what happened but not:

Is my agent actually reliable right now?
Is it producing business value or just burning tokens?
Will I know when it breaks before my users do?

What I found in public repositories

This isn't just anecdotal. The evidence is sitting in open GitHub issues.

No alerting for over two years

LangFuse's alerting feature request has been open since December 2023. You can see every trace in beautiful detail, but if your agent starts failing at 3am, you'll find out at 9am.

langfuse/langfuse#714 -- opened Dec 18, 2023

The monitoring tool crashed production

LangSmith's tracing decorator crashed production apps during an outage. The tool that's supposed to tell you something went wrong... was the thing that went wrong.

langchain-ai/langsmith-sdk#1306 -- opened Dec 9, 2024

Cost tracking doesn't add up

Cost calculations are wrong for cached tokens and vision models. Your dashboard says you're spending one amount. Your actual invoice says another.

langchain-ai/langsmith-sdk#1375 -- opened Jan 4, 2025

The gap

Current observability tools are designed for debugging after the fact, not for catching failures as they happen. They're flight recorders, not collision avoidance systems.

What's missing:

Reliability scoring. Not "here's a trace" but "your agent's reliability dropped from 94% to 71% in the last hour." A single number that tells you whether to worry.

Business outcome connection. Your monitoring tool says the agent completed successfully. Your analytics says conversion rate dropped. Nobody connects these two.

Simple alerting. Not enterprise-grade configuration with 47 steps. Just: "error rate crossed 5%, here's the Slack message."

The scale of the problem

57% of teams now run AI agents in production. Gartner predicts 40% of agentic AI projects will be canceled by 2027 due to trust issues.

The trust problem isn't about AI capability. It's about not knowing when it's broken.

I'm researching this

I'm not building a product pitch. I'm trying to understand if this is a widespread problem or just my own frustration.

If you run AI agents in production (or plan to), I'd appreciate 2 minutes:

5-question survey -- no signup, no email required.

I'll share the results publicly here on dev.to.

Whether this becomes a tool, an open-source library, or just a post with interesting data -- the findings will be useful either way.

DEV Community