DEV Community

Cover image for My agent wasn't flaky. I just couldn't see it looping.
Ansh Saxena
Ansh Saxena

Posted on • Originally published at dev.to

My agent wasn't flaky. I just couldn't see it looping.

A lot of what I do is stare at other people’s agent traces, the ones whose print logs say they are fine, but whose users say they are slow. This time, on Wednesday afternoon, I received a ping saying the agent feels “slow.” not broken. just slow. which is the worst kind of bug report because there’s nothing to grep for.

I open the logs and see this:

[tool] search_knowledge_base called
[tool] search_knowledge_base returned: null
[tool] search_knowledge_base called
[tool] search_knowledge_base returned: null
[tool] search_knowledge_base called
[tool] search_knowledge_base returned: null
[tool] search_knowledge_base called
[tool] search_knowledge_base returned: null
Enter fullscreen mode Exit fullscreen mode

Same call. Same input. Same null. Four times in a row.

I'd love to say I caught it from the logs, but honestly I just restarted the process and it went away. Blamed the upstream API in my head. Closed the tab. Moved on with my day.

The tool wasn't broken. The agent was stuck in a loop because it didn't know what to do when a search came back empty, so it just... tried again. And the logs had no opinion about that. They just dutifully printed each attempt as if it were the first.


Here's the part I wish I'd seen at the time:

Alerts fired:
  ALERT retry_loop
  ALERT failure_rate
Enter fullscreen mode Exit fullscreen mode

retry_loop trips when the same tool shows up 4+ times in the last 6 spans. failure_rate trips when more than 20% of recent spans are errors. Both are on by default — I didn't have to pick a threshold or configure anything. The loop would have been visible from span #4, which is roughly 30 seconds before any of my users noticed.


I don't think this is a "bad code" problem. Tools return null sometimes. APIs go down. An agent that retries when it gets nothing back is, in some sense, doing the right thing. It just doesn't have a good intuition for when to stop. Mine certainly didn't.

The actual problem is that I had no way to see the loop happening until someone messaged me about it. By then it had been going for a while.


If you want to see what I mean, this takes about 30 seconds and doesn't need any API keys:

pip install openclawwatch
ocw demo retry-loop
Enter fullscreen mode Exit fullscreen mode

It shows the same scenario two ways - the print() version (technically accurate, completely unhelpful) and the OCW version, where the alerts fire on their own a few spans in.


Wiring it into a real agent is three lines:

from ocw.sdk import patch_anthropic, watch

patch_anthropic()

@watch(agent_id="my-agent")
def run():
    ...  # your existing code, unchanged
Enter fullscreen mode Exit fullscreen mode

Run ocw serve somewhere in the background. Then ocw alerts tells you what's fired and ocw traces gives you the full waterfall - every tool call, every latency, in order.

It's local. No cloud, no signup, no account.


The thing I keep coming back to is this: there's a real difference between an agent that retried four times because a tool returned null, and one that retried four times for no reason anyone can explain. From the outside they look identical. One is an infrastructure problem you can fix. The other is just... vibes. And vibes don't ship.

I think people say "you can't trust agents in production" when what they actually mean is "I can't see what mine is doing." Those aren't the same problem. The first one is unsolvable. The second one is a Wednesday afternoon and a missing alert.

ocw demo retry-loop - go see for yourself.


Part of the Agent Incident Library - reproducible scenarios for the failures that don't show up in your logs.

Top comments (0)