DEV Community: Ansh Saxena

Three of my agent's API calls were Opus. My logs said "200 OK" eight times.

Ansh Saxena — Fri, 08 May 2026 14:12:04 +0000

If you run a multi-agent workflow — LangChain with fallbacks, CrewAI with different models per agent, AutoGen, or anything where someone (maybe past-you) configured model routing — this post is for you.

Here's what the logs showed:

[agent] Starting document analysis...
[llm] Response received (200 OK)
[llm] Response received (200 OK)
[llm] Response received (200 OK)
[llm] Response received (200 OK)
[llm] Response received (200 OK)
[llm] Response received (200 OK)
[llm] Response received (200 OK)
[llm] Response received (200 OK)
[agent] Task complete.

Eight successes. Nothing to investigate.

Here's what actually happened:

Model               Calls    Cost (USD)
─────────────────────────────────────
claude-opus-4-6       3       $3.2325
claude-sonnet-4-6     3       $0.2775
claude-haiku-4-5      2       $0.0092
─────────────────────────────────────
Total                          $3.5192

Three calls to Opus. 92% of the bill. The model= config said Haiku. A fallback router in the chain was escalating harder subtasks — exactly as configured, two weeks ago, by someone who then forgot.

print() has no way to tell you which model handled which call. HTTP responses don't include "by the way, this one cost $1.20." OCW does.

This happens whenever:

A LangChain fallback escalates to a stronger model on error or complexity
A CrewAI crew has different models per agent and you've lost track
A config override somewhere in your stack that past-you set

The per-session cost looks fine until it compounds. $3.52 per session × 3 sessions/day × 20 working days = $211/month on a workflow you thought cost $20.

See it in 30 seconds, no API keys:

pip install tokenjam
tj demo surprise-cost

8 synthetic LLM spans with real pricing math — same model mix, same token counts as the real scenario. Side-by-side: what print() shows vs. what OCW reveals.

Wire up your real agent:

from tokenjam.sdk import patch_anthropic, watch

patch_anthropic()

@watch(agent_id="my-agent")
def run():
    ...  # your existing code unchanged

Set a budget cap:

# tj.toml
[agents.my-agent.budget]
session_usd = 5.00

OCW fires an alert when you cross it. Not on the bill. When the call happens.

The cost isn't the problem. Invisibility is the problem. Once you can see which model ran which call, the budget conversation becomes a technical decision instead of a 2am surprise.

tj demo surprise-cost — run it, see what was hiding.

Part of the Agent Incident Library

My agent worked yesterday. Today it's possessed.

Ansh Saxena — Fri, 08 May 2026 14:10:24 +0000

Two weeks of clean runs. Same prompts, same repo, same results.

Then Tuesday happened.

The outputs were longer. Different variable names. Tool calls you'd never seen before. You asked the agent about it. It explained confidently. The explanation sounded plausible.

No stack trace. No error. No crash. Just behavior that used to be one thing and is now quietly something else.

This is the hardest failure to diagnose because you have nothing to point at. You have a feeling. A feeling is not a measurement.

Here's what five baseline sessions looked like:

Session 1: ~1,000 tokens | tools: [search, summarize]
Session 2: ~1,000 tokens | tools: [search, summarize]
Session 3: ~1,100 tokens | tools: [search, summarize]
Session 4:   ~950 tokens | tools: [search, summarize]
Session 5: ~1,050 tokens | tools: [search, summarize]

Here's session 6:

Session 6: 50,000 tokens | tools: [fetch_url, parse_html, extract_entities, classify, store_results]

Five new tools. 50x the tokens. Every metric off the chart.

Your print() logs said: output looks reasonable. Moving on.

OCW fired drift_detected the moment the session closed.

The DriftDetector builds a rolling baseline from prior sessions. When a new session's token counts exceed a Z-score of 2.0, or the tool sequence diverges past a Jaccard distance of 0.4 — it fires. No manual baseline to set up. No dashboard to configure. It learns from your agent's own history.

You find out in seconds. Not after a week of "huh, that seemed weird."

pip install tokenjam
tj demo hallucination-drift

No API keys. Runs entirely in-process. Watch 5 normal sessions, then 1 anomalous one, then the alert.

Enable it for your real agent:

# tj.toml
[agents.my-agent.drift]
enabled            = true
baseline_sessions  = 10
token_threshold    = 2.0
tool_sequence_diff = 0.4

Then tj drift shows Z-scores per session. tj alerts shows when the threshold was crossed.

The take that makes people mad: "LLMs are non-deterministic — you can't test them."

You're right. You can't test them the way you test functions. But you can measure them. You can build a baseline and alert when behavior leaves it.

Testing asks "is this correct?" Drift detection asks "is this different from how it's always behaved?" The second question is answerable. It just requires keeping score.

tj demo hallucination-drift — run it, see what keeping score looks like.

Part of the Agent Incident Library

My agent wasn't flaky. I just couldn't see it looping.

Ansh Saxena — Sun, 26 Apr 2026 07:36:16 +0000

I work on TokenJam, an open-source observability tool for AI agents. A lot of what I do is stare at other people's agent traces — the ones their print logs say are fine and their users say are slow.

The single most common pattern I see is the silent retry loop. It looks like this:

[tool] search_knowledge_base called
[tool] search_knowledge_base returned: null
[tool] search_knowledge_base called
[tool] search_knowledge_base returned: null
[tool] search_knowledge_base called
[tool] search_knowledge_base returned: null
[tool] search_knowledge_base called
[tool] search_knowledge_base returned: null

Same call, same input, same null. Four times in a row.

Nothing here is technically an error. The HTTP status is 200. The tool ran. The model decided to call it again. From the log's perspective, this is four successful operations. From the user's perspective, the agent is hung.

This is why people say agents are "flaky" — there's no error to grep for, just behavior that doesn't terminate. And print() will never tell you, because each line in isolation is correct. The pathology is in the sequence, and a flat log file has no concept of sequence beyond timestamps.

When I designed the retry_loop detector for OCW, the rule I landed on was deliberately boring: fire when the same tool name shows up 4+ times in the last 6 spans. No ML, no per-agent tuning. Most real loops are tighter than that — they're 6+ identical calls in a row — so 4-of-6 catches them early without false positives on legitimate retries.

It runs alongside failure_rate, which trips when more than 20% of recent spans error out. Both default-on. Together they cover the two flavors of "stuck": looping on success and looping on failure.

Alerts fired:
  ALERT retry_loop
  ALERT failure_rate

Visible from span 4. No threshold tuning. No dashboard.

I'm not arguing the agent is doing something wrong here. Tools return null. APIs go down. An agent retrying when it gets nothing back is reasonable behavior in isolation — the bug is that it has no termination condition for silence, only for errors. Fixing that is a prompt-engineering problem.

But you can't fix what you can't see, and the reason I built this detector is that the typical observability path for agents is: ship with print(), get a vague "it's slow" report, restart the process, blame the upstream, ship again. The loop never gets diagnosed because nothing in the workflow surfaces it.

The demo reproduces the failure end-to-end with no API keys and no setup:

pip install tokenjam
tj demo retry-loop

It synthesizes the span sequence above, runs both detectors against it, and shows you the print() view next to the OCW view. About 30 seconds.

To wire it into a real agent, the SDK is three lines:

from tokenjam.sdk import patch_anthropic, watch

patch_anthropic()

@watch(agent_id="my-agent")
def run():
    ...  # your existing code, unchanged

Run tj serve in the background. tj alerts shows what fired. tj traces shows the full span waterfall. Local DuckDB, no cloud, no signup.

The framing I keep pushing back on is "you can't trust agents in production." That's two different statements collapsed into one. There's a real difference between an agent that retried four times because a tool returned null, and an agent that retried four times for no reason anyone can reconstruct. The first is a fixable infrastructure problem. The second is a monitoring gap masquerading as a reliability problem.

Most of the agents I see have the second problem. Once you can replay the span sequence, the first problem becomes a normal engineering ticket.

tj demo retry-loop — give it 30 seconds, see the alert fire.

Part of the Agent Incident Library — reproducible scenarios for the failures that don't show up in your logs.