<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Ansh Saxena</title>
    <description>The latest articles on DEV Community by Ansh Saxena (@anshss).</description>
    <link>https://dev.to/anshss</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F924503%2F269dc4de-290a-44fc-a73b-9ad86d4a66f2.jpeg</url>
      <title>DEV Community: Ansh Saxena</title>
      <link>https://dev.to/anshss</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/anshss"/>
    <language>en</language>
    <item>
      <title>Three of my agent's API calls were Opus. My logs said "200 OK" eight times.</title>
      <dc:creator>Ansh Saxena</dc:creator>
      <pubDate>Fri, 08 May 2026 14:12:04 +0000</pubDate>
      <link>https://dev.to/anshss/three-of-my-agents-api-calls-were-opus-my-logs-said-200-ok-eight-times-43e6</link>
      <guid>https://dev.to/anshss/three-of-my-agents-api-calls-were-opus-my-logs-said-200-ok-eight-times-43e6</guid>
      <description>&lt;p&gt;If you run a multi-agent workflow — LangChain with fallbacks, CrewAI with different models per agent, AutoGen, or anything where someone (maybe past-you) configured model routing — this post is for you.&lt;/p&gt;

&lt;p&gt;Here's what the logs showed:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[agent] Starting document analysis...
[llm] Response received (200 OK)
[llm] Response received (200 OK)
[llm] Response received (200 OK)
[llm] Response received (200 OK)
[llm] Response received (200 OK)
[llm] Response received (200 OK)
[llm] Response received (200 OK)
[llm] Response received (200 OK)
[agent] Task complete.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Eight successes. Nothing to investigate.&lt;/p&gt;

&lt;p&gt;Here's what actually happened:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Model               Calls    Cost (USD)
─────────────────────────────────────
claude-opus-4-6       3       $3.2325
claude-sonnet-4-6     3       $0.2775
claude-haiku-4-5      2       $0.0092
─────────────────────────────────────
Total                          $3.5192
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Three calls to Opus. 92% of the bill. The &lt;code&gt;model=&lt;/code&gt; config said Haiku. A fallback router in the chain was escalating harder subtasks — exactly as configured, two weeks ago, by someone who then forgot.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;print()&lt;/code&gt; has no way to tell you which model handled which call. HTTP responses don't include "by the way, this one cost $1.20." OCW does.&lt;/p&gt;




&lt;p&gt;This happens whenever:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A LangChain fallback escalates to a stronger model on error or complexity&lt;/li&gt;
&lt;li&gt;A CrewAI crew has different models per agent and you've lost track&lt;/li&gt;
&lt;li&gt;A config override somewhere in your stack that past-you set&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The per-session cost looks fine until it compounds. $3.52 per session × 3 sessions/day × 20 working days = &lt;strong&gt;$211/month&lt;/strong&gt; on a workflow you thought cost $20.&lt;/p&gt;




&lt;p&gt;See it in 30 seconds, no API keys:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;tokenjam
tj demo surprise-cost
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;8 synthetic LLM spans with real pricing math — same model mix, same token counts as the real scenario. Side-by-side: what &lt;code&gt;print()&lt;/code&gt; shows vs. what OCW reveals.&lt;/p&gt;




&lt;p&gt;Wire up your real agent:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;tokenjam.sdk&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;patch_anthropic&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;watch&lt;/span&gt;

&lt;span class="nf"&gt;patch_anthropic&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="nd"&gt;@watch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;agent_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;my-agent&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="p"&gt;...&lt;/span&gt;  &lt;span class="c1"&gt;# your existing code unchanged
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Set a budget cap:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight toml"&gt;&lt;code&gt;&lt;span class="c"&gt;# tj.toml&lt;/span&gt;
&lt;span class="nn"&gt;[agents.my-agent.budget]&lt;/span&gt;
&lt;span class="py"&gt;session_usd&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;5.00&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;OCW fires an alert when you cross it. Not on the bill. When the call happens.&lt;/p&gt;




&lt;p&gt;The cost isn't the problem. Invisibility is the problem. Once you can see which model ran which call, the budget conversation becomes a technical decision instead of a 2am surprise.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;tj demo surprise-cost&lt;/code&gt; — run it, see what was hiding.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Part of the &lt;a href="https://github.com/Metabuilder-Labs/TokenJam" rel="noopener noreferrer"&gt;Agent Incident Library&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>agents</category>
      <category>api</category>
      <category>llm</category>
      <category>monitoring</category>
    </item>
    <item>
      <title>My agent worked yesterday. Today it's possessed.</title>
      <dc:creator>Ansh Saxena</dc:creator>
      <pubDate>Fri, 08 May 2026 14:10:24 +0000</pubDate>
      <link>https://dev.to/anshss/my-agent-worked-yesterday-today-its-possessed-5bkb</link>
      <guid>https://dev.to/anshss/my-agent-worked-yesterday-today-its-possessed-5bkb</guid>
      <description>&lt;p&gt;Two weeks of clean runs. Same prompts, same repo, same results.&lt;/p&gt;

&lt;p&gt;Then Tuesday happened.&lt;/p&gt;

&lt;p&gt;The outputs were longer. Different variable names. Tool calls you'd never seen before. You asked the agent about it. It explained confidently. The explanation sounded plausible.&lt;/p&gt;

&lt;p&gt;No stack trace. No error. No crash. Just behavior that used to be one thing and is now quietly something else.&lt;/p&gt;

&lt;p&gt;This is the hardest failure to diagnose because you have nothing to point at. You have a feeling. A feeling is not a measurement.&lt;/p&gt;




&lt;p&gt;Here's what five baseline sessions looked like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Session 1: ~1,000 tokens | tools: [search, summarize]
Session 2: ~1,000 tokens | tools: [search, summarize]
Session 3: ~1,100 tokens | tools: [search, summarize]
Session 4:   ~950 tokens | tools: [search, summarize]
Session 5: ~1,050 tokens | tools: [search, summarize]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Here's session 6:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Session 6: 50,000 tokens | tools: [fetch_url, parse_html, extract_entities, classify, store_results]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Five new tools. 50x the tokens. Every metric off the chart.&lt;/p&gt;

&lt;p&gt;Your &lt;code&gt;print()&lt;/code&gt; logs said: &lt;em&gt;output looks reasonable. Moving on.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;OCW fired &lt;code&gt;drift_detected&lt;/code&gt; the moment the session closed.&lt;/p&gt;




&lt;p&gt;The &lt;code&gt;DriftDetector&lt;/code&gt; builds a rolling baseline from prior sessions. When a new session's token counts exceed a Z-score of 2.0, or the tool sequence diverges past a Jaccard distance of 0.4 — it fires. No manual baseline to set up. No dashboard to configure. It learns from your agent's own history.&lt;/p&gt;

&lt;p&gt;You find out in seconds. Not after a week of "huh, that seemed weird."&lt;/p&gt;






&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;tokenjam
tj demo hallucination-drift
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;No API keys. Runs entirely in-process. Watch 5 normal sessions, then 1 anomalous one, then the alert.&lt;/p&gt;

&lt;p&gt;Enable it for your real agent:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight toml"&gt;&lt;code&gt;&lt;span class="c"&gt;# tj.toml&lt;/span&gt;
&lt;span class="nn"&gt;[agents.my-agent.drift]&lt;/span&gt;
&lt;span class="py"&gt;enabled&lt;/span&gt;            &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
&lt;span class="py"&gt;baseline_sessions&lt;/span&gt;  &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;
&lt;span class="py"&gt;token_threshold&lt;/span&gt;    &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;2.0&lt;/span&gt;
&lt;span class="py"&gt;tool_sequence_diff&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;0.4&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then &lt;code&gt;tj drift&lt;/code&gt; shows Z-scores per session. &lt;code&gt;tj alerts&lt;/code&gt; shows when the threshold was crossed.&lt;/p&gt;




&lt;p&gt;The take that makes people mad: &lt;em&gt;"LLMs are non-deterministic — you can't test them."&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;You're right. You can't test them the way you test functions. But you can measure them. You can build a baseline and alert when behavior leaves it.&lt;/p&gt;

&lt;p&gt;Testing asks "is this correct?" Drift detection asks "is this different from how it's always behaved?" The second question is answerable. It just requires keeping score.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;tj demo hallucination-drift&lt;/code&gt; — run it, see what keeping score looks like.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Part of the &lt;a href="https://github.com/Metabuilder-Labs/TokenJam" rel="noopener noreferrer"&gt;Agent Incident Library&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>python</category>
      <category>devops</category>
      <category>productivity</category>
    </item>
    <item>
      <title>My agent wasn't flaky. I just couldn't see it looping.</title>
      <dc:creator>Ansh Saxena</dc:creator>
      <pubDate>Sun, 26 Apr 2026 07:36:16 +0000</pubDate>
      <link>https://dev.to/anshss/your-agent-isnt-flaky-youre-blind-4pk3</link>
      <guid>https://dev.to/anshss/your-agent-isnt-flaky-youre-blind-4pk3</guid>
      <description>&lt;p&gt;I work on &lt;a href="https://github.com/Metabuilder-Labs/TokenJam" rel="noopener noreferrer"&gt;TokenJam&lt;/a&gt;, an open-source observability tool for AI agents. A lot of what I do is stare at other people's agent traces — the ones their print logs say are fine and their users say are slow.&lt;/p&gt;

&lt;p&gt;The single most common pattern I see is the silent retry loop. It looks like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[tool] search_knowledge_base called
[tool] search_knowledge_base returned: null
[tool] search_knowledge_base called
[tool] search_knowledge_base returned: null
[tool] search_knowledge_base called
[tool] search_knowledge_base returned: null
[tool] search_knowledge_base called
[tool] search_knowledge_base returned: null
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Same call, same input, same null. Four times in a row.&lt;/p&gt;

&lt;p&gt;Nothing here is technically an error. The HTTP status is 200. The tool ran. The model decided to call it again. From the log's perspective, this is four successful operations. From the user's perspective, the agent is hung.&lt;/p&gt;

&lt;p&gt;This is why people say agents are "flaky" — there's no error to grep for, just behavior that doesn't terminate. And &lt;code&gt;print()&lt;/code&gt; will never tell you, because each line in isolation is correct. The pathology is in the &lt;em&gt;sequence&lt;/em&gt;, and a flat log file has no concept of sequence beyond timestamps.&lt;/p&gt;




&lt;p&gt;When I designed the &lt;code&gt;retry_loop&lt;/code&gt; detector for OCW, the rule I landed on was deliberately boring: fire when the same tool name shows up 4+ times in the last 6 spans. No ML, no per-agent tuning. Most real loops are tighter than that — they're 6+ identical calls in a row — so 4-of-6 catches them early without false positives on legitimate retries.&lt;/p&gt;

&lt;p&gt;It runs alongside &lt;code&gt;failure_rate&lt;/code&gt;, which trips when more than 20% of recent spans error out. Both default-on. Together they cover the two flavors of "stuck": looping on success and looping on failure.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Alerts fired:
  ALERT retry_loop
  ALERT failure_rate
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Visible from span 4. No threshold tuning. No dashboard.&lt;/p&gt;




&lt;p&gt;I'm not arguing the agent is doing something wrong here. Tools return &lt;code&gt;null&lt;/code&gt;. APIs go down. An agent retrying when it gets nothing back is reasonable behavior in isolation — the bug is that it has no termination condition for &lt;em&gt;silence&lt;/em&gt;, only for errors. Fixing that is a prompt-engineering problem.&lt;/p&gt;

&lt;p&gt;But you can't fix what you can't see, and the reason I built this detector is that the typical observability path for agents is: ship with &lt;code&gt;print()&lt;/code&gt;, get a vague "it's slow" report, restart the process, blame the upstream, ship again. The loop never gets diagnosed because nothing in the workflow surfaces it.&lt;/p&gt;




&lt;p&gt;The demo reproduces the failure end-to-end with no API keys and no setup:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;tokenjam
tj demo retry-loop
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It synthesizes the span sequence above, runs both detectors against it, and shows you the &lt;code&gt;print()&lt;/code&gt; view next to the OCW view. About 30 seconds.&lt;/p&gt;




&lt;p&gt;To wire it into a real agent, the SDK is three lines:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;tokenjam.sdk&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;patch_anthropic&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;watch&lt;/span&gt;

&lt;span class="nf"&gt;patch_anthropic&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="nd"&gt;@watch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;agent_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;my-agent&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="p"&gt;...&lt;/span&gt;  &lt;span class="c1"&gt;# your existing code, unchanged
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Run &lt;code&gt;tj serve&lt;/code&gt; in the background. &lt;code&gt;tj alerts&lt;/code&gt; shows what fired. &lt;code&gt;tj traces&lt;/code&gt; shows the full span waterfall. Local DuckDB, no cloud, no signup.&lt;/p&gt;




&lt;p&gt;The framing I keep pushing back on is "you can't trust agents in production." That's two different statements collapsed into one. There's a real difference between an agent that retried four times because a tool returned null, and an agent that retried four times for no reason anyone can reconstruct. The first is a fixable infrastructure problem. The second is a monitoring gap masquerading as a reliability problem.&lt;/p&gt;

&lt;p&gt;Most of the agents I see have the second problem. Once you can replay the span sequence, the first problem becomes a normal engineering ticket.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;tj demo retry-loop&lt;/code&gt; — give it 30 seconds, see the alert fire.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Part of the &lt;a href="https://github.com/Metabuilder-Labs/TokenJam" rel="noopener noreferrer"&gt;Agent Incident Library&lt;/a&gt; — reproducible scenarios for the failures that don't show up in your logs.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>python</category>
      <category>devops</category>
      <category>productivity</category>
    </item>
  </channel>
</rss>
