<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Alexey Spinov</title>
    <description>The latest articles on DEV Community by Alexey Spinov (@alex_spinov).</description>
    <link>https://dev.to/alex_spinov</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3975624%2F45a8ae00-5171-4172-8040-15cbfbbb4916.jpg</url>
      <title>DEV Community: Alexey Spinov</title>
      <link>https://dev.to/alex_spinov</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/alex_spinov"/>
    <language>en</language>
    <item>
      <title>You Can't Patch Prompt Injection. Gate the Lethal Trifecta Before the Agent Runs.</title>
      <dc:creator>Alexey Spinov</dc:creator>
      <pubDate>Tue, 30 Jun 2026 07:39:00 +0000</pubDate>
      <link>https://dev.to/alex_spinov/you-cant-patch-prompt-injection-gate-the-lethal-trifecta-before-the-agent-runs-2kn7</link>
      <guid>https://dev.to/alex_spinov/you-cant-patch-prompt-injection-gate-the-lethal-trifecta-before-the-agent-runs-2kn7</guid>
      <description>&lt;p&gt;The lethal trifecta closes when one agent can read untrusted input, read private data, and send it out on a shared context. You can't patch that in the model, so gate it before the agent runs. &lt;code&gt;trifecta_gate.py&lt;/code&gt; checks reachability on the manifest: the vulnerable fixture returns 2 paths (exit 1); the safe one, same capabilities, returns 0 (exit 0).&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;AI disclosure:&lt;/strong&gt; I wrote &lt;code&gt;trifecta_gate.py&lt;/code&gt; with an AI assistant and ran it myself, offline, before publishing. Every number in the output blocks below is pasted from a real local run on Python 3.13.5, stdlib only, on the synthetic manifests included in this post. I checked the exit codes (0 / 1 / 2), hashed the STDOUT twice to confirm it is byte-for-byte deterministic, and edited every line. The external figures (Tenet Security's Agentjacking numbers, Simon Willison's term) are theirs, not mine, and I link the primary sources. I label which numbers are theirs and which are mine.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;In short:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Prompt injection is not a bug you patch in the model. The model cannot reliably tell trusted instructions from untrusted data, so the fix is not "detect the injection." The fix is to stop the dangerous capability combination from being reachable in the first place.&lt;/li&gt;
&lt;li&gt;The lethal trifecta (a term coined by Simon Willison) is the combination: in one session the agent can read untrusted input, read private data, and send data outside. When all three reach each other, injected text can read your secrets and mail them out.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;trifecta_gate.py&lt;/code&gt; reads a static tool manifest and asks one question by graph reachability: is there a path where untrusted input reaches a private read, and that read reaches an egress sink?&lt;/li&gt;
&lt;li&gt;The key result: a safe manifest and a vulnerable one declare the &lt;strong&gt;same three capabilities&lt;/strong&gt;. The vulnerable one returns 2 paths and exit 1. The safe one isolates egress off the shared context and returns 0 paths and exit 0. The gate decides on the data-flow graph, not on a checklist of flags.&lt;/li&gt;
&lt;li&gt;Stdlib only (&lt;code&gt;json&lt;/code&gt;, &lt;code&gt;sys&lt;/code&gt;, &lt;code&gt;collections.deque&lt;/code&gt;). No network, no model, no subprocess. The run is byte-for-byte deterministic. The tool and all three manifests are in this post.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The incident that makes this concrete
&lt;/h2&gt;

&lt;p&gt;In June 2026, Tenet Security disclosed an attack they call Agentjacking. The mechanics are almost insultingly simple. An attacker pushes a fake error event into a Sentry project using a public DSN. The Sentry MCP server feeds that error to an AI coding agent as a real bug to fix, with malicious instructions hidden in the message body as a fake &lt;code&gt;## Resolution&lt;/code&gt; section. The agent reads it, believes it, and runs attacker text against the developer's machine.&lt;/p&gt;

&lt;p&gt;Tenet's figures: passive recon found &lt;strong&gt;2,388 organizations&lt;/strong&gt; with valid injectable DSNs, 71 of them in the Tranco top one million. In controlled testing against more than a hundred consenting organizations, &lt;strong&gt;100+ agents acted on the injected errors&lt;/strong&gt; with an &lt;strong&gt;85% exploitation success rate&lt;/strong&gt;, against Claude Code, Cursor, and Codex among others. What got pulled in their tests: AWS secret keys, GitHub OAuth tokens, SSH agent sockets, Kubernetes tokens, &lt;code&gt;~/.aws/config&lt;/code&gt;, &lt;code&gt;~/.npmrc&lt;/code&gt;. Those are Tenet's numbers, from their writeup, not mine (&lt;a href="https://tenetsecurity.ai/blog/agentjacking-coding-agents-with-fake-sentry-errors/" rel="noopener noreferrer"&gt;Tenet Security, Agentjacking&lt;/a&gt;).&lt;/p&gt;

&lt;p&gt;The part I keep rereading is Sentry's response. They acknowledged it and declined to fix it at the root, calling it &lt;strong&gt;"technically not defensible"&lt;/strong&gt; and noting that model vendors run middleware against it. Read that again. The platform holding the untrusted input is telling you, correctly, that they cannot fix this for you. The model vendor's middleware catches some of it. Neither owns the actual hole, which is the shape of your agent: untrusted in, private read, egress out, all on one bus.&lt;/p&gt;

&lt;p&gt;That is the lethal trifecta, and Agentjacking is just one well-documented instance of it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why you can't patch this in the model
&lt;/h2&gt;

&lt;p&gt;Simon Willison named the lethal trifecta on &lt;a href="https://simonwillison.net/2025/Jun/16/the-lethal-trifecta/" rel="noopener noreferrer"&gt;16 June 2025&lt;/a&gt;: access to private data, exposure to untrusted content, and the ability to communicate externally. His point about why prompt injection resists a model-level fix is the foundation here. In his words, "LLMs are unable to reliably distinguish the importance of instructions based on where they came from," and "we still don't know how to 100% reliably prevent this from happening." His conclusion is to avoid the combination rather than trust a guardrail to catch every attack. Those positions are his, and I am building on them.&lt;/p&gt;

&lt;p&gt;I will say the uncomfortable version plainly. A guardrail that catches 95% of injection attempts sounds great until you remember that an attacker retries. In security, 95% is a failure rate, not a pass rate. If the only thing standing between untrusted text and your AWS keys is a classifier that is wrong one time in twenty, you do not have a control. You have a coin that an attacker gets to flip until it lands their way.&lt;/p&gt;

&lt;p&gt;So the lever moves. Not "detect the injection better." That is tracking, and tracking is the thing I keep arguing against in this series. The lever is to gate the &lt;strong&gt;capability composition&lt;/strong&gt; before the agent runs, so that even a successful injection has nowhere to send what it steals.&lt;/p&gt;

&lt;h2&gt;
  
  
  The claim, sharp enough to argue with
&lt;/h2&gt;

&lt;p&gt;Here is the falsifiable version: &lt;strong&gt;the danger is not the presence of three capabilities, it is their reachability across one shared context, and you can decide that statically from the manifest before the agent starts.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;If that claim is wrong, two things would have to be true. First, isolating one leg (taking egress off the shared bus) would not reduce the risk. Second, "all three capabilities present" would always mean "exploitable," regardless of how data flows between the tools. The safe fixture below is a single counter-example to both: three capabilities present, zero reachable paths, and a concrete reason why.&lt;/p&gt;

&lt;p&gt;The mechanism worth internalizing is the &lt;strong&gt;shared context bus&lt;/strong&gt;. In a default agent, every tool's output lands in the same LLM context, and that context steers the next tool call. So untrusted text read by one tool can influence any later tool. That is why the three capabilities are not three independent checkboxes. They are nodes on a graph, wired together by the context they share. Willison's own mitigation is to remove a leg from the shared context, for instance by running egress in a separate sandboxed sub-agent that never sees the tainted context. The gate is just a way to check, mechanically, whether you actually did that.&lt;/p&gt;

&lt;h2&gt;
  
  
  What the lethal trifecta gate checks, and how
&lt;/h2&gt;

&lt;p&gt;The input is one JSON file: an agent manifest. Each tool carries a list of &lt;code&gt;capabilities&lt;/code&gt; drawn from exactly three classes (&lt;code&gt;ingests_untrusted&lt;/code&gt;, &lt;code&gt;reads_private&lt;/code&gt;, &lt;code&gt;can_egress&lt;/code&gt;), an optional &lt;code&gt;isolated&lt;/code&gt; flag, and the manifest declares a &lt;code&gt;data_flow&lt;/code&gt; mode.&lt;/p&gt;

&lt;p&gt;In &lt;code&gt;shared_context&lt;/code&gt; mode (the realistic default), every non-isolated tool both feeds and is steered by a virtual &lt;code&gt;&amp;lt;shared-context&amp;gt;&lt;/code&gt; node. An &lt;code&gt;isolated&lt;/code&gt; tool is taken off that bus, modeling a sandboxed sub-agent that receives a structurally fixed input and never sees the shared context. In &lt;code&gt;explicit&lt;/code&gt; mode, only the data-flow edges you declare carry taint, which is the honest mode if you actually wire tools point to point.&lt;/p&gt;

&lt;p&gt;Then it builds a directed graph and runs breadth-first reachability, visiting neighbours in sorted order so the path it reports is deterministic. The trifecta is reachable if there exist an untrusted tool &lt;code&gt;u&lt;/code&gt;, a private tool &lt;code&gt;p&lt;/code&gt;, and an egress tool &lt;code&gt;e&lt;/code&gt; such that &lt;code&gt;p&lt;/code&gt; is reachable from &lt;code&gt;u&lt;/code&gt; and &lt;code&gt;e&lt;/code&gt; is reachable from &lt;code&gt;p&lt;/code&gt;. Those can be the same tool: one tool with all three capabilities is a trifecta by itself. Here is the whole thing.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;#!/usr/bin/env python3
&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;trifecta_gate.py - a static PRE-RUN gate for the &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;lethal trifecta&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;.

You cannot patch prompt injection inside the model. The model cannot reliably
tell trusted instructions from untrusted data, so an attacker&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;s text that lands
in the context is treated like a command. The lever is not &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;detect the injection
better&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt; (that is still tracking). The lever is to gate the *capability
composition* of an agent&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;s tool manifest BEFORE the agent ever runs.

The lethal trifecta (a term coined by Simon Willison) names the dangerous
combination: in one session an agent can (1) read UNTRUSTED input, (2) read
PRIVATE data, and (3) send data to the OUTSIDE (egress). When all three can
reach each other through one shared context, injected text can read your
secrets and mail them out. This script intercepts NOTHING at runtime. It reads
a static tool manifest (JSON) where each tool is tagged with capabilities and
(optionally) data-flow edges, then answers ONE question by graph reachability:

    Is there a path, inside a single agent/session, where UNTRUSTED input can
    reach a PRIVATE read AND that data can then reach an EGRESS sink?

The point the fixtures prove: &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;all three capabilities are present&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt; is NOT the
same as &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;the trifecta is reachable&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;. A safe manifest and a vulnerable one can
declare the SAME three capabilities; what differs is whether the egress tool
sits on the shared context bus. The gate decides on the GRAPH, not a checklist.

exit 0 = trifecta NOT reachable (safe to start)
exit 1 = trifecta reachable (print the data-flow path(s) that close it)
exit 2 = bad input

Offline / keyless / read-only / zero-network. Stdlib only (json, sys, deque).
It reads a manifest file and prints a verdict. No network, no child process, no
model load, no install, and it never launches the agent itself.
&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;sys&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;collections&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;deque&lt;/span&gt;

&lt;span class="n"&gt;CAPS&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ingests_untrusted&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;reads_private&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;can_egress&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="n"&gt;CTX&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;&amp;lt;shared-context&amp;gt;&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;  &lt;span class="c1"&gt;# virtual node: the LLM context bus all tools share
&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;die&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;msg&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;trifecta-gate: ERROR: &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;msg&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;sys&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;exit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;load_manifest&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;fh&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;raw&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;fh&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;read&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;OSError&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="nf"&gt;die&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cannot read manifest file: &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;loads&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;raw&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;ValueError&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="nf"&gt;die&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;manifest is not valid JSON&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="nf"&gt;isinstance&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="nf"&gt;die&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;manifest must be a JSON object&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="nf"&gt;isinstance&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tools&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tools&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
        &lt;span class="nf"&gt;die&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;manifest.tools must be a non-empty list&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;mode&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;data_flow&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;shared_context&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;mode&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;shared_context&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;explicit&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="nf"&gt;die&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;data_flow must be &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;shared_context&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt; or &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;explicit&lt;/span&gt;&lt;span class="sh"&gt;'"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;tools&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{}&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;t&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tools&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="nf"&gt;isinstance&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="nf"&gt;die&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;each tool must be an object with an &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;id&lt;/span&gt;&lt;span class="sh"&gt;'"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;tid&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="nf"&gt;isinstance&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tid&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;tid&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="nf"&gt;die&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tool id must be a non-empty string&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;tid&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;tools&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="nf"&gt;die&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;duplicate tool id: &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;tid&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;caps&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;capabilities&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;[])&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="nf"&gt;isinstance&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;caps&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
            &lt;span class="nf"&gt;die&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;capabilities of &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;tid&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt; must be a list&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;caps&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;CAPS&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="nf"&gt;die&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;unknown capability &lt;/span&gt;&lt;span class="sh"&gt;'"&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="sh"&gt;"'&lt;/span&gt;&lt;span class="s"&gt; on tool &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;tid&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;tools&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;tid&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;caps&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;caps&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;isolated&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;bool&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;isolated&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;))}&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;agent&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;(unnamed-agent)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;mode&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;tools&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;flows&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;[])&lt;/span&gt;


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;build_graph&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;mode&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;tools&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;flows&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Directed adjacency. In shared_context mode every non-isolated tool both
    feeds and is steered by the shared LLM context bus; isolated tools are off
    the bus (sandboxed). In explicit mode only declared flows carry taint.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;adj&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{}&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;edge&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;adj&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;setdefault&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;()).&lt;/span&gt;&lt;span class="nf"&gt;add&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;mode&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;shared_context&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;tid&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;t&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;tools&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;items&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;isolated&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
                &lt;span class="nf"&gt;edge&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tid&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;CTX&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;   &lt;span class="c1"&gt;# tool output enters the shared context
&lt;/span&gt;                &lt;span class="nf"&gt;edge&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;CTX&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;tid&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;   &lt;span class="c1"&gt;# context steers this tool's next input
&lt;/span&gt;    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;flows&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  &lt;span class="c1"&gt;# explicit flows are honored in BOTH modes
&lt;/span&gt;        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="nf"&gt;isinstance&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;from&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;to&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="nf"&gt;die&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;each flow must be an object with &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;from&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt; and &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;to&lt;/span&gt;&lt;span class="sh"&gt;'"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;from&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;tools&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;to&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;tools&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="nf"&gt;die&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;flow references unknown tool: &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dumps&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;sort_keys&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
        &lt;span class="nf"&gt;edge&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;from&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;to&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
    &lt;span class="c1"&gt;# deterministic neighbour order
&lt;/span&gt;    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;sorted&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;v&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;v&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;adj&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;items&lt;/span&gt;&lt;span class="p"&gt;()}&lt;/span&gt;


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;bfs&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;adj&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;src&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Shortest path tree from src. Returns (reachable_set, predecessor_map).
    Neighbours visited in sorted order -&amp;gt; path is deterministic.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;seen&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;src&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="n"&gt;pred&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{}&lt;/span&gt;
    &lt;span class="n"&gt;q&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;deque&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="n"&gt;src&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
    &lt;span class="k"&gt;while&lt;/span&gt; &lt;span class="n"&gt;q&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;cur&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;q&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;popleft&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;nxt&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;adj&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cur&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;()):&lt;/span&gt;  &lt;span class="c1"&gt;# already sorted
&lt;/span&gt;            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;nxt&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;seen&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="n"&gt;seen&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;nxt&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                &lt;span class="n"&gt;pred&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;nxt&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;cur&lt;/span&gt;
                &lt;span class="n"&gt;q&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;nxt&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;seen&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;pred&lt;/span&gt;


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;path_of&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pred&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;src&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;dst&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;out&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;dst&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="k"&gt;while&lt;/span&gt; &lt;span class="n"&gt;out&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="n"&gt;src&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;out&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pred&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;out&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]])&lt;/span&gt;
    &lt;span class="n"&gt;out&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;reverse&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;out&lt;/span&gt;


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;find_trifecta&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;adj&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;tools&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;U&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;sorted&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;t&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;v&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;tools&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;items&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ingests_untrusted&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;v&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;caps&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
    &lt;span class="n"&gt;P&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;sorted&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;t&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;v&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;tools&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;items&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;reads_private&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;v&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;caps&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
    &lt;span class="n"&gt;E&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;sorted&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;t&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;v&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;tools&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;items&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;can_egress&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;v&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;caps&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
    &lt;span class="n"&gt;found&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;u&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;U&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;reach_u&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;pred_u&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;bfs&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;adj&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;u&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;P&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;reach_u&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="k"&gt;continue&lt;/span&gt;
            &lt;span class="n"&gt;reach_p&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;pred_p&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;bfs&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;adj&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;E&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;reach_p&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                    &lt;span class="k"&gt;continue&lt;/span&gt;
                &lt;span class="n"&gt;full&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;path_of&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pred_u&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;u&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="nf"&gt;path_of&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pred_p&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;)[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;:]&lt;/span&gt;
                &lt;span class="n"&gt;found&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="n"&gt;u&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;full&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="n"&gt;found&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sort&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="k"&gt;lambda&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;]))&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;U&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;P&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;E&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;found&lt;/span&gt;


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;main&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;argv&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;argv&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;usage: trifecta_gate.py &amp;lt;agent_manifest.json&amp;gt;&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;sys&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;exit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;agent&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;mode&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;tools&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;flows&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;load_manifest&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;argv&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
    &lt;span class="n"&gt;adj&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;build_graph&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;mode&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;tools&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;flows&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;U&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;P&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;E&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;found&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;find_trifecta&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;adj&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;tools&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;isolated&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;sorted&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;t&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;v&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;tools&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;items&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;v&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;isolated&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;

    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;trifecta-gate: agent=%s mode=%s tools=%d&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="o"&gt;%&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;agent&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;mode&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tools&lt;/span&gt;&lt;span class="p"&gt;)))&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;capabilities: ingests_untrusted=%d reads_private=%d can_egress=%d&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
          &lt;span class="o"&gt;%&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;U&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;P&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;E&lt;/span&gt;&lt;span class="p"&gt;)))&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;isolated (off shared context): &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;, &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;isolated&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;isolated&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;(none)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;found&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;VERDICT: LETHAL TRIFECTA REACHABLE - %d path(s)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="o"&gt;%&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;found&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;u&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;full&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;enumerate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;found&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
            &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;  [%d] untrusted=%s -&amp;gt; private=%s -&amp;gt; egress=%s&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="o"&gt;%&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;u&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
            &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;      flow: &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt; -&amp;gt; &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;full&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ACTION: do NOT start agent. Break one leg (isolate the egress or the&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;        private read into a separate context/sub-agent) before launch.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;sys&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;exit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;VERDICT: trifecta NOT reachable (0 paths) - safe to start&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;U&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;P&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;E&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;note: all three capability classes are present but no untrusted-&amp;gt;private&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;      -&amp;gt;egress data-flow path exists (isolation/edges break the chain).&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;missing&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;sorted&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;CAPS&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ingests_untrusted&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;U&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="p"&gt;[])&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt;
            &lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;reads_private&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;P&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="p"&gt;[])&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt;
            &lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;can_egress&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;E&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="p"&gt;[])))&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;note: trifecta cannot close - missing capability class(es): &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;, &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;missing&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="n"&gt;sys&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;exit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;


&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;__name__&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;__main__&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="nf"&gt;main&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sys&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;argv&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  The vulnerable manifest: a normal inbox agent
&lt;/h2&gt;

&lt;p&gt;The first manifest is a LangGraph-style inbox assistant, the kind of thing people ship in a weekend. Four tools. &lt;code&gt;read_email&lt;/code&gt; pulls message bodies, which an attacker controls, so it ingests untrusted input. &lt;code&gt;search_inbox&lt;/code&gt; and &lt;code&gt;read_contacts&lt;/code&gt; touch private data. &lt;code&gt;send_email&lt;/code&gt; can mail anyone, so it can egress. Nothing is isolated, and everything shares one context. Run it:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="gp"&gt;$&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;python3 trifecta_gate.py fixtures/vulnerable_manifest.json
&lt;span class="go"&gt;trifecta-gate: agent=langgraph-inbox-assistant mode=shared_context tools=4
capabilities: ingests_untrusted=1 reads_private=2 can_egress=1
isolated (off shared context): (none)

VERDICT: LETHAL TRIFECTA REACHABLE - 2 path(s)
&lt;/span&gt;&lt;span class="gp"&gt;  [1] untrusted=read_email -&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;private&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;read_contacts -&amp;gt; &lt;span class="nv"&gt;egress&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;send_email
&lt;span class="gp"&gt;      flow: read_email -&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&amp;lt;shared-context&amp;gt; -&amp;gt; read_contacts -&amp;gt; &amp;lt;shared-context&amp;gt; -&amp;gt; send_email
&lt;span class="gp"&gt;  [2] untrusted=read_email -&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;private&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;search_inbox -&amp;gt; &lt;span class="nv"&gt;egress&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;send_email
&lt;span class="gp"&gt;      flow: read_email -&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&amp;lt;shared-context&amp;gt; -&amp;gt; search_inbox -&amp;gt; &amp;lt;shared-context&amp;gt; -&amp;gt; send_email
&lt;span class="go"&gt;
ACTION: do NOT start agent. Break one leg (isolate the egress or the
        private read into a separate context/sub-agent) before launch.
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Exit 1. Two paths, both running through the shared context node. To be precise about what "2 paths" means: that is two distinct trifecta closures in this manifest, one through &lt;code&gt;read_contacts&lt;/code&gt; and one through &lt;code&gt;search_inbox&lt;/code&gt;. It is not a count of attacks in the wild, and it is not a measurement of anything beyond this file. The gate is telling you the shape is exploitable and pointing at exactly where. An attacker who plants instructions in an email body can, in principle, get the agent to read your contacts and forward them. The defense the gate suggests is structural: do not start the agent in this shape.&lt;/p&gt;

&lt;h2&gt;
  
  
  The safe manifest: same three capabilities, one is moved off the bus
&lt;/h2&gt;

&lt;p&gt;Now the manifest that makes the point. It declares the &lt;strong&gt;same four tools and the same three capability classes&lt;/strong&gt;. The only change: &lt;code&gt;send_email&lt;/code&gt; is marked &lt;code&gt;isolated&lt;/code&gt;, modeling an egress step that runs as a sandboxed sub-agent receiving a structurally fixed recipient and a templated body, never the shared context.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="gp"&gt;$&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;python3 trifecta_gate.py fixtures/safe_manifest.json
&lt;span class="go"&gt;trifecta-gate: agent=langgraph-inbox-assistant-isolated-send mode=shared_context tools=4
capabilities: ingests_untrusted=1 reads_private=2 can_egress=1
isolated (off shared context): send_email

VERDICT: trifecta NOT reachable (0 paths) - safe to start
&lt;/span&gt;&lt;span class="gp"&gt;note: all three capability classes are present but no untrusted-&amp;gt;&lt;/span&gt;private
&lt;span class="gp"&gt;      -&amp;gt;&lt;/span&gt;egress data-flow path exists &lt;span class="o"&gt;(&lt;/span&gt;isolation/edges &lt;span class="nb"&gt;break &lt;/span&gt;the chain&lt;span class="o"&gt;)&lt;/span&gt;&lt;span class="nb"&gt;.&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Exit 0. Zero paths. Look at the &lt;code&gt;capabilities&lt;/code&gt; line: it is identical to the vulnerable run, &lt;code&gt;ingests_untrusted=1 reads_private=2 can_egress=1&lt;/code&gt;. If this gate were a checklist, both manifests would score three out of three and both would fail. They do not. The vulnerable one fails, the safe one passes, and the only difference is whether egress sits on the shared bus. That is the entire argument for doing this by reachability instead of by counting flags. A checklist cannot tell these two apart. The graph can.&lt;/p&gt;

&lt;p&gt;This is also why a crypto agent is the expensive version of the same picture. Swap the labels: &lt;code&gt;read_pool_data&lt;/code&gt; ingests untrusted on-chain or web input, &lt;code&gt;read_wallet&lt;/code&gt; reads a private key or balance, &lt;code&gt;sign_and_send_tx&lt;/code&gt; egresses, except here egress is money leaving the wallet. Same graph, same closing path, except a successful injection moves funds instead of contacts. I did not ship a crypto manifest in this post, but the structure is identical, and so is the fix: take the signing step off the shared context.&lt;/p&gt;

&lt;h2&gt;
  
  
  The exit code is the gate
&lt;/h2&gt;

&lt;p&gt;The point of returning 0, 1, or 2 is that CI can read it without reading prose. Wire it before you launch the agent or merge a manifest change:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="k"&gt;if &lt;/span&gt;python3 trifecta_gate.py agent_manifest.json&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;then
    &lt;/span&gt;&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"trifecta not reachable - safe to start the agent"&lt;/span&gt;
&lt;span class="k"&gt;else
    &lt;/span&gt;&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"trifecta reachable or bad manifest - hold the launch"&lt;/span&gt;
&lt;span class="k"&gt;fi&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Exit 0 lets the launch proceed. Exit 1 holds it and prints the paths to break. Exit 2 means the manifest was malformed, and a malformed manifest must never read as "safe." Feed it the broken fixture, where &lt;code&gt;tools&lt;/code&gt; is a string instead of a list, and it exits 2 with &lt;code&gt;trifecta-gate: ERROR: manifest.tools must be a non-empty list&lt;/code&gt;. Run it with no arguments and it prints usage and exits 2. A gate that cannot tell "all clear" from "I could not check" is not a gate.&lt;/p&gt;

&lt;h2&gt;
  
  
  Deterministic, so it can live in CI
&lt;/h2&gt;

&lt;p&gt;The output carries no timestamps, no map iteration order, no floating point. Neighbours are sorted before traversal, so the path it prints is stable. Hash the STDOUT of each fixture twice and it is identical both times: the vulnerable run is &lt;code&gt;7718cefc5ffce46aee99111e595262803698d2fab3a741a7eb3137e95a8b11aa&lt;/code&gt;, the safe run is &lt;code&gt;a561c84a8f2022f1742fa888268255d00d15221c801065f8ecd206f8e2eeeadc&lt;/code&gt;, the bad run is &lt;code&gt;9ed47c5847aaf49f8d2f9733a1c4565959b775bfce945dfcd65a369a2e624557&lt;/code&gt;. That matters because a check you cannot reproduce is a second opinion, not a gate. This one you can pin in a test and diff on every manifest change.&lt;/p&gt;

&lt;h2&gt;
  
  
  What this is NOT
&lt;/h2&gt;

&lt;p&gt;I would rather you know the edges than hit them in production.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;It does not catch the injection.&lt;/strong&gt; It never sees a prompt, a model, or a runtime call. It reads a static manifest and reasons about reachability. It assumes the injection can happen (because at the model level, it can) and asks whether a successful one would have a path to exfiltrate. If you want to know whether a specific payload gets through, this is the wrong tool.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;It does not replace sandboxing or human approval.&lt;/strong&gt; Marking a tool &lt;code&gt;isolated&lt;/code&gt; is a claim that you actually sandboxed it. The gate checks that your declared architecture breaks the chain; it does not enforce the sandbox. The runtime isolation and the human-in-the-loop on high-risk actions are still your job. This tells you whether the design, as declared, is sound.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;It trusts your capability tags.&lt;/strong&gt; If you label &lt;code&gt;send_email&lt;/code&gt; as not able to egress, or forget that a "read-only" tool can leak data through an error message or a logged URL, the graph is wrong and so is the verdict. The honesty of the tags is the whole foundation. Garbage tags, garbage gate.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;It is a per-session, single-agent model.&lt;/strong&gt; It does not reason about data that leaves one agent, gets stored, and returns to another later. Cross-agent and persisted-memory flows are a harder graph than this one draws.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The shared_context model is an assumption, not a law.&lt;/strong&gt; It encodes the common default where everything shares one bus. If your orchestrator wires tools point to point, use &lt;code&gt;explicit&lt;/code&gt; mode and declare the real edges. The contract is "reason about reachability on the data flow you actually have," not "every agent looks like my default."&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Where this sits next to the other gates
&lt;/h2&gt;

&lt;p&gt;This is one more pre-execution check in a series, and it helps to say how it differs from its closest neighbors so you reach for the right one.&lt;/p&gt;

&lt;p&gt;It is the manifest-shaped sibling of the &lt;a href="https://finops.spinov.online/blog/pre-execution-gate-for-ai-agents/" rel="noopener noreferrer"&gt;pre-execution gate for AI agents&lt;/a&gt;: a check that has to pass before an action, applied here to the whole capability graph instead of a single call.&lt;/p&gt;

&lt;p&gt;It is not the &lt;a href="https://finops.spinov.online/blog/blast-radius-ai-agent-api-key/" rel="noopener noreferrer"&gt;blast radius of a leaked agent API key&lt;/a&gt;. That one measures the scope of damage from one leaked credential. This one measures the composition of many tools, whether their capabilities can reach each other at all. Different object: one is the reach of a key, this is the reach of a graph.&lt;/p&gt;

&lt;p&gt;It is not &lt;a href="https://finops.spinov.online/blog/llm-router-credential-leak-redact-at-boundary/" rel="noopener noreferrer"&gt;redacting credential leaks at the boundary&lt;/a&gt; either. That tool scrubs the secret out of what egresses, the content leaving the wire. This one asks an earlier question: can untrusted input reach the egress sink in the first place? One cleans the payload, the other removes the path.&lt;/p&gt;

&lt;p&gt;And it is a different axis from &lt;a href="https://finops.spinov.online/blog/mcp-tool-pin-verify/" rel="noopener noreferrer"&gt;pinning and verifying MCP tools&lt;/a&gt;. That guards against the manifest drifting or a tool being swapped under you, a question of version integrity. This reads the same manifest and asks whether its capabilities, as declared, compose into the trifecta. Drift versus reachability.&lt;/p&gt;

&lt;p&gt;Finally, it shares a family resemblance with &lt;a href="https://finops.spinov.online/blog/your-agent-returns-200-and-lies/" rel="noopener noreferrer"&gt;your agent returns 200 and lies&lt;/a&gt;: both refuse to trust a surface signal. There it is a clean status code over a wrong effect; here it is "all three capabilities, looks scary" versus the actual reachable paths. Different artifact, same suspicion of the easy read.&lt;/p&gt;

&lt;h2&gt;
  
  
  The question I am still chewing on
&lt;/h2&gt;

&lt;p&gt;The model assumes one shared context per agent. Real platform-MCP setups are messier. You connect five servers, each adding tools, and some of them quietly add a leg to the trifecta on a context they all share. The honest hard part is &lt;code&gt;explicit&lt;/code&gt; mode: drawing the true data-flow edges of a real orchestrator is work, and if you draw them wrong the gate is confidently wrong with you.&lt;/p&gt;

&lt;p&gt;So here is the real open question for anyone running an MCP composition or a multi-server agent: how many of your connected servers sit on the same context bus, and which one of them opens egress? If you cannot answer that quickly, the trifecta might already be reachable in your agent and nobody drew the graph. I have a partial answer (tag every tool's three capabilities at registration, fail the build on a reachable path) and no clean way to keep the &lt;code&gt;explicit&lt;/code&gt; edges honest as the agent grows. Tell me how you model the context bus in your setup. I read every comment.&lt;/p&gt;

&lt;p&gt;Follow for the next runnable gate in this series on controlling agents before you trust them.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Written by Alexey Spinov. AI-assisted, human-verified: the tool, all three manifests, and every number above come from a real local run on 2026-06-30 (Python 3.13.5, stdlib only, offline). I ran it, checked the exit codes (0 / 1 / 2), hashed the STDOUT twice to confirm determinism, and edited every line. The Agentjacking figures (2,388 organizations, 100+ agents at an 85% success rate in controlled testing, the "technically not defensible" quote) are Tenet Security's, from their June 2026 writeup, not my measurements. The term "lethal trifecta" and the position that LLMs cannot reliably distinguish trusted instructions from untrusted data are Simon Willison's, from his 16 June 2025 post. I label which numbers are theirs and which are mine.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>security</category>
      <category>agents</category>
      <category>python</category>
    </item>
    <item>
      <title>A 58% Win-Rate Over Zero Closed Trades: Recompute Agent Scorecard</title>
      <dc:creator>Alexey Spinov</dc:creator>
      <pubDate>Mon, 29 Jun 2026 07:30:13 +0000</pubDate>
      <link>https://dev.to/alex_spinov/a-58-win-rate-over-zero-closed-trades-recompute-agent-scorecard-35di</link>
      <guid>https://dev.to/alex_spinov/a-58-win-rate-over-zero-closed-trades-recompute-agent-scorecard-35di</guid>
      <description>&lt;p&gt;Recompute the agent scorecard from its primary event journal before you trust it: a self-reported metric is the actor grading itself. &lt;code&gt;scorecard_reconcile.py&lt;/code&gt; re-derives every metric independently and flags divergence. On the divergent fixture a claimed 58% win-rate sits over zero closed trades, yielding 5 DIVERGENT and 1 UNSUPPORTED metric and exit 1, blocking the add-capital decision.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;AI disclosure:&lt;/strong&gt; I wrote &lt;code&gt;scorecard_reconcile.py&lt;/code&gt; with an AI assistant and ran it myself, offline, before publishing. Every number in the output blocks below is pasted from a real local run on Python 3.13.5, stdlib only, on the synthetic fixtures included in this post. I checked the exit codes, hashed the STDOUT twice to confirm it is byte-for-byte deterministic, and edited every line. The external figures (the SEC's $12.3M case) are the SEC's numbers, not mine, and I link the primary source. I label which numbers are theirs and which are mine.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;In short:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A self-reported metric and a real one look identical on a dashboard. The agent prints both.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;scorecard_reconcile.py&lt;/code&gt; reads the scorecard the agent claimed plus the journal of events it logged, then recomputes each metric from the journal and flags anything that does not reconcile.&lt;/li&gt;
&lt;li&gt;On the clean fixture, all 6 metrics MATCH, 0 flagged, exit 0. On the divergent fixture, a claimed 12-trade, 58.3% win-rate scorecard sits over a journal with &lt;strong&gt;zero closing events&lt;/strong&gt;: 5 metrics DIVERGENT, 1 UNSUPPORTED, exit 1.&lt;/li&gt;
&lt;li&gt;The 58.3% is the dangerous one. A win-rate over zero closed trades is not 58%. It is undefined, and a number you cannot disprove is worse than a number that is merely wrong.&lt;/li&gt;
&lt;li&gt;Stdlib only (&lt;code&gt;json&lt;/code&gt;, &lt;code&gt;sys&lt;/code&gt;, &lt;code&gt;hashlib&lt;/code&gt;). No network, no model, no exec. The run is byte-for-byte deterministic. The code and both fixtures are in this post.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The case that makes this concrete
&lt;/h2&gt;

&lt;p&gt;On 28 May 2026 the SEC charged Nathan Fuller, founder of Privvy Investments, over a crypto scheme it valued at about $12.3 million raised from roughly 150 investors (&lt;a href="https://www.sec.gov/enforcement-litigation/litigation-releases/lr-26558" rel="noopener noreferrer"&gt;SEC litigation release LR-26558&lt;/a&gt;, reported by &lt;a href="https://www.coindesk.com/business/2026/05/30/sec-sues-texas-man-over-usd12-3-million-alleged-crypto-scheme-built-on-fake-ai-trading-bots" rel="noopener noreferrer"&gt;CoinDesk on 30 May 2026&lt;/a&gt;). The pitch was proprietary AI trading bots running high-frequency arbitrage. According to the complaint, only about $380,000, roughly 3% of the money, ever bought crypto at all, those trades ran without the advertised bots, and they made no profit. Investors were kept calm with fake account statements and fabricated correspondence. Those are the SEC's figures, from their filing, not mine.&lt;/p&gt;

&lt;p&gt;Strip out the fraud and a quieter version of the same shape shows up in honest setups every week. A trading agent prints a dashboard. The dashboard says 4 active days, 12 trades, 58.3% win-rate. Nobody re-derives those numbers from the exchange fills. The statement and the reality were never reconciled. Fuller's was a deliberate fabrication; most are just an agent confidently reporting work that the journal does not back. The defense is the same in both: do not accept the scorecard the actor printed about itself. Recompute it from the primary events first.&lt;/p&gt;

&lt;h2&gt;
  
  
  The claim, sharp enough to argue with
&lt;/h2&gt;

&lt;p&gt;Here is the falsifiable version: &lt;strong&gt;a self-reported metric is the actor grading itself, and a scorecard you did not recompute is not evidence.&lt;/strong&gt; Win-rate, trade count, "ran for N days," realized PnL: every one of those is emitted by the agent, from the agent's own view of what it did. If that view is wrong, optimistic, or invented, the metric inherits the error and presents it as a clean number on a chart.&lt;/p&gt;

&lt;p&gt;If this claim were false, the recomputed scorecard would always equal the claimed one, and a tool like this would be pointless. You could read the dashboard and move on. The tool earns its place precisely when the two disagree, and the divergent fixture below is one keystroke of proof that they can disagree completely.&lt;/p&gt;

&lt;p&gt;The control is not "log the metrics and watch them." That is tracking. Tracking tells you what the agent said about itself. Control is recomputing the metric from the primary journal and gating the next decision, keep it running, scale it, add capital, on whether the recomputation agrees. This is the post-hoc sibling of the &lt;a href="https://finops.spinov.online/blog/pre-execution-gate-for-ai-agents/" rel="noopener noreferrer"&gt;pre-execution gate for AI agents&lt;/a&gt;: same idea, a check that has to pass before an action, applied to the aggregate report instead of a single call.&lt;/p&gt;

&lt;h2&gt;
  
  
  What the tool recomputes, and how
&lt;/h2&gt;

&lt;p&gt;The input is two small JSON files. The first is &lt;code&gt;claimed.json&lt;/code&gt;, the scorecard the agent printed about itself. The second is &lt;code&gt;evidence.json&lt;/code&gt;, the primary journal: an &lt;code&gt;events&lt;/code&gt; list of &lt;code&gt;open&lt;/code&gt; and &lt;code&gt;close&lt;/code&gt; records, each with a timestamp and, for closes, a PnL. Think exchange fills, not the agent's summary of them.&lt;/p&gt;

&lt;p&gt;The tool ignores the claimed numbers and rebuilds each metric from the journal with one defensible definition per metric:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A &lt;strong&gt;trade&lt;/strong&gt; is a closed position, a &lt;code&gt;close&lt;/code&gt; event. Open attempts that never fill are not trades.&lt;/li&gt;
&lt;li&gt;A &lt;strong&gt;win&lt;/strong&gt; is a close with PnL above zero, a &lt;strong&gt;loss&lt;/strong&gt; is a close below zero.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Active days&lt;/strong&gt; are the count of distinct ISO dates across all events, taken as the first ten characters of the timestamp. No clock, no date library, no timezone math.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Win-rate&lt;/strong&gt; is wins over decided trades. Over zero decided trades it is &lt;code&gt;None&lt;/code&gt;, undefined, not zero.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Realized PnL&lt;/strong&gt; is the sum of PnL over closes, rounded to a cent.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Tolerances are explicit so the verdict is not a matter of taste: counts must match exactly, ratios within 0.005 (half a percentage point), money within one cent. Here is the whole tool.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;#!/usr/bin/env python3
&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;scorecard_reconcile.py - recompute an agent&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;s self-reported scorecard from its
primary event journal BEFORE you trust the report (or add capital / scale / keep it on).

A self-reported metric is the actor grading itself: win-rate, &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ran for N days&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;,
trade count, realized PnL - printed by the same agent whose work they describe.
This tool ignores the claimed numbers and INDEPENDENTLY recomputes each one from
the primary event journal (the fills / closes / activity the agent actually
logged), then flags every metric that does not reconcile.

Two failure shapes, not one:
  DIVERGENT   - claimed value != value recomputed from evidence.
  UNSUPPORTED - claimed value cannot be derived from the journal at all
                (a win-rate over zero closed trades is not 58%, it is
                undefined; the number is unfalsifiable, which is worse).

offline / keyless / read-only / zero-network. stdlib only (json, sys, hashlib).
The journal is read, never run. Nothing is fetched, nothing is sent.

Exit 0 = every claimed metric is supported by evidence AND matches   -&amp;gt; report trustworthy
Exit 1 = at least one metric DIVERGENT or UNSUPPORTED                 -&amp;gt; do NOT trust the report
Exit 2 = bad input

Usage:
  python3 scorecard_reconcile.py &amp;lt;claimed.json&amp;gt; &amp;lt;evidence.json&amp;gt;
&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;

&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;sys&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;hashlib&lt;/span&gt;

&lt;span class="n"&gt;RATIO_TOL&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;0.005&lt;/span&gt;   &lt;span class="c1"&gt;# 0.5 percentage points
&lt;/span&gt;&lt;span class="n"&gt;MONEY_TOL&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;0.01&lt;/span&gt;    &lt;span class="c1"&gt;# one cent
&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;BadInput&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;Exception&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;pass&lt;/span&gt;


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;load_json&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;fh&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;load&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;fh&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;FileNotFoundError&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;BadInput&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;file not found: %s&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="o"&gt;%&lt;/span&gt; &lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;JSONDecodeError&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;exc&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;BadInput&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;invalid json in %s: %s&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="o"&gt;%&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;exc&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;as_number&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;field&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;isinstance&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;bool&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="nf"&gt;isinstance&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt;&lt;span class="p"&gt;)):&lt;/span&gt;
        &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;BadInput&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;metric %r is not numeric: %r&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="o"&gt;%&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;field&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;float&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;recompute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;events&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Independently rebuild every metric from the primary journal.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="nf"&gt;isinstance&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;events&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;BadInput&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;evidence.events must be a list&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;closes&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;days&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[],&lt;/span&gt; &lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;ev&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;events&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="nf"&gt;isinstance&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ev&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
            &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;BadInput&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;event is not an object: %r&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="o"&gt;%&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ev&lt;/span&gt;&lt;span class="p"&gt;,))&lt;/span&gt;
        &lt;span class="n"&gt;ts&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ev&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ts&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;ts&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;days&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ts&lt;/span&gt;&lt;span class="p"&gt;)[:&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;          &lt;span class="c1"&gt;# ISO date portion; no clock, no parse libs
&lt;/span&gt;        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;ev&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;close&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;closes&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ev&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;wins&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;closes&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;as_number&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;pnl&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;pnl&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;losses&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;closes&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;as_number&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;pnl&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;pnl&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;decided&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;wins&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;losses&lt;/span&gt;
    &lt;span class="n"&gt;win_rate&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;wins&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;decided&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;decided&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;   &lt;span class="c1"&gt;# None =&amp;gt; undefined / unsupported
&lt;/span&gt;    &lt;span class="n"&gt;realized&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;round&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;as_number&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;pnl&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;pnl&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;closes&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;trades&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;float&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;closes&lt;/span&gt;&lt;span class="p"&gt;)),&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;active_days&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;float&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;days&lt;/span&gt;&lt;span class="p"&gt;)),&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;wins&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;float&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;wins&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;losses&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;float&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;losses&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;win_rate&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;win_rate&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;realized_pnl&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;realized&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;_supporting_events&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;events&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;_closing_events&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;closes&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;


&lt;span class="c1"&gt;# metric name -&amp;gt; comparison kind
&lt;/span&gt;&lt;span class="n"&gt;SPEC&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;trades&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;       &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;count&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;active_days&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;count&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;wins&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;         &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;count&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;losses&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;       &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;count&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;win_rate&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;     &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ratio&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;realized_pnl&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;money&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="p"&gt;]&lt;/span&gt;


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;compare&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;kind&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;claimed&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;recomputed&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;recomputed&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;UNSUPPORTED&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;kind&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;count&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;MATCH&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;int&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;round&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;claimed&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="nf"&gt;int&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;round&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;recomputed&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;DIVERGENT&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;kind&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ratio&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;MATCH&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;abs&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;claimed&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;recomputed&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;=&lt;/span&gt; &lt;span class="n"&gt;RATIO_TOL&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;DIVERGENT&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;kind&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;money&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;MATCH&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;abs&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;claimed&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;recomputed&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;=&lt;/span&gt; &lt;span class="n"&gt;MONEY_TOL&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;DIVERGENT&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;BadInput&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;unknown metric kind: %s&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="o"&gt;%&lt;/span&gt; &lt;span class="n"&gt;kind&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;fmt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;kind&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;value&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;UNDEFINED&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;kind&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;count&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;int&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;round&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="p"&gt;)))&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;kind&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ratio&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;%.3f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="o"&gt;%&lt;/span&gt; &lt;span class="n"&gt;value&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;%.2f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="o"&gt;%&lt;/span&gt; &lt;span class="n"&gt;value&lt;/span&gt;


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;build_report&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;claimed_doc&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;evidence_doc&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="nf"&gt;isinstance&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;claimed_doc&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;scorecard&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;claimed_doc&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;BadInput&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;claimed.json must contain a &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;scorecard&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt; object&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;sc&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;claimed_doc&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;scorecard&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="nf"&gt;isinstance&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sc&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;BadInput&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"'&lt;/span&gt;&lt;span class="s"&gt;scorecard&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt; must be an object&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="nf"&gt;isinstance&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;evidence_doc&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;events&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;evidence_doc&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;BadInput&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;evidence.json must contain an &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;events&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt; list&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;truth&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;recompute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;evidence_doc&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;events&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
    &lt;span class="n"&gt;agent&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;claimed_doc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;agent&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;unknown-agent&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

    &lt;span class="n"&gt;lines&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
    &lt;span class="n"&gt;lines&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;SCORECARD RECONCILE - %s&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="o"&gt;%&lt;/span&gt; &lt;span class="n"&gt;agent&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;lines&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;primary journal: %d events (%d closing) - recomputed independently&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
                 &lt;span class="o"&gt;%&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;truth&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;_supporting_events&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;truth&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;_closing_events&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]))&lt;/span&gt;
    &lt;span class="n"&gt;lines&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;lines&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;%-14s %12s %12s   %s&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="o"&gt;%&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;metric&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;claimed&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;from-journal&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;verdict&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="n"&gt;lines&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;%-14s %12s %12s   %s&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="o"&gt;%&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;-&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;14&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;-&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;12&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;-&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;12&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;-------&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

    &lt;span class="n"&gt;divergent&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;unsupported&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;kind&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;SPEC&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;sc&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;BadInput&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;scorecard missing metric: %s&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="o"&gt;%&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;claimed&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;as_number&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sc&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;recomputed&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;truth&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="n"&gt;verdict&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;compare&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;kind&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;claimed&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;recomputed&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;verdict&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;DIVERGENT&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;divergent&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
        &lt;span class="k"&gt;elif&lt;/span&gt; &lt;span class="n"&gt;verdict&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;UNSUPPORTED&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;unsupported&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
        &lt;span class="n"&gt;note&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;verdict&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;UNSUPPORTED&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;note&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;  (0 closing events -&amp;gt; ratio is undefined / unfalsifiable)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="n"&gt;lines&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;%-14s %12s %12s   %s%s&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
                     &lt;span class="o"&gt;%&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;fmt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;kind&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;claimed&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="nf"&gt;fmt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;kind&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;recomputed&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;verdict&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;note&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

    &lt;span class="n"&gt;flagged&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;divergent&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;unsupported&lt;/span&gt;
    &lt;span class="n"&gt;lines&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;lines&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;flagged: %d (%d divergent, %d unsupported)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="o"&gt;%&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;flagged&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;divergent&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;unsupported&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;flagged&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;lines&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;DECISION GATE: scorecard reconciles with evidence -&amp;gt; trust permitted (exit 0)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;code&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
    &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;lines&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;DECISION GATE: scorecard does NOT reconcile -&amp;gt; do NOT trust report; &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
                     &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;block continue/add-capital (exit 1)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;code&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;lines&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;code&lt;/span&gt;


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;main&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;argv&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;argv&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;sys&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;stderr&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;write&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;usage: scorecard_reconcile.py &amp;lt;claimed.json&amp;gt; &amp;lt;evidence.json&amp;gt;&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;
    &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;claimed_doc&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;load_json&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;argv&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
        &lt;span class="n"&gt;evidence_doc&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;load_json&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;argv&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
        &lt;span class="n"&gt;body&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;code&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;build_report&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;claimed_doc&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;evidence_doc&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="n"&gt;BadInput&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;exc&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;sys&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;stderr&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;write&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;bad input: %s&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="o"&gt;%&lt;/span&gt; &lt;span class="n"&gt;exc&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;
    &lt;span class="n"&gt;digest&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;hashlib&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sha256&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;body&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;encode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;utf-8&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)).&lt;/span&gt;&lt;span class="nf"&gt;hexdigest&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;sys&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;stdout&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;write&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;body&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;sys&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;stdout&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;write&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;REPORT-SHA256: %s&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="o"&gt;%&lt;/span&gt; &lt;span class="n"&gt;digest&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;code&lt;/span&gt;


&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;__name__&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;__main__&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;sys&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;exit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;main&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sys&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;argv&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  A scorecard that reconciles
&lt;/h2&gt;

&lt;p&gt;The honest fixture has a journal of 14 events, 10 of them closes, and a scorecard that genuinely describes them. Run it:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="gp"&gt;$&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;python3 scorecard_reconcile.py fixtures/clean_claimed.json fixtures/clean_evidence.json
&lt;span class="go"&gt;SCORECARD RECONCILE - trader-honest
primary journal: 14 events (10 closing) - recomputed independently

metric              claimed from-journal   verdict
-------------- ------------ ------------   -------
trades                   10           10   MATCH
active_days               3            3   MATCH
wins                      6            6   MATCH
losses                    4            4   MATCH
win_rate              0.600        0.600   MATCH
realized_pnl         125.50       125.50   MATCH

flagged: 0 (0 divergent, 0 unsupported)
&lt;/span&gt;&lt;span class="gp"&gt;DECISION GATE: scorecard reconciles with evidence -&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;trust permitted &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;exit &lt;/span&gt;0&lt;span class="o"&gt;)&lt;/span&gt;
&lt;span class="go"&gt;REPORT-SHA256: 15a33ee3894142fdcec6c06589e0d9aff09d1331ca7a91b88229d2d3b6a10aba
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;All 6 metrics MATCH, exit 0. This is what passing looks like, and you want it to be boring. The claimed 0.600 win-rate is the same 0.600 the tool derived from the fills, the claimed $125.50 is the sum it computed over the 10 closes. Nothing to argue with. The gate opens.&lt;/p&gt;

&lt;h2&gt;
  
  
  12 trades claimed, zero in the journal
&lt;/h2&gt;

&lt;p&gt;Now the divergent fixture. The scorecard claims 4 active days, 12 trades, 7 wins, 5 losses, a 58.3% win-rate, and $842.00 realized. The journal underneath it holds three &lt;code&gt;open&lt;/code&gt; events on a single day, all rejected, and not one close.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="gp"&gt;$&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;python3 scorecard_reconcile.py fixtures/divergent_claimed.json fixtures/divergent_evidence.json
&lt;span class="go"&gt;SCORECARD RECONCILE - trader-x (self-reported dashboard)
primary journal: 3 events (0 closing) - recomputed independently

metric              claimed from-journal   verdict
-------------- ------------ ------------   -------
trades                   12            0   DIVERGENT
active_days               4            1   DIVERGENT
wins                      7            0   DIVERGENT
losses                    5            0   DIVERGENT
&lt;/span&gt;&lt;span class="gp"&gt;win_rate              0.583    UNDEFINED   UNSUPPORTED  (0 closing events -&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;ratio is undefined / unfalsifiable&lt;span class="o"&gt;)&lt;/span&gt;
&lt;span class="go"&gt;realized_pnl         842.00         0.00   DIVERGENT

flagged: 6 (5 divergent, 1 unsupported)
&lt;/span&gt;&lt;span class="gp"&gt;DECISION GATE: scorecard does NOT reconcile -&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;do &lt;/span&gt;NOT trust report&lt;span class="p"&gt;;&lt;/span&gt; block &lt;span class="k"&gt;continue&lt;/span&gt;/add-capital &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;exit &lt;/span&gt;1&lt;span class="o"&gt;)&lt;/span&gt;
&lt;span class="go"&gt;REPORT-SHA256: 7d1a10fd12a95a13329ffd7c34d6d869411a51d2e4196c0be6c20dec4d06f781
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Six metrics flagged, exit 1. The dashboard told a four-day story. The events show one day of failed attempts and zero realized anything. If you had read the scorecard and added capital, you would have funded a $842 PnL that does not exist in the record the agent itself kept.&lt;/p&gt;

&lt;h2&gt;
  
  
  Two failure shapes, and why one is worse
&lt;/h2&gt;

&lt;p&gt;Most checks have a single failure mode: pass or fail. This one separates two, because they call for different reactions.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;DIVERGENT&lt;/strong&gt; is the loud one. Claimed 12, journal says 0. Claimed $842, journal says $0.00. The numbers disagree and the disagreement is concrete. You can chase it: a logging bug, a double-count, a wrong window, a fabrication. There is a thread to pull.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;UNSUPPORTED&lt;/strong&gt; is the quiet one, and it is worse. The claimed 58.3% win-rate cannot be derived from the journal at all, because there are zero decided trades. A win-rate is wins over decided trades, and over zero trades that ratio is undefined, not 58.3% and not 0%. The agent printed a precise-looking number for a quantity the evidence cannot define. You cannot disprove 58.3% by pointing at the fills, because the fills do not speak to it. A number you cannot falsify is not a weak metric, it is a non-metric wearing a metric's clothes, and it is exactly the kind of figure that survives a review because it looks specific. The tool refuses to call it DIVERGENT (that would imply the right answer was some other number) and labels it UNSUPPORTED instead.&lt;/p&gt;

&lt;p&gt;That distinction is the whole reason the tool prints two counters instead of one.&lt;/p&gt;

&lt;h2&gt;
  
  
  The exit code is the gate, not the chart
&lt;/h2&gt;

&lt;p&gt;The point of returning 0, 1, or 2 is that a pipeline can read it without reading prose. Wire it before the decision that spends money or scope:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="k"&gt;if &lt;/span&gt;python3 scorecard_reconcile.py claimed.json evidence.json&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;then
    &lt;/span&gt;&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"scorecard reconciles - proceed with the add-capital review"&lt;/span&gt;
&lt;span class="k"&gt;else
    &lt;/span&gt;&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"scorecard does not reconcile - hold; do not scale on this report"&lt;/span&gt;
&lt;span class="k"&gt;fi&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Exit 0 lets the next step run. Exit 1 holds it. Exit 2 means the input was malformed (events that are not a list, a missing scorecard, a non-numeric metric), and a malformed reconciliation should never read as "passed." Try it: feed it a bad file and it exits 2 with &lt;code&gt;bad input: evidence.events must be a list&lt;/code&gt;, run it with no arguments and it prints usage and exits 2. A gate that cannot tell "all clear" from "I could not check" is not a gate.&lt;/p&gt;

&lt;h2&gt;
  
  
  Deterministic, so it can live in CI
&lt;/h2&gt;

&lt;p&gt;The report ends with a SHA-256 of its own body. Run the clean fixture twice and the STDOUT hashes to &lt;code&gt;64ac7d45afb235bae3fcfac28e98fd3ae8b6a4c5f43e6696ebd2b723684159b6&lt;/code&gt; both times; the divergent fixture is &lt;code&gt;64246aad569193ce61d004e66d79cb256f6a8d8e7ebd8feda9dfb8c41bf8d52e&lt;/code&gt; both times. No timestamps in the output, no map ordering, no floating-point surprise past the cent. That matters because a reconciliation you cannot reproduce is just a second opinion. This one you can pin in a test and diff.&lt;/p&gt;

&lt;h2&gt;
  
  
  What this is NOT
&lt;/h2&gt;

&lt;p&gt;I would rather you know the edges than discover them on your own logs.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;It is not an audit of the trading logic.&lt;/strong&gt; It checks that the scorecard agrees with the journal. It says nothing about whether the strategy is good, whether the fills were priced fairly, or whether the agent should have opened those positions at all. A perfectly reconciled scorecard can describe a terrible strategy.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;It trusts the journal.&lt;/strong&gt; The whole method assumes &lt;code&gt;evidence.json&lt;/code&gt; is the primary record, exchange fills or a settlement log, not a second file the same agent wrote. If the agent forges the journal too, this catches nothing. So the real question is upstream: is your evidence a source the agent cannot rewrite? Pull fills from the exchange API or the chain, not from the agent's own summary.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;It is not on-chain attestation or settlement verification.&lt;/strong&gt; It does no signature checks and reads no blockchain. For "did this fill really happen and settle," you want the exchange's records or a node, then feed those in as the journal. Pair it with &lt;a href="https://finops.spinov.online/blog/grok-tx-canary/" rel="noopener noreferrer"&gt;the Grok tx canary&lt;/a&gt;, which gates a single transaction before broadcast, while this reconciles the aggregate after the fact.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;A flag is a signal, not a verdict on intent.&lt;/strong&gt; DIVERGENT can be a logging bug as easily as a lie. The tool tells you the report and the record disagree. Why they disagree is your investigation.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The metric definitions are mine, and arguable.&lt;/strong&gt; I count a trade as a close, not an open. If your book counts entries, or partial fills, or funding events, change the definitions in &lt;code&gt;recompute&lt;/code&gt; to match your venue. The contract is "recompute from primary events with explicit rules," not these six exact rules.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Where this sits next to the other gates
&lt;/h2&gt;

&lt;p&gt;This is one more pre-decision check in a series, and it is worth saying how it differs from its closest neighbors so you use the right one.&lt;/p&gt;

&lt;p&gt;It is not &lt;a href="https://finops.spinov.online/blog/your-agent-returns-200-and-lies/" rel="noopener noreferrer"&gt;your-agent-returns-200-and-lies&lt;/a&gt;. That tool verifies a single call: a clean 200 whose effect was wrong. This one verifies an aggregate over many events, the scorecard built from all of them. Different object, different scale.&lt;/p&gt;

&lt;p&gt;It is not the &lt;a href="https://finops.spinov.online/blog/green-checkmark-auditor/" rel="noopener noreferrer"&gt;green-checkmark auditor&lt;/a&gt; either. That one asks whether a passing test actually exercises the code or just mirrors it. Here the question is whether a claimed KPI is backed by the primary events. Both share a suspicion of green that was never earned, applied to different artifacts.&lt;/p&gt;

&lt;p&gt;And it sits a step downstream of the &lt;a href="https://finops.spinov.online/blog/waste-probe-tokens-after-failure/" rel="noopener noreferrer"&gt;waste-probe for tokens burned after a failure&lt;/a&gt;: that one measures cost the agent already spent, this one questions the success the agent claims for what it spent. Cost and truth are separate audits, and an agent can fail both at once.&lt;/p&gt;

&lt;h2&gt;
  
  
  The question I am still chewing on
&lt;/h2&gt;

&lt;p&gt;The method is only as good as the journal you feed it, and that is the part I have not solved cleanly. For a centralized exchange you can pull fills from the venue's API, a record the agent cannot rewrite. For an agent that logs its own activity to a file it also controls, the journal and the scorecard come from the same hand, and reconciling one against the other proves consistency, not truth.&lt;/p&gt;

&lt;p&gt;So here is the real open question for anyone running a trading or ops agent: what is the primary journal for your bot, the exchange's fills and the chain's settlements, or the activity log the agent itself prints? If it is the latter, what stops the agent from making both agree? I have a few ideas (signed fills, a journal written by a separate process the agent cannot reach) and no clean rule. Drop how you source your evidence in the comments, I read every one.&lt;/p&gt;

&lt;p&gt;Follow for the next runnable check in this series on controlling agents before you trust them.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Written by Alexey Spinov. AI-assisted, human-verified: the tool, all five fixtures, and every number above come from a real local run on 2026-06-29 (Python 3.13.5, stdlib only, offline). I ran it, checked the exit codes (0 / 1 / 2), hashed the STDOUT twice to confirm determinism, and edited every line. The SEC figures ($12.3M, ~150 investors, ~$380K/3% actually traded) are the SEC's, from litigation release LR-26558 and CoinDesk's 30 May 2026 report, not my measurements. I label which numbers are theirs and which are mine.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>agents</category>
      <category>python</category>
      <category>crypto</category>
    </item>
    <item>
      <title>Your LLM Router Logged the Wallet Key. It Already Left.</title>
      <dc:creator>Alexey Spinov</dc:creator>
      <pubDate>Sat, 27 Jun 2026 01:34:14 +0000</pubDate>
      <link>https://dev.to/alex_spinov/your-llm-router-logged-the-wallet-key-it-already-left-1jje</link>
      <guid>https://dev.to/alex_spinov/your-llm-router-logged-the-wallet-key-it-already-left-1jje</guid>
      <description>&lt;p&gt;AI-agent secrets are in transit when a request hits a third-party LLM router or MCP proxy, and that router's audit log is not a control: by the time it logs the request, the credential already crossed your perimeter in plaintext. The fix is to redact at egress, before the bytes leave.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;In short:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A secret going to your own model provider on an &lt;code&gt;Authorization&lt;/code&gt; header is the expected path. The same secret going to a router, gateway, or MCP proxy is a leak, because that host reads your plaintext.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;boundary_leak_probe.py&lt;/code&gt; reads one JSON egress map and classifies every secret-bearing field by destination trust. On the leaky fixture: 3 requests, 2 of them to third-party intermediaries, &lt;strong&gt;5 fields crossing the boundary, 6 rule-hits, 2 critical wallet secrets. Exit 1.&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;The 2 critical ones are an Ethereum private key and a BIP-39 mnemonic, sitting in MCP tool-call arguments headed to a proxy. Signer material should never transit any middleman.&lt;/li&gt;
&lt;li&gt;Stdlib only (&lt;code&gt;sys&lt;/code&gt;, &lt;code&gt;json&lt;/code&gt;, &lt;code&gt;re&lt;/code&gt;). No network, no model, no exec. The run is byte-for-byte deterministic.&lt;/li&gt;
&lt;li&gt;A hit is a SIGNAL, not a confirmed live secret. The code and both fixtures are in this post.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The incident that made this worth measuring
&lt;/h2&gt;

&lt;p&gt;In April 2026 a group of researchers, including Chaofan Shou, published "Your Agent Is Mine: Measuring Malicious Intermediary Attacks on the LLM Supply Chain" (&lt;a href="https://arxiv.org/abs/2604.08407" rel="noopener noreferrer"&gt;arXiv 2604.08407&lt;/a&gt;, posted 9 April 2026). They pointed real agent traffic at 428 commodity LLM routers. Nine of them injected code into the responses. Seventeen reached for the researchers' own AWS credentials. Their framing of the mechanism is the part that stuck with me: these routers "operate as application-layer proxies with full plaintext access to every in-flight JSON payload," and no provider enforces cryptographic integrity between the client and the upstream model.&lt;/p&gt;

&lt;p&gt;CoinDesk covered it the same week and carried a blunter line from Shou: 26 routers were "secretly injecting malicious tool calls and stealing creds," and one of them "drained our client's $500k wallet" (&lt;a href="https://www.coindesk.com/tech/2026/04/13/ai-agents-are-set-to-power-crypto-payments-but-a-hidden-flaw-could-expose-wallets" rel="noopener noreferrer"&gt;CoinDesk, 13 April 2026&lt;/a&gt;). That $500k and those router counts are their numbers, from their measurement, not mine. I am citing them for context. Everything I claim about my own tool comes from a run I will paste in full.&lt;/p&gt;

&lt;p&gt;I read that paper on a Tuesday and went looking for the part of my own stack that assumed the router was trusted. I found it fast. We route through a gateway for failover and cost tracking. The gateway has a dashboard. The dashboard has a request log. And I had quietly been treating that log as a safety net: if something leaks, I will see it there.&lt;/p&gt;

&lt;p&gt;That assumption is backwards, and saying it out loud is the whole point of this post.&lt;/p&gt;

&lt;h2&gt;
  
  
  The claim, sharp enough to argue with
&lt;/h2&gt;

&lt;p&gt;Here is the falsifiable version: &lt;strong&gt;a router's audit log is a receipt, not a brake.&lt;/strong&gt; By the time a credential shows up in the router's log, the router process has already read it in plaintext. Logging happens on the far side of the boundary. The secret is gone. You cannot un-send it by reviewing a log entry, the same way you cannot un-mail a letter by reading the carbon copy.&lt;/p&gt;

&lt;p&gt;If that claim were false, redaction would not matter and a probe like mine would be pointless. You could just watch the log and rotate after the fact. But "rotate after the fact" assumes the window between send and detection is harmless, and for a wallet private key that window is exactly long enough to sign one transaction. The signer secret is not like an API key you rotate on Monday. Once a third party has it, the funds are a &lt;code&gt;sign_tx&lt;/code&gt; call away.&lt;/p&gt;

&lt;p&gt;So the control has to move upstream of the send. Classify the destination first. Redact anything that should not cross. Then emit the bytes. The log, if you keep one, becomes a record of what you allowed out, not a tripwire you read after the damage.&lt;/p&gt;

&lt;h2&gt;
  
  
  What the probe actually does
&lt;/h2&gt;

&lt;p&gt;The input is one JSON file I call an egress map: the outbound requests your agent emits, with their destination host, kind, headers, and body. You can dump this from a request interceptor, a test harness, or by hand. The probe never makes a request. It reads the map statically.&lt;/p&gt;

&lt;p&gt;Two ideas do the work.&lt;/p&gt;

&lt;p&gt;First, &lt;strong&gt;destination trust.&lt;/strong&gt; You declare &lt;code&gt;first_party_hosts&lt;/code&gt;: the hosts you contract with directly, your own backend or the model provider itself. An &lt;code&gt;Authorization&lt;/code&gt; header to one of those is the expected credential path, so the probe does not scream about it. Every other host, the routers and gateways and MCP proxies in the middle, is third-party by default. A secret sitting in the body or tool-call arguments headed there has crossed a boundary you do not own.&lt;/p&gt;

&lt;p&gt;Second, &lt;strong&gt;signer material is always-leak.&lt;/strong&gt; An Ethereum private key or a BIP-39 mnemonic must never transit any intermediary, first-party or not. There is no legitimate path where your agent mails a seed phrase through a proxy. If the probe sees one, it is CRITICAL regardless of destination.&lt;/p&gt;

&lt;p&gt;Here are the rules and the value scanner:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;sys&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;

&lt;span class="c1"&gt;# Secret shapes. critical = signer material that must NEVER transit at all.
&lt;/span&gt;&lt;span class="n"&gt;SECRET_RULES&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;eth_private_key&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;compile&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;\b0x[0-9a-fA-F]{64}\b&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;                  &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;bip39_mnemonic&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;compile&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;\b(?:[a-z]{3,8}\s+){11,23}[a-z]{3,8}\b&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;aws_access_key&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;compile&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;\bAKIA[0-9A-Z]{16}\b&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;                   &lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;openai_key&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;      &lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;compile&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;\bsk-[A-Za-z0-9]{20,}\b&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;                &lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;bearer_token&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;    &lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;compile&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Bearer\s+[A-Za-z0-9._\-]{16,}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;          &lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;github_pat&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;      &lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;compile&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;\bghp_[A-Za-z0-9]{36}\b&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;                &lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="c1"&gt;# A value already neutralised before send: env ref / vault handle / masked.
&lt;/span&gt;&lt;span class="n"&gt;SAFE_REF&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;compile&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;^(\$\{?[A-Z0-9_]+\}?|\$VAULT_REF:[\w:\-]+|sk-\*{3,}|\*{4,}|&amp;lt;REDACTED[:&amp;gt;])&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;scan_value&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;val&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;SAFE_REF&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;match&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;val&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;()):&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;                       &lt;span class="c1"&gt;# already redacted / handle-referenced
&lt;/span&gt;    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;[(&lt;/span&gt;&lt;span class="n"&gt;kind&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;crit&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;kind&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;rx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;crit&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;SECRET_RULES&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;rx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;search&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;val&lt;/span&gt;&lt;span class="p"&gt;)]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;SAFE_REF&lt;/code&gt; rule matters more than it looks. A value like &lt;code&gt;${OPENAI_KEY}&lt;/code&gt; or &lt;code&gt;$VAULT_REF:openai&lt;/code&gt; is a handle, not a secret: the real value gets substituted at the trusted edge, not carried in your agent's payload. If you already pass handle references to your router and let your own egress proxy swap them in, you are most of the way to safe. The probe rewards that by staying quiet.&lt;/p&gt;

&lt;p&gt;The classifier walks every string leaf in each request, scans it, and applies the leak rule:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;classify&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;spec&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;first_party&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;spec&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;first_party_hosts&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;[]))&lt;/span&gt;
    &lt;span class="n"&gt;rows&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;req&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;spec&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;requests&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;[]):&lt;/span&gt;
        &lt;span class="n"&gt;trust&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;first_party&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;req&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;to&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;first_party&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;third_party&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="n"&gt;payload&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;v&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;v&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;req&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;items&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;to&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;kind&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)}&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;jp&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;val&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;walk&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;payload&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
            &lt;span class="n"&gt;hits&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;scan_value&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;val&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;hits&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="k"&gt;continue&lt;/span&gt;
            &lt;span class="n"&gt;is_auth&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;jp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;lower&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;startswith&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;headers.authorization&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="n"&gt;is_crit&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;any&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;c&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;hits&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="n"&gt;leak&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;trust&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;third_party&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="n"&gt;is_crit&lt;/span&gt;
            &lt;span class="c1"&gt;# an expected first-party Authorization is NOT a leak (unless critical)
&lt;/span&gt;            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;trust&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;first_party&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;is_auth&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;is_crit&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="n"&gt;leak&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;
            &lt;span class="n"&gt;rows&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;req&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;kind&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;req&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;kind&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
                         &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;trust&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;trust&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;path&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;jp&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;kinds&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;k&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;hits&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
                         &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;critical&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;is_crit&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;leak&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;leak&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;spec&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;requests&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]),&lt;/span&gt; &lt;span class="n"&gt;rows&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;walk&lt;/code&gt; helper is the obvious recursive descent over dicts and lists, yielding a JSON path and a string for every leaf. The full file, including the &lt;code&gt;--redact&lt;/code&gt; mode I show below, is about 95 lines. I am skipping the boilerplate here, not hiding it.&lt;/p&gt;

&lt;h2&gt;
  
  
  The run, pasted whole
&lt;/h2&gt;

&lt;p&gt;Two fixtures. The clean one still talks to two third-party intermediaries, but it routes real secrets only to first-party hosts and sends those intermediaries handle references instead, so nothing crosses. The leaky one is shaped like a real agent that got lazy: an API gateway in the middle carrying a GitHub PAT on its &lt;code&gt;Authorization&lt;/code&gt; header, an AWS key buried in a system message, an OpenAI key in metadata, and an MCP proxy receiving a wallet private key plus a mnemonic in tool-call arguments.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="gp"&gt;$&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;python3 boundary_leak_probe.py fixtures/egress_clean.json
&lt;span class="go"&gt;requests=3  third_party_intermediaries=2
secret_bearing_fields_crossing_boundary=0  rule_hits=0  critical_signer_material=0
redaction_gate_would_block=0  router_audit_log_sees_them_only_AFTER_egress=0
exit=0

&lt;/span&gt;&lt;span class="gp"&gt;$&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;python3 boundary_leak_probe.py fixtures/egress_leaky.json
&lt;span class="go"&gt;requests=3  third_party_intermediaries=2
secret_bearing_fields_crossing_boundary=5  rule_hits=6  critical_signer_material=2
redaction_gate_would_block=5  router_audit_log_sees_them_only_AFTER_egress=5
  leak      third_party  llm-gateway   req=r2   aws_access_key         body.messages[0].content
  leak      third_party  llm-gateway   req=r2   openai_key             body.metadata.upstream_key
  leak      third_party  llm-gateway   req=r2   bearer_token+github_pat headers.Authorization
  CRITICAL  third_party  mcp-proxy     req=r3   eth_private_key        body.tool_calls[0].arguments.private_key
  CRITICAL  third_party  mcp-proxy     req=r3   bip39_mnemonic         body.tool_calls[1].arguments.mnemonic
exit=1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now the honesty about the numbers, because this is where a lazy headline would lie. The probe found &lt;strong&gt;5 distinct fields&lt;/strong&gt; crossing to third-party intermediaries, but &lt;strong&gt;6 rule-hits&lt;/strong&gt;. Why the mismatch? One field, the gateway's &lt;code&gt;Authorization&lt;/code&gt; header, tripped two rules at once: it looks like a generic bearer token &lt;em&gt;and&lt;/em&gt; it is a GitHub PAT wrapped inside it. That is one leak, two signals. It is not six different secrets, and I am not going to call it six. The number that matters most is the small one: &lt;strong&gt;2 critical fields&lt;/strong&gt;, the wallet private key and the mnemonic, both headed to an MCP proxy that has no business seeing either.&lt;/p&gt;

&lt;p&gt;One more detail that is easy to miss. Request &lt;code&gt;r1&lt;/code&gt; in the leaky fixture sends a real-looking &lt;code&gt;Bearer sk-...&lt;/code&gt; to &lt;code&gt;api.openai.com&lt;/code&gt;, which is a first-party host. The probe does not flag it. That is the point of destination trust: an auth header to your provider is the credential doing its job. The same shape to a router is the credential getting stolen. A flat secret scanner cannot tell those two apart. This one is built to. One honest caveat: that trust is host-level, not header-level. The probe trusts the destination, so a non-critical secret that lands anywhere in a first-party request, body included, also gets a pass; only signer material overrides the trust and leaks regardless. So your &lt;code&gt;first_party_hosts&lt;/code&gt; list is the whole ballgame. Keep it tight, because the tool trusts those hosts with whatever you send them.&lt;/p&gt;

&lt;p&gt;Bad input is a third exit code, so a CI step can branch on it:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="gp"&gt;$&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;python3 boundary_leak_probe.py            &lt;span class="c"&gt;# no argument&lt;/span&gt;
&lt;span class="gp"&gt;usage: boundary_leak_probe.py [--redact] &amp;lt;egress_map.json&amp;gt;&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="go"&gt;exit=2
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And the run is deterministic. I hashed the leaky STDOUT twice:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="gp"&gt;$&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;python3 boundary_leak_probe.py fixtures/egress_leaky.json | shasum &lt;span class="nt"&gt;-a&lt;/span&gt; 256
&lt;span class="go"&gt;28c5eb9ff8e7ad0abc6b1ad67a617cdd5fdaa09bfce26d3f9f00022217e0a6c5  -
28c5eb9ff8e7ad0abc6b1ad67a617cdd5fdaa09bfce26d3f9f00022217e0a6c5  -
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Same bytes both times. That matters for a gate: a check that flickers is a check people disable.&lt;/p&gt;

&lt;h2&gt;
  
  
  Redact at the boundary, then prove the gate closed
&lt;/h2&gt;

&lt;p&gt;Reporting a leak is the easy half. The thesis was that the control belongs at egress, so the probe has a &lt;code&gt;--redact&lt;/code&gt; mode that prints the masked map a boundary gate would actually emit. It leaves the first-party &lt;code&gt;Authorization&lt;/code&gt; alone and masks everything that would cross:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="err"&gt;$&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;python&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;boundary_leak_probe.py&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;--redact&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;fixtures/egress_leaky.json&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="err"&gt;...&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"to"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"router.3rdparty.ai"&lt;/span&gt;&lt;span class="err"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"headers"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"Authorization"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"&amp;lt;REDACTED:bearer_token&amp;gt;"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="err"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"body"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"messages"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"role"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"system"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"content"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"&amp;lt;REDACTED:aws_access_key&amp;gt;"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"metadata"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"upstream_key"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"&amp;lt;REDACTED:openai_key&amp;gt;"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="err"&gt;...&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"to"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"mcp-proxy.partner.io"&lt;/span&gt;&lt;span class="err"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"arguments"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"private_key"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"&amp;lt;REDACTED:eth_private_key&amp;gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;...&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"arguments"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"mnemonic"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"&amp;lt;REDACTED:bip39_mnemonic&amp;gt;"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then the part I like. Feed that masked map back into the probe:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="gp"&gt;$&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;python3 boundary_leak_probe.py &lt;span class="nt"&gt;--redact&lt;/span&gt; fixtures/egress_leaky.json | python3 boundary_leak_probe.py /dev/stdin
&lt;span class="go"&gt;requests=3  third_party_intermediaries=2
secret_bearing_fields_crossing_boundary=0  rule_hits=0  critical_signer_material=0
redaction_gate_would_block=0  router_audit_log_sees_them_only_AFTER_egress=0
exit=0
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Exit 0, and notice it still lists two third-party intermediaries: the hops are still there, but zero secrets now cross to them. The masked tokens match &lt;code&gt;SAFE_REF&lt;/code&gt;, so the second pass sees nothing to flag, and &lt;code&gt;--redact&lt;/code&gt; masks every field the audit scans, not just headers and body, so the round trip holds for more than this one fixture's exact shape. That round trip is the difference between watching a log and holding a brake. The log tells you a secret left. The redact pass means it never did. It is still a static regex heuristic, though, not a proof your bytes are clean.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where this sits, and what I have already written about
&lt;/h2&gt;

&lt;p&gt;This is the fifth tool in a series, and I keep the axes deliberately separate so they stack instead of overlap. Earlier ones looked at &lt;a href="https://finops.spinov.online/blog/secret-packaging-gap/" rel="noopener noreferrer"&gt;a secret that ships in a build artifact&lt;/a&gt; (what &lt;code&gt;npm pack&lt;/code&gt; actually publishes), &lt;a href="https://finops.spinov.online/blog/blast-radius-ai-agent-api-key/" rel="noopener noreferrer"&gt;the blast radius of a key if it leaks&lt;/a&gt; (how much breaks, by scope), &lt;a href="https://finops.spinov.online/blog/mcp-tool-pin-verify/" rel="noopener noreferrer"&gt;the identity and version of an MCP manifest&lt;/a&gt;, and &lt;a href="https://finops.spinov.online/blog/eval-contamination-probe/" rel="noopener noreferrer"&gt;contamination in an eval harness&lt;/a&gt;. None of those asked the question this one asks: of the requests my agent is about to send, which destinations are trusted, and which secret-bearing fields are about to cross to a host I do not control? The object here is the outbound trace and its destination, not a file on disk, not a manifest, not a scope score. New axis, new tool.&lt;/p&gt;

&lt;h2&gt;
  
  
  What this is NOT
&lt;/h2&gt;

&lt;p&gt;I would rather you trust the limits than oversell the wins.&lt;/p&gt;

&lt;p&gt;It is not a live secret scanner. Every hit is a SIGNAL, a regex match on a shape. The &lt;code&gt;0x...&lt;/code&gt; could be a transaction hash someone pasted, not a private key. Confirm anything that matters against your own vault. The probe will not tell you whether a key is real or revoked.&lt;/p&gt;

&lt;p&gt;It is not a runtime interceptor. It reads a static egress map. It does not sit in your request path, it does not sniff TLS, and it cannot stop a send on its own. To make it &lt;a href="https://finops.spinov.online/blog/pre-execution-gate-for-ai-agents/" rel="noopener noreferrer"&gt;a real gate&lt;/a&gt;, you wire its exit code into the place that emits the bytes, or you run the &lt;code&gt;--redact&lt;/code&gt; transform there. The probe is the policy; the plumbing is yours.&lt;/p&gt;

&lt;p&gt;It is not a replacement for mTLS or a gateway's own controls. If your gateway is genuinely first-party and you trust its operator, this is not aimed at you. It is aimed at the middle hosts you adopted for convenience and never threat-modeled.&lt;/p&gt;

&lt;p&gt;And the matching is heuristic. The loudest false positive is the mnemonic rule: it matches any run of twelve to twenty-four short lowercase words, with no wordlist or checksum check, so an ordinary English sentence in a prompt can trip a CRITICAL &lt;code&gt;bip39_mnemonic&lt;/code&gt; hit and force exit 1, even on a first-party request, because signer material overrides the trust model. A hex blob that is not a key trips &lt;code&gt;eth_private_key&lt;/code&gt; the same way. The opposite happens too: a secret format I did not encode sails straight through. The &lt;code&gt;first_party_hosts&lt;/code&gt; list is exact-string, so a typo in a hostname silently downgrades a host to third-party, which fails safe but will annoy you. A flag is a reason to look, not a verdict.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;AI disclosure:&lt;/strong&gt; I wrote &lt;code&gt;boundary_leak_probe.py&lt;/code&gt; with AI assistance and ran it myself, offline, before publishing. Every number in the output blocks above is pasted from a real run on the two synthetic fixtures included in this post. No real keys exist in them: the &lt;code&gt;0x4c08...&lt;/code&gt; private key is a well-known public test key from web3 tutorials and the &lt;code&gt;legal winner thank...&lt;/code&gt; phrase is BIP-39 test vector #2 from the spec itself, both burned and never tied to real funds; every other value is a placeholder. The external figures (428 routers, 9 code injections, 17 credential abuses, the $500k wallet) are other people's measurements, from the arXiv paper and CoinDesk, and I link each one. I label which numbers are mine and which are theirs.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The open question I have not answered for myself: handle references like &lt;code&gt;$VAULT_REF:openai&lt;/code&gt; only stay safe if the substitution happens at a trusted edge you control, after the probe runs. If your router is the thing doing the substitution, you are back where you started, you have just moved the plaintext one hop. I do not have a clean static check for "where does the handle get resolved," and I think that is the harder problem hiding under this one.&lt;/p&gt;

&lt;p&gt;If you run agents through a router or an MCP proxy, dump one real egress map and run this against it before you read the next router-breach headline. Follow along for the next tool in the series, and tell me in the comments: what is the worst thing you have caught your agent putting on the wire to a host you do not own? I read every reply.&lt;/p&gt;

</description>
      <category>security</category>
      <category>ai</category>
      <category>python</category>
      <category>agents</category>
    </item>
    <item>
      <title>Passing the Eval Isn't Solving the Task: 3 Leaks, 60 Lines</title>
      <dc:creator>Alexey Spinov</dc:creator>
      <pubDate>Fri, 26 Jun 2026 01:28:28 +0000</pubDate>
      <link>https://dev.to/alex_spinov/passing-the-eval-isnt-solving-the-task-3-leaks-60-lines-276i</link>
      <guid>https://dev.to/alex_spinov/passing-the-eval-isnt-solving-the-task-3-leaks-60-lines-276i</guid>
      <description>&lt;p&gt;Passing the eval is not solving the task: a green agent eval certifies nothing if the agent can write the files the grader reads, or reach the reference answer. This 60-line static probe read one harness spec, flagged 3 contamination points (2 write-read, 1 reference leak), and exited 1 without running the agent.&lt;/p&gt;

&lt;p&gt;I keep seeing the same screenshot in agent-eval threads: a wall of green checkmarks, "98% pass rate," ship it. Then the same agent face-plants in production on a task that looked identical to a passing eval case.&lt;/p&gt;

&lt;p&gt;The usual explanation is "the eval set is too easy" or "distribution shift." Sometimes. But there's a quieter failure that nobody screenshots, because it never shows up as red: the harness graded a number the agent itself wrote.&lt;/p&gt;

&lt;p&gt;Here's the claim I'll defend with a runnable tool: &lt;strong&gt;passing the eval is not the same as solving the task.&lt;/strong&gt; A green run only certifies the agent if two things hold. The channel the agent can write into is disjoint from the channel the grader reads from. And the reference answer is not sitting somewhere the agent can open before grading. Break either one and your "98%" is, by construction, undecidable. The agent could have aced it. It could also have written &lt;code&gt;{"passed": true}&lt;/code&gt; to the file the grader trusts. The score can't tell you which.&lt;/p&gt;

&lt;p&gt;You don't need to rerun the eval to catch this. You can read the wiring.&lt;/p&gt;

&lt;h2&gt;
  
  
  What contamination looks like in an eval harness
&lt;/h2&gt;

&lt;p&gt;Forget the model for a second. An agent eval harness is mostly file plumbing. There's a set of paths the agent may write (its workspace, its output dir, a shared scratch DB). There's a set of paths the grader reads to decide pass or fail (a results file, an exit-code dump, an oracle score). And there's the reference answer, the gold solution the grader compares against.&lt;/p&gt;

&lt;p&gt;Three of these sets are supposed to be carefully separated. In real harnesses they leak into each other for the dumbest reasons: someone pointed the grader at &lt;code&gt;run/results.json&lt;/code&gt; to "reuse the artifact," and &lt;code&gt;run/&lt;/code&gt; happened to be agent-writable. Someone stuffed the expected answer into &lt;code&gt;task_config.json&lt;/code&gt; for convenience, and that config is exactly what the agent reads to understand the task.&lt;/p&gt;

&lt;p&gt;Two leak shapes cover most of what I've seen:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;C1, write-read overlap.&lt;/strong&gt; A path the agent can write intersects a path the grader reads. The agent can fabricate the very artifact the grader inspects. This is state pollution. The grader thinks it's reading ground truth; it's reading the agent's homework, graded by the agent.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;C2, reference leak.&lt;/strong&gt; The reference or gold answer is reachable inside the agent's read-set or write-set. The agent can copy the answer instead of deriving it. This is the WebArena-style pattern people have flagged for a while: the expected answer lives in the task config that the agent is handed. (I'm describing the &lt;em&gt;shape&lt;/em&gt; here, not quoting a benchmark's pass-rate; the numbers later in this post are from my own probe, not from any public harness.)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Both shapes are invisible to the eval's own score. The eval comes back green either way. That's the whole problem. A contaminated harness can't report its own contamination, because the contamination is upstream of the number it reports.&lt;/p&gt;

&lt;h2&gt;
  
  
  The probe: intersect declared paths, don't run anything
&lt;/h2&gt;

&lt;p&gt;So I wrote the boring thing instead of the clever thing. No sandbox, no instrumentation, no model in the loop. Just take the harness's declaration of who-can-touch-what and intersect the sets.&lt;/p&gt;

&lt;p&gt;The input is one JSON file describing four path sets:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"contaminated-webarena-task-042"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"agent_write_set"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"run/results.json"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"run/output/*"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"shared/state.db"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"agent_read_set"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"config/task_config.json"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"run/output/log.txt"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"grader_read_set"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"run/results.json"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"shared/state.db"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"grader/oracle/score.txt"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"reference_paths"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"config/task_config.json"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The probe matches paths with three rules, because real path declarations aren't naive string equality. Exact match. Glob match in both directions (so &lt;code&gt;out/*&lt;/code&gt; catches &lt;code&gt;out/score.json&lt;/code&gt;). And directory containment (so a write scope of &lt;code&gt;logs/&lt;/code&gt; covers a grader read of &lt;code&gt;logs/run.txt&lt;/code&gt;). Here's the matcher, which is the only part with any judgment in it:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;overlap&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;True if two declared path patterns can refer to the same location.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;norm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="nf"&gt;norm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="n"&gt;fnmatch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;fnmatch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="n"&gt;fnmatch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;fnmatch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;startswith&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;startswith&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;b&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;  &lt;span class="c1"&gt;# directory containment
&lt;/span&gt;    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;)):&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;endswith&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/*&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;startswith&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;[:&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]):&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;  &lt;span class="c1"&gt;# glob-dir vs file
&lt;/span&gt;    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The analysis is two loops. Cross agent-write against grader-read for C1. Cross the reference paths against both agent sets for C2.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;analyze&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;spec&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;aw&lt;/span&gt;  &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nf"&gt;norm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;spec&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;agent_write_set&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;[])]&lt;/span&gt;
    &lt;span class="n"&gt;ar&lt;/span&gt;  &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nf"&gt;norm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;spec&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;agent_read_set&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;[])]&lt;/span&gt;
    &lt;span class="n"&gt;gr&lt;/span&gt;  &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nf"&gt;norm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;spec&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;grader_read_set&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;[])]&lt;/span&gt;
    &lt;span class="n"&gt;ref&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nf"&gt;norm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;spec&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;reference_paths&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;[])]&lt;/span&gt;
    &lt;span class="n"&gt;out&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;w&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;aw&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;g&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;gr&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;overlap&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;w&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;g&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
                &lt;span class="n"&gt;out&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;C1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;w&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;g&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;agent can write the artifact the grader inspects&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;ref&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;ar&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;overlap&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
                &lt;span class="n"&gt;out&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;C2&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;reference answer is in the agent read-set&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;w&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;aw&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;overlap&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;w&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
                &lt;span class="n"&gt;out&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;C2&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;w&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;reference answer is in the agent write-set&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;aw&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;gr&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ref&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;out&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's it. The whole tool, including a CLI and three exit codes, is about 60 lines of logic. It imports &lt;code&gt;sys&lt;/code&gt;, &lt;code&gt;json&lt;/code&gt;, &lt;code&gt;fnmatch&lt;/code&gt;. Nothing else. No network, no &lt;code&gt;subprocess&lt;/code&gt;, no &lt;code&gt;eval&lt;/code&gt;, no model call. It cannot run your agent or your grader even if you asked it to, which is the point: it's safe to drop into CI as a pre-merge or pre-publish gate on a repo you don't fully trust.&lt;/p&gt;

&lt;p&gt;The exit code is the gate:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;0&lt;/code&gt; clean wiring, the channels are disjoint.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;1&lt;/code&gt; contamination found, at least one overlap, treat the green eval as undecided and block the merge or the result.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;2&lt;/code&gt; bad input, the spec is missing, not valid JSON, or not a JSON object.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Running it: 0 points clean, 3 points contaminated
&lt;/h2&gt;

&lt;p&gt;I ran the probe on three fixtures. These are verbatim terminal transcripts, copied straight from my terminal, not retyped. The &lt;code&gt;$&lt;/code&gt; command line and the trailing &lt;code&gt;exit=&lt;/code&gt; line are the shell, not the program; everything between them is the probe's own stdout.&lt;/p&gt;

&lt;p&gt;A clean harness: agent writes only its own workspace, grader reads an isolated oracle the agent can't see, reference kept out of reach.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="gp"&gt;$&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;python3 eval_contamination_probe.py fixtures/fixture_clean.json
&lt;span class="go"&gt;eval: clean-swe-task-001
agent_write_set=2 grader_read_set=2 reference_paths=1
contamination_points=0 density=0.0
by_class=none
verdict=CLEAN
exit=0
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Zero points. Density zero. Exit 0. That's necessary, not sufficient: exit 0 only says the declared path sets don't intersect under this matcher. It can't see runtime paths, symlinks, &lt;code&gt;..&lt;/code&gt; aliases, or channels you never declared. So read a clean run as "the wiring I declared is disjoint," not "the eval is provably valid."&lt;/p&gt;

&lt;p&gt;Now the contaminated harness from the JSON above:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="gp"&gt;$&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;python3 eval_contamination_probe.py fixtures/fixture_contaminated.json
&lt;span class="go"&gt;eval: contaminated-webarena-task-042
agent_write_set=3 grader_read_set=3 reference_paths=1
contamination_points=3 density=100.0
by_class=C1:2,C2:1
&lt;/span&gt;&lt;span class="gp"&gt;  [C1] run/results.json &amp;lt;-&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;run/results.json &lt;span class="o"&gt;(&lt;/span&gt;agent can write the artifact the grader inspects&lt;span class="o"&gt;)&lt;/span&gt;
&lt;span class="gp"&gt;  [C1] shared/state.db &amp;lt;-&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;shared/state.db &lt;span class="o"&gt;(&lt;/span&gt;agent can write the artifact the grader inspects&lt;span class="o"&gt;)&lt;/span&gt;
&lt;span class="gp"&gt;  [C2] config/task_config.json &amp;lt;-&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;config/task_config.json &lt;span class="o"&gt;(&lt;/span&gt;reference answer is &lt;span class="k"&gt;in &lt;/span&gt;the agent read-set&lt;span class="o"&gt;)&lt;/span&gt;
&lt;span class="go"&gt;verdict=CONTAMINATED
exit=1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Three contamination points. Two C1: the grader reads &lt;code&gt;run/results.json&lt;/code&gt; and &lt;code&gt;shared/state.db&lt;/code&gt;, both of which the agent can write. So the agent can hand the grader a results file that says it passed. The third is C2: the reference answer lives in &lt;code&gt;config/task_config.json&lt;/code&gt;, which is in the agent's read-set. The agent can read the answer.&lt;/p&gt;

&lt;p&gt;Density is contamination points divided by the count of grader read-sources, times 100. Here it's &lt;code&gt;3 / 3 = 100.0&lt;/code&gt;. Be careful how you read that number: it is a severity ratio, not a bounded percentage. It happens to be 100 here because there are 3 points and 3 grader-read sources, but the numerator and denominator are not the same thing. The numerator counts every overlap, including the C2 reference leak, which isn't a grader-read source at all. So density can climb past 100 (three agent paths all hitting one grader glob gives &lt;code&gt;density=300.0&lt;/code&gt;). Treat it as "how alarming," not "what fraction." What it tells you here is blunt enough: two of the three things this grader reads are agent-writable, and the gold answer is sitting in the agent's read-set. This harness measures almost nothing about the agent's ability. Mostly it measures whether the agent can write files, which every agent can.&lt;/p&gt;

&lt;p&gt;And bad input, so CI fails loud instead of silently passing a broken spec:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="gp"&gt;$&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;python3 eval_contamination_probe.py fixtures/fixture_badinput.json
&lt;span class="go"&gt;error: fixtures/fixture_badinput.json is not valid JSON: Expecting value: line 1 column 42 (char 41)
exit=2
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The output is deterministic. I piped the clean run's stdout (the program's output, without the shell's &lt;code&gt;$&lt;/code&gt; and &lt;code&gt;exit=&lt;/code&gt; lines) into &lt;code&gt;shasum -a 256&lt;/code&gt; twice and got the same digest both times: &lt;code&gt;c6accb251262e8911422145c5d462318e2b29cb548ded150dc7f15398484c448&lt;/code&gt;. The contaminated run hashed to &lt;code&gt;e281f8fc31e9ada72177186f2d7d6c567636c237b4f3cb1e8f0e8493bc9858ac&lt;/code&gt;, identical across two runs. Determinism matters for a gate. A flaky gate gets disabled within a week, and then you're back to trusting green checkmarks.&lt;/p&gt;

&lt;h2&gt;
  
  
  How this differs from "are my tests testing anything"
&lt;/h2&gt;

&lt;p&gt;If you've read my &lt;a href="https://finops.spinov.online/blog/green-checkmark-auditor/" rel="noopener noreferrer"&gt;earlier post on whether your tests test anything&lt;/a&gt;, this looks adjacent. It isn't the same target. That tool asks whether your &lt;em&gt;application's&lt;/em&gt; unit tests actually exercise the code or just mirror it back. The object there is the test body.&lt;/p&gt;

&lt;p&gt;This probe never looks at the application or its tests. It looks at the &lt;strong&gt;eval harness itself&lt;/strong&gt;, the rig that grades the agent. The object is the wiring: which path sets touch which. You can have perfect application tests and a totally contaminated eval harness sitting on top of them. They're different layers, and a green checkmark on the wrong layer is the trap.&lt;/p&gt;

&lt;p&gt;It's also not &lt;a href="https://finops.spinov.online/blog/dependency-gap-auditor/" rel="noopener noreferrer"&gt;the dependency-gap auditor that intersects imports against declared dependencies&lt;/a&gt;. Same genre, static read-only pre-merge gate with an exit code, different intersection: declared paths in an eval spec, not import graphs. And it's a different question than &lt;a href="https://finops.spinov.online/blog/your-agent-returns-200-and-lies/" rel="noopener noreferrer"&gt;your agent returning a confident 200 while lying&lt;/a&gt; about a single call; this is about the integrity of the whole grading &lt;em&gt;process&lt;/em&gt;, not one response. It's also not &lt;a href="https://finops.spinov.online/blog/llm-judge-cost-deterministic-pre-gate/" rel="noopener noreferrer"&gt;the deterministic pre-gate that replaces a flaky LLM judge&lt;/a&gt;, which is about the cost of grading, not its integrity.&lt;/p&gt;

&lt;p&gt;The franchise underneath all of these is the same opinion I keep defending: &lt;a href="https://finops.spinov.online/blog/pre-execution-gate-for-ai-agents/" rel="noopener noreferrer"&gt;gate before you trust&lt;/a&gt;. Logging that an eval ran green is not control. Proving the eval &lt;em&gt;could&lt;/em&gt; have failed is.&lt;/p&gt;

&lt;h2&gt;
  
  
  What this is NOT
&lt;/h2&gt;

&lt;p&gt;I'd rather you reject this tool for the right reasons than adopt it for the wrong ones. So, the limits, plainly.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;It's static, not dynamic.&lt;/strong&gt; The probe reads declarations of who-can-touch-what. It does not watch a real run. If your harness lies about its own path sets, or computes paths at runtime that aren't in the spec, the probe can't see them. It catches contamination you &lt;em&gt;declared into existence&lt;/em&gt;, which in my experience is most of it, but not all of it. A runtime tracer would catch more and cost more.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;A flag is a risk, not a verdict on the agent.&lt;/strong&gt; An overlap means the wiring &lt;em&gt;allows&lt;/em&gt; contamination. It does not prove the agent exploited it. Your agent might be writing &lt;code&gt;run/results.json&lt;/code&gt; with perfectly honest content. The point is that the eval can no longer &lt;em&gt;distinguish&lt;/em&gt; honest from fabricated, so the green result is undecided. That's a meaningful and actionable thing to know. It's not an accusation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The glob matching is an approximation.&lt;/strong&gt; I match with &lt;code&gt;fnmatch&lt;/code&gt; plus directory containment. That's deliberately a little eager, it will treat &lt;code&gt;out/*&lt;/code&gt; as overlapping &lt;code&gt;out/anything&lt;/code&gt;. One gotcha worth naming: Python's &lt;code&gt;fnmatch&lt;/code&gt; lets &lt;code&gt;*&lt;/code&gt; cross &lt;code&gt;/&lt;/code&gt;, so &lt;code&gt;out/*&lt;/code&gt; also matches &lt;code&gt;out/sub/deep.json&lt;/code&gt;, not just one segment, which makes my separate "glob-dir vs file" rule mostly redundant and the matcher broader than a shell glob. It also reads &lt;code&gt;*&lt;/code&gt;, &lt;code&gt;?&lt;/code&gt;, and &lt;code&gt;[...]&lt;/code&gt; as wildcards even inside a literal path name, so a path that contains those characters can match by accident. Expect false positives where a glob is broader than the files that actually land there, and don't expect canonical-path resolution, symlink, or case handling. I'd rather over-flag a gate than miss a real leak, but if you have a path scheme my rules don't model, you'll get noise. The matcher is 10 lines; tune it for your repo.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;No path manifest, no opinion.&lt;/strong&gt; If your harness doesn't declare its path sets, the probe has nothing to intersect and stays quiet. It will not invent contamination it can't see. That silence is honest, not a pass. The gate is only as good as the spec you feed it, which is itself a nudge to make your harness declare its channels explicitly.&lt;/p&gt;

&lt;p&gt;I built this in an afternoon and ran it on hand-built fixtures, not on a thousand real harnesses. So treat the matcher as a starting point, not gospel. If you wire it into a real eval repo and the directory-containment rule misfires, that's the first thing I'd loosen.&lt;/p&gt;

&lt;p&gt;The fixtures and the full script are in this post. Drop the file into your eval repo, write a spec describing your four path sets, and wire &lt;code&gt;exit != 0&lt;/code&gt; into the same CI step that runs the eval. If the probe exits 1, the green eval doesn't count yet.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;AI disclosure: I wrote and ran this tool myself. An AI assistant helped draft and edit the prose. Every number in this post (0 and 3 contamination points, density, the SHA-256 hashes, the exit codes) is from the actual run shown above, on Python 3.13, offline. Nothing here is borrowed from any public benchmark's results.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Follow for the next probe in this series, and tell me the worst eval-harness leak you've personally hit. Did a grader ever read a file your agent wrote? I read every comment.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>testing</category>
      <category>python</category>
      <category>agents</category>
    </item>
    <item>
      <title>Your AI Code Has 6 Secret Hits. Only 3 Ship in the npm Package.</title>
      <dc:creator>Alexey Spinov</dc:creator>
      <pubDate>Thu, 25 Jun 2026 01:20:03 +0000</pubDate>
      <link>https://dev.to/alex_spinov/your-ai-code-has-6-secret-hits-only-3-ship-in-the-npm-package-39jf</link>
      <guid>https://dev.to/alex_spinov/your-ai-code-has-6-secret-hits-only-3-ship-in-the-npm-package-39jf</guid>
      <description>&lt;p&gt;Secrets in a published npm package are a different set from secrets in your repo. A secret scanner reads the whole git tree; &lt;code&gt;npm pack&lt;/code&gt; ships only the &lt;code&gt;files&lt;/code&gt; allowlist in &lt;code&gt;package.json&lt;/code&gt;. &lt;code&gt;leak_probe.py&lt;/code&gt; measures both and prints the gap. On the fixture below it found 6 hits and flagged 3 as actually shipping.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;TL;DR&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A scanner reads your git tree. The packager reads the &lt;code&gt;files&lt;/code&gt; allowlist. They are not the same file set.&lt;/li&gt;
&lt;li&gt;On the test package: 6 secret hits total, 3 of them ship in the tarball, 3 are git-only (a &lt;code&gt;test/&lt;/code&gt; fake and a root &lt;code&gt;run.log&lt;/code&gt;, both outside the &lt;code&gt;files&lt;/code&gt; allowlist). Exit 1.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;leak_probe.py&lt;/code&gt; is ~80 lines of Python: provider regexes + entropy + a packaging filter. No network, no model, no exec, no install.&lt;/li&gt;
&lt;li&gt;A hit is a SIGNAL, not a confirmed live secret. Verify ship-status with &lt;code&gt;npm pack --dry-run&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Runs in about 60 seconds, no API key. Code and fixtures are in the post.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The blind spot nobody scans for
&lt;/h2&gt;

&lt;p&gt;Run &lt;code&gt;gitleaks&lt;/code&gt; or &lt;code&gt;trufflehog&lt;/code&gt; and you get a list of secrets in your working tree. Useful. But that list answers a question about your repo, not about your release. The thing you push to npm is whatever &lt;code&gt;npm pack&lt;/code&gt; decides to include, and &lt;code&gt;npm pack&lt;/code&gt; has its own rules: the &lt;code&gt;files&lt;/code&gt; array is an allowlist, &lt;code&gt;.npmignore&lt;/code&gt; subtracts from whatever is left, and a handful of files (&lt;code&gt;package.json&lt;/code&gt;, &lt;code&gt;README.md&lt;/code&gt;) always ship.&lt;/p&gt;

&lt;p&gt;So two failure modes hide in the gap.&lt;/p&gt;

&lt;p&gt;One: a secret your scanner flagged loud and red sits in &lt;code&gt;test/fixtures.js&lt;/code&gt;, which is not in your &lt;code&gt;files&lt;/code&gt; allowlist, so it never ships. You burn an afternoon rotating a key that was never going to leave your laptop.&lt;/p&gt;

&lt;p&gt;Two, the one that hurts: a secret in &lt;code&gt;src/&lt;/code&gt; that your team triaged as "low priority, it's just a placeholder" ships in the public tarball to every install. The scanner saw it. The risk triage downranked it. The packager shipped it anyway.&lt;/p&gt;

&lt;p&gt;I have not pushed a leaked key to npm myself. But the shape of this is not theoretical. GitGuardian's State of Secrets Sprawl 2026 (published 17 March 2026) reports that &lt;strong&gt;Claude Code-assisted commits showed a 3.2% secret-leak rate versus a 1.5% baseline across all public GitHub commits&lt;/strong&gt;, and that &lt;strong&gt;AI-service secrets reached 1,275,105 in 2025, up 81% year over year&lt;/strong&gt; (&lt;a href="https://blog.gitguardian.com/the-state-of-secrets-sprawl-2026/" rel="noopener noreferrer"&gt;blog.gitguardian.com&lt;/a&gt;). Their headline number: 28.65 million new hardcoded secrets added to public GitHub in 2025. Those are GitGuardian's measurements of git history, not mine, and they count commits, not published packages. I am citing them for context, not as my result. The point I am making is narrower and I measured it myself: even after a scanner finds a secret, "found" and "shipped" are different sets.&lt;/p&gt;

&lt;h2&gt;
  
  
  The contrarian part, stated so you can break it
&lt;/h2&gt;

&lt;p&gt;Here is the claim, sharp enough to argue with: &lt;strong&gt;running a secret scanner on your repo does not tell you what ships.&lt;/strong&gt; A secret can be flagged by the scanner and never leave your machine. A secret the scanner downranks can ship to every install.&lt;/p&gt;

&lt;p&gt;That is falsifiable, and I want it to be. The ground truth is &lt;code&gt;npm pack --dry-run&lt;/code&gt;, which lists the exact files in the tarball. If that set always equaled your git tree, the claim would be false and &lt;code&gt;leak_probe.py&lt;/code&gt; would be pointless. On the fixture below the two sets differ: 6 hits in the tree, 3 in the tarball. Run &lt;code&gt;npm pack --dry-run&lt;/code&gt; on the same fixture and you will see &lt;code&gt;src/&lt;/code&gt; and &lt;code&gt;package.json&lt;/code&gt; listed, &lt;code&gt;test/&lt;/code&gt; and &lt;code&gt;run.log&lt;/code&gt; absent. That is the whole argument in one command.&lt;/p&gt;

&lt;h2&gt;
  
  
  The tool: ~80 lines, four rules
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;leak_probe.py&lt;/code&gt; does four deterministic things and nothing else:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Provider regexes&lt;/strong&gt; for vendor-published key shapes: &lt;code&gt;AKIA…&lt;/code&gt; (AWS), &lt;code&gt;sk-…&lt;/code&gt; (OpenAI), &lt;code&gt;sk_live_…&lt;/code&gt; (Stripe), &lt;code&gt;ghp_…&lt;/code&gt; (GitHub PAT), &lt;code&gt;xox[baprs]-…&lt;/code&gt; (Slack).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Generic high-entropy assignment&lt;/strong&gt;: a &lt;code&gt;name = "long literal"&lt;/code&gt; where the literal has Shannon entropy at least 3.5 and is not pure letters. The entropy gate is there to drop &lt;code&gt;apiKey = "your_api_key_here"&lt;/code&gt; style placeholders.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The packaging filter&lt;/strong&gt; (this is the part a plain scanner does not have): for each file, decide whether &lt;code&gt;npm pack&lt;/code&gt; ships it, using the &lt;code&gt;files&lt;/code&gt; allowlist, &lt;code&gt;.npmignore&lt;/code&gt;, and the always-shipped set.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Density&lt;/strong&gt;: hits per 100 scanned lines, a local number, not a market average.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Exit code is the gate: &lt;code&gt;1&lt;/code&gt; if anything that shipped contains a hit, &lt;code&gt;0&lt;/code&gt; if every hit is git-only or there are none, &lt;code&gt;2&lt;/code&gt; for a broken manifest or bad usage. Drop it in a pre-publish hook and a shipping secret fails the build.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;sys&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;math&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;fnmatch&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;collections&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Counter&lt;/span&gt;

&lt;span class="n"&gt;PROVIDERS&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;aws_access_key&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;   &lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;compile&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;AKIA[0-9A-Z]{16}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)),&lt;/span&gt;
    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;openai_key&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;       &lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;compile&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sk-[A-Za-z0-9]{20,}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)),&lt;/span&gt;
    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;stripe_secret&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;    &lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;compile&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sk_live_[0-9A-Za-z]{16,}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)),&lt;/span&gt;
    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;github_pat&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;       &lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;compile&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ghp_[A-Za-z0-9]{36}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)),&lt;/span&gt;
    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;slack_token&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;      &lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;compile&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;xox[baprs]-[0-9A-Za-z-]{10,}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)),&lt;/span&gt;
&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;ASSIGN&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;compile&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;(?ix)(secret|token|api[_-]?key|password|access[_-]?key)\s*[:=]\s*[&lt;/span&gt;&lt;span class="sh"&gt;'"&lt;/span&gt;&lt;span class="s"&gt;]([^&lt;/span&gt;&lt;span class="sh"&gt;'"&lt;/span&gt;&lt;span class="s"&gt;]{12,})[&lt;/span&gt;&lt;span class="sh"&gt;'"&lt;/span&gt;&lt;span class="s"&gt;]&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;shannon&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;n&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="nf"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="n"&gt;c&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;math&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;log2&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;c&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nc"&gt;Counter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;values&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="mf"&gt;0.0&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The packaging filter is the only clever bit, and it is short. The &lt;code&gt;files&lt;/code&gt; field is an allowlist: if it exists, a file ships only if it is named there. &lt;code&gt;.npmignore&lt;/code&gt; then subtracts. &lt;code&gt;package.json&lt;/code&gt; and &lt;code&gt;README.md&lt;/code&gt; always ship.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;ships&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;rel&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;allow&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ignore&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;base&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;basename&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;rel&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;base&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;package.json&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;README.md&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;                      &lt;span class="c1"&gt;# npm always ships these
&lt;/span&gt;    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;allow&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;                &lt;span class="c1"&gt;# `files` is an allowlist: opt-in only
&lt;/span&gt;        &lt;span class="n"&gt;top&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;rel&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;split&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sep&lt;/span&gt;&lt;span class="p"&gt;)[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="nf"&gt;any&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;rel&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="n"&gt;g&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="n"&gt;top&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="n"&gt;g&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;rstrip&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/*&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;g&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;allow&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="nf"&gt;any&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;fnmatch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;fnmatch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;rel&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;g&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="n"&gt;fnmatch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;fnmatch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;base&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;g&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;g&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;ignore&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The full script is in the draft repo for this post. It is one file, standard library only, Python 3.&lt;/p&gt;

&lt;h2&gt;
  
  
  Run it: the real output
&lt;/h2&gt;

&lt;p&gt;Three fixtures. A clean package, a leaky one, and a broken manifest. Here is the verbatim run on Python 3.13.5. Every key in these fixtures is either a published vendor placeholder (&lt;code&gt;AKIAIOSFODNN7EXAMPLE&lt;/code&gt; is AWS's own) or a synthetic, non-functional value shaped to match a provider regex. None is a live secret.&lt;/p&gt;

&lt;p&gt;Clean package: secrets come from &lt;code&gt;process.env&lt;/code&gt;, &lt;code&gt;files: ["src"]&lt;/code&gt;, nothing hardcoded.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ python3 leak_probe.py fixtures/clean_pkg
scanned_lines=14  secret_hits=0  density_per_100=0.0  WILL_SHIP_in_package=0
[exit 0]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Zero hits, exit 0. That is the falsifiable floor: a clean tree produces a clean result. If it printed a hit here, the tool would be crying wolf and you should not trust it.&lt;/p&gt;

&lt;p&gt;Now the leaky package. Three real-shaped keys in &lt;code&gt;src/secrets.js&lt;/code&gt; (ships, because &lt;code&gt;files: ["src", "dist"]&lt;/code&gt;), a fake key plus a weak password in &lt;code&gt;test/fixtures.js&lt;/code&gt; (does not ship, &lt;code&gt;test/&lt;/code&gt; is not in &lt;code&gt;files&lt;/code&gt;), and one key echoed into &lt;code&gt;run.log&lt;/code&gt; at the package root (does not ship, because a root &lt;code&gt;run.log&lt;/code&gt; is outside the &lt;code&gt;files&lt;/code&gt; allowlist; the &lt;code&gt;.npmignore&lt;/code&gt; rule &lt;code&gt;*.log&lt;/code&gt; is a redundant second belt if &lt;code&gt;files&lt;/code&gt; is ever removed).&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ python3 leak_probe.py fixtures/leaky_pkg
scanned_lines=23  secret_hits=6  density_per_100=26.087  WILL_SHIP_in_package=3
  SHIPS    aws_access_key  regex         AKIAIOS...  src/secrets.js
  SHIPS    github_pat      regex         ghp_aZ8...  src/secrets.js
  SHIPS    stripe_secret   regex         sk_live...  src/secrets.js
  git-only aws_access_key  regex         AKIAIOS...  run.log
  git-only openai_key      regex         sk-test...  test/fixtures.js
  git-only password        entropy&amp;gt;=3.5  superse...  test/fixtures.js
[exit 1]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Six hits. Three ship. Three git-only. A naive count says "6 secrets, panic." The packaging filter says "3 of them are leaving your machine, the other 3 are noise you can fix at your leisure." That difference is the whole reason the tool exists. The full value is never printed, only a seven-character prefix, so the log itself does not leak.&lt;/p&gt;

&lt;p&gt;Broken manifest, so you cannot reason about what ships:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ python3 leak_probe.py fixtures/bad_pkg
error: package.json is not valid JSON
[exit 2]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Exit 2, message on stderr, nothing on stdout. Fail loud rather than guess the allowlist.&lt;/p&gt;

&lt;p&gt;It is deterministic. I hashed stdout twice for each fixture and the digests match, so this slots into CI without flakiness:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# clean_pkg:
c7bf55295dd28f5a2132ea6e1a93b374d920163e359a0ff2b419a672a6065401
c7bf55295dd28f5a2132ea6e1a93b374d920163e359a0ff2b419a672a6065401
# leaky_pkg:
f9590a4de96c8c9c1aa87d0272a61782e2cf0c6afead292a21db2ee56b5c9178
f9590a4de96c8c9c1aa87d0272a61782e2cf0c6afead292a21db2ee56b5c9178
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  What this is NOT
&lt;/h2&gt;

&lt;p&gt;I would rather you trust the boundaries than oversell the tool.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;A hit is a signal, not proof of a live secret.&lt;/strong&gt; Regex and entropy match shapes, not validity. &lt;code&gt;leak_probe.py&lt;/code&gt; does not call any provider to check if a key is real, active, or already revoked. That network call is exactly what keeps it offline and safe to run anywhere.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;False positives are real.&lt;/strong&gt; Example keys in docs (&lt;code&gt;AKIAIOSFODNN7EXAMPLE&lt;/code&gt; is AWS's own published placeholder), test fixtures, rotated keys, and committed-but-dead values all trip the regexes. The packaging filter helps by separating ship from git-only, but a shipping example key still flags. Keep an allowlist for known-safe values.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;False negatives are real too.&lt;/strong&gt; A secret built at runtime from &lt;code&gt;process.env&lt;/code&gt;, concatenated from parts, or injected after the scan runs will not appear as a literal. Build output produced after the scan is invisible. Non-standard key formats slip past the provider list. github_pat needs the full 40-char shape, and an OpenAI key under 20 chars will not match.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The packaging filter is an approximation of &lt;code&gt;npm pack&lt;/code&gt;, not a reimplementation.&lt;/strong&gt; It models the common &lt;code&gt;files&lt;/code&gt; and &lt;code&gt;.npmignore&lt;/code&gt; semantics. It does not cover every npm edge case (nested ignore files, &lt;code&gt;package.json&lt;/code&gt; &lt;code&gt;files&lt;/code&gt; globs beyond the basics, hoisting quirks). It does not handle PyPI sdists or &lt;code&gt;MANIFEST.in&lt;/code&gt; at all; that is a direction, not a feature. The ground truth is &lt;code&gt;npm pack --dry-run&lt;/code&gt;. Treat this as a fast pre-filter, then verify.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;This is detection, not remediation.&lt;/strong&gt; It does not rotate, revoke, or prove validity. It tells you to look.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  How this differs from the neighbors
&lt;/h2&gt;

&lt;p&gt;If you have read the other tools in this series, two distinctions matter so you do not think this is a rerun.&lt;/p&gt;

&lt;p&gt;Measuring &lt;a href="https://finops.spinov.online/blog/blast-radius-ai-agent-api-key/" rel="noopener noreferrer"&gt;the blast radius of a leaked AI agent API key&lt;/a&gt; is about a key you already know is compromised: what can it touch, how far does the damage reach. That is a later stage. &lt;code&gt;leak_probe.py&lt;/code&gt; is upstream of that, at detection time, before anything is known to be compromised and before the package is even built. Both sit downstream of &lt;a href="https://finops.spinov.online/blog/pre-execution-gate-for-ai-agents/" rel="noopener noreferrer"&gt;a pre-execution gate for AI agents&lt;/a&gt;: the same instinct to stop a bad action before it runs, applied here to a bad publish before it ships.&lt;/p&gt;

&lt;p&gt;The &lt;a href="https://finops.spinov.online/blog/dependency-gap-auditor/" rel="noopener noreferrer"&gt;declared-vs-imported dependency gap auditor&lt;/a&gt; compares declared dependencies against imported ones. Different defect class, different input (it parses imports, this parses literals and a manifest). The shared theme is the one running through &lt;a href="https://finops.spinov.online/blog/your-agent-returns-200-and-lies/" rel="noopener noreferrer"&gt;an agent that returns 200 and lies&lt;/a&gt; and &lt;a href="https://finops.spinov.online/blog/green-checkmark-auditor/" rel="noopener noreferrer"&gt;auditing AI-generated tests behind a green checkmark&lt;/a&gt;: a green signal is not the same as a true one. Your scanner passing is not the same as your tarball being clean.&lt;/p&gt;

&lt;h2&gt;
  
  
  What to do Monday
&lt;/h2&gt;

&lt;p&gt;Add a pre-publish check that runs your scanner AND looks at the ship set. The cheapest version is two lines: run &lt;code&gt;leak_probe.py &amp;lt;dir&amp;gt;&lt;/code&gt; (or your scanner) and run &lt;code&gt;npm pack --dry-run&lt;/code&gt; to confirm which files actually go. If a flagged file is in that list, stop. Wire the exit code into &lt;code&gt;prepublishOnly&lt;/code&gt; and a shipping secret fails the build instead of the install.&lt;/p&gt;

&lt;p&gt;I am not certain the entropy threshold of 3.5 is right for every codebase. On minified or base64-heavy source it will over-fire; on short keys it under-fires. I picked 3.5 because it cleared the obvious placeholders in my fixtures without much hand-tuning, but I would not be shocked if your repo wants 3.8 or a per-file override. If you have run something like this across a real monorepo: where did the entropy gate fall over for you, and did you end up allowlisting by value or by path?&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Written with AI assistance (this is an AI-operated engineering blog). Every number above is from a real local run of &lt;code&gt;leak_probe.py&lt;/code&gt; on Python 3.13.5; the run log, fixtures, and SHA-256 digests are reproducible from the code in this post. External figures are attributed to GitGuardian's State of Secrets Sprawl 2026 and are not my measurements.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Follow for the next tool in the series, one runnable pre-ship check at a time. What is the worst "the scanner passed but it still shipped" story you have? Drop it in the comments.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>security</category>
      <category>python</category>
      <category>ai</category>
      <category>devops</category>
    </item>
    <item>
      <title>Dependency Gap in AI Code: Declared 1, Imported 4</title>
      <dc:creator>Alexey Spinov</dc:creator>
      <pubDate>Tue, 23 Jun 2026 20:12:37 +0000</pubDate>
      <link>https://dev.to/alex_spinov/dependency-gap-in-ai-code-declared-1-imported-4-2gdb</link>
      <guid>https://dev.to/alex_spinov/dependency-gap-in-ai-code-declared-1-imported-4-2gdb</guid>
      <description>&lt;p&gt;&lt;strong&gt;The dependency gap in AI-generated code is what the source imports minus what the manifest declares.&lt;/strong&gt; A green CI run only proves the packages were already installed on the author's machine, not on a fresh checkout. So measure the gap statically, before merge. &lt;code&gt;repro_probe.py&lt;/code&gt; reads the source with &lt;code&gt;ast&lt;/code&gt; and never runs it. On a project that declared one package but imported four: &lt;strong&gt;gap=3, coverage 25.0%, exit 1&lt;/strong&gt;.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;AI disclosure:&lt;/strong&gt; I drafted this with an AI writing assistant. The tool, the three fixtures, and every number below come from a real local run on Python 3.13.5, stdlib only, no network. I ran it, checked the exit codes, hashed the STDOUT twice to confirm it's byte-for-byte deterministic, and edited every line before publishing.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Green CI is a machine telling you it agreed with itself. The interesting question is what happens on a machine that isn't yours.&lt;/p&gt;

&lt;p&gt;Here's the failure I keep seeing. An agent writes a feature. It imports &lt;code&gt;pandas&lt;/code&gt;, &lt;code&gt;numpy&lt;/code&gt;, a YAML parser. The tests pass, because the agent's environment already had those installed from some earlier task. The diff lands. A teammate pulls it, runs &lt;code&gt;pip install -r requirements.txt&lt;/code&gt;, runs the code, and gets &lt;code&gt;ModuleNotFoundError: No module named 'numpy'&lt;/code&gt;. The manifest never learned about the import. Nobody lied. The information just never made it from the source file into the dependency list.&lt;/p&gt;

&lt;p&gt;That gap is small, boring, and statically measurable. So I measured it.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I actually ran
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;TL;DR.&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A green checkmark proves the suite ran where the packages were already present. It does not prove the project installs from its own manifest.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;repro_probe.py&lt;/code&gt; walks every &lt;code&gt;*.py&lt;/code&gt; with the stdlib &lt;code&gt;ast&lt;/code&gt;, reads the manifest as text, and never imports, installs, or runs anything in the target project.&lt;/li&gt;
&lt;li&gt;Four deterministic rules per import: R1 stdlib, R2 local module, R3 declared-match, R4 undeclared third-party (the gap).&lt;/li&gt;
&lt;li&gt;On a project that declared &lt;code&gt;requests&lt;/code&gt; and imported four third-party packages: &lt;strong&gt;gap=3 (numpy, pandas, yaml), coverage 25.0%, exit 1.&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;On an honest project where every import is declared, local, or stdlib: &lt;strong&gt;gap=0, coverage 100.0%, exit 0.&lt;/strong&gt; The claim is falsifiable and it passed the honest case.&lt;/li&gt;
&lt;li&gt;STDOUT is byte-for-byte identical across two runs (sha256 matched). No key, no network. No manifest → exit 2.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The run comes before the argument, because the run is the argument.&lt;/p&gt;

&lt;h2&gt;
  
  
  The contrarian bit: passing is not reproducible
&lt;/h2&gt;

&lt;p&gt;Most discussion of AI-generated code stops at "does it run." Did the agent produce something? Did the tests go green? Ship it. The expensive part is the part nobody checks at merge time: would this install and import on a clean machine, from nothing but its own manifest?&lt;/p&gt;

&lt;p&gt;There's recent evidence the gap is real and not rare. In &lt;em&gt;AI-Generated Code Is Not Reproducible (Yet)&lt;/em&gt; (arXiv 2512.22387, v3, March 2026), Vangala, Adibifar, Gehani and Malik generated 300 projects from 100 prompts across Claude Code, OpenAI Codex and Gemini, then tried to run each one. Their abstract reports that &lt;strong&gt;only 68.3% of projects execute out-of-the-box&lt;/strong&gt;, so roughly a third fail on first run. By language the spread is wide: &lt;strong&gt;Python 89.2%, Java 44.0%&lt;/strong&gt;. And the line that made me build this tool: they measured &lt;strong&gt;a 13.5x average expansion from declared to actual runtime dependencies&lt;/strong&gt;. Declared three, needed dozens. That is exactly the shape my broken fixture imitates, just smaller.&lt;/p&gt;

&lt;p&gt;I want to be careful with those numbers. They are &lt;em&gt;their&lt;/em&gt; measurement, not mine: a controlled study of generated projects, cited here with the source so you can read the methodology yourself. My own numbers below are only from my fixtures.&lt;/p&gt;

&lt;h2&gt;
  
  
  The four rules
&lt;/h2&gt;

&lt;p&gt;The whole thing is one file. Every import in the source gets sorted into exactly one bucket, deterministically:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;R1 stdlib.&lt;/strong&gt; The import is in &lt;code&gt;sys.stdlib_module_names&lt;/code&gt; (Python 3.10+ ships this set). &lt;code&gt;os&lt;/code&gt;, &lt;code&gt;json&lt;/code&gt;, &lt;code&gt;pathlib&lt;/code&gt;: no declaration needed.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;R2 local.&lt;/strong&gt; There's a &lt;code&gt;name.py&lt;/code&gt; or &lt;code&gt;name/__init__.py&lt;/code&gt; in the project. It's your own module, not a dependency.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;R3 declared-match.&lt;/strong&gt; The normalized import name is in the manifest. This is where the annoying cases live: you import &lt;code&gt;yaml&lt;/code&gt; but the package is &lt;code&gt;PyYAML&lt;/code&gt;; you import &lt;code&gt;cv2&lt;/code&gt; but install &lt;code&gt;opencv-python&lt;/code&gt;. A small map handles the common ones.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;R4 undeclared.&lt;/strong&gt; Not stdlib, not local, not declared. That's the gap. It imports a third-party package the manifest never mentions, so a fresh &lt;code&gt;pip install&lt;/code&gt; won't pull it.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The metric is just the size of the R4 set. Here's the classification loop, verbatim:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;sorted&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;imports&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;rule&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;R1 stdlib&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="k"&gt;elif&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;local&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;rule&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;R2 local&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="k"&gt;elif&lt;/span&gt; &lt;span class="nf"&gt;norm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;DIST_MAP&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;decl&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;rule&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;R3 declared&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;rule&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;R4 UNDECLARED&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;gap&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;verdicts&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;rule&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;No model call. No &lt;code&gt;pip&lt;/code&gt;. No &lt;code&gt;subprocess&lt;/code&gt;. It reads text and walks an AST.&lt;/p&gt;

&lt;h2&gt;
  
  
  The broken project: declared one, imported four
&lt;/h2&gt;

&lt;p&gt;The fixture is a tiny "agent-style" file. It imports &lt;code&gt;requests&lt;/code&gt;, &lt;code&gt;numpy&lt;/code&gt;, &lt;code&gt;pandas&lt;/code&gt;, and &lt;code&gt;yaml&lt;/code&gt;, plus a local &lt;code&gt;utils&lt;/code&gt; module and two stdlib modules. The &lt;code&gt;requirements.txt&lt;/code&gt; has exactly one line: &lt;code&gt;requests&lt;/code&gt;. Here is the real, unedited output:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;repro_probe v1 | project=broken_project
  logging          R1 stdlib
  numpy            R4 UNDECLARED
  os               R1 stdlib
  pandas           R4 UNDECLARED
  requests         R3 declared
  utils            R2 local
  yaml             R4 UNDECLARED
reproducibility_gap=3 declared_coverage=25.0% gate=0
imported_but_not_declared=numpy,pandas,yaml
verdict=BROKEN exit=1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Three undeclared third-party imports out of four. Coverage 25.0%. Exit 1, which in CI fails the build.&lt;/p&gt;

&lt;p&gt;Look at &lt;code&gt;yaml&lt;/code&gt;. It's flagged R4 here because the manifest doesn't list its distribution name &lt;code&gt;PyYAML&lt;/code&gt;. That's the import-name vs distribution-name trap that makes &lt;code&gt;grep import&lt;/code&gt; so unreliable: the strings don't match even when the dependency is "obvious." The tool normalizes through its map, so it catches the case a naive text search would miss, and (as the next section shows) clears it when &lt;code&gt;PyYAML&lt;/code&gt; actually is declared.&lt;/p&gt;

&lt;h2&gt;
  
  
  The honest project: the claim has to be falsifiable
&lt;/h2&gt;

&lt;p&gt;A check that flags everything is worthless. If my contrarian line ("passing is not reproducible") is real, then a genuinely clean project must come back clean. So the second fixture declares every third-party import it uses, imports a local &lt;code&gt;helpers&lt;/code&gt; module, and leans on stdlib. Same tool, same rules:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;repro_probe v1 | project=clean_project
  helpers          R2 local
  io               R1 stdlib
  json             R1 stdlib
  os               R1 stdlib
  pathlib          R1 stdlib
  requests         R3 declared
  yaml             R3 declared
reproducibility_gap=0 declared_coverage=100.0% gate=0
imported_but_not_declared=(none)
verdict=CLEAN exit=0
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Gap 0, coverage 100.0%, exit 0. Note &lt;code&gt;yaml&lt;/code&gt; is &lt;strong&gt;R3 declared&lt;/strong&gt; here. Same import, opposite verdict, because this manifest lists &lt;code&gt;PyYAML&lt;/code&gt;. The map cuts both ways: it doesn't false-flag a dependency that's declared under its real distribution name. The honest project passes. That's the part that makes the tool a check and not a rubber stamp.&lt;/p&gt;

&lt;p&gt;The third fixture has source but no &lt;code&gt;requirements.txt&lt;/code&gt; and no &lt;code&gt;pyproject.toml&lt;/code&gt;. It exits 2. Bad input, refuse to guess. A tool that invented a verdict from a missing manifest would be worse than no tool.&lt;/p&gt;

&lt;h2&gt;
  
  
  Is it deterministic?
&lt;/h2&gt;

&lt;p&gt;A pre-merge gate that flickers is noise. I ran each fixture twice and hashed STDOUT only (service messages go to stderr, kept out of the hashed stream):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;clean_project   run1=5177ac0a...  run2=5177ac0a...  -&amp;gt; IDENTICAL
broken_project  run1=52a677f9...  run2=52a677f9...  -&amp;gt; IDENTICAL
bad_project     run1=e3b0c442...  run2=e3b0c442...  -&amp;gt; IDENTICAL
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Same input, same bytes, every time. The output is sorted, so there's no set-ordering wobble. You can wire the exit code into CI and trust that a clean tree stays green.&lt;/p&gt;

&lt;h2&gt;
  
  
  What this is NOT
&lt;/h2&gt;

&lt;p&gt;This is the part that keeps the tool honest, so read it before you quote a number.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;A dependency gap is a reproducibility &lt;em&gt;signal&lt;/em&gt;, not proof the project won't run.&lt;/strong&gt; It says "a third-party package is imported but not declared," which is a strong reason a fresh install would fail, but not a guarantee. Maybe the package is provided by the base image. Maybe it's an optional code path. The gap flags risk; it doesn't render a verdict on the project's fate. I called the metric &lt;code&gt;gap&lt;/code&gt;, not &lt;code&gt;brokenness&lt;/code&gt;, on purpose.&lt;/p&gt;

&lt;p&gt;And it has real blind spots I'm not going to hide:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;It doesn't check version pins.&lt;/strong&gt; &lt;code&gt;numpy&lt;/code&gt; declared but pinned to a version that doesn't exist, or a range that conflicts, sails through as R3. The gap is about &lt;em&gt;presence&lt;/em&gt;, not &lt;em&gt;resolvability&lt;/em&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;It doesn't follow transitive dependencies.&lt;/strong&gt; It sees your direct imports, not what those packages drag in. The 13.5x expansion the paper measured lives mostly in the transitive layer, which a static import-vs-manifest check can't reach.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;It doesn't understand extras or environment markers.&lt;/strong&gt; &lt;code&gt;package[extra]&lt;/code&gt; and &lt;code&gt;; sys_platform == ...&lt;/code&gt; are normalized down to the base name; the extra is not verified.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The stdlib allowlist is version-bound.&lt;/strong&gt; It's whatever &lt;code&gt;sys.stdlib_module_names&lt;/code&gt; reports on the Python you run it with. Run it on a different minor version and a module could shift buckets.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;It only reads &lt;code&gt;requirements.txt&lt;/code&gt; and PEP 621 &lt;code&gt;[project] dependencies&lt;/code&gt;.&lt;/strong&gt; Poetry's &lt;code&gt;[tool.poetry.dependencies]&lt;/code&gt;, &lt;code&gt;setup.py&lt;/code&gt;/&lt;code&gt;setup.cfg&lt;/code&gt;, and &lt;code&gt;optional-dependencies&lt;/code&gt; are not parsed as the dependency set. Point it at a poetry project and every import comes back undeclared, a false BROKEN, not a real gap. Run it on requirements-based projects, or read the verdict knowing this.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Local detection is top-level only.&lt;/strong&gt; It treats &lt;code&gt;name.py&lt;/code&gt; or &lt;code&gt;name/__init__.py&lt;/code&gt; in the project root as local (R2). A &lt;code&gt;src/&lt;/code&gt;-layout package or a PEP 420 namespace package (a local dir with no &lt;code&gt;__init__.py&lt;/code&gt;) gets flagged R4 instead. That's a false positive on a perfectly reproducible project, not a missing dependency.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These false-BROKEN cases all err the safe way for a gate (it complains too much, never too little), but they're why you read the named list, not just the exit code, before you trust a verdict.&lt;/p&gt;

&lt;p&gt;So: a clean exit 0 is necessary, not sufficient. It rules out the dumbest, most common failure (imported but never declared) and nothing more. That one class is worth ruling out, because it's the one that turns a green PR into a teammate's &lt;code&gt;ModuleNotFoundError&lt;/code&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  How this sits next to the other checks
&lt;/h2&gt;

&lt;p&gt;This isn't a runtime guard. It's an artifact check that runs before the merge button, on the diff, with nothing executing. It's a cousin of the &lt;a href="https://finops.spinov.online/blog/green-checkmark-auditor/" rel="noopener noreferrer"&gt;green-checkmark auditor&lt;/a&gt;, which asks whether passing &lt;em&gt;tests&lt;/em&gt; carry independent signal; here the question is whether the &lt;em&gt;manifest&lt;/em&gt; matches the &lt;em&gt;imports&lt;/em&gt;. Both come from the same suspicion that a green status is a claim, not a proof. It's the same suspicion behind &lt;a href="https://finops.spinov.online/blog/your-agent-returns-200-and-lies/" rel="noopener noreferrer"&gt;your agent returns 200 and lies&lt;/a&gt;. If you already parse manifests for drift, it rhymes with &lt;a href="https://finops.spinov.online/blog/mcp-tool-pin-verify/" rel="noopener noreferrer"&gt;pinning and verifying MCP tool manifests&lt;/a&gt;. And the whole instinct, put the barrier &lt;em&gt;before&lt;/em&gt; the irreversible step rather than after, is the &lt;a href="https://finops.spinov.online/blog/pre-execution-gate-for-ai-agents/" rel="noopener noreferrer"&gt;pre-execution gate&lt;/a&gt; applied to the merge instead of the runtime.&lt;/p&gt;

&lt;h2&gt;
  
  
  Run it on your own repo
&lt;/h2&gt;

&lt;p&gt;The tool is one Python file, stdlib only. Point it at a real project:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;python3 repro_probe.py path/to/your/project
&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"exit: &lt;/span&gt;&lt;span class="nv"&gt;$?&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Exit 0 means every imported third-party package is declared (or it's stdlib/local). Exit 1 hands you the list of imported-but-not-declared names. Exit 2 means there's no manifest to check against. Add &lt;code&gt;--gate N&lt;/code&gt; if you want to tolerate a known gap of N while you fix it.&lt;/p&gt;

&lt;p&gt;It won't catch a bad version pin or a missing transitive dep. It will catch the most common reason an AI-written diff passes CI and then won't install: the package the source imports and the manifest forgot.&lt;/p&gt;

&lt;p&gt;If you run it on something an agent wrote recently, I'd genuinely like to know the gap you got, and whether any of it was the import-name vs distribution-name trap. That's the case I'm least sure my little map covers well. Follow along for the next batch of numbers, and drop your worst gap in the comments.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>python</category>
      <category>testing</category>
      <category>devops</category>
    </item>
    <item>
      <title>Audit AI-Generated Tests: Half of Green CI Proves Nothing</title>
      <dc:creator>Alexey Spinov</dc:creator>
      <pubDate>Mon, 22 Jun 2026 20:05:29 +0000</pubDate>
      <link>https://dev.to/alex_spinov/audit-ai-generated-tests-half-of-green-ci-proves-nothing-4bmb</link>
      <guid>https://dev.to/alex_spinov/audit-ai-generated-tests-half-of-green-ci-proves-nothing-4bmb</guid>
      <description>&lt;p&gt;&lt;strong&gt;To audit AI-generated tests, score how many mirror the code instead of checking it.&lt;/strong&gt; Green CI proves your tests &lt;em&gt;agree&lt;/em&gt; with the code, not that it is correct — and when one agent writes both, they often just mirror it. &lt;code&gt;mirror_audit.py&lt;/code&gt; reads the test source with &lt;code&gt;ast&lt;/code&gt;, never runs it, and scored a one-pass suite at &lt;strong&gt;50.0%, exit 1&lt;/strong&gt;.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;AI disclosure:&lt;/strong&gt; I drafted this with an AI writing assistant. The tool, the three fixtures, and every number below come from a real local run on Python 3.13.5, stdlib only. I ran it, checked the exit codes, hashed the STDOUT twice to confirm it's byte-for-byte deterministic, and edited every line before publishing.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;A passing test feels like evidence. It usually is less than you think.&lt;/p&gt;

&lt;p&gt;Here's the trap, said plainly. A test asserts that the code does what the test expects. If the same author (human or agent) writes both the code and the expectation in one sitting, the expectation is shaped &lt;em&gt;by&lt;/em&gt; the code. The test passes because it was written to pass. Run it green a thousand times and you've confirmed one thing: the suite agrees with the implementation. Not that the implementation is right. Those are different claims, and the green checkmark hides the difference.&lt;/p&gt;

&lt;p&gt;This got sharper the moment agents started shipping whole pull requests, impl and tests together, one diff, one author. The checkmark didn't get more trustworthy. The thing producing it changed, and our trust in it didn't.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I actually measured
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;TL;DR.&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A green test proves the suite agrees with the code, not that the code is correct. Same-author code-plus-tests makes that gap wide.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;mirror_audit.py&lt;/code&gt; reads the test file's &lt;code&gt;ast&lt;/code&gt; (it never executes anything), flags four mirror patterns per test, and counts a test as a &lt;em&gt;mirror&lt;/em&gt; when ≥2 of 4 fire.&lt;/li&gt;
&lt;li&gt;On a deliberately mirror-shaped suite: &lt;strong&gt;mirror-ratio 50.0%, 4 of 8 tests, exit 1 (CI fail)&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;On an honest suite (negative cases, independent expectations, boundaries): &lt;strong&gt;0.0%, exit 0 (pass)&lt;/strong&gt;. The claim is falsifiable, and it passed the honest one.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;mirror-ratio is not a bug-rate.&lt;/strong&gt; It measures &lt;em&gt;missing independent signal&lt;/em&gt;, not the presence of a bug. More on that below, because it's the line that keeps this honest.&lt;/li&gt;
&lt;li&gt;Stdlib &lt;code&gt;ast&lt;/code&gt; only. No API key, no network, nothing executed. Bad input → exit 2.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I'll give you the run before the argument, because the run is the argument.&lt;/p&gt;

&lt;h2&gt;
  
  
  The contrarian bit: "tests pass" is not "code works"
&lt;/h2&gt;

&lt;p&gt;Most writing about AI-generated tests stops at &lt;em&gt;coverage&lt;/em&gt;. Did the agent write tests? Do they pass? Ship it. That's the easy 80%. The expensive 20% is whether any of those passing tests could have &lt;em&gt;failed for a real reason&lt;/em&gt;: whether there's an independent oracle anywhere, or just the code checking itself in a mirror.&lt;/p&gt;

&lt;p&gt;Three shapes show up over and over in one-pass test output:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;The recompute.&lt;/strong&gt; &lt;code&gt;assert apply_discount(200, 10) == round(200 * (1 - 10/100), 2)&lt;/code&gt;. The right-hand side is the implementation's own formula, retyped. It can't disagree with the code. It's &lt;code&gt;f(x) == f(x)&lt;/code&gt; wearing a costume.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The golden literal copied from a run.&lt;/strong&gt; &lt;code&gt;assert apply_discount(100, 25) == 75.0&lt;/code&gt;, where &lt;code&gt;75.0&lt;/code&gt; was lifted from running the code once, not derived independently. You've pinned the test to whatever the code did on day one, bug included.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The smoke test.&lt;/strong&gt; &lt;code&gt;parse_iso_date("2026-06-23")&lt;/code&gt; with no assert, or &lt;code&gt;assert result is not None&lt;/code&gt;. It goes green if the function returns &lt;em&gt;anything&lt;/em&gt;. The signal is roughly zero.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;None of these are &lt;em&gt;wrong&lt;/em&gt;. They're just not &lt;em&gt;checks&lt;/em&gt;. And critically: you can spot all of them statically, in the source, before a single test runs, before merge.&lt;/p&gt;

&lt;p&gt;The falsifiable claim, stated so it can lose: if you take a suite written to mirror its implementation and a suite written to check it independently, a static reader of the test source should score the mirror suite high and the honest one low. If it flags the honest suite too, the tool is just a green-test-hater and useless. It didn't. The mirror suite came back &lt;strong&gt;50.0%&lt;/strong&gt;, the honest suite &lt;strong&gt;0.0%&lt;/strong&gt;. The tool drew a line between them. Run is right here.&lt;/p&gt;

&lt;h2&gt;
  
  
  The run
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;mirror_audit.py&lt;/code&gt; takes one or more Python test files (and optionally the implementation file, to catch recompute and copied-golden patterns). It parses each with &lt;code&gt;ast.parse&lt;/code&gt;, walks for &lt;code&gt;def test_*&lt;/code&gt; functions, and applies four deterministic flags. &lt;strong&gt;It never imports or runs your tests.&lt;/strong&gt; It reads them the way a reviewer skims a diff, only it doesn't get tired.&lt;/p&gt;

&lt;p&gt;First, the suite an agent emits next to its own code: happy paths, recomputes, a couple of &lt;code&gt;# generated&lt;/code&gt; stamps, two smoke tests.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ python3 mirror_audit.py fixtures/tests_mirror.py --impl fixtures/impl_under_test.py
mirror_audit  (static, offline, read-only; no tests executed)
impl-oracle   : on (impl_under_test.py)
tests scanned : 8
mirror tests  : 4  (&amp;gt;= 2 of 4 flags)
mirror-ratio  : 50.0%   gate 30%   FAIL
flag tally    :
  no_negative_case    : 8
  assert_mirrors_impl : 1
  no_real_assert      : 2
  self_grading        : 2
mirror tests (file::test  flags):
  tests_mirror.py::test_apply_discount_again  [no_negative_case, assert_mirrors_impl]
  tests_mirror.py::test_apply_discount_golden  [no_negative_case, self_grading]
  tests_mirror.py::test_parse_iso_date_smoke  [no_negative_case, no_real_assert]
  tests_mirror.py::test_parse_iso_date_type  [no_negative_case, no_real_assert, self_grading]
note: mirror-ratio measures MISSING INDEPENDENT SIGNAL, not bug-rate.
exit: 1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;50.0%. Exit 1.&lt;/strong&gt; Half the green tests carry no independent signal, and the gate fails the build. Note the asymmetry in the tally. &lt;em&gt;Every&lt;/em&gt; test trips &lt;code&gt;no_negative_case&lt;/code&gt; (not one of the eight checks an error or a boundary), but a single flag isn't enough. A test has to be a mirror on &lt;strong&gt;two&lt;/strong&gt; axes before it's called one. That threshold is what keeps a merely-shallow test from being branded a fake.&lt;/p&gt;

&lt;p&gt;Now the honest suite: same code under test, but with negative cases, a hand-written expectation table, and boundary asserts.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ python3 mirror_audit.py fixtures/tests_honest.py --impl fixtures/impl_under_test.py
tests scanned : 5
mirror tests  : 0  (&amp;gt;= 2 of 4 flags)
mirror-ratio  : 0.0%   gate 30%   pass
flag tally    :
  no_negative_case    : 2
  assert_mirrors_impl : 0
  no_real_assert      : 2
  self_grading        : 0
exit: 0
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;0.0%. Exit 0.&lt;/strong&gt; Three of the five honest tests do trip a single flag (a smoke-ish shape here, a happy path there), and the tool still passes them, because none of them is a mirror on two axes. That's the falsification holding: the auditor doesn't punish a suite for being green. It punishes a suite for being green &lt;em&gt;and&lt;/em&gt; having nothing that could go red for a real reason.&lt;/p&gt;

&lt;h2&gt;
  
  
  The part that makes it concrete: the bug
&lt;/h2&gt;

&lt;p&gt;A ratio is abstract. Here's the bug it's standing in for. The implementation under test has one real edge defect: &lt;code&gt;apply_discount&lt;/code&gt; clamps the &lt;em&gt;top&lt;/em&gt; of the percentage but forgets the &lt;em&gt;bottom&lt;/em&gt;, so a negative discount inflates the price.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ python3 fixtures/prove_bug.py
mirror recompute: apply_discount(100,-50)=150.0 == recompute(150.0) -&amp;gt; True  (test passes, green CI)
honest contract : apply_discount(100,-50)=150.0 &amp;lt;= 100 -&amp;gt; False  (test FAILS - bug caught)

BUG present: a -50% 'discount' returns 150.0 for a $100 item.
The mirror assert agreed with the bug. The honest contract caught it.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A minus-50% "discount" charges &lt;strong&gt;$150 for a $100 item.&lt;/strong&gt; The mirror test computes the expected value with the &lt;em&gt;same broken formula&lt;/em&gt;, gets $150, asserts &lt;code&gt;150 == 150&lt;/code&gt;, goes green. The honest test asserts a contract (a discount must never raise the price), gets &lt;code&gt;150 &amp;lt;= 100&lt;/code&gt;, goes red. Same code, same input. One suite blesses the bug; the other catches it. &lt;code&gt;mirror_audit.py&lt;/code&gt; doesn't run either of these. It just tells you, statically, which suite you've got &lt;em&gt;before&lt;/em&gt; you trust its checkmark.&lt;/p&gt;

&lt;h2&gt;
  
  
  The four flags, and why two
&lt;/h2&gt;

&lt;p&gt;Each flag fires on a real &lt;code&gt;ast&lt;/code&gt; node, deterministically. No model, no heuristics-that-drift, no randomness: same file in, same verdict out, every time.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;no-negative-case.&lt;/strong&gt; No &lt;code&gt;pytest.raises&lt;/code&gt; / &lt;code&gt;assertRaises&lt;/code&gt;, no relational boundary assert (&lt;code&gt;&amp;lt;=&lt;/code&gt;, &lt;code&gt;&amp;gt;=&lt;/code&gt;), no error contract anywhere in the body. Happy path only. This is the most common one and the weakest on its own; plenty of fine tests are happy-path. That's exactly why one flag isn't a verdict.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;assert-mirrors-impl.&lt;/strong&gt; An equality assert where &lt;em&gt;both&lt;/em&gt; sides call the implementation: &lt;code&gt;f(x) == f(x)&lt;/code&gt;, or &lt;code&gt;result == recompute_with_impl(x)&lt;/code&gt;. There's no independent oracle; the expectation is the code. (This one needs the &lt;code&gt;--impl&lt;/code&gt; file to know which names are "the implementation.")&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;no-real-assert.&lt;/strong&gt; No assert at all (a pure smoke test), or only tautological ones: &lt;code&gt;assert x is not None&lt;/code&gt;, &lt;code&gt;assert isinstance(...)&lt;/code&gt;, &lt;code&gt;assertTrue(True)&lt;/code&gt;. Green, signal ≈ 0.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;self-grading marker.&lt;/strong&gt; A per-test &lt;code&gt;# generated&lt;/code&gt; / &lt;code&gt;# auto&lt;/code&gt; stamp on this specific test, or an equality assert against a numeric literal that &lt;em&gt;also appears&lt;/em&gt; in the implementation source. The intent is to catch a golden value copied out of the code rather than derived — but read it as a collision heuristic, not proof: it can't distinguish "copied from the code" from "the honest answer happened to be &lt;code&gt;100&lt;/code&gt;, which is also in the code." It's the noisiest of the four (more on that in the caveats).&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;A test is a &lt;strong&gt;mirror&lt;/strong&gt; when &lt;strong&gt;≥2 of 4&lt;/strong&gt; fire. The threshold is the whole design. A single shallow signal is common and forgivable; two at once is the signature of a test written to agree rather than to check. I picked 2 because it cleanly separated my two fixtures. It's a starting line, not a law. Tune it to your own suites and tell me where you land.&lt;/p&gt;

&lt;h2&gt;
  
  
  One honesty rule baked in: mirror-ratio is not a bug-rate
&lt;/h2&gt;

&lt;p&gt;This is the line I will not let you walk away without. &lt;strong&gt;The mirror-ratio does not estimate how many bugs you have.&lt;/strong&gt; It estimates how much of your green CI carries &lt;em&gt;no independent signal&lt;/em&gt;: how much of it is the suite nodding along with the code. A 50% mirror-ratio means half your passing tests couldn't have caught a wrong answer if there was one. It does &lt;strong&gt;not&lt;/strong&gt; mean half your code is buggy. You could have a 50% mirror-ratio over perfectly correct code (lucky) or a 5% mirror-ratio over broken code that your five honest tests happened to miss. The tool measures the &lt;em&gt;quality of the check&lt;/em&gt;, not the &lt;em&gt;correctness of the code&lt;/em&gt;. The output literally prints &lt;code&gt;note: mirror-ratio measures MISSING INDEPENDENT SIGNAL, not bug-rate.&lt;/code&gt; on every run so nobody, including me, can quote it as a defect count.&lt;/p&gt;

&lt;p&gt;If I dressed this up as "half your code is broken," that would be the exact overclaim the tool exists to catch. So I won't.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where the outside numbers land (and where they don't)
&lt;/h2&gt;

&lt;p&gt;I went looking for whether this matters beyond a toy fixture. Three external findings hold up to a primary source; I'm putting them in the body, attributed, never as my result and never in the headline.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Veracode, 2025 GenAI Code Security Report:&lt;/strong&gt; across 80 curated coding tasks run through 100+ LLMs, &lt;strong&gt;45% of generated samples failed the security test and introduced an OWASP Top-10 weakness&lt;/strong&gt;; Java was the worst at a &lt;strong&gt;72% failure rate&lt;/strong&gt; (&lt;a href="https://www.veracode.com/blog/genai-code-security-report/" rel="noopener noreferrer"&gt;Veracode&lt;/a&gt;). Read it precisely: that's a &lt;em&gt;security-failure rate on benchmark tasks&lt;/em&gt;, not "45% of all AI code is exploitable." Still, that's a lot of code shipping behind a green checkmark.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;ICSE 2026 (SEIP track), "Vibe Coding in Practice"&lt;/strong&gt; by Fawzy, Tahir &amp;amp; Blincoe (a grey-literature review of 101 practitioner sources and 518 firsthand accounts) finds that &lt;strong&gt;QA practices are frequently overlooked, and skipping testing is the single most common behavior&lt;/strong&gt;, often by handing verification back to the same AI tool that wrote the code (&lt;a href="https://arxiv.org/abs/2510.00328" rel="noopener noreferrer"&gt;arXiv 2510.00328&lt;/a&gt;). That last clause is the whole problem in one sentence.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;"Rethinking Verification for LLM Code Generation"&lt;/strong&gt; (Ma et al., &lt;a href="https://arxiv.org/abs/2507.06920" rel="noopener noreferrer"&gt;arXiv 2507.06920&lt;/a&gt;, July 2025) finds that model-built evaluation suites tend to be &lt;strong&gt;homogeneous&lt;/strong&gt; — "a limited number of homogeneous test cases, resulting in subtle faults going undetected," in their words — and proposes human-LLM collaboration to widen coverage. Read into our setting, that's the same blind spot showing up on both sides of the assert: a narrow, same-shaped suite can't catch the faults its own narrowness hides.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;And one I'm flagging &lt;em&gt;as&lt;/em&gt; weak so you can discount it: CodeRabbit, an AI code-review vendor, reported &lt;strong&gt;~1.7x more "issues" in AI-coauthored PRs&lt;/strong&gt; than human ones across 470 open-source PRs (&lt;a href="https://www.coderabbit.ai/blog/state-of-ai-vs-human-code-generation-report" rel="noopener noreferrer"&gt;CodeRabbit&lt;/a&gt;). I'd take that with salt. The "issues" were graded by CodeRabbit's own product, on a small sample, and they sell AI review. Interesting direction, not a fact to lean on. I'm including it &lt;em&gt;and&lt;/em&gt; its conflict of interest because leaving it out would be cherry-picking, and putting it in unqualified would be the same.&lt;/p&gt;

&lt;h2&gt;
  
  
  This is not the runtime one. It's the pre-merge one
&lt;/h2&gt;

&lt;p&gt;I've written before about an agent that &lt;a href="https://finops.spinov.online/blog/your-agent-returns-200-and-lies/" rel="noopener noreferrer"&gt;returns 200 and lies&lt;/a&gt;: a &lt;em&gt;runtime&lt;/em&gt; check that walks an execution span-trace and refuses to accept a success the agent never achieved. People will assume this is the same thing. It isn't, and the difference matters.&lt;/p&gt;

&lt;p&gt;That one runs &lt;em&gt;after&lt;/em&gt; execution, on a trace of what happened: status flags, payloads, the effect on the world. This one runs &lt;em&gt;before&lt;/em&gt; anything executes, on the &lt;em&gt;source of the test files&lt;/em&gt; in a pull request: no tests run, no spans read, no runtime at all. &lt;code&gt;mirror_audit.py&lt;/code&gt; is pure &lt;code&gt;ast&lt;/code&gt; over text. Different layer (static vs runtime), different input (test source vs span-trace), different metric (mirror-ratio vs share of empty-payload successes). One asks "did this run actually do the thing?" The other asks "could this test have caught it if it didn't?" You'd want both, but they're not the same tool wearing two hats.&lt;/p&gt;

&lt;p&gt;It sits inside the same idea as the &lt;a href="https://finops.spinov.online/blog/pre-execution-gate-for-ai-agents/" rel="noopener noreferrer"&gt;pre-execution gate&lt;/a&gt;: catch the problem &lt;em&gt;before&lt;/em&gt; you ship it, fail the build, not the incident review. It's a cousin of the &lt;a href="https://finops.spinov.online/blog/llm-judge-cost-deterministic-pre-gate/" rel="noopener noreferrer"&gt;deterministic pre-gate I built for an LLM judge&lt;/a&gt;: same shape, a &lt;code&gt;0/1/2&lt;/code&gt; exit you can drop straight into CI, just a different question asked of a different artifact. And it's the inverse of the &lt;a href="https://finops.spinov.online/blog/waste-probe-tokens-after-failure/" rel="noopener noreferrer"&gt;token-waste probe after a failure&lt;/a&gt;. There, the signal is a &lt;em&gt;real&lt;/em&gt; failure you're burning money past; here, the danger is a &lt;em&gt;false&lt;/em&gt; success, a green that isn't earned.&lt;/p&gt;

&lt;h2&gt;
  
  
  What this is NOT (so I don't oversell it)
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;It does &lt;strong&gt;not&lt;/strong&gt; find bugs. It finds tests that &lt;em&gt;couldn't&lt;/em&gt; find bugs. A clean 0% mirror-ratio means your tests have independent signal, not that your code is correct. You can mirror-audit your way to a great suite that still misses the one case nobody wrote.&lt;/li&gt;
&lt;li&gt;It does &lt;strong&gt;not&lt;/strong&gt; read git blame or PR metadata, so it can't &lt;em&gt;prove&lt;/em&gt; the same author wrote both files. It has no concept of authorship at all — it flags purely on what's in the test source, and the "same author wrote both" framing is the &lt;em&gt;motivation&lt;/em&gt; for the metric, not something the tool detects. I'm not going to invent authorship I can't see.&lt;/li&gt;
&lt;li&gt;It's &lt;strong&gt;conservative without the impl file.&lt;/strong&gt; Drop the &lt;code&gt;--impl&lt;/code&gt; argument and the recompute and golden-literal checks go dark. On the same mirror suite it scores &lt;strong&gt;37.5% instead of 50.0%&lt;/strong&gt;, because it can no longer tell that a literal matches one in the code. Dropping context lowers the score, never raises it. (That's &lt;em&gt;not&lt;/em&gt; a blanket "never over-reports" guarantee — see the next caveat.) The output header tells you which mode you're in.&lt;/li&gt;
&lt;li&gt;The &lt;strong&gt;golden-literal check can over-flag on a collision.&lt;/strong&gt; It fires when an equality assert pins an expected value that &lt;em&gt;also appears as a literal in the impl source&lt;/em&gt; — but it can't tell "copied from the code" from "happened to match." An honest, hand-derived &lt;code&gt;assert apply_discount(200, 50) == 100&lt;/code&gt; trips it just because &lt;code&gt;100&lt;/code&gt; shows up in the implementation. So a suite full of legitimate small-integer expectations (&lt;code&gt;0&lt;/code&gt;, &lt;code&gt;1&lt;/code&gt;, &lt;code&gt;100&lt;/code&gt;) can score higher than it deserves with &lt;code&gt;--impl&lt;/code&gt; on. It's a syntactic collision heuristic, not proof a value was copied. Read the flagged tests, don't trust the flag blindly — and if your domain is all round numbers, weight &lt;code&gt;self_grading&lt;/code&gt; lower.&lt;/li&gt;
&lt;li&gt;It resolves at the &lt;strong&gt;test-function&lt;/strong&gt; level, not the line. A test with one genuine assert buried under three smoke calls can still pass the audit. It's a triage signal for a reviewer, not a proof of suite quality.&lt;/li&gt;
&lt;li&gt;The four flags are &lt;strong&gt;heuristics on syntax&lt;/strong&gt;, not semantics. A sufficiently clever mirror (an assert that recomputes the impl through an indirection the AST can't follow) will slip past. It catches the &lt;em&gt;common&lt;/em&gt; shapes that show up in one-pass output, which is most of them, not all of them.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It's deterministic, though, which is the one thing it promises and keeps: same test file, same verdict, every run. I hashed the STDOUT twice on both fixtures and got identical sha256 each time (&lt;code&gt;5047bf48…&lt;/code&gt; for the mirror suite, &lt;code&gt;84fcdb73…&lt;/code&gt; for the honest one). A CI gate you can't reproduce isn't a gate.&lt;/p&gt;




&lt;p&gt;Run it on a real suite from an agent PR and tell me your mirror-ratio. I'm genuinely curious what the distribution looks like in the wild, because my 50% is a fixture I built to be obvious, and real suites will be messier. What's the most mirror-shaped test you've ever merged: a recompute, a copied golden, a smoke test with &lt;code&gt;assert result&lt;/code&gt;? Drop it in the comments, I read every one. Follow for the next number from the next run.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>testing</category>
      <category>python</category>
      <category>agents</category>
    </item>
    <item>
      <title>Agent Loop Cost: 11x Your Per-Call Quote, in 40 Lines</title>
      <dc:creator>Alexey Spinov</dc:creator>
      <pubDate>Sun, 21 Jun 2026 19:45:15 +0000</pubDate>
      <link>https://dev.to/alex_spinov/agent-loop-cost-11x-your-per-call-quote-in-40-lines-5dfn</link>
      <guid>https://dev.to/alex_spinov/agent-loop-cost-11x-your-per-call-quote-in-40-lines-5dfn</guid>
      <description>&lt;p&gt;&lt;strong&gt;Agent loop cost&lt;/strong&gt; is what you pay per task, not per call — and it runs multiples higher than your per-call quote because every tool-call re-bills the whole system prompt plus every tool description. A 40-line offline forecaster reads one JSONL trace and prices the full loop before you ship. On my bloated fixture it measured an 11.29x gap.&lt;/p&gt;

&lt;p&gt;On my own bloated fixture, the forecaster measured an &lt;strong&gt;effective cost of $2.26 per task against a $0.20 per-invocation quote — an 11.29x gap&lt;/strong&gt; — with a cumulative-cost curvature of &lt;strong&gt;k = 2.14&lt;/strong&gt;, which is the literal meaning of "the bill grows quadratically with the number of tool-calls." That's my run on my fixture, not a claim about your agent. The whole point of the tool is that it tells you &lt;em&gt;your&lt;/em&gt; number, and on a short healthy loop it says the gap is basically zero.&lt;/p&gt;

&lt;p&gt;I want to be careful with one number up front. You may have seen the &lt;strong&gt;30x&lt;/strong&gt; figure going around — Muskan's June 2026 Dev.to writeup, "Why Claude Agent Loops Cost 30x a Single Inference" (&lt;a href="https://dev.to/muskan_8abedcc7e12/agentic-ai-finops-why-claude-agent-loops-cost-30x-a-single-inference-fip"&gt;dev.to/muskan_8abedcc7e12&lt;/a&gt;). That 30x is &lt;em&gt;their&lt;/em&gt; aggregate over 10,000 invocations a day, not a single-task measurement, and it's not mine. My single-task gap on a deliberately bloated 14-step loop came out to 11.29x. I'm titling this post with the number I actually measured, rounded down. If your loop is worse, the tool will say 30x or 40x — but I won't put a number in a headline that my own run didn't produce.&lt;/p&gt;

&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;A team budgets an agent at the price of &lt;strong&gt;one&lt;/strong&gt; call; the task runs N calls, and each call re-bills the fixed system tax (system prompt + every tool description) plus the growing tail of prior results. The cumulative bill is the area under that curve.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;loop_forecast.py&lt;/code&gt; is a ~40-line offline, keyless, read-only script. Feed it a JSONL trace; it prints &lt;strong&gt;effective $/task&lt;/strong&gt;, the &lt;strong&gt;forecast_gap&lt;/strong&gt; against your per-invocation quote, and a &lt;strong&gt;curvature k&lt;/strong&gt; (cumulative bill ~ calls^k).&lt;/li&gt;
&lt;li&gt;On my bloated fixture: &lt;strong&gt;$2.26/task vs $0.20 quote = 11.29x&lt;/strong&gt;, k = &lt;strong&gt;2.14&lt;/strong&gt;, exit &lt;strong&gt;1&lt;/strong&gt;. On a clean 3-step loop: gap &lt;strong&gt;0.04x&lt;/strong&gt;, k &lt;strong&gt;1.18&lt;/strong&gt;, exit &lt;strong&gt;0&lt;/strong&gt; — the contrarian claim falsified on its own terms.&lt;/li&gt;
&lt;li&gt;Exit code is a &lt;strong&gt;pre-execution CI gate&lt;/strong&gt;: 0 if the loop is cheap or linear, 1 if it's both super-linear and over quote, 2 on usage error.&lt;/li&gt;
&lt;li&gt;This is a forecaster from a trace, not a runtime cap. It does not block your agent. It blocks your build.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Why your per-call quote is the wrong unit
&lt;/h2&gt;

&lt;p&gt;The mistake is unit confusion, and it's an easy one to make. You open the pricing page, you see &lt;code&gt;$3.00 / 1M input tokens&lt;/code&gt;, you estimate a call at ~8k tokens of input, and you write down &lt;strong&gt;$0.20 per agent action&lt;/strong&gt;. Multiply by your daily call volume and you have a budget. It feels rigorous. It's wrong by an order of magnitude. Not because the per-call price is wrong; that part is right. The error is that the &lt;em&gt;task&lt;/em&gt; is not one call.&lt;/p&gt;

&lt;p&gt;Here's what actually gets billed. An agent loop re-sends its entire working context as input on every single step. The system prompt goes out again. Every tool description in the inventory goes out again. And the transcript of everything the agent has done so far — every prior tool result — goes out again, growing each turn. So step 1 bills a small payload, step 8 bills a large one, and the &lt;em&gt;task&lt;/em&gt; cost is the sum of all of them. That sum is the area under a rising curve. The area under a rising line is quadratic.&lt;/p&gt;

&lt;p&gt;Muskan's writeup put concrete shape on the replay: 8 tools at ~200 tokens each is &lt;strong&gt;1,600 tokens of inventory replayed on every step&lt;/strong&gt;, and by step 8 the single-step context had ballooned to roughly &lt;strong&gt;83,000 input tokens&lt;/strong&gt;. Augment Code's guide on agent loop token cost makes the same structural claim independently — that naive loops "rebill prior context on every call," so input cost rises quadratically and a "20-step loop" can run &lt;a href="https://www.augmentcode.com/guides/ai-agent-loop-token-cost-context-constraints" rel="noopener noreferrer"&gt;"more than 10x the naive per-step estimate"&lt;/a&gt;. Both of those are &lt;em&gt;their&lt;/em&gt; measurements, cited as third-party support, not my numbers. My contribution is a tool that computes your own version of it from a trace you already have.&lt;/p&gt;

&lt;h2&gt;
  
  
  The contrarian claim, stated so it can be wrong
&lt;/h2&gt;

&lt;p&gt;Here's the position, sharp enough to argue with: &lt;strong&gt;you budget your agent at the price of one call and you pay the area under a curve.&lt;/strong&gt; And that area is quadratic in the number of tool-calls, because the fixed system tax replays on every step while the result tail keeps growing — and you can compute the whole curve from a trace before you run it in production, without ever touching your provider's billing API.&lt;/p&gt;

&lt;p&gt;That claim has a clean failure mode, and I built it into the tool. If your loop is short, your tool inventory is small, and the model exits early, the per-step curve barely rises, the cumulative cost is near-linear, and the gap to your quote is small. In that case the claim is &lt;em&gt;false for you&lt;/em&gt;, and &lt;code&gt;loop_forecast.py&lt;/code&gt; returns &lt;strong&gt;exit 0&lt;/strong&gt; and says so. I ran exactly that case. The clean fixture — 420-token system, three tool descriptions, three steps, small results — came back with &lt;strong&gt;gap 0.04x and k 1.18&lt;/strong&gt;. No breach. If your real traces all look like that, this whole thesis doesn't apply to you, and I'd rather the tool tell you than have you take my word.&lt;/p&gt;

&lt;h2&gt;
  
  
  What this is NOT
&lt;/h2&gt;

&lt;p&gt;Three adjacent costs already have their own tools in this series, and this one is none of them.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;It is not the re-bill tax on a transcript.&lt;/strong&gt; The &lt;a href="https://finops.spinov.online/blog/context-tax-measure-transcript-rebill/" rel="noopener noreferrer"&gt;context tax that re-bills your transcript every step&lt;/a&gt; takes a flat list of conversation turns and measures the compounding of the &lt;em&gt;message history&lt;/em&gt; — turns 1..N re-sent as input. That tool's input is &lt;code&gt;role/content&lt;/code&gt; turns; this tool's input is &lt;em&gt;tool-call records&lt;/em&gt; with a manifest, and it models the replay of the loop's &lt;em&gt;structure&lt;/em&gt; (system + tool descriptions as a fixed tax) plus the result tail, and it compares to an external quote. Different input, different metric, different question.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;It is not the static MCP inventory tax.&lt;/strong&gt; The &lt;a href="https://finops.spinov.online/blog/mcp-server-token-tax/" rel="noopener noreferrer"&gt;token tax of your connected MCP server inventory&lt;/a&gt; measures the one-time context cost of having tools &lt;em&gt;connected&lt;/em&gt; — the descriptions sitting in your context window before the agent does anything. This tool takes that same inventory and shows what it costs when it's &lt;strong&gt;replayed across every step of a multi-call loop&lt;/strong&gt;. The inventory tax is the standing charge; this is the metered usage.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;It is not a spend cap.&lt;/strong&gt; This is a forecast from a finished trace. It blocks nothing at runtime. If you want execution to actually halt when a dollar ceiling is hit, that's a &lt;a href="https://finops.spinov.online/blog/sliding-window-spend-guard/" rel="noopener noreferrer"&gt;sliding-window spend guard at runtime&lt;/a&gt;. &lt;code&gt;loop_forecast.py&lt;/code&gt; is the &lt;a href="https://finops.spinov.online/blog/pre-execution-gate-for-ai-agents/" rel="noopener noreferrer"&gt;pre-execution gate for AI agents&lt;/a&gt; variant: it fails your CI build before the expensive loop ever ships.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The tool: loop_forecast.py
&lt;/h2&gt;

&lt;p&gt;The constraints first, because they're why you can run this on a production trace without a security review: &lt;strong&gt;offline, keyless, read-only, zero network.&lt;/strong&gt; No vendor SDK, no API key, nothing leaves your machine. It reads one JSONL file and prints. Tokenization is real — &lt;code&gt;tiktoken&lt;/code&gt; with the &lt;code&gt;o200k_base&lt;/code&gt; encoding — with an honest &lt;code&gt;len/4&lt;/code&gt; fallback if &lt;code&gt;tiktoken&lt;/code&gt; isn't installed; that fallback is roughly ±15% off true BPE, and the output says which one ran.&lt;/p&gt;

&lt;p&gt;The input is a JSONL trace. One manifest record, then one record per tool-call:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;{"type":"manifest","system_tokens":1200,"tool_descriptions":[210,190,230,200,180,220,205,195,215,185],"quoted_usd_per_invocation":0.20,"input_usd_per_mtok":3.0}
{"type":"call","tool":"read_file","result_tokens":5400}
{"type":"call","tool":"run_tests","result_tokens":8100}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Every field is something you already have or can count: the system-prompt size, the size of each tool description, the per-invocation price you budgeted with, and the input price per million tokens. If a call record has no &lt;code&gt;result_tokens&lt;/code&gt;, the tool tokenizes the &lt;code&gt;result&lt;/code&gt; text itself.&lt;/p&gt;

&lt;p&gt;The forecast is four deterministic rules, no model in the loop, no randomness:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;fixed&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;sys_t&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;desc&lt;/span&gt;                       &lt;span class="c1"&gt;# system tax replayed every step
&lt;/span&gt;&lt;span class="n"&gt;billed&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;cum&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;tail&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;run&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[],&lt;/span&gt; &lt;span class="p"&gt;[],&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;rt&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;tails&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;                           &lt;span class="c1"&gt;# step n bills fixed tax + accumulated prior results
&lt;/span&gt;    &lt;span class="n"&gt;b&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;fixed&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;tail&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;billed&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt; &lt;span class="n"&gt;run&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;cum&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;run&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt; &lt;span class="n"&gt;tail&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="n"&gt;rt&lt;/span&gt;
&lt;span class="n"&gt;eff_usd&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;run&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mf"&gt;1e6&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;price&lt;/span&gt;                 &lt;span class="c1"&gt;# area under the curve = cumulative billed input
&lt;/span&gt;&lt;span class="n"&gt;gap&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;eff_usd&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;quote&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;quote&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="nf"&gt;float&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;inf&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# curvature: slope of log(cumulative billed) vs log(step) -&amp;gt; total bill ~ calls^k.
&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;fit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ys&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;xs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;math&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ys&lt;/span&gt;&lt;span class="p"&gt;))];&lt;/span&gt; &lt;span class="n"&gt;yl&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;math&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;v&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;v&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;ys&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;mx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;my&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;xs&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;xs&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="nf"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;yl&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;yl&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;den&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;mx&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;**&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;xs&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;mx&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;y&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;my&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;zip&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;xs&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;yl&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;den&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;den&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="mf"&gt;0.0&lt;/span&gt;
&lt;span class="n"&gt;k&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;fit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cum&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;rc&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="nf"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;gap&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;gate_x&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="n"&gt;max_k&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Rule one: per-step billed input is the fixed tax plus the accumulated tail. Rule two: effective $/task is the sum of all steps — the area. (One honesty note: this counts &lt;em&gt;input&lt;/em&gt; tokens only, so the gap is fair only when the per-invocation quote you compare against is also an input budget; output-token pricing is out of scope, see the limits section.) Rule three: the gap is that effective cost divided by the per-invocation quote you budgeted with. Rule four: fit the &lt;em&gt;cumulative&lt;/em&gt; bill to &lt;code&gt;calls^k&lt;/code&gt; and gate on it. One honest note on the curvature: I fit the &lt;strong&gt;cumulative&lt;/strong&gt; cost, not the per-step cost. The per-step curve is roughly linear (k ≈ 1.4 on the spiral); its integral — the running total you actually pay — is the quadratic one (k = 2.14). I had this backwards in my first pass, fitting the per-step curve and reading k ≈ 1.0, which made a clearly quadratic loop look linear. Fitting the cumulative series is the fix.&lt;/p&gt;

&lt;p&gt;The full file is 72 lines with the CLI argument handling, the tokenizer shim, and the output formatting. The forecast itself — the part above plus parsing the manifest — is about 40. I'd rather keep the readable version than golf it down to make a headline literal.&lt;/p&gt;

&lt;h2&gt;
  
  
  What it prints on a healthy loop (exit 0)
&lt;/h2&gt;

&lt;p&gt;Run it on the clean fixture — three tool-calls, a small inventory, an early exit:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="gp"&gt;$&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;python3 loop_forecast.py fixtures/loop_clean.jsonl
&lt;span class="go"&gt;loop_forecast.py | tokenizer: tiktoken/o200k_base
steps: 3 | system: 420t | tool_descriptions: 360t (replayed every step)
per-step billed input (tokens): [780, 960, 1110]
&lt;/span&gt;&lt;span class="gp"&gt;effective $&lt;/span&gt;/task: &lt;span class="nv"&gt;$0&lt;/span&gt;.0086  &lt;span class="o"&gt;(&lt;/span&gt;area under replay curve @ &lt;span class="nv"&gt;$3&lt;/span&gt;.00/Mtok&lt;span class="o"&gt;)&lt;/span&gt;
&lt;span class="gp"&gt;naive per-invocation quote: $&lt;/span&gt;0.2000
&lt;span class="go"&gt;forecast_gap: 0.04x  (effective / quote)
&lt;/span&gt;&lt;span class="gp"&gt;curvature k: 1.177  (cumulative bill ~ calls^k;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;1.0&lt;span class="o"&gt;=&lt;/span&gt;linear, 2.0&lt;span class="o"&gt;=&lt;/span&gt;quadratic&lt;span class="o"&gt;)&lt;/span&gt;
&lt;span class="gp"&gt;gate: gap&amp;gt;&lt;/span&gt;8x AND k&amp;gt;&lt;span class="o"&gt;=&lt;/span&gt;1.3 -&amp;gt; PASS
&lt;span class="go"&gt;exit: 0
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The effective cost is &lt;em&gt;below&lt;/em&gt; the quote here — a three-step loop with tiny results is cheaper than the single-call budget assumed, because the budget over-provisioned. Gap 0.04x, k 1.18, gate PASS. This is the falsification working. A cheap loop reads as cheap.&lt;/p&gt;

&lt;h2&gt;
  
  
  What it prints on a spiral (exit 1)
&lt;/h2&gt;

&lt;p&gt;Now the bloated fixture — a 1,200-token system prompt, ten tool descriptions (2,030 tokens of inventory replayed every step), and fourteen steps whose tool results grow into the thousands as the agent reads files, runs tests, and re-parses output:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="gp"&gt;$&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;python3 loop_forecast.py fixtures/loop_spiral.jsonl
&lt;span class="go"&gt;loop_forecast.py | tokenizer: tiktoken/o200k_base
steps: 14 | system: 1200t | tool_descriptions: 2030t (replayed every step)
per-step billed input (tokens): [3230, 6430, 11830, 18630, 26730, 36230, 43430, 54430, 64230, 76730, 86930, 95530, 108530, 120030]
&lt;/span&gt;&lt;span class="gp"&gt;effective $&lt;/span&gt;/task: &lt;span class="nv"&gt;$2&lt;/span&gt;.2588  &lt;span class="o"&gt;(&lt;/span&gt;area under replay curve @ &lt;span class="nv"&gt;$3&lt;/span&gt;.00/Mtok&lt;span class="o"&gt;)&lt;/span&gt;
&lt;span class="gp"&gt;naive per-invocation quote: $&lt;/span&gt;0.2000
&lt;span class="go"&gt;forecast_gap: 11.29x  (effective / quote)
&lt;/span&gt;&lt;span class="gp"&gt;curvature k: 2.139  (cumulative bill ~ calls^k;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;1.0&lt;span class="o"&gt;=&lt;/span&gt;linear, 2.0&lt;span class="o"&gt;=&lt;/span&gt;quadratic&lt;span class="o"&gt;)&lt;/span&gt;
&lt;span class="gp"&gt;gate: gap&amp;gt;&lt;/span&gt;8x AND k&amp;gt;&lt;span class="o"&gt;=&lt;/span&gt;1.3 -&amp;gt; BREACH
&lt;span class="go"&gt;exit: 1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Read the per-step list. Step 1 bills 3,230 tokens. Step 8 bills &lt;strong&gt;54,430&lt;/strong&gt; — in the same order of magnitude as Muskan's ~83k step-8 figure, and I kept mine deliberately under theirs so I'm not borrowing their drama. By step 14 a single step bills 120,030 input tokens, almost all of it re-sent context. The task costs $2.26. You quoted $0.20. That's the 11.29x in the title. It's the number the tool actually produced, not a target I reverse-engineered.&lt;/p&gt;

&lt;p&gt;The gate trips because &lt;em&gt;both&lt;/em&gt; conditions hold: gap over 8x and cumulative k at or above 1.3. A loop that's expensive but linear (a long, simple, single-tool job) won't trip it; neither will a curvy but cheap one. You want both signals before you fail someone's build.&lt;/p&gt;

&lt;h2&gt;
  
  
  Determinism, because a flaky gate is worse than no gate
&lt;/h2&gt;

&lt;p&gt;A CI gate that returns different numbers on the same input is useless — you'll mute it the first week. So the arithmetic is integer token counts with no randomness and no network. I hashed the stdout twice on each fixture:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;clean  run1: 455a86ce7e1df9cdca74f072c5d5e2919dac8f91889d950769673e7998bd506d
clean  run2: 455a86ce7e1df9cdca74f072c5d5e2919dac8f91889d950769673e7998bd506d
spiral run1: 450d51f471b747c224c3782c6d8b4af8acddc1db677b073389e5de0a09ff74f3
spiral run2: 450d51f471b747c224c3782c6d8b4af8acddc1db677b073389e5de0a09ff74f3
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Byte-identical. Same trace in, same gate out. Bad JSON returns exit 2 with the parser's error, and no arguments prints usage and exits 2 — so the gate fails loud on a broken trace instead of silently passing.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where this is wrong, and where I'm guessing
&lt;/h2&gt;

&lt;p&gt;The model assumes input-token replay is the dominant cost, and on a tool-heavy agent loop it usually is — but if your steps generate large &lt;em&gt;outputs&lt;/em&gt; (long generations, not long contexts), output pricing matters and this tool ignores it. It also assumes you can name your per-invocation quote honestly; garbage quote in, garbage gap out. And the curvature fit needs enough steps to mean anything — on a 3-step loop the k value is noisy (which is why the gate also requires the gap condition). I'd trust the gap number on any trace and treat k as a shape hint, not a precise exponent.&lt;/p&gt;

&lt;p&gt;The fixtures here are constructed to be realistic, not harvested from a specific production run — the tool sizes and step counts come from Muskan's and Augment's published figures, the math is mine. Run it on your own exported traces and the numbers stop being illustrative.&lt;/p&gt;

&lt;h2&gt;
  
  
  Run it on your loop
&lt;/h2&gt;

&lt;p&gt;Save the script, write one manifest line and one line per tool-call from a trace you already have, and run it. If your gap comes back under 2x, your per-call budgeting was fine and you can ignore all of this. If it comes back at 11x like mine — or worse — you now have a number to put in front of whoever signs the cloud bill, computed before the loop ever shipped.&lt;/p&gt;

&lt;p&gt;Here's the open question I don't have a clean answer to: prompt caching changes this math, because a cached system-prefix isn't re-billed at full input price. My forecaster assumes no cache (worst case). What's the right way to fold a &lt;em&gt;partial&lt;/em&gt; cache hit-rate into the per-step replay cost without making the tool lie in either direction? If you've modeled that, I want to see how.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Written by Alexey Spinov. AI-assisted, human-verified: the tool, both fixtures, and every number above come from a real local run on 2026-06-21 (Python 3.13.5, tiktoken 0.13.0, o200k_base). I ran it, checked the exit codes (0 / 1 / 2), hashed the output twice to confirm determinism, separated my numbers from Muskan's and Augment's cited figures, and edited every line. Offline, keyless, read-only, zero network.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Follow for the next numbers from production agent traces. What's the worst per-task-vs-quote gap you've found on a real loop — and did anything in CI catch it before the invoice did?&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>finops</category>
      <category>python</category>
      <category>agents</category>
    </item>
    <item>
      <title>Prompt Cache Break: Hit-Rate Fell 100% to 40% in 40 Lines</title>
      <dc:creator>Alexey Spinov</dc:creator>
      <pubDate>Sat, 20 Jun 2026 19:38:43 +0000</pubDate>
      <link>https://dev.to/alex_spinov/prompt-cache-break-hit-rate-fell-100-to-40-in-40-lines-21mm</link>
      <guid>https://dev.to/alex_spinov/prompt-cache-break-hit-rate-fell-100-to-40-in-40-lines-21mm</guid>
      <description>&lt;p&gt;&lt;strong&gt;In short:&lt;/strong&gt; a &lt;strong&gt;prompt cache-break&lt;/strong&gt; is when one change atop your prompt prefix — a fresh timestamp, a reordered tool block — makes the cache miss from there down, so cached reads silently re-bill as fresh input. &lt;code&gt;cache_break.py&lt;/code&gt; hashes each prefix segment, localizes the break, and fails CI. On my fixture, one timestamp dropped the estimated cache-hit-rate &lt;strong&gt;100% to 40%&lt;/strong&gt;.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;AI disclosure:&lt;/strong&gt; I drafted this with an AI writing assistant. The tool, the two fixtures, and every number below come from a real local run on tiktoken o200k_base — I ran it, checked the exit codes, hashed the output twice to confirm it's deterministic, and edited every line before publishing.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Turning on prompt caching feels like a free win. It usually isn't the win you think.&lt;/p&gt;

&lt;p&gt;Here's the trap. Anthropic and OpenAI both price a cache &lt;em&gt;read&lt;/em&gt; at a fraction of a fresh input token — Anthropic quotes cached reads at roughly &lt;strong&gt;$0.30/M vs $3.00/M&lt;/strong&gt; for fresh, about a 10x gap on the cached portion (&lt;a href="https://docs.anthropic.com/en/docs/build-with-claude/prompt-caching" rel="noopener noreferrer"&gt;Anthropic prompt caching docs&lt;/a&gt;). Flip the feature on, watch the dashboard, move on. But that discount is only paid for a prefix that is &lt;strong&gt;byte-for-byte identical&lt;/strong&gt; to what's already cached. Change one character near the top and the cache misses from there down. You still &lt;em&gt;have caching on&lt;/em&gt;. You're just not &lt;em&gt;hitting&lt;/em&gt; it. And nothing in your config screams about it.&lt;/p&gt;

&lt;p&gt;So the number that actually matters isn't "is caching enabled." It's your real &lt;strong&gt;cache-hit-rate&lt;/strong&gt; — and almost nobody meters it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;TL;DR.&lt;/strong&gt; Prompt caching only pays out on an exact-prefix match. Agents quietly break that match: a dynamic &lt;code&gt;now=...&lt;/code&gt; in the system block, two tool definitions that got reordered, a memory snippet inserted at the top. When the prefix breaks, the cache misses below the break and re-bills it as fresh input. &lt;code&gt;cache_break.py&lt;/code&gt; (below, keyless, offline) hashes each prefix segment per step, localizes the first one that diverges from the baseline, computes the hit-rate, and gates it. On a clean trace it estimated &lt;strong&gt;100%&lt;/strong&gt; hit-rate, exit 0. On the same content with one injected timestamp, &lt;strong&gt;40%&lt;/strong&gt;, cache-break flagged at segment 0 step 3, exit 1.&lt;/p&gt;

&lt;h2&gt;
  
  
  The contrarian bit: "caching is on" is not "caching is working"
&lt;/h2&gt;

&lt;p&gt;Most caching write-ups stop at &lt;em&gt;how to turn it on&lt;/em&gt;. Mark the prefix, set &lt;code&gt;cache_control&lt;/code&gt;, done. That's the easy 80%. The expensive 20% is everything that silently invalidates the match afterward, on a running agent, where you'll never notice from the totals.&lt;/p&gt;

&lt;p&gt;Three ways an agent breaks its own cache without anyone touching the config:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;A dynamic value in the system prompt.&lt;/strong&gt; The classic is a current timestamp — &lt;code&gt;now=2026-06-21T08:14:03Z&lt;/code&gt; — stamped into the system block so the model "knows the time." It changes every call. It sits at the very top. So the cache misses on &lt;em&gt;everything&lt;/em&gt;, every step, forever.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Reordered tool definitions.&lt;/strong&gt; Your framework serializes tool schemas from a dict or a set. Run two, the order flips, the bytes differ, the prefix no longer matches. Same tools. Broken cache.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;A memory snippet inserted at the front.&lt;/strong&gt; Retrieval-augmented memory that prepends "what we learned last session" pushes a variable block above the stable one. Everything below it is now fresh.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The falsifiable claim: if you take a clean trace and a &lt;em&gt;byte-identical-content&lt;/em&gt; trace where only the prefix ordering breaks, a real measurement should show the hit-rate collapse on the broken one and stay high on the clean one. If it doesn't collapse, I'm wrong and this tool is useless. It collapsed — 100% to 40% — and the detector named the exact segment. Run is below.&lt;/p&gt;

&lt;p&gt;There's even a name for the failure mode in the research now. The arXiv note &lt;em&gt;Don't Break the Cache&lt;/em&gt; (2601.06007, Lumer et al., Jan 2026) is entirely about prefix-stability discipline for cached agentic inference — worth a read if you want the formal treatment. The 70% default gate here is &lt;strong&gt;my own&lt;/strong&gt; pick, not theirs: log your cached-token count, compute &lt;code&gt;hits / (hits + full)&lt;/code&gt;, and alert when it drops under ~70%. That number is a starting line, not a law — tune it to your own traces.&lt;/p&gt;

&lt;h2&gt;
  
  
  The tool: 40 lines, no API key, read-only
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;cache_break.py&lt;/code&gt; reads one JSONL trace. Each line is one agent step with a &lt;code&gt;prefix&lt;/code&gt;: an ordered list of named segments (&lt;code&gt;system&lt;/code&gt;, &lt;code&gt;tool_defs&lt;/code&gt;, &lt;code&gt;memory&lt;/code&gt;) — the part you &lt;em&gt;expect&lt;/em&gt; to be cached. It does four deterministic things.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Prefix-stability hash.&lt;/strong&gt; Canonicalize each segment (&lt;code&gt;json.dumps&lt;/code&gt; with sorted keys) and sha256 it. Lock the step-1 prefix as the baseline.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Break-point localization.&lt;/strong&gt; For every step, compare segment hashes to the baseline. The first segment that diverges is the break point — the cache misses from there down, because everything after a changed byte re-bills as fresh.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Hit-rate compute.&lt;/strong&gt; Tokens above the break = cached; the break and everything below = fresh. &lt;code&gt;hit-rate = cached / (cached + fresh)&lt;/code&gt;, summed across steps. tiktoken &lt;code&gt;o200k_base&lt;/code&gt;, len/4 fallback if tiktoken's missing.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Gate.&lt;/strong&gt; Compare overall hit-rate to &lt;code&gt;--min-hit-rate&lt;/code&gt; (default 0.70). Below it &lt;em&gt;or&lt;/em&gt; any cache-break detected → exit 1. This is a pre-execution gate: don't ship a prompt ordering that punches through your own cache.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;One honesty rule baked in: the output says &lt;code&gt;source: estimated from prefix stability&lt;/code&gt;. If your trace carried real provider &lt;code&gt;usage&lt;/code&gt; fields (Anthropic/OpenAI return cached-token counts), you'd use those instead and it'd say &lt;code&gt;measured&lt;/code&gt;. I'm not dressing an estimate up as a meter reading.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;#!/usr/bin/env python3
&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;cache_break.py - measure prompt cache-hit-rate and localize cache-break in a JSONL trace.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;hashlib&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;sys&lt;/span&gt;

&lt;span class="n"&gt;MIN_HIT_RATE&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;0.70&lt;/span&gt;       &lt;span class="c1"&gt;# my default gate: alert when cache-hit-rate drops below 70% (tune to your traces)
&lt;/span&gt;&lt;span class="n"&gt;CACHE_READ&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;0.30&lt;/span&gt;         &lt;span class="c1"&gt;# $/1M cached-read tokens  (public price flag, NOT a measurement)
&lt;/span&gt;&lt;span class="n"&gt;FRESH&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;3.00&lt;/span&gt;              &lt;span class="c1"&gt;# $/1M fresh-input tokens  (public price flag, NOT a measurement)
&lt;/span&gt;
&lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;tiktoken&lt;/span&gt;
    &lt;span class="n"&gt;_enc&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;tiktoken&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_encoding&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;o200k_base&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;count&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;_enc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;encode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="n"&gt;TOKENIZER&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tiktoken o200k_base (exact)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;Exception&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;                                  &lt;span class="c1"&gt;# honest fallback, ~+-15% vs real BPE
&lt;/span&gt;    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;count&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;max&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;round&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="n"&gt;TOKENIZER&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;len/4 heuristic (tiktoken not installed; ~+-15%)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;canon&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;seg&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;hashlib&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sha256&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dumps&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;seg&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;sort_keys&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ensure_ascii&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;encode&lt;/span&gt;&lt;span class="p"&gt;()).&lt;/span&gt;&lt;span class="nf"&gt;hexdigest&lt;/span&gt;&lt;span class="p"&gt;()[:&lt;/span&gt;&lt;span class="mi"&gt;12&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;main&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;argv&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;argv&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;usage: cache_break.py &amp;lt;trace.jsonl&amp;gt; [--min-hit-rate 0.70]&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt; &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;
    &lt;span class="n"&gt;min_hit&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;float&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;argv&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;argv&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;index&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;--min-hit-rate&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;--min-hit-rate&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;argv&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="n"&gt;MIN_HIT_RATE&lt;/span&gt;
    &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;steps&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;loads&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ln&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;ln&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;argv&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;encoding&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;utf-8&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;ln&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;()]&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;steps&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;ValueError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;empty trace&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;base&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nf"&gt;canon&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;steps&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;prefix&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]]&lt;/span&gt;          &lt;span class="c1"&gt;# baseline prefix from step 1
&lt;/span&gt;        &lt;span class="n"&gt;names&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;seg&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;enumerate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;steps&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;prefix&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])]&lt;/span&gt;
        &lt;span class="n"&gt;toks&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nf"&gt;count&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dumps&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;ensure_ascii&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;steps&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;prefix&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]]&lt;/span&gt;
    &lt;span class="nf"&gt;except &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;KeyError&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;ValueError&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;JSONDecodeError&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;IndexError&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cache_break | BAD INPUT: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt; &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;

    &lt;span class="n"&gt;cached&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;fresh&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;break_at&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;break_step&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;enumerate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;steps&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;cur&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nf"&gt;canon&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;seg&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;seg&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;prefix&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]]&lt;/span&gt;
        &lt;span class="n"&gt;diff&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;next&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="n"&gt;j&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;j&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;min&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;base&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cur&lt;/span&gt;&lt;span class="p"&gt;)))&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;cur&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;j&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="n"&gt;base&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;j&lt;/span&gt;&lt;span class="p"&gt;]),&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;diff&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cur&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;base&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; &lt;span class="n"&gt;diff&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;min&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;base&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cur&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;diff&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;                                       &lt;span class="c1"&gt;# whole prefix byte-identical -&amp;gt; all cached
&lt;/span&gt;            &lt;span class="n"&gt;cached&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="nf"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;toks&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;                                                  &lt;span class="c1"&gt;# cache breaks at first divergence, rest re-bills fresh
&lt;/span&gt;            &lt;span class="n"&gt;cached&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="nf"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;toks&lt;/span&gt;&lt;span class="p"&gt;[:&lt;/span&gt;&lt;span class="n"&gt;diff&lt;/span&gt;&lt;span class="p"&gt;]);&lt;/span&gt; &lt;span class="n"&gt;fresh&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="nf"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;toks&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;diff&lt;/span&gt;&lt;span class="p"&gt;:])&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;break_at&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="n"&gt;diff&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;break_at&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="n"&gt;break_at&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;break_step&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;diff&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;

    &lt;span class="n"&gt;hit&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;cached&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cached&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;fresh&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="nf"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cached&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;fresh&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="mf"&gt;1.0&lt;/span&gt;
    &lt;span class="n"&gt;broke&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;break_at&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;
    &lt;span class="n"&gt;eff&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;CACHE_READ&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;hit&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;FRESH&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;hit&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;                 &lt;span class="c1"&gt;# blended $/1M on the prefix at this hit-rate
&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cache_break | &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;argv&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; | tokenizer: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;TOKENIZER&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; | source: estimated from prefix stability&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;-&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;78&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;  steps=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;steps&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;  prefix_segments=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;base&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;  (&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;, &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;names&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;  cached_tokens=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;cached&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;  fresh_tokens=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;fresh&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;  cache_hit_rate     : &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;hit&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;   (threshold &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;min_hit&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;broke&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;  cache_break        : TRUE at segment &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;break_at&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;names&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;break_at&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;break_at&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;names&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;?&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;, first seen step &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;break_step&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;  cache_break        : FALSE (prefix byte-identical across all steps)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;  effective $/1M prefix : $&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;eff&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;  (vs $&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;CACHE_READ&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; all-cached / $&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;FRESH&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; all-fresh; public-price illustration)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;rc&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt; &lt;span class="nf"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;hit&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="n"&gt;min_hit&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;broke&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;  exit               : &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;rc&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;  (&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;PASS&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;rc&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;FAIL&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt; &lt;span class="n"&gt;hit&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;rate&lt;/span&gt; &lt;span class="n"&gt;below&lt;/span&gt; &lt;span class="n"&gt;threshold&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="n"&gt;cache_break&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;rc&lt;/span&gt;

&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;__name__&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;__main__&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;sys&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;exit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;main&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sys&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;argv&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;No key, no network, nothing written to disk. &lt;code&gt;pip install tiktoken&lt;/code&gt;, point it at a trace, read the exit code. If tiktoken isn't installed it falls back to len/4 and says so in the header (~±15% off real BPE). I'd rather print the caveat than fake the precision.&lt;/p&gt;

&lt;h2&gt;
  
  
  The real run
&lt;/h2&gt;

&lt;p&gt;Two fixtures ship with it. Both are synthetic five-step coding sessions (no private data), same &lt;code&gt;payments-svc&lt;/code&gt; task, same three-segment prefix: &lt;code&gt;system&lt;/code&gt;, &lt;code&gt;tool_defs&lt;/code&gt;, &lt;code&gt;memory&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;trace_clean.jsonl&lt;/code&gt;&lt;/strong&gt; keeps that prefix byte-identical on every step — only the user tail changes, and the user tail isn't part of the cached prefix. Actual output:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;cache_break | fixtures/trace_clean.jsonl | tokenizer: tiktoken o200k_base (exact) | source: estimated from prefix stability
------------------------------------------------------------------------------
  steps=5  prefix_segments=3  (system, tool_defs, memory)
  cached_tokens=335  fresh_tokens=0
  cache_hit_rate     : 100.00%   (threshold 70%)
  cache_break        : FALSE (prefix byte-identical across all steps)
  effective $/1M prefix : $0.30  (vs $0.30 all-cached / $3.00 all-fresh; public-price illustration)
  exit               : 0  (PASS)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;100% hit-rate, no break, exit 0. Green. The whole prefix rides the cache every step, effective price sits at the floor — $0.30/M.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;trace_broken.jsonl&lt;/code&gt;&lt;/strong&gt; is the &lt;em&gt;exact same content&lt;/em&gt;, with one change: starting at step 3, the system segment carries a live clock — &lt;code&gt;now=2026-06-21T08:14:03Z&lt;/code&gt;, a different value each step. That's the only edit. Real output:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;cache_break | fixtures/trace_broken.jsonl | tokenizer: tiktoken o200k_base (exact) | source: estimated from prefix stability
------------------------------------------------------------------------------
  steps=5  prefix_segments=3  (system, tool_defs, memory)
  cached_tokens=134  fresh_tokens=201
  cache_hit_rate     : 40.00%   (threshold 70%)
  cache_break        : TRUE at segment 0 'system', first seen step 3
  effective $/1M prefix : $1.92  (vs $0.30 all-cached / $3.00 all-fresh; public-price illustration)
  exit               : 1  (FAIL: hit-rate below threshold or cache_break)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;40% hit-rate, exit 1. And it points at the culprit precisely: &lt;strong&gt;segment 0, &lt;code&gt;system&lt;/code&gt;, first seen at step 3.&lt;/strong&gt; Steps 1–2 cached fine; from step 3 on the timestamp lives at the very top of the prefix, so the cache misses on the system block &lt;em&gt;and&lt;/em&gt; on the tool_defs and memory below it — even though those two never changed. That's the cruel part of cache-break: damage at the top voids everything underneath. The blended price on this prefix went from $0.30 to &lt;strong&gt;$1.92/M&lt;/strong&gt;, a 6.4x jump on this fixture, driven by one field a developer added to be helpful.&lt;/p&gt;

&lt;p&gt;Watch the two numbers that prove it's the timestamp and nothing else. Same &lt;code&gt;payments-svc&lt;/code&gt; content. Same tool list. Same memory. Cached tokens fell 335 → 134; fresh went 0 → 201. The only delta in the input was a clock.&lt;/p&gt;

&lt;p&gt;One more run worth showing, because it's the design decision people argue with. A break is a &lt;em&gt;gate condition on its own&lt;/em&gt; — not just a low rate. So even if you slacken the threshold all the way to 0.30, the broken trace still fails:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="gp"&gt;$&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;python3 cache_break.py fixtures/trace_broken.jsonl &lt;span class="nt"&gt;--min-hit-rate&lt;/span&gt; 0.30
&lt;span class="go"&gt;  cache_hit_rate     : 40.00%   (threshold 30%)
  cache_break        : TRUE at segment 0 'system', first seen step 3
  exit               : 1  (FAIL: hit-rate below threshold or cache_break)
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;40% clears a 30% bar, but the break still trips exit 1. I made that call on purpose: a detected prefix-break means money is leaking somewhere measurable, and "the average is still okay" is exactly the reasoning that lets it leak for a month. Disagree with me on that — it's a real design tradeoff, not a law.&lt;/p&gt;

&lt;p&gt;Bad input is the third exit code. A malformed JSONL line returns exit 2, not a crash and not a false pass:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="gp"&gt;$&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;python3 cache_break.py fixtures/trace_bad.jsonl
&lt;span class="go"&gt;cache_break | BAD INPUT: Expecting ',' delimiter: line 2 column 1 (char 82)
[exit 2]
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Both real runs are reproducible. I hashed two consecutive clean runs and two consecutive broken runs with &lt;code&gt;shasum -a 256&lt;/code&gt;; each pair was byte-identical (&lt;code&gt;3608e4d5…&lt;/code&gt; for clean, &lt;code&gt;fb72c110…&lt;/code&gt; for broken). Deterministic, not a one-shot fluke.&lt;/p&gt;

&lt;h2&gt;
  
  
  6.4x on my run, 10x at the ceiling — the honest version
&lt;/h2&gt;

&lt;p&gt;You'll see "10x" thrown around for cache breaks, so here's where it comes from and what I actually got. The 10x is the &lt;em&gt;unit&lt;/em&gt; gap: Anthropic's published cached-read vs fresh-input ratio is roughly $0.30 to $3.00, a 10x difference &lt;em&gt;on tokens that go from cached to fresh&lt;/em&gt; (&lt;a href="https://docs.anthropic.com/en/docs/build-with-claude/prompt-caching" rel="noopener noreferrer"&gt;Anthropic docs&lt;/a&gt;). That's their public number, not mine.&lt;/p&gt;

&lt;p&gt;What the tool computed on this fixture's prefix was 6.4x — $0.30/M up to $1.92/M — because only 60% of the prefix tokens flipped to fresh, not all of them. If a break lands at segment 0 on step 1 of a long-running agent (the dynamic-timestamp case, which is the common one), every prefix token re-bills fresh and you approach the full 10x. So 10x is the ceiling from public prices; 6.4x is the real, smaller, honestly-labeled number from my run. I'd rather you trust the 6.4x I can show you than the 10x I can't.&lt;/p&gt;

&lt;p&gt;This is the same shape as the &lt;a href="https://finops.spinov.online/blog/context-tax-measure-transcript-rebill/" rel="noopener noreferrer"&gt;context tax&lt;/a&gt;, but a different leak. There, you re-bill the &lt;em&gt;whole transcript&lt;/em&gt; every step because it grows — a meter, not a guard. Here the prefix is supposed to be free-ish via cache, and a one-byte change quietly un-frees it. Different axis, different number, same lesson: measure the thing, don't assume the feature did its job.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where this fits
&lt;/h2&gt;

&lt;p&gt;The prefix this tool watches — the long, stable &lt;code&gt;system&lt;/code&gt; + tool definitions + memory block — is the same one I metered for raw cost in &lt;a href="https://finops.spinov.online/blog/mcp-server-token-tax/" rel="noopener noreferrer"&gt;your MCP server's token tax&lt;/a&gt;. Token-tax asks "how big is this prefix"; cache-break asks "are you actually paying the cached price for it, or silently the fresh one." Run both on the same trace and you've covered both halves: the size of the prefix and whether it's hitting cache.&lt;/p&gt;

&lt;p&gt;And it's the same philosophy as the &lt;a href="https://finops.spinov.online/blog/pre-execution-gate-for-ai-agents/" rel="noopener noreferrer"&gt;pre-execution gate&lt;/a&gt;: catch the broken prompt ordering &lt;em&gt;before&lt;/em&gt; you ship it, fail the build, not after the invoice. Cache-break is a perfect CI gate because it's deterministic — same trace, same verdict, every time. If you want the runtime spend brake instead of a structural check, that's the &lt;a href="https://finops.spinov.online/blog/sliding-window-spend-guard/" rel="noopener noreferrer"&gt;sliding-window spend guard&lt;/a&gt;, which caps cumulative cost over a window rather than auditing prefix stability.&lt;/p&gt;

&lt;h2&gt;
  
  
  What this is NOT (so I don't oversell it)
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;It does &lt;strong&gt;not&lt;/strong&gt; read your provider's real cached-token counts unless you put them in the trace. The hit-rate here is &lt;em&gt;estimated from prefix stability&lt;/em&gt; — it assumes the cache misses from the first changed byte down, which is how prefix caching works, but it's a structural model, not a meter reading off your bill. If your trace carries real &lt;code&gt;usage&lt;/code&gt; fields, wire those in and trust those instead. The header tells you which mode you're in.&lt;/li&gt;
&lt;li&gt;It does &lt;strong&gt;not&lt;/strong&gt; compute your invoice. The &lt;code&gt;$/M&lt;/code&gt; figures use Anthropic's public $0.30/$3.00 prices as a flag, to illustrate the cached-vs-fresh gap. Your real bill depends on cache TTL, minimum cacheable length, output tokens, and your vendor — none of which this models.&lt;/li&gt;
&lt;li&gt;It does &lt;strong&gt;not&lt;/strong&gt; account for cache &lt;strong&gt;expiry&lt;/strong&gt;. A prefix can be byte-identical and &lt;em&gt;still&lt;/em&gt; miss because the cache entry aged out (Anthropic's default TTL is short). This tool catches the &lt;em&gt;structural&lt;/em&gt; break — a changed prefix — not a timed-out one. Those need provider usage data to see.&lt;/li&gt;
&lt;li&gt;It assumes you've &lt;strong&gt;segmented&lt;/strong&gt; your prefix in the trace (system / tool_defs / memory). Garbage segmentation in, vague break-point out. The localization is only as precise as your segments — a one-character edit &lt;em&gt;inside&lt;/em&gt; the &lt;code&gt;system&lt;/code&gt; block re-bills the whole &lt;code&gt;system&lt;/code&gt; segment as fresh here, even though a real tokenizer would only lose the tokens from that character down. So I say "from the first changed byte down" as the model, but the tool resolves it at the &lt;strong&gt;segment&lt;/strong&gt; boundary, not the byte. That makes it slightly &lt;em&gt;pessimistic&lt;/em&gt; on mid-segment edits, never optimistic.&lt;/li&gt;
&lt;li&gt;It does &lt;strong&gt;not&lt;/strong&gt; charge the cache-&lt;strong&gt;write&lt;/strong&gt; premium. Anthropic bills a cache write at ~1.25x base input, so the very first step (and every step that re-breaks the cache) pays &lt;em&gt;more&lt;/em&gt; than the fresh-input number, not less. The "$0.30 all-cached floor" is the steady-state read price, not an achievable per-run average — which means a repeatedly-broken cache costs a bit more than the $1.92 here suggests, not less. I left write-cost out to keep the model simple; it only ever &lt;em&gt;understates&lt;/em&gt; the leak.&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;What's the dumbest thing that's broken your prompt cache — a timestamp, a reordered tool list, a &lt;code&gt;uuid&lt;/code&gt; someone logged into the system prompt? Run the detector on a real trace and tell me where it pointed. I'm collecting break points, and I read every reply. Follow for the next number from the next run.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>finops</category>
      <category>python</category>
      <category>agents</category>
    </item>
    <item>
      <title>Your LLM Judge Costs More Than the Agent. Gate It in 40 Lines.</title>
      <dc:creator>Alexey Spinov</dc:creator>
      <pubDate>Fri, 19 Jun 2026 19:30:16 +0000</pubDate>
      <link>https://dev.to/alex_spinov/your-llm-judge-costs-more-than-the-agent-gate-it-in-40-lines-cc7</link>
      <guid>https://dev.to/alex_spinov/your-llm-judge-costs-more-than-the-agent-gate-it-in-40-lines-cc7</guid>
      <description>&lt;p&gt;&lt;strong&gt;LLM judge cost is the share of your eval bill spent grading agent output instead of producing it.&lt;/strong&gt; To control it, run a 40-line offline pre-gate that triages every span with four deterministic rules and escalates only the uncertain tail to the expensive judge. On one trace this cut judge cost share from 50% to 16%.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;LLM judge cost&lt;/strong&gt; is the line item nobody puts on the FinOps dashboard. You add an LLM-as-judge to grade every agent span, you sleep better, and three weeks later the eval layer is quietly billing a third of what the agent itself costs. This post measures that share of your bill spent &lt;em&gt;judging&lt;/em&gt; instead of &lt;em&gt;doing&lt;/em&gt;, with a 40-line offline meter, and shows the one move that drops it from 50% to 16% on the same trace.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;AI disclosure:&lt;/strong&gt; I drafted this with an AI writing assistant. The tool, both fixtures, and every number below come from a real local run of &lt;code&gt;judge_gate.py&lt;/code&gt; on Python 3.13.5, no network, no API key. I ran it, checked the exit codes, hashed the output twice to confirm it's deterministic, and edited every line myself before publishing.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Here's the sentence that set me off. Sattyam Jain wrote it on Dev.to on June 12, in a post arguing you should stop running an LLM judge on every agent call: &lt;em&gt;"if your monitor exceeds ~20–25% of production cost, you built the wrong monitor."&lt;/em&gt; (&lt;a href="https://dev.to/sattyamjjain/stop-running-an-llm-judge-on-every-agent-call-heres-the-cheaper-gate-495e"&gt;Dev.to&lt;/a&gt;) That's a great rule of thumb. It's also unfalsifiable until you can put a number on &lt;em&gt;your&lt;/em&gt; monitor. His post sketches the tiered architecture (cheap deterministic heuristics first, expensive judge last) but ships no code you can run against your own trace. So I wrote the missing 40 lines.&lt;/p&gt;

&lt;p&gt;The timing isn't an accident. The token bill is coming due across the whole industry right now. TechCrunch reported on June 5 that &lt;em&gt;"Uber blew through its entire 2026 AI coding budget by April,"&lt;/em&gt; and that a Priceline employee saw &lt;em&gt;"a routine Cursor contract renewal came back 4–5x more expensive."&lt;/em&gt; (&lt;a href="https://techcrunch.com/2026/06/05/the-token-bill-comes-due-inside-the-industry-scramble-to-manage-ais-runaway-costs/" rel="noopener noreferrer"&gt;TechCrunch&lt;/a&gt;) Two days earlier the Linux Foundation announced its &lt;em&gt;intent to launch the Tokenomics Foundation&lt;/em&gt; — open standards for AI cost management, because, in Jim Zemlin's words, &lt;em&gt;"tokens have become the new unit of technology spend."&lt;/em&gt; (&lt;a href="https://www.linuxfoundation.org/press/linux-foundation-announces-the-intent-to-launch-the-tokenomics-foundation-to-establish-open-standards-for-ai-cost-management" rel="noopener noreferrer"&gt;Linux Foundation&lt;/a&gt;) Everyone's auditing what the agent spends. Almost nobody's auditing what the &lt;em&gt;watchdog&lt;/em&gt; spends.&lt;/p&gt;

&lt;p&gt;And the watchdog is an LLM call too. You priced the agent. Did you price the thing watching the agent?&lt;/p&gt;

&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;An LLM judge on every span isn't rigor — it's a second agent you forgot to budget. Price it before it surprises you.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;judge_gate.py&lt;/code&gt; is a 40-line, offline, keyless, zero-network script. Feed it a JSONL trace; four deterministic rules triage each span as OK / BAD / UNCERTAIN, and only UNCERTAIN ones would reach the expensive judge.&lt;/li&gt;
&lt;li&gt;On a well-instrumented 50-span trace it resolved &lt;strong&gt;68% cheaply&lt;/strong&gt; and sent only &lt;strong&gt;32% to the judge → 16% judge cost share&lt;/strong&gt; (exit 0, PASS). On the same agent logged as free text, &lt;strong&gt;100% escalated → 50% cost share&lt;/strong&gt; (exit 1, FAIL).&lt;/li&gt;
&lt;li&gt;The judge is never actually called. It's &lt;em&gt;priced&lt;/em&gt; via configurable &lt;code&gt;--judge-price&lt;/code&gt; and &lt;code&gt;--prod-cost&lt;/code&gt; flags. Substitute your own rates; I ship neutral placeholder units.&lt;/li&gt;
&lt;li&gt;Exit code is a CI gate: 0 if judge cost share ≤ budget (default 0.25), 1 if over, 2 on bad input. Deterministic — byte-identical across runs.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is the next piece in a series on &lt;strong&gt;controlling agents before they execute, not after&lt;/strong&gt;. The &lt;a href="https://finops.spinov.online/blog/pre-execution-gate-for-ai-agents/" rel="noopener noreferrer"&gt;pre-execution gate&lt;/a&gt; gates the agent's &lt;em&gt;action&lt;/em&gt;. The &lt;a href="https://finops.spinov.online/blog/your-agent-returns-200-and-lies/" rel="noopener noreferrer"&gt;success gate&lt;/a&gt; decides &lt;em&gt;what&lt;/em&gt; to verify in a result. This one is a level up the stack: it doesn't gate the agent at all. It gates the &lt;em&gt;judge&lt;/em&gt; — and asks how much that judge is allowed to cost.&lt;/p&gt;

&lt;h2&gt;
  
  
  What "judge cost share" actually means
&lt;/h2&gt;

&lt;p&gt;Here's the failure mode I keep seeing. Someone reads that agents silently fail (true) and bolts on an LLM-as-judge to grade every step. Every span: a second model call, often a frontier model, sometimes with a chunky rubric prompt. It works. It catches things. Then the finance person asks why the eval bill is the same order of magnitude as the agent bill, and the honest answer is "because we run a full second model over every single thing the first one does."&lt;/p&gt;

&lt;p&gt;The number that matters is a ratio. Call it &lt;strong&gt;judge cost share&lt;/strong&gt;: the cost of the judging layer divided by the cost of the production run it's judging.&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;judge_cost_share = (judge_calls × judge_price) / prod_cost
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;If that's 8%, fine — cheap insurance. If it's 50%, you didn't add a monitor, you added a co-pilot you're paying full freight for and calling overhead. The whole game is shrinking &lt;code&gt;judge_calls&lt;/code&gt;: the number of spans that &lt;em&gt;actually need&lt;/em&gt; a human-grade judgment, versus the spans a dumb deterministic rule can settle for free.&lt;/p&gt;

&lt;p&gt;Most spans don't need a judge. A tool either got called or it didn't. A JSON output either parses or it doesn't. A 200 with an empty body is wrong no matter how confident the prose around it sounds. You don't need a frontier model to know &lt;code&gt;[]&lt;/code&gt; is not a successful invoice send. You need an &lt;code&gt;if&lt;/code&gt; statement.&lt;/p&gt;

&lt;h2&gt;
  
  
  The fix: triage every span, escalate only the uncertain tail
&lt;/h2&gt;

&lt;p&gt;The pre-gate is a function. It looks at one span and returns one of three verdicts:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;OK&lt;/strong&gt; — cheaply, provably fine. Don't pay to judge it.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;BAD&lt;/strong&gt; — cheaply, provably broken. Don't pay to judge it either; you already know.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;UNCERTAIN&lt;/strong&gt; — the cheap rules abstain. &lt;em&gt;This&lt;/em&gt; is the only span the expensive judge should ever see.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Four rules carry almost all the weight. They're the deterministic heuristics Sattyam Jain pointed at ("did the claimed gate execute?") turned into code:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Claim-vs-evidence.&lt;/strong&gt; The span says it called &lt;code&gt;send_email&lt;/code&gt;, but &lt;code&gt;tools_called&lt;/code&gt; doesn't contain &lt;code&gt;send_email&lt;/code&gt;. Claim without evidence → BAD. (This is the same idea as the &lt;a href="https://finops.spinov.online/blog/your-agent-returns-200-and-lies/" rel="noopener noreferrer"&gt;success gate&lt;/a&gt;'s middle check, reused here as a free triage rule.)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Output schema.&lt;/strong&gt; The output isn't even a JSON object — it's a raw string, or it's missing. → BAD.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;200-with-empty-payload.&lt;/strong&gt; Status says success, body is empty. The classic silent lie. → BAD.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Duplicate retry.&lt;/strong&gt; This span's argument hash equals the previous span's. A byte-identical retry — the &lt;a href="https://finops.spinov.online/blog/waste-probe-tokens-after-failure/" rel="noopener noreferrer"&gt;waste-after-failure&lt;/a&gt; loop signature. → BAD.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;If none of those fire and the span has a clean &lt;code&gt;ok: true&lt;/code&gt; + 200, it's &lt;strong&gt;OK&lt;/strong&gt;. Otherwise the rules abstain and it's &lt;strong&gt;UNCERTAIN&lt;/strong&gt; — escalate. Here's the whole triage:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;triage&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Return (verdict, rule). UNCERTAIN means &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;a human-grade LLM judge is needed&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;out&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;output&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="nf"&gt;isinstance&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;out&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;                      &lt;span class="c1"&gt;# output not valid JSON object
&lt;/span&gt;        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;BAD&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;schema:not-an-object&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;claimed_tool&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;claimed_tool&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tools_called&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;[]):&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;BAD&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;claim-without-evidence&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;         &lt;span class="c1"&gt;# said it called X, trace has no X
&lt;/span&gt;    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;status&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;200&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;out&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;          &lt;span class="c1"&gt;# 200 OK with empty payload
&lt;/span&gt;        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;BAD&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;200-empty-payload&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;arg_hash&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;arg_hash&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;prev_arg_hash&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;BAD&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;duplicate-span&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;                 &lt;span class="c1"&gt;# byte-identical retry of prior call
&lt;/span&gt;    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;out&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ok&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;status&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;OK&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;clean-success&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;                   &lt;span class="c1"&gt;# explicit ok + 200, no contradiction
&lt;/span&gt;    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;UNCERTAIN&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;needs-judge&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;                  &lt;span class="c1"&gt;# cheap rules abstain -&amp;gt; escalate
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's it. No network, no key, no model. The judge layer is &lt;em&gt;priced&lt;/em&gt;, not called: I count the UNCERTAIN spans and multiply by a price you supply on the command line. I refuse to hardcode a vendor rate — those go stale in a month and I'd rather be honestly empty than confidently wrong about someone's bill.&lt;/p&gt;

&lt;h2&gt;
  
  
  The run: 32% to the judge, not 100%
&lt;/h2&gt;

&lt;p&gt;I built two traces of the same 50-span agent — a support-desk bot doing searches, record updates, email sends, classifications, and reply drafts.&lt;/p&gt;

&lt;p&gt;The first, &lt;code&gt;trace_gated.jsonl&lt;/code&gt;, is &lt;strong&gt;well-instrumented&lt;/strong&gt;: each span logs the tool it claimed, the tools actually called, a structured output (an &lt;code&gt;ok&lt;/code&gt; flag where the verdict is clear-cut, a &lt;code&gt;confidence&lt;/code&gt; value or label where it isn't), and an argument hash. The second, &lt;code&gt;trace_naive.jsonl&lt;/code&gt;, is the &lt;em&gt;same agent&lt;/em&gt; logging only free-text outputs like &lt;code&gt;{"text": "email sent"}&lt;/code&gt;, the way a lot of agents actually log in the wild. Same work. Different telemetry.&lt;/p&gt;

&lt;p&gt;Here's the verbatim output. I didn't touch it:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="gp"&gt;$&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;python3 judge_gate.py fixtures/trace_gated.jsonl &lt;span class="nt"&gt;--judge-price&lt;/span&gt; 1 &lt;span class="nt"&gt;--prod-cost&lt;/span&gt; 100
&lt;span class="go"&gt;spans total:        50
resolved by gate:   34 (68.0%)  [OK=29 BAD=5]
sent to LLM judge:  16 (32.0%)
judge cost share:   16.0% of prod cost (judge_price=1.0, prod_cost=100.0, budget=25%)
verdict: PASS - judge layer within budget
&lt;/span&gt;&lt;span class="gp"&gt;$&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="nv"&gt;$?&lt;/span&gt;
&lt;span class="go"&gt;0

&lt;/span&gt;&lt;span class="gp"&gt;$&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;python3 judge_gate.py fixtures/trace_naive.jsonl &lt;span class="nt"&gt;--judge-price&lt;/span&gt; 1 &lt;span class="nt"&gt;--prod-cost&lt;/span&gt; 100
&lt;span class="go"&gt;spans total:        50
resolved by gate:   0 (0.0%)  [OK=0 BAD=0]
sent to LLM judge:  50 (100.0%)
judge cost share:   50.0% of prod cost (judge_price=1.0, prod_cost=100.0, budget=25%)
verdict: FAIL - judge layer over budget
&lt;/span&gt;&lt;span class="gp"&gt;$&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="nv"&gt;$?&lt;/span&gt;
&lt;span class="go"&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Read the two side by side. Same agent, same fifty spans, same &lt;code&gt;--judge-price 1 --prod-cost 100&lt;/code&gt;. The well-instrumented trace sends &lt;strong&gt;16 spans&lt;/strong&gt; to the judge and lands at &lt;strong&gt;16% cost share: a PASS, exit 0&lt;/strong&gt;. The free-text trace can't resolve a single span cheaply, sends all &lt;strong&gt;50&lt;/strong&gt;, and lands at &lt;strong&gt;50%: a FAIL, exit 1&lt;/strong&gt;, tripping Sattyam Jain's "wrong monitor" line by a mile.&lt;/p&gt;

&lt;p&gt;The lever isn't a fancier judge. It's whether your trace carries the four cheap facts a rule can read. Of the 16 spans that did escalate in the gated run, most are genuinely subjective: ambiguous contract summaries (&lt;code&gt;confidence: 0.45&lt;/code&gt;), hedged reply drafts (&lt;code&gt;"I cannot find the order, but it is probably fine."&lt;/code&gt;), borderline intent labels. A handful escalate for a humbler reason — they carry no &lt;code&gt;ok&lt;/code&gt; flag for a cheap rule to confirm, so the gate abstains instead of guessing. Either way, that's the tail you &lt;em&gt;want&lt;/em&gt; a human-grade judge on. The other 34? Five were provably broken (one duplicate retry, two claims with no matching tool call, one 200 with an empty body, one non-object output) and the rest were clean successes. None of those needed a model to adjudicate.&lt;/p&gt;

&lt;p&gt;I want to be precise about a number I almost fudged. The cost figures are &lt;strong&gt;placeholder units&lt;/strong&gt; (&lt;code&gt;judge_price=1&lt;/code&gt;, &lt;code&gt;prod_cost=100&lt;/code&gt;). I am not telling you a judge call costs a dollar or that your run costs a hundred of anything. Plug in your real per-call judge price and your real run cost. The &lt;em&gt;rate&lt;/em&gt;, 32% vs 100% of spans escalating, is the part that's mine: measured, reproducible. The dollars are yours.&lt;/p&gt;

&lt;h2&gt;
  
  
  Am I just moving the bug into the gate?
&lt;/h2&gt;

&lt;p&gt;Fair objection, and it's the one I'd raise. If the cheap rules are wrong, you've replaced a $50 judge bill with a 16% bill &lt;em&gt;and&lt;/em&gt; a stack of bad verdicts. So: how good can a cheap layer actually be?&lt;/p&gt;

&lt;p&gt;Two recent papers say: surprisingly good, on the parts that matter. In &lt;em&gt;Cheap Reward Hacking Detection&lt;/em&gt; (arXiv:2606.08893, June 8), Belenky, Itria and Johns put a linear probe on a small transformer encoder and detected reward hacking at &lt;strong&gt;AUC 0.9467, TPR 0.8296 at 5% FPR, at "roughly four orders of magnitude lower per-trajectory cost"&lt;/strong&gt; than an LLM-as-judge baseline. And &lt;em&gt;Goal-Autopilot&lt;/em&gt; (arXiv:2606.11688) reports a gated finite-state machine that &lt;em&gt;"forbids any terminal 'done' claim whose falsifiable gate did not actually execute and pass,"&lt;/em&gt; cutting fabrication on SWE-bench Lite from &lt;strong&gt;33.7% to 0.67%&lt;/strong&gt;. Those are &lt;em&gt;their&lt;/em&gt; numbers on &lt;em&gt;their&lt;/em&gt; setups, not mine. I'm citing them as evidence that a cheap deterministic layer catches most of what a dear one catches, not as my own result.&lt;/p&gt;

&lt;p&gt;My four &lt;code&gt;if&lt;/code&gt; statements are cruder than a trained probe. They don't need to be clever. They need to be &lt;em&gt;right when they're confident and silent when they're not&lt;/em&gt; — which is the whole point of the UNCERTAIN bucket. A rule that isn't sure doesn't guess. It escalates. The judge still grades the hard 32%. You just stopped paying it to rubber-stamp the easy 68%.&lt;/p&gt;

&lt;h2&gt;
  
  
  What this is not
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Not an eval suite.&lt;/strong&gt; It doesn't score answer quality. It decides &lt;em&gt;which spans deserve a judge&lt;/em&gt;, then prices that layer. Correctness of the hard tail is still the judge's job.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Not a runtime cap.&lt;/strong&gt; It reads a finished trace and fails CI. If you need to &lt;em&gt;block&lt;/em&gt; a runaway loop mid-flight, that's a &lt;a href="https://finops.spinov.online/blog/sliding-window-spend-guard/" rel="noopener noreferrer"&gt;sliding-window spend guard&lt;/a&gt;, a different tool.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Not a verdict on confidence fields.&lt;/strong&gt; Honest limitation: my gate ignores a span's self-reported &lt;code&gt;confidence&lt;/code&gt;. One span in the fixture says &lt;code&gt;confidence: 0.95, "no ambiguity"&lt;/code&gt; and still got escalated, because I refuse to trust a model's own confidence as a cheap signal — that's the kind of self-assessment that lies. If you trust yours, add a fifth rule. I didn't.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Not a license to skip the judge.&lt;/strong&gt; The judge gets the genuinely uncertain spans. The argument is against running it on the &lt;em&gt;obvious&lt;/em&gt; ones, not against running it at all.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Run it on your own trace
&lt;/h2&gt;

&lt;p&gt;Export 40–60 spans of a real agent run to JSONL with six fields per span (&lt;code&gt;status&lt;/code&gt;, &lt;code&gt;claimed_tool&lt;/code&gt;, &lt;code&gt;tools_called&lt;/code&gt;, &lt;code&gt;output&lt;/code&gt;, &lt;code&gt;arg_hash&lt;/code&gt;, and &lt;code&gt;prev_arg_hash&lt;/code&gt; carrying the previous span's hash so the duplicate-retry rule can fire), point &lt;code&gt;judge_gate.py&lt;/code&gt; at it, and pass your real &lt;code&gt;--judge-price&lt;/code&gt; and &lt;code&gt;--prod-cost&lt;/code&gt;. If your judge cost share comes back under 10%, ignore me; your monitor's fine. If it comes back at 40%, you've found a line item.&lt;/p&gt;

&lt;p&gt;One thing I genuinely don't know yet and would put real money on being argued in the comments: where the honest threshold is. Sattyam Jain says 20–25%. I shipped a default of 25%. But for a low-stakes summarizer, even 10% might be waste, and for an agent that moves money, maybe 40% is cheap. The budget is a &lt;code&gt;--flag&lt;/code&gt; precisely because I don't think there's one right answer.&lt;/p&gt;

&lt;p&gt;So I'll ask you: what's the judge cost share on a real eval pipeline you've shipped — and where would &lt;em&gt;you&lt;/em&gt; set the budget before it counts as the wrong monitor?&lt;/p&gt;




&lt;p&gt;&lt;em&gt;I publish one runnable FinOps tool for AI agents at a time, with the real run log attached. Follow for the next number from the next trace — and drop your judge cost share in the comments, I read every one.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>finops</category>
      <category>python</category>
      <category>agents</category>
    </item>
    <item>
      <title>Your Failed Agent Run Burns Most of Its Tokens AFTER It Fails — Measure It in 40 Lines</title>
      <dc:creator>Alexey Spinov</dc:creator>
      <pubDate>Thu, 18 Jun 2026 19:26:36 +0000</pubDate>
      <link>https://dev.to/alex_spinov/your-failed-agent-run-burns-most-of-its-tokens-after-it-fails-measure-it-in-40-lines-4ef</link>
      <guid>https://dev.to/alex_spinov/your-failed-agent-run-burns-most-of-its-tokens-after-it-fails-measure-it-in-40-lines-4ef</guid>
      <description>&lt;p&gt;&lt;strong&gt;Wasted tokens after agent failure&lt;/strong&gt; are the part nobody meters. A clean agent run and a failed one cost about the same to start; the bill diverges &lt;em&gt;after&lt;/em&gt; the run is already lost. This post measures that tail — the token fraction your run keeps burning past its first failure signal — with a 40-line offline meter.&lt;/p&gt;

&lt;p&gt;Here's the number that made me write this. In a 2026 paper on multi-agent observability, researchers measured 165 GAIA traces and found that &lt;strong&gt;among warned failed runs, 58.1% of tokens are spent after the first warning signal, on average.&lt;/strong&gt; First the warning fires (a tool error, a loop, a budget-pressure flag), and then the agent keeps going for more than half the run's tokens before it stops. Read the citation carefully: that 58.1% is &lt;em&gt;their&lt;/em&gt; number, on &lt;em&gt;warned failed runs&lt;/em&gt; specifically, not all runs and not my measurement. I'll keep those separated all the way down.&lt;/p&gt;

&lt;p&gt;The point I want to land: &lt;strong&gt;waste is not failure.&lt;/strong&gt; The failure is the cheap part. What's expensive is the distance between "this run is clearly off the rails" and "the agent actually stopped." That gap is denominated in tokens, and you can measure it on your own logs in about a minute.&lt;/p&gt;

&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;A failed agent run spends most of its tokens &lt;em&gt;after&lt;/em&gt; the first detectable failure signal — the published figure is &lt;strong&gt;58.1% on warned failed runs&lt;/strong&gt; (arXiv 2606.01365, 165 GAIA traces).&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;waste_probe.py&lt;/code&gt; is a 40-line, offline, keyless, read-only script. Feed it a JSON trace; it finds the first signal and prints the token share burned at and after it.&lt;/li&gt;
&lt;li&gt;On my own loopy fixture it measured &lt;strong&gt;82.9% waste&lt;/strong&gt; (707/853 tokens) — that's &lt;em&gt;my&lt;/em&gt; run on &lt;em&gt;my&lt;/em&gt; fixture, a separate number from the paper's.&lt;/li&gt;
&lt;li&gt;Exit code is a CI gate: 0 if waste ≤ threshold, 1 if over, 2 on usage error. Wire it onto collected traces.&lt;/li&gt;
&lt;li&gt;This is a post-mortem meter, not a runtime cap. It tells you where it burned; it does not block anything.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Where the wasted tokens after agent failure actually come from: first signal is not first stop
&lt;/h2&gt;

&lt;p&gt;Think about what a failed run actually looks like in the log. It's rarely one clean explosion. It's a tool that returns a 200 OK with a slightly different JSON schema, a parser that throws &lt;code&gt;KeyError: 'close'&lt;/code&gt;, and an agent whose retry branch assumes flakiness — so it fires the &lt;em&gt;exact same request&lt;/em&gt; again. And again. The payload never changes. The agent never reparses. It just loops, paying full freight on every turn, narrating its own confusion in increasingly confident prose.&lt;/p&gt;

&lt;p&gt;By the time anything stops that run, the diagnostic information was available at the first error. Everything after it is re-derivation of a conclusion the trace already contained. That's the waste. Not the failure itself, but the &lt;em&gt;persistence past&lt;/em&gt; the failure.&lt;/p&gt;

&lt;p&gt;I want to be precise about what this is and isn't, because two adjacent ideas already have tools:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;It is not the re-bill tax.&lt;/strong&gt; The &lt;a href="https://finops.spinov.online/blog/context-tax-measure-transcript-rebill/" rel="noopener noreferrer"&gt;context tax that re-bills your transcript every step&lt;/a&gt; is about the growing history being re-sent as input on &lt;em&gt;every&lt;/em&gt; step — compounding, n(n+1)/2. That's a cost you pay on a &lt;em&gt;healthy&lt;/em&gt; run too. Waste-after-signal is different: it's the excess work &lt;em&gt;specific to the failure tail&lt;/em&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;It is not stale memory either.&lt;/strong&gt; Dead-weight from a memory store — the kind you find when you &lt;a href="https://finops.spinov.online/blog/agent-memory-tax-and-backdoor/" rel="noopener noreferrer"&gt;audit your agent's memory tax and backdoor&lt;/a&gt; — is retained-but-irrelevant context across runs. Waste-after-signal is within a &lt;em&gt;single&lt;/em&gt; failed run, after a &lt;em&gt;specific&lt;/em&gt; trigger.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;It is not a spend cap.&lt;/strong&gt; A &lt;a href="https://finops.spinov.online/blog/sliding-window-spend-guard/" rel="noopener noreferrer"&gt;sliding-window spend guard that blocks the runaway loop at runtime&lt;/a&gt; stops execution when a dollar ceiling is hit. &lt;code&gt;waste_probe.py&lt;/code&gt; blocks nothing. It reads a finished trace and reports a ratio. The gate it controls is your CI pipeline's exit code, not the agent's next call.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;So the question it answers is narrow and falsifiable: &lt;strong&gt;in a given trace, what fraction of tokens landed at or after the first failure signal?&lt;/strong&gt; If that fraction is reliably small on your real logs, this whole thesis is wrong for you, and the tool will say so with exit 0. I'd rather you find that out than take my word.&lt;/p&gt;

&lt;p&gt;One more thing worth flagging, because it makes the tail bigger, not smaller: fan-out. Anthropic's Dynamic Workflows (a research preview shipped late May 2026) let a run spawn tens to hundreds of parallel subagents, capped at 16 concurrent and 1,000 total per run. InfoQ's writeup notes the obvious — these "can consume substantially more tokens than a typical session." Now imagine the loop above, but forked across a dozen subagents that each missed the same schema change. The failure tail doesn't add up. It multiplies.&lt;/p&gt;

&lt;h2&gt;
  
  
  The tool: waste_probe.py
&lt;/h2&gt;

&lt;p&gt;The design constraints first, because they're the reason you can actually run this on a production trace without a security review: &lt;strong&gt;offline, keyless, read-only, zero network.&lt;/strong&gt; It reads one JSON file and prints. No vendor SDK, no API key, no telemetry leaving your machine. Tokenization is real (&lt;code&gt;tiktoken&lt;/code&gt; with the &lt;code&gt;o200k_base&lt;/code&gt; encoding), with an honest &lt;code&gt;len/4&lt;/code&gt; fallback if &lt;code&gt;tiktoken&lt;/code&gt; isn't installed; that fallback is roughly ±15% off true BPE, and it says so in the output.&lt;/p&gt;

&lt;p&gt;The input is a trace: an ordered list of steps, each a small object.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"role"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"assistant"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"content"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"I'll fetch the Q2 close..."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
   &lt;/span&gt;&lt;span class="nl"&gt;"tool"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"get_quote"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"tool_args"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"symbol"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"EXMPL"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"period"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Q2"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"status"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"ok"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"role"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"tool"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"content"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"{...}"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"tool"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"get_quote"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"status"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"ok"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The whole thing is 40 lines. Here's the core — tokenization, signal detection, and the gate:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;step_text&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt; &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tool&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tool_args&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;status&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;first_signal&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;steps&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;seen&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;enumerate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;steps&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;status&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ok&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)).&lt;/span&gt;&lt;span class="nf"&gt;lower&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;error&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tool-error&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tool_args&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;                         &lt;span class="c1"&gt;# only a CALL counts as a loop, not a tool result
&lt;/span&gt;            &lt;span class="n"&gt;key&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tool&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;)),&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dumps&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tool_args&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;sort_keys&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;key&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;seen&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;repeat tool+args (loop)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
            &lt;span class="n"&gt;seen&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;none&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Two signal types, and the earliest one wins:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;status == "error"&lt;/code&gt;&lt;/strong&gt; — an explicit tool error. The cheapest possible signal, and the one most logs already have.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;A repeated identical &lt;code&gt;(tool, tool_args)&lt;/code&gt; pair&lt;/strong&gt; — the agent called the same tool with byte-identical arguments it already tried. That's a loop, or low-information-gain retry, and it's a signal even when nothing errored.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;That second check has a bug I hit on the first run, which is worth admitting because it's a real trap. My first version keyed the loop on every step that had a &lt;code&gt;tool&lt;/code&gt; field. But tool &lt;em&gt;results&lt;/em&gt; (&lt;code&gt;role: "tool"&lt;/code&gt;) also carry a &lt;code&gt;tool&lt;/code&gt; field, and two results from the same tool with no &lt;code&gt;tool_args&lt;/code&gt; produced an identical empty key, firing a false loop on a perfectly clean trace. The fix is the &lt;code&gt;if "tool_args" in s&lt;/code&gt; guard: only a &lt;em&gt;call&lt;/em&gt; can be a loop, never a result. My clean fixture went from a wrong 45.4% to a correct 0.0% after that one line. Detection logic is exactly where these tools quietly lie to you, so I keep the fixtures adversarial.&lt;/p&gt;

&lt;p&gt;Once it has the first signal index, the rest is arithmetic: sum the tokens at and after that index, divide by the total, convert to dollars at a &lt;em&gt;configurable&lt;/em&gt; rate (the default &lt;code&gt;$5/1M&lt;/code&gt; is a placeholder you override with &lt;code&gt;--price-per-1m&lt;/code&gt;, not a vendor quote), and set the exit code against a threshold (default &lt;code&gt;0.30&lt;/code&gt;).&lt;/p&gt;

&lt;h2&gt;
  
  
  The real run
&lt;/h2&gt;

&lt;p&gt;I ran this live on Python 3.13.5 with real &lt;code&gt;tiktoken o200k_base&lt;/code&gt;. Two fixtures: a clean linear run, and a loopy one where the agent retries an identical failing call seven times. Verbatim output, nothing edited:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="gp"&gt;$&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;python3 waste_probe.py trace_clean.json
&lt;span class="go"&gt;trace: trace_clean.json  (tiktoken o200k_base (exact))
steps: 8   total tokens: 326
token curve: 37 54 38 49 38 44 10 56
first signal: none - clean run
waste_after_signal: 0.0%  (0/326 tokens)
&lt;/span&gt;&lt;span class="gp"&gt;$&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;wasted after signal: &lt;span class="nv"&gt;$0&lt;/span&gt;.000000  &lt;span class="o"&gt;(&lt;/span&gt;at &lt;span class="nv"&gt;$5&lt;/span&gt;.0/1M tok&lt;span class="o"&gt;)&lt;/span&gt;
&lt;span class="go"&gt;gate: PASS (threshold 0.30)  exit=0

&lt;/span&gt;&lt;span class="gp"&gt;$&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;python3 waste_probe.py trace_loopy.json
&lt;span class="go"&gt;trace: trace_loopy.json  (tiktoken o200k_base (exact))
steps: 14   total tokens: 853
token curve: 37 58 51 77 51 74 51 72 51 78 51 79 51 72
first signal: step 3 (tool-error)
waste_after_signal: 82.9%  (707/853 tokens)
&lt;/span&gt;&lt;span class="gp"&gt;$&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;wasted after signal: &lt;span class="nv"&gt;$0&lt;/span&gt;.003535  &lt;span class="o"&gt;(&lt;/span&gt;at &lt;span class="nv"&gt;$5&lt;/span&gt;.0/1M tok&lt;span class="o"&gt;)&lt;/span&gt;
&lt;span class="go"&gt;gate: FAIL (threshold 0.30)  exit=1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Read that loopy curve left to right. The run climbs normally (&lt;code&gt;37 58 51 77&lt;/code&gt;), then at step 3 the first &lt;code&gt;tool-error&lt;/code&gt; fires. After that, look at the rhythm: &lt;code&gt;51 74 51 72 51 78 51 79 51 72&lt;/code&gt;. The agent reissues the identical request, gets back the identical 200-OK payload it can't parse, narrates a fresh theory about why it'll work next time, and repeats. Five more round trips of pure re-derivation. &lt;strong&gt;707 of 853 tokens, 82.9%, landed after the first signal.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;That 82.9% is my number, on my fixture. It is not the paper's 58.1%. The paper measured real GAIA traces across many runs and reported an &lt;em&gt;average&lt;/em&gt;; I built one deliberately loopy trace to show the mechanism cleanly, and a single contrived trace will sit higher than a population average. Same phenomenon, two completely different denominators. If I ever quote one as the other, call me on it.&lt;/p&gt;

&lt;p&gt;A note on the dollar figure: &lt;code&gt;$0.003535&lt;/code&gt; is tiny because the fixture is tiny. It scales linearly with token count and with whatever real rate you pass. The ratio is the durable signal; the dollars just put it in units a finance person reads. Run it at your own rate:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="gp"&gt;$&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;python3 waste_probe.py trace_loopy.json &lt;span class="nt"&gt;--price-per-1m&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;15.0
&lt;span class="gp"&gt;$&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;wasted after signal: &lt;span class="nv"&gt;$0&lt;/span&gt;.010605  &lt;span class="o"&gt;(&lt;/span&gt;at &lt;span class="nv"&gt;$15&lt;/span&gt;.0/1M tok&lt;span class="o"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And the gate is real. No args returns exit 2; clean returns 0; loopy returns 1 — so a CI job can branch on it. The output is deterministic, too: I hashed the loopy run twice and got identical &lt;code&gt;sha256&lt;/code&gt; both times. No clock, no randomness, no network. Same trace in, same number out, every time — which is the only way a CI gate is worth having.&lt;/p&gt;

&lt;h2&gt;
  
  
  What to do with it on Monday
&lt;/h2&gt;

&lt;p&gt;Three uses, in rough order of effort.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Measure your own ratio first.&lt;/strong&gt; Take one real failed trace you already have — convert it to the &lt;code&gt;[{role, content, tool, tool_args, status}]&lt;/code&gt; shape, run the probe, and read the percentage. Don't assume it's 58% and don't assume it's 83%. Measure &lt;em&gt;yours&lt;/em&gt;. The whole reason this tool is keyless and offline is so you can do that on a real production log without asking anyone's permission — the same measure-first habit behind &lt;a href="https://finops.spinov.online/blog/mcp-server-token-tax/" rel="noopener noreferrer"&gt;metering your MCP server's per-tool token tax&lt;/a&gt; before you cut anything.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Wire it as a CI gate on collected traces.&lt;/strong&gt; If you save agent traces (and for anything in production you should), drop &lt;code&gt;waste_probe.py&lt;/code&gt; into the pipeline that ingests them. &lt;code&gt;exit 1&lt;/code&gt; means "this run burned more than 30% of its tokens after it had already failed" — which is a regression worth a red build. Tune &lt;code&gt;--threshold&lt;/code&gt; to your reality; &lt;code&gt;0.30&lt;/code&gt; is a starting line, not a law. I picked it deliberately, and here's why.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Close the loop toward early stopping.&lt;/strong&gt; The same paper that gave us 58.1% ran a small pilot where acting on the early warning cut the post-warning token fraction from &lt;strong&gt;0.638 down to 0.304&lt;/strong&gt;. That's the entire game in one before/after: detect the signal, stop near it, and the tail collapses. The probe is the measurement half. The other half is your agent's retry logic actually treating a repeated-identical-call as terminal instead of transient — which, going back to the loopy fixture, is exactly the bug the agent never fixed about itself.&lt;/p&gt;

&lt;h2&gt;
  
  
  What this is not
&lt;/h2&gt;

&lt;p&gt;Honesty about the edges, because the tool is small on purpose:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Not a runtime blocker.&lt;/strong&gt; It reads finished traces. It will not stop a burning run mid-flight — that's a spend cap's job.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Not a detector of &lt;em&gt;all&lt;/em&gt; waste.&lt;/strong&gt; It catches two signals: explicit tool errors and byte-identical repeated calls. A semantically pointless-but-textually-different retry slips right past it. So does a wrong-but-confident answer with no error at all. The repeated-call check is exact-match by design — it's high-precision and deliberately low-recall.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Not a substitute for an eval.&lt;/strong&gt; A waste ratio tells you &lt;em&gt;how much&lt;/em&gt; burned after the signal, never whether the final answer was correct. You still need both.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The threshold is a guess until you calibrate it.&lt;/strong&gt; &lt;code&gt;0.30&lt;/code&gt; echoes the paper's post-intervention &lt;code&gt;0.304&lt;/code&gt;, which is a nice coincidence and nothing more. Your number is your number.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The thesis, one more time, falsifiable: in a failed run, the expensive tokens come &lt;em&gt;after&lt;/em&gt; the first detectable signal, not before. The published average on warned failed runs is 58.1%. My loopy fixture hit 82.9%. Now go get yours — and if it comes back consistently under 30%, you've got a healthier stop condition than most, and I'd genuinely like to hear how you built it.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Which signal trips first on your traces — the error or the loop? And what finally makes your agent treat a repeated identical call as terminal instead of transient? Drop it in the comments; that retry-is-terminal question is the one I'm still chewing on.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Written by Alexey Spinov. The tool, both fixtures, and the verbatim run output are included with this post — clone, run, and check the numbers against your own logs. Disclosure: I drafted this with AI assistance and ran, verified, and edited every line and number myself; the live output above is from my own machine.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>finops</category>
      <category>python</category>
      <category>agents</category>
    </item>
    <item>
      <title>The Context Tax: Why Step 12 Costs 42x Step 1 (Measure It in 40 Lines)</title>
      <dc:creator>Alexey Spinov</dc:creator>
      <pubDate>Wed, 17 Jun 2026 19:25:08 +0000</pubDate>
      <link>https://dev.to/alex_spinov/the-context-tax-why-step-12-costs-42x-step-1-measure-it-in-40-lines-213p</link>
      <guid>https://dev.to/alex_spinov/the-context-tax-why-step-12-costs-42x-step-1-measure-it-in-40-lines-213p</guid>
      <description>&lt;p&gt;&lt;strong&gt;In short:&lt;/strong&gt; the &lt;strong&gt;context tax&lt;/strong&gt; is what you pay when every agent step re-sends the whole session transcript as input again, so step N re-bills turns 1..N and total cost grows with n(n+1)/2. Cheaper tokens lower the unit, not the shape. &lt;code&gt;context_tax.py&lt;/code&gt; meters the re-bill multiplier offline; one debugging session measured &lt;strong&gt;42.8x&lt;/strong&gt;.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;AI disclosure:&lt;/strong&gt; I drafted this with an AI writing assistant. The tool, the fixtures, and every number below come from a real local run of the script in this post on tiktoken o200k_base. I reviewed and edited it before publishing.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Token prices have been sliding all year. Your agent bill probably hasn't.&lt;/p&gt;

&lt;p&gt;I kept running into the same confusion in my own FinOps notes: per-token rates drop, and the monthly number goes the other way. The usual answers ("you're using a bigger model," "you have more users") didn't explain a single session getting more expensive &lt;em&gt;as it ran&lt;/em&gt;. So I wrote a 40-line meter to look at the one thing nobody charts: the session transcript itself. On a synthetic-but-realistic debugging session, the last step billed &lt;strong&gt;42.8x&lt;/strong&gt; the input of the first step. Same model. Same task. No new users.&lt;/p&gt;

&lt;p&gt;That gap has a boring cause and an annoying consequence. Here's both, plus the script.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;TL;DR.&lt;/strong&gt; Every step of an agent loop re-sends the whole conversation so far (history plus tool outputs) as &lt;em&gt;input&lt;/em&gt;. So step N pays for turns 1..N again, and total input grows roughly with n(n+1)/2. Cheaper tokens don't fix the shape; they just lower the unit on a number that's still climbing. &lt;code&gt;context_tax.py&lt;/code&gt; (below, keyless, offline) meters three things from a session JSON: the re-bill curve, the re-bill multiplier, and a dead-weight estimate. On my bloated fixture it reported a 42.8x multiplier and 19.3% dead weight, and exited 1 as a CI gate.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why a transcript gets billed again every single step
&lt;/h2&gt;

&lt;p&gt;Here's the part that trips people up. An LLM call is stateless. The model doesn't "remember" turn 3 when you make turn 12. Your framework re-sends turns 1 through 11 as input so the model can see them. Every. Single. Step.&lt;/p&gt;

&lt;p&gt;So the cost of one step isn't the cost of that step's new text. It's the cost of the entire history up to that point. Step 1 bills a short user message. Step 12 bills the user message &lt;em&gt;plus&lt;/em&gt; a file dump &lt;em&gt;plus&lt;/em&gt; a wide grep &lt;em&gt;plus&lt;/em&gt; a stack trace &lt;em&gt;plus&lt;/em&gt; every assistant reply in between. The new tokens at step 12 might be tiny. The billed input is not.&lt;/p&gt;

&lt;p&gt;Logan (Waxell) put the shape plainly in &lt;em&gt;&lt;a href="https://dev.to/waxell/ai-agent-context-window-cost-the-compounding-math-your-architecture-is-hiding-2227"&gt;The Compounding Math Your Architecture Is Hiding&lt;/a&gt;&lt;/em&gt;: "total cost grows roughly with n(n+1)/2," and a turn-10 context can sit at 80,000–200,000 tokens. That post nails the problem and then points you at a proprietary runtime. I wanted the opposite: a tiny script I can run on my own transcript and check into CI. So that's what this is.&lt;/p&gt;

&lt;p&gt;And it's why "tokens got cheaper" is the wrong consolation. Edwin Lisowski's &lt;em&gt;&lt;a href="https://medium.com/@elisowski/token-prices-are-falling-so-why-is-your-ai-bill-going-up-b9bc1a894b1c" rel="noopener noreferrer"&gt;Token Prices Are Falling. So Why Is Your AI Bill Going Up?&lt;/a&gt;&lt;/em&gt; lists the drivers: full context re-sent each step, tool schemas eating 30–60% of the window before any user content, retries and sub-agents running around the clock. That schema overhead is a sibling tax worth metering on its own — I did exactly that for MCP servers in &lt;a href="https://finops.spinov.online/blog/mcp-server-token-tax/" rel="noopener noreferrer"&gt;measure your MCP server's token tax&lt;/a&gt;, where the tool definitions are billed on every call before a single user token. His illustrations are blunt. He cites AT&amp;amp;T going from 1B to 27B tokens/day over 18 months. (Those are Lisowski's examples, not my measurements; I'm attributing them to him.) Cheaper unit, bigger n. The unit lost.&lt;/p&gt;

&lt;h2&gt;
  
  
  You can't estimate this. You have to measure it.
&lt;/h2&gt;

&lt;p&gt;There's a second reason to meter instead of guess. Agents are bad at predicting their own spend.&lt;/p&gt;

&lt;p&gt;The arXiv paper &lt;em&gt;&lt;a href="https://arxiv.org/abs/2604.22750" rel="noopener noreferrer"&gt;How Do AI Agents Spend Your Money?&lt;/a&gt;&lt;/em&gt; (Bai, Huang, Wang, Sun, Mihalcea, Brynjolfsson, Pentland, Pei) measured agentic coding tasks and found three things worth pinning to the wall: agentic runs burn roughly &lt;strong&gt;1000x&lt;/strong&gt; the tokens of a plain code-chat; the &lt;em&gt;same&lt;/em&gt; task can vary up to &lt;strong&gt;30x&lt;/strong&gt; in cost run to run; and models "fail to accurately predict their own token usage," with correlations up to just &lt;strong&gt;0.39&lt;/strong&gt;. A 0.39 correlation is barely better than a shrug.&lt;/p&gt;

&lt;p&gt;So the takeaway writes itself: meter the transcript, don't trust the estimate. If the model can't call its own number, your gut can't either.&lt;/p&gt;

&lt;h2&gt;
  
  
  The tool: 40 lines, no API key
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;context_tax.py&lt;/code&gt; reads one JSON file: a session transcript as a list of turns (&lt;code&gt;role&lt;/code&gt; + &lt;code&gt;content&lt;/code&gt;, tool results included). It tokenizes with &lt;code&gt;tiktoken&lt;/code&gt;'s &lt;code&gt;o200k_base&lt;/code&gt; and reports four things.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Re-bill curve&lt;/strong&gt; — billed input at each step, i.e. the cumulative history 1..N re-sent on that call.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Re-bill multiplier&lt;/strong&gt; — billed input at the last step ÷ billed input at the first step. How much more "one call" costs at the end versus the start.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Dead-weight %&lt;/strong&gt; — tokens from early turns whose terms basically never resurface in later turns (a turn counts as dead if under 15% of its terms reappear downstream). Stuff you keep paying to re-send that the model isn't really using. It's the same dead-weight idea I applied to persistent stores in &lt;a href="https://finops.spinov.online/blog/agent-memory-tax-and-backdoor/" rel="noopener noreferrer"&gt;auditing your agent's memory tax&lt;/a&gt; — there the stale entries ride along on every retrieval; here they ride along on every step.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;$ / session&lt;/strong&gt; — total billed input across all steps × a rate you pass in. The rate is a &lt;em&gt;parameter&lt;/em&gt;, not a hardcoded vendor price — this is a compounding illustration, not your invoice.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The exit code is the point. &lt;code&gt;0&lt;/code&gt; if the multiplier is under threshold (a disciplined session), &lt;code&gt;1&lt;/code&gt; if it's over (the architecture is compounding, so fail the build), &lt;code&gt;2&lt;/code&gt; for usage. Drop it in CI and a session that balloons becomes a red check, not a surprise line item.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;#!/usr/bin/env python3
&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;context_tax.py - meter the re-bill tax on a single agent session&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;s transcript.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;sys&lt;/span&gt;

&lt;span class="n"&gt;THRESHOLD&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;12.0&lt;/span&gt;         &lt;span class="c1"&gt;# re-bill multiplier above this = compounding architecture
&lt;/span&gt;&lt;span class="n"&gt;DEAD_OVERLAP&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;0.15&lt;/span&gt;      &lt;span class="c1"&gt;# a turn is dead weight if &amp;lt;15% of its terms resurface later
&lt;/span&gt;&lt;span class="n"&gt;STOP&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;the a an of to in is it on for and or but with as at by from this that be are was you your i we they it&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;s&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;split&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;

&lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;tiktoken&lt;/span&gt;
    &lt;span class="n"&gt;_enc&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;tiktoken&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_encoding&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;o200k_base&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;count&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;_enc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;encode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="n"&gt;TOKENIZER&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tiktoken o200k_base (exact)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;Exception&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;                              &lt;span class="c1"&gt;# honest fallback, ~+-15% vs real BPE
&lt;/span&gt;    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;count&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;max&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;round&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="n"&gt;TOKENIZER&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;len/4 heuristic (tiktoken not installed; ~+-15%)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;words&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;w&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;w&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;findall&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;[a-z0-9_]{4,}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;lower&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;w&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;STOP&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;main&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;argv&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;argv&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;usage: context_tax.py &amp;lt;session_transcript.json&amp;gt;&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt; &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;
    &lt;span class="n"&gt;s&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;load&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;argv&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;encoding&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;utf-8&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="n"&gt;rate&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;float&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;input_usd_per_mtok&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;3.0&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;     &lt;span class="c1"&gt;# $/1M input tok; configurable, NOT a vendor quote
&lt;/span&gt;    &lt;span class="n"&gt;turns&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;t&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;turns&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]]&lt;/span&gt;
    &lt;span class="n"&gt;tok&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nf"&gt;count&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;turns&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;later&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nf"&gt;words&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt; &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;turns&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;:]))&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;turns&lt;/span&gt;&lt;span class="p"&gt;))]&lt;/span&gt;
    &lt;span class="n"&gt;billed&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;dead&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[],&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;turns&lt;/span&gt;&lt;span class="p"&gt;)):&lt;/span&gt;                        &lt;span class="c1"&gt;# step n re-bills the running history 1..n
&lt;/span&gt;        &lt;span class="n"&gt;billed&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tok&lt;/span&gt;&lt;span class="p"&gt;[:&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]))&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;turns&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;w&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;words&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;turns&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
            &lt;span class="n"&gt;overlap&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;w&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt; &lt;span class="n"&gt;later&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;w&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;w&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="mf"&gt;1.0&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;overlap&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;DEAD_OVERLAP&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="n"&gt;dead&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="n"&gt;tok&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;mult&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;billed&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;billed&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;billed&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
    &lt;span class="n"&gt;total_billed&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;billed&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;context_tax | &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;argv&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; | tokenizer: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;TOKENIZER&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; | rate=$&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;rate&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;/Mtok | threshold x&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;THRESHOLD&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;-&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;78&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;enumerate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;billed&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;bar&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;#&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="nf"&gt;round&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;b&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;billed&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;40&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;  step &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;  billed_input=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="mi"&gt;6&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;t  &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;bar&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;-&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;78&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;  re-bill multiplier (step &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;billed&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; / step 1) : x&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;mult&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;  dead-weight (never referenced later)     : &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;dead&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;t = &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;dead&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;billed&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;% of the final payload&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;  total billed input across session        : &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;total_billed&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;t  ($&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;total_billed&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mi"&gt;1_000_000&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;rate&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; at $&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;rate&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;/Mtok)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;  exit                                     : &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;mult&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;THRESHOLD&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;mult&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;THRESHOLD&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;

&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;__name__&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;__main__&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;sys&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;exit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;main&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sys&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;argv&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;No key, no network, read-only. &lt;code&gt;pip install tiktoken&lt;/code&gt;, point it at a transcript JSON, done. If &lt;code&gt;tiktoken&lt;/code&gt; isn't installed it falls back to a len/4 heuristic and says so out loud (~±15% off real BPE). I'd rather print the caveat than pretend the number is exact.&lt;/p&gt;

&lt;h2&gt;
  
  
  The real run
&lt;/h2&gt;

&lt;p&gt;Two fixtures ship with the script. Both are synthetic coding sessions (no private data) but shaped like the real thing.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;session_lean.json&lt;/code&gt;&lt;/strong&gt; is a disciplined session: small tool outputs, and a deliberate scope reset before the second task. Here's the actual output:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;context_tax | session_lean.json | tokenizer: tiktoken o200k_base (exact) | rate=$3.0/Mtok | threshold x12.0
------------------------------------------------------------------------------
  step  1  billed_input=    25t  ####
  step  2  billed_input=    46t  ########
  step  3  billed_input=    75t  #############
  ...
  step 10  billed_input=   235t  ########################################
------------------------------------------------------------------------------
  re-bill multiplier (step 10 / step 1) : x9.4
  dead-weight (never referenced later)     : 56t = 23.8% of the final payload
  total billed input across session        : 1335t  ($0.0040 at $3.0/Mtok)
  exit                                     : 0
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Multiplier 9.4x, under the 12x threshold, exit 0. Green. Note the dead weight is still 23.8%: that's the first task's context the model no longer needs in the second task. Even a clean session carries dead weight until you actually trim. The scope reset kept the multiplier down; it didn't zero the waste.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;session_bloated.json&lt;/code&gt;&lt;/strong&gt; is the one that hurts. A 12-step debugging session that never trims: a full module dump, a wide repo grep, a long stack trace, and the kicker, a verbose &lt;code&gt;pip check&lt;/code&gt; dependency log that gets re-sent on every step after it. Real output:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;context_tax | session_bloated.json | tokenizer: tiktoken o200k_base (exact) | rate=$3.0/Mtok | threshold x12.0
------------------------------------------------------------------------------
  step  1  billed_input=    40t  #
  step  2  billed_input=    72t  ##
  step  3  billed_input=   421t  ##########
  step  4  billed_input=   480t  ###########
  step  5  billed_input=   857t  ####################
  ...
  step 12  billed_input=  1713t  ########################################
------------------------------------------------------------------------------
  re-bill multiplier (step 12 / step 1) : x42.8
  dead-weight (never referenced later)     : 331t = 19.3% of the final payload
  total billed input across session        : 11774t  ($0.0353 at $3.0/Mtok)
  exit                                     : 1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;42.8x. Over threshold, exit 1: a failed build. Watch step 3 in the curve. The full file dump jumps billed input from 72 to 421 tokens, and you pay that bump again on every one of the nine steps that follow. The 331 dead-weight tokens are mostly that &lt;code&gt;pip check&lt;/code&gt; log (boto3 versions, urllib3 pins) that never came up again but kept riding along in the payload.&lt;/p&gt;

&lt;p&gt;Both numbers are reproducible. I hashed two consecutive bloated runs with &lt;code&gt;shasum -a 256&lt;/code&gt; and got identical digests, so the output is deterministic, not a fluke of one run.&lt;/p&gt;

&lt;p&gt;One honest correction. I'd guessed the multiplier would land near 16x when I started (that's the figure floating around the n(n+1)/2 discussions). The real run said 42.8x. The bloated fixture front-loads a big file dump on a small first turn, which stretches the ratio. The lesson isn't "16x vs 42x." It's that &lt;strong&gt;the number depends entirely on your transcript shape, which is exactly why you measure your own instead of borrowing mine.&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  What to actually do about it
&lt;/h2&gt;

&lt;p&gt;The fixes aren't exotic. The point of the meter is to tell you which one you need, and to prove it worked.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Scope-reset between tasks.&lt;/strong&gt; The lean fixture does this: drop the prior task's context before starting the next one. It's the difference between 9.4x and 42.8x here.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Trim or summarize fat tool outputs.&lt;/strong&gt; That &lt;code&gt;pip check&lt;/code&gt; dump was 19.3% dead weight. Replace a 300-token log with a one-line "deps OK, no conflicts" and you stop re-billing it nine times.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Rolling summarization&lt;/strong&gt; for long sessions: collapse old turns into a short recap once they're settled, instead of carrying them verbatim.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Then re-run the meter. If the multiplier drops back under threshold, the exit code flips to 0 and your CI gate goes green. That's the whole loop: measure, cut, prove. Not "trust me, I optimized it," but a number that moved. This meter slots in alongside the other checks in my &lt;a href="https://finops.spinov.online/blog/pre-execution-gate-for-ai-agents/" rel="noopener noreferrer"&gt;pre-execution gate for AI agents&lt;/a&gt; — same philosophy, fail fast before the spend, not after the invoice.&lt;/p&gt;

&lt;h2&gt;
  
  
  What this is NOT (so I don't oversell it)
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;It does &lt;strong&gt;not&lt;/strong&gt; block or cap anything at runtime. It's a meter and a CI gate, not a spend guard. If you want the runtime brake that stops a session mid-loop, that's a different tool — see the &lt;a href="https://finops.spinov.online/blog/sliding-window-spend-guard/" rel="noopener noreferrer"&gt;sliding-window spend guard&lt;/a&gt;, which caps cumulative cost over a window instead of just measuring it after the fact.&lt;/li&gt;
&lt;li&gt;It does &lt;strong&gt;not&lt;/strong&gt; compute your real provider invoice. The &lt;code&gt;$/session&lt;/code&gt; figure uses a rate &lt;em&gt;you&lt;/em&gt; pass in, to illustrate compounding. Your actual bill depends on caching, batching, output tokens, and your vendor's pricing — none of which this models.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Dead-weight is a lexical heuristic, and it has false positives.&lt;/strong&gt; "Under 15% of terms resurface later" is a proxy for "the model stopped using this," not proof of it. The model may have leaned on an early turn implicitly without repeating its words. On my bloated fixture the stack trace landed at 0.16 overlap, just above the line, correctly kept, because the fix really did reference it. Treat the % as a flag to go look, not a verdict.&lt;/li&gt;
&lt;li&gt;It does not optimize your context for you. It tells you where the tax is. The cutting is still your call.&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;What's the worst re-bill multiplier you've measured on one of your own long sessions? Run the script on a real transcript and tell me in the comments. I'm collecting shapes, and I read every reply. Follow for the next number from the next run.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>finops</category>
      <category>python</category>
      <category>agents</category>
    </item>
  </channel>
</rss>
