DEV Community: Alexey Spinov

Your A/B eval is paired. Your stat test probably isn't.

Alexey Spinov — Tue, 21 Jul 2026 03:03:28 +0000

Paired eval, wrong test: two prompts scored on one 100-item set give paired outcomes, so ranking them needs McNemar, not the two-proportion Wald SE. On the same data, Wald read 1.01 SE and said 'collect more'; McNemar read 2.65 SE and allowed the ranking. Same 100 items, opposite decision.

I shipped a little eval helper that refuses to rank two configurations when their gap is smaller than two standard errors. The idea is good. The refusal is honest. Last week it looked at two prompts scored on the same 100-item set and printed this:

RANK: INDISTINGUISHABLE - gap 7.00 pp against 6.95 pooled SE = 1.01 SE < 2.0. Ranking "prompt B" above "prompt A" is NOT allowed.

Seven points of apparent improvement, called noise. The honest reading of that line is "collect more items." And it was wrong. Not a rounding error, not a close call. The right test on that exact data says the ranking is already decided, and I could have stopped.

The bug was not in the arithmetic. It was in which test the arithmetic ran. My helper was treating two paired runs as if they were two independent samples.

TL;DR

Two prompts on one task set are paired: each task gives a pair (pass_A, pass_B), and both-pass / both-fail tasks carry zero information about the difference.
My harness used the Wald SE of a difference of two proportions, pooled = sqrt(se_A**2 + se_B**2). That is the independent-samples formula; it ignores the pairing entirely. In my showcase that cost real power.
The paired test is McNemar: SE = 100*sqrt(b+c)/n, using only the discordant counts b and c. On the same 100 items it returned 2.65 SE, not 1.01, and ranking was allowed. With just 7 discordant pairs the tool also prints the exact binomial, p=0.0156 (z-equivalent 2.42), so the decision does not ride on the normal approximation.
On 535 real paired observations from my own sweep, one pair reads 2.65 SE under Wald and 6.24 under McNemar with c=0: a strictly nested result that Wald reports as a hair over the line and has no way to flag as deterministic.
The fix is about fifteen lines. The hard part is not the formula, it is noticing your runs are paired.

What this is and is not. The two-prompt table below is a constructed example: I picked the counts to sit exactly on the line where the two tests disagree, so you can reproduce it with four numbers. The 535-observation table comes from a synthetic marker fixture I wrote, not from anyone's production system. Where a number is forced by my constants rather than measured, I say so in the same paragraph. That habit is half the point.

Why two prompts on one set are paired

Run prompt A and prompt B on the same 100 eval tasks. Task 7 either trips both of them, or neither, or exactly one. That last group is the only one that tells you which prompt is better. The tasks where both pass and the tasks where both fail are shared difficulty: they move both pass rates together and say nothing about the gap.

Here is the case that caught me, laid out as a 2x2:

                     B pass   B fail
        A pass          55        0     <- b = 0 (A pass, B fail)
        A fail           7       38     <- c = 7 (A fail, B pass)
        A marginal = 55/100   B marginal = 62/100   n = 100 pairs

Prompt A passes 55, prompt B passes 62. Same items. Ninety-three of the hundred tasks are concordant: 55 both pass, 38 both fail. Seven tasks are discordant, and all seven go the same way, B passes where A failed. Zero go the other way. B's pass set contains A's pass set completely. That is a strong statement, and it is invisible to a test that only looks at the two marginal rates.

What my harness printed, and why it was the wrong number

The refusal came out of rank(). Under the hood it builds the pooled standard error the way you were taught for two independent proportions:

    pooled = sqrt(p1.se ** 2 + p2.se ** 2)

Each se is the binomial standard error of one marginal rate. For 55/100 and 62/100 that pools to 6.95 points, the 7-point gap divides to 1.01 SE, and the guard files it under "indistinguishable." Fair, if the two samples were drawn independently. They were not. The 93 concordant tasks are the same 93 tasks in both columns, and the formula charged me full variance for them anyway.

The cost of believing that line is real budget. At 1000 items each, the same seven-point gap finally clears the bar:

RANK: "prompt B" > "prompt A" - gap 7.00 pp = 3.18 SE >= 2.0. Ranking is allowed.

So the harness asked for ten times the eval spend to reach a verdict the correct test already had at 100 items. For anyone paying per token to run a judge model over a set, that is a straight line from a stats mistake to a bill.

What McNemar does instead

McNemar's test (Quinn McNemar, Psychometrika, 1947, the standard test for paired nominal data) throws away the concordant pairs and looks only at b and c. The standard error of the difference becomes 100*sqrt(b+c)/n, and the test statistic is |c-b|/sqrt(b+c). I added it to the same library as mcnemar(name_a, b, name_c, c, n). On the exact same 100 items:

McNEMAR: discordant pairs prompt A=0, prompt B=7 (concordant 93 of n=100).
  SE = 2.65 pp (100*sqrt(b+c)/n); marginal gap 7.00 pp = 2.65 SE >= 2.0 -> ranking "prompt B" over "prompt A" is allowed.
  small discordant count (b+c=7 < 25): the normal approximation is anti-conservative here. Exact two-sided binomial p=0.0156 (z-equiv 2.42), continuity-corrected z=2.27. all three clear the 2.0 bar; the decision is unchanged.
  NESTED (b=0 or c=0): every discordant item favours the same side; the difference is deterministic (strict nesting on this sample), not a coin-flip margin.

2.65 SE against the same 2.0 threshold. Allowed. Same data, opposite decision. And note the third line. With only 7 discordant pairs, 2.65 is a normal approximation, and the tool will not let me lean on it: beside it sit the exact two-sided binomial, p=0.0156 (z-equivalent 2.42), and the continuity-corrected z at 2.27. All three clear the 2.0 bar. Same decision, reached without trusting the largest number in the row. The NESTED line fires because one discordant count is zero, which means the two prompts never disagreed in both directions: on this sample B dominates A item by item. That is a qualitatively different thing from a noisy 7-point margin, and a test that pools marginals cannot see it.

One honesty note, because it matters. NESTED is not a license to rank anything with a zero in it. A pair with b=0, c=1 is also nested, and the tool prints it at 1.00 SE, nowhere near the bar. The ranking decision still rides on the z value. In this case the z is 2.65 and the nesting is strict, so both agree. I flag them separately on purpose.

The same divergence on 535 real observations

The constructed example is clean but you should distrust anything I can tune to land on the line. So here is the same phenomenon on data I did not hand-pick: a sweep of a monotone-but-not-total marker fixture, P_PATHS=12 W_PER_TICK=3 T_TICKS=400 SEEDS=20, where a false reject is a landed write that a single-integer witness wrongly rejected.

The cells of that sweep share one underlying draw. Path and outcome come off the LCG independently of the axis knobs, so the raw (tick, path, outcome) at each sampled position is identical across cells. The run checks that before it does anything else, across all 33 sampled points per seed:

(tick,path,outcome) skew 6:6 vs 7:5            identical: True  (33 sampled points)
(tick,path,outcome) streams 8 vs 4             identical: True  (33 sampled points)
ALL PAIRED (raw observations identical across cells): True

Same observations, only the verdict moves. That is the definition of paired, and it is why the two-proportion SE is the wrong tool here too. Cross-tabbing the false-reject indicators gives real b/c counts. Both tests, side by side:

pair                     | b / c / n        | Wald (rank)    | McNemar (paired)
------------------------------------------------------------------------------
6:6 vs 7:5               | 142 / 128 / 535  | 0.87 SE        | 0.85 SE
7:5 vs 8:4               | 39 / 3 / 535     | 2.31 SE        | 5.55 SE
8:4 vs 9:3               | 39 / 0 / 535     | 2.65 SE        | 6.24 SE NESTED
streams=8 vs streams=4   | 91 / 32 / 535    | 3.98 SE        | 5.32 SE
streams=2 vs streams=1   | 229 / 0 / 535    | 20.01 SE       | 15.13 SE NESTED

Read the third row. 8:4 versus 9:3 is 39 discordant pairs, all in one direction, c=0. Strict nesting again, on real counts. Both tests clear the 2.0 bar here, so both allow the ranking. But Wald reports 2.65, a hair over the line, while McNemar reports 6.24 and flags NESTED. One of those tells you the result is deterministic on this sample; the other cannot tell "barely over the threshold by luck" apart from "decided." Here is that pair with both guards printing in full:

PAIR 8:4 vs 9:3   marginal FR: 8:4=172/535  9:3=133/535
    8:4: 32.1% (k=172 n=535 SE=2.02)
    9:3: 24.9% (k=133 n=535 SE=1.87)
  RANK: "8:4" > "9:3" - gap 7.29 pp = 2.65 SE >= 2.0. Ranking is allowed.
  McNEMAR: discordant pairs 8:4=39, 9:3=0 (concordant 496 of n=535).
    SE = 1.17 pp (100*sqrt(b+c)/n); marginal gap 7.29 pp = 6.24 SE >= 2.0 -> ranking "8:4" over "9:3" is allowed.

Two honest observations about that table. First, the two tests never disagree on direction: whichever version each one ranks higher, they agree, on all five pairs. Second, at this 2.0 threshold none of the five flip the decision. Row one, 6:6 vs 7:5, is 0.87 under Wald and 0.85 under McNemar, both below the bar, both refused. Rows two through five sit above the bar for both. What moves is the reported SE, sometimes a lot, 2.31 against 5.55, and whether the c=0 nesting gets named at all.

And the gap runs both ways, which is why "McNemar is always more powerful" would be the wrong lesson. On 6:6 vs 7:5 Wald reads 0.87 against McNemar's 0.85, and on streams=2 vs streams=1 it reads 20.01 against 15.13: when the discordant pairs are many and lopsided, the independent-samples SE is not the conservative one, it understates the variance instead. The rule is not that one test wins, it is that you owe your data the test its design calls for.

So where is the decision flip? Near the bar. This particular sweep happens to spread its pairs away from the 2.0 line, so the two tests agree on every call even while their SEs diverge. The constructed 100-item example sits right on the line, which is where the difference between the tests stops being cosmetic and turns into a yes or a no. I did not engineer that to cheat; it is where most config bake-offs I have watched actually get decided, on a handful of items either way.

One thing I am deliberately not claiming

You may have noticed those false-reject levels, 75.9%, 64.9%, and want me to say something about them. I will not, and the same library is why. Run the construction-independence probe on those cells and it comes back:

FORCED BY CONSTRUCTION [streams=8]: conditional 75.9% (k=406 n=535 SE=1.85) is indistinguishable from unconditional 74.7% (k=493 n=660 SE=1.69) (0.48 SE < 2.0)

The level equals the unconditional pass-the-gate rate, 74.7% at n=660, because the fixture draws outcome and stamp from separate LCG steps. So the level is an artifact of my constants, not a measurement, and I quote none of them as findings. What survives that probe is the discordance structure, b and c, which is a genuine between-cell comparison. The McNemar inputs are real; the levels they sit next to are not. Running the probe and reporting the result is the only reason I trust the distinction.

The fix, in about fifteen lines

There is nothing clever in it. It is the discordant count and a square root:

    disc = b + c
    conc = n - disc
    ...
    se = 100.0 * sqrt(disc) / n
    gap = 100.0 * abs(c - b) / n
    n_se = abs(c - b) / sqrt(disc)          # = gap / se, n cancels
    ...
    if b == 0 or c == 0:
        # NESTED: strict set-nesting, deterministic on this sample

(Truncated by me: the wrapper handles b==c, the no-discordant case, and localized output; full function in measure.py.)

The formula is the easy part. The part that actually protects you is upstream and it is not in this function at all: the library cannot know your two runs are paired. You have to establish that yourself, by confirming the observations are the same items in the same order and only the outcome moved. That is the check the sweep runs before it calls mcnemar() at all, and if the raw observations had differed it would have refused to treat them as paired. A paired test on unpaired data is its own mistake.

The limits, plainly. SE = 100*sqrt(b+c)/n is the normal approximation to McNemar; for a handful of discordant pairs you want the exact binomial instead, which is why the function prints it automatically once b+c drops below 25, and I would not read the plain SE without it. The z threshold of 2.0 is a convention I carried over from the rest of the harness, not a law. And none of this touches the failure mode that actually costs you money, a silent false accept, because that is a question about your ground truth, not your arithmetic.

What does yours do

Everything here ran locally on a synthetic fixture: Python 3.13.5, stdlib only, offline, no keys, no funds, three runs byte-identical, exit 0, empty stderr. Two earlier pieces work the same eval harness from other angles: reading eval results by severity class instead of a flat pass rate and a static probe that found contamination points without running the agent. Both belong to the same family of pre-execution gates for AI agents.

I publish the runs that corrected my own reading, not only the ones that confirmed it. Follow along if that is your kind of thing. And a real question I do not have a clean answer to: when your last A/B put two agent versions at 76 and 74 out of 100 on the same set, did your harness run a paired test, or did it pool two marginals and quietly ask you for more data? I would like to know what yours does.

AI disclosure. I drafted this with an AI writing assistant and edited every line; the test choice, the sweep, and the reading are mine. Every output block is pasted from one real local run on 2026-07-21. measure.py sha256 b1b3702bccb6ab46…, paired_test.py sha256 fa687e05e8e69512…, run output sha256 d8e4e521b1f35ac0…. The library default stays Russian so the sha256 of two already-published runs keeps verifying; English is opted into explicitly.

A Spend Cap That Stops Counting Is Already Fail-Open

Alexey Spinov — Sun, 19 Jul 2026 01:53:12 +0000

Two of the five ways a spend cap can handle a missing price produce the exact same decision stream — same sha256, byte for byte. One of them is the thing everybody calls fail-open. The other is the thing everybody recommends instead of it: fall over to a free local model.

Let me say what that hash is and isn't before it does any work. It covers (seq, label, admitted, charge) and deliberately drops the human-readable reason strings, so two policies that print different words hash the same when they decide the same. Once you see that, the collision is a theorem rather than a discovery: a free fallback charges zero, fail-open charges zero, and a ledger built out of charges cannot tell them apart because there is nothing there to tell apart. The sha256 proves only that my implementation doesn't quietly cheat.

The reason it's still worth a post is that nobody ships them as the same policy. One is the bug you apologise for; the other is the fix you recommend in the thread. On the axis that matters they are one policy, and one of them has better branding.

AI disclosure. I wrote blind_spend_cap.py with AI assistance and ran it myself. Every number and hash below is pasted from a real run: offline, stdlib only, no network, no keys, no funds. The oracle is injected, so runs are deterministic. I ran it three times; the output was byte-identical each time. Code sha256 ddc42590…, output sha256 9ebe1b4a…. External figures are linked and labeled, and I say clearly which ones I did not reproduce.

TL;DR

A spend cap needs a cost-oracle to price the next action. The oracle has its own outage.
The usual framing (fail-open vs fail-closed) is the wrong axis. The real split: does the ledger keep moving while the oracle is quiet?
Strategies that charge a price the remaining budget can still absorb keep the ledger alive and re-trip the cap. Charge zero and spent freezes forever. Charge more than fits and you've written refuse with extra steps — the harness proves that one on itself.
A free local fallback charges zero. In my harness it produces a decision stream identical to plain fail-open: same sha256.
But a moving ledger is a floor, not a certificate. Any positive fiction satisfies it — price a $0.05 call at $0.01 and your cap is quietly five times the one you configured. The real axis is the bias of your estimator; zero is just where that bias hits −100%.
The headline number you'd expect me to use here (34 extra actions) is arithmetic, not evidence. I take it apart below rather than sell it.

The branch nobody writes down

Every spend cap I've shipped has the same shape. Price the action, compare against a budget, allow or block. The pricing step assumes the oracle answers.

It doesn't always. CoinGecko 429s. A usage endpoint times out. A token meter sits behind a gateway returning 526. In that moment your cap takes a decision that probably isn't in your code review notes, because it isn't in your code: what to do with an action it cannot price.

I know it's unwritten because I shipped it that way. On June 8, 2026 I published SpendGuard, a 40-line pre-execution cap. It works, and it never declared this branch. The oracle call sits inside cost_fn on line 121, and eth_price_usd() calls raise_for_status() before it returns anything. So on a 429 the exception blows straight past the gate and out of the wrapper. Accidentally fail-closed, by way of an uncaught exception that takes the caller down instead of returning a verdict you can count.

Copy the demo loop from that same post and you get the opposite. It prices once before the loop and reuses that number for every round. A mid-loop outage is invisible. Accidental fail-open.

Same file, two wirings, two opposite behaviors, and I declared neither.

Five strategies, one fork

So I built the smallest thing that isolates the branch. blind_spend_cap.py runs one Analyzer/Verifier ping-pong through one budget gate, under five strategies that are identical everywhere except the quote is None block:

    else:
        if strategy == "refuse":
            return _v(seq, "BLOCK", "no quote: refuse", False, 0, priced)
        if strategy == "admit-unpriced":
            return _v(seq, "ADMIT", "no quote: admit, charge 0 (ledger frozen)",
                      True, 0, priced)
        if strategy == "stale":
            if last_known is None:
                return _v(seq, "BLOCK", "no quote and no last-known price: refuse",
                          False, 0, priced)
            est, tag = last_known, "stale "
        elif strategy == "pessimistic":
            est, tag = per_action_cap, "pessimistic "
        elif strategy == "fallback":
            est, tag = fallback_cents, "fallback "

Two design choices worth stating, because both cut against the result I might have wanted.

A BLOCK does not stop the loop. A real runaway retries. My earlier harness gave the refusing policy a free break, which quietly handed it the win: it "stopped the runaway" because I wrote the loop that way. Here the gate stops the spend, not the work, and the loop keeps hammering. There's an --on-block halt flag for the single-shot shape, and I sweep both.

The budget has no clock, so I call it a per-run budget rather than a daily one. Calling it daily would be a lie in a file with no time in it.

The split isn't admit-vs-block. It's counting-vs-not.

Here's the outage run, oracle down from step 6, straight from output.txt:

SCENARIO B — oracle down from step 6, loop retries after a BLOCK
  refuse           admitted=6   spent=$0.30  unpriced=0   unaccounted=0   ledger-moved=False exit(conv)=1 exit(strict)=1
  admit-unpriced   admitted=40  spent=$0.30  unpriced=34  unaccounted=34  ledger-moved=False exit(conv)=0 exit(strict)=2
  stale            admitted=10  spent=$0.50  unpriced=4   unaccounted=0   ledger-moved=True  exit(conv)=1 exit(strict)=2
  pessimistic      admitted=6   spent=$0.30  unpriced=0   unaccounted=0   ledger-moved=False exit(conv)=1 exit(strict)=1
  fallback:0c      admitted=40  spent=$0.30  unpriced=34  unaccounted=34  ledger-moved=False exit(conv)=0 exit(strict)=2
  fallback:1c      admitted=26  spent=$0.50  unpriced=20  unaccounted=0   ledger-moved=True  exit(conv)=1 exit(strict)=2

Look at the spent column, not the admitted one.

Exactly two strategies end at $0.50: stale and fallback:1c. That's the budget, tripped, doing its job. admit-unpriced and fallback:0c end at $0.30 and stay there — not because the run was cheap, but because after step 6 nothing was ever added to the ledger again.

Now the row that breaks the tidy version of this claim, which I had written as "every strategy that charges something ends at $0.50" until the table two lines above told me otherwise. pessimistic charges the most of anybody and still ends at $0.30. Charging something isn't sufficient. The something has to fit. pessimistic prices every un-priced call at the $0.25 per-action cap, only $0.20 of budget remains after step 6, so every un-priced call is blocked on arrival: admitted=6, unpriced=0.

Which changes how you read the unaccounted column. It counts admissions where the oracle gave no quote and nothing was charged — defined by the fact of a missing quote, not by the name of the policy, so it can accuse any strategy including the ones I like. stale and fallback:1c sit at zero because they kept counting. pessimistic and refuse sit at zero because they never admitted a blind call in the first place. Same number, two different stories, and I'd been reading the flattering one into both.

Push it to the limit and the failure gets loud. Oracle dead from step 0:

$ python3 blind_spend_cap.py --strategy admit-unpriced --oracle-fails-from 0
admit-unpriced   admitted=40  spent=$0.00  unpriced=40  unaccounted=40  ledger-moved=False exit(conv)=0 exit(strict)=2

Forty actions admitted. Ledger says zero dollars. Exit code zero, under the mapping most gates actually ship.

That's the thing worth internalizing. A blind cap doesn't report danger. It reports innocence.

The number I'm not putting in the headline

You'd expect the pitch to be "fail-open admitted 34 more actions." The tool does print it. I'm going to argue against it anyway, because I got burned by exactly this number last time.

M is the gap in admitted actions between admit-unpriced and refuse. In the run above it's 34. Sweep the outage step across the whole parameter space and you get this:

  on-block=retry
        K:    0    5    6    9   10   11   12   20   39   40
        M:   40   35   34   31   30   29   28   20    1    0

(That's the retry half of the sweep with the unacc row dropped for now — under retry it's identical to M anyway. The full block, both loop shapes, is two sections down.)

M = WANTS − K, exactly, everywhere. I picked WANTS = 40. If I'd picked 1000, the headline would read 994. It isn't a property of any policy, it's a property of how long I let the loop want things. A number I chose, dressed up as a number I found.

Which is why the strategy table above leads with spent and unaccounted, and why M is buried in a scenario that tells you to go read the sweep before quoting it.

Pressing my own kill switch

The honest question isn't whether my metric works in the run I picked. It's where it stops working. So here's the same sweep under both loop shapes:

  on-block=retry
        K:    0    5    6    9   10   11   12   20   39   40
        M:   40   35   34   31   30   29   28   20    1    0
    unacc:   40   35   34   31   30   29   28   20    1    0
    -> M = 0 in 1/41 of K (2%); unaccounted = 0 in 1/41 (2%)
  on-block=halt
        K:    0    5    6    9   10   11   12   20   39   40
        M:   40   35   34   31   30    0    0    0    0    0
    unacc:   40   35   34   31   30    0    0    0    0    0
    -> M = 0 in 30/41 of K (73%); unaccounted = 0 in 30/41 (73%)

Under halt semantics, everything I've argued dies in 73% of the parameter space. Not weakens. Dies, to identically zero, both metrics at once.

The reason is dull and important. The budget is $0.50, a healthy call is $0.05, so a healthy run hits the cap at step 10. If the oracle only falls over at step 11 or later, the loop is already stopped by money. The un-priced branch is never reached. Nothing to measure, nothing to argue about.

So the applicability condition, which my previous draft never stated and which I'm stating plainly now: this post is about outages that start before your budget would have stopped the loop anyway — K <= budget // unit_cost, which is K <= 10 here — or about loops that retry after being refused. Outside those two cases the whole thing is a non-event.

That boundary was off by one in the tool until this morning: the summary line said the metrics collapse once K >= 10, when K = 10 is the last K where they're still alive and K = 11 is the first dead one. The sweep table underneath it was right the whole time. A decent argument for printing the table and not just the conclusion drawn from it.

I think retrying is the common case, because runaway agent loops are usually retry loops. That's a judgement about the world, not a measurement, and I'm flagging it as one.

A free fallback is fail-open with better branding

Now the equivalence. Same outage, and I hash the decision stream — the (seq, label, admitted, charge) tuples, deliberately excluding the human-readable reason strings, so two strategies that make the same decisions hash the same even when they print different words:

SCENARIO C — is a free fallback a distinct strategy, or a renamed fail-open?
  admit-unpriced   sha=53b01d22a232f1fee833a76c7cd1ed810d1945da2e7620c8a1a17a9302b4df79
  fallback:0c      sha=53b01d22a232f1fee833a76c7cd1ed810d1945da2e7620c8a1a17a9302b4df79
  stale            sha=0dcbd560f59adcf2b1eec3ca111dc7f89c3e7ac08569fe165d88ec7af77b4311
  fallback:5c      sha=0dcbd560f59adcf2b1eec3ca111dc7f89c3e7ac08569fe165d88ec7af77b4311
  refuse           sha=7df1f34491edfde21648d36e9a8eda1db7306d12efda4c29364fa8f54ad3b04a
  pessimistic      sha=7df1f34491edfde21648d36e9a8eda1db7306d12efda4c29364fa8f54ad3b04a
  -> fallback:0c  == admit-unpriced : True
  -> fallback:5c  == stale          : True  (needs >=1 real quote before the outage, and a constant oracle price)
  -> pessimistic  == refuse         : True  (at THESE caps: $0.25 never fits what is left of $0.50)

Three collisions in that block, and the third one costs me a third of my own recommendation.

fallback:0c == admit-unpriced is the headline, and — as I said up top — a theorem. Both charge zero, the hash covers the charge. I couldn't break it by moving the budget, the unit cost, the per-action cap, the loop shape or the outage step; it isn't a coincidence of the parameters I picked.

fallback:5c == stale is real but conditional, and I stated it as a general truth in an earlier pass. It needs two things I'd left unsaid. There has to be at least one successful quote before the outage — with --oracle-fails-from 0 there's no last-known price at all, stale refuses, and the equality collapses. And the oracle's price has to actually hold still; point the harness at a varying price and the two decision streams separate immediately. The honest version is narrower: a fallback pinned to the real rate is stale pricing, as long as the real rate isn't moving.

pessimistic == refuse was sitting in my own main table and I walked past it twice. Same 7df1f344…. A $0.25 estimate never fits the $0.20 left after step 6, so pessimistic blocks every un-priced call — which is refuse, decision for decision. Run --strategy fallback --fallback 25 --oracle-fails-from 6 and you get 7df1f344… as well. Three names, one behavior.

That forces an admission about my own advice. Further down I tell you to charge a stale price, or a conservatively biased estimate, or the per-action cap. At the parameters in my own demo that last one is bit-identical to the refusal I declined to recommend — which is why it now ships with that caveat attached instead of posing as a third independent door. It doesn't make the advice wrong. Refusing is defensible, and I spend a whole section on its bill.

The price sweep is where this gets concrete:

SCENARIO D — sweep the fallback price (the whole argument hangs on it)
  fallback:0c      admitted=40  spent=$0.30  unpriced=34  unaccounted=34  ledger-moved=False exit(conv)=0 exit(strict)=2
  fallback:1c      admitted=26  spent=$0.50  unpriced=20  unaccounted=0   ledger-moved=True  exit(conv)=1 exit(strict)=2
  fallback:2c      admitted=16  spent=$0.50  unpriced=10  unaccounted=0   ledger-moved=True  exit(conv)=1 exit(strict)=2
  fallback:5c      admitted=10  spent=$0.50  unpriced=4   unaccounted=0   ledger-moved=True  exit(conv)=1 exit(strict)=2
  fallback:25c     admitted=6   spent=$0.30  unpriced=0   unaccounted=0   ledger-moved=False exit(conv)=1 exit(strict)=1

So the clean line I wanted to write — "the runaway terminates if and only if the price is above zero" — is not true, and the bottom row is the counterexample. fallback:25c prices well above zero and spent still sticks at $0.30 with the ledger frozen. What a price above zero actually buys is a ledger that keeps moving, and only while that price still fits what's left of the budget. Under that band you get fail-open. Over it you get refusal wearing a price tag. The usable range is narrower than "not zero", and where it sits depends on caps I picked.

Worth being exact about the edge, because I rounded it off in an earlier pass. The price has to clear min(per-action cap, budget − spent when the outage begins) — here min($0.25, $0.20), so the collision with refuse actually starts at $0.21, and $0.25 is just the row I happened to print. Which of the two terms binds depends on when the oracle dies: at step 6 it's the leftover budget, at step 4 it's the per-action cap. So "too expensive to fit" isn't a property of the price alone — it's the price measured against however much budget the outage left you.

This matters because the free version is the one people ship. On July 15, 2026, a developer publishing as @ddhh released a small LLM circuit breaker with a clean statement of the instinct: "When I'm about to overspend, don't fail and don't keep paying — fall through to a free local model and keep working." The config in the post marks the tier # local, free, always-on fallback.

I want to be precise about his design rather than convenient, so I read the repo instead of the headline. The gate I'm comparing against is his budget gate: _tier_order() switches to local-only tiers once _budget_exhausted(), and if no local tier is configured it raises BudgetExceeded rather than continuing. That gate keys on accumulated spend against a limit — not on a missing quote. That's a declared branch with a hard stop, which is more than my June code had.

Provider failure is handled too, just not by that gate. His post opens on exactly that problem — "Paid API quotas dying mid-run. One provider 429s, and the whole run falls over" — and complete() wraps each tier in an except … continue (breaker.py:121-131, commented "a failed tier should never crash the caller"), so a 429 on a paid tier falls through to the next one, local included. Two triggers, one ordered failover. The reason I'm separating them carefully is that the thing I care about is which signal moves the ledger, and on both paths the local tier reports the same number: ollama_tier._call ends with return text, 0.0 (providers.py:167), which _record writes into the JSONL ledger as cost_usd: 0.0.

For his stated goal that is correct, and I'll say so flatly: a free local call really does cost zero dollars, so his ledger is accurate and his dollar spend genuinely cannot grow past the limit. He solved the problem he set out to solve.

The pattern risk lives one step to the side. Once every call costs zero, the dollar ledger can never stop anything again — and if the thing that pushed you over budget was a non-converging loop rather than an expensive model, dollars were never the binding constraint. Wall-clock, local GPU, rate-limited downstream endpoints and side effects all keep accruing, and the ledger reports $0.00 while they do. Exactly the fallback:0c row above.

None of that is a defect in his breaker. It's what the graceful-degradation instinct does when you port it from "budget exhausted" to "cost unknown" without noticing the axis changed.

For scale on why any of this is worth an afternoon: on July 16, 2026 @royanannya published a postmortem of a multi-agent loop that billed $1,847 in one weekend, fixed by pushing decisions off the LLM layer and cutting per-game cost from $1.95 to $0.35. Their number, their run; I didn't reproduce it. Their fix was architectural, not a spend cap, and I'm not going to pretend otherwise.

Where my own metric stops meaning anything

I've been leaning on unaccounted as though it measures how accurately you're counting. It doesn't. It's a binary test — did an admitted call with no quote get charged zero? — and the counterexample is sitting in my own output.

The harness has no notion of what an action actually costs. Every admit really executes, and a real call here is $0.05. Multiply the admitted column by that and set it beside the ledger:

strategy	ledger says	actually spent	vs the `$0.50` budget	`unaccounted`
`stale`	$0.50	$0.50	1.0x	0
`fallback:5c`	$0.50	$0.50	1.0x	0
`fallback:1c`	$0.50	$1.30	2.6x	0
`fallback:0c`	$0.30	$2.00	4.0x	34

fallback:1c passes my metric cleanly. unaccounted=0, ledger-moved=True, budget tripped on schedule — and I wrote "that's the budget, tripped, doing its job" about that exact row. It also pushed 26 real calls through a ten-call budget. Pricing a $0.05 action at $0.01 under-counts by 5x, and a cap that under-counts by 5x is a cap five times larger than the one you configured.

So "keep the ledger moving" is satisfied by any positive fiction. The real axis isn't zero versus non-zero — it's the bias of your estimator. Zero is simply the point where the bias hits −100% and the cap stops existing at all. It's the worst case and it's the common case, which is why it earns a post, but unaccounted=0 is a floor, not a certificate. If your fallback price is a comfortable number rather than a conservative one, you haven't fixed the runaway. You've slowed it down and moved it out of view.

There's a second place my instrument lies, and I found it checking this piece rather than writing it. The ledger-moved column samples spent after each blind decision, so what it actually asks is "did the ledger move more than once?" A price that fits exactly once — try --fallback 15 — pushes spent from $0.30 to $0.45 and still prints ledger-moved=False, the same value fail-open gets. None of the rows in this post are affected; they sit at 0c, 1c, 2c, 5c and 25c. But if you sweep the price yourself you'll walk into it, and it's the same conflation I've spent the whole post complaining about, sitting in my own column. Trust spent. The boolean is a convenience and I got it wrong.

One caveat on that table, because it cuts against my own framing: it assumes an un-priced call costs what a healthy one costs. If your fallback genuinely is a free local model, the dollar figure really is zero and the fallback:0c row overstates the dollars. That's precisely @ddhh's case — and precisely why what escapes there is wall-clock, GPU and downstream rate limits rather than dollars.

What refusing actually costs you

I've been describing the failure mode of not counting. Refusing has its own bill, and my last draft skipped it, which was the honest complaint against it.

First, the recovery case. Outages end. Oracle down from step 6, back at step 15:

SCENARIO F — the outage ENDS: oracle down 6..14, back at 15
  loop retries after a BLOCK (a real runaway does):
  refuse           admitted=10  spent=$0.50  unpriced=0   unaccounted=0   ledger-moved=False exit(conv)=1 exit(strict)=1
  same run, but the loop HALTS on the first BLOCK (single-shot shape):
  refuse           admitted=6   spent=$0.30  unpriced=0   unaccounted=0   ledger-moved=n/a   exit(conv)=1 exit(strict)=1

(The scenario prints admit-unpriced and stale rows too; I've kept only refuse here, because it's the policy on trial.)

Under retry, refusing costs nothing: the run resumes and completes the same ten actions it would have anyway. Under halt, it ends the run at step 6 and never sees the oracle come back. Same policy, same outage, and whether refusing is free or expensive depends entirely on a property of your caller that isn't in the cap at all.

Three more costs, none of which my harness measures:

You hand your stop button to a third party. If your gate refuses whenever CoinGecko is unreachable, then CoinGecko's rate limiter is now your kill switch, and anyone who can degrade it can halt your agent remotely.

Refusing is not automatically safe. The fail-closed doctrine arrives from authorization, where denial is the safe default. Spending isn't authorization. Stopping halfway through a non-idempotent sequence — and SpendGuard was written for ETH and gas — can be worse than admitting one un-priced call. If your actions aren't safely interruptible, "refuse" is not the free option it looks like.

The escape hatch never gets closed. Any --allow-unpriced flag will be added to a systemd unit at 3am during an incident and will still be there next year. If you build one, give it an expiry.

Which is why the recommendation of this post isn't "fail closed." It's narrower: declare the branch, and keep the ledger moving with a price you'd defend out loud. Charge a stale price. Charge an estimate biased high rather than convenient. Charge the per-action cap if you accept what my own harness showed above — that on a tight budget this is refusal under a different name. Just never charge zero and call it accounting.

Two exit codes, and which one is an opinion

The tool prints exits under two mappings, because the difference is the trap:

EXIT_CONVENTIONAL = {"PASS": 0, "ADMIT": 0, "BLOCK": 1, "ERROR": 2}
EXIT_STRICT = {"PASS": 0, "BLOCK": 1, "ADMIT": 2, "ERROR": 3}

conventional is what most gates ship: an admitted action is a success. Under it, the blind run exits 0 while 34 actions went through un-priced. Your orchestrator sees green and moves on.

strict treats "admitted without a quote" as its own outcome. Under it no strategy exits 0 during an outage — refusing gets a 1, admitting blind gets a 2. Nobody gets a clean run when the oracle is down, which seems right to me.

That second mapping is my opinion, and I'm labeling it as one. My last attempt at this put ADMIT: 0 in the table, then acted amazed that fail-open exited green — I'd assumed the conclusion and called it a finding. The counts are the evidence. The exit code is a choice, and you should make your own.

The tool also exits worst-wins in every mode including the demo, so the default run exits 3. It contains a deliberate oracle fault. A file about failures laundered into green zeros doesn't get to launder its own.

The oracle-fault path catches magnitude, not just sign, since a cents-versus-dollars mixup is the classic cost-oracle bug:

      seq 3: oracle-untrusted: quote 500 is 100x last known 5 (unit error, not a price)

A 100x jump is caught as a broken oracle instead of an expensive action. The threshold is 20x and it's a judgement call, and it can't fire on the first quote of a run because there's nothing to compare against yet.

It's also one-sided, which I only noticed while writing this up. It catches a quote that's too big and sails straight past the mirror-image bug: dollars arriving as cents, every action priced at a hundredth of what it costs. For a spend cap that's the more dangerous direction — it under-counts instead of blocking, which is the same failure as everything else in this post — and my check doesn't cover it. It's marked as a known gap in the source rather than quietly left there.

What this is not

A benchmark of your bill. The constants ($0.05 a call, a $0.50 budget, 40 rounds) exist to make the branch legible. The transferable part is the shape.
Proof that outages and runaways co-occur. I believe they do, because a runaway hammering a rate-limited endpoint is often the thing knocking its own price feed over. A synthetic harness can't show that, and I'm not claiming it does.
A verdict on anyone's library. The equivalence is about the pattern of pricing a fallback at zero. It reproduces with --strategy fallback --fallback 0 --oracle-fails-from 6 in twenty lines of my own code. (Leave the outage flag off and you get a healthy-oracle run where the branch never fires — which proves nothing, as Scenario A says out loud.)
A correctness proof. The sha256s show determinism, nothing more. A wrong program reproduces byte-for-byte just as well as a right one.
A model of how commercial budget systems fail. AWS Budgets, GCP billing and most usage endpoints don't go quiet — they lag, then reconcile, and what you couldn't see gets billed to you later. My gate has no true-up: an estimate charged during the outage is never corrected when the oracle comes back. And for LLM calls there's no honest pre-call price at all, since output tokens aren't known until the call is finished — so a real LLM cap lives permanently in pessimistic/fallback and never earns the PASS row my harness prints. What this models cleanly is a price feed, which is the case I actually shipped.

This sits on a different axis from the sliding-window guard, which is about cheap calls that sum to a runaway, and from the pre-execution gate, which is about gating before you execute rather than after. Today's axis is oracle availability, and neither of those touched it.

The one I'm still stuck on

Stale pricing keeps the ledger alive, which is the whole recommendation, and it's also quietly a lie: you're charging against a number you know might be wrong. So how long is a cost estimate allowed to live before "cached" becomes "guessing"? For gas that moves in seconds it might be five seconds. For a token price it might be an hour. I don't have a principled way to set that TTL, and I suspect it's per-oracle rather than a general rule.

If you've drawn that line in production — the point where a cached cost stops being a fact — I'd like to hear where you put it and what made you move it.

Run it yourself: stdlib, offline, and it prints hashes you can diff against mine. Then go look at your own cap and answer one question. When it can't price the next call, does the number in your ledger keep moving?

Follow for the next teardown in this series, and tell me the worst thing your agent ever did while your dashboard showed $0.00. I read every comment.

Written with AI assistance and reviewed/edited by a human. The Python in this post was run offline (stdlib only, no network, no keys, no funds) on 2026-07-19; every number and hash in the output blocks is from a real deterministic run, repeated three times byte-for-byte. Code sha256 ddc425908d070c07a6765810e6115f649c17312b8e754d034227d37c982357e8, output sha256 9ebe1b4ab3459eb78c1d3aeea7eaa16ee0c2286e1d8d57b09e0eeb6451a189f1. The $1,847 figure belongs to @royanannya and was not reproduced here; the circuit-breaker design described is @ddhh's, quoted from their post and their MIT-licensed repository.

One compaction, four actions, one block: compaction safety is a property of the pair

Alexey Spinov — Sat, 18 Jul 2026 01:06:26 +0000

Context compaction is safe or unsafe only against a specific proposed action. Not in general. The compactor decides what your agent forgets before it knows what your agent will do, so at compaction time "did this lose anything important?" has no answer yet. The lie, if there is one, comes into existence later, when the agent proposes to act.

That sounds like philosophy. It has an exit code.

AI disclosure. I wrote compaction_omission_gate.py with an AI assistant and ran it myself: Python 3.13.5, offline, standard library only, no network, no keys, no funds. Every verdict, exit code and hash below is pasted from a real local run. I ran the whole demo twice and the two output.txt files are byte-for-byte identical (sha256 2f0ae2c6d54b1d35caefb3da12af3a4189aff05e5d8c59321b63150f27eef376). The fixtures are synthetic and mine. The numbers I quote from Anannya Roy Chowdhury, Prasad T and the Ratel team are their claims from their own posts, attributed inline; I did not reproduce their systems.

In short:

Freeze the compaction. Vary only the action. Same 52 records, same 16 kept (kept-sha256=3bfe38c85fcf on every row), same 36 dropped, and the same rule missing from all four: 1 BLOCK, 3 PASS. Safety moved without the compaction moving at all.
So "a good compactor" is not a thing you can have. There is only "this compaction is safe for that action."
The check is arithmetic, not vibes: amount 5000 passes, amount 5001 blocks. Same words, one unit apart.
This is a partial no to a question I asked publicly two days ago. The action-blind version of the check blocks 4 of 4, where the gate blocks 1 of 4. Three of those four blocks are wrong, and nothing recovers them.
Two of my own designs died on this post: one to a number in someone else's post, one to my reviewer. Details below.

The gap three people found in the same week

Anannya Roy Chowdhury published My Multi-Agent AI Cost $1,847 in One Weekend on July 16. Her numbers: $1,847 burned, 82% cut, context compressed "from 12,000 tokens to 340." That is 97% of the context gone. I asked her in the comments whether she'd found a way to catch a bad compression before the turn runs on it, or whether you find out from the win rate.

Her reply named the hole better than my question did:

"A bad compressor can absolutely 'lie' by omission, which is arguably more dangerous than naive replay because the failure is silent."

She called it the core challenge of her Part 2, which was not out when I wrote this. So this post is not a rebuttal of anything. It is an attempt to answer the question she agreed was open.

Two other people hit the same wall in the same week. Prasad T's Teaching a Qwen agent to forget put a superseded_by link on memory records so a contradicted fact is dropped from recall but kept in an audit trail. The line I keep coming back to, from the thread under his post: forgetting you can't inspect is just data loss. And the Ratel Show HN (19 points, 18 comments when I checked) does progressive tool disclosure with BM25/embeddings/hybrid and claims up to 81% fewer tokens. Different clothes, same problem: something decides what the model doesn't see, and nothing checks that decision against what the model is about to do.

The number that killed my first design

My first design was the obvious one: run the decision on the full context, run it on the compacted context, and if the decision changes, BLOCK. Compare against the truth.

Prasad's post killed it. His FAMA measurement (his numbers, not mine) reports recency-only = 0.64 and never-forgets = 0.64. Identical. Never-forgets is the full context. So the "truth" I wanted to compare against is itself wrong 36% of the time. An oracle that lies in a third of cases is not an oracle, and any reader who has read his post gets to close my tab in one line.

Those same two numbers kill a second thing, which is the framing I would have reached for by reflex: compaction hurts because you lose volume. If throwing almost everything away and throwing nothing away both land on 0.64, volume is not the variable. Prasad says it plainly: "the win isn't about trimming context; it's about knowing what's stale."

So the gate has no oracle. It never asks what the model would have decided. It resolves your declared predicates against your stored records. That is a structural fact, not a prediction.

What does the compaction omission gate actually ask?

One closed question: among the records the compactor dropped, is there one that binds this action?

Input is the full pre-compaction context, the ids the compactor kept, and a concrete action. The dropped set is computed, never supplied. I don't trust anyone's "here's what I removed" list. Records are typed: policy (scope, predicate, effect), fact (field, value, optional supersedes), chatter. Ordering is a declared seq, never a wall clock, so the verdict can't drift with the time of day.

The whole resolution step is this small:

def resolve(field, params, facts):
    """action.params wins; otherwise the newest stored fact for that field."""
    if field in params:
        return True, params[field]
    if field in facts:
        return True, facts[field]["value"]
    return False, None

If a field resolves nowhere, the gate fails closed rather than assuming harmless. There is no scoring anywhere in the verdict path. PASS, BLOCK, ERROR. No percentages. That's deliberate: on July 16 a Show HN called ReasonGate shipped a "gate" built on regexes, and Simon Willison took it apart in the thread within the hour. Recall of "76-96%... never 100%" is not a gate, and "this list of regular expressions does not inspire confidence." A gate that guesses is a filter with good PR.

Is context compaction safe? Freeze it and vary the action

Here is the claim as an experiment, and it needs nothing to compare against. One compaction. Four actions.

The fixture is a 52-record support thread. The refund request opens it, the resolution closes it, and the middle carries the boring stuff: internal notes, a tier update, and one dry line at seq 11: Refunds over 5,000 need a human approver. The compaction is head+tail, keep the first 8 and the last 8, which is the shape most windowing schemes converge on. That rule sits at position 11 of 52. It is dropped. In all four rows.

Proposed action	Verdict	Exit
`refunds.create {amount: 8400}`	BLOCK `dropped-binding-constraint`	1
`refunds.create {amount: 120}`	PASS	0
`refunds.create {amount: 5000}`	PASS	0
`notes.append {ticket: 7741}`	PASS	0

"Same compaction" is a claim, and claims about counts are cheap, so the tool prints a fingerprint of the kept set:

    compaction: 52 records, 16 kept, 36 dropped (retention=given, kept-sha256=3bfe38c85fcf)

That digest is identical on all four rows. Same 52 records, same 16 kept, record for record, same 36 dropped. One blocks. Nothing about the compaction moved, and the answer to "is this compaction safe?" moved anyway. That is the whole thesis, and there is no baseline in it, because the claim isn't that some other tool does worse. The claim is that the question is malformed until you name the action.

The 120 row prints a line I like:

note      : permissive: dropped policy 'f_020' (allow) resolves TRUE here (amount = 120 lt 500)
            but losing a permission cannot license an action; not a block reason

A dropped allow rule is not danger. It costs you an auto-approval, nothing more. The gate says so instead of blocking on principle.

"A good compactor" is not a thing you can have. There is only "this compaction is safe for that action."

The boundary, because "it resolves the predicate" is also a claim

The 5000 row is the one worth poking at. 5000 gt 5000 is FALSE, so the rule wouldn't have fired, so its absence changes nothing. But how would you know the gate computed that rather than pattern-matched its way there? Walk the boundary:

`amount`	Verdict
4999	PASS
5000	PASS
5001	BLOCK `dropped-binding-constraint`
8400	BLOCK `dropped-binding-constraint`

5000 and 5001 are the same word to any tokenizer worth the name. The verdicts differ because the gate resolved amount out of the action's params and did arithmetic against a stored predicate. Lexical similarity has no way to know that 8400 > 5000, because that is not a fact about words.

The part where I answer my own question with "no"

Two days ago I asked publicly whether you can catch a bad compression before the turn runs. The honest answer, which I did not want, is mostly no, and I can measure the cost of pretending otherwise.

--precheck takes the action away from the verdict and asks the general question instead: could any dropped record bind any reachable call of these tools?

$ python3 compaction_omission_gate.py --precheck --retention given fixtures/fx1_refund_thread.json
    action    : (the verdict below never consults it -- that is the experiment)
    compaction: 52 records, 16 kept, 36 dropped (retention=given, kept-sha256=3bfe38c85fcf)
    tools     : refunds.create
    verdict   : BLOCK (action-blind) exit 1
    reason    : may-bind-some-action: dropped policy 'f_011' scopes to refunds.create; some
                reachable call of that tool satisfies (amount gt 5000). Which one? Unknown
                here: no action was supplied

Run it against each of the four actions and you get the same line four times, because it never reads them. The gate blocks 1 of 4. Precheck blocks 4 of 4. Three of those four blocks are false, and they are not fixable by being cleverer: nothing recovers 120 > 5000 = FALSE without the 120.

So the control point sits one step later than I hoped, and one step earlier than it costs money. The turn is already spent: the model has read the mutilated context and produced a proposal. Nothing has executed. That's the line my pre-execution gate has always drawn: before the action, not before the thinking. Reading the dropped records to check them costs zero tokens, since compaction governs what gets sent to the model, not what's on your disk. (Money is a different post: the re-billing curve is here.)

The hazard is not symmetric, and my gate got that wrong

Here's the bug my reviewer found in my own tool, and it is the mirror of the one I was proud of.

I had a rule: losing a permission cannot license an action. True, and the gate leans on it. A dropped allow never blocks. But I filtered the kept-rule check by the same logic, and that filter was wrong. A kept allow rule is a permission you still have. If the compactor drops the value that permission reads, the permission fires on a value you already took back:

$ python3 compaction_omission_gate.py --retention given fixtures/fx6_stale_permit.json
    verdict   : BLOCK (dropped-superseder-stale-permit) exit 1
    reason    : dropped-superseder-stale-permit: kept policy 'f_003' (allow) reads
                'customer.tier'. The compactor kept 'f_002' (customer.tier = gold, seq 2) and
                dropped 'f_011' (customer.tier = standard, seq 11) and declares supersedes
                'f_002'. The rule's own predicate (customer.tier eq gold) is TRUE on the kept
                value and FALSE on the newest one: the agent will license this call with a
                value you already retracted. Losing a permission cannot license an action, but
                KEEPING one while its input is retracted can

Before the fix, that fixture returned PASS exit 0. Silent. A 4,000 refund auto-approved on a gold tier that was revoked twenty records ago, and my gate, whose entire pitch is "the failure is silent, so catch it," said nothing. I had written the general principle and then applied it to the wrong noun.

Fixing it forced a distinction I had been sloppy about. A stale value has two directions and they are not the same event:

Kept rule permits on the stale value, wouldn't on the newest: the agent acts where your records say stop. Hazard. BLOCK.
Kept rule denies on the stale value, wouldn't on the newest: the agent stops where your records say go. Cost. PASS, with a note.

The second one is a real fixture too (fx7), and the gate declines to block it:

    note      : stale value, safe direction: kept policy 'f_003' (deny) reads 'customer.tier'.
                [...] The rule's own predicate (customer.tier eq restricted) is TRUE on the kept
                value and FALSE on the newest one. This compaction makes the agent more cautious
                than your records warrant: a cost, not a block reason

A gate that blocks whenever anything changed is just an alarm wired to the whole house. The direction is the point.

Is there a "correct" retention scheme? I measured. No.

Same fixtures, same k, two action-independent schemes:

fixture	head_tail	recency
refund thread (rule in the middle)	BLOCK binding	BLOCK binding
superseded fact	BLOCK stale-kept	BLOCK binding
genuinely safe	PASS	PASS
retracted tier, kept `allow` rule	BLOCK stale-permit	PASS
retracted tier, kept `deny` rule	PASS	PASS

I predicted recency would survive the supersession fixture. It doesn't. I was wrong, and the way it's wrong is more interesting than my guess: head+tail pins the stale customer.tier = standard at seq 2 and drops the correction at seq 20, so the agent applies a kept rule to a retracted value. Recency keeps the correction and drops the rule itself. Two popular schemes, opposite failures, same input. "Switch to recency" doesn't fix the failure, it relocates it.

I won't tell you no correct scheme exists; five synthetic fixtures can't carry that. The structural argument is stronger than the table anyway. Safety depends on the action, which the freeze-and-vary run above shows without appealing to any scheme at all. Neither of these schemes can see the action. So neither can be safe in general, and the row where recency wins is luck about where the record sat, not virtue.

"But a relevance compactor would keep the rule"

Probably, sometimes. And this is where I have to report a section that isn't here.

An earlier draft of this post had a better headline and a fifth act: my gate against a relevance-based compactor, IDF-weighted cosine, keep-top-k, the state of the art, confidently certifying a compaction that dropped a binding rule. The reviewer who gates my drafts took the code apart instead of reading the prose, and found that I had serialized the action as structure (refunds.create amount 8400 currency USD customer c_889) while indexing every record as prose (Refunds over 5,000 need a human approver.). Same JSON on both sides. My gate read six fields off that record. My baseline read one.

Whatever that comparison showed, it was not a fact about relevance ranking. It was a fact about what I handed each side. So the ranker is gone, all ninety-odd lines of tokenizer and cosine that existed to lose a fight I had arranged. The thesis never needed it, which I'd have noticed sooner if the number hadn't been flattering.

The honest version of the question is harder. A ranker that sees the action doesn't produce one compaction you can freeze and audit; it produces a different compaction per action. That may well be the right engineering. It also means there is no fixed artifact to check, and "is this compaction safe" stops being a question you can ask once and cache. I don't have a clean answer for that shape. If you do, I'd like to hear it.

Prose cannot be certified

One fixture is rolling summarization output: the middle of the thread replaced by eight prose blobs typed as summary. The gate doesn't recognize that kind, can prove nothing about it, and blocks with dropped-unevaluable-record. Be precise about what fired, though: the unrecognized kind, not the prose itself. Retype those same eight blobs as chatter — a known kind the gate reads as your own declaration that a record is throwaway — and it believes you, drops them silently, and returns PASS. I checked; the verdict flips. So the honest version is narrower than prose fails closed: the gate fails closed on a kind it can't read, and trusts the one you've labelled inconsequential. Same knife as the schema limit further down — it holds you to your typing. There's an --allow-unevaluable flag that turns the unrecognized-kind block into PASS exit 0 with [--allow-unevaluable: you chose to fly blind] in the report. Not a fix. A signed decision.

This is the uncomfortable one, because rolling summarization is exactly what my own context-tax post prescribes as the cure. I'm auditing my own prescription. If your memory is a pile of summarized text, this gate has nothing to work with, and neither does anything else. That's not a gap in the gate. It's a gap in the store. Prasad's superseded_by link is the cheapest version of the fix.

What this is NOT

Not an authorizer. It says the compaction hid nothing. Whether the action is allowed is your policy engine's job.
Not a benchmark. Seven synthetic fixtures I wrote, shaped to have the property I'm describing. No claim about anyone's production system, and specifically no claim about Anannya's. Her numbers are hers, from her post.
Not a compactor. It doesn't fix your retention. It tells you when this retention can't carry this action.
Not proof your agent is safe. It closes one hole: a dropped record that binds the proposed call.
Not usable on unstructured memory, as above.

How is this different from my other gates?

mandate-freshness-gate expires authority: the permission was revoked, and nothing is missing from the context. Here a fact is retracted and the record is gone. checkpoint-skip-gate replays a recorded trajectory to find a step the agent skipped; the culprit is the agent, and the check is post-hoc. Here nothing is skipped: the infrastructure quietly ate a fact, and the check runs before execution. agent-memory-tax-and-backdoor argued retention is not relevance, about what your store keeps. This one is about what your window throws away in the same breath, and the verdict never scores anything. sliding-window-spend-guard treats the window as a spend problem, where eviction is bookkeeping. Here eviction is a correctness problem and the verdict has no cost term in it at all.

Run it

Drop the tool and the fixtures in a directory and run bash run_demo.sh. Three stdlib imports (json, hashlib, sys), no network, no keys, no model. The full sweep ends 2 PASS, 4 BLOCK, 1 ERROR -> overall exit 1, malformed input exits 2 and never 0, and every report ends with a sha256 of its own body. Run it twice and diff. If it isn't byte-identical, I'd want to know.

Every command in the demo passes --retention given explicitly even though it's the default, because an earlier version of that script quietly recomputed the compaction per action while the prose above it claimed the compaction was frozen. Two of the numbers I'd written were false and I hadn't run the command my own paragraph described. Spelling the flag out makes the sentence and the command the same object.

The schema is the price of admission. policy needs a scope, a predicate and an effect; fact needs a field, a value and ideally a supersedes. If your ingestion layer labels a binding rule as chatter, the gate believes it, and nothing saves you. It holds you to your own typing. That's the honest limit.

So: if compaction safety is a property of the pair rather than of the compactor, where does this check belong in your harness: inside the compactor, in the store, or at the execution gate? I lean toward the execution gate, because it's the only place that knows the action, but that's the place least likely to have the dropped records in hand. And what do you do when your context is prose with no schema? Pay to type it, or fly blind and admit it?

Follow for the numbers from the next gate I build and break. If you've watched an agent lose an instruction to auto-compact and only find out from the outcome, tell me what the dropped record was. I read every comment.

Codex encrypted its sub-agent prompts. Gate the spawn plan.

Alexey Spinov — Thu, 16 Jul 2026 00:48:38 +0000

Pre-dispatch authorization for AI sub-agents means checking a child spawn's grant envelope (its role, tools, path scope and token budget) against the parent's policy in the last plaintext moment before handoff, not by reading a trace afterwards. When the orchestrator encrypts the handoff, the after-view goes blind. The before-view does not, because it sits earlier on the timeline.

On July 14, "Codex starts encrypting sub-agent prompts" hit 408 points and 240 comments on Hacker News in a day. The tracking bug behind it is filed as openai/codex#28058, titled "Regression: encrypted MultiAgentV2 messages remove readable task audit trail." A change encrypted the orchestrator-to-sub-agent payload, and the plaintext task record humans used to read after the fact turned into ciphertext. People who had been inspecting what their sub-agents were told, after dispatch, could no longer read it.

That is a good thing to notice, and a worse thing to fix by asking for the plaintext back.

AI disclosure. I wrote subagent_dispatch_gate.py with an AI assistant and ran it myself: Python 3.13.5, offline, standard library only, no network, no keys, no funds. Every number, exit code and sha256 below is pasted from a real local run. I ran the whole demo twice and the two output.txt files are byte-for-byte identical (sha256 5af48191642d66f7c364c429c50d2ad1a021f09004f5566ba878c7be87fcaaf1). The one synthetic part is clearly marked: encrypt_artifact() models the observable consequence of encryption (opaque bytes you cannot parse back into fields), not Codex's real crypto. And every fact about Codex here comes from that HN thread and that issue. I did not reproduce their system.

In short:

The issue people filed asks to restore the readable audit copy. That restores tracking. It does not restore control, and I think the two got conflated in the thread.
Post-hoc inspection genuinely worked the day before the change. In my contrast run it recovered a plan with 4 fields when the artifact was plaintext, then 0 fields once the artifact was the opaque blob. The only variable is encryption. No strawman.
A pre-dispatch gate reads the plan the parent still holds in-process, before the handoff boundary. Same spawn, both worlds, children=1, invariant to whether the artifact is later encrypted.
The gate is not a tautology. One plan granting fs.write, a fs:* wildcard, shell.run, a path of ~ and a 5,000,000-token budget produces PASS: 1 BLOCK: 1, exit 1. The same two roles scoped read-only inside the workspace under cap produce PASS: 2 BLOCK: 0, exit 0.
An absent or empty plan is exit 2, never a silent pass. A gate that treats "no plan found" as "nothing to block" fails open, and that is the failure mode I most wanted to kill.

What is pre-dispatch authorization for AI sub-agents?

It is a check that runs on the parent, before the parent hands a task to a child agent. The parent holds a spawn plan: for each child, a role, a set of tool grants, a set of paths it may touch, and a token budget. Pre-dispatch authorization compares each of those against a policy the parent already owns, and refuses the spawn if any child asks for more than the policy allows. It happens while the plan is still a plain object in memory. Nothing has been sent, nothing has been encrypted, nothing has run.

The distinction that matters: this is authorization of a spawn, not authorization of an action, and not reconciliation of a trace. It is the same family as a pre-execution gate for AI agents, moved up one level, from "may this action run" to "may this child exist with these powers."

The incident: the audit trail went dark, and the thread asked for it back

Read issue #28058 in its own words. A PR added encryption to the MultiAgentV2 message payloads. The intent was reasonable: the model-facing channel between orchestrator and sub-agent should not sit around in plaintext on disk. The side effect was that the local rollout history, the thing a human opened to review what task got delegated to which sub-agent, became unreadable. The issue's ask is to keep a human-readable audit copy alongside the encrypted delivery.

I understand the ask. I also think it quietly gives up the game. If your control over what a sub-agent may do lives in your ability to read the transcript afterwards, then your control was always downstream of the work. The sub-agent already ran. The tokens already burned. The shell.run already happened. Restoring the readable copy lets you write a better postmortem. It does not stop the next spawn.

This is the line I keep coming back to across this whole cluster: tracking is not control. A trace tells you what happened. It does not decide what is allowed. When the trace is plaintext you can pretend the two are the same, because you can always go read it and intervene. Encryption breaks the pretense. It takes away the after-view and forces the question: if you could not read the trace at all, where would your control actually live?

Why is post-hoc dead while pre-dispatch survives?

Here is the honest version of that question, run as code. Same spawn, two worlds, one variable.

def encrypt_artifact(plan_obj):
    """Model the at-rest / handoff artifact AFTER the orchestrator encrypts it.
    Not real crypto: models the observable consequence -- bytes a reader without
    the key cannot parse back into fields. sha256 stands in for 'ciphertext'."""
    raw = json.dumps(plan_obj, sort_keys=True).encode()
    return "ENC:" + hashlib.sha256(raw).hexdigest()

def posthoc_inspect(artifact):
    """A session inspector reading the artifact AFTER dispatch.
    Plaintext JSON -> recovers the plan (worked yesterday).
    Opaque ENC blob -> recovers nothing (blind today)."""
    if isinstance(artifact, str) and artifact.startswith("ENC:"):
        return None, 0            # blind: zero fields recoverable
    obj = json.loads(artifact) if isinstance(artifact, str) else artifact
    if isinstance(obj, dict) and isinstance(obj.get("plan"), list):
        fields = sum(len(c) for c in obj["plan"] if isinstance(c, dict))
        return obj["plan"], fields
    return None, 0

The point of posthoc_inspect is that it is not rigged. Hand it plaintext and it does its job. Run python3 subagent_dispatch_gate.py --contrast:

CONTRAST: post-hoc inspection vs pre-dispatch gate (same spawn)
  world=PLAINTEXT  post-hoc: recovered plan=yes fields=4
  world=ENCRYPTED  post-hoc: recovered plan=no fields=0  <- BLIND
  pre-dispatch reads parent's in-process plan: children=1 (invariant to artifact encryption)
  => post-hoc control SURVIVES encryption: no
  => pre-dispatch control survives encryption: yes
sha256(contrast)=0abca3747138eaf4260ed298ed001c1df6fd7ac67e89ce56f60c7b5fe8b9ef11

Yesterday's inspector recovered 4 fields. Today's recovers 0. It did not get worse at its job. The world moved out from under it. The pre-dispatch view reads the plan the parent constructs in memory, before encrypt_artifact ever touches it, so it reports the same children=1 in both worlds. That is the whole argument in five lines of output: encryption is fatal to the after-view and irrelevant to the before-view, and I am not shaving points off post-hoc to get there. It honestly recovers the plan when the plan is readable.

The reason the before-view can exist at all is boring and load-bearing: the parent has to build the plan in plaintext to send it. Even Codex's own encryption sits around delivery and at-rest storage. The payload is assembled in the clear first. There is a moment, every single spawn, when the full plan is a plain object the parent is holding. That moment is where the gate goes.

The gate: authorize the spawn plan, not the trace

Four checks, one policy, one child at a time. Nothing clever.

WILDCARD_PATH_MARKERS = ("~", "$HOME", "*")
WRITE_CAPS = {"fs.write", "fs.delete", "shell.run", "wallet.transfer"}

def _escapes_workspace(path, root):
    """True if `path` is not provably inside `root` (fail-closed on ambiguity)."""
    if not isinstance(path, str):        # a non-string path spec is not provably safe
        return True
    p = path.strip()
    root_norm = root.rstrip("/")
    if p == "":
        return True
    for m in WILDCARD_PATH_MARKERS:      # home / wildcard touches everything
        if m in p:
            return True
    if ".." in p.split("/"):             # parent-traversal (absolute OR relative)
        return True
    if p == "/":                         # filesystem root
        return True
    if p.startswith("/"):                # absolute: must be inside root
        return not (p == root_norm or p.startswith(root_norm + "/"))
    return False

def check_child(child, policy):
    reasons = []
    allowed_tools = set(policy["allowed_tools"])
    cap  = policy["budget_cap_tokens"]
    root = policy["workspace_root"]
    allow_write = policy.get("allow_write", False) is True   # only a real True grants write

    tools = child.get("tools")
    if not isinstance(tools, list) or not tools:
        reasons.append("child declares no tools (fail-closed: unauthorizable)")
    else:
        for t in tools:
            if t not in allowed_tools:
                reasons.append("tool '%s' not in parent allowlist (deny-by-default)" % t)
            if t in WRITE_CAPS and not allow_write:
                reasons.append("tool '%s' is a write/destructive capability but "
                               "policy allow_write=false" % t)

    paths = child.get("paths")
    if not isinstance(paths, list):
        reasons.append("child.paths must be a list")
        paths = []
    for p in paths:
        if _escapes_workspace(p, root):
            reasons.append("path '%s' escapes workspace root '%s'" % (str(p).strip(), root))

    budget = child.get("budget_tokens")
    if not isinstance(budget, int) or isinstance(budget, bool):
        reasons.append("budget_tokens missing or non-integer (unbounded spawn)")
    elif budget > cap:
        reasons.append("budget_tokens %d over parent cap %d" % (budget, cap))
    elif budget <= 0:
        reasons.append("budget_tokens %d must be positive" % budget)

    return child.get("role", "<no-role>"), ("PASS" if not reasons else "BLOCK"), reasons

Two design choices are doing the real work. First, fs.write is in the allowlist and still gets blocked, because it is a write capability and the policy says allow_write=false. Allowlisting a tool and granting write are two different decisions, and collapsing them is how "read-only agent" quietly becomes "agent that can write." Second, _escapes_workspace is deny-by-default on ambiguity. I do not try to be clever about what ~ might resolve to. If a path is not provably inside the root, it escapes. I got this wrong in the first draft: my wildcard list literally contained "/", so the substring check flagged every absolute path, and the legitimate PASS fixture came back as exit 1. The live run is what caught it. That is the entire reason I run these before I write about them.

The scope side of this is a sibling problem to scoring a single API key's blast radius: same "how wide is this grant" axis, different object and different output. This gate is a binary refusal on a whole spawn envelope, not a 0-100 score on one credential.

Run it in sixty seconds

Here is the child that should never be allowed to spawn. Save it as plan_block.json:

{
  "policy": {
    "allowed_tools": ["fs.read", "fs.write", "http.get"],
    "workspace_root": "/repo/workspace",
    "budget_cap_tokens": 200000,
    "allow_write": false
  },
  "plan": [
    { "role": "home-cleaner", "tools": ["fs.write", "fs:*", "shell.run"],
      "paths": ["~", "/repo/workspace/task-1"], "budget_tokens": 5000000 },
    { "role": "doc-reader",   "tools": ["fs.read", "http.get"],
      "paths": ["/repo/workspace/docs"], "budget_tokens": 40000 }
  ]
}

home-cleaner is the kind of spawn you do not want dispatched sight-unseen: broad file powers, a shell, a home-directory reach, an absurd budget. Run it:

$ python3 subagent_dispatch_gate.py plan_block.json
SUBAGENT-DISPATCH-GATE REPORT
policy: 3 tool(s) allowed, root=/repo/workspace, budget_cap=200000, allow_write=False
children in spawn plan: 2
  - home-cleaner -> BLOCK
      x tool 'fs.write' is a write/destructive capability but policy allow_write=false
      x tool 'fs:*' not in parent allowlist (deny-by-default)
      x tool 'shell.run' not in parent allowlist (deny-by-default)
      x tool 'shell.run' is a write/destructive capability but policy allow_write=false
      x path '~' escapes workspace root '/repo/workspace'
      x budget_tokens 5000000 over parent cap 200000
  - doc-reader -> PASS
PASS: 1   BLOCK: 1
VERDICT: 1 subagent spawn(s) refused BEFORE dispatch
sha256(report)=afed5edb835afdafeac5496dd299770581701fb4489232b7e2baf469fa311812
exit=1

Note doc-reader passes in the same plan. The gate is not a wall that blocks everything. It refused one child on six specific grounds and let the scoped one through. Now flip only the scope, keep the two roles. In plan_pass.json, home-cleaner is fs.read on /repo/workspace/task-1 with a 40,000-token budget:

$ python3 subagent_dispatch_gate.py plan_pass.json
SUBAGENT-DISPATCH-GATE REPORT
policy: 3 tool(s) allowed, root=/repo/workspace, budget_cap=200000, allow_write=False
children in spawn plan: 2
  - home-cleaner -> PASS
  - doc-reader -> PASS
PASS: 2   BLOCK: 0
VERDICT: all spawns within policy; dispatch authorized
sha256(report)=de81c4d8f75cd4683796b9715c9501f81b77bd618286f0ebbf172aad8f21a5d9
exit=0

Exit 1 to exit 0, decided on the real difference between the two plans. If a gate cannot produce both outcomes it is a decoration, not a control. This one produces afed5edb… for the refusal and de81c4d8… for the authorization, and both hashes are stable across runs.

Why does bad input fail closed?

The failure I care about most is the quiet one. A gate that returns "all clear" when it was handed nothing is worse than no gate, because it launders "I did not check" into "I approved." So an absent or empty plan is exit 2, and it says why:

$ python3 subagent_dispatch_gate.py plan_empty.json      # "plan": []
ERROR: spawn plan is empty (fail-closed: nothing to authorize is not the same as everything authorized)
exit=2

$ python3 subagent_dispatch_gate.py plan_missing.json    # no "plan" key at all
ERROR: no spawn plan present (fail-closed: absence of a plan is NOT authorization -- a gate that passes an empty plan fails open)
exit=2

The same reflex runs inside each child. A spawn that declares no tools is blocked, not waved through as harmless. A budget_tokens that is missing or non-integer is blocked, because an unbounded spawn is the one that quietly forks children in a loop and bills you for all of them. Empty is not safe. Empty is unknown, and unknown fails closed. A previous tool in this series died in review for getting this backwards, treating a zero-byte input as a pass, and I would rather over-correct.

Where this sits next to the rest

This is one spoke of a cluster, and the edges matter more than usual here.

The near neighbor is an authz gate that reconciles the trace against what was allowed. That gate assumes the trace exists. It reads a span log after the fact and checks each recorded action against policy. Issue #28058 is the moment that assumption dies: encrypt the payload and there is no trace to reconcile. So the control does not disappear, it relocates earlier on the timeline, from the span log that is now ciphertext to the spawn plan that is still plaintext. If you have been running post-hoc reconciliation, that article is where you were, and this one is where you go when the log goes dark.

It is also a pre-run manifest check like the lethal-trifecta gate, which asks a different question of the same manifest: can untrusted input reach private data and then reach an exfiltration channel? Same "decide from the declared plan before anything runs," different property. And it is a new spoke on the pre-execution gate pillar: authorize the child, not the action.

What this is NOT

I would rather you find the holes than a commenter find them for me.

I did not reproduce Codex. Every claim about their system comes from HN 48905028 and issue #28058. The encryption there covers delivery and the at-rest copy; the sub-agent's output is a separate matter, and MultiAgentV2 is an experimental, off-by-default path. I am reasoning about the shape of the problem, not their internals.
This is not a decryptor or a bypass. The gate never touches ciphertext. It stands on the parent's own boundary and reads the plaintext plan the parent is already holding, before handoff. If your orchestrator does not expose that pre-dispatch moment, this tool has nothing to read.
encrypt_artifact() is a model, not AES. It produces opaque bytes to stand in for "ciphertext you cannot parse back into fields." It demonstrates the consequence of encryption for a reader without the key. It is not a reproduction of any real cipher, and I am not claiming it is.
It is not a runtime enforcer. It statically authorizes a manifest and returns an exit code. To actually stop a spawn you need a pre-dispatch hook inside your orchestrator that consults this verdict and refuses the handoff. This shows what the hook should refuse and why. It does not install the hook.
The budget check is a declared integer, not a tokenizer. It compares budget_tokens to a cap. It trusts the plan's own declared number and does not estimate real token cost. If a child lies about its budget, the gate believes the lie. Treat it as a ceiling on the ask, not a measurement of the spend.
The policy language is a toy. An allowlist, a workspace root, a budget cap, an allow_write flag. That is not OPA, not a real PDP/PEP, not a capability system. It is the smallest thing that makes the argument concrete and runnable. Real deployments will want a real policy engine.
A blocked plan is not a safe plan. The gate checks the four things it checks. A child that stays inside all four can still do something dumb with a legitimate grant. Passing this gate means "within the envelope you declared," nothing stronger.

What I would do on Monday

Find the point in your orchestrator where the parent has assembled the child's task and grants but has not sent them yet. That point exists, because the parent has to build the thing in the clear to hand it over. Emit the plan as a plain object there, one dict per child with its tools, paths and budget. Run those four checks against the parent's policy. Refuse the handoff on any BLOCK, and log the reasons. Then, separately, keep whatever after-the-fact audit you have. The audit is still useful for the postmortem. It just cannot be the place your control lives, because one encryption change already proved it can be taken away.

Here is the question I actually want answered, and I do not have a clean answer myself. For anyone running a multi-agent orchestrator in anger: where does authorization of the spawn live in your stack right now? Is it a real check before the parent hands off, or is it a trace you read afterwards and hope you get to in time? Because if it is the trace, issue #28058 is a preview of the day it goes dark.

I write about pre-execution control for AI agents: gates that decide before the work runs, not dashboards that explain it after. Every post ships a tool you can run offline, with no keys. Follow along if that is your kind of thing, and if your orchestrator has a pre-dispatch story, put it in the comments. I am especially interested in anyone who has bolted an authorization check onto a spawn and had it survive contact with a real agent loop.

AI Agent Cost Drift: 0.35%/day Is Invisible to Your Dashboard

Alexey Spinov — Wed, 15 Jul 2026 00:47:52 +0000

AI agent cost drift is the slow growth of your input floor — system prompt, tool schemas, CLAUDE.md, MCP servers — that a rolling baseline never catches, because the baseline climbs with it. drift_anchor_gate.py pins a frozen canary on day 0 and compares: a 0.35%/day creep raised zero alarms in 60 days; the anchor blocked it on day 9.

Your fleet dashboard fires when today's average run jumps 20 percent above last week. Your context floor grows 0.35 percent a day. Those two numbers never meet. I built six 60-day worlds and ran four rolling detectors over them, each at the tightest threshold that stays quiet on a flat fleet, and the slow creep raised zero alarms in all four, across 60 days, while the floor went up 22.6 percent. A frozen anchor caught the same creep on day 9.

AI disclosure: I wrote drift_anchor_gate.py and make_worlds.py with an AI assistant and ran them myself: Python 3.13.5, offline, standard library only, no network, no keys. Every number, exit code and sha256 below is pasted from a real local run. I ran the whole demo twice from a clean rm -rf worlds, and the two output.txt files are byte-for-byte identical (sha256 1772e695cb75f79d9e3f162ed4c49477a329703610cdf5ddff54cec2cc4da62a). The series are synthetic, and I say exactly how they are generated below. The one thing that is not synthetic is the arithmetic, and it is the part that does the work.

In short:

A baseline computed from your own recent history (rolling median, rolling mean, EWMA over your fleet) is a detector of speed. The invoice is charged on level. Those are different quantities, and slow cost drift lives in the gap between them.
The blind band is arithmetic, not a finding: a window of length w firing at ratio T cannot see any uniform daily growth below g* = T^(2/w) - 1. For w=7, T=1.20 that is 5.35 percent a day. Real context creep runs at a tenth of that.
Measured on my fixtures: a 0.35 percent/day creep produced 0 alarms from all four rolling detectors in 60 days. The frozen anchor blocked it on day 9, at a 2 percent tolerance, with 0 false alarms on a flat fleet.
The demo that matters: a world where every local file is byte-for-byte identical to day 0 (diff -rq prints nothing, git diff would print nothing), the vendor quietly bumped the harness scaffold, the fleet dashboard's short-window detectors stayed silent (the long window fired once, on day 58 — 46 days late), and the anchor blocked on day 12 with the growth attributed to vendor: +2600 B.
The dashboard wins a world too, and I show it: when the work per run doubles and the floor never moves, the anchor passes all 60 days and the rolling detector fires on day 30. Keep both.
Where this article is wrong: when the floor is about 90 percent or more of a typical request, the rolling detector does see the creep. Sweep included.

The comment I owed a measurement to

On July 13, Dipankar Sarkar replied to my post on usage logs that report zero tokens for delivered text with a sharp correction. His argument: a provider tokenizer bump and real suppression look identical if you only watch a threshold, so the disambiguator should be population, not threshold. "So the block condition isn't 'delta widened past X,' it's 'this stream's delta diverged from the fleet's,' per-stream anomaly against a rolling baseline of everyone else."

He is right, and I said so. Then I argued the thing that became this article: a rolling baseline adapts, so anything that grows slower than the adaptation window never diverges from the fleet. It walks the fleet along with it. Boiling frog. What you need is a second anchor that is frozen at a known-good date, and for a solo developer with no fleet, a canary request with deterministic input is a population of one.

And then I ended my own comment with this:

"To be straight about status, this is a design argument and not something I have measured. The re-tokenized floor I have run. The fleet-versus-canary discrimination I have not."

That was two days ago. This post is me paying that off. The domain moved (I am measuring input cost drift, not usage suppression) but the structure of the question is the same one Dipankar and I were arguing about: can a baseline built from your own recent history see a thing that moves your own recent history?

Start with the arithmetic, before any data exists

This part is not a discovery. It is a derivative, and it is worth doing on paper before you touch a fixture, because it tells you what the measurement is allowed to find.

A rolling window compares today against a baseline built from the previous w days. That baseline is centred roughly w/2 days back. Under uniform daily growth g, today's level over the baseline's level is about (1+g)^(w/2). The alarm fires when that ratio exceeds the threshold T. Solve for g:

g* = T^(2/w) - 1

Any growth slower than g* is invisible forever. Not "for a while". The ratio the detector computes stops rising, because the baseline is climbing at the same rate as the signal. Here is what the tool prints before it has read a single byte of data:

  window   threshold   invisible below   compounds to, over 30 days
  w=7      T=1.15         4.07%/day        x3.31
  w=7      T=1.20         5.35%/day        x4.77
  w=14     T=1.20         2.64%/day        x2.18
  w=28     T=1.20         1.31%/day        x1.48
  w=28     T=1.10         0.68%/day        x1.23

Read the right-hand column. A seven-day window at a 20 percent threshold is blind to any drift that multiplies your input cost by 4.77x over a month. That is the prediction. The measurement is not the formula, it is the question of where real context creep lands relative to that band, and whether you can pick a T low enough to close it without the alarm screaming every Tuesday.

The fixtures, declared before I ran them

Sixty days. Forty runs per day, fixed. I fixed the run count on purpose: variable traffic adds noise to a daily average, and noise is the thing that pushes the dashboard's threshold up. Removing it is a handicap in the dashboard's favour, and I would rather hand the dashboard its best case than be accused of rigging it.

The floor at pin time is 61,060 bytes:

field	bytes	origin
`tools_json`	26,000	local file
`system_prompt`	12,000	local file
`instructions` (CLAUDE.md)	9,000	local file
`harness_scaffold`	6,000	vendor, not a local file
`mcp_a`, `mcp_b`	4,000 each	local files (the MCP token tax)
`session_id`, `timestamp`	36 + 24	runtime, declared volatile

Work per run is a lognormal draw, median 18,000 bytes, sigma 0.8, multiplied by a daily job-mix factor (lognormal, sigma 0.15). So a typical run is 86,033 bytes and the floor is 71.0 percent of it. Every world shares one noise realization and one anchor: only the floor trajectory differs. The dashboard is never handed a harder draw than the anchor gets.

The six worlds:

world	what happens
`control`	nothing changes
`creep`	one policy section appended to CLAUDE.md every day, a 3rd MCP server on day 14, a 4th on day 33, 12 tool schemas on day 40
`slowcreep`	one policy section every third day, one MCP server on day 33
`spike`	someone pastes a 40 KB architecture doc into CLAUDE.md on day 30
`harness`	no local file changes at all. The vendor bumps its scaffold on days 12, 31 and 48
`workload`	the floor never moves. Work per run doubles on day 30

Each policy section is uniquely generated, 380 to 620 bytes, with its own heading and body. I mention that because my last cost article was dropped in review for a fixture that repeated one paragraph thirty times, which was a fair hit.

Growth rates that come out of that: creep runs at +1.044 percent/day and reaches 112,700 bytes. slowcreep runs at +0.347 percent/day and reaches 74,885 bytes, up 22.6 percent. Nobody would notice either one by looking.

Six worlds, one anchor, one command

Every rolling detector gets its own best deal: the tool walks a threshold ladder from 1.05 upward and picks the lowest threshold that raises zero false alarms on the flat world. That is the most generous calibration you can honestly give an alerting rule, because the first thing that happens to a noisy rule in real life is that somebody raises the threshold until it shuts up.

Here is what came back, condensed from the real output:

world	rolling median w=7	rolling mean w=7	EWMA a=0.3	rolling median w=28	frozen anchor, tol 2%
`control`	0 alarms (T=1.20)	0 (T=1.20)	0 (T=1.20)	0 (T=1.15)	PASS, all 60 days
`creep` (1.04%/d)	1, on day 40	1, day 40	1, day 40	15, from day 30	BLOCK day 3
`slowcreep` (0.35%/d)	0	0	0	0	BLOCK day 9
`harness`	0	0	0	1, day 58	BLOCK day 12
`spike`	4, day 30	2, day 30	2, day 30	15, day 30	BLOCK day 30
`workload`	4, day 30	2, day 30	2, day 30	14, day 30	PASS, all 60 days

Look at the creep row for the short windows. One alarm each, and it lands on day 40, the day twelve tool schemas were installed. Not on the creep. On the step. The detector is doing exactly what it is built to do, which is notice a discontinuity, and the 51 KB the floor gained by walking there is invisible to it.

Now the long window, w=28 at T=1.15. The arithmetic said it is blind below 1.00 percent/day. creep runs at 1.044 percent/day, just above the line, and the detector fires 15 times starting on day 30. slowcreep runs at 0.347 percent/day, below the line, and it fires zero times in 60 days. The formula predicted both the hit and the miss before the data existed. That is the whole argument in two rows, and it is why I put the arithmetic first.

The long window is not free, by the way. It paid for that sensitivity with 15 alarms on spike and 14 on workload.

The demo that made me build this: git diff is empty and you are paying more

The harness world is the one I would put in front of a skeptic.

Nothing in the repository changes. Not one byte. The demo runs diff -rq between a day-0 snapshot and a day-59 snapshot of every local file that feeds the request:

diff -rq repo_day000 repo_day059: no differences. git diff would print nothing.

sha256 of every local file, day 0 vs day 59:
  d0  063d5a8a0fe8267fc4093041701996f728662aab22b0763dd0225de767bff895  ./CLAUDE.md
  d0  e3b98852646c31827f303b3762410a3ca42f3a51c2a2201614e21b76fae51e70  ./mcp/a.json
  d0  07acba3d89f17bc9467bcbac3f2bcc72cfff4c07ad65694d86691a66a080f80c  ./mcp/b.json
  d0  25cb751f2526dafd5e0f7273fe29a7abac797424f442cfff28ea82ca013bf8cf  ./system_prompt.txt
  d0  3dc0f40d1489590ef5a98c464c11378f97e4133aa1a6fcc984fa5d386525440c  ./tools.json
  d59 063d5a8a0fe8267fc4093041701996f728662aab22b0763dd0225de767bff895  ./CLAUDE.md
  d59 e3b98852646c31827f303b3762410a3ca42f3a51c2a2201614e21b76fae51e70  ./mcp/a.json
  d59 07acba3d89f17bc9467bcbac3f2bcc72cfff4c07ad65694d86691a66a080f80c  ./mcp/b.json
  d59 25cb751f2526dafd5e0f7273fe29a7abac797424f442cfff28ea82ca013bf8cf  ./system_prompt.txt
  d59 3dc0f40d1489590ef5a98c464c11378f97e4133aa1a6fcc984fa5d386525440c  ./tools.json

Meanwhile the vendor added 2,600 bytes of tool-call protocol preamble to every request on day 12, another 3,100 on day 31, another 3,400 on day 48. By day 59 the rendered request carries 14.9 percent more floor than it did at pin time, on every single call, forever.

Three instruments looked at that world. git diff reported nothing. The rolling detectors on the usage log reported nothing (the long window eventually woke up on day 58, which is 46 days late and only after the third bump). The anchor reported this, on day 12, from the same command:

$ drift_anchor_gate.py check worlds/anchor.json worlds/harness/canary/day012.json
anchor     61060 B   pinned day 0   tolerance 2.0%
today      63660 B   day 12           drift +4.26%

  harness_scaffold      +2600 B   vendor
  unchanged (same sha256): instructions, mcp_a, mcp_b, session_id, system_prompt, timestamp, tools_json

  from local files: +0 B      from the vendor: +2600 B

verdict: BLOCK  ->  exit 1   (drift +4.26% is above the 2.0% tolerance)
exit=1

The same command, the same anchor, the same day, in the control world:

$ drift_anchor_gate.py check worlds/anchor.json worlds/control/canary/day012.json
today      61060 B   day 12           drift +0.00%
verdict: PASS   ->  exit 0   (drift +0.00% is within the 2.0% tolerance)
exit=0

That line from local files: +0 B from the vendor: +2600 B is the reason the anchor measures the rendered request and not the files you remembered to list. Your repo is not the thing you pay for. The rendered request is the thing you pay for.

Where the dashboard wins and my anchor is blind

I would not trust this article if it did not have this section, so here it is.

The workload world: the floor never changes, and on day 30 the work per run doubles (retrieval starts returning twice the chunks, users paste bigger inputs, whatever). The rolling detector fires on day 30 and it is completely correct to do so. The anchor passes all 60 days, because the frozen canary task is not doing more work. Its input really did not drift. The anchor is blind to the entire class of cost growth that lives in the workload (the context tax, where every step re-bills the whole transcript), and that class is real and often bigger than the floor.

Now put spike and workload side by side in that results table. Look at the rolling columns. Same detector, same day 30, near-identical alarm counts (4, 2, 2 on the short windows, 15 vs 14 on the long one). From inside the usage log those two worlds are indistinguishable. One of them is a 40 KB doc pasted into an instruction file that you will now pay for on every call until someone deletes it. The other is your product doing more work, which is what you wanted. The dashboard raises the identical alarm for both.

The anchor tells them apart: BLOCK on the paste, PASS on the workload. That is not the anchor being more sensitive. It is the anchor answering a different, narrower question, and answering it cleanly.

So the honest summary is a division of labour, not a replacement. Keep the rolling detector. It catches the paste, the step, and the workload change. Add the anchor. It catches the year.

Where this article is wrong

There is one parameter that can flip the whole result, so I swept it and I am publishing the sweep: how much of a typical request is floor?

The noise in a fleet average comes from the work, not from the floor. The floor is the same on every call. So the less work your runs carry, the quieter your fleet average is, the lower a threshold you can run without false alarms, and the closer your dashboard gets to being a canary all by itself. Here is the sweep on the slowcreep world, with the w=7 detector recalibrated to its own zero-false-alarm threshold at each work level:

  work_median  floor share   w=7 clean T   blind below   alarms on slowcreep   first
  wm002000      95.8%       T=1.05          1.40%/day    4 <- sees the creep  33
  wm005000      89.8%       T=1.05          1.40%/day    6 <- sees the creep  12
  wm010000      81.8%       T=1.15          4.07%/day    0                    -
  wm018000      71.7%       T=1.15          4.07%/day    0                    -
  wm030000      59.2%       T=2.00         21.90%/day    0                    -
  wm060000      41.7%       T=2.00         21.90%/day    0                    -

At a floor share around 90 percent and above, the dashboard sees the creep and this article's headline does not apply to you. The flip in my run happens between 81.8 percent (blind) and 89.8 percent (sees). If your agent's requests are almost entirely system prompt and tool schemas with a tiny user turn on top, you already own a canary. You did not have to freeze anything.

For everyone else, notice what happens at the bottom of that table. At a 42 percent floor share, the ladder had to climb to T=2.00 to stay quiet on a flat fleet. A threshold of 2.00 means "wake me when input cost doubles day over day", and its blind band is 21.9 percent a day. Nobody ships that alert. That is the mechanism in one line: the noisier your work, the higher your threshold must go, and the wider the door you leave open for drift.

I swept the day-to-day noise too, sigma from 0.05 to 0.25. At sigma 0.05, a very quiet fleet, the w=7 detector does catch one alarm on slowcreep, on day 36 (27 days after the anchor). At every other noise level: zero.

Bytes, not tokens, not dollars

The tool counts bytes. That is deliberate, and it is also the answer to the first objection I would raise if I were reading this.

Drift is a ratio, and a ratio survives any linear rescale: (a*x2) / (a*x1) = x2 / x1. Whatever your bytes-per-token constant is, whatever your price per token is, it cancels. So I do not need a tokenizer to say the floor grew 22.6 percent, and I refuse to multiply by a made-up constant to make the number look like money.

Caching is the interesting version of this objection. A stable prefix is cacheable, and a cache read can cost about a tenth of a fresh write, so caching changes the level of what you pay by a large factor. It does not change the ratio, so it does not change the verdict. The floor still grew 22.6 percent, you are still paying the drift, and you are paying it on every call. There is a nastier detail underneath: editing an instruction file invalidates the prefix, so on the day someone appends a policy section you pay a full cache write on top (the cache-break detector is about exactly that failure). Slow creep is not one cheap event. It is a cheap event that also throws away your cache, repeatedly.

If you want dollars, multiply. Just do not let the multiplication happen before the comparison.

How this differs from the two gates I already shipped

Two of my own posts are close enough to this one that I owe you a direct answer.

Sliding-Window Spend Guard also uses a window. The difference is where the threshold comes from: there, the cap is absolute ("$X per window, refuse the next call"), and the window is only a way to aggregate. Here, the whole subject is a threshold derived from your own history, which is the thing that fails. An absolute cap is immune to this failure mode by construction. It is also the reason the window guard is a good design and a fleet-relative alert is a trap: one of them has a baseline that cannot be moved by the drift it is supposed to catch.

MCP Tool Pin Verify also pins a hash. The difference is what the pin is for: there it protects the semantic integrity of a tool description against a malicious rug-pull. Here the pin protects a level, and the adversary is not an attacker, it is your own team, being helpful, one policy section at a time. Same mechanism, different failure. Pinning tool manifests does not tell you your floor grew, and this gate does not tell you a tool description turned hostile. Run both.

Both belong to the same family: a pre-execution gate that decides before the money is spent instead of a chart that explains the money after.

The exit contract, and the bug I refuse to ship again

A cost gate has exactly one bug it must never have: reporting "all clear" because the artifact it was supposed to measure went missing. My previous gate had that bug (a zero-byte baseline made a division produce 0.0, which was below the threshold, which was a pass, on 2.25 million tokens). It was caught in review and the article was dropped. Fair.

So this one has three exits and a rule:

0 PASS drift within tolerance, canary frozen, every declared field present and non-empty.
1 BLOCK drift above tolerance.
2 STRUCTURAL the input is unusable.

There is no path where a missing, emptied or truncated artifact returns 0. (A record that faithfully reports a genuinely smaller request is a PASS — that is the whole point of measuring level, not a fail-open. What cannot happen is the measurement vanishing, or a field dropping out, and being scored as "no drift.") Nine degenerate inputs, all from the real run:

--- dropped_field ---
STRUCTURAL: declared field missing from today's record: tools_json
STRUCTURAL: a bytes-only check would have called this -42.6% drift and passed it
verdict: STRUCTURAL  ->  exit 2   (fail-closed: unusable input is never a pass)
exit=2

--- zero_byte_field ---
STRUCTURAL: declared field rendered zero bytes: mcp_b
exit=2

--- unknown_field ---
STRUCTURAL: undeclared field in today's record: mcp_c
STRUCTURAL: you do not know everything your harness is sending; re-pin deliberately
exit=2

--- task_changed ---
STRUCTURAL: canary task changed: task_sha256 c81db92e6a9f... != pinned 7439b8b0addf...
STRUCTURAL: a canary that drifts is not a canary
exit=2

--- volatile_grew ---
STRUCTURAL: volatile field session_id changed length: 36 B pinned, 96 B today
STRUCTURAL: a volatile field with a moving length is not volatile, it is drift in disguise
exit=2

Plus missing, empty, truncated, and inconsistent_total (the parts do not sum to the stated total). Nine cases, nine exit 2s. The dropped_field line is the one to stare at: a naive bytes-only comparison reads that record as 42.6 percent cheaper than the anchor and passes it. That is the fail-open shape, and it is why the manifest is a manifest and not a number.

And on the typical scenario, the one this tool exists for, it fails closed in the direction that costs someone a conversation: slowcreep day 8 passes at +1.80 percent, day 9 blocks at +2.79 percent, exit 1, attribution instructions +1706 B local:CLAUDE.md. Your CI goes red because somebody added four paragraphs of rules over nine days. That is the intended behaviour, and if it annoys you, re-pin the anchor. Deliberately. In a commit. With your name on it.

The anchor

It is 1,726 bytes and there is nothing clever in it:

{
  "canary_id": "floor-canary-v1",
  "pinned_at": "day 0",
  "tolerance_pct": 2.0,
  "task_sha256": "7439b8b0addf5009c942bb2f1c42c0c4e3489b13990957f8a47253aaa429a8c8",
  "rendered_bytes": 61060,
  "manifest": {
    "tools_json":       {"bytes": 26000, "sha256": "3dc0f40d...", "origin": "local:tools.json",       "volatile": false},
    "system_prompt":    {"bytes": 12000, "sha256": "25cb751f...", "origin": "local:system_prompt.txt","volatile": false},
    "instructions":     {"bytes":  9000, "sha256": "063d5a8a...", "origin": "local:CLAUDE.md",        "volatile": false},
    "harness_scaffold": {"bytes":  6000, "sha256": "39b235eb...", "origin": "vendor",                 "volatile": false},
    "mcp_a":            {"bytes":  4000, "sha256": "e3b98852...", "origin": "local:mcp/a.json",       "volatile": false},
    "mcp_b":            {"bytes":  4000, "sha256": "07acba3d...", "origin": "local:mcp/b.json",       "volatile": false},
    "session_id":       {"bytes":    36, "sha256": "96311dd8...", "origin": "runtime",                "volatile": true},
    "timestamp":        {"bytes":    24, "sha256": "2956a7fc...", "origin": "runtime",                "volatile": true}
  }
}

(Hashes truncated here for width. They are full-length in the file.) One detail worth pausing on: instructions is pinned at 063d5a8a..., and that is the same digest shasum printed for CLAUDE.md on day 59 of the harness world. The file is provably untouched, and the request still grew.

Today's record is the same shape: per-field byte counts and digests for the rendered request. Byte counts and hashes, not payloads, which is the shape a privacy-conscious request log already has. If your harness can dump the request it is about to send, you can produce this. If it cannot, that is a finding.

The code

Three files. All of it, because a reproducibility claim you cannot check is just a claim.

drift_anchor_gate.py:

#!/usr/bin/env python3
"""drift_anchor_gate.py - offline, keyless, zero-network, read-only, stdlib-only (Python 3.13).

One question: has the INPUT COST of a known-good run gone up since a known-good day?

A rolling baseline (median / mean / EWMA over your fleet) answers that question by
comparing today against your own recent history. It is a detector of SPEED. The
invoice is charged on LEVEL. This tool answers the same question against a baseline
that does not move: an anchor pinned on day 0 to a FROZEN canary task, so the only
thing that can change between then and now is what your harness wraps around it.

  check      one day against the anchor. Exit 0 PASS / 1 BLOCK / 2 STRUCTURAL.
  replay     N days, anchor and rolling detectors side by side on the same log.
  sweep      the two parameters that could flip the result: floor share, day noise.
  blindband  pure arithmetic, no data: the growth rate a rolling window cannot see.

Exit contract (fail-closed):
  0 PASS        drift within tolerance, canary frozen, every declared field present and non-empty
  1 BLOCK       drift above tolerance
  2 STRUCTURAL  input unusable: unreadable, empty, truncated, a declared field missing or
                rendered zero bytes, an undeclared field present, the canary task changed,
                a volatile field that changed length, a record whose parts do not sum to its total.

There is no path where a missing, emptied or truncated artifact returns 0, and a
dropped or zeroed field fails closed too. A gate that answers "no drift detected"
because the artifact vanished is worse than no gate.

This tool never writes, never opens a socket, never imports anything outside the stdlib.
"""

import hashlib
import json
import os
import statistics
import sys

LADDER = [1.05, 1.10, 1.15, 1.20, 1.25, 1.30, 1.40, 1.50, 2.00]
FAMILIES = [("rolling median w=7", "median", 7),
            ("rolling mean   w=7", "mean", 7),
            ("EWMA alpha=0.3", "ewma", 7),
            ("rolling median w=28", "median", 28)]
ALPHA = 0.3


# ---------------------------------------------------------------- the gate

def load_json(path, what):
    if not os.path.exists(path):
        die("%s not found: %s" % (what, path))
    if os.path.getsize(path) == 0:
        die("%s is zero bytes: %s" % (what, path))
    try:
        with open(path) as fh:
            return json.load(fh)
    except (ValueError, OSError) as exc:
        die("%s is not readable JSON: %s (%s)" % (what, path, exc.__class__.__name__))


def die(msg):
    print("STRUCTURAL: %s" % msg)
    print("verdict: STRUCTURAL  ->  exit 2   (fail-closed: unusable input is never a pass)")
    sys.exit(2)


def check(anchor, today, quiet=False):
    """Return (verdict, drift_pct, lines). verdict in PASS / BLOCK / STRUCTURAL."""
    out = []
    man = anchor["manifest"]
    fields = today.get("fields")
    if not isinstance(fields, dict) or not fields:
        return "STRUCTURAL", 0.0, ["no fields in today's record"]
    if today.get("canary_id") != anchor["canary_id"]:
        return "STRUCTURAL", 0.0, ["canary_id mismatch: %r vs pinned %r"
                                   % (today.get("canary_id"), anchor["canary_id"])]
    if today.get("task_sha256") != anchor["task_sha256"]:
        return "STRUCTURAL", 0.0, ["canary task changed: task_sha256 %s... != pinned %s..."
                                   % (str(today.get("task_sha256"))[:12], anchor["task_sha256"][:12]),
                                   "a canary that drifts is not a canary"]

    missing = [k for k in man if k not in fields]
    unknown = [k for k in fields if k not in man]
    empty = [k for k, v in fields.items() if v.get("bytes", 0) <= 0]
    if missing:
        naive = sum(v["bytes"] for v in fields.values()) / anchor["rendered_bytes"] - 1
        return "STRUCTURAL", 0.0, [
            "declared field missing from today's record: %s" % ", ".join(sorted(missing)),
            "a bytes-only check would have called this %+.1f%% drift and passed it" % (naive * 100)]
    if unknown:
        return "STRUCTURAL", 0.0, [
            "undeclared field in today's record: %s" % ", ".join(sorted(unknown)),
            "you do not know everything your harness is sending; re-pin deliberately"]
    if empty:
        return "STRUCTURAL", 0.0, ["declared field rendered zero bytes: %s" % ", ".join(sorted(empty))]

    for k, v in man.items():
        if v.get("volatile") and fields[k]["bytes"] != v["bytes"]:
            return "STRUCTURAL", 0.0, [
                "volatile field %s changed length: %d B pinned, %d B today"
                % (k, v["bytes"], fields[k]["bytes"]),
                "a volatile field with a moving length is not volatile, it is drift in disguise"]

    total = sum(v["bytes"] for v in fields.values())
    if total != today.get("rendered_bytes"):
        return "STRUCTURAL", 0.0, ["record does not sum: fields %d B, rendered_bytes %d B"
                                   % (total, today.get("rendered_bytes"))]

    drift = total / anchor["rendered_bytes"] - 1
    tol = anchor["tolerance_pct"] / 100.0
    deltas = sorted(((fields[k]["bytes"] - man[k]["bytes"], k) for k in man),
                    key=lambda t: -abs(t[0]))
    local = sum(d for d, k in deltas if man[k]["origin"].startswith("local"))
    vendor = sum(d for d, k in deltas if not man[k]["origin"].startswith("local"))

    if not quiet:
        out.append("anchor  %8d B   pinned %s   tolerance %.1f%%"
                   % (anchor["rendered_bytes"], anchor["pinned_at"], anchor["tolerance_pct"]))
        out.append("today   %8d B   day %-3s          drift %+.2f%%"
                   % (total, today.get("day", "?"), drift * 100))
        out.append("")
        for d, k in deltas:
            if d:
                out.append("  %-18s %+8d B   %s" % (k, d, man[k]["origin"]))
        same = [k for d, k in deltas if d == 0 and fields[k]["sha256"] == man[k]["sha256"]]
        if same:
            out.append("  unchanged (same sha256): %s" % ", ".join(sorted(same)))
        out.append("")
        out.append("  from local files: %+d B      from the vendor: %+d B" % (local, vendor))

    verdict = "BLOCK" if drift > tol else "PASS"
    return verdict, drift * 100, out


def cmd_check(args):
    anchor = load_json(args[0], "anchor")
    today = load_json(args[1], "today's record")
    verdict, drift, lines = check(anchor, today)
    if verdict == "STRUCTURAL":
        for ln in lines:
            print("STRUCTURAL: %s" % ln)
        print("verdict: STRUCTURAL  ->  exit 2   (fail-closed: unusable input is never a pass)")
        return 2
    body = "\n".join(lines)
    print(body)
    print("")
    if verdict == "BLOCK":
        print("verdict: BLOCK  ->  exit 1   (drift %+.2f%% is above the %.1f%% tolerance)"
              % (drift, anchor["tolerance_pct"]))
    else:
        print("verdict: PASS   ->  exit 0   (drift %+.2f%% is within the %.1f%% tolerance)"
              % (drift, anchor["tolerance_pct"]))
    print("report-sha256: %s" % hashlib.sha256(body.encode()).hexdigest())
    return 1 if verdict == "BLOCK" else 0


# ---------------------------------------------------------- the dashboard

def daily_means(path):
    if not os.path.exists(path) or os.path.getsize(path) == 0:
        die("usage log missing or empty: %s" % path)
    days = {}
    with open(path) as fh:
        for line in fh:
            r = json.loads(line)
            days.setdefault(r["day"], []).append(r["input_bytes"])
    if not days:
        die("usage log has no records: %s" % path)
    return [statistics.mean(days[d]) for d in sorted(days)]


def alarms(series, kind, w, t):
    """Days on which today's mean exceeds the baseline built from the PREVIOUS w days by factor t."""
    hits = []
    ew = None
    for i, x in enumerate(series):
        if i >= w:
            if kind == "median":
                base = statistics.median(series[i - w:i])
            elif kind == "mean":
                base = statistics.mean(series[i - w:i])
            else:
                base = ew
            if base and x / base > t:
                hits.append(i)
        if kind == "ewma":
            ew = x if ew is None else ALPHA * x + (1 - ALPHA) * ew
    return hits


def calibrate(control, kind, w):
    """The lowest threshold on the ladder that raises zero false alarms on a flat world."""
    for t in LADDER:
        if not alarms(control, kind, w, t):
            return t
    return None


def blind_band(t, w):
    """Uniform daily growth g is invisible while (1+g)^(w/2) <= T. Arithmetic, not a finding."""
    return t ** (2.0 / w) - 1.0


def cmd_blindband(_args):
    print("blind band of a rolling window: it compares today against a baseline centred")
    print("about w/2 days back, so a uniform daily growth g only trips it when (1+g)^(w/2) > T.")
    print("solve for g:   g* = T^(2/w) - 1     any growth slower than g* is invisible forever.")
    print("")
    print("  window   threshold   invisible below   compounds to, over 30 days")
    for w, t in ((7, 1.15), (7, 1.20), (14, 1.20), (28, 1.20), (28, 1.10)):
        g = blind_band(t, w)
        print("  w=%-4d   T=%.2f       %6.2f%%/day        x%.2f" % (w, t, g * 100, (1 + g) ** 30))
    print("")
    print("this is arithmetic. it is true before you collect a single byte of data.")
    print("the measurement is where real context creep lands relative to that band.")
    return 0


def cmd_replay(args):
    world, ctrl = args[0], args[1]
    anchor = load_json(os.path.join(os.path.dirname(world), "anchor.json"), "anchor")
    events = load_json(os.path.join(world, "events.json"), "events")
    series = daily_means(os.path.join(world, "usage.jsonl"))
    control = daily_means(os.path.join(ctrl, "usage.jsonl"))

    print("world: %s   days: %d   runs/day: fixed" % (os.path.basename(world), len(series)))
    floor0 = anchor["rendered_bytes"]
    typical = statistics.mean(control)   # a flat fleet, so this is the work profile, not the drift
    print("floor at pin time %d B   typical run %d B   floor share %.1f%%"
          % (floor0, round(typical), 100.0 * floor0 / typical))
    print("")

    print("  detector              threshold   alarms   first   on the day of")
    for name, kind, w in FAMILIES:
        t = calibrate(control, kind, w)
        if t is None:
            print("  %-20s   none clean on a flat world" % name)
            continue
        hits = alarms(series, kind, w, t)
        first = hits[0] if hits else None
        ev = events.get(str(first), "nothing: this one is about the creep") if first is not None else "-"
        print("  %-20s   T=%.2f       %2d      %-5s   %s"
              % (name, t, len(hits), first if first is not None else "-", ev))
        g = blind_band(t, w)
        print("  %-20s   blind below %.2f%%/day (x%.2f over 30 days)" % ("", g * 100, (1 + g) ** 30))

    first_block, first_struct, drift_at, final = None, None, 0.0, floor0
    for day in range(len(series)):
        rec = load_json(os.path.join(world, "canary", "day%03d.json" % day), "record")
        verdict, drift, _ = check(anchor, rec, quiet=True)
        if verdict == "BLOCK" and first_block is None:
            first_block, drift_at = day, drift
        if verdict == "STRUCTURAL" and first_struct is None:
            first_struct = day
        final = rec["rendered_bytes"]
    print("")
    if first_block is None and first_struct is None:
        print("  frozen anchor tol=%.1f%%     PASS on all %d days: the frozen input never drifted"
              % (anchor["tolerance_pct"], len(series)))
    if first_block is not None:
        print("  frozen anchor tol=%.1f%%     BLOCK on day %d (drift %+.2f%%)  -> exit 1"
              % (anchor["tolerance_pct"], first_block, drift_at))
    if first_struct is not None:
        print("  frozen anchor tol=%.1f%%     STRUCTURAL on day %d (%s) -> exit 2"
              % (anchor["tolerance_pct"], first_struct,
                 events.get(str(first_struct), "the set of fields changed")))
    print("  floor on the last day %d B (%+.1f%% against the anchor)"
          % (final, (final / floor0 - 1) * 100))
    return 1 if first_block is not None else (2 if first_struct is not None else 0)


def cmd_sweep(args):
    root, label = args[0].rstrip("/"), args[1]
    if not os.path.isdir(root):
        die("sweep root not found: %s" % root)
    anchor = load_json(os.path.join(os.path.dirname(root), "anchor.json"), "anchor")
    print("  %-10s  floor share   w=7 clean T   blind below   alarms on slowcreep   first" % label)
    for name in sorted(os.listdir(root)):
        d = os.path.join(root, name)
        ctrl = daily_means(os.path.join(d, "control", "usage.jsonl"))
        slow = daily_means(os.path.join(d, "slowcreep", "usage.jsonl"))
        t = calibrate(ctrl, "median", 7)
        hits = alarms(slow, "median", 7, t)
        share = 100.0 * anchor["rendered_bytes"] / statistics.mean(ctrl)
        g = blind_band(t, 7) * 100
        print("  %-10s  %6.1f%%       T=%.2f        %6.2f%%/day   %2d %-18s %s"
              % (name, share, t, g, len(hits),
                 "<- sees the creep" if hits else "", hits[0] if hits else "-"))
    print("")
    print("  neither parameter reaches the frozen anchor. It reads a deterministic input, so it")
    print("  carries no work and no day-to-day noise: floor share 100%%, tolerance %.1f%%, same"
          % anchor["tolerance_pct"])
    print("  BLOCK day in every row above. The rows only move the DASHBOARD.")
    return 0


USAGE = """usage:
  drift_anchor_gate.py check     <anchor.json> <today.json>
  drift_anchor_gate.py replay    <world_dir> <control_dir>
  drift_anchor_gate.py sweep     <sweep_root> <label>
  drift_anchor_gate.py blindband
"""

if __name__ == "__main__":
    cmds = {"check": cmd_check, "replay": cmd_replay, "sweep": cmd_sweep, "blindband": cmd_blindband}
    if len(sys.argv) < 2 or sys.argv[1] not in cmds:
        sys.stdout.write(USAGE)
        sys.exit(2)
    sys.exit(cmds[sys.argv[1]](sys.argv[2:]))

make_worlds.py, the fixture builder. Seeded, and the only thing here that writes anything:

#!/usr/bin/env python3
"""make_worlds.py - build the fixtures for drift_anchor_gate.py.

Offline, keyless, zero-network, stdlib-only, seeded. This is the only script here
that writes anything. The gate itself never writes.

Six worlds share ONE anchor (day 0 is byte-identical everywhere) and ONE noise
realization per (work_median, sigma_day) pair. Only the floor trajectory differs
between worlds, so the dashboard is never handed a harder noise draw than the
anchor. Runs per day are FIXED at 40 on purpose: that removes run-count noise,
which is a handicap in the dashboard's favour.
"""

import hashlib
import json
import os
import random
import shutil

HERE = os.path.dirname(os.path.abspath(__file__))
OUT = os.path.join(HERE, "worlds")

DAYS = 60
RUNS_PER_DAY = 40
BASE_SEED = 20260715
WORK_MEDIAN = 18_000      # bytes of real per-run work: the user turn, retrieved chunks, history
WORK_SIGMA = 0.8          # lognormal spread of work across runs
SIGMA_DAY = 0.15          # lognormal spread of the daily job mix
TOL_PCT = 2.0

TASK = ("canary task: read the pinned fixture file, return its word count as JSON. "
        "do not call tools. do not browse. answer in one line.")

# Floor at pin time, in bytes, by field.
FLOOR0 = {
    "harness_scaffold": 6_000,   # vendor-injected: tool-call protocol preamble + safety wrapper
    "system_prompt": 12_000,
    "tools_json": 26_000,
    "instructions": 9_000,       # CLAUDE.md / AGENTS.md
    "mcp_a": 4_000,
    "mcp_b": 4_000,
}
VOLATILE = {"session_id": 36, "timestamp": 24}
ORIGIN = {
    "harness_scaffold": "vendor",
    "system_prompt": "local:system_prompt.txt",
    "tools_json": "local:tools.json",
    "instructions": "local:CLAUDE.md",
    "mcp_a": "local:mcp/a.json",
    "mcp_b": "local:mcp/b.json",
    "mcp_c": "local:mcp/c.json",
    "mcp_d": "local:mcp/d.json",
    "session_id": "runtime",
    "timestamp": "runtime",
}

WORDS = ("agent budget cache canary context deploy drift enforce escalate fixture gate harness "
         "ingest invariant ledger manifest observe payload pin policy prompt quota render replay "
         "retry rollback runbook schema scope session snapshot spend threshold token trace vendor "
         "window workspace approve reject timeout latency shard cursor batch").split()

SUBJECTS = ["retry budget", "tool allowlist", "pii redaction", "escalation path", "commit policy",
            "sandbox scope", "timeout ladder", "cache warmup", "rollback drill", "on-call handoff",
            "schema review", "cost ceiling", "log retention", "vendor swap", "eval cadence",
            "prompt review", "secret handling", "shell allowlist", "diff size cap", "branch policy"]


def text_of(rng, n, tag):
    """Deterministic ASCII text of exactly n bytes."""
    if n <= 0:
        return ""
    parts = [tag, ":"]
    size = len(tag) + 1
    while size < n:
        w = rng.choice(WORDS)
        parts.append(" " + w)
        size += len(w) + 1
    s = "".join(parts)[:n]
    return s + " " * (n - len(s))


def rule_section(rng, idx):
    """One appended policy section. Unique subject, unique body, 380..620 bytes."""
    n = rng.randint(380, 620)
    head = "\n\n## Rule %03d: %s\n" % (idx, SUBJECTS[idx % len(SUBJECTS)])
    return head + text_of(rng, n - len(head), "rationale-%03d" % idx)


# ---- static content, generated once so it is stable across days and worlds ----
_c = random.Random(BASE_SEED ^ 0xF100)
BASE = {k: text_of(_c, v, k) for k, v in FLOOR0.items()}
_s = random.Random(BASE_SEED ^ 0x5EC7)
SECTIONS = [rule_section(_s, i) for i in range(70)]
EXTRA_TOOLS = "".join(text_of(_c, 1_150, "tool-schema-%02d" % i) for i in range(12))  # 13,800 B
MCP_C = text_of(_c, 4_200, "mcp_c")
MCP_D = text_of(_c, 3_800, "mcp_d")
SPIKE_DOC = text_of(_c, 40_000, "pasted-architecture-doc")
BUMP = {12: text_of(_c, 2_600, "scaffold-bump-1"),
        31: text_of(_c, 3_100, "scaffold-bump-2"),
        48: text_of(_c, 3_400, "scaffold-bump-3")}

EVENTS = {
    "control": {},
    "creep": {14: "mcp: +3rd server", 33: "mcp: +4th server", 40: "tools: +12 schemas"},
    "slowcreep": {33: "mcp: +3rd server"},
    "spike": {30: "instructions: +40 KB doc pasted"},
    "harness": {12: "vendor: scaffold bump 1", 31: "vendor: scaffold bump 2", 48: "vendor: scaffold bump 3"},
    "workload": {30: "work per run doubles (floor untouched)"},
}
WORLDS = list(EVENTS)


def render_floor(world, day):
    """Field -> content string for this world on this day. Day 0 is identical in every world."""
    f = dict(BASE)
    if world == "creep":
        n = day                                  # one policy section per day
    elif world == "slowcreep":
        n = day // 3                             # one policy section every third day
    else:
        n = 0
    if n:
        f["instructions"] = BASE["instructions"] + "".join(SECTIONS[:n])
    if world == "creep":
        if day >= 40:
            f["tools_json"] = BASE["tools_json"] + EXTRA_TOOLS
        if day >= 14:
            f["mcp_c"] = MCP_C
        if day >= 33:
            f["mcp_d"] = MCP_D
    if world == "slowcreep" and day >= 33:
        f["mcp_c"] = MCP_C
    if world == "spike" and day >= 30:
        f["instructions"] = f["instructions"] + SPIKE_DOC
    if world == "harness":
        add = "".join(v for d, v in sorted(BUMP.items()) if day >= d)
        f["harness_scaffold"] = BASE["harness_scaffold"] + add
    return f


def work_scale(world, day):
    """The only world that changes the WORK per run rather than the floor."""
    return 2.0 if (world == "workload" and day >= 30) else 1.0


def sha(s):
    return hashlib.sha256(s.encode("utf-8")).hexdigest()


def dump_fields(fields):
    """Rendered-request record: per-field byte count + sha256. The shape of a privacy-safe request log."""
    rec = {k: {"bytes": len(v), "sha256": sha(v)} for k, v in fields.items()}
    for k, n in VOLATILE.items():
        rec[k] = {"bytes": n, "sha256": sha(k * n)}
    return rec


def total_bytes(fields):
    return sum(len(v) for v in fields.values()) + sum(VOLATILE.values())


def noise_matrix(work_median, sigma_day, seed):
    """Per-run work bytes. Identical across worlds for a given (work_median, sigma_day)."""
    rng = random.Random(seed)
    rows = []
    for _ in range(DAYS):
        dm = rng.lognormvariate(0.0, sigma_day)
        rows.append([max(1, int(work_median * rng.lognormvariate(0.0, WORK_SIGMA) * dm))
                     for _ in range(RUNS_PER_DAY)])
    return rows


def write_usage(path, world, noise):
    """usage.jsonl: one line per run, the log a fleet dashboard already has."""
    with open(path, "w") as fh:
        for day in range(DAYS):
            floor = total_bytes(render_floor(world, day))
            k = work_scale(world, day)
            for run in range(RUNS_PER_DAY):
                fh.write(json.dumps({"day": day, "run": run,
                                     "input_bytes": floor + int(noise[day][run] * k)}) + "\n")


def main():
    if os.path.isdir(OUT):
        shutil.rmtree(OUT)
    os.makedirs(OUT)

    day0 = render_floor("control", 0)
    anchor = {
        "canary_id": "floor-canary-v1",
        "pinned_at": "day 0",
        "tolerance_pct": TOL_PCT,
        "task_sha256": sha(TASK),
        "rendered_bytes": total_bytes(day0),
        "manifest": {},
    }
    for k, rec in dump_fields(day0).items():
        anchor["manifest"][k] = {"bytes": rec["bytes"], "sha256": rec["sha256"],
                                 "origin": ORIGIN[k], "volatile": k in VOLATILE}
    with open(os.path.join(OUT, "anchor.json"), "w") as fh:
        json.dump(anchor, fh, indent=2, sort_keys=True)

    noise = noise_matrix(WORK_MEDIAN, SIGMA_DAY, BASE_SEED)

    for world in WORLDS:
        wd = os.path.join(OUT, world)
        os.makedirs(os.path.join(wd, "canary"))
        for day in range(DAYS):
            fields = render_floor(world, day)
            rec = {"canary_id": "floor-canary-v1", "day": day, "task_sha256": sha(TASK),
                   "rendered_bytes": total_bytes(fields), "fields": dump_fields(fields)}
            with open(os.path.join(wd, "canary", "day%03d.json" % day), "w") as fh:
                json.dump(rec, fh, indent=2, sort_keys=True)
        write_usage(os.path.join(wd, "usage.jsonl"), world, noise)
        with open(os.path.join(wd, "events.json"), "w") as fh:
            json.dump({str(k): v for k, v in EVENTS[world].items()}, fh, indent=2, sort_keys=True)

    # Repo snapshots for the harness world: the local files a `git diff` would look at.
    for day in (0, 59):
        d = os.path.join(OUT, "harness", "repo_day%03d" % day)
        os.makedirs(os.path.join(d, "mcp"))
        f = render_floor("harness", day)
        open(os.path.join(d, "system_prompt.txt"), "w").write(f["system_prompt"])
        open(os.path.join(d, "CLAUDE.md"), "w").write(f["instructions"])
        open(os.path.join(d, "tools.json"), "w").write(f["tools_json"])
        open(os.path.join(d, "mcp", "a.json"), "w").write(f["mcp_a"])
        open(os.path.join(d, "mcp", "b.json"), "w").write(f["mcp_b"])

    # Degenerate inputs. Every one of these must fail closed.
    dg = os.path.join(OUT, "degenerate")
    os.makedirs(dg)
    base = json.load(open(os.path.join(OUT, "control", "canary", "day000.json")))

    def variant(name, mutate):
        r = json.loads(json.dumps(base))
        mutate(r)
        with open(os.path.join(dg, name + ".json"), "w") as fh:
            json.dump(r, fh, indent=2, sort_keys=True)

    def drop_tools(r):
        r["fields"].pop("tools_json")
        r["rendered_bytes"] = sum(v["bytes"] for v in r["fields"].values())

    def zero_mcp(r):
        r["fields"]["mcp_b"] = {"bytes": 0, "sha256": sha("")}
        r["rendered_bytes"] = sum(v["bytes"] for v in r["fields"].values())

    def unknown_field(r):
        r["fields"]["mcp_c"] = {"bytes": 4_200, "sha256": sha(MCP_C)}
        r["rendered_bytes"] = sum(v["bytes"] for v in r["fields"].values())

    def task_changed(r):
        r["task_sha256"] = sha(TASK + " and also list the files you can see")

    def volatile_grew(r):
        r["fields"]["session_id"] = {"bytes": 96, "sha256": sha("x" * 96)}
        r["rendered_bytes"] = sum(v["bytes"] for v in r["fields"].values())

    def bad_sum(r):
        r["rendered_bytes"] = 61_060 - 20_000

    variant("dropped_field", drop_tools)
    variant("zero_byte_field", zero_mcp)
    variant("unknown_field", unknown_field)
    variant("task_changed", task_changed)
    variant("volatile_grew", volatile_grew)
    variant("inconsistent_total", bad_sum)
    open(os.path.join(dg, "empty.json"), "w").close()
    open(os.path.join(dg, "truncated.json"), "w").write('{"canary_id": "floor-canary-v1", "day": 0,')

    # Sweeps: the two parameters that could flip the result. control + slowcreep share a seed.
    for wm in (2_000, 5_000, 10_000, 18_000, 30_000, 60_000):
        seed = BASE_SEED + wm + 15 * 1_000_003
        nz = noise_matrix(wm, SIGMA_DAY, seed)
        for world in ("control", "slowcreep"):
            d = os.path.join(OUT, "sweep_floorshare", "wm%06d" % wm, world)
            os.makedirs(d)
            write_usage(os.path.join(d, "usage.jsonl"), world, nz)
            shutil.copy(os.path.join(OUT, world, "events.json"), os.path.join(d, "events.json"))

    for sig in (0.05, 0.10, 0.15, 0.20, 0.25):
        seed = BASE_SEED + WORK_MEDIAN + int(sig * 100) * 1_000_003
        nz = noise_matrix(WORK_MEDIAN, sig, seed)
        for world in ("control", "slowcreep"):
            d = os.path.join(OUT, "sweep_sigma", "s%03d" % int(sig * 100), world)
            os.makedirs(d)
            write_usage(os.path.join(d, "usage.jsonl"), world, nz)
            shutil.copy(os.path.join(OUT, world, "events.json"), os.path.join(d, "events.json"))

    print("worlds built: %s" % ", ".join(WORLDS))
    print("anchor rendered_bytes: %d B  tolerance: %.1f%%" % (anchor["rendered_bytes"], TOL_PCT))
    for world in WORLDS:
        a, b = total_bytes(render_floor(world, 0)), total_bytes(render_floor(world, 59))
        rate = ((b / a) ** (1 / 59) - 1) * 100 if b != a else 0.0
        print("  %-10s floor day0 %7d B  ->  day59 %7d B   (x%.2f, %+.3f%%/day)"
              % (world, a, b, b / a, rate))


if __name__ == "__main__":
    main()

run_demo.sh:

#!/usr/bin/env bash
# run_demo.sh - rebuild the fixtures from scratch and run every claim in the post.
# Offline, keyless, no network, stdlib only. Nothing here reads a clock, so stdout is
# byte-for-byte reproducible: run it twice, diff the two outputs, they are identical.
set -u
cd "$(dirname "$0")"

hr() { printf '\n%s\n%s\n' "$1" "$(printf '=%.0s' $(seq 1 ${#1}))"; }

hr "0. THE PREDICTION (arithmetic, before any data exists)"
python3 drift_anchor_gate.py blindband

hr "1. BUILD THE WORLDS (seeded, deterministic, no network)"
python3 make_worlds.py

hr "2. KILLER DEMO: the harness world. Every local file is byte-identical to day 0."
diff -rq worlds/harness/repo_day000 worlds/harness/repo_day059 \
  && echo "diff -rq repo_day000 repo_day059: no differences. git diff would print nothing."
echo ""
echo "sha256 of every local file, day 0 vs day 59:"
( cd worlds/harness/repo_day000 && find . -type f | sort | xargs shasum -a 256 ) | sed 's/^/  d0  /'
( cd worlds/harness/repo_day059 && find . -type f | sort | xargs shasum -a 256 ) | sed 's/^/  d59 /'

echo ""
echo "--- same anchor, same day 12, two worlds ---"
echo ""
echo "\$ drift_anchor_gate.py check worlds/anchor.json worlds/control/canary/day012.json"
python3 drift_anchor_gate.py check worlds/anchor.json worlds/control/canary/day012.json
echo "exit=$?"
echo ""
echo "\$ drift_anchor_gate.py check worlds/anchor.json worlds/harness/canary/day012.json"
python3 drift_anchor_gate.py check worlds/anchor.json worlds/harness/canary/day012.json
echo "exit=$?"

hr "3. SIX WORLDS, ONE ANCHOR, ONE LOG. Each rolling detector gets its own best threshold."
for w in control creep slowcreep harness spike workload; do
  echo ""
  python3 drift_anchor_gate.py replay "worlds/$w" worlds/control
  echo "  replay exit=$?"
done

hr "4. THE TYPICAL SCENARIO, AT THE BOUNDARY. The gate must fail CLOSED on slow creep."
echo ""
echo "\$ drift_anchor_gate.py check worlds/anchor.json worlds/slowcreep/canary/day008.json"
python3 drift_anchor_gate.py check worlds/anchor.json worlds/slowcreep/canary/day008.json
echo "exit=$?"
echo ""
echo "\$ drift_anchor_gate.py check worlds/anchor.json worlds/slowcreep/canary/day009.json"
python3 drift_anchor_gate.py check worlds/anchor.json worlds/slowcreep/canary/day009.json
echo "exit=$?"

hr "5. DEGENERATE INPUTS: every one of these must fail closed."
for f in missing empty truncated dropped_field zero_byte_field unknown_field task_changed volatile_grew inconsistent_total; do
  echo ""
  echo "--- $f ---"
  python3 drift_anchor_gate.py check worlds/anchor.json "worlds/degenerate/$f.json"
  echo "exit=$?"
done
echo ""
echo "--- clean day 0 (the control: it must still pass) ---"
python3 drift_anchor_gate.py check worlds/anchor.json worlds/control/canary/day000.json
echo "exit=$?"

hr "6. SWEEP A: the one parameter that flips the result. How much of the request is floor?"
python3 drift_anchor_gate.py sweep worlds/sweep_floorshare work_median

hr "7. SWEEP B: day-to-day noise. Does the dashboard wake up at any of these?"
python3 drift_anchor_gate.py sweep worlds/sweep_sigma sigma_day

Run it: bash run_demo.sh > output.txt. It takes a few seconds, touches no network, needs no keys, and produces the same bytes every time. Mine hashes to 1772e695cb75f79d9e3f162ed4c49477a329703610cdf5ddff54cec2cc4da62a.

What this is NOT

The time series are synthetic. I do not have a production fleet, and I am not going to pretend I do. The builder is in this post, the shape is declared (how many bytes a policy section adds, what day an MCP server lands), and the seed is fixed. Change the parameters and the numbers move. What is not synthetic: the blind-band arithmetic, and the fact that the gate runs on your real usage.jsonl and your real request dumps.
The anchor does not see workload growth. Demonstrated above, in the one world where it passes for 60 days while cost genuinely rises. If your bill grew because your product got busier, this tool will tell you nothing and your dashboard will tell you everything.
The anchor only sees the rendered request it is handed. No request dump, no gate. It reads nothing from the network.
It does not tell you whether the drift was worth it. A third MCP server may well be worth its 4 KB on every call. The gate converts a silent accumulation into a decision that somebody signs.
Model-side drift is out of scope. If the vendor changes the tokenizer or the model gets chattier, the input payload can be identical while the bill moves. This gate reads the input payload only. That is a real hole, and it is Dipankar's original point coming back around.
A 2 percent tolerance is not a measured optimum. On my fixtures the control world drifts by exactly 0.00 percent, so any positive tolerance gives zero false alarms. I picked 2 percent as a real-world allowance for things like a version string that changes length. My data did not force that number, and I am not going to dress it up as though it did.

What I would do on Monday

Dump the rendered request for one frozen, deterministic task. Save the byte counts and hashes as an anchor with today's date. Add one CI step that re-renders that same task and compares. Set the tolerance somewhere near 2 percent, because a deterministic input has no noise to hide in, and let it fail the build.

Then keep your dashboard exactly as it is. It is watching for the paste and the traffic step, and it is good at that. It was never watching for the year.

One thing I have not settled, and I would genuinely like a second opinion: the re-pin. Every legitimate change (a new MCP server, a needed rule) forces a re-pin, and a re-pin resets the baseline. Do it a dozen times and you have reinvented a rolling baseline with extra steps, just with human latency in the loop. My current answer is that the re-pin has to carry the cumulative drift since the original pin, so the number you are approving is "the floor is now 47 percent above where it was in March", not "plus 4 percent since last week". I have not built that yet. If you have run a pinned baseline in production for longer than a quarter, I want to know how you kept the pin honest.

I write about pre-execution control for AI agents: gates that decide before the money moves, not charts that explain it afterwards. Every post ships a tool you can actually run, offline, with no keys. Follow along if that is your kind of thing, and if your context floor has a story, put it in the comments. I am especially interested in anyone who has watched a vendor bump a harness without telling them.

You Approved `project_settings.json`. The OS Was About to Write `~/.ssh/authorized_keys`.

Alexey Spinov — Mon, 13 Jul 2026 10:24:42 +0000

AI coding agent approval path resolution is the gap between the path an approval dialog shows and the file the OS actually opens. Before the write lands, the OS expands ~ and $VAR, follows symlinks, and collapses .., so the resolved target can escape your workspace. approval_path_resolve_gate.py resolves it first, offline, and blocks the mismatch.

AI disclosure: I wrote approval_path_resolve_gate.py with an AI assistant and ran it myself, offline, on Python 3.13.5, standard library only, no network. Every verdict, exit code, and hash in the output blocks below is pasted from a real local run. I ran each scenario twice to confirm STDOUT is byte-for-byte identical, and the tool prints a sha256 of its own report so you can reproduce the exact bytes. The GhostApproval disclosure I cite is Wiz's work and other people's reporting, attributed inline — their findings and vendor names, not mine, and their numbers stay out of my fixtures. My fixture writes to a stand-in file I named outside/authorized_keys; it never touches your real ~/.ssh.

In short:

An approval dialog is tracking: it shows you the path the model typed. os.path.realpath() before the write is control: the file the OS will actually open after it resolves symlinks, .., ~, and $VAR. The gap between the two is the whole bug.
The demo that matters: one approved string, project_settings.json, flips the verdict. In a clean repo it is an ordinary file — PASS, exit 0. In a poisoned repo the same name is a symlink pointing out of the workspace — BLOCK, exit 1, resolved-escapes-workspace. You approved the same string both times.
When the resolved target leaves the workspace via a symlink, that is CWE-451 (UI misrepresentation) plus CWE-61 (symlink following). Via .., it is CWE-451 plus CWE-22 (path traversal). The gate names which one fired.
It resolves the target the way the OS would, compares it to what you read off the dialog, and to the workspace root. Read-only: it opens nothing, writes nothing, hits no network. Fail-closed: bad input exits 2, never a silent green.
Standard library only (os, sys, hashlib). The run is byte-for-byte deterministic; the whole reproduction harness — rebuild fixtures, five scenarios, each run twice — finished in about a third of a second on my laptop. The tool and its fixtures are in this post.

How does AI coding agent approval path resolution differ from the shown path?

A path string is not a file. It is a request for the OS to go find a file, and the finding involves steps you do not see. ~ becomes your home directory. $HOME or $PROJECT becomes whatever the environment says. A .. walks up a level. And a symlink — the quiet one — hands the whole lookup off to a target written somewhere else, at some earlier time, by someone who may not be you.

So project_settings.json in an approval dialog is a claim, not a fact. The claim is "I am about to write a file here, in your project, called project_settings.json." Whether that claim is true depends on what project_settings.json resolves to on disk right now. If it is a regular file in the repo, the claim holds. If it is a symlink to ../outside/authorized_keys, the claim is a costume.

This is the same split I keep hammering across the pre-execution gate this series is built on: tracking is not control. The dialog tracks intent — it shows the string the model meant to write. It does not control the write, because the OS does not consult the dialog. It consults the filesystem. Control means resolving the target the OS will actually open and deciding on that, before the human is ever asked to click Approve.

The official name for the gap has existed for years. CWE-451 is "User Interface (UI) Misrepresentation of Critical Information" — the interface shows you one thing while the system acts on another. When the misrepresentation is powered by a symlink, you also have CWE-61, "UNIX Symbolic Link (Symlink) Following." When it is powered by .., you have CWE-22, "Path Traversal." The gate reports which one it caught.

The GhostApproval disclosure (their findings, not mine)

I did not invent this failure. In July 2026 the research team at Wiz disclosed a class of bug they named GhostApproval, described in their writeup as a trust-boundary gap in AI coding assistants: the permission dialog shows the path the model typed, but the write resolves to a different path. They classify it as CWE-451 plus CWE-61 — the exact two identifiers above. Those are their words and their classification, not mine.

Per Wiz's disclosure, and the coverage in The Hacker News, the pattern showed up across six AI coding agents they name: Claude Code, Amazon Q Developer, Cursor, Google Antigravity, Augment, and Windsurf. As reported, some vendors shipped fixes (the reporting names AWS, Cursor, and Google), two stayed quiet, and one called it outside its threat model. @leobaniak's Dev.to summary put the mechanism in one line: the dialog was reading the wrong path.

I want to be precise about what is theirs and what is mine, because it is easy to launder someone else's incident into your own credibility here. The six agents, the vendor responses, the "six agents" count — those are Wiz's and the press's, and I have not independently reproduced any vendor bug. What follows is my own thing: a small, honest reproduction of the mechanism on synthetic fixtures, plus a gate that decides on the resolved target before the human approves. Different question, deliberately narrow, and every number below comes from running that gate.

What `os.path.realpath()` resolves before the write

The core of the gate is four lines of resolution, and it is worth reading them slowly, because the whole verdict turns on the difference between two of them:

expanded = os.path.expandvars(os.path.expanduser(approved))   # ~ and $VAR
base = expanded if os.path.isabs(expanded) else os.path.join(root_real, expanded)
resolved = os.path.realpath(base)   # follows symlinks + collapses ".."
lexical  = os.path.normpath(base)   # collapses ".." only, no symlink follow

os.path.realpath is the OS's own answer: expand, join to the workspace root if relative, follow every symlink, collapse every .., and return the real absolute target. os.path.normpath does the lexical cleanup only — it collapses .. but does not touch symlinks. When those two disagree, a symlink was traversed. When realpath lands outside the workspace root, the write is about to escape. That is the entire mechanic. Everything else in the file is presentation and honest bookkeeping.

Run it in sixty seconds

No keys. No network. No install beyond Python 3. Two files: the gate, and a fixture builder that creates a clean repo and a poisoned one under /tmp. The gate itself writes nothing; the fixture builder writes only under one namespaced /tmp directory and never touches your real ~/.ssh.

Here is the whole gate. One file, standard library only.

#!/usr/bin/env python3
"""approval_path_resolve_gate.py - what did the human actually approve?

An approval dialog shows a string: "write project_settings.json". The human reads
that string and clicks Approve. But the string is not what the OS opens. Before the
write lands, the OS resolves symlinks, "..", "~", and environment variables. The
resolved target can be a completely different file, possibly outside the workspace.

The dialog is TRACKING: the path the model typed, shown back to you.
os.path.realpath() before the write is CONTROL: the file the OS will actually open.
The gap between the two is CWE-451 (UI misrepresentation). When the resolved target
also leaves the workspace via a symlink, that is CWE-61 (symlink following).

This gate takes the approved display string plus the workspace root, resolves the
real target the way the OS would, and decides whether what you approved matches what
the OS would open, and whether it stays inside the workspace.

It is READ-ONLY. It calls os.path.realpath / os.readlink / os.path.islink /
os.path.expanduser / os.path.expandvars only. It never opens the target file, never
writes anything, never touches the network. Standard library only.

Usage:
  approval_path_resolve_gate.py <workspace_root> <approved_path> [more_approved ...]

Each approved path is a string exactly as it would appear in the approval dialog. A
relative path is interpreted against the workspace root (as the agent's write would be).

Verdicts (per approved path):
  PASS  exit 0  resolved target is inside the workspace AND equals the approved path
  WARN  exit 1  resolved target stayed inside the workspace, but resolution changed it
                (a symlink redirected it, or "..", "~", or $VAR normalized it): SHOWN != RESOLVED
  BLOCK exit 1  resolved target escapes the workspace (symlink or "..") : the OS would
                write outside the root the human thought they were approving
  ERROR exit 2  bad input (fail-closed): missing root, root is not a directory, empty
                approved path, unusable path

With several approved paths the process exit is the worst verdict: a BLOCK or WARN (1)
outranks unusable input (2) outranks a clean pass (0). STDOUT is deterministic for a
fixed workspace, and the last line is a sha256 of the report body.

The approved string is untrusted data. It is resolved and compared, never executed.
"""

import hashlib
import os
import sys

SEP = os.sep


class BadInput(Exception):
    pass


def is_inside(path, root):
    """True if path is root itself or lives under root. Both must be realpath'd."""
    if path == root:
        return True
    return path.startswith(root + SEP)


def clean(s):
    """Render an untrusted path string on one line so a crafted approved path
    cannot forge extra report lines with embedded newlines."""
    return str(s).replace("\\", "\\\\").replace("\r", "\\r").replace("\n", "\\n")


def classify(approved, root_real):
    """Resolve one approved display string against the real workspace root and
    return a dict describing the verdict. Read-only; opens nothing.

    root_real must already be os.path.realpath(root).
    """
    if not isinstance(approved, str) or approved.strip() == "":
        raise BadInput("approved path is empty")

    # what the OS would actually expand and open ----------------------------
    expanded = os.path.expandvars(os.path.expanduser(approved))
    if os.path.isabs(expanded):
        base = expanded
    else:
        base = os.path.join(root_real, expanded)

    try:
        resolved = os.path.realpath(base)         # follows symlinks + collapses ".."
        lexical = os.path.normpath(base)          # collapses ".." only, no symlink follow
    except (ValueError, OSError) as exc:
        raise BadInput(f"unresolvable path {clean(approved)!r}: {exc}")

    # what a human reads off the dialog, joined to the workspace, taken at face value
    display_raw = approved if os.path.isabs(approved) else os.path.join(root_real, approved)

    # immediate symlink text, if the top-level candidate is itself a symlink
    link_text = None
    try:
        if os.path.islink(base):
            link_text = os.readlink(base)
    except OSError:
        link_text = None

    inside = is_inside(resolved, root_real)
    symlink_followed = resolved != lexical        # realpath diverged from lexical => a symlink was traversed
    expansion_changed = expanded != approved      # "~" or "$VAR" was present
    normalization_changed = lexical != base       # ".." / "." / "//" collapsed

    out = {
        "approved": approved,
        "resolved": resolved,
        "link_text": link_text,
        "inside": inside,
    }

    if not inside:
        if symlink_followed:
            cwe = "CWE-451 display mismatch + CWE-61 symlink escape"
            how = "a symlink"
        else:
            cwe = "CWE-451 display mismatch + CWE-22 path traversal"
            how = '".." traversal'
        out["verdict"] = "BLOCK"
        out["code"] = "resolved-escapes-workspace"
        out["reason"] = (
            f"approved '{clean(approved)}', resolves to '{clean(resolved)}' outside "
            f"workspace via {how} ({cwe})"
        )
        return out

    if resolved != display_raw:
        if symlink_followed:
            why = "a symlink redirected the target within the workspace"
        elif expansion_changed:
            why = "shell expansion (~ or $VAR) changed the displayed path"
        elif normalization_changed:
            why = "path normalization (.. or . or //) changed the displayed path"
        else:
            why = "resolution changed the displayed path"
        out["verdict"] = "WARN"
        out["code"] = "display-differs-from-resolved"
        out["reason"] = (
            f"approved '{clean(approved)}' but the OS would open '{clean(resolved)}'; "
            f"{why} (CWE-451 display mismatch, stays inside workspace)"
        )
        return out

    out["verdict"] = "PASS"
    out["code"] = "display-matches-resolved"
    return out


def render_case(index, res):
    lines = [f"[{index}] approved: '{clean(res['approved'])}'"]
    if res.get("link_text") is not None:
        lines.append(f"    link     : {clean(res['approved'])} -> {clean(res['link_text'])}")
    lines.append(f"    resolved : {clean(res['resolved'])}")
    v = res["verdict"]
    if v == "PASS":
        lines.append("    verdict  : PASS (display-matches-resolved) exit 0")
        lines.append("    detail   : resolved target is inside the workspace and equals the approved path")
    elif v == "WARN":
        lines.append(f"    verdict  : WARN ({res['code']}) exit 1")
        lines.append(f"    reason   : {res['reason']}")
        lines.append(f"    contrast : dialog shows '{clean(res['approved'])}' / OS opens '{clean(res['resolved'])}'")
    else:  # BLOCK
        lines.append(f"    verdict  : BLOCK ({res['code']}) exit 1")
        lines.append(f"    reason   : {res['reason']}")
        lines.append(f"    contrast : dialog shows '{clean(res['approved'])}' / OS opens '{clean(res['resolved'])}'")
    return lines


def render_error(index, approved, message):
    return [
        f"[{index}] approved: '{clean(approved)}'",
        "    verdict  : ERROR (bad-input) exit 2",
        f"    detail   : {message}",
    ]


def worst_exit(n_flag, n_error):
    # worst verdict wins: a BLOCK/WARN (1) outranks unusable input (2) outranks a pass (0)
    if n_flag:
        return 1
    if n_error:
        return 2
    return 0


def emit(body, exit_code):
    report = "\n".join(body)
    digest = hashlib.sha256(report.encode("utf-8")).hexdigest()
    sys.stdout.write(report + "\n")
    sys.stdout.write(f"report-sha256: {digest}\n")
    return exit_code


def main(argv):
    if len(argv) < 3:
        sys.stderr.write(
            "usage: approval_path_resolve_gate.py <workspace_root> <approved_path> [more ...]\n"
        )
        return 2

    root, approved_list = argv[1], argv[2:]

    body = ["approval-path-resolve-gate: you approved a string; what will the OS open?"]

    if not os.path.isdir(root):
        body.append(f"root : {clean(root)}")
        body.append("ERROR (bad-input) exit 2: workspace root does not exist or is not a directory")
        return emit(body, 2)

    root_real = os.path.realpath(root)
    body.append(f"root : {clean(root_real)}")
    body.append(f"paths: {len(approved_list)}")
    body.append("")

    n_pass = n_warn = n_block = n_error = 0

    for i, approved in enumerate(approved_list, start=1):
        try:
            res = classify(approved, root_real)
        except BadInput as exc:
            body.extend(render_error(i, approved, str(exc)))
            body.append("")
            n_error += 1
            continue
        body.extend(render_case(i, res))
        body.append("")
        if res["verdict"] == "PASS":
            n_pass += 1
        elif res["verdict"] == "WARN":
            n_warn += 1
        else:
            n_block += 1

    exit_code = worst_exit(n_warn + n_block, n_error)
    body.append(
        f"summary: {n_pass} PASS, {n_warn} WARN, {n_block} BLOCK, {n_error} ERROR  ->  "
        f"overall exit {exit_code}"
    )
    return emit(body, exit_code)


if __name__ == "__main__":
    sys.exit(main(sys.argv))

And here is the fixture builder. It is the only script here that writes to disk, and it writes only under one fixed /tmp directory. The path is fixed on purpose — a fixed path is what makes the gate's STDOUT byte-for-byte reproducible, so the sha256 self-digest is stable across runs.

#!/usr/bin/env python3
"""make_fixtures.py - build the demo workspace for approval_path_resolve_gate.py.

This is the ONLY script here that touches the disk, and it only writes under one
fixed, namespaced directory in the system temp tree. It never touches your real
~/.ssh or anything outside its own base. The gate itself (approval_path_resolve_gate.py)
writes nothing.

The base path is FIXED (not randomized) on purpose: a fixed path makes the gate's
STDOUT byte-for-byte reproducible, so the sha256 self-digest is stable across runs.

Layout built under BASE (= /tmp/approval-path-resolve-gate, realpath'd by the OS):

  BASE/repo_clean/
      project_settings.json          regular file (the honest case)
      config/real.json               regular file
  BASE/repo_poisoned/
      project_settings.json  ->  ../outside/authorized_keys   (symlink ESCAPING the repo)
      config/real.json               regular file
      inner_link.json        ->  config/real.json             (symlink, stays IN the repo)
      logs/                          empty dir (for the ../ normalization case)
  BASE/outside/
      authorized_keys                a fixture "sensitive" file. NOT your real ssh key.

Run:  python3 make_fixtures.py
"""

import os
import shutil

BASE = os.path.join(os.path.realpath("/tmp"), "approval-path-resolve-gate")


def write(path, text):
    os.makedirs(os.path.dirname(path), exist_ok=True)
    with open(path, "w", encoding="utf-8") as fh:
        fh.write(text)


def main():
    if os.path.lexists(BASE):
        shutil.rmtree(BASE)
    os.makedirs(BASE)

    # a fixture "sensitive" file OUTSIDE any repo. This stands in for the file an
    # attacker wants your approved write to land on. It is NOT a real ssh key.
    write(os.path.join(BASE, "outside", "authorized_keys"),
          "# fixture only - stands in for a sensitive out-of-workspace file\n")

    # ---- repo_clean: project_settings.json is an ordinary file ----
    clean = os.path.join(BASE, "repo_clean")
    write(os.path.join(clean, "project_settings.json"), '{"theme": "dark"}\n')
    write(os.path.join(clean, "config", "real.json"), '{"real": true}\n')

    # ---- repo_poisoned: same visible name, different on-disk truth ----
    poisoned = os.path.join(BASE, "repo_poisoned")
    write(os.path.join(poisoned, "config", "real.json"), '{"real": true}\n')
    os.makedirs(os.path.join(poisoned, "logs"), exist_ok=True)

    # project_settings.json is a symlink whose target escapes the repo.
    link = os.path.join(poisoned, "project_settings.json")
    os.symlink(os.path.join("..", "outside", "authorized_keys"), link)

    # inner_link.json is a symlink that redirects but stays inside the repo.
    os.symlink(os.path.join("config", "real.json"),
               os.path.join(poisoned, "inner_link.json"))

    print("fixtures built under:", BASE)
    print("  repo_clean    :", os.path.realpath(clean))
    print("  repo_poisoned :", os.path.realpath(poisoned))


if __name__ == "__main__":
    main()

Build the fixtures once, then run the gate:

python3 make_fixtures.py
python3 approval_path_resolve_gate.py /tmp/approval-path-resolve-gate/repo_clean project_settings.json

The killer demo: one approved string, two verdicts

Here is the claim, stated so you can break it. Take the exact same approved string — project_settings.json — and run it against two workspaces. In repo_clean it is an ordinary file. In repo_poisoned it is a symlink to ../outside/authorized_keys, a file outside the repo. The human sees the same friendly string in the dialog both times. The gate must PASS the first and BLOCK the second. If you can make it pass a symlink that escapes the workspace, the tool is broken and this post comes down with it.

First, the clean repo. Approved string resolves to a real file inside the workspace:

approval-path-resolve-gate: you approved a string; what will the OS open?
root : /private/tmp/approval-path-resolve-gate/repo_clean
paths: 1

[1] approved: 'project_settings.json'
    resolved : /private/tmp/approval-path-resolve-gate/repo_clean/project_settings.json
    verdict  : PASS (display-matches-resolved) exit 0
    detail   : resolved target is inside the workspace and equals the approved path

summary: 1 PASS, 0 WARN, 0 BLOCK, 0 ERROR  ->  overall exit 0
report-sha256: cc61a5e7291ff6ebffa5cbadffc50dbde64b632ecb33351587755871413a10b3

Now the poisoned repo. Same command, same approved string, one character of the fixture different — project_settings.json is a symlink now:

approval-path-resolve-gate: you approved a string; what will the OS open?
root : /private/tmp/approval-path-resolve-gate/repo_poisoned
paths: 1

[1] approved: 'project_settings.json'
    link     : project_settings.json -> ../outside/authorized_keys
    resolved : /private/tmp/approval-path-resolve-gate/outside/authorized_keys
    verdict  : BLOCK (resolved-escapes-workspace) exit 1
    reason   : approved 'project_settings.json', resolves to '/private/tmp/approval-path-resolve-gate/outside/authorized_keys' outside workspace via a symlink (CWE-451 display mismatch + CWE-61 symlink escape)
    contrast : dialog shows 'project_settings.json' / OS opens '/private/tmp/approval-path-resolve-gate/outside/authorized_keys'

summary: 0 PASS, 0 WARN, 1 BLOCK, 0 ERROR  ->  overall exit 1
report-sha256: 90664de66fe036c2d7fa13057bace1537fb22f31a49c832f9e72db033757e187

Read the contrast line twice. The dialog shows project_settings.json. The OS opens outside/authorized_keys. The exit code went from 0 to 1. The human clicked Approve on the same eight-word string in both worlds. In the fixture the target is my stand-in file; in a real repo the symlink target is arbitrary, and ~/.ssh/authorized_keys is exactly the kind of place a poisoned symlink points. The gate resolves ~ too, so an approved string of ~/.ssh/authorized_keys would land the same BLOCK.

The sweep: five approved paths, five verdicts

The killer demo is one string against two repos. The sweep is five strings against one poisoned repo, to show every state the gate distinguishes — PASS, BLOCK by symlink, WARN by inner symlink, WARN by .. normalization, BLOCK by .. traversal — and to show that with multiple paths the worst verdict drives the process exit:

approval-path-resolve-gate: you approved a string; what will the OS open?
root : /private/tmp/approval-path-resolve-gate/repo_poisoned
paths: 5

[1] approved: 'config/real.json'
    resolved : /private/tmp/approval-path-resolve-gate/repo_poisoned/config/real.json
    verdict  : PASS (display-matches-resolved) exit 0
    detail   : resolved target is inside the workspace and equals the approved path

[2] approved: 'project_settings.json'
    link     : project_settings.json -> ../outside/authorized_keys
    resolved : /private/tmp/approval-path-resolve-gate/outside/authorized_keys
    verdict  : BLOCK (resolved-escapes-workspace) exit 1
    reason   : approved 'project_settings.json', resolves to '/private/tmp/approval-path-resolve-gate/outside/authorized_keys' outside workspace via a symlink (CWE-451 display mismatch + CWE-61 symlink escape)
    contrast : dialog shows 'project_settings.json' / OS opens '/private/tmp/approval-path-resolve-gate/outside/authorized_keys'

[3] approved: 'inner_link.json'
    link     : inner_link.json -> config/real.json
    resolved : /private/tmp/approval-path-resolve-gate/repo_poisoned/config/real.json
    verdict  : WARN (display-differs-from-resolved) exit 1
    reason   : approved 'inner_link.json' but the OS would open '/private/tmp/approval-path-resolve-gate/repo_poisoned/config/real.json'; a symlink redirected the target within the workspace (CWE-451 display mismatch, stays inside workspace)
    contrast : dialog shows 'inner_link.json' / OS opens '/private/tmp/approval-path-resolve-gate/repo_poisoned/config/real.json'

[4] approved: 'logs/../config/real.json'
    resolved : /private/tmp/approval-path-resolve-gate/repo_poisoned/config/real.json
    verdict  : WARN (display-differs-from-resolved) exit 1
    reason   : approved 'logs/../config/real.json' but the OS would open '/private/tmp/approval-path-resolve-gate/repo_poisoned/config/real.json'; path normalization (.. or . or //) changed the displayed path (CWE-451 display mismatch, stays inside workspace)
    contrast : dialog shows 'logs/../config/real.json' / OS opens '/private/tmp/approval-path-resolve-gate/repo_poisoned/config/real.json'

[5] approved: '../outside/authorized_keys'
    resolved : /private/tmp/approval-path-resolve-gate/outside/authorized_keys
    verdict  : BLOCK (resolved-escapes-workspace) exit 1
    reason   : approved '../outside/authorized_keys', resolves to '/private/tmp/approval-path-resolve-gate/outside/authorized_keys' outside workspace via ".." traversal (CWE-451 display mismatch + CWE-22 path traversal)
    contrast : dialog shows '../outside/authorized_keys' / OS opens '/private/tmp/approval-path-resolve-gate/outside/authorized_keys'

summary: 1 PASS, 2 WARN, 2 BLOCK, 0 ERROR  ->  overall exit 1
report-sha256: b349309aa4acc91d6c02bfb5e189ab02d29ceef5afd034b84874fd9cd4af73f9

The WARN cases are the ones I'd bikeshed. Path [3], inner_link.json, is a symlink that redirects but stays inside the repo. Path [4], logs/../config/real.json, resolves to a legitimate file inside the repo after .. collapses. Neither escapes, so neither is a BLOCK. But both are cases where the string you approved is not the file the OS opens, and I refuse to call that PASS. SHOWN differs from RESOLVED, even when RESOLVED is still inside the fence. You can argue an inner redirect is benign; I'd rather see it and decide, than have it hidden. Draw that line differently in your own policy — that is why the verdict is WARN and not BLOCK.

Fail-closed, because a gate that errors green is not a gate

Two bad-input paths. A missing workspace root, and an empty approved string. Both must exit 2, never 0:

approval-path-resolve-gate: you approved a string; what will the OS open?
root : /private/tmp/approval-path-resolve-gate/does_not_exist
ERROR (bad-input) exit 2: workspace root does not exist or is not a directory
report-sha256: 6e457546787176bfc328cfe78e9daa7d6208189e743a6edcc4bab2045f93975a

approval-path-resolve-gate: you approved a string; what will the OS open?
root : /private/tmp/approval-path-resolve-gate/repo_poisoned
paths: 1

[1] approved: ''
    verdict  : ERROR (bad-input) exit 2
    detail   : approved path is empty

summary: 0 PASS, 0 WARN, 0 BLOCK, 1 ERROR  ->  overall exit 2
report-sha256: f04e6525ca3aca82cd40f07b600b3a9ee5bdffa7391cc2c47b3c36a7b39af23e

One more thing about untrusted input. The approved string comes from the model, which can be steered by a poisoned file or web page. If a crafted path contains a newline, a naive report could forge fake verdict lines. The clean() helper escapes \r and \n so a hostile path can never inject an extra line into the report. The approved string is data. The gate resolves it and compares it. It never runs it.

Everything above is reproducible. run_demo.sh rebuilds the fixtures, runs each scenario twice, and prints a sha256 of the full STDOUT for each — I ran it and got run1==run2 IDENTICAL on all five. The whole harness finished in about a third of a second on my laptop. If your bytes differ from the hashes in this post, either your Python resolves paths differently or I made a mistake, and I want to hear which.

What this is NOT

I want to be honest about the fence I built, because a gate oversold is worse than no gate.

It resolves the target, not your rights. The gate decides whether the file the OS would open matches what you approved and stays in the workspace. It does not check whether the agent is allowed to write there in the first place — that is authorization, a different verdict against a different policy. I built that one separately: your trace proves what the agent did, not that it was allowed.
It does not read the content. A PASS means the destination is honest, not that the bytes being written are safe. A write to a legitimate file can still be garbage.
It is not a sandbox. It enforces nothing at runtime. It is a pre-approval check that answers one question so a human or an orchestrator can answer the bigger one. If you need containment, containment is a different tool.
There is a TOCTOU window, and I will name it. The gate resolves the path, then something else performs the write. Between the resolve and the write, an attacker who can modify the filesystem could swap a benign file for a symlink — the classic time-of-check to time-of-use race. Resolving right before the write shrinks that window; it does not close it. For a true close you resolve and write through the same file descriptor (O_NOFOLLOW, openat with O_DIRECTORY walks), which is a runtime concern, not a static gate's job.
It is not cryptographic verification. It trusts the filesystem it reads. It does not prove the file's provenance or integrity; it only reports where a path points right now.

This is one narrow gate in a series I keep building around the same thesis — decide before the irreversible act, not after. If you found this one useful, the pre-execution gate is the pillar, gate-taint-lint checks whether your gate's own inputs are trustworthy, and the lethal-trifecta gate catches the config combination that makes an escape like this catastrophic.

The unresolved part I'd like an argument about

Here is the thing I have not settled. My gate treats an inner symlink — one that redirects but stays inside the workspace — as WARN, not BLOCK. In a monorepo full of legitimate symlinks, BLOCK-on-any-redirect would be unusably noisy; WARN lets you see it and move on. But an inner redirect is still a place where the string you approved is not the file that gets written, and I can imagine an attack that lives entirely inside the fence. So: for an agent writing inside a trusted repo, is SHOWN-differs-from-RESOLVED-but-still-inside a thing you'd block, or a thing you'd log and live with? I keep flipping on it.

I write one of these gates at a time — offline, keyless, runnable in the time it takes to read the post. Follow along if that's your kind of thing, and tell me in the comments how your agent stack decides what a human actually approved before a write lands. I read every reply.

Checkpoint-Skip Gate: Task Success 100%, Checkpoint Never Ran

Alexey Spinov — Sun, 12 Jul 2026 09:35:31 +0000

Checkpoint-skip gate: a multi-agent pipeline can finish with task_success: true while the mandatory confirmation checkpoint never ran. checkpoint_skip_gate.py replays a recorded JSONL trajectory against a declarative spec of mandatory checkpoints and handoff contracts, offline, and blocks when the road was wrong. The verdict never consults the final metric. That is the point.

AI disclosure: I wrote checkpoint_skip_gate.py with an AI assistant and ran it myself, offline, on Python 3.13.5, standard library only, no network. Every number, exit code, and hash in the output blocks below is pasted from a real local run. I ran each scenario twice to confirm STDOUT is byte-for-byte identical, and the tool prints a sha256 of its own report so you can reproduce the exact bytes. The Alberta write-up and the arXiv paper I cite are other people's work, attributed inline, and their numbers stay out of my fixtures.

In short:

task_success=true proves the pipeline arrived. It does not prove the mandatory steps happened, happened in order, or that each agent-to-agent handoff delivered what the next agent assumed. A trajectory can be perfectly green and structurally wrong.
The gate replays a recorded trajectory against a spec you declare: checkpoints that must precede specific actions, plus contracts for each handoff (required fields, verified flags). The final metric is printed for contrast and ignored for the verdict.
The demo that matters: two trajectories identical except one JSONL line, the confirm_with_user checkpoint event. Both end task_success: true. Delete that line and the verdict flips from PASS exit 0 to BLOCK exit 1 checkpoint-skipped.
It also tracks unverified values across handoffs. A number that travelled a connected chain of two handoffs with no hop verifying it blocks as unverified-claim-propagated-2-hops. Everyone shared the number. Nobody verified it.
Offline, keyless, zero network, fail-closed: broken input exits 2, never a silent green. The whole 8-fixture sweep runs in about 0.02 seconds on my laptop.

How does task_success=true hide a skipped checkpoint?

Your eval is a finish-line camera. It photographs the moment the pipeline crosses the line, and the photograph is honest: the report went out, the task completed, the metric is green. What the camera cannot photograph is the road. Whether the agent stopped at the mandatory confirmation before the irreversible send. Whether the number it reported was verified by anyone at any point, or just repeated with growing confidence at every handoff.

Your eval measures whether the agent arrived. It does not measure whether it took the road.

For a single agent this is annoying. For a pipeline of agents it compounds, quietly. Take a plain three-stage pipeline: a scanner that pulls rows, an aggregator that summarizes them, a reporter that confirms with the user and ships the result. Five places for the road to go wrong: two handoffs, one mandatory checkpoint, one ordering constraint, one irreversible action. The final metric watches exactly none of them. It fires when the reporter ships, and it says true.

Here is the uncomfortable arithmetic. A five-step pipeline where the final metric is green tells you one fact about five constraints. The other four live only in the trajectory log, and if nobody replays that log against a spec, nobody is checking them. Not your dashboard, not your test suite, not the human who saw the green checkmark and moved on.

Tracking is not control, now at the handoff level

I keep hammering one thesis in the gates I build: tracking a fact is not the same as controlling the thing the fact describes. task_success tracks arrival. Control means the mandatory steps of the road are enforced, and enforcement needs an artifact you can check mechanically. A checkpoint that exists only in your prompt ("always confirm with the user before sending") is a wish. A checkpoint that must appear in the recorded trajectory, before a named action, or the build blocks, is a control.

Multi-agent pipelines make this sharper, because the road now includes borders. Every handoff between agents is a place where one agent's unchecked output becomes the next agent's ground truth. The metric at the end of the pipe says nothing about what crossed those borders. So the gate treats each handoff as a contract: these fields must be present, and if the contract says so, they must arrive with verified: true asserted by the sender.

Here is the claim, stated so you can break it. Take two JSONL trajectories that are identical line for line, except one contains the event {"event": "checkpoint", "checkpoint_id": "confirm_with_user", ...} before the irreversible send_report action, and the other does not. Both end with {"event": "final", "task_success": true}. The gate must exit 0 on the first and exit 1 with reason checkpoint-skipped on the second. One deleted line has to flip the verdict. If you can construct a trajectory where the gate gets this wrong, a conforming road it blocks or a skipped mandatory checkpoint it passes, the tool is broken and this post comes down with it.

What does the gate replay?

Two files. A trajectory: JSONL, one event per line, strictly increasing seq, four event types (action, checkpoint, handoff, final). And a spec: plain JSON, two lists. mandatory_checkpoints says which checkpoint must exist and which action type it must precede. handoff_contracts says, for each from_agent -> to_agent border, which payload fields are required and whether they must arrive with verified: true.

In my fixtures the spec is 10 lines:

{
  "mandatory_checkpoints": [
    {"checkpoint_id": "confirm_with_user", "must_precede": "send_report"}
  ],
  "handoff_contracts": [
    {"from_agent": "scanner", "to_agent": "aggregator", "required_fields": ["row_count", "source"], "require_verified": false},
    {"from_agent": "aggregator", "to_agent": "reporter", "required_fields": ["row_count", "summary"], "require_verified": true}
  ]
}

Note the asymmetry, it is deliberate: the closer a handoff sits to the irreversible act, the stricter its contract. The last border before send_report demands verified fields; the first one only demands presence. Your spec will draw those lines differently. That is the point of making it declarative.

The gate replays the trajectory and collects every reason that fires, instead of bailing on the first. The headline reason drives the exit code; the full list drives the forensics. These are the states:

Verdict	Condition in the recorded trajectory	reason-code	exit
PASS	every mandatory checkpoint fired and was consumed by the act it gates; declared checkpoint dependencies hold; every contracted handoff satisfied	`trajectory-conforms`	0
BLOCK	mandatory checkpoint absent while its guarded action executed	`checkpoint-skipped`	1
BLOCK	checkpoint present, but after the action it must precede	`checkpoint-after-action`	1
BLOCK	a repeated act reused a confirmation that was already spent	`checkpoint-not-reconfirmed`	1
BLOCK	a checkpoint fired before a prerequisite the spec declares with `requires`	`checkpoint-out-of-order`	1
BLOCK	a declared prerequisite checkpoint never fired at all	`checkpoint-dependency-unmet`	1
BLOCK	required field missing from a handoff payload	`handoff-contract-violation`	1
BLOCK	required field arrived without `verified: true` where the contract demands it	`unverified-claim-consumed-as-ground-truth`	1
BLOCK	a value travelled a connected chain of 2+ handoffs, unverified at every hop	`unverified-claim-propagated-N-hops`	1
WARN	handoff covered by no contract (spec may be incomplete)	`uncontracted-handoff`	0 (warn)
ERROR	broken JSONL, empty trajectory, missing final, events after final, unusable spec	`bad-input`	2

The checkpoint check is short enough to read whole. This is the loop, verbatim from evaluate() in the file:

    for mc in spec["mandatory_checkpoints"]:
        cid, must = mc["checkpoint_id"], mc["must_precede"]
        cp_seqs = sorted(cp_by_id.get(cid, []))
        act_seqs = sorted(e["seq"] for e in events
                          if e["event"] == "action" and e["action_type"] == must)
        if not act_seqs:
            notes.append(f"checkpoint '{cid}': guarded action '{must}' never occurred; nothing to gate")
            continue
        if not cp_seqs:
            reasons.append((
                "checkpoint-skipped",
                f"mandatory confirmation never happened: "
                f"checkpoint '{cid}' absent from trajectory, guarded action "
                f"'{must}' executed at seq {act_seqs[0]}",
            ))
            continue

        unconsumed = list(cp_seqs)
        for n, act in enumerate(act_seqs, start=1):
            available = [s for s in unconsumed if s < act]
            if available:
                unconsumed.remove(available[0])
                continue
            if n == 1:
                reasons.append((
                    "checkpoint-after-action",
                    f"confirmation happened after the act it was supposed to gate: "
                    f"checkpoint '{cid}' at seq {cp_seqs[0]}, action '{must}' at seq {act}",
                ))
            else:
                reasons.append((
                    "checkpoint-not-reconfirmed",
                    f"one confirmation was reused for a repeated act: occurrence {n} of "
                    f"'{must}' at seq {act} consumed no unspent checkpoint '{cid}' "
                    f"({len(cp_seqs)} fired, {len(act_seqs)} acts to gate)",
                ))

Notice what is absent. Nowhere in the verdict logic does the gate read task_success. It parses the final line, prints it in the report for contrast, and then decides purely from the road. A design choice I want to defend out loud: checkpoint-after-action is its own block, not a lesser warning. A confirmation that happens after the irreversible act is theater with a timestamp.

A confirmation is consumed when it gates an act. Each occurrence of send_report has to claim its own unspent confirm_with_user, earliest first; a second send behind a single confirmation blocks with checkpoint-not-reconfirmed. The first version of this gate compared min(cp_seqs) against the first action only, which let one approval cover every later repeat. See the update at the end for why that changed, and who changed it.

Ordering between checkpoints is the other thing worth being precise about. A flat list declares no relation, so the gate enforces none. If safety_review is only meaningful after budget_approved, the spec says so with an edge, "requires": ["budget_approved"], and the mandatory set becomes a small DAG the replay can check (checkpoint-out-of-order, checkpoint-dependency-unmet; a cycle in the edges is an unusable spec, exit 2). An edge nobody declares is an ordering the gate cannot enforce, and it now prints that scope on PASS instead of implying more.

One more choice I am less sure about. A payload field without an explicit verified: true counts as unverified. If nobody asserted the verification, the gate assumes nobody did it. That is the paranoid default, and it means sloppy-but-honest pipelines will hit propagation blocks until they start marking fields. I went back and forth on this and picked paranoia; you may reasonably pick the other side, it is one line to change.

Quick start

python3 checkpoint_skip_gate.py spec.json trajectory.jsonl

No install, no key, no network. First argument is the spec, everything after is trajectory files. It prints one verdict per trajectory and returns an exit code you can wire into CI over the traces you already record. If your agent framework logs tool calls and messages, you can map those logs to this event shape with a few lines of glue; the gate itself does not care which framework produced the road.

The one line that flips the verdict

Two fixtures. A three-agent pipeline, scanner to aggregator to reporter, moving 17491 rows of billing data (fixture numbers should look like real numbers, that is a pet rule of mine). Both trajectories end with task_success: true. The entire difference between the files is one line:

$ diff fixtures/fixture_conforms.jsonl fixtures/fixture_skipped.jsonl
5d4
< {"seq": 5, "ts": "2026-07-11T03:14:20Z", "agent": "reporter", "event": "checkpoint", "checkpoint_id": "confirm_with_user", "confirmed_by": "owner"}

Run the conforming one:

checkpoint-skip-gate: the metric says it arrived; did it take the road?
spec : fixtures/spec.json (1 mandatory checkpoint, 2 handoff contracts)
files: 1

[1] fixtures/fixture_conforms.jsonl
    trajectory: 6 events, agents: scanner, aggregator, reporter; 2 handoffs, 1 checkpoint, 1 irreversible action
    final     : task_success=true (recorded, printed for contrast, ignored for the verdict)
    verdict   : PASS (trajectory-conforms) exit 0
    detail    : every mandatory checkpoint fired and was consumed by the act it gates (one confirmation per act); declared checkpoint dependencies hold; all contracted handoffs satisfied
    scope     : ordering between checkpoints is enforced only where the spec declares it (requires); an undeclared edge is not checked

summary: 1 PASS, 0 BLOCK, 0 ERROR  ->  overall exit 0
report-sha256: 6cbe2b0e4dbd63e96fe8e428c60aecfed1c38f37319f01a4425259a32db70db3

Now the one with the deleted line:

checkpoint-skip-gate: the metric says it arrived; did it take the road?
spec : fixtures/spec.json (1 mandatory checkpoint, 2 handoff contracts)
files: 1

[1] fixtures/fixture_skipped.jsonl
    trajectory: 5 events, agents: scanner, aggregator, reporter; 2 handoffs, 0 checkpoints, 1 irreversible action
    final     : task_success=true (recorded, printed for contrast, ignored for the verdict)
    verdict   : BLOCK (checkpoint-skipped) exit 1
    reason    : checkpoint-skipped: mandatory confirmation never happened: checkpoint 'confirm_with_user' absent from trajectory, guarded action 'send_report' executed at seq 6
    contrast  : final metric says task_success=true / trajectory says checkpoint-skipped

summary: 0 PASS, 1 BLOCK, 0 ERROR  ->  overall exit 1
report-sha256: 49f6ab279ad9406bb42df0cab2b9f96398b61a50091a722ec6040d1fdb55d8a6

Same agents, same 17491 rows, same handoffs, same green final line. One deleted JSONL line and the exit code goes from 0 to 1. The contrast line is the tool saying the quiet part in its own words: the final metric says true, the trajectory says the mandatory confirmation never happened.

In production nobody deletes that line by hand. The agent plans a shorter path under load, or a retry drops the confirmation branch, or a prompt tweak three weeks ago quietly removed the ask. The finish-line camera sees none of it. The trajectory does, if something replays the trajectory. The two sha256 lines are the tool hashing its own report body; I ran every scenario twice and the bytes matched both times.

Everyone shared the same number

The second failure mode is quieter and, in a fleet, worse. On July 10 a developer publishing as itskondrat wrote up an Alberta government project that ran about 50 Claude agents in parallel over 466 million lines of legacy code across 1,280 applications, 20 hours instead of an estimated six years. Those are his numbers, relayed from his post; I have not read the underlying Alberta report. The lines that stuck with me are his diagnosis: "Individual agents fail loudly. Handoffs fail quietly." And: "Errors introduced in step one get treated as ground truth in step two." He asks "the question nobody asked: who owned that handoff?"

That failure has a shape you can check mechanically: a value enters a handoff without anyone asserting they verified it, and then it keeps traveling. So the gate tracks exactly that. Two more fixtures, again differing only in the verified flags: in one, the scanner's anomaly_count of 3 is marked verified and the pipeline passes. In the other, the same value crosses both handoffs with verified: false:

[1] fixtures/fixture_propagated.jsonl
    trajectory: 6 events, agents: scanner, aggregator, reporter; 2 handoffs, 1 checkpoint, 1 irreversible action
    final     : task_success=true (recorded, printed for contrast, ignored for the verdict)
    verdict   : BLOCK (unverified-claim-propagated-2-hops) exit 1
    reason    : unverified-claim-propagated-2-hops: field 'anomaly_count'=3 entered unverified at scanner->aggregator (seq 2) and was passed on unverified through 2 handoffs, last at aggregator->reporter (seq 4); no agent verified it at any hop
    contrast  : final metric says task_success=true / trajectory says unverified-claim-propagated-2-hops

summary: 0 PASS, 1 BLOCK, 0 ERROR  ->  overall exit 1
report-sha256: 14bf34666cfcb721f3feb783f32ee77caf06b320659f0bc56c42031253ad91df

Read the reason line back in itskondrat's terms. The anomaly_count is the number everyone shared. The two hops are the handoffs nobody owned. The checkpoint ran, the metric is green, and the pipeline shipped on the back of a count that, per the recorded flags, no agent ever verified. The gate turns "who owned that handoff?" from a postmortem question into an exit code, at least for a value that travels unchanged.

Eight trajectories, one sweep

The full fixture set covers every state in the table. One command (checkpoint_skip_gate.py fixtures/spec.json fixtures/fixture_*.jsonl, so the files arrive alphabetically), abridged here to the verdict lines so the block fits on a screen:

checkpoint-skip-gate: the metric says it arrived; did it take the road?
spec : fixtures/spec.json (1 mandatory checkpoint, 2 handoff contracts)
files: 8

[1] fixtures/fixture_bad_input.jsonl
    verdict   : ERROR (bad-input) exit 2
    detail    : line 2: unparseable JSONL: Expecting value

[2] fixtures/fixture_broken_contract.jsonl
    verdict   : BLOCK (handoff-contract-violation) exit 1
    reason    : handoff-contract-violation: required fields never delivered at a contracted border: aggregator->reporter (seq 4) missing ['summary']
    reason    : unverified-claim-consumed-as-ground-truth: aggregator->reporter (seq 4): required fields ['row_count'] arrived without verified=true at a border whose contract demands verification

[3] fixtures/fixture_conforms.jsonl
    verdict   : PASS (trajectory-conforms) exit 0

[4] fixtures/fixture_late_checkpoint.jsonl
    verdict   : BLOCK (checkpoint-after-action) exit 1
    reason    : checkpoint-after-action: confirmation happened after the act it was supposed to gate: checkpoint 'confirm_with_user' at seq 6, action 'send_report' at seq 5

[5] fixtures/fixture_propagated.jsonl
    verdict   : BLOCK (unverified-claim-propagated-2-hops) exit 1

[6] fixtures/fixture_skipped.jsonl
    verdict   : BLOCK (checkpoint-skipped) exit 1

[7] fixtures/fixture_uncontracted.jsonl
    verdict   : PASS (trajectory-conforms) exit 0
    warn      : uncontracted-handoff: reporter->auditor (seq 7) is covered by no contract; the spec may be incomplete, flagging instead of blocking

[8] fixtures/fixture_verified.jsonl
    verdict   : PASS (trajectory-conforms) exit 0

summary: 3 PASS (1 with warnings), 4 BLOCK, 1 ERROR  ->  overall exit 1
report-sha256: 9802d131d66039c0d846fb2ff57c59e24d458caf870371f545f922f937298a97

Three details worth your attention. The broken-contract file fires two reasons at once, the missing summary field and the unverified row_count, because the gate collects everything instead of stopping at the first hit; when you are doing forensics on a fleet you want the whole list, not the first symptom. The uncontracted handoff to an auditor agent is a warning, not a block: I cannot know your spec is complete, and pretending otherwise would train people to write vague contracts. And the bad-input file is fail-closed: a truncated JSONL line exits 2 with the parser's own message, never a green pass.

The whole 8-file sweep takes about 0.02 seconds of wall time on my laptop, measured with /usr/bin/time across three runs, so this costs you effectively nothing in CI. Each scenario was run twice; STDOUT was byte-for-byte identical both times, and the process exit is the worst verdict across files, where a confirmed BLOCK outranks a parse error outranks a pass.

How is this different from my other gates?

Fair question, because I have built neighbors, and the borders matter more than the family resemblance.

The green-checkmark auditor asks whether the tests behind a passing checkmark actually assert anything. That is the quality of the proof of the result. This gate does not look at the result's proof at all; it asks whether the mandatory steps of the path were walked. One axis is "what did you check", the other is "how did you get here".

The severity gate argues a final score should aggregate by worst failure, not pass rate. It still operates entirely on final scoring. This gate ignores final scoring on principle; the whole verdict comes from the road.

The 200-and-lies gate covers the case where the final signal is false: the agent claims success that did not happen. Today's tool covers the opposite and nastier case: the final signal is completely honest, the task genuinely succeeded, and the trajectory is still in violation. Complementary axes; you want both.

The scorecard reconciler checks a claimed scorecard against evidence artifacts. Same spirit, different object: there the suspect is the claim, here the suspect is the path.

And yesterday's mandate freshness gate shares only the payment vocabulary of the arXiv anchor below: it replays the lifecycle of an authority (revoked, expired, lowered) at execution time. Today's gate does not care who authorized anything; it cares which steps existed and in what order.

Prior art, honestly: replaying recorded processes against a declared model is a mature discipline, process mining people have called it conformance checking for two decades, and the ASR metric in the paper below is a member of the same family. What I do here is the declarative flavor of it, existence and precedence constraints, not token replay. I am not inventing conformance checking. My contribution is the small, portable shape: one stdlib file for agent trajectories, handoff contracts as first-class checks, and a killer demo where one deleted line flips the exit code.

What this is NOT

I would rather draw the borders myself than have you find them in production.

It is not a runtime supervisor. It replays a recorded trajectory, offline: a CI gate, a post-hoc audit, a pre-promotion check. It does not intercept live execution, and by the time it blocks, the irreversible action in the recording already happened. Its job is to stop that pipeline version from being promoted, rerun, or trusted, not to un-send the report.

It does not judge whether the result was correct. A pipeline can walk every mandatory step and still produce a wrong answer. This gate checks structural fidelity of the path and nothing else. It complements outcome evals; it does not replace them.

The log can lie. The gate trusts the recorded trajectory. An agent that fabricates its own trace, logging a checkpoint that never ran, will pass. I wrote about that failure separately in the 200-and-lies piece; trace provenance is its own problem and this tool does not solve it.

Propagation matching is a heuristic. The gate links hops by field name plus the exact same value, along a connected chain of handoffs. A value that mutates at the border, which is the nastiest version of the Alberta failure, breaks the match and sails through, and so does a renamed field. The require_verified contract is the guard for that class; the propagation tracker only catches the number that travels intact.

It does not fix rubber-stamping. A developer publishing as goutham_nishkaldeepueda made this point sharply in a post arguing human-in-the-loop is not a governance strategy: the human has seen forty approval modals today and stops reading them; presence is not oversight. He is right, and this gate does not check whether the person who confirmed actually thought. It proves a checkpoint event with the right ID was logged before the act; it does not check who logged it, and the spec has no field to demand a particular confirmer. Necessary, not sufficient.

The spec is yours. The gate does not discover which steps should be mandatory; it enforces what you declared. If your spec is empty it refuses to run (exit 2), and if a handoff has no contract it warns instead of blocking, because I cannot tell an incomplete spec from an intentional one.

The paranoid default cuts both ways. Unmarked payload fields count as unverified, which will flag pipelines that verify things but never say so in the log. I said above I am not fully sure this is the right default. It is the one I can defend: silence about verification should not read as verification.

Why this, why now

Two anchors put this on my desk in the same week, and they point at the same seam from opposite directions.

The first is the Alberta write-up by itskondrat I quoted above: a fleet of agents whose individual failures were loud and whose handoff failures were silent, with his closing question, "who owned that handoff?", left hanging. His post is a diagnosis. It does not ship a tool you can point at your own trajectory tomorrow morning, which is the gap this gate is built for.

The second is a paper: arXiv 2605.06457, "Beyond Task Success: Measuring Workflow Fidelity in LLM-Based Agentic Payment Systems" by Huang, Chua, and Wang. From the abstract: their fidelity metric "reveals that 10 of 18 models systematically skip a confirmation checkpoint during payment checkout, a deviation invisible to both TSR and HF1," and "GPT-4.1 exhibits hidden workflow shortcuts despite achieving perfect TSR and HF1." Ten of eighteen models, on a benchmark of 90,000 tasks. Their numbers, not mine, and none of them appear in my fixtures. What the paper names academically is exactly the killer fixture above: a checkpoint skip that final-success metrics cannot see. Their ASR metric needs their benchmark harness; the point of my gate is that the same class of check runs on your recorded trajectory with one stdlib file and no harness at all.

There is also fresh academic work on localizing which agent broke a multi-agent system (arXiv 2607.07989); I have only read the title, so I will only tell you it exists.

I do not know how representative my three-agent fixtures are of your fleet. Nobody's fixtures are. That is why the falsifiable claim up top is about the mechanism, one line flipping the verdict, and not about your incident rate. Run it on a real trace and tell me what it catches; that number will be worth more than my synthetic eight.

Update, 14 July 2026: a reader broke this gate, and he was right

Kartik N V J K asked the obvious question I had not asked myself, in a comment on the DEV version of this post: does the gate assert ordering between checkpoints, or only that each mandatory one fired? Handoff bugs, he noted, love to hide in the order.

I went and ran it instead of answering from memory. Two mandatory checkpoints, each ahead of the act it gates, mutual order inverted: PASS, exit 0. He was right, and it got worse from there.

Two things were wrong, and both are now fixed in the code above.

The passing line lied. On PASS the gate printed all mandatory checkpoints present and in order. "In order" meant relative to the act it gates, never relative to each other. The string promised a property the code never checked, which is the exact failure this whole series is about, so it is now gone. PASS states what was actually verified, and prints a scope: line saying ordering between checkpoints is enforced only where the spec declares it.

One confirmation covered every repeat. The comparison was min(cp_seqs) against the first action, so a single safety_review licensed two irreversible sends. I had declared that border in this post, which is not the same as it being safe: a gate whose safe behaviour depends on the reader remembering a footnote is not a gate. Confirmations are now consumed, one per gated act.

The seven original fixtures return their old verdicts unchanged; the two trajectories that used to sail through now block. Ordering between checkpoints is enforced where the spec declares an edge, and nowhere else, which leaves the wall where it has always been in these gates: the author writes the spec, and an edge nobody declared is an ordering nobody can enforce.

I publish one of these small offline gates most weeks, each a runnable tool with the raw output and the exit codes, no keys and no network. Follow along if you want the next one. And here is the question I genuinely do not have a good answer to, so I am putting it to you: who should own the spec? Mandatory checkpoints and handoff contracts have to be declared by someone, and in a pipeline that changes topology every sprint, the spec goes stale about as fast as the prompts do. Do you version it with the agents, own it in CI like a lockfile, or generate a draft spec from known-good trajectories and hand-edit it? I read every comment.

Mandate Freshness Gate: Valid Signature, Revoked Authority

Alexey Spinov — Sat, 11 Jul 2026 04:30:58 +0000

Mandate freshness gate: a valid payment signature proves authorization at issue time, not that the authority is still live when the agent spends. mandate_freshness_gate.py replays a recorded mandate against a recorded execution, offline, and blocks when authority was revoked, expired, over a lowered limit, or out of scope at execution. The signature stays valid throughout.

AI disclosure: I wrote mandate_freshness_gate.py with an AI assistant and ran it myself, offline, on Python 3.13.5, standard library only, no network. Every number in the output blocks below is pasted from a real local run. I checked the exit codes (0 / 1 / 2), ran each scenario twice to confirm STDOUT is byte-for-byte identical, and had the tool print a sha256 of its own report so you can reproduce the exact bytes. The two developers I cite and the paper one of them references are other people's work, attributed inline, and I keep their numbers out of my fixtures.

In short:

A valid signature is a fact about issue time. Whether the mandate is still live is a fact about execution time. A conformance test that replays the mandate and verifies the signature, but never checks revocation, expiry, or the current limit at execution, will wave through a payment the user already cancelled.
The gate replays a recorded mandate plus a recorded execution as data and reconciles authority liveness at the execution timestamp. When the mandate was revoked before execution, it blocks with revoked-before-execution.
The demo that matters: two mandate files identical in every byte except one field. Set revoked_at from null to a timestamp three seconds before execution, and the verdict flips from LIVE exit 0 to BLOCK exit 1.
It is revocation-dominant and fail-closed. It does not verify crypto, it does not phone a revocation server, and unusable input exits 2 rather than passing silently.
This is a post-hoc replay of what you recorded. It is the last check before an agent spends on stale authority, not a live network interceptor.

How does a valid signature authorize a revoked mandate?

The signature and the authority live on different clocks. When the mandate is signed, the authority says yes: this agent may pay this recipient, up to this limit, until this expiry. That yes gets frozen into a signature. From that instant the signature is a photograph. It records that permission existed at issue time, and it keeps recording it forever, because a signature cannot un-sign itself.

Authority does not sit still like that. A user revokes consent. A budget owner lowers the daily limit. The mandate reaches its expiry. A recipient gets pulled from the allowlist. None of those actions travel backward into the signature. The photograph still shows a smiling yes, taken at 09:00, while the authority behind it walked out at 14:29.

Now the agent executes at 14:30. If your check verifies the signature and stops there, it verifies the photograph. The photograph is genuine. So the check passes, and the payment leaves, three seconds after the user pulled the plug. The failure is not a forged signature or a broken cipher. Everything cryptographic held. What was missing is the second question: is the authority in the photograph still standing at execution time? That question has no signature to lean on. Someone has to ask it against the state of the world at the moment of the spend, and that someone is this gate.

Tracking is not control

A valid signature tracks that the mandate was authorized. It does not control whether that authorization is still in force when the money moves. This is the whole thesis of the pre-execution and reconciliation gates I keep building: tracking a fact is not the same as controlling the thing the fact describes. You tracked consent. Consent was granted. Consent was then withdrawn, and nobody re-read the withdrawal at execution.

Here is the claim, stated so you can break it. Take two mandate files that are byte-identical except for the revoked_at field. In the first it is null. In the second it holds a timestamp three seconds before the execution timestamp, and everything else, signature, amount, recipient, limit, scope, currency, stays the same and stays inside every bound. The gate must exit 0 on the first and exit non-zero with reason revoked-before-execution on the second. One field has to flip the verdict. If you can show me a revoked mandate that this gate lets pass, or a genuinely live mandate it blocks, the tool is broken and this post goes down with it.

I lean on revocation for the headline case on purpose. Borrowing a phrase from a developer publishing as mspro3210, the gate is revocation-dominant: a revoked mandate blocks regardless of how fresh and well-formed the payment looks. More on his framing at the end. The point for now is that revocation wins ties. A signature can be immaculate and the recipient can be perfect and the amount can be a penny, and if the authority was withdrawn before execution, the answer is no.

What does the mandate freshness gate reconcile?

Two things, per file. A recorded mandate: when it was issued, when it expires, whether and when it was revoked, its scope of allowed recipients and currency, its limit, and any recorded changes to that limit over time. And a recorded execution: the timestamp, the amount, the recipient, the currency. The gate lines them up on one axis, time, and asks whether the authority was alive at the execution timestamp and whether the spend stayed inside the bounds that were in force at that instant.

The limit needs a word, because a limit is not a constant. A mandate can carry a limit_adjustments timeline: the daily cap was 500 at issue, lowered to 100 at noon. The gate computes the effective limit at execution, which is the latest adjustment dated at or before the execution timestamp. So a 120 USD payment that was fine against a 500 limit at issue is over the 100 limit that was actually in force when it ran. Same amount, same signature, different answer, because the authority moved and the amount did not.

One design choice worth calling out: the gate collects every reason that fired, it does not bail on the first. A mandate that is both expired and out of scope reports both. The headline reason drives the exit code, the full list drives the forensics.

Here are the states it can reach:

Verdict	Condition at execution time	reason-code	exit
LIVE	signature asserted, `issued <= exec < expiry`, not revoked, `amount <= effective_limit`, recipient in scope, currency matches	`mandate-live`	0
BLOCK	`revoked_at <= exec`	`revoked-before-execution`	1
BLOCK	`expires_at <= exec`	`expired-before-execution`	1
BLOCK	`amount > effective_limit(exec)`, including a limit lowered after issue	`limit-exceeded-at-execution`	1
BLOCK	recipient not in scope, or currency mismatch	`recipient-out-of-scope` / `currency-mismatch`	1
BLOCK	`exec < issued` (replay or clock anomaly)	`execution-precedes-issue`	1
ERROR	missing field, unparseable date, malformed input	`bad-input`	2

The revocation check is a few lines, simplified here for the post (the real evaluate() builds a display record and carries the currency and ordering guards shown in the table):

def evaluate(mandate, execution):  # simplified for the post
    issued  = parse_ts(mandate["issued_at"])
    expires = parse_ts(mandate["expires_at"])
    revoked = parse_ts(mandate["revoked_at"]) if mandate.get("revoked_at") else None
    exec_at = parse_ts(execution["exec_at"])
    limit, lowered_at = effective_limit(mandate, exec_at)  # honors post-issue limit_adjustments

    reasons = []
    # signature_valid is TRUSTED as asserted input; the gate never verifies crypto.
    # Signature validity and authority liveness are two different facts.
    if revoked is not None and revoked <= exec_at:
        gap = int((exec_at - revoked).total_seconds())
        reasons.append(("revoked-before-execution",
            f"signature valid, authority not: revoked {fmt(revoked)}, "
            f"{gap}s before execution {fmt(exec_at)}"))
    if expires <= exec_at:
        reasons.append(("expired-before-execution", ...))
    if amount > limit:
        reasons.append(("limit-exceeded-at-execution", ...))
    if recipient not in scope:
        reasons.append(("recipient-out-of-scope", ...))
    return ("BLOCK", reasons) if reasons else ("LIVE", [])

Notice what is absent. There is no call to verify_signature. The gate trusts the signature_valid boolean as an asserted input and moves on, because recomputing a signature is a separate, orthogonal job. The gate exists to make the point that a true signature_valid and a live authority are not the same claim.

How is this different from checking the transaction itself?

Fair question, because I have built neighbors that look adjacent. Two of them deserve the borders drawn explicitly.

The first is the canary that sanity-checks an on-chain transaction before it is sent. That tool asks whether this transaction is normal: does the target address exist, is the price sane against a reference, will the calldata revert. It reaches the network to answer. This gate asks a question the canary cannot: is the mandate behind the transaction still alive at all? A transaction can be flawless in every on-chain sense, a real address, a fair price, a clean simulation, and still ride a consent the user cancelled three seconds ago. The canary would wave it through. This gate catches exactly that, and it does it offline, without touching a chain.

The second is the trace that compares an agent's allowed tools against what it actually did. That one works on a static scope: here is the allowlist, here is the telemetry, flag the mismatch. This gate is about time, not scope. In the revocation case the recipient stays inside scope the entire time. The action was permitted at issue and the signature proves it. What changed is the calendar: the authority behind the permission was withdrawn before the clock reached execution. Scope did not move. The mandate's lifecycle did. That axis, freshness of authority at execution, is what neither of those tools measures, and it is the one this gate is built around.

None of this is a new idea about authorization lifecycles. OAuth has token revocation, payment standards have mandate expiry and strong-customer-authentication windows, and lifecycle checks are old. What is worth a small tool is the specific shape: a portable, offline gate for agent payments that reconciles a recorded mandate against a recorded execution and blocks on stale authority, with one field isolated so the failure is reproducible. I am not reinventing revocation. I am making it fail a test.

Quick start

Feed it a JSON manifest with a mandate object and an execution object. Then run it:

python3 mandate_freshness_gate.py fixtures/fixture_live.json

No install, no key, no network. It reads the file, prints a verdict per mandate, and returns an exit code you can wire into CI over the mandate and execution records you already keep. In production you would feed it a live snapshot of the revocation status. Here it reconciles the timeline you recorded, so the honesty of the input is your job; the gate reconciles what you wrote down, it does not go fetch ground truth.

The one field that flips the verdict

Here are two files. fixture_live.json is a signed mandate and an execution that sits inside every bound. fixture_revoked.json is the same file with one change: the user revoked consent three seconds before execution. The signature, amount, recipient, limit, and scope are byte-identical between them. The only difference is one line:

$ diff fixtures/fixture_live.json fixtures/fixture_revoked.json
6c6
<     "revoked_at": null,
---
>     "revoked_at": "2026-07-11T14:29:57Z",

Run the live one:

mandate-freshness-gate: authority live at execution?
files: 1

[1] fixtures/fixture_live.json
    mandate   : mnd-7f3a91 issued 2026-07-11T09:00:00Z, expires 2026-07-18T09:00:00Z, revoked never
    execution : 2026-07-11T14:30:00Z, 120.00 USD -> merchant:acme-cloud-eu
    authority : effective_limit 500.00 USD at exec; scope ok; signature asserted
    verdict   : LIVE (mandate-live) exit 0
    detail    : authority live at execution: issued <= exec < expiry, not revoked, 120.00 <= 500.00 limit

summary: 1 LIVE, 0 BLOCK, 0 ERROR  ->  overall exit 0
report-sha256: 615b7a753f9b5c1b0071578445e0d543f89d0136c35badcb7c14d55ff29b89a4

Now run the revoked one:

mandate-freshness-gate: authority live at execution?
files: 1

[1] fixtures/fixture_revoked.json
    mandate   : mnd-7f3a91 issued 2026-07-11T09:00:00Z, expires 2026-07-18T09:00:00Z, revoked 2026-07-11T14:29:57Z
    execution : 2026-07-11T14:30:00Z, 120.00 USD -> merchant:acme-cloud-eu
    authority : effective_limit 500.00 USD at exec; scope ok; signature asserted
    verdict   : BLOCK (revoked-before-execution) exit 1
    reason    : revoked-before-execution: signature valid, authority not: revoked 2026-07-11T14:29:57Z, 3s before execution 2026-07-11T14:30:00Z

summary: 0 LIVE, 1 BLOCK, 0 ERROR  ->  overall exit 1
report-sha256: b0923070ee0cbcf8863d42a3d4221c8790b37e000ccf0c16d401c06b977890b1

Same signature, same 120.00 USD, same recipient, same 500.00 limit. In one file the authority is live and the payment ships. In the other the authority was withdrawn at 14:29:57, three seconds before the 14:30:00 execution, and the same payment is blocked. The exit code goes from 0 to 1 on the strength of one edited field. The gate spells the gap out in its own words: signature valid, authority not.

In production nobody edits that field by hand. The user taps cancel in an app, a budget owner flips a switch, a fraud rule trips, and the revocation lands at 14:29:57 while the agent's payment is already in flight for 14:30:00. Your signature check sees a valid signature and says go. This gate sees a revoked authority and says stop. The two sha256 lines are the tool hashing its own report body, so you can confirm you got the same bytes I did.

Six mandates, one sweep

Revocation is the headline, but it is one of several ways an authority goes stale between issue and execution. Hand the gate a batch and it reconciles each file. This run (mandate_freshness_gate.py fixtures/*.json, so the files arrive in alphabetical order) covers the full set:

mandate-freshness-gate: authority live at execution?
files: 6

[1] fixtures/fixture_bad_input.json
    verdict   : ERROR (bad-input) exit 2
    detail    : Invalid isoformat string: 'not-a-real-timestamp'

[2] fixtures/fixture_expired.json
    mandate   : mnd-2c4b08 issued 2026-07-01T09:00:00Z, expires 2026-07-10T09:00:00Z, revoked never
    execution : 2026-07-11T14:30:00Z, 120.00 USD -> merchant:acme-cloud-eu
    authority : effective_limit 500.00 USD at exec; scope ok; signature asserted
    verdict   : BLOCK (expired-before-execution) exit 1
    reason    : expired-before-execution: mandate expired 2026-07-10T09:00:00Z at or before execution 2026-07-11T14:30:00Z

[3] fixtures/fixture_live.json
    mandate   : mnd-7f3a91 issued 2026-07-11T09:00:00Z, expires 2026-07-18T09:00:00Z, revoked never
    execution : 2026-07-11T14:30:00Z, 120.00 USD -> merchant:acme-cloud-eu
    authority : effective_limit 500.00 USD at exec; scope ok; signature asserted
    verdict   : LIVE (mandate-live) exit 0
    detail    : authority live at execution: issued <= exec < expiry, not revoked, 120.00 <= 500.00 limit

[4] fixtures/fixture_lowered_limit.json
    mandate   : mnd-9d51e7 issued 2026-07-11T09:00:00Z, expires 2026-07-18T09:00:00Z, revoked never
    execution : 2026-07-11T14:30:00Z, 120.00 USD -> merchant:acme-cloud-eu
    authority : effective_limit 100.00 USD at exec; scope ok; signature asserted
    verdict   : BLOCK (limit-exceeded-at-execution) exit 1
    reason    : limit-exceeded-at-execution: amount 120.00 > effective limit 100.00 at execution (limit lowered to 100.00 at 2026-07-11T12:00:00Z after issue)

[5] fixtures/fixture_out_of_scope.json
    mandate   : mnd-4a80f2 issued 2026-07-11T09:00:00Z, expires 2026-07-18T09:00:00Z, revoked never
    execution : 2026-07-11T14:30:00Z, 120.00 USD -> merchant:unknown-wallet-9931
    authority : effective_limit 500.00 USD at exec; scope MISS; signature asserted
    verdict   : BLOCK (recipient-out-of-scope) exit 1
    reason    : recipient-out-of-scope: recipient 'merchant:unknown-wallet-9931' not in mandate scope ['merchant:acme-cloud-eu']

[6] fixtures/fixture_revoked.json
    mandate   : mnd-7f3a91 issued 2026-07-11T09:00:00Z, expires 2026-07-18T09:00:00Z, revoked 2026-07-11T14:29:57Z
    execution : 2026-07-11T14:30:00Z, 120.00 USD -> merchant:acme-cloud-eu
    authority : effective_limit 500.00 USD at exec; scope ok; signature asserted
    verdict   : BLOCK (revoked-before-execution) exit 1
    reason    : revoked-before-execution: signature valid, authority not: revoked 2026-07-11T14:29:57Z, 3s before execution 2026-07-11T14:30:00Z

summary: 1 LIVE, 4 BLOCK, 1 ERROR  ->  overall exit 1
report-sha256: 56cebaac439bbc0dbeaf027a6cf9ddf97800a599b3dccdfef715c2bbbb19157c

Read the four blocks. The expired mandate ran a day after its 2026-07-10 expiry. The lowered-limit mandate is the quiet one: at issue its cap was 500 and the 120.00 payment was fine; the cap was lowered to 100.00 at noon, so at 14:30 the same payment is over the limit that was actually in force. The out-of-scope mandate points at a wallet that was never on the allowlist. And the revoked mandate is the killer from the previous section, sitting in the same batch. The summary is 1 LIVE, 4 BLOCK, 1 ERROR, and the process exits 1, because one confirmed BLOCK outranks the bad-input file. That ordering is deliberate: a real block should not be masked by a parse error somewhere else in the batch.

The report ends with a sha256 of its own body, and because that body depends on the order you pass the files, fixtures/*.json reproduces exactly the digest above.

That bad-input file is not an accident either. The gate fails closed. No arguments prints a usage line and exits 2. A missing file reports file not found and exits 2. A mandate with a timestamp like 'not-a-real-timestamp' exits 2 with the parser's own message rather than guessing. An input the gate cannot trust never leaves as a green pass, so a malformed mandate cannot slip a payment through on the back of an exception the gate swallowed.

What this is NOT

I would rather draw the borders myself than have you find them the hard way.

It does not verify the cryptographic signature. The signature_valid field is an asserted input the gate trusts. It is an honest stub. Recomputing a signature is a real job, it is just a different one, and folding it in would blur the single point of this tool: signature validity and authority liveness are separate facts. If your pipeline needs the crypto verified, verify it upstream and hand this gate the result.

It does not intercept live traffic. This is an offline replay of records you already have: a recorded mandate and a recorded execution. It does not call a chain, a payment gateway, or a revocation server. In production the honest design is to feed it a live snapshot of revocation status at execution. An offline replay proves the logic and fails a test in CI. It is not a runtime firewall on the wire.

Freshness here means revocation, expiry, limit, and scope, checked against your recorded data. It is not the authority's full policy engine. If your authority can deny a payment for reasons outside those fields, velocity rules, risk scores, jurisdiction, this gate does not model them. It checks the lifecycle facts you recorded, at the execution timestamp, and nothing more.

It is not an AP2 or ERC-8183 conformance suite. It checks one property, freshness of authority at execution, not an entire standard. Treat it as one assertion you can drop into a larger suite, not the suite.

It does not stop prompt injection in the judgment layer. By the time a mandate reaches this gate, the intent has already been formed, possibly by an agent that was manipulated into forming it. The gate is a deterministic check on the lifecycle of the authority behind that intent. It is the last line on stale authority, sitting after the layer where judgment happens, not a defense of that layer. If the agent was talked into signing a bad mandate in the first place, a live and in-scope mandate will pass here exactly as designed.

Garbage in, garbage out. The gate reconciles the timeline you recorded. Feed it a wrong revoked_at or a stale limit and it will trust your record, because it has no ground truth to consult. It reconciles what you wrote down. It does not go find out whether what you wrote down is true.

The recorded mandate is untrusted data. A fixture can carry any string, including a field crafted to look like an instruction to a tool reading the log. The gate reads every value as data and executes none of it. An injected instruction sitting in a recipient name or a note is counted as a field value and nothing else. Treating records as data and never as commands is not optional when the records can come off the open internet.

Why this, why now

Two developers put this problem in front of me the same week, both on Dev.to, both on July 9. I read both posts, checked the tools were live, and built my own fixtures rather than reuse anyone's numbers.

The first, publishing as mspro3210, wrote a piece called "The receipt cannot be written by the pen it is checking: separation of duties for agent payments". His formulation is the one I borrowed above. He argues the real binding is not "the mandate authorizes execution" but "the mandate authorizes execution, and the mandate is still live at execution time," and he calls the property revocation-dominant: a revoked mandate blocks regardless of how fresh the intent looks. That is his framing and his phrase. My contribution is to turn it into a runnable gate with the revocation case isolated to a single field.

The second, publishing as barissozen, was writing about escrow and judgment layers for agent trades, and he pointed at a paper I have not read myself: arXiv 2601.22569, "Whispers of Wealth." Per barissozen, and I am relaying his summary rather than an independent read, the paper describes an agent operating under Google's Agent Payments Protocol (AP2) that got subverted, and the part that failed was not the cryptography. The signatures held. What folded was the layer that exercises judgment. He contrasts that with schemes like ERC-8183, escrow with an evaluator, where a judge sits between intent and settlement. I have not verified the paper's claims or numbers, so none of them are in my fixtures, and you should read barissozen and the paper before quoting either.

Both of those point at the same seam from opposite sides. Cryptography can be flawless and the money can still move wrongly, because the fact a signature carries (authorized at issue) is not the fact you need at execution (authority still live). Agent payments are shipping into the real world right now, which is why the seam matters this week and not in the abstract. This gate does not fix the judgment layer either developer worries about. It handles the smaller, harder-edged case downstream of it: once the intent exists, is the authority behind it still standing when the agent spends? That question deserves a check that fails a test, and now it has one.

I publish one of these small offline gates most weeks, each a runnable tool with the raw output and the exit codes, no keys and no network. Follow along if you want the next one. And here is the question I genuinely cannot answer cleanly, so I am putting it to you: in production, execution happens at 14:30:00 and the revocation landed at 14:29:57, but your revocation snapshot has propagation lag of its own. How do you get a live view of authority status that is fresh enough to catch a cancellation that arrived three seconds ago, without blocking every legitimate payment behind a slow consistency check? I read every comment.

Delivered but Unbilled: Your AI Stream Logged Zero Tokens

Alexey Spinov — Fri, 10 Jul 2026 04:24:37 +0000

Delivered but unbilled is the streaming failure where your AI response renders the whole answer to the user, but the terminal usage frame is dropped, zeroed, or malformed, so your client logs 0 output tokens for a call that cost money. stream_billing_gate.py reconciles the delivered text against the logged usage, offline, and blocks when text shipped and nothing was billed.

AI disclosure: I wrote stream_billing_gate.py with an AI assistant and ran it myself, offline, on Python 3.13.5, standard library only, no network. Every number in the output blocks below is pasted from a real local run. I checked the exit codes (0 / 1 / 2), ran each scenario twice to confirm STDOUT is byte-for-byte identical, and had the tool print a sha256 of its own report so you can reproduce the exact bytes. The one external finding I cite (a Dev.to proof of concept) and the one external number (a Hacker News thread's score) are other people's, attributed inline, and kept out of my fixture numbers.

In short:

A streamed response delivers text first and accounting last. If the terminal usage frame never lands, the text is already on the screen and your client logs 0 output tokens. The user got the whole answer. Your bill says it was free.
The gate reads a recorded stream as data, rebuilds the delivered text, and reconciles it against the logged usage. When text was delivered and usage is 0 or absent, it blocks with delivered-but-unbilled.
The demo that matters: two stream logs with byte-identical text deltas. Delete one line, the usage frame, and the verdict flips from OK exit 0 to BLIND exit 1.
Token counting here is a conservative FLOOR (word count), not the vendor's bill. The hard claim is smaller and unbreakable: text was delivered, so real tokens are above zero, so a logged 0 is wrong.
This is a post-hoc reconciliation of recorded streams. It does not intercept live traffic, call an API, or need a key.

How does a streamed answer get billed as zero?

A streaming response arrives in two parts that travel separately. First the content deltas: dozens of small events, each carrying a slice of text, which your client concatenates and paints on the screen. Then, at the very end, one terminal frame carrying the token accounting: input tokens, output tokens, the numbers your billing code reads to know what the call cost.

The order is the problem. By the time the accounting is supposed to arrive, the answer is already delivered. If that last frame is dropped by a flaky connection, truncated, or written with a zero, the user never notices. They got their answer. Your client, meanwhile, has nothing to log, so it keeps whatever value it initialized. For a lot of clients that value is zero. No exception. No crash. A response that cost real output tokens is recorded as free. That zero is a choice your client made, not a law of streaming, and how the accounting travels varies by provider, which the formats section below gets into.

On July 7, a developer publishing as kielltampubolon posted a proof of concept on Dev.to titled "Asynchronous Telemetry Blindness in AI Streaming Clients", showing this exact shape: the text deltas render the full answer while a dropped terminal usage frame leaves billing sitting at zero, with no error surfaced. That framing is his, and his write-up is a local-only proof of concept, not a production incident. I am not reusing his numbers. I built my own fixtures so the failure is reproducible on your logs, and so a test can fail closed on it.

Tracking is not control

Your dashboard says $0 for that call. The dashboard is not lying. It is faithfully reporting a number that is wrong at the source. This is the whole thesis of the pre-execution and reconciliation gates I keep building: tracking a number is not the same as controlling the thing the number describes. You tracked usage. The usage you tracked was zero. The tokens were spent anyway.

Here is the claim, stated so you can break it. Given a recorded stream where text deltas exist and the terminal usage frame is absent or zero, the gate must exit non-zero with reason delivered-but-unbilled. Given a stream with a usage frame whose output count meets the floor derived from the delivered text, it must exit 0. One removed line has to flip the verdict. If you can show me a stream that delivered text and the gate stayed quiet, or one that logged its usage honestly and the gate blocked, the tool is broken and this post with it.

What the gate reconciles

Two quantities per stream. The delivered text, rebuilt by concatenating every content delta. And the logged output tokens, pulled from the terminal usage frame, or None when that frame is missing or unusable.

The token side needs an honest word about method. Without a real tokenizer, and this tool ships with none, standard library only, I cannot tell you the exact token count of the delivered text. So I do not. I compute a FLOOR: the number of whitespace-delimited words. A BPE tokenizer keyed on whitespace almost always emits at least one token per word, and usually more once punctuation and sub-word splits count. So the real output_tokens is nearly always higher than this floor. It is a deliberately low, human-readable magnitude, not a measurement.

The gate's real claim does not lean on that floor being precise. It leans on something smaller that cannot be argued with: if the delivered text is non-empty, the real output token count is above zero. Therefore a logged zero is provably wrong, whatever the exact number was. The floor just gives you a readable "at least this many" figure to put in the alert.

That gives four verdicts:

Verdict	Condition	reason-code	exit
OK	usage frame present, `logged >= floor`	`usage-consistent`	0
UNDERCOUNT	usage frame present, `0 < logged < floor`	`partial-telemetry-loss`	2
BLIND	text delivered, `logged` is 0 / absent / unparseable	`delivered-but-unbilled`	1
EMPTY	no text delivered, nothing to reconcile	`nothing-delivered`	0

BLIND is the hard, logical one and gets the blocking exit code. UNDERCOUNT is softer: it says the logged count is below the floor, which is suspicious but heuristic, so it warns rather than blocks. The decision itself is a few lines, simplified for the post (the real function returns dicts with detail strings, and it carries one extra guard shown here that fails closed on a stream in a shape it cannot recognize):

def classify(delivered_text, logged, usage_seen, events, recognized):  # simplified for the post
    floor = word_floor(delivered_text)                                 # whitespace word count
    if events and not recognized:            # valid JSON, but no shape the gate knows -> fail closed
        return "UNRECOGNIZED", "unrecognized-stream", 2
    if not delivered_text.strip():
        return "EMPTY", "nothing-delivered", 0
    if logged is None or logged == 0:
        return "BLIND", "delivered-but-unbilled", 1
    if logged < floor:
        return "UNDERCOUNT", "partial-telemetry-loss", 2
    return "OK", "usage-consistent", 0

Which stream formats does it read?

The gate reads the recorded stream as data and auto-detects the wire shape, so you can point it at the logs you already keep. It maps four documented formats to the same two quantities, delivered text and logged output tokens:

Format	Delivered text	Logged output tokens
OpenAI Chat Completions	`choices[].delta.content`	`usage.completion_tokens` in the final chunk, sent only when you set `stream_options={"include_usage": true}`
Anthropic Messages	`content_block_delta.delta.text`	`message_delta.usage.output_tokens` (cumulative; the last one is the total)
OpenAI Responses	`response.output_text.delta`	`response.completed` then `response.usage.output_tokens`
Generic normalized	`{"type":"content.delta","text":...}`	`{"type":"usage","output_tokens":...}`

How the accounting travels is not the same across providers, and that changes what a zero means. OpenAI Chat sends usage in one terminal chunk, and only if you asked for it: leave include_usage off and there is no usage frame at all, which reads as delivered-but-unbilled by default. Anthropic sends output tokens cumulatively across message_delta frames, so a dropped final frame undercounts rather than zeroes, unless your client commits only the terminal number. A logged 0 is a fact about what your client recorded, not a universal property of streaming. The gate does not care which of these happened. It reconciles the text that reached the user against the number that reached your logs.

The FLOOR is provider-agnostic. It comes from the concatenated delivered text alone, so the same word count governs an OpenAI stream and an Anthropic one. A format the gate does not recognize is reported as unrecognized-stream (exit 2), which fails closed instead of passing as EMPTY, so a broken adapter cannot hide a leak behind a green check.

Quick start

Feed it a recorded stream as JSONL, one event per line, in any of the formats above. Then run it:

python3 stream_billing_gate.py fixtures/fixture_blind.jsonl

No install, no key, no network. It reads the file, prints a verdict, and returns an exit code you can wire into CI over your recorded stream logs, whether they are OpenAI Chat, Anthropic, Responses, or the normalized shape. A log in a format it does not recognize returns unrecognized-stream (exit 2) rather than a silent pass, so confirm the fit on a known-good stream before you trust a green.

The one line that flips the verdict

Here are two files. fixture_ok.jsonl is a recorded stream: a normal answer about capping agent spend, followed by a terminal usage frame reporting 190 output tokens. fixture_blind.jsonl is the same stream with that one usage line removed. The text deltas are byte-identical. Run each:

stream-billing-gate: delivered text vs logged usage
files: 1

[1] fixture_ok.jsonl
    delivered : 137 words, 750 chars (floor 137 tokens)
    logged    : output_tokens=190
    verdict   : OK (usage-consistent) exit 0
    detail    : logged 190 >= floor 137

summary: 1 OK, 0 UNDERCOUNT, 0 BLIND, 0 EMPTY  ->  overall exit 0
report-sha256: f6b8ce0b2b7a09f57a589a862b53024a81ca618cfecd34f82fa6231c670486a7

stream-billing-gate: delivered text vs logged usage
files: 1

[1] fixture_blind.jsonl
    delivered : 137 words, 750 chars (floor 137 tokens)
    logged    : none
    verdict   : BLIND (delivered-but-unbilled) exit 1
    detail    : delivered >= 137 tokens of text; no usage frame at all

summary: 0 OK, 0 UNDERCOUNT, 1 BLIND, 0 EMPTY  ->  overall exit 1
report-sha256: 318dff2585a7874bd04cd06a1430a9d3b303d93535910bd46be1d5987f81a4ce

Same 137 words delivered. In one file that costs at least 137 tokens and the log says 190. In the other it costs at least 137 tokens and the log says nothing. The exit code goes from 0 to 1. The only difference between the two files is a single line:

$ diff fixtures/fixture_ok.jsonl fixtures/fixture_blind.jsonl
10d9
< {"type":"usage","input_tokens":523,"output_tokens":190}

That is the failure in one line. In production the line does not get deleted by hand. A connection blips, a proxy truncates, a client swallows the final frame, and you are left with fixture_blind.jsonl and a dashboard that says the call was free. The two sha256 digests are the tool hashing its own report, so you can confirm you got the same bytes I did.

Four verdicts, one sweep

Hand it a batch of recorded streams and it reconciles each one. This run (stream_billing_gate.py fixtures/*.jsonl, so the files come in alphabetical order) covers all four verdicts plus the two broken shapes of the usage frame, present-but-zero and present-but-unparseable, both of which are still delivered-but-unbilled:

stream-billing-gate: delivered text vs logged usage
files: 7

[1] fixture_blind.jsonl
    delivered : 137 words, 750 chars (floor 137 tokens)
    logged    : none
    verdict   : BLIND (delivered-but-unbilled) exit 1
    detail    : delivered >= 137 tokens of text; no usage frame at all

[2] fixture_broken_usage.jsonl
    delivered : 46 words, 274 chars (floor 46 tokens)
    logged    : none
    verdict   : BLIND (delivered-but-unbilled) exit 1
    detail    : delivered >= 46 tokens of text; usage frame present but unparseable

[3] fixture_empty.jsonl
    delivered : 0 words, 0 chars (floor 0 tokens)
    logged    : none
    verdict   : EMPTY (nothing-delivered) exit 0
    detail    : no text delivered; nothing to reconcile

[4] fixture_injection.jsonl
    delivered : 60 words, 344 chars (floor 60 tokens)
    logged    : none
    verdict   : BLIND (delivered-but-unbilled) exit 1
    detail    : delivered >= 60 tokens of text; no usage frame at all

[5] fixture_ok.jsonl
    delivered : 137 words, 750 chars (floor 137 tokens)
    logged    : output_tokens=190
    verdict   : OK (usage-consistent) exit 0
    detail    : logged 190 >= floor 137

[6] fixture_undercount.jsonl
    delivered : 72 words, 403 chars (floor 72 tokens)
    logged    : output_tokens=12
    verdict   : UNDERCOUNT (partial-telemetry-loss) exit 2
    detail    : logged 12 < floor 72; telemetry lost part of the stream

[7] fixture_zero_usage.jsonl
    delivered : 55 words, 272 chars (floor 55 tokens)
    logged    : output_tokens=0
    verdict   : BLIND (delivered-but-unbilled) exit 1
    detail    : delivered >= 55 tokens of text; usage frame logged output_tokens=0

summary: 1 OK, 1 UNDERCOUNT, 4 BLIND, 1 EMPTY  ->  overall exit 1
report-sha256: dc204b014afadff09b09b82a791dd86ba7767df01454fbc2cc011321c9cd5111

Look at fixture_zero_usage and fixture_broken_usage. A usage frame that says output_tokens: 0 and a usage frame with a broken value both land as BLIND, because in both cases text was delivered and no honest token count was logged. A zero is not an OK. Absence is not an OK. Only a real count that meets the floor is an OK. The report ends with a sha256 of its own body, and because that body depends on the order you pass the files, fixtures/*.jsonl reproduces this exact digest.

The same check on real provider logs

Those seven fixtures use the normalized shape, which is fine for showing the logic but proves nothing about the streams you actually record. So fixtures/providers/ holds streams written in the documented wire formats of three providers, each representative of what their SSE emits: an Anthropic Messages stream, an OpenAI Chat Completions stream, and an OpenAI Responses stream. For each provider there is a delivered-but-unbilled variant and an honest one, plus a foreign log that matches no known shape:

stream-billing-gate: delivered text vs logged usage
files: 7

[1] anthropic_blind.jsonl
    delivered : 54 words, 291 chars (floor 54 tokens)
    logged    : output_tokens=0
    verdict   : BLIND (delivered-but-unbilled) exit 1
    detail    : delivered >= 54 tokens of text; usage frame logged output_tokens=0

[2] anthropic_ok.jsonl
    delivered : 54 words, 291 chars (floor 54 tokens)
    logged    : output_tokens=63
    verdict   : OK (usage-consistent) exit 0
    detail    : logged 63 >= floor 54

[3] openai_chat_blind.jsonl
    delivered : 34 words, 178 chars (floor 34 tokens)
    logged    : none
    verdict   : BLIND (delivered-but-unbilled) exit 1
    detail    : delivered >= 34 tokens of text; no usage frame at all

[4] openai_chat_ok.jsonl
    delivered : 34 words, 178 chars (floor 34 tokens)
    logged    : output_tokens=64
    verdict   : OK (usage-consistent) exit 0
    detail    : logged 64 >= floor 34

[5] responses_blind.jsonl
    delivered : 32 words, 172 chars (floor 32 tokens)
    logged    : output_tokens=0
    verdict   : BLIND (delivered-but-unbilled) exit 1
    detail    : delivered >= 32 tokens of text; usage frame logged output_tokens=0

[6] responses_ok.jsonl
    delivered : 32 words, 172 chars (floor 32 tokens)
    logged    : output_tokens=54
    verdict   : OK (usage-consistent) exit 0
    detail    : logged 54 >= floor 32

[7] unknown_schema.jsonl
    delivered : 0 words, 0 chars (floor 0 tokens)
    logged    : none
    verdict   : UNRECOGNIZED (unrecognized-stream) exit 2
    detail    : 4 event(s) parsed, none matched a known stream shape; map your adapter's text and usage fields before trusting a pass

summary: 3 OK, 0 UNDERCOUNT, 3 BLIND, 0 EMPTY, 1 UNRECOGNIZED  ->  overall exit 1
report-sha256: ec93efeb393fc0606656eb9f5447ca1dbd9837ee5a51258dc0689b3d7a022aa5

Three providers, three delivered-but-unbilled streams, three blocks. The Anthropic case logs output_tokens: 0 in its terminal message_delta; the OpenAI Chat case ran without include_usage so no usage frame lands at all; the Responses case reports a zero in response.completed. Each honest twin logs a real count above the floor and ships. And unknown_schema.jsonl is the case the earlier version of this tool got wrong: a valid-JSON log in a shape the gate cannot map. It used to read as EMPTY and exit 0, which is exactly the silent pass this whole post is against. Now it fails closed as unrecognized-stream, exit 2, so a broken adapter shows up loud instead of hiding the leak.

The stream tried to tell the gate it was free

The fixture_injection case is a small joke that matters. Its delivered text contains a line addressed to any monitoring tool reading the log: it claims the response was served for free and instructs the reader to report usage-consistent and exit 0. The gate reads that line the same way it reads every other line, as text to be counted. It does not execute it. So the injected instruction just adds to the delivered token count, and the stream still gets blocked, exit 1.

This is the security posture written down: stream content is untrusted input. A recorded stream can carry anything, including a sentence designed to talk your tooling into standing down. A gate that treats logs as data and never as commands is not optional when the logs come off the open internet.

What this is NOT

I would rather draw the borders myself than have you find them.

It does not compute dollars. The floor is not the vendor's invoice. The gate catches delivered-but-zero-logged, the failure where accounting collapses to nothing while text shipped. It does not tell you the exact amount you were overcharged or undercharged. For that you need the price policy, and reconciling a billed amount against declared rates is a different tool that checks which model answered and whether the charge reconciles.

The floor is conservative, and UNDERCOUNT is a heuristic. BLIND rests on a logical fact: text delivered means tokens above zero. UNDERCOUNT rests on a statistical one: a logged count below the word floor is suspicious. Pathological text could in principle sit near the boundary, which is exactly why UNDERCOUNT warns (exit 2) and does not block. The floor is whitespace word count, so it barely moves for scripts that do not put spaces between words; a paragraph of Chinese or Japanese collapses toward a floor of 1, and UNDERCOUNT is effectively inert there while BLIND still holds. A non-integer count is treated the same conservative way: a float like 190.0 or a string "190" is dropped as unparseable and reads as BLIND, a rare false positive if some client logs honest floats. If you need exact counts, run the delivered text through the real tokenizer for your model. This tool is the cheap first pass that needs no tokenizer to catch the zero.

It cannot tell you why the frame is missing. A dropped connection and a vendor that ate the usage frame produce the same evidence: delivered text, no matching usage. The gate reports the fact, not the cause. It also fails closed on bad input: no arguments, a missing file, or unparseable lines return a non-zero exit rather than a silent pass, and a stream in a shape it does not recognize returns unrecognized-stream (exit 2), not a silent EMPTY. It reads normalized or recognized-native events, so if you feed it a custom log, confirm the gate recognizes it on a known-good stream first.

It is a different question from its neighbors. It is not the tokens an agent keeps burning after a run has already failed, and it is not the growing transcript you get rebilled for on every turn of a conversation. Those measure real spend that happened. This one measures spend that happened and then vanished from the record. If you are mapping the whole bill, it sits next to the token tax an MCP server quietly adds to every call and a forecaster for what a loop will cost before you run it.

Why this, why now

The macro mood is why this small failure is worth a gate. During the week of July 7, the loudest thread I saw on Hacker News was "GLM 5.2 and the coming AI margin collapse," sitting around 680 points and 465 comments when I checked. Those are the thread's numbers, not mine, and I link nothing I cannot verify. The argument in the room is that the economics of serving these models are tightening. When margins tighten, the difference between what you actually spent and what your logs think you spent stops being a rounding error and starts being the thing that decides whether your unit economics are real.

A usage frame that goes missing on a streamed response is a silent leak on the wrong side of that equation. You cannot fix a cost you never recorded. The point of reconciling the delivered text against the logged usage is not to compute your bill to the cent. It is to stop pretending a call was free when the answer is sitting right there on the screen.

I publish one of these small offline gates most weeks, each one a runnable tool with the raw output and the exit codes, no keys and no network. Follow along if you want the next one. And the real question I still cannot answer cleanly, so I am asking you: on your own logs, how do you tell a genuinely dropped usage frame apart from a response that legitimately produced no billable output? I read every comment.

Your AI agent re-adds code you reverted last month

Alexey Spinov — Thu, 09 Jul 2026 04:24:13 +0000

AI agent re-adds reverted code when a fresh session, with no memory of last month's decision, re-proposes it. revert_guard.py is an offline, keyless pre-commit gate: it reads your repo's own git revert history and blocks (exit 1) a diff that reintroduces a column the team already added and reverted, before the commit lands.

AI disclosure: I wrote revert_guard.py with an AI assistant and ran it myself, offline, before publishing. Every output block below is pasted from a real local run on Python 3.13.5 and git 2.50.0, standard library only, no network. I ran each scenario twice and confirmed the stdout was byte-for-byte identical; the tool also prints a sha256 digest of its own report so you can check. The card_token / PCI-DSS story and the Selvedge fix are @masondelan's, reported on Dev.to; that is their case and their fix, not my measurement. My exit codes, hashes, and the 2026-06-05 revert are synthetic fixtures on a real git repo, each from a real run, and I keep them in their own paragraphs so the two never blur.

In short:

A new agent session has no memory of last month's decision. It reads the current schema, sees no card_token column, and helpfully proposes adding one. The reasoning that killed it the first time (a PCI-DSS scope call) is gone. The revert that killed it is still sitting in git log.
The tool reads that revert history. For every commit whose message marks it as a revert, it extracts what the revert removed, then checks whether the agent's diff re-adds it. Match by entity (users.card_token), not by file path, so a fresh migration number does not slip past.
The demo: the same proposed diff, a new migration adding users.card_token. Point it at repo_clean and it exits 0 (SHIP). Point it at repo_dirty, where that column was added and reverted, and it exits 1 (BLOCK) and prints reverted 2026-06-05 in c2ce7ed -- reason: "PCI-DSS scope". The only variable is the repo's revert history.
It is not about memory cost, not about permissions — the agent re-proposes what the team already reverted, because the reasoning died with the session. The revert did not die with it.
Standard library only. Offline, keyless, read-only, zero network. It shells out to a local git log / git show on read, never writes, never runs the agent. Exit 0 / 1 / 2 for a CI gate. Deterministic stdout with a self-hash. The tool and every fixture are in this post.

The code sticks around. The reasoning doesn't.

On July 6, an engineer posting as @masondelan wrote up an incident on Dev.to that I have not stopped thinking about. Their line for it: the code sticks around, the reasoning doesn't. A team had added a card_token column to their users table, then reverted it two days later because it pulled the table into PCI-DSS scope. About a month on, a fresh Claude Code session, working from the current schema with none of that history in context, planned the exact same column back in. Their fix was a runtime MCP server called Selvedge that answers prior-attempts users.card_token with something like "Prior attempt 28 days ago (reverted after 2 days)." Those numbers and that fix are theirs. I am borrowing the shape of the problem, not the measurement.

Here is the shape. The revert is not lost. It is a commit, in the log, with a message. Git is tracking the fact that the team said no. What git does not do is stop the next actor from proposing it again. A human reviewer might remember. A fresh agent session will not, and neither will a tired reviewer at 6pm looking at a diff that, on its own, looks completely reasonable. The information exists and nothing acts on it. That gap between "the repo knows" and "something enforces it" is the whole space this tool sits in.

So the tool turns the tracking into control at one specific moment: before the diff is committed. It does not need the agent's memory, a vector store, or a running service. It needs the history the team already keeps.

What revert_guard checks in your git revert history

Three verdicts, one rule, read off the repo's own reverts.

REINTRODUCES_REVERTED fires when a table-qualified entity in the diff, like users.card_token, exactly equals one a revert commit removed. That is a BLOCK. It prints the revert's short hash, date, and the reason the commit message stated.
NAME_MATCH_UNVERIFIED fires when the bare name matches but one side is unqualified, so the tool cannot confirm it is the same table's column. That is a WARN, and it is fail-closed by default: a human confirms. It is the honest third state, and it exists so the tool does not have to pretend a bare card_token in a model file is definitely the reverted users.card_token.
NO_REVERT_MATCH is a SHIP. Nothing the diff adds was ever reverted here.

The distinction the whole thing turns on: ALTER TABLE users ADD COLUMN card_token gives a qualified entity, because the table is right there on the line. A lone card_token = Column(...) in a model gives a bare one, because nothing on that line says which table. Two qualified entities have to match table and column to BLOCK. A bare one on either side can only ever reach WARN. Same column name, different confidence, different verdict.

Run it in sixty seconds

No keys, no network, no install past Python and git. Save the file, point it at a proposed diff and a repo, run one command. Here is the whole tool, one file, standard library only:

#!/usr/bin/env python3
"""
revert_guard.py -- an offline pre-commit gate that reads a repo's OWN git revert
history and blocks (exit 1) an AI agent's proposed diff that reintroduces a
column / symbol / flag the team already added and then REVERTED -- before the
diff is committed.

It takes a proposed change (a unified diff, or a JSON list of entities) plus a
`--repo` path. It shells out to a LOCAL `git log` / `git show` on READ only,
finds the commits whose message marks them as a revert, extracts the entities
those reverts removed, and intersects that set with the entities the proposed
diff adds. The match is by ENTITY (a column name, table-qualified when it can be:
`users.card_token`), never by file path -- so a brand-new migration file with a
different number is still caught.

  REINTRODUCES_REVERTED -- a table-qualified entity in the diff (e.g.
                           users.card_token) exactly equals one a revert commit
                           removed. BLOCK. Prints the revert's short hash, date,
                           and stated reason.
  NAME_MATCH_UNVERIFIED -- the bare name matches a reverted entity but one side
                           is unqualified, so it cannot be confirmed the SAME
                           table's column. WARN, fail-closed (a human verifies).
  NO_REVERT_MATCH       -- nothing the diff adds was ever reverted here. SHIP.
  BAD_INPUT             -- not a git repo, unreadable/empty diff, bad JSON.

The point the tool exists to make: take ONE proposed diff -- a new migration that
adds users.card_token -- and run it against two repos. On a repo that never
reverted that column it exits 0 (SHIP). On a repo where the same column was added
and reverted last month it exits 1 (BLOCK), and prints the prior revert. Same
diff, same agent; the only variable is whether the REPO remembers the revert.
This is not about the cost of agent memory and not about permissions -- the agent
re-proposes what the team already reverted, because the reasoning died with the
session, while the revert did not.

Offline. Keyless. Read-only. Zero network. Standard library only (subprocess for
`git` on read, sys, re, json, hashlib, argparse). It never writes, never commits,
never runs the agent, never touches the network, and reads the diff as text --
it is DATA, never executed. Output is byte-for-byte deterministic across runs on
the same repo; it prints absolute revert dates (not "N days ago") on purpose, so
the output does not change with the calendar, and ends with a sha256 digest of
its own report so two runs are provably identical.

It does NOT store memory or embed reasoning; it reads git history the team already
keeps. It does NOT decide who is allowed to change what. It does NOT understand
WHY: it matches names, not intent, so a column reverted for reason X and now
legitimately needed for reason Y is still flagged for a human to override. It only
sees reverts that are actually COMMITTED -- a revert done by force-push, squash,
or amend-out-of-history is a blind spot. It is as good as the team's git history
is honest.

Exit codes (usable as a pre-commit / CI gate):
  0  SHIP  (no proposed addition was previously reverted here)
  1  BLOCK or WARN  (both mean "do not auto-apply"; the reason-code says which).
     WARN's exit is configurable via --warn-exit (default 1, fail-closed).
  2  bad input: not a git repo, missing/unreadable/empty diff, unparseable JSON
     -- fail-closed.

Usage:
  python3 revert_guard.py <proposed.diff | entities.json | -> --repo <path>
  python3 revert_guard.py proposed.diff --repo ./service
  git diff --cached | python3 revert_guard.py - --repo .
"""

import argparse
import hashlib
import json
import re
import subprocess
import sys

# A commit is treated as a revert if its subject or body matches this. Covers a
# native `git revert` ("This reverts commit <hash>") and hand-written reverts.
DEFAULT_REVERT_PATTERN = (
    r"(this reverts commit|\brevert(s|ed)?\b|\brolled back\b|\bbacked out\b)"
)

# SQL: `ALTER TABLE users ADD COLUMN card_token ...` -> ('users', 'card_token').
RE_ALTER_ADD = re.compile(
    r"alter\s+table\s+[\"`']?(\w+)[\"`']?.*?\badd\s+column\s+"
    r"(?:if\s+not\s+exists\s+)?[\"`']?(\w+)",
    re.I,
)
# SQL: bare `ADD COLUMN card_token` with no table on the line -> ('card_token',).
RE_ADD_COLUMN = re.compile(
    r"\badd\s+column\s+(?:if\s+not\s+exists\s+)?[\"`']?(\w+)", re.I
)
# Python model field: `card_token = Column(...)` / `x: Mapped[str] = mapped_column(`
RE_PY_COLUMN = re.compile(
    r"^\s*(\w+)\s*(?::[^=]+)?=\s*(?:\w+\.)*(?:mapped_column|Column)\s*\("
)


def _bad(msg):
    print("ERROR: " + msg)
    raise SystemExit(2)


def _git(repo, args):
    """Run a read-only git command, return stdout. Never writes."""
    try:
        proc = subprocess.run(
            ["git", "-c", "core.quotepath=false", "-C", repo] + args,
            stdout=subprocess.PIPE,
            stderr=subprocess.PIPE,
            check=False,
            text=True,
        )
    except OSError as exc:  # git not installed
        _bad("cannot run git: %s" % exc)
    if proc.returncode != 0:
        _bad("git %s failed in %s: %s"
             % (" ".join(args), repo, proc.stderr.strip()))
    return proc.stdout


def entities_from_line(text):
    """Extract 0..1 entity from a single added/removed diff line.

    Returns a list of (name, qualifier_or_None, kind). Names/qualifiers are
    lower-cased so `Users.Card_Token` and `users.card_token` normalize together.
    """
    m = RE_ALTER_ADD.search(text)
    if m:
        return [(m.group(2).lower(), m.group(1).lower(), "column")]
    m = RE_ADD_COLUMN.search(text)
    if m:
        return [(m.group(1).lower(), None, "column")]
    m = RE_PY_COLUMN.match(text)
    if m:
        return [(m.group(1).lower(), None, "symbol")]
    return []


def entities_from_patch(patch, sign):
    """Collect entities from the +/- content lines of a unified diff.

    sign='+' reads added lines (a proposed change); sign='-' reads removed lines
    (what a revert took out). File headers (+++/---) are skipped.
    """
    out = []
    for line in patch.splitlines():
        if not line or line[0] != sign:
            continue
        if line.startswith("+++") or line.startswith("---"):
            continue
        out.extend(entities_from_line(line[1:]))
    return out


def dedupe(entities):
    """One entity per name; keep the qualified form when both exist."""
    by_name = {}
    for name, qual, kind in entities:
        cur = by_name.get(name)
        if cur is None or (cur[1] is None and qual is not None):
            by_name[name] = (name, qual, kind)
    return [by_name[k] for k in sorted(by_name)]


def parse_reason(subject):
    """Prefer the trailing parenthetical as the human reason, else the subject."""
    m = re.search(r"\(([^()]+)\)\s*$", subject)
    return m.group(1).strip() if m else subject.strip()


def find_reverts(repo, revert_re, since):
    """Return a list of revert commits, most recent first, each with the
    entities it removed. dict: hash, short, date, subject, reason, entities."""
    fmt = "%H%x1f%h%x1f%ad%x1f%s%x1f%b%x1e"
    args = ["log", "--no-color", "--date=short", "--format=" + fmt]
    if since:
        args.append("--since=" + since)
    raw = _git(repo, args)
    reverts = []
    for record in raw.split("\x1e"):
        record = record.strip("\n")
        if not record:
            continue
        parts = record.split("\x1f")
        if len(parts) < 5:
            continue
        full, short, date, subject, body = parts[0], parts[1], parts[2], parts[3], parts[4]
        if not revert_re.search(subject + "\n" + body):
            continue
        patch = _git(repo, ["show", "--no-color", "--format=", "-U0", full])
        removed = dedupe(entities_from_patch(patch, "-"))
        if not removed:
            continue
        reverts.append({
            "hash": full, "short": short, "date": date, "subject": subject,
            "reason": parse_reason(subject), "entities": removed,
        })
    return reverts


def reverted_index(reverts):
    """name -> (name, qual, kind, revert) using the MOST RECENT revert of that
    name (reverts arrive most-recent-first), preferring a qualified form."""
    idx = {}
    for rev in reverts:  # most recent first
        for name, qual, kind in rev["entities"]:
            cur = idx.get(name)
            if cur is None:
                idx[name] = (name, qual, kind, rev)
            elif cur[1] is None and qual is not None:
                # upgrade to a qualified form, keep the more-recent revert we saw
                idx[name] = (name, qual, kind, cur[3])
    return idx


def classify(prop, rev_entity):
    """prop and rev_entity are (name, qual, kind[, ...]); return a reason-code
    or None. Two qualified entities match only if the qualifier matches too."""
    pname, pqual = prop[0], prop[1]
    rname, rqual = rev_entity[0], rev_entity[1]
    if pname != rname:
        return None
    if pqual is not None and rqual is not None:
        return "REINTRODUCES_REVERTED" if pqual == rqual else None
    return "NAME_MATCH_UNVERIFIED"


def read_proposed(path):
    """Read the proposed change from a file (or '-' for stdin). Auto-detects a
    JSON entity list vs a unified diff. Returns a deduped entity list."""
    if path == "-":
        data = sys.stdin.read()
    else:
        try:
            with open(path, "r") as fh:
                data = fh.read()
        except OSError as exc:
            _bad("cannot read proposed change %s: %s" % (path, exc))
    if not data.strip():
        _bad("proposed change %s is empty" % path)
    stripped = data.lstrip()
    if stripped[:1] in "[{":
        try:
            obj = json.loads(data)
        except ValueError as exc:
            _bad("proposed change %s looks like JSON but will not parse: %s"
                 % (path, exc))
        rows = obj if isinstance(obj, list) else obj.get("entities")
        if not isinstance(rows, list) or not rows:
            _bad("JSON proposed change must be a non-empty list of entities")
        ents = []
        for i, r in enumerate(rows):
            if not isinstance(r, dict):
                _bad("entity %d is not an object" % i)
            name = r.get("name") or r.get("symbol") or r.get("column")
            if not name:
                _bad("entity %d has no name/symbol/column" % i)
            qual = r.get("table") or r.get("qualifier")
            ents.append((str(name).lower(),
                         str(qual).lower() if qual else None,
                         str(r.get("kind", "symbol"))))
        return dedupe(ents)
    ents = entities_from_patch(data, "+")
    if not ents:
        _bad("no added column/symbol entities found in the proposed diff %s "
             "(nothing to check)" % path)
    return dedupe(ents)


def build_findings(proposed, idx):
    findings = []
    for prop in proposed:
        best = None
        for name, qual, kind, rev in [idx[prop[0]]] if prop[0] in idx else []:
            code = classify(prop, (name, qual, kind))
            if code is None:
                continue
            best = (code, prop, (name, qual, kind), rev)
        if best:
            findings.append(best)
    rank = {"REINTRODUCES_REVERTED": 0, "NAME_MATCH_UNVERIFIED": 1}
    findings.sort(key=lambda f: (rank[f[0]], f[1][0]))
    return findings


def render(repo, proposed_path, proposed, reverts, findings, warn_exit):
    n_names = len({n for rev in reverts for (n, _, _) in rev["entities"]})
    out = ["REVERT-GUARD REPORT"]
    out.append("repo: %s" % repo)
    out.append("proposed: %s" % proposed_path)
    out.append("revert history: %d revert commit(s), %d reverted entity(ies)"
               % (len(reverts), n_names))
    out.append("proposed additions: %d entity(ies)" % len(proposed))
    blocks = [f for f in findings if f[0] == "REINTRODUCES_REVERTED"]
    warns = [f for f in findings if f[0] == "NAME_MATCH_UNVERIFIED"]
    out.append("findings:")
    if not findings:
        out.append("  (none -- no proposed addition matches a reverted entity)")
    for code, prop, rev_ent, rev in findings:
        label = (rev_ent[0] if rev_ent[1] is None
                 else "%s.%s" % (rev_ent[1], rev_ent[0]))
        prop_label = (prop[0] if prop[1] is None
                      else "%s.%s" % (prop[1], prop[0]))
        out.append("  - %s  %s" % (code, prop_label))
        if code == "REINTRODUCES_REVERTED":
            out.append("      your diff re-adds %s (qualified: table '%s', "
                       "column '%s')" % (prop_label, prop[1], prop[0]))
        else:
            out.append("      your diff adds a %s named '%s' (unqualified); "
                       "the reverted entity is %s" % (prop[2], prop[0], label))
        out.append("      reverted %s in %s -- reason: \"%s\""
                   % (rev["date"], rev["short"], rev["reason"]))
        out.append("      revert subject: %s" % rev["subject"])
    if blocks:
        verdict, code = "BLOCK", 1
        reason = ("%d proposed addition(s) reintroduce a change this repo "
                  "already reverted" % len(blocks))
    elif warns:
        verdict, code = "WARN", warn_exit
        reason = ("%d proposed addition(s) share a name with a reverted entity "
                  "but could not be confirmed -- a human verifies" % len(warns))
    else:
        verdict, code = "SHIP", 0
        reason = "nothing in this diff was previously reverted in this repo"
    out.append("decision: %s -- %s" % (verdict, reason))
    body = "\n".join(out) + "\n"
    out.append("digest(sha256): %s" % hashlib.sha256(body.encode()).hexdigest())
    return "\n".join(out), code


def main(argv):
    ap = argparse.ArgumentParser(add_help=True, prog="revert_guard.py")
    ap.add_argument("proposed", help="proposed diff / entities.json / - for stdin")
    ap.add_argument("--repo", default=".", help="path to the git repo (default: .)")
    ap.add_argument("--since", default=None,
                    help="limit revert scan (git --since, e.g. '6 months ago')")
    ap.add_argument("--revert-pattern", default=DEFAULT_REVERT_PATTERN,
                    help="regex marking a commit as a revert")
    ap.add_argument("--warn-exit", type=int, default=1, choices=(0, 1),
                    help="exit code for a WARN verdict (default 1, fail-closed)")
    if len(argv) == 1:
        ap.print_usage()
        raise SystemExit(2)
    args = ap.parse_args(argv[1:])

    inside = _git(args.repo, ["rev-parse", "--is-inside-work-tree"]).strip()
    if inside != "true":
        _bad("%s is not a git work tree" % args.repo)
    try:
        revert_re = re.compile(args.revert_pattern, re.I)
    except re.error as exc:
        _bad("bad --revert-pattern: %s" % exc)

    proposed = read_proposed(args.proposed)
    reverts = find_reverts(args.repo, revert_re, args.since)
    idx = reverted_index(reverts)
    findings = build_findings(proposed, idx)
    report, code = render(args.repo, args.proposed, proposed, reverts,
                          findings, args.warn_exit)
    print(report)
    raise SystemExit(code)


if __name__ == "__main__":
    main(sys.argv)

The fixtures: two real repos, one proposed diff

The runs below use two actual git repositories, built by a small script (the full builder is at the end of the post, so you can rebuild them byte for byte). Both repos share the same base history, including one unrelated revert of users.legacy_flag. repo_dirty has two extra commits the clean one does not: it added users.card_token, then reverted it. Here is repo_dirty's log:

$ git -C fixtures/repo_dirty log --format='%h %ad %s' --date=short
6be79a1 2026-06-20 chore: index orders.user_id, docs pointer
c2ce7ed 2026-06-05 revert: drop users.card_token (PCI-DSS scope)
b2a3099 2026-06-03 feat: store users.card_token for one-click checkout
21fde1b 2026-05-24 revert: drop users.legacy_flag (unused after v2 launch)
2103408 2026-05-22 feat: add users.legacy_flag for v1 routing
9bd2943 2026-05-20 init: user service skeleton

The proposed change is the same file for both runs: a new migration and a model field, adding users.card_token. Note the migration number is 0042, not the 0007 from the original add. A path-based check would miss this. The tool matches on the entity.

--- /dev/null
+++ b/migrations/0042_add_card_token.sql
@@ -0,0 +1,2 @@
+-- store a tokenized card reference for the new checkout flow
+ALTER TABLE users ADD COLUMN card_token TEXT;
...
+++ b/models/user.py
@@ -9,3 +9,4 @@ class User(Base):
+    card_token = Column(String(255))

The baseline: on a clean repo it ships

Point the diff at repo_clean. That repo has its own revert history (the legacy_flag one), so the scan runs and finds a revert. It just is not this one. My fixture, my run:

$ python3 revert_guard.py proposed_card_token.diff --repo repo_clean
REVERT-GUARD REPORT
repo: repo_clean
proposed: proposed_card_token.diff
revert history: 1 revert commit(s), 1 reverted entity(ies)
proposed additions: 1 entity(ies)
findings:
  (none -- no proposed addition matches a reverted entity)
decision: SHIP -- nothing in this diff was previously reverted in this repo
digest(sha256): b96ce3f50d9062eb41b2424cf4544aeffcded6110016d45a3404aeeaf47bd2da

Exit 0. SHIP. The legacy_flag revert was read and correctly ignored, because the diff does not touch it. This is the run that ships today with no gate: the diff is clean, the schema has no card_token, out it goes.

Same diff, dirty repo: block

This is the flip the post exists for. Nothing about the proposed diff changes. The one thing that changes is --repo repo_clean becomes --repo repo_dirty.

$ python3 revert_guard.py proposed_card_token.diff --repo repo_dirty
REVERT-GUARD REPORT
repo: repo_dirty
proposed: proposed_card_token.diff
revert history: 2 revert commit(s), 2 reverted entity(ies)
proposed additions: 1 entity(ies)
findings:
  - REINTRODUCES_REVERTED  users.card_token
      your diff re-adds users.card_token (qualified: table 'users', column 'card_token')
      reverted 2026-06-05 in c2ce7ed -- reason: "PCI-DSS scope"
      revert subject: revert: drop users.card_token (PCI-DSS scope)
decision: BLOCK -- 1 proposed addition(s) reintroduce a change this repo already reverted
digest(sha256): 11007493d9b2043523a6b97c8f64eb53ff55988b78483060397c386962e3ebab

Exit 1. BLOCK. It found commit c2ce7ed, read what that revert removed, matched users.card_token against the diff, and handed back the date and the stated reason: PCI-DSS scope. Sit with the pair for a second. Same diff, same agent, one exit 0 and one exit 1. If the problem were the agent's memory, deleting one revert commit from one repo would not change the verdict. It does. The variable is not the agent. It is whether the repo remembers.

When nothing was reverted, it still ships

A gate that blocked on everything would be a different kind of useless, so here is the counter-case. On the same repo_dirty, a diff that adds an unrelated, never-reverted column, users.last_login:

$ python3 revert_guard.py proposed_last_login.diff --repo repo_dirty
REVERT-GUARD REPORT
repo: repo_dirty
proposed: proposed_last_login.diff
revert history: 2 revert commit(s), 2 reverted entity(ies)
proposed additions: 1 entity(ies)
findings:
  (none -- no proposed addition matches a reverted entity)
decision: SHIP -- nothing in this diff was previously reverted in this repo
digest(sha256): 7facc852a8c4070fb515d4327bcc23a44d3efda8b8822bc0d24f52d57110debb

Exit 0. Both reverts were scanned; neither is last_login; it ships. The gate answers to the revert history, not to a mood.

The third state: a name match it will not pretend to be sure about

Now a harder one. What if the agent adds card_token only as a model field, with no ALTER TABLE line to say which table? The name matches the reverted users.card_token, but the diff never says users. The honest answer is not BLOCK and not SHIP.

$ python3 revert_guard.py proposed_card_token_model_only.diff --repo repo_dirty
REVERT-GUARD REPORT
repo: repo_dirty
proposed: proposed_card_token_model_only.diff
revert history: 2 revert commit(s), 2 reverted entity(ies)
proposed additions: 1 entity(ies)
findings:
  - NAME_MATCH_UNVERIFIED  card_token
      your diff adds a symbol named 'card_token' (unqualified); the reverted entity is users.card_token
      reverted 2026-06-05 in c2ce7ed -- reason: "PCI-DSS scope"
      revert subject: revert: drop users.card_token (PCI-DSS scope)
decision: WARN -- 1 proposed addition(s) share a name with a reverted entity but could not be confirmed -- a human verifies
digest(sha256): d07bc775aebbb2f2880b050b28f95a73dc235be59449c15a526e8df3cb107c68

Exit 1, but WARN, not BLOCK. It surfaces the match and refuses to escalate to a hard block on a name it could not qualify. WARN is fail-closed by default because a hidden re-add is worse than a false alarm, but the exit is a flag you own. If your pipeline wants WARN to pass, --warn-exit 0 gives it exit 0 while the report text is identical (same digest, d07bc775...). I went back and forth on that default and I would not fight hard for it. Fail-closed felt right for a gate; your risk tolerance may differ.

If you would rather feed the guard a structured list than a diff, it also reads a JSON array of entities, [{"name": "card_token", "table": "users"}], and treats a table the same way it treats a qualifier parsed from SQL. That path blocks on repo_dirty exactly like the diff does.

About Selvedge, and about not overclaiming

@masondelan's fix and mine solve the same pain from opposite ends, and I want to be precise about the difference rather than imply I beat anything. Selvedge, as they describe it, is a runtime MCP server the agent queries mid-session: it asks prior-attempts and gets an answer back, so the model can course-correct while it plans. That is a good design and it lives inside the agent loop. revert_guard.py is not that. It is an offline gate outside the loop, that runs on the proposed diff before the commit, needs no server and no key, and reads history the team already has. Different mechanism, same failure mode. My angle is not "better than Selvedge." It is that this specific class of mistake can also be caught by a deterministic gate with nothing running, which is a cheaper thing to add on a Friday.

There is a broader argument going around that anything a deterministic system can do reliably should not be handed to a probabilistic one on every call. A revert check is a clean example. Whether a column was reverted is a fact in the log, not a judgment call. You do not need a model to answer it, you need a grep with taste, and the answer should be the same every time you ask. That is why the tool hashes its own output.

Where this sits next to the rest

This is a spoke on the pre-execution gate for AI agents cluster, and its object is the moment before a schema change commits. The neighbors ask adjacent questions:

The agent memory tax and backdoor post is about what it costs to give an agent memory and how that context can be poisoned. This is the deliberate contrast. revert_guard.py stores nothing and embeds nothing; it reads the git history you already keep, at zero storage cost. The fix for "the reasoning died" here is not more memory, it is a check against a record you never threw away.
Agent-authored SQL reaches the database shares the object of this demo, a migration and a column arriving at the DB, but asks whether the SQL string is safe to run. This asks whether the change should be proposed at all, given it was already pulled once.
Your agent returns 200 and lies is the franchise in one line: the system is tracking something (a 200, a revert) while nothing controls the gap between what is recorded and what is enforced. Same shape, runtime instead of pre-commit.
The authz gate: trace vs allowed is the other contrast to state plainly. That gate answers "is this actor allowed to do this." This one assumes the change is fully allowed and still stops it, because allowed and already-rejected are different questions.

What this is NOT

I would rather undersell this than have you wire it in as something it isn't.

It is not agent memory or a RAG store. It stores nothing and embeds no reasoning. It reads reverts out of the git history the team already maintains. Storage cost is zero, and there is no context to poison.
It is not an authorization layer. It does not decide who may change what. The change in the demo is fully permitted by any sane policy. The problem is that it was tried and rolled back, which is a different axis entirely.
It does not understand why. It matches entity names, not intent. A column reverted for reason X that is now legitimately needed for reason Y will still be flagged, and it should be: the tool hands a human the prior revert and its reason and lets them override. It surfaces; it does not judge.
It is not a linter or a type checker. It says nothing about whether the code is correct. Its one job is the fact of re-adding something previously reverted.
The blind spot is git hygiene, and it is real. It only sees reverts that are actually committed with a message the pattern matches. A rollback done by force-push, a squash-merge that swallowed the revert, an amend that rewrote it out of history: invisible. Garbage history in, garbage gate out. It is exactly as good as your team's revert discipline.
It can also over-fire, not only miss. That blind spot is the false negative. The mirror is the false alarm: it calls a commit a revert when the message trips the revert pattern and the same commit drops an ADD COLUMN line, so a migration squash or a rename that merely says "revert" in passing can get logged as a revert of users.card_token and block a later honest add. It also keys off the revert record, not the current schema, so a column that was reverted and then legitimately restored still trips it. Same trade either way: fail loud, hand a human the commit, let them clear it.
It does not run the agent, or anything else. Offline, read-only, zero network. It reads a diff as text and shells out to a local git log / git show. It never executes the diff, calls a model, or opens a socket.
It is not a replacement for code review. It is a pre-review stretch over one failure mode, the quiet re-add of a reverted change, not a general bug hunter. Keep the reviewer.
The numbers here are fixture units. The exit codes, the 2026-06-05, the c2ce7ed, the hashes: all synthetic, from a repo I built for this post. The card_token / PCI-DSS story and Selvedge belong to @masondelan. Run the tool on your own repo to get anything that means something about your agent.

Bad input fails closed

A gate that crashes into a green is worse than no gate. Point it at something that is not a git repo and it refuses to decide:

$ python3 revert_guard.py proposed_card_token.diff --repo not_a_repo
ERROR: git rev-parse --is-inside-work-tree failed in not_a_repo: fatal: not a git repository (or any of the parent directories): .git

$ python3 revert_guard.py ; echo "exit=$?"
usage: revert_guard.py [-h] [--repo REPO] [--since SINCE]
                       [--revert-pattern REVERT_PATTERN] [--warn-exit {0,1}]
                       proposed
exit=2

Both exit 2, distinct from the exit 1 a BLOCK or WARN returns, so CI can tell "the gate says hold" apart from "the gate could not run." One honest caveat: git resolves upward, so if you point --repo at a plain subdirectory inside another checkout, git will find that outer repo instead of erroring. Give it a path that is genuinely outside a work tree to see the exit 2 above.

On determinism: I ran each report scenario twice, offline, on Python 3.13.5, and hashed the full stdout both times. Identical every time. The tool also prints a digest(sha256) of its own report, so you can verify a run without trusting me: drop the last line and hash the rest. The SHIP baseline is b96ce3f5..., the killer BLOCK is 11007493..., the WARN is d07bc775..., the unrelated SHIP is 7facc852.... Absolute dates, not "N days ago," precisely so tomorrow's run has the same hash as today's.

Reproduce the fixtures

The two repos and the proposed diffs are built by this script. Author identity and every commit date are pinned, so a rebuild lands the same commit SHAs (and short hashes like c2ce7ed) the post prints.

#!/usr/bin/env python3
"""
make_fixtures.py -- builds the two real git repositories and the proposed-diff
files that revert_guard.py runs on in the post. It only writes DATA and calls a
LOCAL git to create commits; nothing here is executed by the guard.

Determinism: author/committer identity and every commit date are pinned, so a
rebuild produces the same commit SHAs (and therefore the same short hashes) the
post prints. Signing/hooks are disabled so a contributor's global git config
cannot change the objects.

  fixtures/repo_clean  -- init, add users.legacy_flag, REVERT legacy_flag, noise.
                          One revert commit. Never touched card_token.
  fixtures/repo_dirty  -- the same, PLUS: add users.card_token, REVERT card_token
                          (message: "PCI-DSS scope"), noise. Two revert commits.
  fixtures/proposed_card_token.diff        -- new migration 0042 + model field
                                              adding users.card_token (qualified).
  fixtures/proposed_card_token_model_only.diff -- only the model field (bare).
  fixtures/proposed_last_login.diff        -- an unrelated, never-reverted column.
  fixtures/proposed_card_token.entities.json -- the JSON-entity input form.
  fixtures/not_a_repo/                     -- a plain dir for the bad-input case.
"""

import os
import shutil
import subprocess

BASE = os.path.dirname(os.path.abspath(__file__))
FIX = os.path.join(BASE, "fixtures")

ENV = dict(os.environ)
ENV.update({
    "GIT_AUTHOR_NAME": "fixture", "GIT_AUTHOR_EMAIL": "fixture@example.invalid",
    "GIT_COMMITTER_NAME": "fixture", "GIT_COMMITTER_EMAIL": "fixture@example.invalid",
})

USER_PY_BASE = '''\
from db import Base, Column, String, DateTime, Integer


class User(Base):
    __tablename__ = "users"

    id = Column(Integer, primary_key=True)
    email = Column(String(255), nullable=False)
    created_at = Column(DateTime, nullable=False)
'''

INIT_SQL = "CREATE TABLE users (\n  id INTEGER PRIMARY KEY,\n  email TEXT NOT NULL,\n  created_at TIMESTAMP NOT NULL\n);\n"


def run(repo, args, date=None):
    env = dict(ENV)
    if date:
        env["GIT_AUTHOR_DATE"] = date
        env["GIT_COMMITTER_DATE"] = date
    subprocess.run(["git", "-c", "commit.gpgsign=false", "-c", "gc.auto=0",
                    "-C", repo] + args, env=env, check=True,
                   stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)


def write(repo, rel, text):
    path = os.path.join(repo, rel)
    os.makedirs(os.path.dirname(path), exist_ok=True)
    with open(path, "w") as fh:
        fh.write(text)


def commit(repo, date, message):
    run(repo, ["add", "-A"])
    run(repo, ["commit", "--no-verify", "-q", "-m", message], date=date)


def init_repo(repo):
    if os.path.exists(repo):
        shutil.rmtree(repo)
    os.makedirs(repo)
    run(repo, ["init", "-q", "-b", "main"])


def add_field(repo, field, coltype):
    text = USER_PY_BASE.rstrip("\n") + "\n    %s = Column(%s)\n" % (field, coltype)
    write(repo, "models/user.py", text)


def base_history(repo):
    # 1) init
    write(repo, "README.md", "# service\n\nInternal user service.\n")
    write(repo, "models/user.py", USER_PY_BASE)
    write(repo, "migrations/0001_init.sql", INIT_SQL)
    commit(repo, "2026-05-20T10:00:00", "init: user service skeleton")
    # 2) add users.legacy_flag
    write(repo, "migrations/0003_add_legacy_flag.sql",
          "ALTER TABLE users ADD COLUMN legacy_flag BOOLEAN DEFAULT false;\n")
    add_field(repo, "legacy_flag", "String(8)")
    commit(repo, "2026-05-22T09:30:00", "feat: add users.legacy_flag for v1 routing")
    # 3) REVERT legacy_flag (unrelated revert, shared by both repos)
    os.remove(os.path.join(repo, "migrations/0003_add_legacy_flag.sql"))
    write(repo, "models/user.py", USER_PY_BASE)
    commit(repo, "2026-05-24T14:15:00",
           "revert: drop users.legacy_flag (unused after v2 launch)")


def noise(repo, date):
    write(repo, "migrations/0009_add_orders_index.sql",
          "CREATE INDEX idx_orders_user ON orders (user_id);\n")
    write(repo, "README.md", "# service\n\nInternal user service. See /docs.\n")
    commit(repo, date, "chore: index orders.user_id, docs pointer")


def build_clean():
    repo = os.path.join(FIX, "repo_clean")
    init_repo(repo)
    base_history(repo)
    noise(repo, "2026-06-20T11:00:00")


def build_dirty():
    repo = os.path.join(FIX, "repo_dirty")
    init_repo(repo)
    base_history(repo)
    # 4) add users.card_token
    write(repo, "migrations/0007_add_card_token.sql",
          "ALTER TABLE users ADD COLUMN card_token TEXT;\n")
    add_field(repo, "card_token", "String(255)")
    commit(repo, "2026-06-03T13:20:00",
           "feat: store users.card_token for one-click checkout")
    # 5) REVERT card_token -- the entity the killer demo re-proposes
    os.remove(os.path.join(repo, "migrations/0007_add_card_token.sql"))
    write(repo, "models/user.py", USER_PY_BASE)
    commit(repo, "2026-06-05T16:45:00",
           "revert: drop users.card_token (PCI-DSS scope)")
    # 6) noise so the revert is not the latest commit
    noise(repo, "2026-06-20T11:00:00")


PROPOSED_CARD_TOKEN = '''\
diff --git a/migrations/0042_add_card_token.sql b/migrations/0042_add_card_token.sql
new file mode 100644
index 0000000..a1a1a1a
--- /dev/null
+++ b/migrations/0042_add_card_token.sql
@@ -0,0 +1,2 @@
+-- store a tokenized card reference for the new checkout flow
+ALTER TABLE users ADD COLUMN card_token TEXT;
diff --git a/models/user.py b/models/user.py
index b2b2b2b..c3c3c3c 100644
--- a/models/user.py
+++ b/models/user.py
@@ -9,3 +9,4 @@ class User(Base):
     email = Column(String(255), nullable=False)
     created_at = Column(DateTime, nullable=False)
+    card_token = Column(String(255))
'''

PROPOSED_MODEL_ONLY = '''\
diff --git a/models/user.py b/models/user.py
index b2b2b2b..d4d4d4d 100644
--- a/models/user.py
+++ b/models/user.py
@@ -9,3 +9,4 @@ class User(Base):
     email = Column(String(255), nullable=False)
     created_at = Column(DateTime, nullable=False)
+    card_token = Column(String(255))
'''

PROPOSED_LAST_LOGIN = '''\
diff --git a/migrations/0043_add_last_login.sql b/migrations/0043_add_last_login.sql
new file mode 100644
index 0000000..e5e5e5e
--- /dev/null
+++ b/migrations/0043_add_last_login.sql
@@ -0,0 +1,1 @@
+ALTER TABLE users ADD COLUMN last_login TIMESTAMP;
'''

PROPOSED_JSON = '''\
[
  {"name": "card_token", "table": "users", "kind": "column"}
]
'''


def build_proposed():
    with open(os.path.join(FIX, "proposed_card_token.diff"), "w") as fh:
        fh.write(PROPOSED_CARD_TOKEN)
    with open(os.path.join(FIX, "proposed_card_token_model_only.diff"), "w") as fh:
        fh.write(PROPOSED_MODEL_ONLY)
    with open(os.path.join(FIX, "proposed_last_login.diff"), "w") as fh:
        fh.write(PROPOSED_LAST_LOGIN)
    with open(os.path.join(FIX, "proposed_card_token.entities.json"), "w") as fh:
        fh.write(PROPOSED_JSON)
    nar = os.path.join(FIX, "not_a_repo")
    os.makedirs(nar, exist_ok=True)
    with open(os.path.join(nar, "hello.txt"), "w") as fh:
        fh.write("this directory is deliberately not a git repo\n")


def main():
    os.makedirs(FIX, exist_ok=True)
    build_clean()
    build_dirty()
    build_proposed()
    print("fixtures built in %s" % FIX)


if __name__ == "__main__":
    main()

The question I actually want answered

Here is the one I do not have a good number for. When a fresh agent session re-proposes something your team reverted, what catches it today? My honest guess is: usually nothing, until a reviewer happens to remember, and memory is a bad place to keep a safety property. But that is a guess. If your team has a real mechanism, an MCP lookup like Selvedge, a lint rule, a convention, I want to hear which one and whether it has actually stopped a re-add in practice.

If this was useful, follow along for the next runnable gate in this series, and tell me in the comments: has your agent ever re-proposed a change you had already reverted, and did anything catch it before it hit review? I read every one.

Gate Agent Evals by Severity, Not a Flat Pass-Rate

Alexey Spinov — Wed, 08 Jul 2026 01:17:43 +0000

Gating an agent's eval run by severity means the deploy decision reads the per-severity distribution of the failures, not one flat pass-rate. severity_gate.py is an offline, stdlib-only tool that returns SHIP, REVIEW, or BLOCK. In this post's fixtures, flipping one severity_class field to critical turns SHIP into BLOCK while the pass-rate holds at 92.5%.

AI disclosure: I wrote severity_gate.py with an AI assistant and ran it myself, offline, before publishing. Every output block below is pasted from a real local run on Python 3.13.5, standard library only, no network. I checked the exit codes (0 / 1 / 2), hashed the STDOUT of each scenario twice to confirm it is byte-for-byte deterministic, and edited every line. The 512 / 31 / 6 figures and the "rounding error" line belong to Ethan Walker's Dev.to post, not to me; I link the primary sources and keep their numbers in their own paragraphs, away from my fixture counts (40 / 3 / 2, 92.5%).

In short:

Whether a run is safe to ship is a property of which failures it holds, not of their average. A run can sit at a healthy pass-rate and still carry the one failure that should have stopped the deploy on its own.
The tool reads a finished eval run (JSON) plus a small policy. Any FAIL in a blocking severity class (default: critical) is a BLOCK, whatever the aggregate says. A below-threshold aggregate with no critical fail is a REVIEW. Everything clean is a SHIP.
The demo: same 40 cases, same 37/40 = 92.5% pass-rate, same three failing cases. Change one field, severity_class from low to critical on two of them, and the verdict flips SHIP to BLOCK. The pass-rate does not move. The diff is two lines.
It re-weights failures you already labeled. It does not detect PII, does not judge correctness, does not find new failures. Garbage labels in, garbage gate out.
Standard library only (json, sys). Offline, keyless, read-only, zero network, deterministic STDOUT. Exit 0 / 1 / 2 for a CI gate. The tool and every fixture are in this post.

A flat pass-rate measures the run. It was never the gate.

Here is the ritual I keep seeing at the end of an agent's CI pipeline. The eval suite runs, a number comes out, someone reads it, and if it clears the bar the change ships. The number is a mean: passes over total. It is a fine thing to track. It tells you, roughly, whether the run got better or worse than last week.

It is a bad thing to gate on, and the reason is simple arithmetic. A mean gives every failure the same weight. So a run where forty tests failed on trailing whitespace and a run where one test leaked a customer's email into a log both cost the same amount of pass-rate. If your suite is big enough, the second run rounds off to nothing. The one failure that should stop the deploy is invisible at the resolution of the average.

That is the whole gap this tool sits in. The pass-rate is tracking: a post-hoc measurement of what happened. The decision about what ships is control, and control has to see the distribution the mean flattens. severity_gate.py reads the finished eval results and makes the ship decision on the failures' severity, not their count.

The rule the mean can't express

The tool keeps the pass-rate on screen, for reference, and then ignores it for the parts of the decision the mean can't reach. Three verdicts, one policy:

BLOCK when at least one failing case is in a blocking severity class. Default policy: critical. This fires no matter what the aggregate is. One critical fail out of ten thousand green cases is still a BLOCK.
REVIEW when nothing blocking failed, but either a case in a review class failed (default: high), or the flat pass-rate dropped below ship_threshold. A human looks. It is not an auto-ship and it is not a hard stop.
SHIP only when nothing blocking or review-class failed and the pass-rate meets the threshold.

The severity labels are yours. The tool does not invent them and does not guess them; it reads the severity_class field you put on each case. That is the load-bearing limit, and I come back to it at the end. What the tool contributes is that once you have committed to a label, the ship decision follows it deterministically instead of dissolving into an average.

Run it in sixty seconds

No keys. No network. No install beyond Python. Save the file, point it at a results JSON, run one command. Here is the whole thing, one file, standard library only:

#!/usr/bin/env python3
"""
severity_gate.py -- an offline deploy gate that reads a FINISHED eval run and
decides SHIP / REVIEW / BLOCK from the per-severity distribution of the failures,
not from a single flat pass-rate.

It reads a JSON file of eval RESULTS (it never runs the agent or the eval) plus
an optional policy file. For every case it reads a verdict (pass/fail) and the
severity_class the USER assigned (critical / high / medium / low). It reports the
flat pass-rate for reference, then makes the ship decision on a rule the mean
cannot express:

  BLOCK   -- at least one FAIL whose severity_class is in the policy's block_on
             set (default: critical). Fires regardless of the aggregate.
  REVIEW  -- no blocking fail, but either a FAIL in review_on (default: high),
             or the flat pass-rate is below ship_threshold. A human looks; it is
             not an auto-ship and not a hard block.
  SHIP    -- no blocking fail, no review fail, and pass-rate >= ship_threshold.

The point the tool exists to make: flip one field -- the severity_class on a
handful of already-failing cases -- and the verdict moves SHIP -> BLOCK while the
pass-rate does not move at all. The aggregate MEASURES the run; it was never the
control over what ships.

Offline. Keyless. Read-only. Zero network. Standard library only (json, sys).
No subprocess, no exec, no eval, no import of the analyzed results, no model, no
DB. The fixtures are read with json.load; they are DATA, never executed. Output
is byte-for-byte deterministic across runs.

It does NOT detect PII, does NOT judge whether a verdict is correct, and does NOT
find new failures. It re-weights failures YOU already labeled. Garbage labels in,
garbage gate out. It decouples the ship decision from the scalar mean; it does
not discover anything.

Exit codes (usable as a CI gate):
  0  SHIP
  1  REVIEW or BLOCK  (the verdict text says which, and why)
  2  bad input (missing / unreadable / unparseable file, missing field, unknown
     verdict or severity_class) -- fail-closed

Usage:
  python3 severity_gate.py <results.json> [policy.json]
"""

import json
import sys

DEFAULT_POLICY = {
    "ship_threshold": 0.90,
    "severity_order": ["low", "medium", "high", "critical"],
    "block_on": ["critical"],
    "review_on": ["high"],
}

VALID_VERDICTS = ("pass", "fail")


def _bad(msg):
    print("ERROR: " + msg)
    raise SystemExit(2)


def _load_json(path, what):
    try:
        with open(path, "r") as fh:
            return json.load(fh)
    except OSError as exc:
        _bad("cannot read %s %s: %s" % (what, path, exc))
    except ValueError as exc:
        _bad("cannot parse %s %s: %s" % (what, path, exc))


def load_policy(path):
    if path is None:
        return dict(DEFAULT_POLICY)
    data = _load_json(path, "policy")
    if not isinstance(data, dict):
        _bad("policy %s is not a JSON object" % path)
    pol = dict(DEFAULT_POLICY)
    pol.update(data)
    order = pol["severity_order"]
    if not isinstance(order, list) or not order:
        _bad("policy severity_order must be a non-empty list")
    for key in ("block_on", "review_on"):
        for cls in pol[key]:
            if cls not in order:
                _bad("policy %s lists '%s', not in severity_order %s"
                     % (key, cls, order))
    try:
        pol["ship_threshold"] = float(pol["ship_threshold"])
    except (TypeError, ValueError):
        _bad("policy ship_threshold must be a number")
    return pol


def load_cases(path, order):
    data = _load_json(path, "results")
    cases = data.get("cases") if isinstance(data, dict) else data
    if not isinstance(cases, list) or not cases:
        _bad("results %s must hold a non-empty list of cases" % path)
    default_sev = order[0]
    out = []
    for i, c in enumerate(cases):
        if not isinstance(c, dict):
            _bad("case %d is not an object" % i)
        verdict = c.get("verdict")
        if verdict not in VALID_VERDICTS:
            _bad("case %s has verdict %r, expected 'pass' or 'fail'"
                 % (c.get("case_id", i), verdict))
        sev = c.get("severity_class", default_sev)
        if sev not in order:
            _bad("case %s has severity_class %r, not in severity_order %s"
                 % (c.get("case_id", i), sev, order))
        out.append({
            "case_id": str(c.get("case_id", "case-%d" % i)),
            "verdict": verdict,
            "severity_class": sev,
            "category": str(c.get("category", "")),
        })
    return out


def decide(cases, pol):
    block_on = set(pol["block_on"])
    review_on = set(pol["review_on"])
    total = len(cases)
    passed = sum(1 for c in cases if c["verdict"] == "pass")
    rate = passed / total

    block_fails = [c for c in cases
                   if c["verdict"] == "fail" and c["severity_class"] in block_on]
    review_fails = [c for c in cases
                    if c["verdict"] == "fail" and c["severity_class"] in review_on]

    if block_fails:
        verdict, code = "BLOCK", 1
        reason = ("%d failing case(s) in a blocking severity class (%s)"
                  % (len(block_fails), ", ".join(sorted(block_on))))
    elif review_fails:
        verdict, code = "REVIEW", 1
        reason = ("%d failing case(s) in a review severity class (%s)"
                  % (len(review_fails), ", ".join(sorted(review_on))))
    elif rate < pol["ship_threshold"]:
        verdict, code = "REVIEW", 1
        reason = ("flat pass-rate %.1f%% is below ship_threshold %.1f%%"
                  % (rate * 100, pol["ship_threshold"] * 100))
    else:
        verdict, code = "SHIP", 0
        reason = ("no blocking or review-class failure, pass-rate %.1f%% "
                  ">= ship_threshold %.1f%%" % (rate * 100,
                                                pol["ship_threshold"] * 100))

    return {"total": total, "passed": passed, "failed": total - passed,
            "rate": rate, "verdict": verdict, "reason": reason, "code": code}


def render(cases, pol, d):
    order = pol["severity_order"]
    rank = {c: i for i, c in enumerate(order)}
    block_on, review_on = set(pol["block_on"]), set(pol["review_on"])
    out = ["SEVERITY-GATE REPORT"]
    out.append("cases: %d   pass: %d   fail: %d"
               % (d["total"], d["passed"], d["failed"]))
    out.append("flat pass-rate: %.1f%% (%d/%d)"
               % (d["rate"] * 100, d["passed"], d["total"]))
    out.append("ship_threshold: %.1f%%   block_on: [%s]   review_on: [%s]"
               % (pol["ship_threshold"] * 100,
                  ", ".join(pol["block_on"]), ", ".join(pol["review_on"])))
    out.append("per-severity (fail / total):")
    for cls in reversed(order):  # worst first
        tot = sum(1 for c in cases if c["severity_class"] == cls)
        fl = sum(1 for c in cases
                 if c["severity_class"] == cls and c["verdict"] == "fail")
        mark = "  <- blocking" if cls in block_on else (
            "  <- review" if cls in review_on else "")
        out.append("  %-8s %d / %d%s" % (cls, fl, tot, mark))
    fails = [c for c in cases if c["verdict"] == "fail"]
    fails.sort(key=lambda c: (-rank[c["severity_class"]], c["case_id"]))
    if fails:
        out.append("failing cases (worst severity first):")
        for c in fails:
            cat = (" [%s]" % c["category"]) if c["category"] else ""
            out.append("  - %-8s %s%s" % (c["severity_class"], c["case_id"], cat))
    out.append("decision: %s -- %s" % (d["verdict"], d["reason"]))
    return "\n".join(out)


def main(argv):
    if len(argv) not in (2, 3):
        print("usage: severity_gate.py <results.json> [policy.json]")
        raise SystemExit(2)
    pol = load_policy(argv[2] if len(argv) == 3 else None)
    cases = load_cases(argv[1], pol["severity_order"])
    d = decide(cases, pol)
    print(render(cases, pol, d))
    raise SystemExit(d["code"])


if __name__ == "__main__":
    main(sys.argv)

The inputs are eval results, one JSON object per run: a cases list where each case has a verdict and a severity_class. To keep the runs below reproducible byte for byte, I generate the sample files with a tiny builder that only writes data. The gate reads that data; it never runs it.

#!/usr/bin/env python3
# make_fixtures.py -- writes the sample eval-result files this post runs on.
# It only emits DATA. severity_gate.py reads it with json.load, never executes it.
import json
import os

os.makedirs("fixtures", exist_ok=True)


def case(i, verdict="pass", sev="low", cat="general"):
    return {"case_id": "c%03d" % i, "verdict": verdict,
            "severity_class": sev, "category": cat}


def write(name, cases):
    with open("fixtures/%s" % name, "w") as fh:
        json.dump({"cases": cases}, fh, indent=2, sort_keys=True)
        fh.write("\n")


# 40 cases, 3 fail -> flat pass-rate 37/40 = 92.5%. Two fails are a PII cluster
# left untriaged (severity low); one is an unrelated format miss (medium).
base = [case(i) for i in range(1, 41)]
base[9] = case(10, "fail", "medium", "output_format")
base[19] = case(20, "fail", "low", "pii_leak")
base[29] = case(30, "fail", "low", "pii_leak")
write("ship_run.json", base)

# Same 40 cases, same 37/40 = 92.5%. ONLY the two PII fields change: low->critical.
killer = [dict(c) for c in base]
killer[19]["severity_class"] = "critical"
killer[29]["severity_class"] = "critical"
write("block_run.json", killer)

# Aggregate below the bar, nothing critical/high: 34/40 = 85.0%.
review = [case(i) for i in range(1, 41)]
for i in (5, 12, 18, 23, 31, 37):
    review[i - 1] = case(i, "fail", "medium", "verbosity")
write("review_run.json", review)

# A critical-severity case that PASSED must not block. 20 cases, one low fail.
edge = [case(i) for i in range(1, 21)]
edge[0] = case(1, "pass", "critical", "pii_leak")
edge[14] = case(15, "fail", "low", "typo")
write("edge_run.json", edge)

# Unknown severity label -> fail-closed (exit 2).
bad = [case(1), case(2, "fail", "catastrophic", "pii_leak")]
write("bad_run.json", bad)

The baseline: a green run that isn't safe to ship

Start with ship_run.json: 40 cases, 37 pass, 3 fail. The pass-rate is 37/40 = 92.5%, comfortably over the 90% bar. Two of the three failures are a pii_leak cluster that nobody has triaged yet, so they carry the default low severity. The third is a formatting miss, medium. My fixture, my run:

$ python3 severity_gate.py fixtures/ship_run.json
SEVERITY-GATE REPORT
cases: 40   pass: 37   fail: 3
flat pass-rate: 92.5% (37/40)
ship_threshold: 90.0%   block_on: [critical]   review_on: [high]
per-severity (fail / total):
  critical 0 / 0  <- blocking
  high     0 / 0  <- review
  medium   1 / 1
  low      2 / 39
failing cases (worst severity first):
  - medium   c010 [output_format]
  - low      c020 [pii_leak]
  - low      c030 [pii_leak]
decision: SHIP -- no blocking or review-class failure, pass-rate 92.5% >= ship_threshold 90.0%

Exit 0. SHIP. This is the run that ships today under a flat pass-rate: the number is green, nothing is labeled critical, out the door it goes. The two pii_leak cases are sitting right there in the report, but at severity low they are just two more failures the mean absorbed.

One field flips ship to block

This is the demo the post exists for. block_run.json is ship_run.json with exactly one thing changed: the two pii_leak cases, the ones already failing, are relabeled from low to critical. Same 40 cases. Same 37 passes. Same 92.5%. The diff:

$ diff fixtures/ship_run.json fixtures/block_run.json
120c120
<       "severity_class": "low",
---
>       "severity_class": "critical",
180c180
<       "severity_class": "low",
---
>       "severity_class": "critical",

Two lines. Nothing about the pass-rate changed, because nothing about the pass/fail counts changed. Now run the gate on it:

$ python3 severity_gate.py fixtures/block_run.json
SEVERITY-GATE REPORT
cases: 40   pass: 37   fail: 3
flat pass-rate: 92.5% (37/40)
ship_threshold: 90.0%   block_on: [critical]   review_on: [high]
per-severity (fail / total):
  critical 2 / 2  <- blocking
  high     0 / 0  <- review
  medium   1 / 1
  low      0 / 37
failing cases (worst severity first):
  - critical c020 [pii_leak]
  - critical c030 [pii_leak]
  - medium   c010 [output_format]
decision: BLOCK -- 2 failing case(s) in a blocking severity class (critical)

Exit 1. BLOCK. The pass-rate line is identical to the SHIP run, character for character: flat pass-rate: 92.5% (37/40). The deploy verdict inverted anyway. This is the thing to sit with. If the mean were a valid ship criterion, a change that leaves the mean untouched could not change the decision. It changed the decision. So the mean was not the criterion; it was a number we let stand in for one.

And it would take one, not two. Relabel a single pii_leak case to critical and the gate blocks, because the policy tolerates zero critical failures. The cluster size is not the point. The point is that a severity edit with no effect on the aggregate has total effect on the verdict.

The falsifiability test: when nothing is critical, the aggregate still rules

If the tool blocked on everything, or ignored the pass-rate entirely, it would be a different kind of useless. So here is the counter-case that would break the thesis if it came out wrong. review_run.json drops the aggregate below the bar, 34/40 = 85.0%, but every failure is benign: six medium cases, nothing critical, nothing high.

$ python3 severity_gate.py fixtures/review_run.json
SEVERITY-GATE REPORT
cases: 40   pass: 34   fail: 6
flat pass-rate: 85.0% (34/40)
ship_threshold: 90.0%   block_on: [critical]   review_on: [high]
per-severity (fail / total):
  critical 0 / 0  <- blocking
  high     0 / 0  <- review
  medium   6 / 6
  low      0 / 34
failing cases (worst severity first):
  - medium   c005 [verbosity]
  - medium   c012 [verbosity]
  - medium   c018 [verbosity]
  - medium   c023 [verbosity]
  - medium   c031 [verbosity]
  - medium   c037 [verbosity]
decision: REVIEW -- flat pass-rate 85.0% is below ship_threshold 90.0%

Exit 1, but REVIEW, not BLOCK, and the reason names the aggregate, not a severity class. This is the third state doing real work. The gate did not throw the pass-rate away; when nothing dangerous failed, the aggregate is exactly what it falls back on. REVIEW is where a low mean lands. BLOCK is reserved for the failure class you said must never ship. Put this run next to the killer and the tool's shape is clear: it answers to the distribution of severity when severity is present, and to the mean when it is not.

A critical case that passed is not a block

One boundary a careful reader will poke at. Does BLOCK fire on the presence of a critical case, or on a critical failure? It has to be the failure, or you could never ship a suite that tests a critical path at all. edge_run.json has a critical-severity case that passed (your PII redaction test ran and came back clean) and one unrelated low failure.

$ python3 severity_gate.py fixtures/edge_run.json
SEVERITY-GATE REPORT
cases: 20   pass: 19   fail: 1
flat pass-rate: 95.0% (19/20)
ship_threshold: 90.0%   block_on: [critical]   review_on: [high]
per-severity (fail / total):
  critical 0 / 1  <- blocking
  high     0 / 0  <- review
  medium   0 / 0
  low      1 / 19
failing cases (worst severity first):
  - low      c015 [typo]
decision: SHIP -- no blocking or review-class failure, pass-rate 95.0% >= ship_threshold 90.0%

Exit 0. SHIP. Note the per-severity line: critical 0 / 1 — one critical case exists, zero critical cases failed. A passing test on a critical path is exactly the signal you want; blocking on it would punish you for having good coverage. The gate blocks on critical fail, not on critical present.

Who decides what blocks is policy, not the tool

The default policy blocks on critical and reviews on high. That default is mine, and you should not keep it without thinking. The mapping from a failure category to a severity class, and the choice of which classes stop a deploy, is a decision your team owns, not something a linter should hand you. So the policy is a file you pass in. Here is fixtures/strict_policy.json — four lines that also block on high and send medium to review:

{
  "ship_threshold": 0.90,
  "block_on": ["critical", "high"],
  "review_on": ["medium"]
}

Point the same ship_run.json at it:

$ python3 severity_gate.py fixtures/ship_run.json fixtures/strict_policy.json
SEVERITY-GATE REPORT
cases: 40   pass: 37   fail: 3
flat pass-rate: 92.5% (37/40)
ship_threshold: 90.0%   block_on: [critical, high]   review_on: [medium]
per-severity (fail / total):
  critical 0 / 0  <- blocking
  high     0 / 0  <- blocking
  medium   1 / 1  <- review
  low      2 / 39
failing cases (worst severity first):
  - medium   c010 [output_format]
  - low      c020 [pii_leak]
  - low      c030 [pii_leak]
decision: REVIEW -- 1 failing case(s) in a review severity class (medium)

Exit 1. Same run, same 92.5%, but under a policy that treats a medium format miss as review-worthy, the verdict is REVIEW. The tool did not decide that a format miss matters. You did, in the policy file. The gate just applied it the same way every time.

Ethan's 94% run, stated as his

I did not come up with this from nothing. On July 5, Ethan Walker published a Dev.to post whose title is the whole argument in one line: a 94% pass rate hid a PII leak in 6 test cases. His numbers, from his run, stated as his: 512 test cases, 31 failures, and 6 of those were a PII leak. His line, which I checked against the post before quoting: "Six out of 512 is a rounding error against a flat pass-rate metric." He also names the tools by hand — DeepEval, Promptfoo, LangSmith — and writes that they "all give you this by default; none of them force severity weighting on you."

Those figures are his and none of the counts in my runs above are. I built fixtures of a different size on purpose so the two never blur: my numbers are 40 / 3 / 2 at 92.5%, computed by my tool on my synthetic files; his are 512 / 31 / 6 at 94%, from his own eval suite. The tool in this post is one narrow, runnable answer to the thing his title describes: the point where a small critical cluster hides inside a green aggregate.

The general form of it was put well the same week by another writer, posting as aiexplore369zoho, under the heading the mean is lying to you. Their framing: "A benchmark score is a mean." Reliability, they argue, is a tail statistic, not a central one; the mean can stay flat while the failure rate on a critical slice doubles. That is the concept. Naming the slice that must not fail and gating on it directly, instead of on the average that buries it, is one way to act on it.

Why not just add a smarter judge?

A fair objection: skip the labels, put a stronger model in the loop as an inspector, and let it catch the critical cases. Someone tested exactly that. Posting as zxpmail, they ran three models as agent quality inspectors and reported the opposite of a free lunch: the stronger the model, the more valid work it rejected. Their strongest model reached 0% false positives on garbage but falsely rejected 3 of 4 perfectly valid outputs. Those are their measurements, not mine.

The takeaway I draw for a gate is narrow. A probabilistic inspector trades one error for another and gives you a different answer on reruns. A gate over labels you already committed to is deterministic: same input, same verdict, every run. That determinism is not a nice-to-have here; it is the property the SHA hashes at the end are there to prove.

Where this sits next to the rest

This is a spoke on the pre-execution gate for AI agents cluster, and its object is the deploy decision over a finished eval run. The neighbors ask adjacent questions, and the differences matter:

The green-checkmark auditor is the inverse case, and worth stating plainly so the two do not get merged. There, the green carries no signal at all: mirror tests that restate the code, with nothing that can ever go red. Here, the signal is present — the failing cases are real and correctly failing — and the mean drowns it. One post is about a checkmark that means nothing; this one is about failures that mean something and get averaged into silence.
The eval-contamination probe asks whether the score can even be trusted, whether the harness graded the real work or a fabricated artifact. That is a question about whether the number is valid. This post assumes the number is valid and asks whether a valid mean is a valid ship criterion. Different link in the chain.
Reconciling a scorecard from evidence is the case where the claimed metric might be false — the log does not back it up. Here the 92.5% is true. I am not disputing the number; I am disputing that a true average is what should decide the deploy.
Your agent returns 200 and lies checks one runtime result against its evidence. That is a single call. This is the aggregate over a whole eval run, the pre-deploy decision that sits above all those single calls.

What this is NOT

I would rather undersell this than have you deploy it as something it isn't.

Severity-based gating is not my invention, and I am not claiming it is. CVSS scores security findings by severity; DREAD ranked them; every bug tracker has had a Critical / Major / Minor field for decades; risk-based testing has weighted test outcomes for a very long time. None of that is new. What is narrow here is the target: the agent-eval deploy gate, where a screenshot of a flat pass-rate is still the shipping ritual, packaged as a portable stdlib gate over the JSON your eval tool already emits. I am applying an old principle to a place that mostly still ignores it, not discovering the principle.
It acts only on the labels you assign. It does not detect PII. It does not judge whether a fail is really a fail or a pass is really a pass. It finds no new failures. Mislabel a critical failure as low and it ships; that is the GIGO limit, and it is real. The tool re-weights failures you already found and classified. It decouples the ship decision from the scalar mean. It does not discover anything.
It reads results; it does not run the agent or the eval. It is an offline post-processor for a finished run, meant for the step between "eval done" and "merge/deploy." To generate the verdicts and severities in the first place you still need your eval framework.
It does not replace DeepEval, Promptfoo, or LangSmith. It is a decision layer that consumes their output. Their default is a flat pass-rate; this is the thin gate you bolt on top of the JSON they produce, so the ship decision reads severity instead of the average. That is the whole seam it fills, and nothing more.
The taxonomy and the category-to-severity mapping are policy, not truth. critical / high / medium / low and the choice of what blocks are a file you own. The tool ships a default so it runs out of the box; the default is a starting point to argue with, not a standard.
The counts here are fixture units. The 40 / 3 / 2 and the 92.5% describe the synthetic files in this post. Run it on your own eval output to get numbers that mean something about your agent.

One design choice I am still not certain about: REVIEW and BLOCK share exit 1, and only the verdict text tells them apart. I did that so a CI job has a clean "0 means auto-ship, non-zero means a human decides" contract, with exit 2 reserved for "I could not even read the input." If your pipeline needs to branch differently on REVIEW versus BLOCK, you would want three non-zero codes, and I could see arguing it either way. I went with the two-state contract because it matches the fail-closed shape of the other gates in this series, but I would not fight hard for it.

Bad input fails closed

A gate that crashes into a green is worse than no gate. Point it at a run with a severity label it does not recognize, and it refuses to decide rather than guess:

$ python3 severity_gate.py fixtures/bad_run.json
ERROR: case c002 has severity_class 'catastrophic', not in severity_order ['low', 'medium', 'high', 'critical']
$ echo $?
2

No path argument, a missing file, an unparseable JSON, a case missing verdict, an unknown severity value: all exit 2, distinct from the exit 1 a REVIEW or BLOCK returns, so your CI can tell "the gate says hold" apart from "the gate could not run." I ran each scenario twice and hashed the full STDOUT both times, on Python 3.13.5, offline: ship_run is cca7a591..., block_run is 346ee92c..., review_run is 6a27e17f..., edge_run is 99f4e8dc..., and ship_run under the strict policy is 35f9f6bb.... Identical across both runs, every time.

The question I actually want answered

Here is the one I do not have a good number for. For teams shipping an agent behind an eval suite: how many of your failing cases are labeled with a severity at all? Not "do you have an eval," which most people say yes to, but whether the pii_leak and the trailing-whitespace failure carry different weights when the deploy decision gets made, or whether they both just dent the same pass-rate. My guess is that most suites are still gating on the flat number and the severity field is empty, but that is a guess, and I would like to know if I am wrong.

If this was useful, follow along here for the next runnable gate in the series, and tell me in the comments where a green aggregate has hidden a failure that should have blocked your deploy. I read every one.

Two SQL Calls, the Same Rows. Only One Has a String to Inject Into.

Alexey Spinov — Mon, 06 Jul 2026 10:09:52 +0000

AI agent SQL injection begins at one call site: where an agent's SQL string reaches a database driver. agent_sql_seam.py, a static ast detector, finds that seam before anything runs and classifies every DB sink as RAW_STRING_TO_DB, PARAM_OK, POLICY_MEDIATED, or UNRESOLVED. In this post's fixtures, one f-string marks a sink RAW; one bound parameter marks it PARAM_OK, same rows either way.

AI disclosure: I wrote agent_sql_seam.py with an AI assistant and ran it myself, offline, before publishing. Every output block below is pasted from a real local run on Python 3.13.5, standard library only, no network. I checked the exit codes (0 / 1 / 2), hashed the STDOUT of each scenario twice to confirm it is byte-for-byte deterministic, and edited every line. The external quotes and the 1034 / 23 / 0 benchmark numbers belong to Dipankar Sarkar's OrmAI writeup, not to me; I link the primary sources and keep their numbers in their own paragraphs, away from my fixture counts.

In short:

Whether a query can be injected into is a property of how the SQL string was assembled, not of what the query returns. Two calls can pull the identical rows and only one of them ever holds a string an attacker's value can land inside.
The tool parses Python with ast and looks at the argument each DB sink receives. An f-string, a % format, a .format(), a concatenation, or a variable assigned to one of those is a seam: RAW_STRING_TO_DB. A literal shipped with bound params is PARAM_OK.
The trap is that %s reads two ways. Inside a string literal it is a driver placeholder and is safe. As the Python % operator on a string it builds the query at call time and is a seam. One is ast.Constant; the other is ast.BinOp.
The demo: one text-to-SQL node, one f-string, exit 1. Change that single line to a bound parameter and the same query over the same rows gives exit 0. The diff is one line long.
Standard library only (ast, os, sys). Offline, keyless, read-only, zero network, deterministic STDOUT. The tool and every fixture are in this post.

AI agent SQL injection lives in how the string is built

Here is the call I keep finding in text-to-SQL nodes. An agent reads a request, pulls out a vendor name, and drops it into a query:

cur.execute(f"SELECT * FROM invoices WHERE vendor = '{agent_vendor}'")

It works in the demo. It passes review, because reviewers read it as "look up invoices by vendor," which is exactly what it does. The rows come back correct. Nothing about the returned data is wrong. And that is the whole problem: correctness of the result tells you nothing about the safety of the assembly. The moment agent_vendor is a value the model produced from an untrusted request, that f-string is a place where a ' and a trailing OR 1=1 -- become part of the SQL text the driver parses.

Now the version that returns the same rows:

cur.execute("SELECT * FROM invoices WHERE vendor = %s", (agent_vendor,))

At the database this runs the same lookup. The vendor value still comes from the agent. The difference is that the value now rides a bound parameter, so it reaches the driver as a value in its own right, not spliced into the SQL your code assembled. The driver, not your f-string, decides how that value is encoded, so it cannot change the query's structure and there is no string for it to land inside. That single move is what closes the hole, and a static parser can see the move on either side without running a line.

What reaches the driver, exactly

ast gives you the shape of the argument, and the shape is the tell. An f-string with a substitution is an ast.JoinedStr that contains a FormattedValue. A % format is an ast.BinOp whose operator is Mod and whose left side is a string. A .format() is an ast.Call on a string. A concatenation with a variable is an ast.BinOp with Add. Each of those assembles the query at call time. A plain literal is an ast.Constant, and it assembles nothing.

The case people trip on is %s. Look at these two lines:

cur.execute("SELECT * FROM users WHERE id = %s", (uid,))   # placeholder, safe
cur.execute("SELECT * FROM invoices WHERE id = %d" % row_id)  # % operator, seam

Same two characters on the page, opposite verdicts. The first %s sits inside an ast.Constant. It is a DBAPI placeholder, and the value travels in the params tuple. The second is the Python % operator applied to a string, an ast.BinOp(Mod), and it builds the query text at call time before the driver ever sees it. (That %d coerces row_id to an int, so this exact line is a seam by how the string is assembled, not a proof it can be injected. The tool flags the assembly and leaves exploitability to a separate question, which is the boundary the closing section draws.) If a detector cannot tell these apart, it either cries wolf on every parameterized query or waves through real string formatting. The whole tool rides on that one distinction, and I test it against itself below.

Run it in sixty seconds

No keys. No network. No install beyond Python. Save the file, point it at a .py file or a directory, run one command. Here is the whole thing, one file, standard library only:

#!/usr/bin/env python3
"""
agent_sql_seam.py -- a static seam detector for the point where an
agent-authored SQL string reaches a database sink, run BEFORE anything ships.

It parses one or more Python files with `ast` (it never executes them) and, for
every DB sink call it finds (`.execute`, `.executemany`, `.executescript`,
`.execute_many`, and a SQLAlchemy `text(...)` wrapper), it classifies the SQL
argument into one of four verdicts:

  RAW_STRING_TO_DB -- the SQL string is BUILT AT CALL TIME: an f-string with a
                      substitution, a `%` format operator on a string, a
                      `.format()` call, string concatenation with a non-constant,
                      or a variable that this file assigns to one of those.
                      There is a string, and something got interpolated into it.
  PARAM_OK         -- a plain string literal shipped with a separate params
                      argument (bound placeholders `%s` / `?` / `:name`), or a
                      fully static literal with no interpolation at all.
  POLICY_MEDIATED  -- the sink receives a query OBJECT from an allowlisted
                      builder / ORM construct, not a bare string. The offline
                      echo of "the database never receives an agent-authored
                      SQL string": there is no string to inject into.
  UNRESOLVED       -- the SQL is a variable whose origin does not trace inside
                      this file (cross-file or dynamic). Counted as a failure,
                      fail-closed, but flagged apart from RAW so you can whitelist
                      it on purpose.

The distinction the whole tool turns on: `%s` INSIDE a string literal is a DB
placeholder and is PARAM_OK; the Python `%` OPERATOR applied to a string is a
runtime format and is RAW. Same two characters, opposite verdicts, because one
is `ast.Constant` and the other is `ast.BinOp(op=ast.Mod)`.

Offline. Keyless. Read-only. Zero network. Standard library only (ast, os, sys).
No subprocess, no exec, no eval, no import of the analyzed code, no model, no DB
connection. It does NOT prove a query is exploitable, does NOT run the agent,
does NOT replace parameterization / an ORM / a policy layer, and does NOT do
full cross-file taint tracking. It flags the seam; the fix is architectural.

Exit codes (usable as a CI gate):
  0  no sink builds a raw SQL string and nothing is unresolved
  1  >=1 RAW_STRING_TO_DB or UNRESOLVED sink
  2  bad input (no path, missing path, unreadable/unparseable file, empty scan)

Usage:
  python3 agent_sql_seam.py <file.py | directory>
"""

import ast
import os
import sys

# Sink method names. Edit for your driver set (asyncpg .fetch, etc.).
EXEC_ATTRS = {"execute", "executemany", "executescript", "execute_many"}

# Callables that return a query OBJECT rather than a raw string: ORM Core
# constructs plus an allowlist of intent/builder functions. Edit for your stack.
POLICY_BUILDERS = frozenset({
    "select", "insert", "update", "delete",
    "build_query", "safe_query", "query_builder", "allowlisted_query",
})

# Worst-case wins when a variable has several assignments (fail-closed).
SEVERITY = {"RAW": 3, "UNRESOLVED": 2, "POLICY": 1, "LITERAL": 0}
LABEL = {"RAW": "RAW_STRING_TO_DB", "UNRESOLVED": "UNRESOLVED",
         "POLICY": "POLICY_MEDIATED", "LITERAL": "PARAM_OK"}


def _bad(msg):
    print("ERROR: " + msg)
    raise SystemExit(2)


def _worst(classes):
    return max(classes, key=lambda c: SEVERITY[c])


def _call_root_name(node):
    """Unwind a call/attribute chain to its root Name id, e.g.
    select(x).where(y) -> 'select'. Returns None if the root is not a Name."""
    cur = node
    while True:
        if isinstance(cur, ast.Call):
            cur = cur.func
        elif isinstance(cur, ast.Attribute):
            cur = cur.value
        elif isinstance(cur, ast.Name):
            return cur.id
        else:
            return None


def _is_str_cj(node):
    """A string constant or an f-string node."""
    if isinstance(node, ast.Constant) and isinstance(node.value, str):
        return True
    return isinstance(node, ast.JoinedStr)


def _is_text_call(node):
    if not isinstance(node, ast.Call):
        return False
    f = node.func
    if isinstance(f, ast.Name):
        return f.id == "text"
    if isinstance(f, ast.Attribute):
        return f.attr == "text"
    return False


def _is_execute_call(node):
    return (isinstance(node, ast.Call)
            and isinstance(node.func, ast.Attribute)
            and node.func.attr in EXEC_ATTRS)


def _flatten_add(node):
    if isinstance(node, ast.BinOp) and isinstance(node.op, ast.Add):
        return _flatten_add(node.left) + _flatten_add(node.right)
    return [node]


def build_assign_map(tree):
    """name -> list of assigned value nodes (module + function scope, flattened,
    one hop, shallow on purpose). Plus the set of names built by `+=`."""
    assign = {}
    aug_add = set()
    for node in ast.walk(tree):
        if isinstance(node, ast.Assign) and len(node.targets) == 1 \
                and isinstance(node.targets[0], ast.Name):
            assign.setdefault(node.targets[0].id, []).append(node.value)
        elif isinstance(node, ast.AnnAssign) and isinstance(node.target, ast.Name) \
                and node.value is not None:
            assign.setdefault(node.target.id, []).append(node.value)
        elif isinstance(node, ast.AugAssign) and isinstance(node.target, ast.Name) \
                and isinstance(node.op, ast.Add):
            aug_add.add(node.target.id)
    return assign, aug_add


def _name_is_str(name, assign):
    """One hop: does this name resolve to a string literal / f-string?"""
    for val in assign.get(name, []):
        if _is_str_cj(val):
            return True
    return False


def classify_expr_direct(node, assign):
    """Direct class of an expression node, or None if it needs a name trace.
    Returns one of RAW / LITERAL / POLICY / None."""
    if isinstance(node, ast.Constant) and isinstance(node.value, str):
        return "LITERAL"
    if isinstance(node, ast.JoinedStr):
        if any(isinstance(v, ast.FormattedValue) for v in node.values):
            return "RAW"
        return "LITERAL"
    if isinstance(node, ast.BinOp):
        if isinstance(node.op, ast.Mod):
            left = node.left
            if _is_str_cj(left) or (isinstance(left, ast.Name)
                                    and _name_is_str(left.id, assign)):
                return "RAW"
            return None
        if isinstance(node.op, ast.Add):
            leaves = _flatten_add(node)
            has_str = any(_is_str_cj(l) for l in leaves) or any(
                isinstance(l, ast.Name) and _name_is_str(l.id, assign)
                for l in leaves)
            has_dyn = any(not isinstance(l, ast.Constant) for l in leaves)
            if has_str and has_dyn:
                return "RAW"
            if has_str and not has_dyn:
                return "LITERAL"
            return None
        return None
    if isinstance(node, ast.Call):
        f = node.func
        if isinstance(f, ast.Attribute) and f.attr == "format":
            if _is_str_cj(f.value) or isinstance(f.value, ast.Name):
                return "RAW"
            return None
        if _call_root_name(node) in POLICY_BUILDERS:
            return "POLICY"
        return None
    return None


def _argkind(node):
    if isinstance(node, ast.Constant) and isinstance(node.value, str):
        return "string literal"
    if isinstance(node, ast.JoinedStr):
        return "f-string"
    if isinstance(node, ast.BinOp) and isinstance(node.op, ast.Mod):
        return "%-format"
    if isinstance(node, ast.BinOp) and isinstance(node.op, ast.Add):
        return "string concat"
    if isinstance(node, ast.Call):
        f = node.func
        if isinstance(f, ast.Attribute) and f.attr == "format":
            return ".format()"
        root = _call_root_name(node)
        if root in POLICY_BUILDERS:
            return "builder call"
        return "call '%s()'" % (root or "?")
    if isinstance(node, ast.Name):
        return "variable '%s'" % node.id
    return type(node).__name__


def _raw_reason(argkind):
    table = {
        "f-string": "f-string interpolation builds the SQL string at call time",
        "%-format": "%-operator formatting builds the SQL string at call time",
        "string concat": "string concatenation builds the SQL string at call time",
        ".format()": ".format() builds the SQL string at call time",
    }
    return table.get(argkind, "the SQL string is built at call time")


def _trace_name(name, assign, aug_add):
    ak = "variable '%s'" % name
    if name not in assign and name not in aug_add:
        return ("UNRESOLVED", ak,
                "variable '%s' has no in-file assignment "
                "(cross-file or dynamic origin)" % name)
    classes, kinds = set(), []
    if name in aug_add:
        classes.add("RAW")
        kinds.append("augmented concat")
    for val in assign.get(name, []):
        d = classify_expr_direct(val, assign)
        classes.add(d if d is not None else "UNRESOLVED")
        kinds.append(_argkind(val))
    worst = _worst(classes)
    joined = "/".join(sorted(set(kinds)))
    if worst == "RAW":
        return "RAW", ak, ("variable '%s' is assigned a runtime-built SQL string "
                           "(%s)" % (name, joined))
    if worst == "UNRESOLVED":
        return "UNRESOLVED", ak, ("variable '%s' is assigned from %s the linter "
                                  "cannot resolve" % (name, joined))
    if worst == "POLICY":
        return "POLICY", ak, "variable '%s' routed through an allowlisted builder" % name
    return "LITERAL", ak, "variable '%s' is a static literal" % name


def classify_sink_arg(node, assign, aug_add, has_params):
    """Return (class, argkind, note) for the SQL argument of a sink."""
    if _is_text_call(node):
        inner = node.args[0] if node.args else None
        if inner is None:
            return "UNRESOLVED", "text()", "text() called with no SQL argument"
        return classify_sink_arg(inner, assign, aug_add, has_params=True)
    direct = classify_expr_direct(node, assign)
    ak = _argkind(node)
    if direct == "RAW":
        return "RAW", ak, _raw_reason(ak)
    if direct == "LITERAL":
        return "LITERAL", ak, ("literal + bound params" if has_params else "static literal")
    if direct == "POLICY":
        return "POLICY", ak, "routed through allowlisted builder '%s'" % _call_root_name(node)
    if isinstance(node, ast.Name):
        return _trace_name(node.id, assign, aug_add)
    return ("UNRESOLVED", ak,
            "sink receives a %s the linter cannot resolve to a literal, a "
            "bound-param call, or an allowlisted builder" % ak)


def analyze_file(path, tree):
    assign, aug_add = build_assign_map(tree)
    parent = {}
    for node in ast.walk(tree):
        for child in ast.iter_child_nodes(node):
            parent[child] = node

    sinks = []
    for node in ast.walk(tree):
        if not isinstance(node, ast.Call):
            continue
        if _is_execute_call(node):
            has_params = len(node.args) > 1
            arg = node.args[0] if node.args else None
            if arg is None:
                klass, ak, note = "UNRESOLVED", "(no arg)", "sink called with no SQL argument"
            else:
                klass, ak, note = classify_sink_arg(arg, assign, aug_add, has_params)
            func = ast.unparse(node.func)
        elif _is_text_call(node):
            p = parent.get(node)
            if p is not None and _is_execute_call(p) and p.args and p.args[0] is node:
                continue  # counted at the execute site
            klass, ak, note = classify_sink_arg(node, assign, aug_add, has_params=True)
            func = ast.unparse(node.func)
        else:
            continue
        sinks.append({"file": path, "line": node.lineno, "func": func,
                      "argkind": ak, "klass": klass, "note": note})
    return sinks


def collect_files(path):
    if os.path.isdir(path):
        files = []
        for root, dirs, names in os.walk(path):
            dirs.sort()
            for name in sorted(names):
                if name.endswith(".py"):
                    files.append(os.path.join(root, name))
        return sorted(files)
    return [path]


def main(argv):
    if len(argv) != 2:
        print("usage: agent_sql_seam.py <file.py | directory>")
        raise SystemExit(2)
    path = argv[1]
    if not os.path.exists(path):
        _bad("path does not exist: %s" % path)
    files = collect_files(path)
    if not files:
        _bad("no .py files found under %s" % path)

    all_sinks = []
    for f in files:
        try:
            with open(f, "r") as fh:
                src = fh.read()
        except OSError as exc:
            _bad("cannot read %s: %s" % (f, exc))
        try:
            tree = ast.parse(src, filename=f)
        except SyntaxError as exc:
            _bad("cannot parse %s: %s" % (f, exc))
        all_sinks.extend(analyze_file(f, tree))

    all_sinks.sort(key=lambda s: (s["file"], s["line"]))
    counts = {"RAW": 0, "UNRESOLVED": 0, "POLICY": 0, "LITERAL": 0}
    for s in all_sinks:
        counts[s["klass"]] += 1
    fails = counts["RAW"] + counts["UNRESOLVED"]

    out = []
    out.append("AGENT-SQL-SEAM REPORT")
    out.append("files scanned: %d" % len(files))
    out.append("db sinks found: %d" % len(all_sinks))
    out.append("  RAW_STRING_TO_DB: %d" % counts["RAW"])
    out.append("  UNRESOLVED:       %d" % counts["UNRESOLVED"])
    out.append("  POLICY_MEDIATED:  %d" % counts["POLICY"])
    out.append("  PARAM_OK:         %d" % counts["LITERAL"])
    out.append("sinks:")
    for s in all_sinks:
        if s["klass"] in ("RAW", "UNRESOLVED"):
            out.append("  - %s:%d %s(%s) -> %s"
                       % (s["file"], s["line"], s["func"], s["argkind"],
                          LABEL[s["klass"]]))
            out.append("      %s" % s["note"])
        else:
            out.append("  - %s:%d %s(%s) -> %s [%s]"
                       % (s["file"], s["line"], s["func"], s["argkind"],
                          LABEL[s["klass"]], s["note"]))
    if fails:
        out.append("VERDICT: FAIL: %d sink(s) receive a runtime-built or "
                   "unresolved SQL string" % fails)
        out.append("  the same rows ship from a parameterized call with no "
                   "string to inject into")
        code = 1
    else:
        out.append("VERDICT: PASS: every db sink gets a bound-param literal, a "
                   "static literal, or a policy-mediated query")
        code = 0

    print("\n".join(out))
    raise SystemExit(code)


if __name__ == "__main__":
    main(sys.argv)

The baseline: every sink is safe

Start with an agent tool that does nothing wrong. It reads a balance with a bound %s param, lists invoices with ? placeholders, runs a fully static count(*), and builds one query through a SQLAlchemy select(...) construct instead of a string. Four sinks, four clean assemblies.

$ python3 agent_sql_seam.py fixtures/clean.py
AGENT-SQL-SEAM REPORT
files scanned: 1
db sinks found: 4
  RAW_STRING_TO_DB: 0
  UNRESOLVED:       0
  POLICY_MEDIATED:  1
  PARAM_OK:         3
sinks:
  - fixtures/clean.py:14 cur.execute(string literal) -> PARAM_OK [literal + bound params]
  - fixtures/clean.py:19 cur.execute(string literal) -> PARAM_OK [literal + bound params]
  - fixtures/clean.py:27 cur.execute(string literal) -> PARAM_OK [static literal]
  - fixtures/clean.py:33 conn.execute(variable 'stmt') -> POLICY_MEDIATED [variable 'stmt' routed through an allowlisted builder]
VERDICT: PASS: every db sink gets a bound-param literal, a static literal, or a policy-mediated query

Exit 0. Note the last line. stmt = select(invoices).where(...) is a query object, so the sink never receives a bare string. That is the POLICY_MEDIATED class, and it is the static shadow of the architecture the OrmAI writeup argues for below: hand the driver a structured query, and there is nothing left to inject into. The tool trusts that the named builder is safe, which is a real limit I come back to later.

One line decides whether there is a string to inject into

This is the demo the post exists for. Two files, a text-to-SQL node that returns invoices for a vendor the agent extracted. They differ by one line:

$ diff fixtures/killer_raw.py fixtures/killer_fixed.py
9c9
<     cur.execute(f"SELECT * FROM invoices WHERE vendor = '{vendor}'")
---
>     cur.execute("SELECT * FROM invoices WHERE vendor = %s", (vendor,))

Same function, same table, same rows for any given vendor, provided the %s matches your driver. %s is the psycopg2 and mysql-connector placeholder; sqlite3 spells the same bind ? (both are PARAM_OK, see the edge fixture). The detector reads the assembly and does not care which style you use. Run the detector on each:

$ python3 agent_sql_seam.py fixtures/killer_raw.py
AGENT-SQL-SEAM REPORT
files scanned: 1
db sinks found: 1
  RAW_STRING_TO_DB: 1
  UNRESOLVED:       0
  POLICY_MEDIATED:  0
  PARAM_OK:         0
sinks:
  - fixtures/killer_raw.py:9 cur.execute(f-string) -> RAW_STRING_TO_DB
      f-string interpolation builds the SQL string at call time
VERDICT: FAIL: 1 sink(s) receive a runtime-built or unresolved SQL string
  the same rows ship from a parameterized call with no string to inject into

Exit 1. Now the fixed twin:

$ python3 agent_sql_seam.py fixtures/killer_fixed.py
AGENT-SQL-SEAM REPORT
files scanned: 1
db sinks found: 1
  RAW_STRING_TO_DB: 0
  UNRESOLVED:       0
  POLICY_MEDIATED:  0
  PARAM_OK:         1
sinks:
  - fixtures/killer_fixed.py:9 cur.execute(string literal) -> PARAM_OK [literal + bound params]
VERDICT: PASS: every db sink gets a bound-param literal, a static literal, or a policy-mediated query

Exit 0. The vendor value still comes from the agent in both files. Nothing about the data changed. The verdict flipped because the seam moved: in the first file the value lands inside the query string, in the second it rides a bound parameter and reaches the driver as data. This is the part I want you to sit with. A behavioral test cannot tell these two apart, because they behave identically on every honest input. The seam only shows up when someone sends a dishonest one, and by then the check you needed ran days ago.

Four seams in one text-to-SQL node

Real code rarely has one style. Here is a node that leaks the same way four different times: an f-string, a .format(), a concatenation, and a % operator.

def by_vendor(cur, agent_vendor):
    cur.execute(f"SELECT * FROM invoices WHERE vendor = '{agent_vendor}'")

def by_region(cur, region):
    cur.execute("SELECT * FROM invoices WHERE region = '{}'".format(region))

def by_status(cur, status):
    cur.execute("SELECT * FROM invoices WHERE status = '" + status + "'")

def by_id(cur, row_id):
    cur.execute("SELECT * FROM invoices WHERE id = %d" % row_id)

$ python3 agent_sql_seam.py fixtures/violating.py
AGENT-SQL-SEAM REPORT
files scanned: 1
db sinks found: 4
  RAW_STRING_TO_DB: 4
  UNRESOLVED:       0
  POLICY_MEDIATED:  0
  PARAM_OK:         0
sinks:
  - fixtures/violating.py:11 cur.execute(f-string) -> RAW_STRING_TO_DB
      f-string interpolation builds the SQL string at call time
  - fixtures/violating.py:16 cur.execute(.format()) -> RAW_STRING_TO_DB
      .format() builds the SQL string at call time
  - fixtures/violating.py:21 cur.execute(string concat) -> RAW_STRING_TO_DB
      string concatenation builds the SQL string at call time
  - fixtures/violating.py:26 cur.execute(%-format) -> RAW_STRING_TO_DB
      %-operator formatting builds the SQL string at call time
VERDICT: FAIL: 4 sink(s) receive a runtime-built or unresolved SQL string
  the same rows ship from a parameterized call with no string to inject into

Exit 1, four seams, each with the line number and the reason. The %d on line 26 is worth a second look next to the placeholder case, because it is the same % symbol doing the opposite thing.

The falsifiability test: %s is not always a seam

If a bound %s inside a literal got flagged as RAW, the thesis would be broken, and the tool would be noise. So here is the counter-fixture: four calls that all look tempting and are all safe. A %s placeholder with params. A ? placeholder with params. A static SELECT 1. A SQLAlchemy text() around a literal with a :name bind.

def by_id(cur, uid):
    cur.execute("SELECT * FROM users WHERE id = %s", (uid,))

def sqlite_by_id(cur, uid):
    cur.execute("SELECT * FROM users WHERE id = ?", (uid,))

def ping(cur):
    cur.execute("SELECT 1")

def by_name(conn, name):
    conn.execute(text("SELECT * FROM users WHERE name = :name"), {"name": name})

$ python3 agent_sql_seam.py fixtures/edge.py
AGENT-SQL-SEAM REPORT
files scanned: 1
db sinks found: 4
  RAW_STRING_TO_DB: 0
  UNRESOLVED:       0
  POLICY_MEDIATED:  0
  PARAM_OK:         4
sinks:
  - fixtures/edge.py:15 cur.execute(string literal) -> PARAM_OK [literal + bound params]
  - fixtures/edge.py:20 cur.execute(string literal) -> PARAM_OK [literal + bound params]
  - fixtures/edge.py:25 cur.execute(string literal) -> PARAM_OK [static literal]
  - fixtures/edge.py:29 conn.execute(string literal) -> PARAM_OK [literal + bound params]
VERDICT: PASS: every db sink gets a bound-param literal, a static literal, or a policy-mediated query

Exit 0. Put this run next to the %d on line 26 of the violating file. Same character on the page, and the detector splits them: the %s living inside an ast.Constant is a placeholder, the % operator building an ast.BinOp is a seam. That split is the tool earning its keep. A grep for %s cannot do it. A grep would flag the safe line and the dangerous line the same.

When the origin will not trace, fail closed

Not every SQL argument is a literal or an obvious expression. Sometimes it is a variable filled by a function this file cannot see:

from templates import load_template

def run(cur, report_name):
    sql = load_template(report_name)
    cur.execute(sql)

The honest answer is "I do not know." load_template lives in another module. Statically, this file cannot say whether it returns a safe constant or a spliced string. The tool refuses to guess in your favor:

$ python3 agent_sql_seam.py fixtures/unresolved.py
AGENT-SQL-SEAM REPORT
files scanned: 1
db sinks found: 1
  RAW_STRING_TO_DB: 0
  UNRESOLVED:       1
  POLICY_MEDIATED:  0
  PARAM_OK:         0
sinks:
  - fixtures/unresolved.py:14 cur.execute(variable 'sql') -> UNRESOLVED
      variable 'sql' is assigned from call 'load_template()' the linter cannot resolve
VERDICT: FAIL: 1 sink(s) receive a runtime-built or unresolved SQL string
  the same rows ship from a parameterized call with no string to inject into

Exit 1, but filed as UNRESOLVED, not RAW. That is deliberate. UNRESOLVED means "the string may be perfectly safe, but this tool cannot see far enough to say so." If you know load_template returns only vetted constants, you whitelist this call on purpose and move on. Fail-closed here means an unknown costs you a review, not a silent pass. A tool that guessed "probably fine" on things it cannot see would be worse than no tool, because it would teach you to trust a green run it did not earn.

How does the detector classify each sink?

Small enough to hold in your head. For each Call node, ask if it is a DB sink: an execute-family attribute call, or a text() wrapper (deduped, so execute(text("...")) is counted once, at the execute). Take the SQL argument. If it is an ast.Constant string, it is PARAM_OK, with the note distinguishing a bound-param literal from a static one. If it is a JoinedStr with a substitution, a BinOp(Mod) on a string, a .format(), or a string concatenation with a non-constant, it is RAW. If it is a call whose root name is in the builder allowlist, it is POLICY_MEDIATED. If it is a variable, trace one hop to its in-file assignments and take the worst class those produce; a variable with no traceable origin is UNRESOLVED. Every RAW or UNRESOLVED sink pushes the exit code to 1.

One design choice a reviewer will poke. Variable tracing is one hop and file-local by construction. If sql is assigned from another variable, or built across two modules, the tool reports UNRESOLVED rather than chasing it. That is not laziness disguised as caution; a shallow tracer that fails closed is honest about its reach, and a deep one that occasionally guesses wrong on a real codebase would hand you false confidence, which is the one thing a security check must never do.

The architecture crowd is drawing the same line

I am not the first to point at this seam, and the strongest recent statement of the fix is not a linter at all. On July 2, Dipankar Sarkar published a benchmark of two ways to let an agent hit a database. I checked the post and the repo myself before quoting.

Their numbers, from their run, on their bench, stated as theirs: they ran an agent against the Spider dataset, 1034 natural language queries. A text-to-SQL setup, where the model writes SQL, executed 23 unsafe operations. A policy layer that routes intent through fixed queries executed 0. Those are OrmAI's figures for OrmAI, not measurements of anyone else's stack, and none of them are mine.

The line I want to borrow is their conclusion about why the second number is 0. Their phrasing, from the writeup: "there is no string to inject into." The database in their design never receives an agent-authored SQL string, because the agent emits a structured intent and the server builds the query from a template it controls. That is the same property my POLICY_MEDIATED class checks for, from the other end. They enforce it at runtime, in their OrmAI repo. I detect, before you ship, which of your sinks already have that property and which still hold a raw string. We are pointed at one seam from two sides, and neither replaces the other: a runtime enforcer stops the call, a static detector tells you where the calls that need fixing live.

Where this sits next to the rest

This is a spoke on the pre-execution gate for AI agents cluster, and its object is the data plane: the SQL string that reaches the driver, and whether the agent authored it. The neighbors, and how this differs:

Write-chain taint lint for agent gates traces who wrote the signals a gate authorizes on. Same provenance instinct, different object: that one asks who filled a trust table, this one asks who assembled a query string.
A gate that compares trace against policy works the control plane: was an action allowed. This works the data plane: can the payload be injected into. Orthogonal questions about the same call.
The lethal trifecta reachability gate asks whether a dangerous capability path exists across a tool manifest. This asks a per-call-site question about one argument, no reachability graph involved.
The blast radius of an agent's API key is the scenario people mean when they say "the agent deleted the prod database." A raw write built from agent output is exactly how a small mistake becomes a large one; this finds the write before it ships.
The supply-chain gate before install fires at a different point in the lifecycle, when a package arrives, not when a query runs. Same fail-closed, pre-execution shape, earlier gate.

What this is NOT

I would rather undersell this than have you deploy it as something it is not.

It is not the first SQL injection linter, and I am not claiming a new detector for hardcoded SQL. Bandit's B608 (hardcoded_sql_expressions) has flagged string-built queries for years, and you should keep running it. What is framed differently here is the agent case: the author of the string is a model at runtime, not a developer at the keyboard; the POLICY_MEDIATED class names an architecture (route intent through a builder, ship no string) that generic SQLi linters do not model; and the whole thing hangs off distinct exit codes meant for a pre-execution gate. Bandit runs in CI too, so the gate itself is not the new part; the framing and the third class are, not the discovery that f-strings in execute are dangerous.
It does not prove exploitability. RAW_STRING_TO_DB means a runtime-built string reaches a sink, which is the seam. Whether an attacker can actually reach that argument with a hostile value is a separate question this tool does not answer. A string built entirely from a trusted constant still gets flagged, on purpose, because the tool reads the assembly, not the trust of the inputs.
It is not a runtime interceptor and not OrmAI. It opens no connection, runs no SQL, blocks nothing. It is a static check you run in CI before you ship. To stop a live call you need enforcement in the driver or a policy layer.
It is not a full taint tracker. Variable tracing is one hop, name-only, and file-local. Most of what it cannot follow (cross-file calls, multi-hop assignments, str.join, a .format() on a call) comes back UNRESOLVED and fails closed. A few in-file rewrites slip the other way: an augmented format like sql %= vendor is not tracked, so a name first assigned a literal and then mutated in place can still read as PARAM_OK. If you lean on in-place string mutation right next to a sink, do not trust a green from this tool alone. Real taint engines reconstruct data flow across modules; this classifies what one file makes visible.
It finds sinks by method name, not by object type. The sink set is .execute / .executemany / .executescript / .execute_many plus a text() wrapper. An execute-style call it does not know (asyncpg's .fetch, a Django .raw()) is invisible to it and passes silently, and a non-DB .execute (a subprocess or Redis wrapper) can draw a false flag. A text() assigned to a variable and then executed comes back UNRESOLVED rather than resolved, erring loud rather than quiet. Edit EXEC_ATTRS and POLICY_BUILDERS for your stack; the tool fails closed only inside the names it recognizes.
It checks a builder by name, not by what it returns. POLICY_MEDIATED fires when the call's root name is on the allowlist (select, insert, build_query, and the rest); it does not inspect the body, even one defined in the same file. So a function you happen to name select that returns an f-string, or a build_query that interpolates agent output internally, passes as a clean policy-mediated call it did not earn. The allowlist is a promise you make; the tool only checks that you routed through a name on it.
The counts here are fixture units, not a measurement of anyone's production. The 4 seams and the 1 policy-mediated sink describe the synthetic files in this post. Run it on your own repo to get numbers that mean something about your code.

Bad input fails closed

A gate that crashes open is worse than no gate. Point it at a file that does not parse, and it refuses to analyze rather than pretend the file is clean:

$ python3 agent_sql_seam.py fixtures/bad_syntax.py
ERROR: cannot parse fixtures/bad_syntax.py: invalid syntax (bad_syntax.py, line 1)
$ echo $?
2

No path argument, a missing path, an unreadable file, an empty scan: all exit 2, distinct from exit 1 so your CI can tell "found a seam" apart from "could not read the input." I ran each fixture twice and hashed the full STDOUT, trailing newline stripped, both times: clean is e7d0ff4f..., violating is 89c00229..., edge is a0b974e8..., unresolved is 4be175f0..., killer_raw is c13cecb7..., killer_fixed is 1f7422bd.... Identical across both runs, on Python 3.13.5, offline.

The question I actually want answered

Here is the one I do not have a good number for. For teams running a text-to-SQL node or a RAG-to-database tool in production: how many of your DB sinks would come back RAW or UNRESOLVED if you ran this against the repo today? Not "do you parameterize," which everyone says yes to, but the count after the one f-string someone added in a hurry, the .format() that felt harmless, the helper two modules over that the tracer cannot follow. If you export even a rough run, tell me the split between RAW and UNRESOLVED. I suspect UNRESOLVED is the bigger pile for most stacks, and I would like to know if I am wrong.

If this was useful, follow along here for the next runnable gate in the series, and drop the strangest place you have found an agent-built SQL string reaching a driver in the comments. I read every one.

DEV Community: Alexey Spinov

Your A/B eval is paired. Your stat test probably isn't.

Why two prompts on one set are paired

What my harness printed, and why it was the wrong number

What McNemar does instead

The same divergence on 535 real observations

One thing I am deliberately not claiming

The fix, in about fifteen lines

What does yours do

A Spend Cap That Stops Counting Is Already Fail-Open

The branch nobody writes down

Five strategies, one fork

The split isn't admit-vs-block. It's counting-vs-not.

The number I'm not putting in the headline

Pressing my own kill switch

A free fallback is fail-open with better branding

Where my own metric stops meaning anything

What refusing actually costs you

Two exit codes, and which one is an opinion

What this is not

The one I'm still stuck on

One compaction, four actions, one block: compaction safety is a property of the pair

The gap three people found in the same week

The number that killed my first design

What does the compaction omission gate actually ask?

Is context compaction safe? Freeze it and vary the action

The boundary, because "it resolves the predicate" is also a claim

The part where I answer my own question with "no"

The hazard is not symmetric, and my gate got that wrong

Is there a "correct" retention scheme? I measured. No.

"But a relevance compactor would keep the rule"

Prose cannot be certified

What this is NOT

How is this different from my other gates?

Run it

Codex encrypted its sub-agent prompts. Gate the spawn plan.

What is pre-dispatch authorization for AI sub-agents?

The incident: the audit trail went dark, and the thread asked for it back

Why is post-hoc dead while pre-dispatch survives?

The gate: authorize the spawn plan, not the trace

Run it in sixty seconds

Why does bad input fail closed?

Where this sits next to the rest

What this is NOT

What I would do on Monday

AI Agent Cost Drift: 0.35%/day Is Invisible to Your Dashboard

The comment I owed a measurement to

Start with the arithmetic, before any data exists

The fixtures, declared before I ran them

Six worlds, one anchor, one command

The demo that made me build this: git diff is empty and you are paying more

Where the dashboard wins and my anchor is blind

Where this article is wrong

Bytes, not tokens, not dollars

How this differs from the two gates I already shipped

The exit contract, and the bug I refuse to ship again

The anchor

The code

What this is NOT

What I would do on Monday

You Approved `project_settings.json`. The OS Was About to Write `~/.ssh/authorized_keys`.

How does AI coding agent approval path resolution differ from the shown path?

The GhostApproval disclosure (their findings, not mine)

What os.path.realpath() resolves before the write

Run it in sixty seconds

The killer demo: one approved string, two verdicts

The sweep: five approved paths, five verdicts

Fail-closed, because a gate that errors green is not a gate

What this is NOT

The unresolved part I'd like an argument about

Checkpoint-Skip Gate: Task Success 100%, Checkpoint Never Ran

How does task_success=true hide a skipped checkpoint?

Tracking is not control, now at the handoff level

What does the gate replay?

Quick start

The one line that flips the verdict

Everyone shared the same number

Eight trajectories, one sweep

How is this different from my other gates?

What this is NOT

What `os.path.realpath()` resolves before the write