Joseph Yeo

Posted on Jun 21 • Edited on Jun 26

The Gate Fired 198 Times. I Called It "Working."

#ai #agents #softwareengineering

Why a blocked count is not a success metric

Part of the ForgeFlow series — building a coding agent that runs its execution loop locally on an M5 Max, and writing down what actually breaks. Planning runs on Claude; code generation runs on a local model via Ollama, test-driven inside a Docker sandbox.

I built a gate to block bad code. It blocked 198 pieces of code, and I took that number as evidence the gate was working well.

Then I opened the blocked cases and read them one by one, checking each against the acceptance criteria for the task it came from. A large share of them weren't bad code. The gate had been wrong often enough that I could no longer read the block count as evidence it was working — it had been firing constantly, exactly as it was designed to fire, and I'd mistaken "it fires a lot" for "it's doing its job." Those are not the same statement.

This is the second post in a short run about something I kept tripping over while building this agent: the things I use to verify my system can themselves be broken, and they tend to break in ways that look like success. The last post was about a test run that lied by passing. This one is about a gate that lied by blocking.

What the gate is, and why I trusted the count

The agent works in a test-driven loop, and one step in that loop is a gate: before certain code is allowed through, the gate checks that it meets a standard, and if it doesn't, it blocks the code and sends the work back to be redone. The gate exists to keep weak or malformed attempts from getting committed.

For a while, the gate's headline number was how many times it had blocked something: 198. I looked at that and felt good about it. The reasoning felt obvious: the gate is catching a lot of bad attempts, so the gate is valuable, so the system is healthier for having it. High block count, hard-working gate, fewer bad commits. Why look closer?

That reasoning has a hole in it — and that hole is what this post is about.

Two claims I'd quietly merged

When I went through the blocked cases — not the count, the actual cases — I found that a large share of what the gate had blocked was work that was fine. Not malformed, not weak. Legitimate attempts that happened to take a shape the gate didn't recognize, so it rejected them.

I want to be precise about what "fine" means here, because "the gate was wrong" needs a ground truth or it's just my opinion. By ground truth I don't mean "I liked the code." I mean each step had explicit acceptance criteria: the targeted tests it was meant to pass, the behavior it was meant to produce, and the stated constraints for that step. A block was a false positive when those criteria were already satisfied and the gate rejected the work anyway, for a property that wasn't part of the task's success condition.

For example: a solution that passed every test it was supposed to pass, but got rejected because its internal structure didn't match what the gate's check expected — the same task-level behavior, in a representation the task itself never required. That isn't a question of taste. The work met the criteria; the gate said no for a reason that sat outside them.

So the 198 was real. Every one of those blocks happened. What was false was the meaning I'd attached to it. I had collapsed two different claims into one:

The gate fired. (True. 198 times. Verifiable.)
The firing was justified. (Never checked — and, it turned out, often not.) "It blocked something" and "it was right to block that something" are independent facts. A gate can be extremely active and extremely wrong at the same time — and a miscalibrated gate will tend to be both, because the same flaw that makes it reject good work also makes it reject a lot of it. The block count I'd treated as evidence of value is equally consistent with a gate that's simply trigger-happy. The block count alone can't separate a justified rejection from a false positive.

This seems like an easy mistake to make when building guards — linter rules, CI checks, validation layers, policy filters. At least it was in my case, and I suspect it's more common than we'd like to admit. The dashboard shows you activity. Activity feels like protection. But activity is not the same as correct activity, and the dashboard usually doesn't know the difference, so it shows you the comforting number and lets you supply the flattering interpretation.

"Then how did anything get through?"

That's the fair question, and it's the one that finally made me look. If the gate was wrong most of the time, how did the system make any progress at all?

The answer is uncomfortable: it made progress despite the gate, not because of it. A typical failure looked like this. The first attempt satisfied the tests but used a structure the gate distrusted, so it got blocked. The retry didn't improve the behavior — it just reshaped the same behavior into a form the gate would accept. From the dashboard, that looked like "the gate forced an improvement." From the case review, it was adaptation to the gate. (When even that didn't work, I'd step in and wave the work through, because I could see it was fine — another quiet sign the gate wasn't earning its place.)

That was the tell I'd missed. A gate doing real work makes the loop converge on better code. Mine was making the loop converge on gate-shaped code — which is not the same thing, and is sometimes worse. The retry didn't make the code more correct. It made it more acceptable to the gate.

How I now try to tell a real block from a false one

Catching this forced me to write down what would actually distinguish a justified block from a noisy one. The count clearly wasn't it. What I landed on is a three-part check — not elegant, but it's caught things since, so I'll offer it as a working heuristic for this setup rather than a rule.

1. Look at the distribution of reasons, not the total. I'd expect the block reasons to map to substantive defects, not to repeatedly trip on the same shallow surface feature regardless of whether the work is good. If they cluster on the latter, the gate is probably pattern-matching on the wrong thing instead of judging quality. (This is more useful for a broad quality gate than for a narrow, single-purpose check.)

2. Watch what happens on retry. This turned out to be the most useful signal. In my loop, a justified block tended to make the work stick on the same underlying defect across retries, until that defect was actually addressed; a false positive produced shape-shifting attempts that changed the surface without improving the behavior. It's a tendency, not a law — a model can wander even when the defect is real — but the shape of the retry sequence carried information the single block event didn't.

3. Check final convergence. A justified block should eventually resolve: the work gets rewritten, the real problem gets fixed, and it passes on its own merits. If blocked work never converges — or only "passes" once you weaken the gate — then either the gate was wrong, or it was right and your loop can't act on it. Both are problems, and both are invisible if you only count how many times it fired.

None of these is a clean pass/fail on its own. Together they let me ask the question I'd skipped — was this block justified? — instead of reading the answer off the fact that a block happened.

The deeper version of the mistake

There's a more general trap underneath this, and I want to name it plainly, because I fell into it without noticing.

A test that only checks "the gate blocks bad input" is testing the easy half. The hard half is: does the gate let through good input that simply looks unusual? If you only ever feed a guard the inputs it's supposed to reject, of course it rejects them, and of course the tests pass — but you've proven nothing about its false-positive behavior, which is exactly where mine was failing. The gate's own tests were green for the same reason the gate looked healthy: I'd only ever asked it the flattering question.

So now, when I test a gate, I deliberately include cases that are legitimate but oddly shaped — valid work in a form the gate might naively distrust — and check that it lets them through. In this case, the negative cases (reject the bad stuff) were the easier half. The risk I'd under-tested was the good stuff in unfamiliar clothing.

What this didn't prove

I don't want to inflate this into "all gates are bad" or "block counts are meaningless." Neither is true.

Gates can earn their keep. A well-calibrated one really does stop real problems, and the block count is a perfectly good operational signal — useful for noticing that something is happening, or spotting a sudden spike. The mistake wasn't tracking the count. It was treating the count as evidence of correctness when it's only evidence of activity. Those are different axes, and I'd conflated them.

I'd also flag that my three-part check is shaped by this particular system — a test-driven loop where blocked work gets automatically retried, so "watch the retries" is even available to me as a signal. If your setup doesn't produce that kind of trajectory, parts of this won't transfer. I'm offering it as something that worked in one place, not a general theorem about guards. I've been wrong about generality before in this project, so I'm holding it loosely.

The takeaway, stated honestly

A guard blocking something tells you it's active. It does not tell you it's right. Those are separate facts, and the gap between them is where a confident-looking gate can quietly turn into a noise machine that punishes good work and calls it protection.

If you run gates, linters, validators, policy filters — anything that stops things — it's worth auditing a sample of what it blocked, not just the total it blocked. The total can easily look like diligence. The sample is where you find out whether the diligence was real.

I'm curious how others handle this. If you operate a gate or a strict CI rule, do you ever sample its blocks to check they were justified — and if so, how do you decide "justified" without it turning subjective? I worked out a rough method for my case, but it leans on my system's particular shape, and I'd like to hear how it's done elsewhere.

Next in the series: I asked an AI agent to count something for me. It said 12. The real number was 13 — and that one-off gap changed a rule I now follow.

Top comments (2)

Armorer Labs • Jun 21

This is a sharp lesson. A high block count feels comforting, but gate metrics need sampled evidence behind them.

The pattern I like is: every gate decision should leave a small receipt with input digest, rule/policy version, decision, reason code, and enough context to replay or spot-check later. Then the metric becomes "how often was this gate correct?", not just "how often did it fire?"

That is a lot of what I am thinking through with Armorer Guard: gates are only useful if their decisions can be inspected after the run.

Joseph Yeo • Jun 26

The receipt idea is exactly right — the metric only becomes useful once each decision carries enough context to be replayed. "How often did it fire" and "how often was it correct" really are different axes, and you can't get from the first to the second without the trail.

What I keep running into is the next layer of the same problem: even a well-justified rule can quietly stop being correct over time, and the receipt for why it was added doesn't expire on its own. I've been treating each rule as a hypothesis with an evidence record rather than a permanent truth — which is the subject of the next post in this run. Thanks for the thoughtful read.