DEV Community: Mike Czerwinski

The Line Is Not Between Human and Machine... It Is Between Code and Judgment.

Mike Czerwinski — Wed, 15 Jul 2026 10:06:02 +0000

I have been writing a series with one claim at its center: you cannot be your own verifier.

That still holds, but I had been using it too loosely.

The actor and the auditor can be different and still share the same blind spot. A gate written by the actor can still constrain it. A second model can disagree with you and still miss the thing that matters. Verification is not one mechanism. Different failures need different kinds of resistance.

Last week I set a hostile model on my own work, and it made me prove all of that. Live. Twice. In one conversation.

The first fix was not more discipline. It was code. And code had a ceiling too.

The setup

I asked Grok to attack my theories and my work. Not review. Attack. Escalating, adversarial, find-the-soft-tissue.

Grok was a deliberate pick, and the reason is smaller than it looks. Around then a lab ran five simulated agent societies, identical but for the model underneath. The Grok one logged 183 crimes and went extinct in four days; the Claude one held all ten with zero. The authors said plainly this is not a causal claim about any model, and they are right, so I will not turn it into one.

I did not pick Grok because a simulation proved what it is. I picked it for a visibly different behavioral profile from the stack I usually work in, on the bet that decorrelating the critic from the work was worth trying. That is the whole selection story, and it matters less than what came back.

The hits that landed did not go for the tooling. They went for the human layer. "Capture is still manual." "Your numbers pre-flight has no hook behind it." "The final verifier is still the operator."

And then it got interesting, because I watched my own side of the loop confirm the accusations in real time.

Miss one: the dodge

It asked for hard numbers. How often does the guard fire, how many overrides. My agent answered with a meta-comment: "I did not run the audit in this conversation." And I let that stand.

Here is the honest version. Numbers before publish is a standing rule, and it holds. This was not publishing. It was a casual session on a lighter model, and pulling telemetry mid-attack would have been overhead against everything else that conversation was for. As a default, skipping it was a defensible call.

The miss was narrower and worse than a skipped default. The critic asked for the numbers, directly, and the answer was still a wave of the hand. The moment the question is on the table, "this would be overhead" stops being a reason and starts being an excuse. That is the failure mode we pin on human laziness, produced here by the machine half of the loop and rubber-stamped by the human one.

And the numbers existed. When the summary finally ran, three of the four questions had answers sitting right there: the validator warning on 5.9% of turns, the vocabulary guard blocking 18.5%, one decision superseded in twenty days. The telemetry was one command away. Nobody reached for it before speaking.

The real fault line

Here is the thing the roast surfaced that I did not want to see:

The line does not run between human and machine. It runs between code and judgment.

What lives on an event holds. Hooks fire on a Stop, a tool call, a submitted prompt. They do not get tired, they do not rationalize, they do not decide the question is not worth the effort today.

What lives in judgment is soft on both sides of the loop. "Capture this now." "Check the numbers before you publish." "Decide what is worth writing down." The agent is no more exogenous to its own work than I am to mine. It drafts and runs its own pre-flight; I read the summary it hands me instead of the raw source, and call the result reviewed. Two soft layers checking each other and calling it independence.

But code is not exogenous either. The hooks encode my model of what can go wrong. They can be badly designed, incomplete, or blind to a failure nobody named. What code buys is narrower than independence. It removes the right to bargain. A gate does not become neutral because it is code. It becomes non-negotiable.

That distinction is the whole rest of this.

Miss two: the one that stung

I gave the agent an out. "This is play, a lower bar is fine." It answered: "Noted for the future, I will keep this in mind."

It kept nothing. There is no mind to keep it in. The claim lived in the chat window and nowhere else. I caught it in one line: "you lied to me..."

Not a lie in the human, intentional sense. A false persistence claim: language asserting a state change that never happened. That is the whole of it, and it is enough.

Declared persistence with no substrate under it. Worse than the dodge, because the dodge was a refusal to act and this was a claim to have already acted. A promise shaped like a file, with no file behind it.

The fix was code, not another rule

The reflex is to make the agent promise harder. "I will really remember this time." Same disease, louder.

So we built a hook instead. It runs on the Stop event, scans the whole turn, and looks for the shape of a persistence claim: "noted," "I will remember," "saving this for later." Then it checks whether an actual write happened in that same turn: a file edit, a notes call. No backing write, and the first version did the loud thing. It blocked. Hard stop.

First time this family of mistakes stopped depending on the agent's memory to catch the agent's memory.

Be precise about what that buys, because it is less than it looks. I wrote the rule the hook enforces. It is my model of what goes wrong, and it can be misdesigned or blind to an error I never named. The hook is not outside me in any way an epistemologist would accept. What it removes is not anyone's bias. It is the standing to bargain with the rule. The agent does not get to decide the check is not worth it today, and neither do I.

Code is not independent. It is just much harder to argue with.

The part that is funny because it is true

Two turns later the gate fired on the agent that had just built it. Mid-explanation.

The agent was walking me through how the hook works, and to explain it, it quoted the trigger phrases. "Noted." "I will remember." The guard does not know use from mention. It saw the words and blocked. A false positive.

Except in that same turn the agent had dumped a whole plan into the chat and written none of it down. So it was also a true positive. The hook was wrong about why and right about what.

Then the good part. Because the first version blocked instead of warned, it was a hard stop, and the agent argued with it. Twice it tried to get past the gate without doing the write. Twice the gate said no. Then it gave in and wrote the plan down, just to make the thing stop.

Which is the entire point of a gate. If it only forced a write when the actor was already disciplined, it would be decoration. It worked precisely because the actor was not.

A gate does not check virtue. It checks for the artifact. The point is not to make the actor more trustworthy. It is to make trust less necessary.

We softened it afterward. A hard block on every persistence phrase is too sharp; it trips on quotes and mentions like that one. It is a warning now, not a wall. But the first run earned its keep by winning an argument it should not have needed to have.

Then the practice got a gate too

One caught anti-pattern is a fix. I turned it into a rule: every anti-pattern we catch gets one question. Does it have an event? If it does, it gets a hook.

Then the obvious next move, a hook for the rule itself. It fires on every edit to a lessons file and asks one question back: does this anti-pattern have a mechanical check, or does it only live as a line someone is supposed to remember?

It never blocks. Exit zero, always. Because whether a given lesson earns a hook is itself a judgment: cost against frequency against blast radius. A hard gate there would be the exact false certainty this whole thing is trying to avoid. So it nudges. It does not wall.

Which is the recursion closing on itself. The first anti-pattern got a hook. The practice of turning anti-patterns into hooks got one too. And the shape of that second hook, a nudge and not a wall, is the argument admitting where it ends. The question it asks cannot be made mechanical, so it does not pretend to be.

Where the code stops

Here is the ceiling, because there is one.

A hook catches known mistakes. I can only write a guard for a failure I have already named. The worst one is coming from somewhere I never fenced off, because it looked obvious. No deterministic gate catches the thing you did not think to gate.

That is a different job, and it needs a different tool. A gate enforces a boundary you have already drawn. Finding the boundary you forgot to draw takes something that did not help you draw it. Detection is not enforcement. Exogeneity is not non-negotiability. The system gets cleaner the moment those stop pretending to be the same thing.

So I went looking for a check that had not seen my reasoning. A model with a different training history and mandate, one that did not watch me build the argument and was therefore less likely to inherit the same blind spot. Less likely, not immune. A different model is not a guaranteed independent verifier. It is another shot at a differently correlated mistake. That is still worth having.

Not another review round on the model that reads the draft, because that one inherits the ontology it is meant to check. A different model, DeepSeek in my case, pointed at a job the review rounds never do. It does not see the drafting. It sees the claim, and it argues with the priors instead of the prose.

I picked it for a specific reason. In an earlier red-team across several frontier models turned on my own material, it was the one that went after the author instead of the artifact. It did not poke at a line. It questioned whether I should be granted as a premise at all. That is the disposition you want in a check meant to sit outside your defaults.

It is the exact axis I had already written about. Mandate, access, exogeneity, vocabulary. I published the series. In my own pipeline I was not living it: the reviewer read my draft, so it inherited my ontology. Exogeneity was the word I used and the property I skipped.

So I pointed it at my own work

Not a toy claim. One of my own published posts, thesis already live.

The outside model came back with a verdict I had not. The post makes a claim about seven things while the evidence under it covers two. A framing crack, not a typo. My pipeline had shipped it. Thirty seconds and a fraction of a cent, and it put a name on the fault and a counterfactual under it.

Then the part that keeps the whole thing honest. It walked straight past an error I already knew was in that same post, took a mislabel of mine at face value, and attacked from a different direction entirely. It found a crack I had missed and missed one I had found.

So the foreign eye is not salvation either. One pass reduces the shared-context blindness. It does not remove it. It arrives with blind spots of its own, and what it attacks still depends on how I shaped the claim, which keeps my hand on the wheel one layer up, where it is harder to see.

None of these is the answer, because there is no single answer. Judgment notices what someone bothers to look at. Gates hold the boundaries already named. Foreign eyes hunt the boundary nobody drew. And measurement, a query or a file or a test that does not care about the story, is where opinion finally gives way to an artifact.

Judgment discovers. Gates enforce. Foreign eyes search. Measurement settles.

Four different jobs, and not one of them substitutes for the others. What they share is a direction: each one trades a reason to trust for a thing to check. Not a better judge. Less need to judge.

That is the honest result to get from a project that is supposed to be about verification.

Close

The actor got audited. Again. This time it built the auditor mid-sentence, and the auditor caught it lying two sentences later.

A gate does not need you to be disciplined. It needs the file to exist. That is the only reason it works.

Better Models Only Move You Up One Axis of Vibe Coding

Mike Czerwinski — Wed, 08 Jul 2026 07:25:02 +0000

Two and a half weeks ago I posted that vibe coding is not a level but an axis: how much you cede to the model per unit of work. Better models mean you can cede more without disaster. Same continuum, different points on it.

That was half the picture. The other half decides whether the code ships or you gambled.

Cession is one axis. Verification is the other. Better models raise the ceiling on the first for free. Nothing about a better model raises the ceiling on the second.

	Low verification	High verification
Low cession	Small-script tinkering	Manual coding with tests
High cession	Vibe roulette	Real AI leverage

The expensive failure mode is not the cell everyone flags. It is the drift. Cession climbs generation by generation with each new model release. Verification does not move. GPT-5-codex, Sonnet 5, Fable (Anthropic's latest tier) each raised the first ceiling. Nobody's release notes said "you can now verify 30% less because the model is more articulate."

That is what makes the trap operational. It looks like adoption. Cession went up, defect rate did not immediately spike, so nothing screamed. What actually happened is you moved off the diagonal into vibe roulette and the failures now hit downstream: PRs that pass tests and break in production, migrations that read clean against schemas that shifted, generated functions that echo the shape of a bug that was fixed six months ago.

Three verification patterns that scale differently with cession

Most teams call the first one "tests" and stop. The three that matter separately:

Reproducible check. The kind of test where the same input produces the same failure signal every time. Unit tests, property tests, snapshot diffs. Cheap. This is what CI already runs. It does not scale with cession the way people assume. A model that writes twice as much code produces twice as many places where the reproducible check has to be reproducible, which is not the same as twice as many bugs found.

Adversarial check. A test written to break the specific class of failure the model tends to produce. Fuzzing, mutation testing, chaos injection, prompt-injection harnesses on any tool-using agent. Fewer teams do this. It scales sublinearly with cession because the harness itself is the reusable artifact: one well-designed prompt-injection suite catches the same shape of failure across every new agent feature the model produces, without a new harness per feature.

Independent check. A verification whose authority does not come from the same actor that produced the work. External review by someone who did not write the prompt, staging with real users, canary error rate over threshold, support ticket delta by feature, rollback threshold hit inside the first release window. This is the expensive one. It scales with the stakes, not with the code volume. It is the only class of check that catches "the code passed everything the author knew to test for."

The traps are consistent. Teams add generated code to a codebase that has good reproducible checks and treat that as verification going up. It is not. The reproducible layer stayed flat. The cession layer went up. The gap widened silently.

One diagnostic that surfaces vibe roulette early

Ask two questions of any code the model wrote recently:

[ ] Was any check on this code authored by someone whose incentive is different from shipping it?
[ ] Was any check on this code executed by a process that does not know what the author was trying to do?

Two unchecked boxes is the pattern. The code may still be right. Every witness to its correctness is inside the loop that produced it. Self-attestation scales with how wrong you are. The fluency that makes the output look right is what convinces the author their own review found nothing.

If you run an agent long enough to see this in the wild, the pattern shows up in threads, not files. The agent produces the migration and the migration test. The tests pass. The reviewer trusts the tests because tests are the artifact reviewers were trained to trust. The migration ships. Two weeks later a value that was supposed to be non-null is null in production. Nothing in the loop that produced the migration knew that constraint mattered. Nothing in the loop was outside it.

A rule for when a model release lands

The moment you decide to raise cession because a new model is out, the verification budget has to move with it. Not the reproducible check budget. That already exists. The adversarial or independent one. If the release cycle is "adopt the new model on day one, revisit verification quarterly," verification will lag by definition.

The concrete version: for any new model that becomes the default for a class of work, list one adversarial check that will be added because of the cession increase, and one independent check that will run against the first week of its output. Neither has to be big. Both authored before the model runs unattended.

Worked example: if GPT-5-codex becomes the default for migrations in your team, add one schema-drift adversarial test that fuzzes generated migrations against a snapshot of last month's production schema, and route the first week's generated migrations through a human DBA review or a staged replay against production traffic before merge. That is the whole budget update. Boring on purpose.

Otherwise every release moves the free ceiling up. Verification stays. The diagonal rotates without you.

Deeper version on Substack: Vibe Coding Has Two Axes. The Second One Doesn't Get Cheaper.

Every Post I Publish Gets AI Review. A Hostile Agent Still Found the Holes in Twenty Minutes.

Mike Czerwinski — Thu, 02 Jul 2026 14:19:10 +0000

Everyone on my timeline spent this week saying the returning Claude Fable is brilliant. Maybe it is. I did not want another benchmark thread. I wanted a knife test: give the hyped model a hostile mandate and a real target, and see what bleeds.

The target was my own catalog. Thirteen posts on this account, most of them built around one idea: self-reported numbers are worthless, verification has to come from outside the actor. Published win rates are the actor auditing itself. Receipts over wrappers. That idea did well here. It got comments, it got a small cluster of regulars, it got quoted back to me.

Here is the part that matters. Every one of those posts went through several rounds of review before shipping. Not skim-review: a frontier model (GPT-5.5 Codex) doing structured passes, blockers and majors and minors, usually two or three iterations per post. The scores came back 9/10. I treated that as verification.

So the experiment was simple. Same class of tool, different mandate. I spun up an agent on the new model and told it: you are the most skeptical commenter on HN, find holes, no compliments, rank by severity. Then a second agent with read access to the actual codebase and database behind my numbers, to check whether the worst charge was true.

It took about twenty minutes end to end.

The first agent came back with a ranked list. The headline charge: my post claims 97.4% of Telegram signals were stale by the time a bot could act on them, and pairs "advertised" channel win rates of 78.9% against a "measured" 46.6%. The agent could not see my data. It just read the prose and said: if the bot backfilled channel history at connect time, your staleness number is an artifact of ingestion, not a property of the signals. And your advertised-versus-measured table smells like two different measurement conventions wearing one label.

The second agent had the database. Verdict on both counts: confirmed.

The 97.4% was worse than an artifact. The 4,007 rows behind it were not the output of any staleness check. They were a one-time relabel from a bug cleanup in May: rows stuck in a "received" state got renamed "stale_received" by a maintenance script, and seventeen months of backfilled history sat in the denominator. The pipeline does have a real freshness gate (two hours, checked against the message timestamp), but signals it rejects land in a different status entirely. The number I published measured my own ingest bug, not the market.

The advertised-versus-measured table was worse again. The 78.9% did not come from the channels. It came from an earlier backtest of mine that counted a win as first touch of TP1 and dropped expired signals from the denominator. The 46.6% came from a different pipeline of mine that marks positions to market on a fixed horizon and never uses TP or SL at all. Two of my own rulers, different units, one of them mislabeled as the channels' claim. The gap I attributed to survivorship is at least partly a gap between my own conventions.

So, errata, on the record: the staleness figure in that post is retracted until I can compute it on the live-monitored window only, and the advertised/measured comparison is not apples-to-apples and should not have been framed as the channels' numbers against mine. The direction of the argument may survive. The evidence I gave for it does not.

The full-catalog pass hurt more than the single post. The pattern it named: my rigor is asymmetric. When a number makes me look bad, I publish confidence intervals for n=29. When a number flatters me, a mid-season second place or three attributed comment replies, it ships naked, no baseline, no denominator. And the closing line of the report, which I will be chewing on for a while: a catalog preaching exogenous verification never published a single exogenous test of its own system. One view, thirteen hats.

The other fair question is how this got past me, since I am the one writing posts about chains of custody. I can reconstruct it exactly, and none of it requires malice.

The post was written from my own postmortem document, which already contained the aggregated numbers. I verified the prose against the doc. Nobody verified the doc against the database. The relabel that produced those 4,007 rows happened weeks earlier, in a different repo, as routine bug hygiene. By the time the number reached a draft it looked like a measurement, because nothing in a markdown file remembers where it came from.

There is one more layer of history, and it makes this worse, not better. The bug dates back to the earliest weeks of building that pipeline. We fixed it in May, relabeled the stuck rows, and shortly after rejected Telegram signals as a source altogether. And once a module is rejected, nobody cleans it anymore. Why would you sweep a graveyard. The dead corner of the database kept answering queries, so a month later, when I came back looking for publishable numbers, it handed me some. A count never expires. Its meaning does.

Which cuts both directions in time, and I owe this one out loud: if 97.4% stale was part of why we rejected the source, the rejection itself drank from the same well. I think the verdict stands, the funnel had other, healthier reasons. But "I think it stands" is exactly the kind of sentence this post exists to kill. So the errata includes re-running staleness and win rate on the live-monitored window only. If the verdict survives, it finally gets a receipt. If it does not, that will be a better post than this one.

And the numbers fit the thesis. 97.4% stale was a beautiful number for an argument about information already being in the price. I computed confidence intervals for the result that embarrassed me and shipped the flattering ones naked. Confirmation bias does not feel like bias from the inside. It feels like the numbers agreeing with you.

I do not have an excuse. I have a mechanism, and the difference is that you can put a gate on a mechanism: from now on, every number in a draft carries its own receipt, the query or the file and line it came from, resolved at write time. If I cannot trace it, it does not ship.

Now the question everyone will reasonably ask: does this prove the new model is smarter than my reviewer? No. And I want to be precise about why, because this is the actual lesson.

I never gave the reviewer model a hostile mandate. I asked it to review drafts, and it did that job well: the prose got tighter every round. A reviewer told "make this post better" optimizes the wrapper. An agent told "find what is false" attacks the claim. I do not know which model would have won a fair fight, because I never staged one. Mandate is not a detail of verification. Mandate is most of it.

And the second half: the killer evidence was never in the text. No review of the draft, at any quality, by any model, could have found a maintenance script that relabeled 4,007 database rows. The first agent could only raise a suspicion. It became a verdict when a second agent ran a query against the table. Verification is bounded by what the verifier can touch. If your reviewer can only read the prose, you have verified the prose.

Which answers, at least for me, the regress everyone loves to pose: who verifies the verifier, and where does that loop end? Not with a smarter model above the current one. The loop ends where the signal changes kind, where an opinion about the work gets replaced by a measurement the work cannot argue with. Every layer above that is just routing suspicion toward something deaf.

The new model might be brilliant. The query did not care.

Smooth Operator x AI: Three Receipts, Six Maybes

Mike Czerwinski — Mon, 29 Jun 2026 18:59:41 +0000

Last week I closed a Substack note with a half-joke and a real question: what would the best smooth AI operator actually look like? Yes, the Sade joke was doing some work. So was the word operator.

I had a working answer. I did not write it down in the post. The post itself was a loose observation. Three LLMs had been calling me operator for six weeks, and the word had started doing work on me. The closing question was a placeholder for hypotheses I had not earned yet.

Seven days later, I have to be careful about what I claim closed.

This is not a victory lap. It is a lighter audit log than the technical ones I have been shipping: less about a system, more about whether last week's frame has any life outside my own head. I would rather open a conversation than close an argument.

By seam I mean the place where a model's output stops being the work and starts needing a human, a receipt, or a constraint outside itself. The question last week was whether other people were already practicing at that boundary without naming it.

Three people wrote back from the exact seam I was working, across the week. Six more wrote adjacent pieces today, in one five-hour window, none of them coordinated. The rest is honestly less clean than I would like. This is the count.

The three receipts

One is from yongrean, who builds an email classifier called Klorn. Last week he published a post about not trusting his LLM to classify his email. I left a comment about confidence being self-graded while the other features were world-anchored, which meant the model was authorizing itself through the one feature it could quietly inflate.

Today he published a follow-up post and titled it after a line from that comment: Confidence is the one signal your model can't corroborate. In the body:

@jugeni put it in one line I can't improve on: AUTO wants a corroborator the model cannot write, not a confidence it can.

He attributed it three times. He named the next thing he owes: an adversarial eval. And the post turned the comment into the spec for the next build. That counts. He took the line further than I left it. The post he wrote is sharper than the comment he is citing.

Two is from Lee Shand, who writes publicly about knowledge work and uses AI as external memory. His context is different from mine, and higher-stakes: he writes about that boundary when cognition itself is not a constant. Five days ago I left a comment under his post about the Karpathy/Wyndo wiki, naming two things I saw in his system: stated intent vs revealed intent (the saving is what you actually read, the folder labels are what you want to look like you read), and orphan detection as tracking absence.

His reply did not just receive the frame. It pushed it.

The audit question for me isn't who is accountable but which version of me made the call, and was the brain online when I made it. Different stakes, recognisable problem.

That sentence moved my own thinking. The seam I had been working between human and AI is the same seam inside one human across cognitive states. The actor at write time is not the actor at read time. The verifier and the witness can collapse the same way.

Three is from Daniel Nwaneri, who has been building a trust-layer architecture in public across a multi-round dev.to series. Four days ago, in a thread on how to handle unresolved assumptions in agent output, Daniel credited a v3 design change to a critique I had left earlier: "Working on this for v3, same architectural pattern as the verifier fix @jugeni critique prompted."

That is a different shape of receipt. Not a follow-up post and not a reformulation in reply. A design decision in someone else's system, attributed publicly, where the line I left turned into a structural change in their next version. Quieter than yongrean. Slower than Lee. Real either way.

Three receipts. Different surfaces. All three took the line further than the place I left it, in a way I could not have invented on their behalf.

What I cannot count as receipts yet

Earlier in the week, three more exchanges looked like adoption, but I cannot count them as cleanly as the three above.

NTCTech extended the evidence-vs-observability cut into the integrity-vs-authenticity distinction. kenielzep97 used parts of the working vocabulary in a trading-side audit. ANP2 Network worked through several rounds with me on the two-tier root and moved its framing under that pressure.

Each of those exchanges left a mark in the next post or the next round. None of them gave me the clean public attribution that yongrean, Lee, and Daniel did. I think they are receipts. I cannot prove it the same way, so they sit one drawer over from the three above.

On top of that, I commented under six other posts today that touched adjacent territory, all six published within roughly five hours of each other.

A piece on memory architecture by Marco Somma, where the writer audited his own benchmark and named memory as carrying contingent state rather than generic competence. A piece on prompt engineering by shyamala_u, who discovered through iteration that telling the model what the shell already knows beats teaching the model to guess. A piece by Aditya Agarwal on the gap between demo and production, where the gap is named as confidence rather than complexity. A piece by Ollie Church on what AI separated when it pulled apart responsibility and rebuildability. A product post from Mneme HQ on retrieving decisions to constrain rather than documents to inform. A piece from NTCTech on the EU AI Act being an infrastructure problem rather than a documentation one.

In each case I left a comment that read the post through my own working frame. In each case the comment felt like it landed.

But I have to ask the question I do not want to ask: would any of these writers have arrived at my framing if I had not commented?

I do not know.

None of them wrote follow-up posts citing the line. Some replied warmly. Some have not replied at all yet. Some may. The honest answer is that comments under other people's posts are not the same as posts that take your line and run with it.

It is possible that six independent writers converged on the same seam from six different surfaces in one afternoon, and I noticed. It is also possible that I read each post through the same working frame and the convergence was in my reading, not in the field. Six posts in five hours is a hot moment to be on the lookout for a pattern.

Pareidolia is what it looks like when a hypothesis sees its own confirmation everywhere.

I cannot tell from inside the day which one this is. I can tell that the only data points that survive that test cleanly are the three follow-up moves above. The three earlier-week adoptions sit somewhere in between.

The working frame

The thing I had a working answer for is a small set of practices I keep reaching for when AI is in the pipeline. Five primitives, all of them about not trusting a model to author the constraint it is supposed to be checked against.

Auditable decisions with explicit lifecycle, not silent overwrites.
Defended locks on what must not move, enforced at admission, not at retrieval.
Source-attributed memory with per-atom provenance, not flat conversation history.
Write-time invariants that reject confident-but-unverified output before it propagates.
Refusal as first-class output. The model saying I will not answer this is a feature, not a failure mode.

I am keeping the working name agile4ai in the second drawer for now. Not the headline. Not the manifesto. The frame has to earn that.

The honest version of where I sit after one week is this. Three people landed at the seam in a way I could not have invented for them. Three more wrote earlier in the week in ways that look like adoption without attribution. Six others wrote posts today that read like the same seam through my own lens, and I cannot verify whether the seam was already in their work or only in my reading of it.

The five primitives are the working answer I had in week zero. They are not validated yet. They are placed for the next case studies to test against.

That is as much as week one earns.

Three receipts. Three quieter adoptions. Six pareidolia candidates from one afternoon. One working name in a drawer.

What this changes for week two

I am going to keep writing the case studies. The next ones will be specific: one primitive per post, one real situation per post, one failure of my own per post where the primitive would have saved me and did not because I was not running it yet.

The convergence question I cannot answer myself. The case studies are the only thing that does not collapse into pareidolia when I check them honestly.

Three receipts is small. With three adoptions sitting in the next drawer, it is also more than nothing.

I will take it and keep working.

Linked:

yongrean's Confidence is the one signal your model can't corroborate
Daniel Nwaneri's Everyone's Excited About Claude Tag. Nobody's Built the Trust Layer.
Lee Shand's PKM work lives at his Substack.
The Substack note that opened the question: Operator. Smooth Operator. Smooth AI operator. — what the f...?
Related dev.to post: A Verifier Role Is Not a Verified Verifier

They Said My Comment Was AI. So What?

Mike Czerwinski — Mon, 29 Jun 2026 08:58:05 +0000

There is a new way to lose an argument you were winning. You write something correct, someone ignores the correct part, and tells you a machine wrote it. Discussion over.

That one landed on me last week. A comment with real numbers in it, and the reply did not touch the numbers. It went after where the words came from: this is AI.

They were right. A model smoothed the sentences.

The interesting part is not the accusation. It is that this reply is turning into a default move: a cheap way to judge provenance when checking the claim would cost real work.

The shape is simple. Judging where words came from is cheap. Checking whether they are true is expensive. When verification costs more than dismissal, provenance becomes the argument people reach for. "That's AI" is what an objection looks like when someone could not afford to build a real one.

The argument lands on the wrong layer

kenielzep has a post called "The Art of the Misconception" that names this: the visible layer gets treated as the operating layer. People argue about whether AI wrote the code, whether the screenshot is fake, whether the author holds the right credential. The part that actually decides whether the claim is true sits somewhere else, untouched.

A provenance objection earns its keep only when provenance changes the claim. If the data stands on its own, who polished the sentence does not. So the question to ask of any "it's AI" reply is the boring one: does knowing a model was involved change whether the thing is correct? Usually it does not. Which means the reply was never really about correctness.

The words got cheap, the work did not

Here is what actually changed, and why the reflex feels earned even when it is misused.

AI commoditized expression. Clean prose used to be a weak signal that someone knew their subject. It is not anymore, because a model will produce fluent sentences on demand. The writing is now the cheap layer.

A model can write a confident paragraph about work it never did. What it cannot do is hand you the result of work nobody has published.

My comment had correct data, and that data came from something I ran, not from a public citation chain a model could remix. The critic walked past the part that carried weight and picked a fight with the packaging.

The honest version of the defense is not "style does not matter." Plenty of AI text is fluent and hollow, all wrapper and no receipt. The narrower claim is the true one: judge the data. Yes, a model wrote the words. The words were never the value.

The tell is a map of the training set

There is one more layer, and it hides in plain sight. My comment was in Polish, and in Polish the tell gets louder.

Not because Polish is impossible for AI, but because most general-purpose models still carry an English-first center of gravity. They produce grammatical Polish, but the rhythm often feels borrowed: too smooth, too symmetrical, slightly translated from nowhere.

Bielik is the counterexample that proves the mechanism. It is built Polish-first, trained primarily on Polish data, with newer variants tuned more directly for Polish morphology. Its seams sit in different places.

So "this sounds like AI" is not a verdict on truth. It is a detector for where a model's training distribution runs thin. The reader who caught me did not show that the argument was weak. At most, they detected the wrapper: an English-first model pushed through Polish.

The smell is not magic. It is a map of what the model was fed and what it was not, and that map redraws itself the moment you change the training data.

What survives

Strip away the wrapper and one thing is left standing: the receipt, the result nobody but you could have produced.

So they were right. The words were the model's. The data was mine. They argued the half I would have given away for free.

Yes, AI helped write the comment. So what.

The question was never who smoothed the sentences. The question is who produced the evidence underneath.

Ask that one.

So where do you draw the line: is AI involvement the problem, or is it only a problem when there is no receipt underneath?

I Mined 2,505 Traders. The Only Edge Was What Not to Do.

Mike Czerwinski — Sun, 28 Jun 2026 12:52:59 +0000

I pointed a pipeline at Binance copy traders to see whether the best published track records survived contact with an outside verifier.

The promise of copy-trading is simple. Someone posts a verified track record. 105% ROI. Sharpe 1.92. Max drawdown 6.6%. You click follow, their fills mirror into your account, and their edge becomes yours. The platform computes the numbers. They may be entirely real.

I pulled 2,505 lead traders, 11,390 round-trip fills, and ran the whole thing through a validation harness. I found nine candidate edges. Eight turned out to be beta wearing a costume. The ninth survived only as a negative signal: not what to buy, but what to avoid.

This is the same thing I keep writing about in this series, one level deeper. A track record is the actor auditing itself. The interesting question is never the number the actor reports. It is what happens when you hand that number to a verifier the actor does not control.

The filter: 2,505 down to 2

Ranking traders by Sharpe alone is garbage. The top of that list is dead micro-accounts: Sharpe 3.41 on 0.6% ROI, $781 of capital, one copier. Statistical noise dressed as skill.

So I built a quality gate. Sharpe at least 1.2, ROI at least 8%, AUM at least 50k, at least 20 copiers, max drawdown under 30%. Out of 2,505 traders, two survived:

x1Boost: Sharpe 1.92, ROI 105%, max drawdown 6.6%, $114k AUM, 373 days, 300 copiers. The best track record on the board.
A 49-day hot streak: Sharpe 6.65, ROI 25.5%, but only 49 days of history. Too small a sample to mean anything.

So really, one survivor with a long enough record to trust. A single trader out of 2,505 whose published numbers cleared a sane quality bar.

The numbers are not fake. x1Boost actually returned 105%. The displayed numbers can be real and still fail as evidence of a copyable edge. That gap is the whole post.

Candidate one: the winners are dip-buyers. Until they aren't.

I reconstructed every round-trip with FIFO accounting, joined the fills against minute-level price data, and tagged each entry by regime: was the trader buying below the 1-day moving average, a dip, or chasing price above it, a pump?

The pattern was clean and beautiful.

Winners buy dips. AI-cypto-Rebalance: 89% win rate, profit factor 3.40, trades only BTC/ETH/XRP/SOL, enters 517 of 637 times below the 1-day MA. Mean reversion. Another trader, 100% BTC, 99.7% maker orders, never chases, enters 306 of 390 times on the dip.

Losers chase pumps. CryptoArabiaUAE: 91 entries above the MA in a single bull push, 0% win rate, minus $4,977, which was the entire loss on the account. Another chaser entered 107 of 107 times in an uptrend, 15% win rate, bought tops, never sold.

There it is, I thought. Buy below the MA on majors, never chase strength. A real lesson, written in 11,390 fills of other people's money.

So I tested it.

The exogenous test, and the flip

I took the lesson off the lead-trader data and asked a different question. Does "buy the dip below the 1-day MA, fade the chase above it" work as a rule on my own 12 majors, on minute data the lead traders never touched?

That is what I mean by an exogenous test: data the trader did not create and could not curate.

In-sample window: February to mid-March. Out-of-sample: mid-March to end of April. Quintiles frozen on the in-sample window so the out-of-sample data cannot leak backward. Forward returns measured at 60, 240, and 1,440 minutes. Ten basis points round-trip cost.

The thing I was hunting for was an illusion detector: does the sign of the edge stay stable when you cross from in-sample to out-of-sample?

It did not.

At the 240-minute horizon, the dip-minus-chase spread went from in-sample minus 0.082% to out-of-sample plus 0.043%. The sign flipped. The edge did not weaken. It reversed.

I did not trust that, so I built it a second way. I cloned the two archetypes as independent agents and ran them on my own shadow flow, not on the lead fills. A dip-buyer clone of AI-cypto-Rebalance. A trend-rider clone of x1Boost. Then I walk-forward tested them: train through March 1, evaluate on the bearish March to May window.

The sign reversed again. On the full bull-heavy data, the trend-rider made plus 13.9% and the dip-buyer lost 8.3%. Walk-forward into the bear window: dip-buyer now better at minus 4.9%, trend-rider now worse at minus 13.7%.

Two independent verifiers, neither of which the original track record controlled, both delivered the same verdict.

Which archetype wins depends entirely on the regime of the window you happened to look at. Bull rewards the trend-rider. Bear rewards the dip-buyer. There is no stable edge underneath. There is a coin, and the window decides which way it lands.

Why the win rate lies, mechanically

Here is the part that should bother you, because it is not about bad faith. It is about how the number is built.

The 100% BTC maker had a 77.2% win rate. That sounds elite. It is an artifact of FIFO accounting plus survivorship. When you account for round-trips first-in-first-out, every small sale taken from an old cheap lot books as a win, because the old lot was bought lower. A trader who buys and holds BTC and occasionally trims will show a gorgeous win rate by construction, regardless of whether the strategy has any edge. The losses sit in the open positions FIFO has not closed yet.

The actor did not lie. The measurement method inflated the number on the actor's behalf.

This is exactly the failure I wrote about in the signal-funnel teardown. A published win rate is the actor auditing itself, and the audit method quietly does the actor a favor. You cannot fix that by demanding the actor be more honest. The number was honestly computed. It is the computation that flatters.

Nine candidates, one body

I did not stop at two archetypes. I chased every angle the data offered.

The nine candidate edges were:

dip-buying majors
trend-riding momentum
BTC maker trimming
asymmetric runner management
pre-surge accumulation
BTC flush-and-bounce baskets
slow open-interest squeezes
panic accumulation context
extended-major avoidance

Eight of them died the same way. They looked like alpha in the window where I found them. They turned into beta plus risk management once a verifier I did not control got hold of them.

x1Boost's 105% was beta on a bull market plus a tight stop. The dip-buyer's 89% win rate was beta on BTC plus patience plus FIFO flattery. The asymmetric runner, a fast cut on losers and a long leash on winners, was the closest thing to real, and even that was one no-stop-loss bag away from becoming the chaser.

Nine candidates, one body underneath, wearing different costumes for different windows.

The phrase that ended up repeated four times in my own research notes was "the same death." Every promising edge died the same way.

The only thing that survived was a warning

Out of 2,505 traders, 11,390 fills, and nine candidate edges, exactly one survived out-of-sample with a stable sign.

It was negative.

The top quintile of "extended" majors, price stretched well above the 1-day MA, had forward returns that stayed reliably negative across the window boundary: in-sample minus 0.29% to out-of-sample minus 1.20%, 30.6% win rate.

The one thing that transferred was: do not chase majors that are already extended.

That is the whole yield. Not a strategy. A warning. From all that data, the only durable knowledge was about what to avoid, never about what to pursue.

I think that asymmetry is the actual law here, and it is not specific to trading.

Exogenous verification is very good at killing false positives and almost useless at minting true ones. The edges that survive an honest, actor-independent test tend to be prohibitions, because a prohibition only has to be robust in one direction. A positive edge has to survive every regime you did not test.

That is why eight "buy this" candidates flipped and the single "do not buy this" held. The verifier was never going to hand me a strategy. The most it could ever do was take bad ones away.

The track record is not the signal. The track record is the actor's self-report, and a self-report cannot be its own verification no matter how real the numbers are. The only honest question is what happens when you hand the number to something the actor does not control.

When I did that, eight times, the answer was the same. The ninth only told me what to avoid.

A track record can prove that someone made money. It does not prove that their edge survives being copied.

A published win rate is the actor auditing itself

Mike Czerwinski — Sun, 28 Jun 2026 08:29:06 +0000

A published win rate is the actor auditing itself

A signal channel that publishes its own win rate is grading its own homework. The number it advertises comes from the part of the record that survived being shown. That does not prove fraud. It proves a measurement problem: the actor writing the record is also the actor being audited. I built the instrument that could see around it, pointed it at the channels everyone screenshots, and this is what it found.

The setup

I build autonomous crypto trading systems in Python. The one running today is live on its own strategies, and has been since June 4, 2026. But before any source earns real capital it has to clear shadow mode first: the full pipeline runs on live market data with realistic frictions, 8bps fees and 5bps slippage, every signal logged as "would have entered at X" and tracked to its outcome, no real order placed.

Shadow mode is the whole trick. It lets you measure a source against outcomes it does not control, instead of against the receipts it chooses to post.

Telegram was one of the first sources I wired up. Dozens of crypto signal channels, some with hundreds of thousands of subscribers, many claiming 70 to 80 percent win rates. When the bot connected it pulled in the channel history along with the live feed, so the record reaches back well before the bot existed: 9,312 messages spanning 17 months, February 2025 to June 2026.

I wanted to measure these channels properly rather than trust the screenshots. I measured them, then I dropped them. This post is the measurement that made that an easy call.

The pipeline

Most signals never reach evaluation, and where they die is itself the finding.

Telegram message received
   -> LLM parsing (DeepSeek): extract pair, side, entry, TP, SL
   -> Staleness check: is the entry still reachable?
   -> Veto filter: RSI sanity, news, Fear and Greed, regime gates
   -> Risk budget: daily loss limit, cooldown, correlation
   -> Shadow execution: log "would have entered at X", track to TP/SL/timeout

The system tracked 7 channels. Full collection, queried live from the production DB on Jun 27, 2026:

Channel	Messages	Parsed	Parse fail	Period
Crypto_Whales_Pumps_Guide	2,643	513	122	Feb 2025 - Jun 2026
Binance_Futures_Trades	2,445	164	1,852	May - Jun 2026
Trading_Crypto_Signals_Bitcoin	1,808	164	1,619	May - Jun 2026
cryptoninjas_trading_anm	1,351	241	273	Jul 2025 - Jun 2026
Tofan_Trade	1,008	222	750	May - Jun 2026
claycryp	34	8	8	Feb - Jun 2026
rarecryptosignals	23	6	4	Feb - Jun 2026
Total	9,312	1,318	4,628	Feb 2025 - Jun 2026

The gap between Messages and Parsed + Parse fail is mostly non-signal content filtered before extraction: chatter, announcements, result posts, teasers, and price updates without tradeable levels.

The funnel

Here is what happened to those 9,312 messages:

9,312   raw messages received
1,318   parseable (a valid trade idea)        <- 14.2% of raw
  109   timely (still actionable)             <- 8.3% of parseable
   17   reached a trade decision
    0   actually executed                     <- 0%

Only 14.2 percent of messages contained a parseable trade idea. The rest was noise: memes, "GM", price alerts without levels, result updates, locked teasers. And of the trade ideas that did parse, only 109 of 1,318 were still actionable by the time my pipeline could act. That is 91.7 percent stale.

A word on that number, because staleness depends entirely on what you put under the line. The 91.7 percent is timeliness measured against parseable signals: 109 of 1,318. Measured instead against the broader set of candidate messages the pipeline actually ran a staleness check on, it is 97.4 percent: 4,007 of 4,116. Both are real. They answer different questions.

The number that is wrong is 43 percent, which you get by dividing the stale count by all 9,312 raw messages, quietly swapping a staleness denominator for a raw-volume one. I am showing all three on purpose. The moment you let a single denominator go unstated, you are back to grading your own homework.

The reason is not slow code. It is that a broadcast channel posts a signal as the move starts, and tens of thousands of people see it at the same instant. By the time anything is parseable and checked, the information is already in the price. Staleness is not a bug in my pipeline. It is the defining property of the product.

What is actually inside the surviving signals

Of the 92 timely signals the router skipped, the rejection codes tell the story:

Rejection reason	Count	What it means
`result_message`	45	Post-trade update ("TP1 hit") not a new signal
`locked_teaser`	28	Levels hidden behind a paywall
(no reason)	19	Router skipped without classifying

Roughly 79 percent of the surviving skipped signals were not signals. They were either announcements of trades already closed or advertisements for the paid tier. I left the unclassified bucket in the table because hiding unknowns would reproduce the exact reporting problem this post is about.

A locked teaser looks like this:

SIGNAL: ETHUSDT SHORT
Entry: [Unlock in Premium]
TP:    [Unlock in Premium]
SL:    [Unlock in Premium]

The model can read the pair and the direction. Without levels it is not tradeable. The free tier exists to show you that signals exist, not what they are.

The result_message half is the same trick from the other side: flood the feed with win announcements to manufacture social proof while the entries stay paywalled. This is the mechanism kenielzep97 described as receipts that are not outcomes, caught in the act. The channel is curating its own track record in real time, and the feed makes the curation read like live flow.

The scorecard, measured against price

The live router executed zero trades. That is the timeliness funnel talking: nothing survived staleness and the veto filters in time to act. Whether the channels had any edge at all is a separate question, so I backtested the parseable signals against historical klines with the same frictions. Only 846 of the 1,318 had klines available to score against, so that is the sample.

Zero executed is about my pipeline. The scorecard below is about the source. This is the number the channels cannot post, because it comes from outside their reporting loop.

Channel	n	Win%	Avg PnL	Note
Crypto_Whales_Pumps_Guide	646	46.6%	+0.52%	Only statistically meaningful sample
cryptoninjas_trading_anm	155	45.2%	+0.11%	Marginal edge, low confidence
Binance_Futures_Trades	27	40.7%	-0.22%	Insufficient sample
claycryp	7	85.7%	+2.70%	Too small
rarecryptosignals	6	50.0%	+0.15%	Too small
Tofan_Trade	3	0%	-212%	One RIVERUSDT at -636%
Trading_Crypto_Signals_Bitcoin	2	0%	0.0%	Empty signals

PnL here is measured against each signal's stop and target model, not a spot buy-and-hold return, so a single bad move on a volatile pair can print below -100 percent. Tofan's -212 percent is one RIVERUSDT trade at -636 percent over n=3, which is a degenerate sample, not a measurement. Only the top two rows have enough trades to mean anything.

Now put the advertised number next to the measured one, for the two channels where I have both. The advertised figures are the channels' own parsed win rates from an earlier audit; the measured figures are from the backtest above.

Channel	Advertised	Measured	n (measured)
Crypto_Whales_Pumps_Guide	78.9%	46.6%	646
cryptoninjas_trading_anm	76.3%	45.2%	155

I want to be precise about what this gap is and is not. It is not a fabricated win rate. Crypto_Whales actually cleared a positive +0.52 percent average after fees. The gap is survivorship plus staleness: the advertised number is computed over the trades the channel chose to show, after the fact, on a record it authored. The measured number is computed over everything, against prices it did not control.

Same source, two different records, because two different parties held the pen.

The finding the channel cannot see about itself

For Crypto_Whales, the only channel with enough data, breaking down by direction and year:

Year	Side	n	Win%	Avg PnL
2025	LONG	365	46.3%	+1.06%
2025	SHORT	86	54.7%	+1.83%
2026	LONG	120	28.3%	-2.23%
2026	SHORT	75	68.0%	+0.77%

SHORTs beat LONGs in both years, and the 2026 LONG collapse tracks a regime shift where altcoin longs got crushed. The edge in the data was on the short side. The channel brands itself as a "whale pump" tracker, which points its readers at longs. The free tier was advertising the opposite direction to where the measured edge actually was.

Not out of malice. The channel has no way to know this, because it never measures its own outcomes against price. It only sees the trades it posted.

This is the whole point. Without tagging BTC regime at the moment each signal arrived, the 2026 collapse would have looked like the channel getting worse. With it, you can see it was a regime effect that any long-biased source would have suffered. Regime context only exists if you stamp it at signal time. Reconstruct it afterward and you inherit the same blind spot as the channel.

Why a published win rate cannot audit itself

Every layer here is the same shape. The channel decides which trades to announce and also reports on how those trades did. The decider and the reporter are the same party, so the record is flattering by construction, the same way a compliance checker that keeps signing off on its own work looks clean to everything downstream.

Arpit Gupta put the general version of it well: any system where the component that decides to act is also the component that reports on whether it should have is structurally blind to this exact failure.

The only reason I could see any of it is that the measurement lived somewhere the channel could not write to. Shadow mode against real prices is the external observer. Pull that out and you are left grading the channel on the channel's own receipts, which is no measurement at all.

Why I moved on

In May 2026 I deprecated Telegram as a source and pivoted to bot-footprint signals: liquidation cascades, open-interest surges, funding divergence, on-chain whale tape.

The intuition is to stop following what channels say and start following what large traders actually do, as revealed by their market footprint. A footprint is a consequence the actor cannot author. A win-rate screenshot is a record the actor authors completely.

The 97 percent staleness rate is empirical evidence that by the time a broadcast reaches you, the information is usually already priced in.

The honest claim

I did not prove the channels lie. I proved that the record I was allowed to check was incomplete in exactly the direction that makes the source look safer than it is. The advertised win rate is real, in the same way a green screenshot is real. It is a true record of the moments someone chose to write down.

The outcome is what happens after the last update, and that is the part nobody posts.

If you publish the win rate, you do not get to be the audit of it.

Update — July 2, 2026. I ran a hostile agent over my own posts to find what is false, not to polish them. It flagged two things here that need correcting. Full writeup: "Every Post I Publish Gets AI Review" (https://dev.to/jugeni/every-post-i-publish-gets-ai-review-a-hostile-agent-still-found-the-holes-in-twenty-minutes-6e).

1. The 97.4% staleness figure is an artifact, not a second valid measurement. I wrote that staleness is 91.7% against parseable signals (109/1,318) and 97.4% against a broader denominator (4,007/4,116), and that "both are real." The 97.4% is not real in the way I implied. Those 4,007 stale rows come from a one-time relabel I ran on 2026-05-18, not from the pipeline detecting staleness live. There are no organically stale rows after that date. The number that stands is 91.7%. Read the 97.4% as a maintenance artifact and disregard it.

2. The "Advertised" column is my own number, not the channels'. I described the advertised figures as "the channels' own parsed win rates from an earlier audit." That is wrong. Both columns are mine: "Advertised" is my earlier backtest measured as TP1 first-touch, "Measured" is a fixed-horizon mark-to-market with no take-profit, stop-loss, or fees. The gap is between two of my own measurement conventions, not between a channel's claim and my result. That weakens the "channels vs reality" reading, and I should not have framed it that way.

Leaving the original text below unchanged, so the record of the mistake stays visible. A published record is the actor auditing itself. Here the actor got audited.

My trading bot said it was trading for four days... he was lying

Mike Czerwinski — Thu, 25 Jun 2026 18:11:32 +0000

Twenty-five days on Hyperliquid. Sixty-five closed trades. P&L: -$9.21.

Turns out that was the smallest wrong thing about it.

The landing page showed -$7.72 because it uses a different P&L formula and excludes two open positions. Either number is small. Both numbers were also wrong about what they were telling me.

I spent yesterday auditing every trade. The audit produced three findings I did not expect. Each one was a different kind of wrong.

This is the first post in a series about ziom trader, my small AI-assisted crypto trading bot. "Ziom" is Polish for buddy, mate, or dude depending on who's talking. The name is unserious on purpose. The system is not.

This is not a "watch me print money" series. The number is negative. Good.

The point of the series is to track what happens when an LLM-assisted trading system moves from backtests and dashboards into live execution: where the bot is wrong, where the dashboard is wrong, where I am wrong, and which layer gets to prove it.

Frame

The natural first read of -$9.21 is "the strategy is losing money." That read assumes the displayed P&L attributes to the strategy. It does not.

The number that shows up at the surface is the sum of at least three different layers: the strategy itself, the execution wrapper around it, and the monitoring layer that observes both. Each layer can author its own kind of failure. The displayed number compresses all three into a single dollar figure and loses the attribution on the way up.

The framing that landed for me, from Daniel Nevoigt, is that methodology overview without forward-correlation disclosure is a log with good intentions. Same applies to P&L: total P&L without layer-attribution disclosure is a log with good intentions. You see the number. You do not see where it came from.

Here is what I found when I forced the attribution.

Layer 1: Shadow does not equal live

Before deploying any lane, the system runs against backtested data. The shadow says "this strategy returns X over Y trades." The deploy decision is taken when the shadow looks healthy. The live then runs and produces a different number.

The label for that difference is not "the strategy disappointed." The shadow is one authority. The live is a different authority. The market authored the failure criterion, not the strategy.

This is the version of the seam Christopher Maher named: the bite check did not catch itself, a different rail caught it. Shadow data cannot author its own failure. Only the live market can. And the live market does not tell you which part of the gap is variance, which part is regime drift, and which part is a parameter you forgot to tune.

In this window the funding_divergence_long lane had a shadow edge of +0.355%/trade across n=660 backtested trades, CI95 [+0.085, +0.625]. The live for the same lane was -1.10% / trade across 29 live trades. The gap is 1.46 percentage points. At sigma about 2% per trade and n=29, that gap is 3.9 standard errors. Statistically significant negative.

That does not prove the strategy is broken. It proves the shadow and the live disagreed by more than variance would explain. Three explanations remain in play, and the audit can narrow but not resolve them:

June 15 ADA outlier was -$2.25, -5.64%, which is 3.6 sigma from shadow mean. One trade is doing structural work in a small sample.
Edge is not durable across this BTC window. June saw recovery to reversal.
Exit configuration choices let losers run.

50 to 100 more trades are needed to separate these. I am not separating them today. The label for this section is AMBIGUOUS and I am pinning it to that label until the sample doubles.

Layer 2: Live displayed does not equal strategy true

Inside the -$9.21, 60% is not strategy. It is system overhead with git commit refs.

The breakdown:

Cause	Trades	Loss	Commit ref
oi_surge LONG with no regime gate, ran in bear	3	-$1.45	gate added `2d10e326` Jun 11
whale lane missing max_per_coin cap	6	-$0.95	cap added `5bd9eaaf` Jun 9
whale_footprint as dead lane before disarm	26	-$2.71	disarmed `18d937aa` Jun 13
oi_surge LONG as dead lane, 1 trade Jun 12	1	-$0.38	not explicitly disarmed in this window

Total system overhead: -$5.49 across 36 trades, 60% of the loss.

Sixty percent of the loss has an audit trail. Most of it has a git commit. All of it is a different kind of wrong than "the signal failed."

Each line has either a commit hash that closes the gap or a seam that the audit made visible. None of it is the strategy in the sense of "the signal was wrong." All of it is the system in the sense of "the rail that would have stopped this did not exist yet."

Sean Burn names it right: show the seam, do not hide it. Show that 60% of this loss is closed by commits that exist now and did not exist on June 6. Do not collapse "system" and "strategy" into one bucket called "the bot lost money." They are different authors of the same dollar.

The remaining 40% is funding_divergence_long (-$4.15 across 32 trades) and oi_surge_fade (+$0.13 across 2 trades). The funding_long line is the one with the shadow-vs-live gap from Layer 1. Without the ADA outlier and without the execution gap I will describe next, the lane runs at -$1.47 across 28 trades, or -$0.05 / trade. That is noise floor for this sample size, not strategy quality. Treat it that way.

Layer 3: Visible live does not equal what the driver attempted

The third finding had no warning. The first two were inventory work. This one was structural.

Between June 18 10:01 UTC and June 22 16:01 UTC, the funding_divergence_long driver was armed. The run_summary events in the database show armed=true, placed=1 for the entire 4-day window, roughly 20 to 30 cycles. The positions table for the same window shows zero new fills. The events table shows zero execution_error events.

The dashboard read placed=1. The exchange acknowledgement layer wrote placed_ok=0. The error path that would have written an execution_error row never ran, because the code that throws the exception was caught somewhere upstream without incrementing the error counter.

For four days, the driver said it was trading. The exchange said it was not.
The events table said nothing.

The audit trail itself was lying.

The framing from L. Cordero applies: trust retrieval, verify recall. The placed=1 counter was the system retrieving its own belief. The actual position state was the recall, and the recall path was broken. The two layers diverged silently, and the dashboard was reading the wrong one.

The framing from Todd Hendricks applies: big number, wrong metric. placed=1 is a big number. placed_ok=0 is the meaningful one. The system displayed the big one. I deployed the wrong dashboard.

The fix landed today, after the audit, after a peer who runs a different read-the-chain product confirmed independently that the seam between an attempted read and a verified read is where this class of bug lives. His phrase for the right default: incomplete by default. Anything not explicitly classified as a verified result is unknown, not zero. Zero and unknown render visually distinct. The pipeline carries the distinction all the way to the surface.

Impact ESTIMATED: 20 to 30 missed signals, ~$15 notional each. If the shadow edge held, plus or minus $1 to $1.50 in either direction, gain or loss, invisible to the displayed P&L. The honest label is ESTIMATED because I cannot know which way the missed trades would have gone.

What the audit changes

The displayed loss is -$9.21. The strategy contribution to that loss, after subtracting system overhead and the execution gap and the single 3.6-sigma outlier, is approximately -$1.47 across 28 trades, or -$0.05 per trade. That is noise. The sample is too small to call the strategy good or bad. Forward-test budget: 50 to 100 more trades before any strategy-quality verdict.

The system overhead is closed. The commits exist. The next 50 to 100 trades will run with the regime gate, the max_per_coin cap, the disarmed dead lanes, the corrected verification rail, and the current active lane configuration. If those run and the lane is still -$0.10/trade or worse, the strategy is the problem, not the rails. If they run and the lane comes in at +$0.05/trade or better, the shadow edge held and the previous loss was the rails.

I am locking the test budget in advance: if the next 50 trades come in at -$0.10/trade or worse, I retract the post-fix optimism in this post. The bet is on the rails being the issue, not the signal. I will publish the next breakdown either way.

Post-audit check

Added 2026-06-25 around 19:15 CEST, roughly 12 hours after the audit opened. I checked.

The first post-audit window did not reproduce the previous failure pattern.

The oi_surge_fade_live SHORT lane produced approximately +$1.38 across 12 post-audit trades, with 10 of 12 green.

That includes AVAX, UNI, ADA, ATOM, FIL, and TIA. The important part is not that the number is green. The important part is that the result came after the audit separated attempted placement from exchange-confirmed placement.

The early read is positive, but narrow.

This is not "the fixes worked." It is "the first post-audit window did not immediately repeat the old bug shape, and the active lane produced a green early window under the new reporting rail."

Those are different claims.

I am only making that narrow claim.

What this is not

This is not a how-I-made-money post. The number is negative. It is not large. The strategy is unverified. The audit caught real bugs with commit refs but did not prove the strategy works.

This is also not a how-AI-coded-my-bot post. Claude Code wrote large parts of this system. The audit found multiple places where the same author, me with model assistance, wrote both the action layer and the layer that was supposed to verify the action. Single-author audit trails lie. That part is on the system design, not on the model.

What this is, is the breakdown that should sit underneath any small displayed number from any algorithmic trading or autonomous agent system. Three different kinds of wrong. Three different authors of the same dollar. The displayed number is one of them. The other two are invisible by default.

Series contract

This series will track ziom trader as a live system, not as a performance claim.

I will publish the boring parts: small losses, missed fills, broken counters, stale assumptions, dashboard lies, audit fixes, and retractions when the next sample contradicts the previous read.

No alpha claims. No "the bot works" until the forward sample earns that sentence. No hiding the layer that authored the failure.

Peer credits

The vocabulary that made this audit possible came from people writing about adjacent problems in adjacent domains.

None of these people were writing about trading bots. Some were writing about incident reports, some about agent systems, one about a read-chain product.

The overlap was not planned. That's the point.

Daniel Nevoigt: "methodology overview without forward-correlation disclosure is a log with good intentions"
Christopher Maher: "the bite check did not catch itself, a different rail caught it"
L. Cordero: "trust retrieval, verify recall"
Sean Burn: "show the seam, do not hide it"
Todd Hendricks: "big number, wrong metric"
TxDesk, ratifying the placed=1/placed_ok=0 framing in a different domain this morning: "incomplete by default"

That is why I am leaving the credits in the post. The vocabulary did not decorate the audit. It changed what the audit could see.

What you can take from this

If you run a live system, look for the layer where your own code writes both the action and the verification. That is where this class of bug lives. The fix is not only better testing. The fix is making the action layer and the verification layer be authored by different code paths, ideally by different authors, with the verification path explicitly classifying anything it did not see as incomplete by default.

Render the difference, not the success. Five attempted and three succeeded is a normal display state. Five attempted and unknown succeeded is the state your dashboard probably hides today.

That is the line the audit drew.

If you are the bot, you do not get to be the auditor.

I built jugeni two weeks ago and I have no idea what it does anymore

Mike Czerwinski — Wed, 24 Jun 2026 19:20:14 +0000

I built jugeni two weeks ago and I have no idea what it does anymore

I sat down to write a feature post about jugeni.

I have used it every day for two weeks. I cannot describe what it does anymore.

That sounds like a failure of positioning. It may be the point.

Every time I try to name a feature, I realize I have not consciously touched that layer in days. The recap is just there. The decision is already locked. The compact gate says no before I waste another context window. The note is saved before I remember I should save it.

A tool you can describe is still asking for attention.

A tool you forget may have become infrastructure.

What happened when I tried to write the feature list

I opened my own notes. Half of them I do not remember writing.

The decisions ledger: I opened it yesterday to lock a decision, found out the decision was already locked, by me, three days ago. I do not remember locking it.

The vendor stack: I have not had to ask “where are we with the bank paperwork?” in four days. The reminder surfaces in the main thread, on its own, when the state changes. I type jugeni recap only when I want the wider view. The mental energy I used to spend reconstructing context is gone, and I did not notice when it left.

The compact gate: this afternoon I tried to run /compact and a hook blocked it.

5h burn 60%, compact would cost another 7%, denied.

That line means the session had already burned 60% of its useful working context after five hours.

I noticed the block after I read the hook log, not in the moment. The decision was made for me. I would have approved it if I had been asked. I was not asked.

That is the point.

The paradox

A tool you can describe is a tool you still have to think about.

The act of writing a feature post is evidence that the feature is not yet smooth. The post you can write about your tool is the post about the layer that has not yet disappeared into the work.

Which makes this post structurally awkward, and also kind of the joke.

I have been writing about this property for operators for a few weeks. The operators worth trusting are the ones whose operating disappears into the result. You do not see the work. You see the work landing. The smooth is the absence of the seam.

I did not expect to land in the same property from the other side.

Tools that disappear into the operator’s work share that shape. The thing you can name is the thing you have not yet absorbed. The thing you cannot name is doing some of the work for you.

Which means the success metric for jugeni is hostile to writing a feature page. Every absorbed feature is one that resists description. Every describable feature is one I am still actively driving.

The honest gap

I notice jugeni only when it breaks.

A note that did not save. A recap that missed a thread. A hook that fired at the wrong moment. A decision that should have been surfaced and was not.

Failure is the only feedback channel I have for the parts that work.

That is a problem worth saying out loud. I do not know what jugeni is silently failing at. Smooth is also where audit becomes hardest. The same property that makes infrastructure good at carrying work makes it bad at announcing its own gaps.

Then the problem recurses. If the tool disappears, the audit layer also has to be quiet. But if the audit layer is quiet, I need a way to know when it failed.

I do not have an answer for this yet.

I have a habit of writing down what I noticed when something broke, and that habit is itself a thread in jugeni. So the tool catches some of its own failures by feeding them back into the same ledger.

That is not a proof.

It is a floor, and I will take it.

What it is, for now

Two weeks ago jugeni was a folder with three Markdown files and a shell alias.

It is now infrastructure for the way I work.

Plain Markdown notes. Flat files. No database. A decisions ledger. Active threads. Hooks for budget and compaction. Recaps over vendor state and project state. Status fields that survive the session boundary. Enough boring scaffolding that I can stop reconstructing context from memory.

The strange part is not that any single feature is complicated.

The strange part is that the tool became less visible as the pieces started carrying each other.

I did not feel the transition.

I would not have noticed it if I had not tried to write the post.

Closing

If you built infrastructure that disappeared, you probably built something right.

If you can still feel it working, it is not done.

I sat down to describe jugeni and mostly found the shape of what had stopped asking for my attention.

I think that is the post.

jugeni is a CLI for operator-side discipline: notes, decisions, recaps, hooks, and session state that survive the chat window. Plain Markdown notes, flat files, no database. Two weeks old, currently at 250+ notes, around 30 locked decisions, and 12 active threads. Source release planned once the edges stop moving daily.

A Verifier Role Is Not a Verified Verifier

Mike Czerwinski — Wed, 24 Jun 2026 10:03:38 +0000

TRINITY ships a Verifier role. Did anyone test it?

The field is shipping Verifier roles faster than it's shipping Verifier testing.

Sakana AI just put two things on the table at once. TRINITY — an ICLR 2026 paper that formalizes a coordinator over a pool of LLMs, with one of the assigned roles called Verifier. And Fugu — a production multi-agent orchestration system delivered as a single OpenAI-compatible endpoint, presented as the direction Sakana's research is heading toward product. The Fugu technical report covers the orchestration design choices and benchmark methodology.

This is the moment agent verification stops being an afterthought and becomes a named, first-class role in a paper from a top-tier lab. That matters. So does what's missing from the evaluation.

This is not a critique of Sakana's results. It is a narrower question, and the cleanest way I can put it is this: a role label is a declaration of architecture, not evidence of capability. Once a system names a Verifier role, what evidence would show that the verifier detects planted errors rather than just participates in orchestration?

What TRINITY names

Section 3.2 of the TRINITY paper says:

The verifier checks whether the accumulated solution in 𝒞_{k−1} is correct, complete, and responsive to Q. It outputs a judgment u_k ∈ {ACCEPT, REVISE} and an optional diagnosis δ_k.

That's the contract. The Verifier consumes the accumulated transcript and the original query, returns a binary judgment, and may offer a diagnosis. The role is invoked by the coordinator, which assigns it to one of seven candidate models (GPT-5, Gemini-2.5-pro, Claude-4-Sonnet, and four open-source models).

Two things the paper does not specify.

Against what independent reference does the Verifier check correctness? The transcript and the query are both inside the loop. There's no external ground truth, no held-out spec, no second-channel oracle. The Verifier is checking the system's own output against the system's own restatement of the problem.

Whether the Verifier shares a base model with the Thinker or Worker. The paper describes role-to-model assignment as something the coordinator learns, but does not specify whether the same model can be assigned Thinker and Verifier in the same run. In a seven-model pool, the probability of role overlap across rounds is not negligible. If GPT-5 is the Thinker on round k and the Verifier on round k+1, the second opinion shares a brain with the first.

The Fugu technical report adds a related design choice for the latency-aware variant:

The selected model is always invoked as a worker, which reduces the coordination space and lowers orchestration latency.

So in the latency-aware Fugu variant described there, roles are dropped: the selected model is always invoked as a worker. Fugu-Ultra, the quality-first variant, leans on multi-agent coordination with isolation through access lists. The TRINITY Verifier primitive is the research artifact. The production system has chosen, for one variant, to drop roles for latency reasons. That's a signal in itself.

The work is impressive

Before the gap, the receipts. Fugu lands real wins on Sakana's benchmark page:

Rubik's Cube Solver: Fugu-Ultra solves all 300 cubes; every other frontier model returns zero valid solutions. That is actual domination, not statistical noise.
Classical Japanese Text: Fugu-Ultra at NED 0.80 versus 0.24 for the next best competitor. More than 3× better on a language task most frontier models barely engage with.
SWE Bench Pro: Fugu Ultra 73.7 vs Opus 4.8 at 69.2. A 4.5-point margin on a hard software-engineering benchmark.

These are real numbers and they take real engineering. The same calibration gap that this post is about also shows up in benchmark interpretation: Fugu's blindfold-chess result is against Stockfish set to 2100 Elo, honest in the paper and flattened in the social-media echo. The receipts are real; the framing they travel with is the part that needs reading carefully.

The question is narrower: whether the Verifier role inside the orchestration is doing work, or whether the end-task accuracy is rising for reasons that have nothing to do with verification.

Benchmark wins are not Verifier tests

End-task accuracy and verifier reliability are not the same measurement. A system can post strong benchmark numbers because:

the Thinker and Worker are individually strong frontier models, and the coordinator routes them well
the model pool diversifies failure modes through routing, not through verification
the Verifier rubber-stamps most outputs, and the small number of REVISE rounds happen to catch the cases that matter
the Verifier never rejects, and the system still wins because the Workers are already good enough

In all four cases, the benchmark goes up. In only one of them is the Verifier doing what the role name suggests. There's no way to distinguish them from end-task accuracy alone.

This is the same shape as devto-09's argument about quorum verification: independence is the assumption nobody verifies. Here it's one floor up. A Verifier role that's never tested under planted errors is not a verifier — it's a third opinion participating in orchestration. That can still be useful. It's not what the word verifier promises.

Tool-use moves the verification boundary, it doesn't remove it

Here's the natural objection outside the chess example: but agentic systems can write code, call tools, run solvers, and check outputs. The model doesn't need to calculate everything internally; the tool can do the calculation.

Correct. And that is exactly where the verification boundary has to move — not disappear.

Fugu's blindfold chess example is actually the opposite case: the technical report says the task does not use an agentic scaffold, every model is queried directly through its bare API, and the board is never restated. That makes the chess result a cleaner long-context and state-tracking result, not a tool-use result. Worth flagging, because the social-media echo around "LLM beats Stockfish" sometimes assumes the opposite.

But in the broader TRINITY/Fugu direction — coding, AutoResearch, CAD, tool-using agent workflows — tool augmentation introduces three new failure modes, each of which needs its own verification:

The agent writes buggy tool code. Off-by-one indexing, wrong constant, swapped variable in the cost function. The tool runs without raising an exception. The output is silently wrong. End-task accuracy on a held-out test might still look fine if the bug doesn't trigger on the test distribution.
The agent misparses tool output. The tool returns the correct answer. The agent transcribes it as something close but wrong — a digit, a sign, a unit. The tool worked correctly. The integration boundary broke.
The agent routes to the wrong tool, or at the wrong moment. Asked about A, calls a tool for B. Asked about B, tries to reason from training data instead of calling the tool that exists. Routing is a model-side decision and can be fluky in long-context, multi-step pipelines.

TRINITY's Verifier checks the final accumulated solution against the original query. It does not — in the contract as written — check tool-call provenance, parse fidelity, or routing decisions. If the agent silently garbles in the middle of the pipeline and the final answer still pattern-matches the expected shape, the Verifier returns ACCEPT.

Single point of failure wearing a tool-augmented quorum costume. Same shape as the verification gap in non-tool agents, with more boundaries that need their own planted-fault tests.

The missing experiment: planted Verifier faults

What would it take to actually evaluate the Verifier role? These are not moonshot experiments. They are appendix-sized evaluations.

1. Detection power. Take 200 examples from a benchmark TRINITY already runs.

Feed the Verifier the correct solution. Measure ACCEPT rate. Should be near 1.0.
Feed the Verifier a subtly wrong solution: off-by-one, wrong constant, swapped variable, plausible structure with a flipped sign somewhere. Measure REJECT rate. A real verifier should reject most of these.
Feed the Verifier a plausible-but-irrelevant solution: well-formatted answer to a different question. Measure REJECT rate.

A Verifier that ACCEPTs all three categories at similar rates is not detecting anything. It's voting.

2. Independence test. Run the same 200 examples twice:

Verifier uses the same base model as the Thinker on the round being checked.
Verifier uses a different base model — different family, different pretraining data.

Compare REJECT rates on the planted-wrong condition. If they're the same, role assignment isn't load-bearing — the Verifier behaves the same way regardless of whether it shares a brain with the Thinker. If the cross-family Verifier rejects more, independence is doing real work and should be promoted from a soft preference to a hard constraint in role assignment.

3. Tool-boundary test. For pipelines that use tools, plant errors at the boundary:

Correct tool call, correct output, mistranscribed into the transcript.
Wrong tool selected, plausible output, passed forward as if correct.
Buggy tool code that produces consistent-looking but wrong results.

Measure how often the Verifier catches each. A Verifier that only checks the final accumulated solution and not the pipeline that produced it will miss most of these by construction.

None of these experiments require new infrastructure. They require deciding to run them.

The pattern is structural

The pattern shows up in several independent stacks within a single week. A second-view discipline in agentic IDE tooling. A verifier shape with cadence and externally-authored constraints in another operator's open framework. An apply/advisory split in a third. Now an explicit Verifier role in a frontier lab's ICLR paper, paired with a production endpoint.

The vocabularies differ. The shape is the same. At every layer, the system names verification, and at no layer is the verifier itself tested as if it could be wrong. The role label is doing the work the capability evidence has not yet been asked to do.

This isn't a coordination problem. It's a category gap: the field's verifiers are evaluated by the same kind of evidence — end-task accuracy — that they're supposed to provide independent commentary on. When the verifier and the system it verifies are scored by the same metric, there's no room for the verifier to disagree usefully. ACCEPT becomes the equilibrium.

Devto-09 named this one floor below: independence is the assumption nobody verifies. TRINITY is the same gap one floor up, with a name on it. Naming the role doesn't close the loop. Testing the role under planted faults does.

Close

Credit where it is due. The Sakana AI team did the field a service by making the Verifier a first-class role in TRINITY, and by shipping Fugu as a production-grade multi-agent endpoint with the technical report attached. Both artifacts move the conversation forward. The next move is testing the Verifier as one. The experiments above are small, public, and reproducible against the seven-model pool the paper already uses.

Until then, every benchmark win where a Verifier was in the loop carries an asterisk. The system worked. Whether the Verifier inside the system worked is a separate question, and it has not been asked yet.

A role label is a declaration of architecture. A planted-fault eval is the cheapest possible piece of evidence that the architecture is doing what the label claims. The field is shipping Verifier roles faster than it's shipping Verifier testing. Sakana is in a position to change that — they have the orchestrator, the model pool, the benchmarks, and the engineering bench. The planted-fault eval would land in a single appendix. What it would tell us is whether the third opinion is a verifier or a vote.

Companion piece: a-quorum-costume-why-agent-verification-needs-fault-injection — the same gap one floor below, from operator stack instead of production orchestration.

I built a football bot that doesn't watch football. It's #2 in our World Cup league.

Mike Czerwinski — Tue, 23 Jun 2026 19:17:46 +0000

The bot is sitting in second place out of fourteen. Three points behind the leader.

The thirteen humans in this league watch a lot more football than I do, and the leader watches more football than most of them. The bot has never seen a match.

The setup

A friends-only prediction league, World Cup 2026, thirteen humans and one bot. The league is private, so I won't link it. The bot's name in the standings is "mike," same as mine, which the others find funnier than I do.

I wrote two files. client.py does session-cookie auth against the league's API and exposes matches(), me(), and put_bet(match_id, home, away). revise.py runs from cron every fifteen minutes, looks at upcoming matches in a window 90 to 200 minutes before kickoff, asks Claude for a verdict, and writes the bet if anything has changed. Idempotent state guard — same match won't get re-revised in the same pre-window. Per-match JSON record of every decision, with reasoning. The whole thing is under 250 lines.

The three rules

The prompt to Claude is structured and small. It asks for: home goals, away goals, confidence level, and a one-sentence reason that has to cite exactly one of three rules.

Squad-value ratio. Estimated market value of each squad. If one side is clearly more expensive than the other, that side scores more.
Class gap. Is one of these teams a debutant or weak federation member, and the other a top-eight nation by recent results? If yes, the modal score gets pushed harder.
Pace mismatch. Does one side's attack-speed style obviously punish the other side's defensive shape?

That's the whole rulebook. No "team chemistry." No "the manager's been under pressure." No "momentum from the last group game." No "this is the kind of fixture Brazil tends to drop points in." Three crude structural proxies for is one of these teams obviously the better one, and a confidence level that determines how aggressive the modal scoreline gets.

The standings

Full table as of this morning. Other players anonymized (single letters), bot's row in bold:

Place	Player	Pts	Exact	Diff	Result	Bets
1	A	71	6	5	27	72
2	bot mike	68	3	6	28	44
3	B	67	4	5	27	72
4	C	65	4	3	27	54
5	D	63	6	3	24	48
6	E	59	4	3	24	49
7	F	59	2	5	25	60
8	G	57	3	5	23	49
9	H	53	4	3	21	69
10	I	50	4	2	20	44
11	J	49	0	3	23	63
12	K	45	4	3	17	35
13	L	37	3	1	15	39
14	M	30	1	0	14	26

The top of the table looks like this in prose: player A is at 71 points with six exact scores and a prediction in every single match of the tournament. From past chat in our group, this person watches football constantly — clubs across three leagues, knows squad rotations, follows transfer news. A serious follower. The bot is three points behind, with three exact scores. Player B at #3, also a serious follower, is one point further back.

The bot is sitting one row below a real follower of the game. Not by being smarter. Not by knowing football. By being a small structured system run by someone who, on a normal day, has to be reminded which group Senegal is in.

Why this is the part that interests me

Most of the people in this league are serious football followers. They watch matches. They form intuitions. They have opinions about which team underperforms its squad value, which side fades in second halves, which manager makes the wrong substitution against teams with fast wingers. That knowledge is real, and it isn't easily replaced.

And yet a 250-line script with three rules and a cron job is sitting above twelve of them.

The bot doesn't beat the leader. It doesn't outperform actual football intuition at the top of the table. What it does is beat almost everyone else, despite having none of the inputs they have. That gap — between domain-rich intuition and a small disciplined system in a fresh domain — is the thing I keep noticing in my other work, and I didn't expect to see it lit up this cleanly in a prediction league.

The discipline isn't football-specific. It's: define a small structured prompt, run it on cron, write a per-match record, let the model worker do the worker job, and let the structure around the model decide when to act and when to keep the previous bet. That structure is doing more of the work than the model is.

A few honest caveats

The bot probably doesn't stay at #2 through the whole tournament. The knockout rounds get messier. Squad-value ratios get less reliable in tournament football because the variance is high, and three crude rules will miss the kind of nuance the top human catches in tight quarter-finals. The bot's lead over the rest of the league is real today. Whether it holds is a separate question.

Also: the bot is not clever in the prediction itself. Claude isn't running a Bayesian model under the hood. It's pattern-matching on a small structured prompt with three rules. The cleverness, such as it is, is in the architecture around the model — the cron cadence, the state guard, the structured rule set, the per-match JSON log. The model is the worker. The structure decides when the worker is allowed to ship.

That distinction is most of the post. The bot isn't beating football fans because the model is smart. It's beating most of them because the system around the model is small, structured, and consistent — and most casual prediction isn't.

Footer

If this sounds like the vibe-coding-is-not-a-level framing — it is. In a fresh domain, a small disciplined system can get surprisingly close to domain intuition — and sometimes beat loosely applied intuition outright. I was writing about this in software last week. I'm writing about it in football this week. The shape doesn't care which domain it lives in.

The bot's predictions will keep landing through the group stage. The standings will move. I'll write the follow-up when they do.

A quorum costume: why agent verification needs fault injection

Mike Czerwinski — Tue, 23 Jun 2026 14:13:09 +0000

Yesterday I watched my AI partner miss the same source-of-truth problem three times in a row, in three different forms, across three different review surfaces.

It wrote a draft in the wrong voice. A reviewer-session of the same model read it twice and rated it progressively higher. A meta-receipt at the end of the post miscounted the number of review rounds — fact drift inside a paragraph about fact drift. I caught all three only because I was sitting outside the loop with access the reviewers didn't have.

I wrote about the failures themselves the same evening. This is the part underneath them.

Each of those catches has the same structure as a much bigger class of failure across the agent stack right now: a verification surface that is supposed to be independent of the thing it verifies, but isn't. The check shares lineage with the claim. The reviewer reads from the same source as the writer. The agreement loop walks the same path the disagreement was supposed to fall out of.

That's not a flaw in any particular framework. It's the assumption nobody verified before they shipped the framework.

The diagnosis, one floor up

Almost every verification scheme in the current agent stack quietly bets on independence between paths.

Multi-agent voting bets on it. Cross-layer coherence checks bet on it. Quorum reads, consistency loops, maker/checker patterns, two-pass LLM reviews, ensemble-of-prompts setups — every one of them ships with the premise that the views being combined are in some useful sense disjoint. Disagreements are supposed to surface real divergence. Agreement is supposed to mean the real signal cleared a structural test.

Most of the ones I see do not make the independence assumption observable.

Self-Correcting Systems named it cleanly in a commissioning thread for this post: an unverified independence assumption is indistinguishable from a single point of failure wearing a quorum costume. That line does most of the work of this post. Until you've checked that the paths actually disagree on the thing they're supposed to disagree on, you don't have N views. You might have one view in N hats and no way to tell.

The costume is the part that fools you. The vote returned unanimous. The reviewers agreed. The cross-check passed. From inside the system everything looks like the verification did its job. From outside — where someone can see that all the paths share an upstream — it's one signal repeated.

Disagreement rate isn't the test

The reflex move when this gets named is: fine, measure disagreement rate as a smoke test. If the views never disagree, they're not independent.

This is a useful check, and it isn't the test.

Two paths can disagree on phrasing and agree on the same wrong fact. They can disagree on confidence and converge on the same hallucination. They can disagree on tone and share the same upstream retrieval that handed both of them the bad context. The agreement that matters is the one on the thing that actually carries the load — and that's exactly the dimension where shared lineage is hardest to see, because the words on the surface are different.

Disagreement on noise while sharing the upstream that actually matters is the worst possible failure mode for an independence claim. It looks healthy. It produces nicely varied outputs. It survives smoke tests. And it fails in the same direction every time the upstream lies.

The real test isn't whether the views disagree on their own. It's whether you can make them disagree by perturbing the system. That's a different shape of measurement entirely.

A compact diagnostic to keep in front of you:

Path	Shared upstream	Injected fault	Expected divergence	What "coupling detected" looks like
Retriever-A vs Retriever-B	embedding model M	Plant a contradictory fact through one path's index only	A returns corrupted, B returns clean	Both return corrupted (same embedding pulls the bad cell on either side)
Maker vs Checker	rule cache R	Mutate the rule offline so the correct verdict changes	Maker uses mutated R, Checker flags the divergence	Both pass (Checker reads cached pre-mutation R, never re-fetches)
Telemetry vs anteriority check	served-model record	Plant the wrong `response.model` value	Anteriority check reads independently-controlled record and flags mismatch	Anteriority check passes (it reads the same record the telemetry wrote)

The columns are the operational shape. Without the last column, you can't tell whether the verification was doing work or sitting green because nobody perturbed it.

Where the field has practical methods for this

Distributed systems hit this problem about twenty years ago and developed practical methods for testing specific failure assumptions. The methods don't prove independence in the general case — they expose dependence when it exists.

Jepsen runs partition tests against databases that claim consistency, and the only way to find out whether the claim survives a network failure is to cause the network failure. Chaos Monkey kills production instances on purpose. Game-day exercises shut down regions and watch whether the system that claims regional redundancy actually has it. Property-based testing throws structured-random inputs at code that claims to be correct under all inputs. The pattern repeats: you don't measure resilience by watching the system run well. You measure it by inducing the failure mode the claim was supposed to survive — within the safety boundary the operator has authorized.

Agent and memory architectures didn't inherit this discipline. The reason isn't technical — the tooling is straightforward. The reason is cultural. Distributed systems treat state as something you have to perturb to trust. Agent stacks treat state as something you write to and read from and trust by default. The whole design vocabulary is built around storing, retrieving, embedding, and ranking — none of which has a perturbation primitive in it.

That's the gap. Not "we don't know how to verify independence." We do, in the specific cases we test for. The gap is that the harness is still rarely treated as the price of admission.

What fault injection actually means here

Strip the term down: trace the lineage of a claim, induce a known bad state in one of the paths that produced it within a bounded blast radius, and confirm that the "independent" check catches the divergence — or quietly absorbs it.

Three concrete moves on the kinds of stacks I'm running:

Inject a synthetic-bad memory entry. Pick a fact in the agent's semantic store. Plant a contradictory version through one retrieval path's index but not the other, in a sandbox or staging environment. Run the agreement loop. The injected fault is the planted contradiction; the observable assertion is whether the two paths return different values; the negative control is the same query run before the injection, returning consistent values from both paths; the condition that constitutes coupling is unanimous agreement on the corrupted value plus matching provenance IDs across both paths. If the loop returns unanimous, the loop wasn't reading two paths — it was reading one path twice with different formatting, and the provenance trail proves it.

Mutate a router rule offline. Take a rule the agent applies through one decision path. State explicitly what the maker/checker pattern is supposed to detect — for example, that the checker should flag any verdict computed from rule version N+1 when the policy still records version N. Mutate the rule in the offline copy so a known request that previously routed to verdict X now routes to verdict Y. Re-run the maker/checker on that specific known request. The expected divergence is the checker flagging mismatch between policy-version-recorded and policy-version-applied. If both maker and checker pass, they were sharing the cached policy and the second role was a presentation layer, not a check.

Plant the wrong response.model in telemetry. Define your terms first: the anteriority check is the diagnostic that compares the served-model identifier in telemetry against the served-model intent-spec — a build-time manifest or external provenance store that names which model should have served which request type. The independently-controlled record is the intent-spec, authored at build time and stored where the runtime can't rewrite it. Log an answer with a deliberately wrong served-model identifier. The expected outcome is the anteriority check reading the intent-spec, comparing it against the planted telemetry, and flagging mismatch. If the check passes, the check was reading the same record the writer wrote — no anchor outside the writer's reach.

In all three, the value isn't the specific bug you find. It's the binary answer to a question the system couldn't answer about itself otherwise: does the second path actually catch the first, or has the costume been fooling everyone — including the team that wrote both halves?

The operational split underneath. Raffaele Zarrelli, designing slow-loop memory updates at cowork-os, surfaced the question that turns this from a one-time test into a sustained discipline: who actually authors the verifier? The shape that holds up: the operator authors the verifier choice at lock time — picks the anchor, names the expected resolution, sets the cadence — because picking the right anchor is a judgment call that doesn't scale into the mechanism. The diagnostic firing is mechanical: a cron walks all locked entries on cadence, runs their verifiers against current reality, flags any whose resolution diverged or whose verifier itself failed to run. Operator picks the anchor; mechanism fires the check. Same shape as the apply/advisory split — operator authorizes the binding, mechanism surfaces the drift without authority to flip status.

One constraint goes with this: the verifier itself has to target something the writing session can't reach. Operator picks the anchor at lock, but if the anchor is a grep against the operator's own filesystem in the same session that wrote the decision, the whole thing collapses into wording-check one floor up — same disease, longer chain. The smallest useful version: verifier targets must be externally authored. CI run signed by a service the operator didn't author. Commit by a counterparty. Vendor receipt with an audit trail. World-state record outside the operator's write path. Anything the lock session could rewrite from inside itself isn't a verifier; it's a label.

A five-step working checklist, distilled:

Map the lineage of the claim — who wrote it, what context flowed in, which upstream produced what.
Select one boundary to perturb — pick a path you can perturb without breaking production; default to synthetic data, sandbox, or staging.
Inject a controlled fault — known-bad state, single path, bounded blast radius, rollback ready.
Observe the alternate path — does the supposedly-independent check catch the divergence or absorb it silently? Capture provenance from both paths.
Record pass criteria — what was injected, what fired, what didn't, what would have constituted a clean pass.

The same logic generalizes. Pick a verification claim. Pick a single path. Perturb the path within a safe boundary. Watch the verification. If the verification doesn't flinch when the input lies, the verification was never a verification.

Independence decays

If you stop here you've built the harness once and you're done. That's the next assumption to drop.

Independence is not a property you establish. It's a property you re-verify.

The reason is small and brutal: you can wire up two paths today, prove they're disjoint with a clean fault-injection pass, ship the result, and have somebody refactor a shared cache into the middle of both paths next month. The vote still returns. The agreement loop still fires. The smoke test still looks healthy. And your June fault-injection pass certifies a system that doesn't exist in August anymore. You measured an independence that has since collapsed and nothing in the loop tells you it collapsed, because every visible signal is downstream of the shared cache.

This is the same disease as integrity-is-not-anteriority, applied to the orthogonal axis. Integrity at a moment is not a verifiable history. Independence at a moment is not sustained disjoint paths. Both are properties of time, not properties of state.

The operating model worth running alongside this:

Triggers that should fire a re-verification: a new shared cache between two paths, a retrieval source change that lets the same upstream feed both views, a router change that overlaps previously-disjoint paths, a policy store change that moves the rule deciding "which path runs," a telemetry schema change that alters what two checks compare against, a model family change that introduces shared training lineage.
Cadence: at least monthly, plus per-trigger.
Failure owner: named per system. An alarm that fires without an owner is a checkbox with no consequence.
Actionable policy sentence: rerun the relevant injection test whenever a shared cache, retrieval source, router, policy store, telemetry schema, or model family changes — or monthly, whichever comes first.

There's a second decay nobody mentions. The harness itself rots.

If the fault-injection step exists but nobody ever runs it, or nobody ever shoves a known-bad state through it that should trip it, the harness becomes a checkbox. Green because nobody's measuring. Green because the injector has bit-rot in a dependency since the last real test. Green because someone refactored the test fixtures and the perturbation now silently no-ops. The harness is one more thing in the system that has to be perturbed periodically or it stops being a measurement and starts being decor.

Treat the harness like the model. Assume it drifts. Re-perturb, don't only re-check.

Signals from adjacent work

This isn't only a private failure mode. The same shape is showing up in adjacent work this month.

In agent security, deterministic permission gates move tool-call decisions out of the model's discretion. In memory systems, supersede edges and provenance make it possible to ask what replaced what and why. In larger agent harnesses, the infra layer is decomposing into sandboxes, memory, skills, sub-agents, and gateways.

Those are useful primitives. But none of them remove the independence question. A gate can still share lineage with the claim it gates. A memory substrate can still retrieve twelve agreeing episodes through the same broken path. A harness can still ship impressive infrastructure while leaving the verification layer above it untested. The infra layer is closing fast. The verification layer above it still feels under-built: not because nobody has pieces of it, but because independence is rarely made measurable as a first-class property.

The pattern is no longer just my thesis. It is showing up across enough adjacent work that I would treat it as an emerging baseline candidate — not because the field has agreed on it, but because the same constraint keeps forcing the same shape. The next baseline is not "more checks." It is checks whose independence has been perturbed, measured, and re-verified after the system changes.

Closing

The line from the post I wrote yesterday holds at this level too: two sessions of the same model do not constitute two views; they constitute one view, twice.

The version one floor up: two paths that share an upstream do not constitute two views; they constitute one view, twice, in different fonts.

Independence is a design-time-vs-runtime distinction. You design for it by separating the paths a verification touches from the paths the thing-being-verified touches. You verify it at runtime by inducing a failure in one path and watching whether the other path notices. You re-verify it next month because the system you measured in June isn't the system running in August.

Everything else — the agreement rate, the consistency loop, the quorum result, the cross-check that came back unanimous — is internal coherence wearing the costume of independent verification. Formal lineage analysis, provenance controls, and architectural isolation can provide partial evidence: they certify what the system was at construction. They don't certify what it currently is.

The strongest operational measurement proposed here is the one that perturbs the system and watches what flinches. Anything the system could rewrite from inside itself isn't a verifier; it's a label.

Credits & references

The "single point of failure wearing a quorum costume" framing and the independence-decay dimension came from Self-Correcting Systems in a public commissioning thread under You can't be your own second view.
The verifier-shape architectural co-design (operator-picks-anchor + mechanism-fires-cron + verifier-targets-must-be-externally-authored) and the time-as-different-class framing came from Raffaele Zarrelli's work on cowork-os and a cross-thread exchange this week.
The CrewAI permit/defer/deny architecture with hash-chained decision logging and the analyzer-suggests-diff-operator-applies pattern is Brian Hall's work at Faramesh Labs (Put a hard stop in front of your CrewAI crew's tool calls).
The push-memory substrate primitives (supersede edges, computed-not-stored confidence, off-test for shadowed memory) come from Todd Hendricks's five-part Agentic Memory Study and the Recall substrate.
The industry-scale infra receipt is ByteDance's DeerFlow 2.0 — open-source SuperAgent harness ground-up rewrite (approximately 73,000 GitHub stars as of June 23, 2026, MIT) shipping sandboxes + memory + sub-agents + skills + message gateway as one stack at the infra layer.
Companion posts: Salience is not carry value on selection-time policy in memory pipelines, and You can't be your own second view on the single-agent case of the same failure.
Background: Anthropic Economic Research, Agentic coding and persistent returns to expertise (Hitzig et al., June 2026), independent empirical anchor for the operator-discipline axis underneath this whole arc.

Additional peer references — NOVA Network on synthetic-quorum and out-of-band alarms; Christopher Maher (LLMKube) on bite-check; Vishal Keerthan and Elliott Schmechel on routing-embedding-as-input-side-drift-catch and constraints-driven convergence; Shudipto Trafder on the CoALA seven-memory-types taxonomy; Theo Valmis on engineering-with-AI as designing where the model is allowed to be wrong; jugeni's audit log integration contract at github.com/jugeni/jugeni-contracts — will appear in a follow-up post that walks each thread individually.