DEV Community

Cover image for Every Post I Publish Gets AI Review. A Hostile Agent Still Found the Holes in Twenty Minutes.
Mike Czerwinski
Mike Czerwinski

Posted on

Every Post I Publish Gets AI Review. A Hostile Agent Still Found the Holes in Twenty Minutes.

Everyone on my timeline spent this week saying the returning Claude Fable is brilliant. Maybe it is. I did not want another benchmark thread. I wanted a knife test: give the hyped model a hostile mandate and a real target, and see what bleeds.

The target was my own catalog. Thirteen posts on this account, most of them built around one idea: self-reported numbers are worthless, verification has to come from outside the actor. Published win rates are the actor auditing itself. Receipts over wrappers. That idea did well here. It got comments, it got a small cluster of regulars, it got quoted back to me.

Here is the part that matters. Every one of those posts went through several rounds of review before shipping. Not skim-review: a frontier model (GPT-5.5 Codex) doing structured passes, blockers and majors and minors, usually two or three iterations per post. The scores came back 9/10. I treated that as verification.

So the experiment was simple. Same class of tool, different mandate. I spun up an agent on the new model and told it: you are the most skeptical commenter on HN, find holes, no compliments, rank by severity. Then a second agent with read access to the actual codebase and database behind my numbers, to check whether the worst charge was true.

It took about twenty minutes end to end.

The first agent came back with a ranked list. The headline charge: my post claims 97.4% of Telegram signals were stale by the time a bot could act on them, and pairs "advertised" channel win rates of 78.9% against a "measured" 46.6%. The agent could not see my data. It just read the prose and said: if the bot backfilled channel history at connect time, your staleness number is an artifact of ingestion, not a property of the signals. And your advertised-versus-measured table smells like two different measurement conventions wearing one label.

The second agent had the database. Verdict on both counts: confirmed.

The 97.4% was worse than an artifact. The 4,007 rows behind it were not the output of any staleness check. They were a one-time relabel from a bug cleanup in May: rows stuck in a "received" state got renamed "stale_received" by a maintenance script, and seventeen months of backfilled history sat in the denominator. The pipeline does have a real freshness gate (two hours, checked against the message timestamp), but signals it rejects land in a different status entirely. The number I published measured my own ingest bug, not the market.

The advertised-versus-measured table was worse again. The 78.9% did not come from the channels. It came from an earlier backtest of mine that counted a win as first touch of TP1 and dropped expired signals from the denominator. The 46.6% came from a different pipeline of mine that marks positions to market on a fixed horizon and never uses TP or SL at all. Two of my own rulers, different units, one of them mislabeled as the channels' claim. The gap I attributed to survivorship is at least partly a gap between my own conventions.

So, errata, on the record: the staleness figure in that post is retracted until I can compute it on the live-monitored window only, and the advertised/measured comparison is not apples-to-apples and should not have been framed as the channels' numbers against mine. The direction of the argument may survive. The evidence I gave for it does not.

The full-catalog pass hurt more than the single post. The pattern it named: my rigor is asymmetric. When a number makes me look bad, I publish confidence intervals for n=29. When a number flatters me, a mid-season second place or three attributed comment replies, it ships naked, no baseline, no denominator. And the closing line of the report, which I will be chewing on for a while: a catalog preaching exogenous verification never published a single exogenous test of its own system. One view, thirteen hats.

The other fair question is how this got past me, since I am the one writing posts about chains of custody. I can reconstruct it exactly, and none of it requires malice.

The post was written from my own postmortem document, which already contained the aggregated numbers. I verified the prose against the doc. Nobody verified the doc against the database. The relabel that produced those 4,007 rows happened weeks earlier, in a different repo, as routine bug hygiene. By the time the number reached a draft it looked like a measurement, because nothing in a markdown file remembers where it came from.

There is one more layer of history, and it makes this worse, not better. The bug dates back to the earliest weeks of building that pipeline. We fixed it in May, relabeled the stuck rows, and shortly after rejected Telegram signals as a source altogether. And once a module is rejected, nobody cleans it anymore. Why would you sweep a graveyard. The dead corner of the database kept answering queries, so a month later, when I came back looking for publishable numbers, it handed me some. A count never expires. Its meaning does.

Which cuts both directions in time, and I owe this one out loud: if 97.4% stale was part of why we rejected the source, the rejection itself drank from the same well. I think the verdict stands, the funnel had other, healthier reasons. But "I think it stands" is exactly the kind of sentence this post exists to kill. So the errata includes re-running staleness and win rate on the live-monitored window only. If the verdict survives, it finally gets a receipt. If it does not, that will be a better post than this one.

And the numbers fit the thesis. 97.4% stale was a beautiful number for an argument about information already being in the price. I computed confidence intervals for the result that embarrassed me and shipped the flattering ones naked. Confirmation bias does not feel like bias from the inside. It feels like the numbers agreeing with you.

I do not have an excuse. I have a mechanism, and the difference is that you can put a gate on a mechanism: from now on, every number in a draft carries its own receipt, the query or the file and line it came from, resolved at write time. If I cannot trace it, it does not ship.

Now the question everyone will reasonably ask: does this prove the new model is smarter than my reviewer? No. And I want to be precise about why, because this is the actual lesson.

I never gave the reviewer model a hostile mandate. I asked it to review drafts, and it did that job well: the prose got tighter every round. A reviewer told "make this post better" optimizes the wrapper. An agent told "find what is false" attacks the claim. I do not know which model would have won a fair fight, because I never staged one. Mandate is not a detail of verification. Mandate is most of it.

And the second half: the killer evidence was never in the text. No review of the draft, at any quality, by any model, could have found a maintenance script that relabeled 4,007 database rows. The first agent could only raise a suspicion. It became a verdict when a second agent ran a query against the table. Verification is bounded by what the verifier can touch. If your reviewer can only read the prose, you have verified the prose.

Which answers, at least for me, the regress everyone loves to pose: who verifies the verifier, and where does that loop end? Not with a smarter model above the current one. The loop ends where the signal changes kind, where an opinion about the work gets replaced by a measurement the work cannot argue with. Every layer above that is just routing suspicion toward something deaf.

The new model might be brilliant. The query did not care.

Top comments (1)

Collapse
 
jugeni profile image
Mike Czerwinski • Edited

I do not know which model would have won a fair fight, because I never staged one.

Actually I did, but this is for next one ;)