I built an AI security benchmark this week. By the end of one night I had killed six of my own results —
every single one a beautiful, convincing number that turned out to be a lie. Catching them was the whole point.
Here's the pattern, because it keeps showing up:
A perfect score is a smoke alarm, not a trophy. Every time something hit 100% / 1.000 / "zero errors,"
it was a broken experiment, not a breakthrough. A few of the ways (generalised, no project specifics):
The metric was scoring vocabulary, not judgement. A model scored a perfect 100% — until I read the
transcripts and saw the scorer was substring-matching words like "threat" and "attack," which the model
used even when it concluded something was safe ("this is not an attack" → counted as a catch). Fix: parse
the actual structured verdict, not the prose.Recall with no control is half a metric. "100% of attacks caught" means nothing without a benign
control set — a model that flags everything also scores 100%. Adding clean inputs exposed the real
false-alarm rate. Precision is not optional.Small samples lie — five times. A number looked great at n=50 and collapsed at n=100. Repeatedly. A
17-point swing between sample sizes will end your headline. Never quote a single small-n number as final.My own benchmark was contaminated. The "attacks" turned out to be — 85% of the time — the attacker
leaking its own task prompt. My headline metric was detecting that, not the threat. I only found it by
reading the raw transcripts on the most flattering result.
The lesson I keep relearning: the most valuable code I write isn't the thing that produces a result — it's
the experiments designed to break it. Honesty isn't a vibe; it's a method. Change one variable at a time.
Verify the control actually works. Read the transcripts. And when a number is suspiciously clean — especially
when it's in your favour — that's exactly when to reach for the knife.
I ended the night with fewer illusions and one number I'd actually defend. That trade is always worth it.
Solo, self-taught, on a single consumer GPU. More soon.
Top comments (0)