DEV Community

MxGuru
MxGuru

Posted on

I killed six of my own results in one night. That was the win

I built an AI security benchmark this week. By the end of one night I had killed six of my own results —
every single one a beautiful, convincing number that turned out to be a lie. Catching them was the whole point.

Here's the pattern, because it keeps showing up:

A perfect score is a smoke alarm, not a trophy. Every time something hit 100% / 1.000 / "zero errors,"
it was a broken experiment, not a breakthrough. A few of the ways (generalised, no project specifics):

  • The metric was scoring vocabulary, not judgement. A model scored a perfect 100% — until I read the
    transcripts and saw the scorer was substring-matching words like "threat" and "attack," which the model
    used even when it concluded something was safe ("this is not an attack" → counted as a catch). Fix: parse
    the actual structured verdict, not the prose.

  • Recall with no control is half a metric. "100% of attacks caught" means nothing without a benign
    control set — a model that flags everything also scores 100%. Adding clean inputs exposed the real
    false-alarm rate. Precision is not optional.

  • Small samples lie — five times. A number looked great at n=50 and collapsed at n=100. Repeatedly. A
    17-point swing between sample sizes will end your headline. Never quote a single small-n number as final.

  • My own benchmark was contaminated. The "attacks" turned out to be — 85% of the time — the attacker
    leaking its own task prompt. My headline metric was detecting that, not the threat. I only found it by
    reading the raw transcripts on the most flattering result.

The lesson I keep relearning: the most valuable code I write isn't the thing that produces a result — it's
the experiments designed to break it.
Honesty isn't a vibe; it's a method. Change one variable at a time.
Verify the control actually works. Read the transcripts. And when a number is suspiciously clean — especially
when it's in your favour
— that's exactly when to reach for the knife.

I ended the night with fewer illusions and one number I'd actually defend. That trade is always worth it.

Solo, self-taught, on a single consumer GPU. More soon.

Top comments (0)