"How do you build a test your detector can't game?"

#ai #machinelearning #testing #buildinpublic

If you build a waste detector and then build your own test set, there's a trap waiting: you can pass without your detector actually working.
Here's how. My "waste" examples have repeated steps; my "clean" examples don't. Looks reasonable — until you realize a dumb detector that just flags any repeat now scores perfectly, and the semantic layer (the hard part, the part that decides if a repeat is genuinely redundant) never has to do anything. You'd ship a "0.85 F1" that's really measuring "did the trace repeat," which is the exact mistake that killed my first project.

The fix is paired design: every clean example shares the identical structure as a waste example — same nodes, same repeats, same handoffs. The only difference is whether the repeated work is semantically redundant or real progress. That single constraint forces the semantic layer to be the actual judge, because structure alone can no longer tell the two apart.

Building the test you can't cheat is harder than building the detector. It's also the only reason the detector's score means anything.