My own test set lied to me

#machinelearning #computervision #kaggle #python

Builder Journal · Hyperspectral Object Tracking Challenge 2026

In every heist movie there is the moment the perfect plan meets the real vault, and something the crew never rehearsed goes wrong. I had the perfect plan this week. My own test set signed off on it. The real vault disagreed.

This is the fourth entry in a log I'm keeping while I compete in the Hyperspectral Object Tracking Challenge 2026, tracking a single object through video shot in colors the eye can't see. An earlier entry was about the home scorer I built so I could measure my work without spending a rationed leaderboard submission. This one is about the day that scorer lied to my face, and why it was my fault, not its.

The change that won at home

I built a change to how the tracker holds onto a target when it starts to slip away. On my home harness it was a clean winner, a solid gain over the version I was running. The best result I had produced. Every signal said ship it, so I spent one of my scarce submissions and sent it to the real leaderboard.

It came back worse. Not flat, not a wash. Worse than the thing it replaced.

Why the home result was a mirage

The forensics took a night. The problem was not the change. The problem was the test set I had used to bless it.

I had assembled that particular dev set out of my hardest scenes, the ones where the tracker was already falling apart. So the change looked brilliant on it, because on a track that is already broken, anything that recovers a lost target is pure upside. But the real competition footage is mostly the opposite: healthy tracks, moving fast, not broken at all. On those, the same change did quiet damage. My test set had measured my best case and reported it as my average.

A dev set built from failure cases certifies rescues. It does not certify deployments. The two are not the same population, and I had let myself forget that.

The second mistake, for free

I made it worse than it had to be. I had stacked two unproven changes into that single submission. So when the bad number came back, it could not tell me which of the two was the culprit. One measurement, two suspects, no way to question them separately. It took the whole night of detective work, with no answer key, to pull them apart. When measurements are scarce, never put two unvalidated ideas in one of them. You learn half as much and pay the full price.

What it actually means

This is the sequel to a lesson from earlier in this log. I had been so pleased that my home scorer matched the official one that I forgot a scorer is only ever as honest as the data you pour into it. A faithful metric on a biased sample is not a small error. It is a confident lie, which is the most expensive kind.

The fix is not a better model. The fix is a test set that looks like the world you are actually going to be graded on.

Where I'm standing right now

Still sitting just under the podium line, exactly where I was before the detour. The progress this week was subtraction. I now know one more thing that does not work and precisely why it doesn't, which is most of what these weeks really are.

Next entry: a win my parameter sweep handed me on a silver platter, that turned out to be a coin landing on its edge.

More in this series

← Start here: I entered a competition to track objects in light you can't see, what the challenge is, and why it's hard.
The leaderboard you can't score against, the home scorer this entry's lesson is the sequel to.
The full Hyperspectral Object Tracking Challenge thread, every entry from this competition.
The Builder Journal, the live log across all my competitions.

This is part of an ongoing builder's log written from inside live competitions. You're reading where I was, not where I am.

DEV Community