SportZone turns a parent's phone video of a youth game into a highlight reel. The hard part is finding the decisive moment. On real footage we were missing nearly half of them. Here's how a week of measuring — not guessing — fixed it.
01THE PROBLEM
On real phone footage, the model went half-blind
Our classifier scored 0.82 on curated YouTube clips. Confident, we ran it on genuine parent-filmed smartphone video for the first time. The number that matters — did we catch the decisive moment? — came back at 0.56. We were missing 44% of the highlights. A highlight tool that misses half the highlights isn't a tool.
The curated-vs-real gap is the whole game. So the first thing we built wasn't a fix — it was a way to measure honestly on the real domain.
02 THE WRONG TURN
We bet on the obvious culprit. We lost.
"The boxes look tight — the detector must be missing people." Reasonable. So we did the expensive thing: assembled ~9,100 commercially-licensed sports images, fine-tuned a detector on a cloud GPU, hit a healthy mAP of 0.716, and plugged it back in.
Hours of data-wrangling and training bought us nothing. Frustrating — but the failure was the clue. If a better detector changes nothing, detection was never the bottleneck. We just didn't have the evidence yet.
03 THE MEASUREMENT THAT CHANGED EVERYTHING
We decomposed the failures instead of arguing about them
We took every missed moment and tagged where in the pipeline it leaked: detection → tracking → pose → impact. One script, no opinions. The breakdown ended the debate:
Detection was 15%. The real leak — 55% — was downstream: the person was found and tracked, but the pose signal was too weak for our kinematics to register the impact. We'd been paving the wrong road. Lesson: decompose before you invest.
04 FIXING THE REAL BOTTLENECK
Sharper pose signal broke a ceiling we'd been stuck under
Now aimed at the right target, the fixes were cheap and pure-software: upscale the pose crop (256 → 384), raise model precision (complexity 1 → 2), and stop cropping off the legs with an asymmetric crop bias. No data. No GPU. This wasn't moving an operating point — it was a genuinely cleaner signal, and it broke past an F1 ceiling of 0.54 that pure threshold-tuning had never cracked.
05THE TRAP
The improvement first looked like a regression
We flipped on four improvements at once and recall dropped: 0.769 → 0.692. The tempting move: lower the detection threshold until the number looks good again. That would have papered over a real defect instead of finding it. So we ran a clean one-at-a-time ablation.
This is the same lesson as the detector detour, one layer deeper: the number lying to you is more dangerous than the number that's low. Ablation is how you tell them apart.
06THE RESULT
0.56 → 0.86, and the weak sports came home
● Shipping note Precision sits at 0.51 — a few extra false highlights. For a highlight reel that's the right trade: better to over-catch than miss the goal. We tighten it after beta.
07WHAT WE'D TATTOO ON OUR ARM
Three lessons, paid for in wasted GPU hours
i.
Decompose before you invest. We burned a cloud fine-tune chasing a bottleneck that was 15% of the problem. A one-afternoon failure breakdown would have redirected the whole week.
ii.
Ablate one change at a time. Four fixes at once hid a regression inside a net gain. Isolation named the single culprit in one pass.
iii.
Distrust the number that recovers too easily. Lowering a threshold would have masked the video-mode bug. The convenient fix and the correct fix are rarely the same move.
the winning config — all software, no new data
z_threshold: 2.0 # impact sensitivity
video_mode: false # the one flag that was the culprit
crop_size: 384 # sharper pose signal
model_complexity: 2 # pose precision
bottom_bias: 0.15 # stop cropping off the legs
tcn_rescue: on # classifier rescues weak candidates





Top comments (0)