Build-in-public engineering log. Every number below is a real measurement pulled straight from our test reports — nothing rounded up to look nicer, nothing invented. If you only ever hear the wins, the wins mean nothing.
In Log #01 we told the messy story of getting highlight recall from 0.56 → 0.86. Since then we built the piece that sits underneath that pipeline: a video action classifier that decides what is happening in a clip. This log is the honest scoreboard for both — because we were curious ourselves what the model actually scores today.
TL;DR current state:
- Action classifier: 94.0% test accuracy, macro-F1 0.94, across 4 classes.
- Highlight pipeline (end to end): recall 0.86, up from 0.56.
- Training data grew 475 → 762 clips between v1 and v2.
Part 1 — The action classifier: 94% today
The classifier answers a narrow but load-bearing question: given a short clip, is this a baseball swing, a basketball layup, a soccer shot, or nothing (background play)? Get this wrong and every downstream highlight decision inherits the mistake.
The model is nothing exotic — an R(2+1)D-18 backbone, 16 frames sampled per clip at 112×112. Deliberately small: it has to run cheaply on real uploaded footage, not win a leaderboard.
Here's the current test report, verbatim from video_scorer_testreport.json:
| Class | Precision | Recall | F1 | Support |
|---|---|---|---|---|
| baseball_swing | 1.000 | 0.971 | 0.986 | 35 |
| soccer_shooting | 0.933 | 0.933 | 0.933 | 15 |
| none | 0.938 | 0.918 | 0.928 | 49 |
| basketball_layup | 0.850 | 0.944 | 0.895 | 18 |
| Overall accuracy | 0.940 | 117 |
Macro-F1 lands at 0.94. Baseball swing is nearly solved (precision 1.000 — it never fires falsely). Basketball layup is our weakest link at F1 0.895, and the confusion matrix says exactly why.
Reading the confusion matrix honestly
pred: bball base none soccer
baseball_swing [ 34 0 1 0 ]
basketball_layup[ 0 17 1 0 ]
none [ 0 3 45 1 ]
soccer_shooting [ 0 0 1 14 ]
Almost every error is a confusion with none, not a cross-sport mix-up. The model rarely calls a layup a swing; it occasionally can't decide whether a play is a highlight at all. Concretely: 3 background clips get misread as layups (that's what drags layup precision to 0.85), and a scatter of real actions leak into none.
That's the good kind of error profile. The sports are cleanly separable; the remaining work is sharpening the action vs. no-action boundary, which is a data problem, not an architecture problem.
Part 2 — More data, honestly labeled
The single biggest lever between v1 and v2 wasn't a clever loss function. It was clips.
| v1 | v2 | |
|---|---|---|
| Total clips | 475 | 762 |
none (train) |
119 | 335 |
We roughly tripled the negatives (none). That's not glamorous, but the confusion matrix above is exactly why it mattered: our errors live on the action/no-action boundary, and you only teach that boundary by showing the model far more of what a non-highlight looks like.
One caveat we keep visible in our own metadata: the current set is a research prototype. The note in dataset_meta.json literally reads "commercial model = rebuild on user / commercially-usable video." The 94% is real, but it's 94% on prototype-domain data — we're not going to quietly let that number imply more than it earned.
The augmentation bet (this one paid off)
Phone footage is ugly: motion blur, compression noise, players filmed tiny and far away, handheld shake. So we simulate all of it at train time rather than pretend clean clips generalize:
- motion blur up to a 21px kernel
- downscale to 0.35–0.7× ("player filmed far away")
- JPEG quality floor of 40 (phone compression artifacts)
- mild shift/scale/rotate for handheld shake
Each real clip spawns ~2 augmented variants. The point isn't more data for its own sake — it's data that looks like the mess the model meets in production.
Part 3 — The pipeline recall story (the Log #01 recap, with the receipts)
The classifier feeds a larger highlight-detection pipeline. That pipeline is the one that went 0.56 → 0.86 recall. Two lessons from that arc are worth re-stating, because they shaped how we built Part 1:
1. We fixed the wrong layer first. Our instinct was "make the detector see better," so we fine-tuned RF-DETR to a respectable 0.716 mAP — and end-to-end recall moved 0.60 → 0.59. Backwards. When we decomposed every miss by stage, detection owned only 15% of failures; 55% died at the pose/kinematic stage. We'd polished the layer that wasn't broken.
2. Never trust a batched change. We turned on four good features at once and the score dropped 0.769 → 0.692. A one-at-a-time ablation exposed a single culprit (video_mode) dragging the batch down. Three of four features were genuinely helping; one bad apple made the whole basket look rotten.
Per-sport, the recall gains held across the board rather than being carried by one lucky category — basketball alone went 0.43 → 0.82.
Where we actually stand
- Classifier: 94.0% accuracy, macro-F1 0.94. Sports are cleanly separated; the only soft spot is the highlight/no-highlight boundary, and we know it's a negatives-data problem.
- Pipeline: 0.86 recall end-to-end, robust across sports.
- Known asterisk: prototype-domain data. The commercial rebuild on user footage is the next real test, and that number could move — we'll report it either way.
What's next
Rebuild the dataset on genuinely user-uploaded, commercially-usable footage and re-run this exact report. If 94% survives the domain shift, we have a product. If it doesn't, you'll see the drop here first.
Building something in the same space and hitting the action/no-action boundary too? I'd love to compare confusion matrices in the comments.
Top comments (0)