Engineering Log #02: What our sports CV model actually scores right now (94% classifier, 0.86 highlight recall)

#machinelearning #computervision #buildinpublic #python

Build-in-public engineering log. Every number below is a real measurement pulled straight from our test reports — nothing rounded up to look nicer, nothing invented. If you only ever hear the wins, the wins mean nothing.

In Log #01 we told the messy story of getting highlight recall from 0.56 → 0.86. Since then we built the piece that sits underneath that pipeline: a video action classifier that decides what is happening in a clip. This log is the honest scoreboard for both — because we were curious ourselves what the model actually scores today.

TL;DR current state:

Action classifier: 94.0% test accuracy, macro-F1 0.94, across 4 classes.
Highlight pipeline (end to end): recall 0.86, up from 0.56.
Training data grew 475 → 762 clips between v1 and v2.

Part 1 — The action classifier: 94% today

The classifier answers a narrow but load-bearing question: given a short clip, is this a baseball swing, a basketball layup, a soccer shot, or nothing (background play)? Get this wrong and every downstream highlight decision inherits the mistake.

The model is nothing exotic — an R(2+1)D-18 backbone, 16 frames sampled per clip at 112×112. Deliberately small: it has to run cheaply on real uploaded footage, not win a leaderboard.

Here's the current test report, verbatim from video_scorer_testreport.json:

Class	Precision	Recall	F1	Support
baseball_swing	1.000	0.971	0.986	35
soccer_shooting	0.933	0.933	0.933	15
none	0.938	0.918	0.928	49
basketball_layup	0.850	0.944	0.895	18
Overall accuracy			0.940	117

Macro-F1 lands at 0.94. Baseball swing is nearly solved (precision 1.000 — it never fires falsely). Basketball layup is our weakest link at F1 0.895, and the confusion matrix says exactly why.

Reading the confusion matrix honestly

                 pred: bball  base   none  soccer
baseball_swing  [   34      0      1      0  ]
basketball_layup[    0     17      1      0  ]
none            [    0      3     45      1  ]
soccer_shooting [    0      0      1     14  ]

Almost every error is a confusion with none, not a cross-sport mix-up. The model rarely calls a layup a swing; it occasionally can't decide whether a play is a highlight at all. Concretely: 3 background clips get misread as layups (that's what drags layup precision to 0.85), and a scatter of real actions leak into none.

That's the good kind of error profile. The sports are cleanly separable; the remaining work is sharpening the action vs. no-action boundary, which is a data problem, not an architecture problem.

Part 2 — More data, honestly labeled

The single biggest lever between v1 and v2 wasn't a clever loss function. It was clips.

	v1	v2
Total clips	475	762
`none` (train)	119	335

We roughly tripled the negatives (none). That's not glamorous, but the confusion matrix above is exactly why it mattered: our errors live on the action/no-action boundary, and you only teach that boundary by showing the model far more of what a non-highlight looks like.

One caveat we keep visible in our own metadata: the current set is a research prototype. The note in dataset_meta.json literally reads "commercial model = rebuild on user / commercially-usable video." The 94% is real, but it's 94% on prototype-domain data — we're not going to quietly let that number imply more than it earned.

The augmentation bet (this one paid off)

Phone footage is ugly: motion blur, compression noise, players filmed tiny and far away, handheld shake. So we simulate all of it at train time rather than pretend clean clips generalize:

motion blur up to a 21px kernel
downscale to 0.35–0.7× ("player filmed far away")
JPEG quality floor of 40 (phone compression artifacts)
mild shift/scale/rotate for handheld shake

Each real clip spawns ~2 augmented variants. The point isn't more data for its own sake — it's data that looks like the mess the model meets in production.

Part 3 — The pipeline recall story (the Log #01 recap, with the receipts)

The classifier feeds a larger highlight-detection pipeline. That pipeline is the one that went 0.56 → 0.86 recall. Two lessons from that arc are worth re-stating, because they shaped how we built Part 1:

1. We fixed the wrong layer first. Our instinct was "make the detector see better," so we fine-tuned RF-DETR to a respectable 0.716 mAP — and end-to-end recall moved 0.60 → 0.59. Backwards. When we decomposed every miss by stage, detection owned only 15% of failures; 55% died at the pose/kinematic stage. We'd polished the layer that wasn't broken.

2. Never trust a batched change. We turned on four good features at once and the score dropped 0.769 → 0.692. A one-at-a-time ablation exposed a single culprit (video_mode) dragging the batch down. Three of four features were genuinely helping; one bad apple made the whole basket look rotten.

Per-sport, the recall gains held across the board rather than being carried by one lucky category — basketball alone went 0.43 → 0.82.

Where we actually stand

Classifier: 94.0% accuracy, macro-F1 0.94. Sports are cleanly separated; the only soft spot is the highlight/no-highlight boundary, and we know it's a negatives-data problem.
Pipeline: 0.86 recall end-to-end, robust across sports.
Known asterisk: prototype-domain data. The commercial rebuild on user footage is the next real test, and that number could move — we'll report it either way.

What's next

Rebuild the dataset on genuinely user-uploaded, commercially-usable footage and re-run this exact report. If 94% survives the domain shift, we have a product. If it doesn't, you'll see the drop here first.

Building something in the same space and hitting the action/no-action boundary too? I'd love to compare confusion matrices in the comments.