DEV Community

Cover image for Engineering Log #02: What our sports CV model actually scores right now (94% classifier, 0.86 highlight recall)
yubin hong
yubin hong

Posted on

Engineering Log #02: What our sports CV model actually scores right now (94% classifier, 0.86 highlight recall)

Build-in-public engineering log. Every number below is a real measurement pulled straight from our test reports — nothing rounded up to look nicer, nothing invented. If you only ever hear the wins, the wins mean nothing.

In Log #01 we told the messy story of getting highlight recall from 0.56 → 0.86. Since then we built the piece that sits underneath that pipeline: a video action classifier that decides what is happening in a clip. This log is the honest scoreboard for both — because we were curious ourselves what the model actually scores today.

TL;DR current state:

  • Action classifier: 94.0% test accuracy, macro-F1 0.94, across 4 classes.
  • Highlight pipeline (end to end): recall 0.86, up from 0.56.
  • Training data grew 475 → 762 clips between v1 and v2.

Part 1 — The action classifier: 94% today

The classifier answers a narrow but load-bearing question: given a short clip, is this a baseball swing, a basketball layup, a soccer shot, or nothing (background play)? Get this wrong and every downstream highlight decision inherits the mistake.

The model is nothing exotic — an R(2+1)D-18 backbone, 16 frames sampled per clip at 112×112. Deliberately small: it has to run cheaply on real uploaded footage, not win a leaderboard.

Here's the current test report, verbatim from video_scorer_testreport.json:

Class Precision Recall F1 Support
baseball_swing 1.000 0.971 0.986 35
soccer_shooting 0.933 0.933 0.933 15
none 0.938 0.918 0.928 49
basketball_layup 0.850 0.944 0.895 18
Overall accuracy 0.940 117

Macro-F1 lands at 0.94. Baseball swing is nearly solved (precision 1.000 — it never fires falsely). Basketball layup is our weakest link at F1 0.895, and the confusion matrix says exactly why.

Reading the confusion matrix honestly

                 pred: bball  base   none  soccer
baseball_swing  [   34      0      1      0  ]
basketball_layup[    0     17      1      0  ]
none            [    0      3     45      1  ]
soccer_shooting [    0      0      1     14  ]
Enter fullscreen mode Exit fullscreen mode

Almost every error is a confusion with none, not a cross-sport mix-up. The model rarely calls a layup a swing; it occasionally can't decide whether a play is a highlight at all. Concretely: 3 background clips get misread as layups (that's what drags layup precision to 0.85), and a scatter of real actions leak into none.

That's the good kind of error profile. The sports are cleanly separable; the remaining work is sharpening the action vs. no-action boundary, which is a data problem, not an architecture problem.


Part 2 — More data, honestly labeled

The single biggest lever between v1 and v2 wasn't a clever loss function. It was clips.

v1 v2
Total clips 475 762
none (train) 119 335

We roughly tripled the negatives (none). That's not glamorous, but the confusion matrix above is exactly why it mattered: our errors live on the action/no-action boundary, and you only teach that boundary by showing the model far more of what a non-highlight looks like.

One caveat we keep visible in our own metadata: the current set is a research prototype. The note in dataset_meta.json literally reads "commercial model = rebuild on user / commercially-usable video." The 94% is real, but it's 94% on prototype-domain data — we're not going to quietly let that number imply more than it earned.

The augmentation bet (this one paid off)

Phone footage is ugly: motion blur, compression noise, players filmed tiny and far away, handheld shake. So we simulate all of it at train time rather than pretend clean clips generalize:

  • motion blur up to a 21px kernel
  • downscale to 0.35–0.7× ("player filmed far away")
  • JPEG quality floor of 40 (phone compression artifacts)
  • mild shift/scale/rotate for handheld shake

Each real clip spawns ~2 augmented variants. The point isn't more data for its own sake — it's data that looks like the mess the model meets in production.


Part 3 — The pipeline recall story (the Log #01 recap, with the receipts)

The classifier feeds a larger highlight-detection pipeline. That pipeline is the one that went 0.56 → 0.86 recall. Two lessons from that arc are worth re-stating, because they shaped how we built Part 1:

1. We fixed the wrong layer first. Our instinct was "make the detector see better," so we fine-tuned RF-DETR to a respectable 0.716 mAP — and end-to-end recall moved 0.60 → 0.59. Backwards. When we decomposed every miss by stage, detection owned only 15% of failures; 55% died at the pose/kinematic stage. We'd polished the layer that wasn't broken.

2. Never trust a batched change. We turned on four good features at once and the score dropped 0.769 → 0.692. A one-at-a-time ablation exposed a single culprit (video_mode) dragging the batch down. Three of four features were genuinely helping; one bad apple made the whole basket look rotten.

Per-sport, the recall gains held across the board rather than being carried by one lucky category — basketball alone went 0.43 → 0.82.


Where we actually stand

  • Classifier: 94.0% accuracy, macro-F1 0.94. Sports are cleanly separated; the only soft spot is the highlight/no-highlight boundary, and we know it's a negatives-data problem.
  • Pipeline: 0.86 recall end-to-end, robust across sports.
  • Known asterisk: prototype-domain data. The commercial rebuild on user footage is the next real test, and that number could move — we'll report it either way.

What's next

Rebuild the dataset on genuinely user-uploaded, commercially-usable footage and re-run this exact report. If 94% survives the domain shift, we have a product. If it doesn't, you'll see the drop here first.

Building something in the same space and hitting the action/no-action boundary too? I'd love to compare confusion matrices in the comments.

Top comments (0)