Distilling SAM 2 into a 6MB student for industrial inspection

#machinelearning #pytorch #mlops #computervision

TL;DR: We took Meta's SAM 2 small (around 224M params) and distilled it into a 6.3MB student that runs at 31 FPS on a Jetson Orin Nano for an automotive surface-defect pipeline. Mask IoU drops from 0.91 to 0.84, which is acceptable for the defect shapes we care about. The single biggest lever was a feature-alignment loss on the image embedding, not the mask logits.

So, the thing is, most of my year goes into event-camera work at Prophesee, but a side contract this spring with an automotive supplier outside Brescia ate two months of my evenings. They make aluminium body panels and they wanted real-time masks for surface defects: scratches, dents, paint pinholes. Cameras are boring CMOS at 25 FPS and 4MP. Target hardware is a Jetson Orin Nano because the PLCs on the line already talk to one over Ethernet.

First thing we tried was to fine-tune SAM 2 small directly and ship it with TensorRT FP16. About 1.2 seconds per image on the Orin. That's roughly 30x too slow for a moving line. We needed something a lot smaller.

The student architecture

We did a MobileSAM-style backbone transplant but kept going further. TinyViT-5M as the image encoder, prompt encoder stripped down to dense point prompts only (we feed it candidate locations from a cheap saliency head, no box prompts), and we cut the mask decoder's upsample to 1/2 instead of 1/4. That last bit lost us a tiny amount of edge precision but it was the difference between 19 FPS and 31 FPS on the Orin.

The student is 1.6M params. Storage is 6.3MB after INT8 weight-only quantisation. The teacher was held in FP16 on a separate A6000 during training.

The loss that actually worked

Here is the part where I want to be very honest. We spent two weeks on cleverness: multi-scale logit matching, attention transfer, contrastive losses on the prompt embedding. None of it beat the obvious thing once we put it in.

The obvious thing: align the student's image embedding to the teacher's in cosine space, alongside the usual soft mask BCE and a supervised dice term on the actual ground truth.

def distill_loss(s_emb, t_emb, s_logits, t_logits, gt_mask):
    # feature alignment in cosine space
    s_n = F.normalize(s_emb.flatten(2), dim=-1)
    t_n = F.normalize(t_emb.flatten(2), dim=-1)
    feat = 1.0 - (s_n * t_n).sum(-1).mean()

    # soft mask distillation with temperature
    soft = F.binary_cross_entropy_with_logits(
        s_logits, torch.sigmoid(t_logits / 2.0)
    )

    # supervised dice on hand-labelled defects
    p = torch.sigmoid(s_logits)
    inter = (p * gt_mask).sum((1, 2, 3))
    union = p.sum((1, 2, 3)) + gt_mask.sum((1, 2, 3))
    dice = 1.0 - (2 * inter / (union + 1e-6)).mean()


    return 0.4 * feat + 0.3 * soft + 0.3 * dice

Without the feat term, we plateaued at IoU 0.71 on the held-out set of 4,200 defect crops. With it, 0.84. Same student, same data, same schedule. That is the kind of result that makes you doubt your prior assumptions about what knowledge transfer actually means.

What the numbers look like

Variant	Params	Size	IoU	FPS (Orin)
SAM 2 small (teacher)	224M	884 MB	0.91	0.8
MobileSAM-style transplant	9.8M	39 MB	0.78	18
Ours, no feature loss	1.6M	6.3 MB	0.71	31
Ours, with feature loss	1.6M	6.3 MB	0.84	31

Where a VLM judge crept in

One thing that surprised me during training: about 6% of held-out crops had teacher masks that were visibly wrong. Pinholes the teacher missed entirely, or scratches it bled into adjacent reflections. If you blindly distill on those, the student learns the teacher's bad habits.

For the crops where student and teacher disagreed by more than 0.15 IoU, we ran a VLM-as-judge step. Claude 4.5 Sonnet on the primary path, Gemini 2.5 Pro as a backup, asked to pick which mask better followed the actual defect contour given the original crop and both overlays. We routed those calls through Bifrost (https://github.com/maximhq/bifrost) because we needed automatic failover between the two providers during peak hours when one would rate-limit us, and writing that retry logic ourselves felt like wasted time. About 38% of disagreements were judged in the student's favour, so we re-weighted those samples in the next epoch.

I will say: the VLM-as-judge is not magic and we hand-checked a stratified sample of its calls. The agreement with our QA lead was around 84%, which was good enough to drive sample re-weighting but not good enough to use as ground truth.

Trade-offs and limitations

Hairline scratches below about 80 microns at our working distance are still missed. The teacher catches some of those. We are debating whether to keep a slow second pass for ambiguous frames.
The feature-alignment loss requires teacher and student image embeddings at the same spatial resolution. That ruled out a couple of backbone choices we wanted to try.
Training was nine days on 4x A6000s. Distillation is not cheap, even if the result is.
We have not validated on glass or carbon-fibre panels. Aluminium only.
INT8 quantisation cost us 0.6 IoU points. Worth it for the size but not free.