BERT vs DistilBERT vs TinyBERT: Speed-Accuracy Trade-offs

#bert #distilbert #tinybert #modelcompression

The 300MB Model That Runs in 40ms

BERT-base sits at 110M parameters and 440MB on disk. DistilBERT cuts that to 66M parameters and 260MB. TinyBERT goes further — 14.5M parameters, 55MB, and inference that actually runs on a Raspberry Pi without thermal throttling.

But here's the catch: you're not just shrinking the model. You're fundamentally changing how it encodes language. DistilBERT uses knowledge distillation, keeping the full 768-dimensional hidden states. TinyBERT applies distillation at every layer AND shrinks the hidden size to 312 dimensions. That architectural difference matters more than the parameter count suggests.

I'm comparing all three on a real-world task: sentiment classification on 50k IMDB reviews. The goal is to see where the accuracy drop actually hurts, and where it's just noise.