DEV Community

TildAlice
TildAlice

Posted on • Originally published at tildalice.io

BERT vs DistilBERT vs TinyBERT: Speed-Accuracy Trade-offs

The 300MB Model That Runs in 40ms

BERT-base sits at 110M parameters and 440MB on disk. DistilBERT cuts that to 66M parameters and 260MB. TinyBERT goes further — 14.5M parameters, 55MB, and inference that actually runs on a Raspberry Pi without thermal throttling.

But here's the catch: you're not just shrinking the model. You're fundamentally changing how it encodes language. DistilBERT uses knowledge distillation, keeping the full 768-dimensional hidden states. TinyBERT applies distillation at every layer AND shrinks the hidden size to 312 dimensions. That architectural difference matters more than the parameter count suggests.

I'm comparing all three on a real-world task: sentiment classification on 50k IMDB reviews. The goal is to see where the accuracy drop actually hurts, and where it's just noise.

Cup of coffee and a city map on a wooden table by the window in Copenhagen's cozy cafe.

Photo by Berna Deniz on Pexels

BERT-base: The Baseline You're Probably Overpaying For

BERT-base (Devlin et al., 2019) uses 12 transformer layers, 768 hidden dimensions, and 12 attention heads. The attention mechanism computes:

$$


Continue reading the full article on TildAlice

Top comments (0)