BERT vs RoBERTa vs DistilBERT: GLUE Scores Decoded

#bert #roberta #distilbert #gluebenchmark

The 88.9% That Started Everything

RoBERTa hit 88.5% on the GLUE benchmark in July 2019. BERT, released just 8 months earlier, scored 80.5% with BERT-base and about 84.5% with BERT-large. That 4-point jump from "large" to RoBERTa came from exactly zero architectural changes.

Wait, what?

Yes, RoBERTa (Liu et al., 2019) is literally BERT with better training. Same transformer encoder, same masked language modeling objective (mostly), same attention mechanism. The gains came from training longer, on more data, with bigger batches, and without the next sentence prediction (NSP) task that BERT insisted was important.

This post walks through what actually changed between these three models, why the GLUE numbers moved the way they did, and which one I'd actually deploy today.