DEV Community

TildAlice
TildAlice

Posted on • Originally published at tildalice.io

BERT vs RoBERTa vs DistilBERT: GLUE Scores Decoded

The 88.9% That Started Everything

RoBERTa hit 88.5% on the GLUE benchmark in July 2019. BERT, released just 8 months earlier, scored 80.5% with BERT-base and about 84.5% with BERT-large. That 4-point jump from "large" to RoBERTa came from exactly zero architectural changes.

Wait, what?

Yes, RoBERTa (Liu et al., 2019) is literally BERT with better training. Same transformer encoder, same masked language modeling objective (mostly), same attention mechanism. The gains came from training longer, on more data, with bigger batches, and without the next sentence prediction (NSP) task that BERT insisted was important.

This post walks through what actually changed between these three models, why the GLUE numbers moved the way they did, and which one I'd actually deploy today.

Cup of coffee and a city map on a wooden table by the window in Copenhagen's cozy cafe.

Photo by Berna Deniz on Pexels

What Is GLUE and Why Should You Care?


Continue reading the full article on TildAlice

Top comments (0)