DEV Community

Cover image for 🧠 BERTScore
Shiva Charan
Shiva Charan

Posted on

🧠 BERTScore

BERTScore is a text evaluation metric used to measure how similar two pieces of text are based on meaning, not just exact words.

It is mainly used to evaluate:

  • Machine translation
  • Text summarization
  • Text generation (LLMs)

Think of it as:

β€œDo these two sentences mean the same thing, even if they are worded differently?”


❌ Why traditional metrics were not enough

Older metrics like BLEU, ROUGE work by:

  • Counting exact word matches
  • Matching n-grams (word sequences)

Problem:

They fail when:

  • Different words express the same meaning
  • Synonyms are used
  • Sentence structure changes

Example failure:

Reference: The cat is sitting on the mat
Candidate: The feline is resting on the rug
Enter fullscreen mode Exit fullscreen mode

BLEU or ROUGE β†’ ❌ low score
Human β†’ βœ… same meaning

This gap is why BERTScore exists.


βœ… Why BERTScore was created

BERTScore uses contextual embeddings from BERT to compare words by semantic similarity.

Key idea:

Words are compared by their meaning in context, not by exact string match.

So:

  • cat β‰ˆ feline
  • sitting β‰ˆ resting
  • mat β‰ˆ rug

πŸ” How BERTScore works

Step 1: Tokenize sentences

Both reference and candidate sentences are split into tokens.

Step 2: Convert tokens to embeddings

Each token is converted into a vector using BERT.

Example (simplified):

cat β†’ [0.21, 0.88, -0.13, ...]
feline β†’ [0.20, 0.87, -0.12, ...]
Enter fullscreen mode Exit fullscreen mode

Step 3: Compare tokens using cosine similarity

Each word in one sentence is matched to the most similar word in the other sentence.

Step 4: Compute scores

BERTScore reports:

  • Precision: how much of the candidate is relevant
  • Recall: how much of the reference is covered
  • F1: balance of both (most commonly used)

πŸ§ͺ Simple example

Reference sentence:

A man is playing guitar
Enter fullscreen mode Exit fullscreen mode

Candidate sentence:

A person is playing an instrument
Enter fullscreen mode Exit fullscreen mode

What happens internally:

Reference word Closest candidate word Similarity
man person high
guitar instrument high
playing playing exact

Result:

BERTScore (F1) β‰ˆ 0.90+
Enter fullscreen mode Exit fullscreen mode

Even though words differ, meaning is preserved.


πŸ†š Comparison with older metrics

Metric Looks at words Understands meaning Handles synonyms Sentence structure aware
BLEU βœ… Exact match ❌ No ❌ No ❌ No
ROUGE βœ… Exact match ❌ No ❌ No ❌ No
BERTScore ❌ Exact not required βœ… Yes βœ… Yes βœ… Yes

🎯 Where BERTScore is used

  • Evaluating LLM outputs
  • Machine translation quality
  • Summarization accuracy
  • Comparing chatbot responses to human answers

Especially useful when:

  • Multiple correct answers exist
  • Wording can vary naturally

⚠️ Limitations of BERTScore

Be aware of these:

  • Computationally expensive
  • Depends on pretrained language models
  • High score does not always mean factually correct
  • Can reward fluent but wrong answers

So it is best used alongside human evaluation or factual checks.


One-Line Summary

BERTScore evaluates text similarity using contextual embeddings from BERT, allowing semantic comparison rather than exact word matching.


Top comments (0)