BERTScore is a text evaluation metric used to measure how similar two pieces of text are based on meaning, not just exact words.
It is mainly used to evaluate:
- Machine translation
- Text summarization
- Text generation (LLMs)
Think of it as:
βDo these two sentences mean the same thing, even if they are worded differently?β
β Why traditional metrics were not enough
Older metrics like BLEU, ROUGE work by:
- Counting exact word matches
- Matching n-grams (word sequences)
Problem:
They fail when:
- Different words express the same meaning
- Synonyms are used
- Sentence structure changes
Example failure:
Reference: The cat is sitting on the mat
Candidate: The feline is resting on the rug
BLEU or ROUGE β β low score
Human β β
same meaning
This gap is why BERTScore exists.
β Why BERTScore was created
BERTScore uses contextual embeddings from BERT to compare words by semantic similarity.
Key idea:
Words are compared by their meaning in context, not by exact string match.
So:
- cat β feline
- sitting β resting
- mat β rug
π How BERTScore works
Step 1: Tokenize sentences
Both reference and candidate sentences are split into tokens.
Step 2: Convert tokens to embeddings
Each token is converted into a vector using BERT.
Example (simplified):
cat β [0.21, 0.88, -0.13, ...]
feline β [0.20, 0.87, -0.12, ...]
Step 3: Compare tokens using cosine similarity
Each word in one sentence is matched to the most similar word in the other sentence.
Step 4: Compute scores
BERTScore reports:
- Precision: how much of the candidate is relevant
- Recall: how much of the reference is covered
- F1: balance of both (most commonly used)
π§ͺ Simple example
Reference sentence:
A man is playing guitar
Candidate sentence:
A person is playing an instrument
What happens internally:
| Reference word | Closest candidate word | Similarity |
|---|---|---|
| man | person | high |
| guitar | instrument | high |
| playing | playing | exact |
Result:
BERTScore (F1) β 0.90+
Even though words differ, meaning is preserved.
π Comparison with older metrics
| Metric | Looks at words | Understands meaning | Handles synonyms | Sentence structure aware |
|---|---|---|---|---|
| BLEU | β Exact match | β No | β No | β No |
| ROUGE | β Exact match | β No | β No | β No |
| BERTScore | β Exact not required | β Yes | β Yes | β Yes |
π― Where BERTScore is used
- Evaluating LLM outputs
- Machine translation quality
- Summarization accuracy
- Comparing chatbot responses to human answers
Especially useful when:
- Multiple correct answers exist
- Wording can vary naturally
β οΈ Limitations of BERTScore
Be aware of these:
- Computationally expensive
- Depends on pretrained language models
- High score does not always mean factually correct
- Can reward fluent but wrong answers
So it is best used alongside human evaluation or factual checks.
One-Line Summary
BERTScore evaluates text similarity using contextual embeddings from BERT, allowing semantic comparison rather than exact word matching.
Top comments (0)