ROUGE - Recall-Oriented Understudy for Gisting Evaluation
Compares the overlap of words or phrases between the generated and reference texts.
• Focuses on recall — did the model capture the key ideas?
• Best for summarization and information(content) coverage.
Reference Summary:
“The quick brown fox jumps over the lazy dog.”
Generated Summary:
“The brown fox leaps over a lazy dog.”
ROUGE-1 = unigram overlap
Compares single-word overlaps.
Matching Words:
• the, brown, fox, over, lazy, dog → 6 matches
Not Matching:
• quick, jumps (in reference)
• leaps, a (in generated)
ROUGE-1 Score =
Overlapping unigrams/Total unigrams in reference = 6/9 = 0.667
ROUGE-2 = bigram overlap
Compares 2-word sequences that appear in order.
Matching Bigrams:
• “brown fox”
• “over the”
• “lazy dog” → 3 matches
Not Matching:
• “quick brown”, “fox jumps”, etc. (not in generated)
ROUGE-2 Score =
3/8 = 0.375
ROUGE-S = Skip-Bigram Overlap
Compares skip-bigrams — word pairs that occur in the same order but not necessarily adjacent.
Examples of Matching Skip-Bigrams:
• (“the”, “fox”)
• (“brown”, “dog”)
• (“fox”, “over”)
• (“the”, “dog”)
These words appear in order in both sentences, though not next to each other.
Skip-bigrams in reference: 36 possible
Matched skip-bigrams: ~12 (depending on allowed skips)
ROUGE-S Score (approximate) =
12/36 = approx 0.33
ROUGE-L = longest common subsequence
Finds the longest sequence of words that appear in order (not necessarily adjacent) in both texts.
LCS:
• “the brown fox over the lazy dog” (but “quick” and “jumps” are missing)
Length of LCS = 7 words
ROUGE-L Score =
7/9 = approx 0.778
ROUGE Variants - Summary
BLEU - Bilingual Evaluation Understudy
• Looks at n-gram precision — how much of the generated text matches the reference exactly.
• Originally designed for machine translation.
• Focuses on precision — are the predicted words correct?
Includes a brevity penalty to discourage overly short translations.
Example:
Reference sentence:
“The quick brown fox jumps over the lazy dog”
Model output:
“The fox”
It matches 2 words, but the answer is way too short.
Brevity Penalty reduces the BLEU score.
BERT Score
BERTScore measures how similar two pieces of text are in meaning using BERT, a powerful language model.
Unlike ROUGE and BLEU, which compare words exactly,
**BERTScore compares the meanings of words — even if the exact words are different.
BERTScore checks whether the words in the generated sentence mean the same thing as in the reference sentence
It uses word embeddings (like word meanings in number form) to do this.
Example:
Reference:
“The dog barked loudly.”
Generated:
“The canine made noise.”
• BLEU/ROUGE = low (few exact matches)
• BERTScore = high (words mean similar things: dog = canine, barked = made noise)
How Does It Work?
• Turns each word in both sentences into vectors using BERT (context-aware).
• For each word in one sentence, it finds the most similar word in the other.
• Calculates precision, recall, and F1 score based on semantic similarity.
Top comments (0)