NLP Evaluation Matrices

ROUGE - Recall-Oriented Understudy for Gisting Evaluation

Compares the overlap of words or phrases between the generated and reference texts.
• Focuses on recall — did the model capture the key ideas?
• Best for summarization and information(content) coverage.

Reference Summary:

“The quick brown fox jumps over the lazy dog.”

Generated Summary:

“The brown fox leaps over a lazy dog.”

ROUGE-1 = unigram overlap

Compares single-word overlaps.
Matching Words:
    • the, brown, fox, over, lazy, dog → 6 matches

Not Matching:
    • quick, jumps (in reference)
    • leaps, a (in generated)

ROUGE-1 Score =
Overlapping unigrams/Total unigrams in reference = 6/9 = 0.667

ROUGE-2 = bigram overlap

Compares 2-word sequences that appear in order.

Matching Bigrams:
    • “brown fox”
    • “over the”
    • “lazy dog” → 3 matches

Not Matching:
    • “quick brown”, “fox jumps”, etc. (not in generated)

ROUGE-2 Score =
3/8 = 0.375

ROUGE-S = Skip-Bigram Overlap

Compares skip-bigrams — word pairs that occur in the same order but not necessarily adjacent.

Examples of Matching Skip-Bigrams:
    • (“the”, “fox”)
    • (“brown”, “dog”)
    • (“fox”, “over”)
    • (“the”, “dog”)
These words appear in order in both sentences, though not next to each other.

Skip-bigrams in reference: 36 possible
Matched skip-bigrams: ~12 (depending on allowed skips)

ROUGE-S Score (approximate) =
12/36 = approx 0.33

ROUGE-L = longest common subsequence

Finds the longest sequence of words that appear in order (not necessarily adjacent) in both texts.

LCS:
    • “the brown fox over the lazy dog” (but “quick” and “jumps” are missing)

Length of LCS = 7 words

ROUGE-L Score =
7/9 = approx 0.778

ROUGE Variants - Summary

BLEU - Bilingual Evaluation Understudy

• Looks at n-gram precision — how much of the generated text matches the reference exactly.
• Originally designed for machine translation.
• Focuses on precision — are the predicted words correct?

Includes a brevity penalty to discourage overly short translations.

Example:

Reference sentence:

“The quick brown fox jumps over the lazy dog”

Model output:

“The fox”

It matches 2 words, but the answer is way too short.
Brevity Penalty reduces the BLEU score.

BERT Score

BERTScore measures how similar two pieces of text are in meaning using BERT, a powerful language model.

Unlike ROUGE and BLEU, which compare words exactly,
**BERTScore compares the meanings of words — even if the exact words are different.

BERTScore checks whether the words in the generated sentence mean the same thing as in the reference sentence

It uses word embeddings (like word meanings in number form) to do this.

Example:

Reference:

“The dog barked loudly.”

Generated:

“The canine made noise.”

    • BLEU/ROUGE = low (few exact matches)
    • BERTScore = high (words mean similar things: dog = canine, barked = made noise)

How Does It Work?
• Turns each word in both sentences into vectors using BERT (context-aware).
• For each word in one sentence, it finds the most similar word in the other.
• Calculates precision, recall, and F1 score based on semantic similarity.

DEV Community

NLP Evaluation Matrices

ROUGE - Recall-Oriented Understudy for Gisting Evaluation

BLEU - Bilingual Evaluation Understudy

BERT Score

Top comments (0)