DEV Community

Cover image for The Science of Prompt Evaluation: From BLEU & ROUGE to Real Human Feedback
Ananya S
Ananya S

Posted on

The Science of Prompt Evaluation: From BLEU & ROUGE to Real Human Feedback

Prompt engineering feels magical—change a few words and the model behaves differently.
But how do you measure whether one prompt is actually better than another?
Just reading the outputs is not enough.
For real AI applications, you need evaluation metrics.

1. BLEU Score

BLEU (Bilingual Evaluation Understudy) checks how many n-grams from the reference appear in the generated output.
An n-gram is a sequence of n consecutive words (or tokens) in a sentence.
Unigram: “the”, “cat”, “sat”
Bigram: “the cat”, “cat sat”

Great for:

  • summarization
  • translation
  • structured responses

Not great for:

  • creative writing
  • open-ended generation
  • varied response styles

Let's say the reference is:

The cat sits on the mat.
Enter fullscreen mode Exit fullscreen mode

And the LLM output is:

A cat is sitting on the mat.
Enter fullscreen mode Exit fullscreen mode

BLEU will give a high score because many words and n-grams match.

2. ROUGE Score

ROUGE is more recall-focused than BLEU. ROUGE measures how much of the reference (ground-truth) text appears in the generated output.

ROUGE Recall = (overlapping units) / (total units in reference)
This is why ROUGE is used heavily in summarization, where the model must capture key points from the original text.

ROUGE-1: unigram overlap (Measures overlap of single words.)
ROUGE-2: bigram overlap (Measures overlap of word pairs.)
ROUGE-L: longest common subsequence (Measures longest in-order word sequence shared by both texts.)

Best for:

  • summarization

  • Q&A where key facts must be present

  • comparing prompt improvements

Example:

from rouge_score import rouge_scorer

scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)

reference = "The cat sits on the mat"
generated = "The mat has a cat sitting on it"

scores = scorer.score(reference, generated)
print(scores)
Enter fullscreen mode Exit fullscreen mode

use_stemmer increases the rouge score as it ensures rouge compares words to their base meaning. Hence sitting, sits, sat will be seen similar to sit. This shows the generated text captures the meaning accurately and hence the score can be higher.

Output:

{
 'rouge1': Score(precision=0.625, recall=0.7142857142857143, fmeasure=0.6666666666666666),
 'rouge2': Score(precision=0.2727272727272727, recall=0.3333333333333333, fmeasure=0.3),
 'rougeL': Score(precision=0.5, recall=0.5555555555555556, fmeasure=0.5263157894736842)
}

Enter fullscreen mode Exit fullscreen mode

Your exact numbers might differ slightly. fmeasure = F1 score
These evaluation techniques fall under Natural Language Processing (NLP) and is commonly used to assess the quality of text generation models such as chatbots, summarizers, and translation systems.
To run the example code, make sure to install the rouge_score library first:

pip install rouge_score

Enter fullscreen mode Exit fullscreen mode

ROUGE-L is especially useful because it checks sentence structure similarity.

For prompt performance, you care about:

  • correctness
  • coherence
  • reasoning
  • readability
  • hallucinations

→ BLEU/ROUGE cannot measure these.

That’s where Human Evaluation comes in.

3. Human Evaluation

Human eval means real people rate outputs based on criteria like:

Metric What It Measures
Accuracy Is the answer correct?
Relevance Is it on topic?
Clarity Easy to read and understand?
Completeness Did it answer the full question?
Safety No harmful or biased content?

Simple Human Evaluation Procedure

  1. Pick a prompt (Prompt A)
  2. Generate output
  3. Modify prompt → (Prompt B)
  4. Generate output
  5. Ask 3–5 evaluators to rate on a 1–5 scale.

Comparing Two Prompts

Prompt A

Summarize this article in one sentence.

Enter fullscreen mode Exit fullscreen mode

Prompt B

Provide a short, factual one-sentence summary of the article below.
Enter fullscreen mode Exit fullscreen mode

Let's say, after generating summaries for 50 samples we get:

Metric Prompt A Prompt B
BLEU 0.42 0.57
ROUGE-L 0.61 0.72
Human Score 3.4 4.5

Prompt B is clearly superior.

Conclusion

Evaluating prompt quality shouldn't be based on guesswork. Use metrics like:

BLEU → precision of n-gram match
ROUGE → recall and similarity
Human Evaluation → truth, clarity, completeness

Together, they give you a reliable, real-world view of how good your prompts actually are.

If you found this useful, hit the ❤️ or 🔥 so others can discover it too.
Which evaluation metric do you prefer—BLEU, ROUGE, or human evaluation?
Drop your answer in the comments. I reply to every comment!

Top comments (2)

Collapse
 
ashrut_sahu_a74f84acac6b4 profile image
Ashrut Sahu

Amazing 🤩

Collapse
 
zeroshotanu profile image
Ananya S

Thank you!