Ananya S

Posted on Dec 8, 2025 • Edited on Dec 23, 2025

The Science of Prompt Evaluation: From BLEU & ROUGE to Real Human Feedback

#promptengineering #llm #beginners #ai

Prompt engineering feels magical—change a few words and the model behaves differently.
But how do you measure whether one prompt is actually better than another?
Just reading the outputs is not enough.
For real AI applications, you need evaluation metrics.

1. BLEU Score

BLEU (Bilingual Evaluation Understudy) checks how many n-grams from the reference appear in the generated output.
An n-gram is a sequence of n consecutive words (or tokens) in a sentence.
Unigram: “the”, “cat”, “sat”
Bigram: “the cat”, “cat sat”

Great for:

summarization
translation
structured responses

Not great for:

creative writing
open-ended generation
varied response styles

Let's say the reference is:

The cat sits on the mat.

And the LLM output is:

A cat is sitting on the mat.

BLEU will give a high score because many words and n-grams match.

2. ROUGE Score

ROUGE is more recall-focused than BLEU. ROUGE measures how much of the reference (ground-truth) text appears in the generated output.

ROUGE Recall = (overlapping units) / (total units in reference)
This is why ROUGE is used heavily in summarization, where the model must capture key points from the original text.

ROUGE-1: unigram overlap (Measures overlap of single words.)
ROUGE-2: bigram overlap (Measures overlap of word pairs.)
ROUGE-L: longest common subsequence (Measures longest in-order word sequence shared by both texts.)

Best for:

summarization
Q&A where key facts must be present
comparing prompt improvements

Example:

from rouge_score import rouge_scorer

scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)

reference = "The cat sits on the mat"
generated = "The mat has a cat sitting on it"

scores = scorer.score(reference, generated)
print(scores)

use_stemmer increases the rouge score as it ensures rouge compares words to their base meaning. Hence sitting, sits, sat will be seen similar to sit. This shows the generated text captures the meaning accurately and hence the score can be higher.

Output:

{
 'rouge1': Score(precision=0.625, recall=0.7142857142857143, fmeasure=0.6666666666666666),
 'rouge2': Score(precision=0.2727272727272727, recall=0.3333333333333333, fmeasure=0.3),
 'rougeL': Score(precision=0.5, recall=0.5555555555555556, fmeasure=0.5263157894736842)
}

Your exact numbers might differ slightly. fmeasure = F1 score
These evaluation techniques fall under Natural Language Processing (NLP) and is commonly used to assess the quality of text generation models such as chatbots, summarizers, and translation systems.
To run the example code, make sure to install the rouge_score library first:

pip install rouge_score

ROUGE-L is especially useful because it checks sentence structure similarity.

For prompt performance, you care about:

correctness
coherence
reasoning
readability
hallucinations

→ BLEU/ROUGE cannot measure these.

That’s where Human Evaluation comes in.

3. Human Evaluation

Human eval means real people rate outputs based on criteria like:

Metric	What It Measures
Accuracy	Is the answer correct?
Relevance	Is it on topic?
Clarity	Easy to read and understand?
Completeness	Did it answer the full question?
Safety	No harmful or biased content?

Simple Human Evaluation Procedure

Pick a prompt (Prompt A)
Generate output
Modify prompt → (Prompt B)
Generate output
Ask 3–5 evaluators to rate on a 1–5 scale.

Comparing Two Prompts

Prompt A

Summarize this article in one sentence.

Prompt B

Provide a short, factual one-sentence summary of the article below.

Let's say, after generating summaries for 50 samples we get:

Metric	Prompt A	Prompt B
BLEU	0.42	0.57
ROUGE-L	0.61	0.72
Human Score	3.4	4.5

→ Prompt B is clearly superior.

Conclusion

Evaluating prompt quality shouldn't be based on guesswork. Use metrics like:

BLEU → precision of n-gram match
ROUGE → recall and similarity
Human Evaluation → truth, clarity, completeness

Together, they give you a reliable, real-world view of how good your prompts actually are.

If you found this useful, hit the ❤️ or 🔥 so others can discover it too.
Which evaluation metric do you prefer—BLEU, ROUGE, or human evaluation?
To understand the meaning of precision, recall and accuracy check out this article: Understanding Errors in Machine Learning: Accuracy, Precision, Recall & F1 Score
Drop your answer in the comments. I reply to every comment!

Top comments (3)

Ashrut Sahu • Dec 9 '25

Amazing 🤩

Ananya S • Dec 9 '25

Thank you!

Hamza KONTE • Mar 7

Prompt evaluation is underinvested — most teams just eyeball outputs. The point about BLEU/ROUGE being poor proxies for actual quality is spot on. One thing that helps before you even get to evaluation: structuring prompts with explicit success criteria in the output_format block gives evaluators (human or automated) something concrete to measure against. I build those criteria into the prompt itself in flompt. flompt.dev / github.com/Nyrok/flompt