Prompt engineering feels magical—change a few words and the model behaves differently.
But how do you measure whether one prompt is actually better than another?
Just reading the outputs is not enough.
For real AI applications, you need evaluation metrics.
1. BLEU Score
BLEU (Bilingual Evaluation Understudy) checks how many n-grams from the reference appear in the generated output.
An n-gram is a sequence of n consecutive words (or tokens) in a sentence.
Unigram: “the”, “cat”, “sat”
Bigram: “the cat”, “cat sat”
Great for:
- summarization
- translation
- structured responses
Not great for:
- creative writing
- open-ended generation
- varied response styles
Let's say the reference is:
The cat sits on the mat.
And the LLM output is:
A cat is sitting on the mat.
BLEU will give a high score because many words and n-grams match.
2. ROUGE Score
ROUGE is more recall-focused than BLEU. ROUGE measures how much of the reference (ground-truth) text appears in the generated output.
ROUGE Recall = (overlapping units) / (total units in reference)
This is why ROUGE is used heavily in summarization, where the model must capture key points from the original text.
ROUGE-1: unigram overlap (Measures overlap of single words.)
ROUGE-2: bigram overlap (Measures overlap of word pairs.)
ROUGE-L: longest common subsequence (Measures longest in-order word sequence shared by both texts.)
Best for:
summarization
Q&A where key facts must be present
comparing prompt improvements
Example:
from rouge_score import rouge_scorer
scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)
reference = "The cat sits on the mat"
generated = "The mat has a cat sitting on it"
scores = scorer.score(reference, generated)
print(scores)
use_stemmer increases the rouge score as it ensures rouge compares words to their base meaning. Hence sitting, sits, sat will be seen similar to sit. This shows the generated text captures the meaning accurately and hence the score can be higher.
Output:
{
'rouge1': Score(precision=0.625, recall=0.7142857142857143, fmeasure=0.6666666666666666),
'rouge2': Score(precision=0.2727272727272727, recall=0.3333333333333333, fmeasure=0.3),
'rougeL': Score(precision=0.5, recall=0.5555555555555556, fmeasure=0.5263157894736842)
}
Your exact numbers might differ slightly. fmeasure = F1 score
These evaluation techniques fall under Natural Language Processing (NLP) and is commonly used to assess the quality of text generation models such as chatbots, summarizers, and translation systems.
To run the example code, make sure to install the rouge_score library first:
pip install rouge_score
ROUGE-L is especially useful because it checks sentence structure similarity.
For prompt performance, you care about:
- correctness
- coherence
- reasoning
- readability
- hallucinations
→ BLEU/ROUGE cannot measure these.
That’s where Human Evaluation comes in.
3. Human Evaluation
Human eval means real people rate outputs based on criteria like:
| Metric | What It Measures |
|---|---|
| Accuracy | Is the answer correct? |
| Relevance | Is it on topic? |
| Clarity | Easy to read and understand? |
| Completeness | Did it answer the full question? |
| Safety | No harmful or biased content? |
Simple Human Evaluation Procedure
- Pick a prompt (Prompt A)
- Generate output
- Modify prompt → (Prompt B)
- Generate output
- Ask 3–5 evaluators to rate on a 1–5 scale.
Comparing Two Prompts
Prompt A
Summarize this article in one sentence.
Prompt B
Provide a short, factual one-sentence summary of the article below.
Let's say, after generating summaries for 50 samples we get:
| Metric | Prompt A | Prompt B |
|---|---|---|
| BLEU | 0.42 | 0.57 |
| ROUGE-L | 0.61 | 0.72 |
| Human Score | 3.4 | 4.5 |
→ Prompt B is clearly superior.
Conclusion
Evaluating prompt quality shouldn't be based on guesswork. Use metrics like:
BLEU → precision of n-gram match
ROUGE → recall and similarity
Human Evaluation → truth, clarity, completeness
Together, they give you a reliable, real-world view of how good your prompts actually are.
If you found this useful, hit the ❤️ or 🔥 so others can discover it too.
Which evaluation metric do you prefer—BLEU, ROUGE, or human evaluation?
Drop your answer in the comments. I reply to every comment!
Top comments (2)
Amazing 🤩
Thank you!