Ntombizakhona Mabaso

for AWS Community Builders

Posted on Jan 22

Describe Methods To Evaluate Foundation Model Performance

#aws #ai #aipractitioner #cloud

🤖 Exam Guide: AI Practitioner
Domain 3: Applications of Foundation Models
📘Task Statement 3.4

🎯 Objectives

This task is about evaluating whether an FM (and an FM-powered application like RAG or an agent) is good enough, not just in a demo, but against repeatable criteria using human review, benchmarks, and objective metrics tied to business goals.

1) Approaches To Evaluate FM performance

1.1 Human Evaluation

People review model outputs and score them against criteria such as:
1 correctness/helpfulness
2 clarity and tone
3 completeness
4 safety/policy compliance
Best for: subjective qualities and real user experience.
Tradeoff: slower and more expensive, but often the most reliable indicator.

1.2 Benchmark Datasets

Use a fixed dataset (prompts + expected outputs) and evaluate consistently over time.
Best for: regression testing, comparing models/prompts, measuring improvement.
Tradeoff: benchmarks may not reflect your exact domain and can be “gamed” if over-optimized.

1.3 Amazon Bedrock Model Evaluation

A managed way to evaluate and compare model outputs,
Useful for standardized evaluation workflows across prompts/models.

2) Metrics To Assess FM Performance

These metrics compare generated text to a reference (ground truth). They’re most useful when you have expected answers (e.g., translation pairs, reference summaries).

2.1 ROUGE

Recall-Oriented Understudy for Gisting Evaluation

ROUGE is common for summarization, it measures overlap between generated and reference text (often n-grams).
Recall-oriented: focuses on how much of the reference content was captured.

2.2 BLEU (Bilingual Evaluation Understudy)

BLEU is common for machine translation, it measures n-gram precision (how much generated text matches the reference translation).

2.3 BERTScore

BERTScore uses embeddings from transformer models (e.g., BERT-like) to measure semantic similarity, not just exact word overlap.
Useful when wording differs but meaning is similar.

No single metric proves “quality.” Metrics can be paired with human evaluation and task-specific checks.

3) Determine Whether The FM Meets Business Objectives

FM quality must translate into outcomes that matter.

Examples of business objective alignment:

3.1 Productivity

Time saved per task, faster drafting, fewer manual steps, reduced handling time.

3.2 User Engagement

Retention, session length, repeat usage, satisfaction ratings.

3.3 Task Success / Task Engineering

Whether users can reliably complete the intended task (e.g., extract fields correctly, create a correct ticket, answer questions with citations).

A model can score well on ROUGE/BLEU but still fail business goals if it’s too slow, too expensive, unsafe, or doesn’t improve user outcomes.

4) Evaluating FM-based Applications

RAG, agents, workflows

It’s not enough to evaluate the base model, you must evaluate the system.

4.1 RAG Evaluation

Evaluate:
1 Retrieval Quality: Are the right documents/chunks being retrieved?
2 Grounding: Does the answer use the retrieved context and avoid making things up?
3 Answer Quality: Correctness, completeness, citations, and refusal behavior when context is missing.

4.2 Common Application Metrics:

1 grounded answer rate / citation accuracy
2 retrieval recall/precision (did we fetch relevant chunks?)
3 hallucination rate (answers not supported by sources)

4.3 Agent / Workflow Evaluation

Evaluate:
1 Task completion rate: Did the agent finish the multi-step objective?
2 Tool correctness: Did it call the right tool with correct parameters?
3 Safety/compliance: Did it attempt disallowed actions or expose sensitive data?
4 Efficiency: Number of steps/tool calls, latency, cost per completed task.

💡 Quick Questions

1. What’s one advantage of human evaluation over automated metrics?
2. Which metric is commonly associated with summarization: ROUGE or BLEU?
3. What is BERTScore trying to measure that ROUGE/BLEU may miss?
4. Name one way to evaluate a RAG application beyond “is the answer good?”
5. Give one example of a business metric you might use to evaluate an FM-powered assistant.

Additional Resources

✅ Answers to Quick Questions

1. Humans can judge qualities that automated metrics struggle with, such as helpfulness, tone, clarity, policy compliance, and real-world correctness in context.

2. ROUGE is commonly used for summarization.

(BLEU is more commonly used for translation.)

3. Semantic similarity/meaning, even when the wording is different (not just exact n-gram overlap).

4. Evaluate retrieval quality (e.g., whether the system retrieves the most relevant chunks/documents) and grounding (whether answers are supported by retrieved sources).

5. Productivity/time saved (e.g., reduced average handling time), user engagement/retention, or conversion rate (depending on the use case).

DEV Community