🤖 Exam Guide: AI Practitioner
Domain 3: Applications of Foundation Models
📘Task Statement 3.4
🎯 Objectives
This task is about evaluating whether an FM (and an FM-powered application like RAG or an agent) is good enough, not just in a demo, but against repeatable criteria using human review, benchmarks, and objective metrics tied to business goals.
1) Approaches To Evaluate FM performance
1.1 Human Evaluation
People review model outputs and score them against criteria such as:
1 correctness/helpfulness
2 clarity and tone
3 completeness
4 safety/policy compliance
Best for: subjective qualities and real user experience.
Tradeoff: slower and more expensive, but often the most reliable indicator.
1.2 Benchmark Datasets
Use a fixed dataset (prompts + expected outputs) and evaluate consistently over time.
Best for: regression testing, comparing models/prompts, measuring improvement.
Tradeoff: benchmarks may not reflect your exact domain and can be “gamed” if over-optimized.
1.3 Amazon Bedrock Model Evaluation
A managed way to evaluate and compare model outputs,
Useful for standardized evaluation workflows across prompts/models.
2) Metrics To Assess FM Performance
These metrics compare generated text to a reference (ground truth). They’re most useful when you have expected answers (e.g., translation pairs, reference summaries).
2.1 ROUGE
Recall-Oriented Understudy for Gisting Evaluation
ROUGE is common for summarization, it measures overlap between generated and reference text (often n-grams).
Recall-oriented: focuses on how much of the reference content was captured.
2.2 BLEU (Bilingual Evaluation Understudy)
BLEU is common for machine translation, it measures n-gram precision (how much generated text matches the reference translation).
2.3 BERTScore
BERTScore uses embeddings from transformer models (e.g., BERT-like) to measure semantic similarity, not just exact word overlap.
Useful when wording differs but meaning is similar.
No single metric proves “quality.” Metrics can be paired with human evaluation and task-specific checks.
3) Determine Whether The FM Meets Business Objectives
FM quality must translate into outcomes that matter.
Examples of business objective alignment:
3.1 Productivity
Time saved per task, faster drafting, fewer manual steps, reduced handling time.
3.2 User Engagement
Retention, session length, repeat usage, satisfaction ratings.
3.3 Task Success / Task Engineering
Whether users can reliably complete the intended task (e.g., extract fields correctly, create a correct ticket, answer questions with citations).
A model can score well on ROUGE/BLEU but still fail business goals if it’s too slow, too expensive, unsafe, or doesn’t improve user outcomes.
4) Evaluating FM-based Applications
RAG, agents, workflows
It’s not enough to evaluate the base model, you must evaluate the system.
4.1 RAG Evaluation
Evaluate:
1 Retrieval Quality: Are the right documents/chunks being retrieved?
2 Grounding: Does the answer use the retrieved context and avoid making things up?
3 Answer Quality: Correctness, completeness, citations, and refusal behavior when context is missing.
4.2 Common Application Metrics:
1 grounded answer rate / citation accuracy
2 retrieval recall/precision (did we fetch relevant chunks?)
3 hallucination rate (answers not supported by sources)
4.3 Agent / Workflow Evaluation
Evaluate:
1 Task completion rate: Did the agent finish the multi-step objective?
2 Tool correctness: Did it call the right tool with correct parameters?
3 Safety/compliance: Did it attempt disallowed actions or expose sensitive data?
4 Efficiency: Number of steps/tool calls, latency, cost per completed task.
💡 Quick Questions
1. What’s one advantage of human evaluation over automated metrics?
2. Which metric is commonly associated with summarization: ROUGE or BLEU?
3. What is BERTScore trying to measure that ROUGE/BLEU may miss?
4. Name one way to evaluate a RAG application beyond “is the answer good?”
5. Give one example of a business metric you might use to evaluate an FM-powered assistant.
Additional Resources
- Evaluate, compare, and select the best foundation models for your use case in Amazon Bedrock (preview)
- Amazon Bedrock Evaluations
- Review metrics for an automated model evaluation job in Amazon Bedrock (console)
- Evaluate the text summarization capabilities of LLMs for enhanced decision-making on AWS
- Accuracy
- Evaluate models or RAG systems using Amazon Bedrock Evaluations – Now generally available
- Evaluate and improve performance of Amazon Bedrock Knowledge Bases
✅ Answers to Quick Questions
1. Humans can judge qualities that automated metrics struggle with, such as helpfulness, tone, clarity, policy compliance, and real-world correctness in context.
2. ROUGE is commonly used for summarization.
(BLEU is more commonly used for translation.)
3. Semantic similarity/meaning, even when the wording is different (not just exact n-gram overlap).
4. Evaluate retrieval quality (e.g., whether the system retrieves the most relevant chunks/documents) and grounding (whether answers are supported by retrieved sources).
5. Productivity/time saved (e.g., reduced average handling time), user engagement/retention, or conversion rate (depending on the use case).
Top comments (0)