A practical evaluation pipeline for GraphRAG agents with quality metrics, safety scans, and observable runs.
Introduction
In Part 4, we orchestrated multiple agents. This article (Part 5) answers a harder question: can we prove that the system is reliable enough for production workloads?
For AI Engineers, answer quality alone is not enough. You also need:
- Repeatable quality checks before release.
- Safety evidence for security and compliance reviews.
- Traceability when behavior changes after prompt, model, or tool updates.
This part adds an evaluation module under src/evaluation with three goals:
- Quality: task completion, intent resolution, tool-call behavior, graph-grounded correctness.
- Safety: adversarial probing with red team strategies and risk categories.
- Observability: telemetry and artifacts that support debugging and regression analysis.
How the three goals are measured
| Goal | Primary signals | Current evidence in this article |
|---|---|---|
| Quality |
task_adherence, intent_resolution, relevance, coherence, response_completeness
|
Foundry quality snapshot (March 2026): 80% task adherence, 100% on other quality signals |
| Safety | Red team attack outcomes by risk category and strategy | Red team run section and Foundry safety screenshots |
| Observability | Prompt/completion token usage, OTel traces, local artifacts | Quality snapshot token counters (85,686 prompt / 5,048 completion) + OTel and JSON report references |
Quality and safety runs can also be exported to Azure AI Foundry, so teams review outcomes in shared dashboards instead of only local JSON artifacts.
Why This Matters for AI Engineers
When you ship agent systems, every change can alter behavior: prompts, model versions, tool schemas, and data.
| Engineering scenario | Typical failure mode | Why this pipeline helps |
|---|---|---|
| Prompt or model update | Fluent but lower-quality answers | Batch baselines expose quality regressions before release |
| Tool contract changes | Wrong tool or wrong arguments | Tool-call evaluators detect routing and schema drift |
| Knowledge graph refresh | Unsupported entities/relationships in answers | Custom graph evaluators detect grounding errors |
| Safety hardening | Unknown risk exposure under adversarial inputs | Red team runs provide repeatable safety evidence |
| Incident debugging | Hard to explain why behavior changed | OTel traces and result artifacts reduce investigation time |
What You Build
| Layer | Component | Purpose |
|---|---|---|
| Dataset | golden_questions.jsonl |
Controlled test set with expected outcomes |
| Generation | generate_eval_data.py |
Runs the agent and writes evaluation data |
| Quality Eval | run_batch_evaluation.py |
Runs built-in and custom evaluators |
| Safety Eval | run_redteam.py |
Runs red team scans against the agent/model |
| Reporting |
evaluation_report.md, JSON outputs, Foundry run links |
Human and machine-readable results |
Module Layout
src/evaluation/
├── config.py
├── datasets/
│ ├── golden_questions.jsonl
│ └── eval_data.jsonl
├── evaluators/
│ ├── builtin.py
│ ├── entity_accuracy.py
│ └── relationship_validity.py
├── monitoring/
│ └── otel_setup.py
├── results/
└── scripts/
├── generate_eval_data.py
├── run_batch_evaluation.py
└── run_redteam.py
What Is Evaluated, and Why
Built-in quality evaluators
| Evaluator | What it checks | Why it matters |
|---|---|---|
TaskAdherenceEvaluator |
Does the response complete the requested task? | Detects incomplete or off-target answers |
IntentResolutionEvaluator |
Does the response resolve user intent? | Detects responses that are fluent but irrelevant |
RelevanceEvaluator |
Is the response relevant to the query? | Detects answers that drift away from the user request |
CoherenceEvaluator |
Is the response logically consistent? | Detects contradictions and weak reasoning flow |
ResponseCompletenessEvaluator |
Does the response cover expected content? | Detects partial answers against expected coverage |
Built-in tool-behavior evaluator (conditional)
| Evaluator | What it checks | Why it matters |
|---|---|---|
ToolCallAccuracyEvaluator |
Were tools/arguments appropriate? | Detects wrong routing, wrong parameters, unnecessary calls |
ToolCallAccuracyEvaluator is included when structured tool_call payloads exist in eval_data.jsonl.
Custom graph evaluators
| Evaluator | What it checks | Why it matters |
|---|---|---|
EntityAccuracyEvaluator |
Mentioned entities exist in Parquet graph data | Detects unsupported entities and weak grounding |
RelationshipValidityEvaluator |
Co-mentioned entity pairs match graph relationships | Detects fabricated links between entities |
Safety scanning
| Component | What it checks | Why it matters |
|---|---|---|
RedTeam scan |
Attack outcomes by risk category and strategy | Produces safety evidence and failure patterns |
Pipeline Steps
Step 1: Start MCP Server
poetry run python run_mcp_server.py
Step 2: Generate Evaluation Data
poetry run python -m evaluation.scripts.generate_eval_data
This runs the Knowledge Captain against 10 golden questions and writes eval_data.jsonl.
Step 3: Run Batch Evaluation
poetry run python -m evaluation.scripts.run_batch_evaluation
Optional variants:
# Skip custom graph evaluators
poetry run python -m evaluation.scripts.run_batch_evaluation --no-custom
# Publish quality run to New Foundry (dashboard + report URL)
poetry run python -m evaluation.scripts.run_batch_evaluation --foundry
Figure. Batch quality run in Azure AI Foundry, including evaluator metrics and row-level evidence.
Step 4: Run Red Team Scan
poetry run python -m evaluation.scripts.run_redteam --flow cloud-model
Use cloud-model as default for predictable behavior.
Figure. Red team safety run in Azure AI Foundry, showing risk-category outcomes and attack results.
Where Azure AI Foundry Fits
Part 5 uses Azure AI Foundry as the shared visualization layer for evaluation operations:
-
Step 3 (
--foundry): publishes a New Foundry quality run and returns a report URL for dashboard review. The default Foundry quality set emphasizes semantic signals (relevance,coherence,response_completeness) plus agent checks (task_adherence,intent_resolution). -
Step 4 (
run_redteam): runs the red team scan and publishes a New Foundry reference run for safety visibility. -
Custom graph evaluators: execute in the same batch pipeline and are persisted in local artifacts (
evaluation_results.json,evaluation_report.md) that are reviewed alongside Foundry run links.
By design, lexical overlap metrics such as F1 are not the default in Foundry export for this agent workflow, because they can under-score correct but paraphrased answers.
This gives one operational workflow: Foundry for centralized run visibility, local custom metrics for graph-specific grounding checks.
Latest Foundry quality snapshot (March 2026)
Most recent quality run summary (10 rows):
| Metric | Score | Rows |
|---|---|---|
| Task adherence | 80% | 8/10 |
| Intent resolution | 100% | 10/10 |
| Relevance | 100% | 10/10 |
| Coherence | 100% | 10/10 |
| Response completeness | 100% | 10/10 |
| Prompt tokens | 85,686 | - |
| Completion tokens | 5,048 | - |
How to interpret this snapshot:
- Semantic quality is stable across the full set.
-
task_adherenceis the primary optimization target. -
ToolCallAccuracyEvaluatoris emitted only wheneval_data.jsonlincludes structuredtool_callpayloads.
Figure. Azure AI Foundry evaluations list used as the central run registry for Part 5.
Key Snippets That Matter
1. Message conversion (MAF to evaluator schema)
The SDK expects OpenAI-style tool_call and tool_result. MAF internally uses function messages.
from evaluation.evaluators.builtin import convert_to_evaluator_messages
messages = convert_to_evaluator_messages(all_msgs)
Without this conversion, tool-focused evaluators are unreliable.
2. Correct evaluator_config mapping shape
evaluator_config = {
"task_adherence": {
"column_mapping": {
"query": "${data.query}",
"response": "${data.response}",
}
}
}
Flattened mappings break field binding.
3. Deployment compatibility guard
Some deployments reject max_tokens and require max_completion_tokens.
if "intent_resolution" in evaluators and not _supports_legacy_max_tokens(config):
evaluators.pop("intent_resolution", None)
This keeps the run operational while preserving the rest of the evaluation set.
4. Red team semantic success guard
total_attacks = _extract_total_evaluated_attacks(result_payload)
if total_attacks == 0:
raise RuntimeError("Red team scan completed but produced zero evaluated attacks.")
This prevents false-success runs in unsupported regions.
Interpreting Results for Release Decisions
Do not treat one score as the whole truth. Use a small gate matrix.
Figure. Foundry evaluation evidence used as release-gate input, not only as post-run reporting.
| Dimension | Signal to watch | Practical release question |
|---|---|---|
| Task quality | TaskAdherence, IntentResolution | Are answers complete and aligned with user goal? |
| Tool behavior | ToolCallAccuracy | Is orchestration stable after changes? |
| Graph grounding | EntityAccuracy, RelationshipValidity | Are claims supported by the knowledge graph? |
| Safety | Red team ASR and risk outcomes | Did risk exposure improve, regress, or stay flat? |
| Traceability | OTel traces + run artifacts | Can we explain failures quickly? |
Recommended practice:
- Compare against previous baseline, not isolated absolute values.
- Block release on clear regressions in critical metrics.
- Keep known exceptions documented and time-boxed.
- Treat missing
ToolCallAccuracyin a run as dataset-shape not-applicable (no structured tool calls), not as an automatic failure.
Common Failure Patterns
| Pattern | Typical cause | Mitigation |
|---|---|---|
| Missing or wrong evaluator columns | Incorrect evaluator_config shape |
Use nested column_mapping
|
| Intermittent evaluator failure | Deployment incompatibility for token params | Use evaluator-only deployment override |
Red team run shows 0/0
|
Region capability mismatch | Move Foundry project to supported region |
| Good text but poor grounding | Response not constrained by graph evidence | Add graph checks and update prompts |
| Hard-to-debug regressions | Missing traces/artifacts | Keep OTel + result JSON in every run |
Environment Variables
| Variable | Required | Purpose |
|---|---|---|
AZURE_OPENAI_ENDPOINT |
Yes | Azure OpenAI endpoint |
AZURE_OPENAI_API_KEY |
Yes | Azure OpenAI key |
AZURE_OPENAI_CHAT_DEPLOYMENT |
No | Default evaluator/chat deployment |
AZURE_OPENAI_EVAL_CHAT_DEPLOYMENT |
No | Evaluator-only deployment override |
AZURE_AI_PROJECT |
Step 4 | New Foundry project endpoint |
APPLICATIONINSIGHTS_CONNECTION_STRING |
No | Production telemetry sink |
OTEL_TRACING_ENDPOINT |
No | Local OTLP endpoint |
Recommended Foundry Screenshots
- Evaluations list view for portfolio-level evidence.
- Batch quality run details with summary and row-level metrics.
- Red team run details with risk categories and outcomes.
- Release-gate evidence view for decision making.
- Prefer detailed run views over reduced pages that only show token counts.
Validation Snapshot
Current module tests:
| File | Tests |
|---|---|
tests/evaluation/test_config.py |
15 |
tests/evaluation/test_builtin_evaluators.py |
21 |
tests/evaluation/test_custom_evaluators.py |
14 |
tests/evaluation/test_monitoring.py |
6 |
tests/evaluation/test_run_redteam.py |
7 |
| Total | 63 |
Why This Part Is a Milestone
Part 5 turns the project from a functional demo into an engineering-grade evaluable system.
- You can compare behavior over time.
- You can detect regressions before production.
- You can produce safety evidence in a repeatable way.
- You can keep a traceable path from query to metric.
This is the baseline for Part 6 (Human-in-the-Loop) and production quality gates.
Key Takeaways
- Agent quality in production is not a single score. You need quality, safety, and traceability together.
- Built-in evaluators and custom graph evaluators solve different problems and should be used as a combined gate.
- Azure AI Foundry gives shared visibility for runs, while local artifacts preserve GraphRAG-specific evidence.
- Missing
ToolCallAccuracyis often a dataset-shape condition, not automatically a regression. - Red team outcomes should be treated as release evidence, not as an optional post-check.
What's Next
In Part 6, we will add Human-in-the-Loop controls to the same pipeline:
- Approval gates for sensitive tool actions.
- Explicit escalation paths for low-confidence responses.
- Audit-friendly checkpoints connected to the same evaluation workflow.
Resources
- Project repository: https://github.com/cristofima/maf-graphrag-series




Top comments (0)