Cristopher Coronado

Posted on Mar 31

Measure Agent Quality and Safety with Azure AI Evaluation SDK and Azure AI Foundry

#ai #python #agents #azure

A practical evaluation pipeline for GraphRAG agents with quality metrics, safety scans, and observable runs.

Introduction

In Part 4, we orchestrated multiple agents. This article (Part 5) answers a harder question: can we prove that the system is reliable enough for production workloads?

For AI Engineers, answer quality alone is not enough. You also need:

Repeatable quality checks before release.
Safety evidence for security and compliance reviews.
Traceability when behavior changes after prompt, model, or tool updates.

This part adds an evaluation module under src/evaluation with three goals:

Quality: task completion, intent resolution, tool-call behavior, graph-grounded correctness.
Safety: adversarial probing with red team strategies and risk categories.
Observability: telemetry and artifacts that support debugging and regression analysis.

How the three goals are measured

Goal	Primary signals	Current evidence in this article
Quality	`task_adherence`, `intent_resolution`, `relevance`, `coherence`, `response_completeness`	Foundry quality snapshot (March 2026): 80% task adherence, 100% on other quality signals
Safety	Red team attack outcomes by risk category and strategy	Red team run section and Foundry safety screenshots
Observability	Prompt/completion token usage, OTel traces, local artifacts	Quality snapshot token counters (85,686 prompt / 5,048 completion) + OTel and JSON report references

Quality and safety runs can also be exported to Azure AI Foundry, so teams review outcomes in shared dashboards instead of only local JSON artifacts.

Why This Matters for AI Engineers

When you ship agent systems, every change can alter behavior: prompts, model versions, tool schemas, and data.

Engineering scenario	Typical failure mode	Why this pipeline helps
Prompt or model update	Fluent but lower-quality answers	Batch baselines expose quality regressions before release
Tool contract changes	Wrong tool or wrong arguments	Tool-call evaluators detect routing and schema drift
Knowledge graph refresh	Unsupported entities/relationships in answers	Custom graph evaluators detect grounding errors
Safety hardening	Unknown risk exposure under adversarial inputs	Red team runs provide repeatable safety evidence
Incident debugging	Hard to explain why behavior changed	OTel traces and result artifacts reduce investigation time

What You Build

Layer	Component	Purpose
Dataset	`golden_questions.jsonl`	Controlled test set with expected outcomes
Generation	`generate_eval_data.py`	Runs the agent and writes evaluation data
Quality Eval	`run_batch_evaluation.py`	Runs built-in and custom evaluators
Safety Eval	`run_redteam.py`	Runs red team scans against the agent/model
Reporting	`evaluation_report.md`, JSON outputs, Foundry run links	Human and machine-readable results

Module Layout

src/evaluation/
├── config.py
├── datasets/
│   ├── golden_questions.jsonl
│   └── eval_data.jsonl
├── evaluators/
│   ├── builtin.py
│   ├── entity_accuracy.py
│   └── relationship_validity.py
├── monitoring/
│   └── otel_setup.py
├── results/
└── scripts/
    ├── generate_eval_data.py
    ├── run_batch_evaluation.py
    └── run_redteam.py

What Is Evaluated, and Why

Built-in quality evaluators

Evaluator	What it checks	Why it matters
`TaskAdherenceEvaluator`	Does the response complete the requested task?	Detects incomplete or off-target answers
`IntentResolutionEvaluator`	Does the response resolve user intent?	Detects responses that are fluent but irrelevant
`RelevanceEvaluator`	Is the response relevant to the query?	Detects answers that drift away from the user request
`CoherenceEvaluator`	Is the response logically consistent?	Detects contradictions and weak reasoning flow
`ResponseCompletenessEvaluator`	Does the response cover expected content?	Detects partial answers against expected coverage

Built-in tool-behavior evaluator (conditional)

Evaluator	What it checks	Why it matters
`ToolCallAccuracyEvaluator`	Were tools/arguments appropriate?	Detects wrong routing, wrong parameters, unnecessary calls

ToolCallAccuracyEvaluator is included when structured tool_call payloads exist in eval_data.jsonl.

Custom graph evaluators

Evaluator	What it checks	Why it matters
`EntityAccuracyEvaluator`	Mentioned entities exist in Parquet graph data	Detects unsupported entities and weak grounding
`RelationshipValidityEvaluator`	Co-mentioned entity pairs match graph relationships	Detects fabricated links between entities

Safety scanning

Component	What it checks	Why it matters
`RedTeam` scan	Attack outcomes by risk category and strategy	Produces safety evidence and failure patterns

Pipeline Steps

Step 1: Start MCP Server

poetry run python run_mcp_server.py

Step 2: Generate Evaluation Data

poetry run python -m evaluation.scripts.generate_eval_data

This runs the Knowledge Captain against 10 golden questions and writes eval_data.jsonl.

Step 3: Run Batch Evaluation

poetry run python -m evaluation.scripts.run_batch_evaluation

Optional variants:

# Skip custom graph evaluators
poetry run python -m evaluation.scripts.run_batch_evaluation --no-custom

# Publish quality run to New Foundry (dashboard + report URL)
poetry run python -m evaluation.scripts.run_batch_evaluation --foundry

Figure. Batch quality run in Azure AI Foundry, including evaluator metrics and row-level evidence.

Step 4: Run Red Team Scan

poetry run python -m evaluation.scripts.run_redteam --flow cloud-model

Use cloud-model as default for predictable behavior.

Figure. Red team safety run in Azure AI Foundry, showing risk-category outcomes and attack results.

Where Azure AI Foundry Fits

Part 5 uses Azure AI Foundry as the shared visualization layer for evaluation operations:

Step 3 (--foundry): publishes a New Foundry quality run and returns a report URL for dashboard review. The default Foundry quality set emphasizes semantic signals (relevance, coherence, response_completeness) plus agent checks (task_adherence, intent_resolution).
Step 4 (run_redteam): runs the red team scan and publishes a New Foundry reference run for safety visibility.
Custom graph evaluators: execute in the same batch pipeline and are persisted in local artifacts (evaluation_results.json, evaluation_report.md) that are reviewed alongside Foundry run links.

By design, lexical overlap metrics such as F1 are not the default in Foundry export for this agent workflow, because they can under-score correct but paraphrased answers.

This gives one operational workflow: Foundry for centralized run visibility, local custom metrics for graph-specific grounding checks.

Latest Foundry quality snapshot (March 2026)

Most recent quality run summary (10 rows):

Metric	Score	Rows
Task adherence	80%	8/10
Intent resolution	100%	10/10
Relevance	100%	10/10
Coherence	100%	10/10
Response completeness	100%	10/10
Prompt tokens	85,686	-
Completion tokens	5,048	-

How to interpret this snapshot:

Semantic quality is stable across the full set.
task_adherence is the primary optimization target.
ToolCallAccuracyEvaluator is emitted only when eval_data.jsonl includes structured tool_call payloads.

Figure. Azure AI Foundry evaluations list used as the central run registry for Part 5.

Key Snippets That Matter

1. Message conversion (MAF to evaluator schema)

The SDK expects OpenAI-style tool_call and tool_result. MAF internally uses function messages.

from evaluation.evaluators.builtin import convert_to_evaluator_messages

messages = convert_to_evaluator_messages(all_msgs)

Without this conversion, tool-focused evaluators are unreliable.

2. Correct `evaluator_config` mapping shape

evaluator_config = {
    "task_adherence": {
        "column_mapping": {
            "query": "${data.query}",
            "response": "${data.response}",
        }
    }
}

Flattened mappings break field binding.

3. Deployment compatibility guard

Some deployments reject max_tokens and require max_completion_tokens.

if "intent_resolution" in evaluators and not _supports_legacy_max_tokens(config):
    evaluators.pop("intent_resolution", None)

This keeps the run operational while preserving the rest of the evaluation set.

4. Red team semantic success guard

total_attacks = _extract_total_evaluated_attacks(result_payload)
if total_attacks == 0:
    raise RuntimeError("Red team scan completed but produced zero evaluated attacks.")

This prevents false-success runs in unsupported regions.

Interpreting Results for Release Decisions

Do not treat one score as the whole truth. Use a small gate matrix.

Figure. Foundry evaluation evidence used as release-gate input, not only as post-run reporting.

Dimension	Signal to watch	Practical release question
Task quality	TaskAdherence, IntentResolution	Are answers complete and aligned with user goal?
Tool behavior	ToolCallAccuracy	Is orchestration stable after changes?
Graph grounding	EntityAccuracy, RelationshipValidity	Are claims supported by the knowledge graph?
Safety	Red team ASR and risk outcomes	Did risk exposure improve, regress, or stay flat?
Traceability	OTel traces + run artifacts	Can we explain failures quickly?

Recommended practice:

Compare against previous baseline, not isolated absolute values.
Block release on clear regressions in critical metrics.
Keep known exceptions documented and time-boxed.
Treat missing ToolCallAccuracy in a run as dataset-shape not-applicable (no structured tool calls), not as an automatic failure.

Common Failure Patterns

Pattern	Typical cause	Mitigation
Missing or wrong evaluator columns	Incorrect `evaluator_config` shape	Use nested `column_mapping`
Intermittent evaluator failure	Deployment incompatibility for token params	Use evaluator-only deployment override
Red team run shows `0/0`	Region capability mismatch	Move Foundry project to supported region
Good text but poor grounding	Response not constrained by graph evidence	Add graph checks and update prompts
Hard-to-debug regressions	Missing traces/artifacts	Keep OTel + result JSON in every run

Environment Variables

Variable	Required	Purpose
`AZURE_OPENAI_ENDPOINT`	Yes	Azure OpenAI endpoint
`AZURE_OPENAI_API_KEY`	Yes	Azure OpenAI key
`AZURE_OPENAI_CHAT_DEPLOYMENT`	No	Default evaluator/chat deployment
`AZURE_OPENAI_EVAL_CHAT_DEPLOYMENT`	No	Evaluator-only deployment override
`AZURE_AI_PROJECT`	Step 4	New Foundry project endpoint
`APPLICATIONINSIGHTS_CONNECTION_STRING`	No	Production telemetry sink
`OTEL_TRACING_ENDPOINT`	No	Local OTLP endpoint

Recommended Foundry Screenshots

Evaluations list view for portfolio-level evidence.
Batch quality run details with summary and row-level metrics.
Red team run details with risk categories and outcomes.
Release-gate evidence view for decision making.
Prefer detailed run views over reduced pages that only show token counts.

Validation Snapshot

Current module tests:

File	Tests
`tests/evaluation/test_config.py`	15
`tests/evaluation/test_builtin_evaluators.py`	21
`tests/evaluation/test_custom_evaluators.py`	14
`tests/evaluation/test_monitoring.py`	6
`tests/evaluation/test_run_redteam.py`	7
Total	63

Why This Part Is a Milestone

Part 5 turns the project from a functional demo into an engineering-grade evaluable system.

You can compare behavior over time.
You can detect regressions before production.
You can produce safety evidence in a repeatable way.
You can keep a traceable path from query to metric.

This is the baseline for Part 6 (Human-in-the-Loop) and production quality gates.

Key Takeaways

Agent quality in production is not a single score. You need quality, safety, and traceability together.
Built-in evaluators and custom graph evaluators solve different problems and should be used as a combined gate.
Azure AI Foundry gives shared visibility for runs, while local artifacts preserve GraphRAG-specific evidence.
Missing ToolCallAccuracy is often a dataset-shape condition, not automatically a regression.
Red team outcomes should be treated as release evidence, not as an optional post-check.

What's Next

In Part 6, we will add Human-in-the-Loop controls to the same pipeline:

Approval gates for sensitive tool actions.
Explicit escalation paths for low-confidence responses.
Audit-friendly checkpoints connected to the same evaluation workflow.

Resources

Project repository: https://github.com/cristofima/maf-graphrag-series

DEV Community

Measure Agent Quality and Safety with Azure AI Evaluation SDK and Azure AI Foundry

Introduction

How the three goals are measured

Why This Matters for AI Engineers

What You Build

Module Layout

What Is Evaluated, and Why

Built-in quality evaluators

Built-in tool-behavior evaluator (conditional)

Custom graph evaluators

Safety scanning

Pipeline Steps

Step 1: Start MCP Server

Step 2: Generate Evaluation Data

Step 3: Run Batch Evaluation

Step 4: Run Red Team Scan

Where Azure AI Foundry Fits

Latest Foundry quality snapshot (March 2026)

Key Snippets That Matter

1. Message conversion (MAF to evaluator schema)

2. Correct `evaluator_config` mapping shape

3. Deployment compatibility guard

4. Red team semantic success guard

Interpreting Results for Release Decisions

Common Failure Patterns

Environment Variables

Recommended Foundry Screenshots

Validation Snapshot

Why This Part Is a Milestone

Key Takeaways

What's Next

Resources

Top comments (0)

Introduction

How the three goals are measured

Why This Matters for AI Engineers

What You Build

Module Layout

What Is Evaluated, and Why

Built-in quality evaluators

Built-in tool-behavior evaluator (conditional)

Custom graph evaluators

Safety scanning

Pipeline Steps

Step 1: Start MCP Server

Step 2: Generate Evaluation Data

Step 3: Run Batch Evaluation

Step 4: Run Red Team Scan

Where Azure AI Foundry Fits

Latest Foundry quality snapshot (March 2026)

Key Snippets That Matter

1. Message conversion (MAF to evaluator schema)

2. Correct evaluator_config mapping shape

3. Deployment compatibility guard

4. Red team semantic success guard

Interpreting Results for Release Decisions

Common Failure Patterns

Environment Variables

Recommended Foundry Screenshots

Validation Snapshot

Why This Part Is a Milestone

Key Takeaways

What's Next

Resources

2. Correct `evaluator_config` mapping shape