DEV Community

Cristopher Coronado
Cristopher Coronado

Posted on

Measure Agent Quality and Safety with Azure AI Evaluation SDK and Azure AI Foundry

A practical evaluation pipeline for GraphRAG agents with quality metrics, safety scans, and observable runs.


Introduction

In Part 4, we orchestrated multiple agents. This article (Part 5) answers a harder question: can we prove that the system is reliable enough for production workloads?

For AI Engineers, answer quality alone is not enough. You also need:

  • Repeatable quality checks before release.
  • Safety evidence for security and compliance reviews.
  • Traceability when behavior changes after prompt, model, or tool updates.

This part adds an evaluation module under src/evaluation with three goals:

  • Quality: task completion, intent resolution, tool-call behavior, graph-grounded correctness.
  • Safety: adversarial probing with red team strategies and risk categories.
  • Observability: telemetry and artifacts that support debugging and regression analysis.

How the three goals are measured

Goal Primary signals Current evidence in this article
Quality task_adherence, intent_resolution, relevance, coherence, response_completeness Foundry quality snapshot (March 2026): 80% task adherence, 100% on other quality signals
Safety Red team attack outcomes by risk category and strategy Red team run section and Foundry safety screenshots
Observability Prompt/completion token usage, OTel traces, local artifacts Quality snapshot token counters (85,686 prompt / 5,048 completion) + OTel and JSON report references

Quality and safety runs can also be exported to Azure AI Foundry, so teams review outcomes in shared dashboards instead of only local JSON artifacts.

Why This Matters for AI Engineers

When you ship agent systems, every change can alter behavior: prompts, model versions, tool schemas, and data.

Engineering scenario Typical failure mode Why this pipeline helps
Prompt or model update Fluent but lower-quality answers Batch baselines expose quality regressions before release
Tool contract changes Wrong tool or wrong arguments Tool-call evaluators detect routing and schema drift
Knowledge graph refresh Unsupported entities/relationships in answers Custom graph evaluators detect grounding errors
Safety hardening Unknown risk exposure under adversarial inputs Red team runs provide repeatable safety evidence
Incident debugging Hard to explain why behavior changed OTel traces and result artifacts reduce investigation time

What You Build

Layer Component Purpose
Dataset golden_questions.jsonl Controlled test set with expected outcomes
Generation generate_eval_data.py Runs the agent and writes evaluation data
Quality Eval run_batch_evaluation.py Runs built-in and custom evaluators
Safety Eval run_redteam.py Runs red team scans against the agent/model
Reporting evaluation_report.md, JSON outputs, Foundry run links Human and machine-readable results

Module Layout

src/evaluation/
├── config.py
├── datasets/
│   ├── golden_questions.jsonl
│   └── eval_data.jsonl
├── evaluators/
│   ├── builtin.py
│   ├── entity_accuracy.py
│   └── relationship_validity.py
├── monitoring/
│   └── otel_setup.py
├── results/
└── scripts/
    ├── generate_eval_data.py
    ├── run_batch_evaluation.py
    └── run_redteam.py
Enter fullscreen mode Exit fullscreen mode

What Is Evaluated, and Why

Built-in quality evaluators

Evaluator What it checks Why it matters
TaskAdherenceEvaluator Does the response complete the requested task? Detects incomplete or off-target answers
IntentResolutionEvaluator Does the response resolve user intent? Detects responses that are fluent but irrelevant
RelevanceEvaluator Is the response relevant to the query? Detects answers that drift away from the user request
CoherenceEvaluator Is the response logically consistent? Detects contradictions and weak reasoning flow
ResponseCompletenessEvaluator Does the response cover expected content? Detects partial answers against expected coverage

Built-in tool-behavior evaluator (conditional)

Evaluator What it checks Why it matters
ToolCallAccuracyEvaluator Were tools/arguments appropriate? Detects wrong routing, wrong parameters, unnecessary calls

ToolCallAccuracyEvaluator is included when structured tool_call payloads exist in eval_data.jsonl.

Custom graph evaluators

Evaluator What it checks Why it matters
EntityAccuracyEvaluator Mentioned entities exist in Parquet graph data Detects unsupported entities and weak grounding
RelationshipValidityEvaluator Co-mentioned entity pairs match graph relationships Detects fabricated links between entities

Safety scanning

Component What it checks Why it matters
RedTeam scan Attack outcomes by risk category and strategy Produces safety evidence and failure patterns

Pipeline Steps

Step 1: Start MCP Server

poetry run python run_mcp_server.py
Enter fullscreen mode Exit fullscreen mode

Step 2: Generate Evaluation Data

poetry run python -m evaluation.scripts.generate_eval_data
Enter fullscreen mode Exit fullscreen mode

This runs the Knowledge Captain against 10 golden questions and writes eval_data.jsonl.

Step 3: Run Batch Evaluation

poetry run python -m evaluation.scripts.run_batch_evaluation
Enter fullscreen mode Exit fullscreen mode

Optional variants:

# Skip custom graph evaluators
poetry run python -m evaluation.scripts.run_batch_evaluation --no-custom

# Publish quality run to New Foundry (dashboard + report URL)
poetry run python -m evaluation.scripts.run_batch_evaluation --foundry
Enter fullscreen mode Exit fullscreen mode

Figure. Batch quality run in Azure AI Foundry, including evaluator metrics and row-level evidence.

Step 4: Run Red Team Scan

poetry run python -m evaluation.scripts.run_redteam --flow cloud-model
Enter fullscreen mode Exit fullscreen mode

Use cloud-model as default for predictable behavior.

Figure. Red team safety run in Azure AI Foundry, showing risk-category outcomes and attack results.

Where Azure AI Foundry Fits

Part 5 uses Azure AI Foundry as the shared visualization layer for evaluation operations:

  • Step 3 (--foundry): publishes a New Foundry quality run and returns a report URL for dashboard review. The default Foundry quality set emphasizes semantic signals (relevance, coherence, response_completeness) plus agent checks (task_adherence, intent_resolution).
  • Step 4 (run_redteam): runs the red team scan and publishes a New Foundry reference run for safety visibility.
  • Custom graph evaluators: execute in the same batch pipeline and are persisted in local artifacts (evaluation_results.json, evaluation_report.md) that are reviewed alongside Foundry run links.

By design, lexical overlap metrics such as F1 are not the default in Foundry export for this agent workflow, because they can under-score correct but paraphrased answers.

This gives one operational workflow: Foundry for centralized run visibility, local custom metrics for graph-specific grounding checks.

Latest Foundry quality snapshot (March 2026)

Most recent quality run summary (10 rows):

Metric Score Rows
Task adherence 80% 8/10
Intent resolution 100% 10/10
Relevance 100% 10/10
Coherence 100% 10/10
Response completeness 100% 10/10
Prompt tokens 85,686 -
Completion tokens 5,048 -

How to interpret this snapshot:

  • Semantic quality is stable across the full set.
  • task_adherence is the primary optimization target.
  • ToolCallAccuracyEvaluator is emitted only when eval_data.jsonl includes structured tool_call payloads.

Figure. Azure AI Foundry evaluations list used as the central run registry for Part 5.

Key Snippets That Matter

1. Message conversion (MAF to evaluator schema)

The SDK expects OpenAI-style tool_call and tool_result. MAF internally uses function messages.

from evaluation.evaluators.builtin import convert_to_evaluator_messages

messages = convert_to_evaluator_messages(all_msgs)
Enter fullscreen mode Exit fullscreen mode

Without this conversion, tool-focused evaluators are unreliable.

2. Correct evaluator_config mapping shape

evaluator_config = {
    "task_adherence": {
        "column_mapping": {
            "query": "${data.query}",
            "response": "${data.response}",
        }
    }
}
Enter fullscreen mode Exit fullscreen mode

Flattened mappings break field binding.

3. Deployment compatibility guard

Some deployments reject max_tokens and require max_completion_tokens.

if "intent_resolution" in evaluators and not _supports_legacy_max_tokens(config):
    evaluators.pop("intent_resolution", None)
Enter fullscreen mode Exit fullscreen mode

This keeps the run operational while preserving the rest of the evaluation set.

4. Red team semantic success guard

total_attacks = _extract_total_evaluated_attacks(result_payload)
if total_attacks == 0:
    raise RuntimeError("Red team scan completed but produced zero evaluated attacks.")
Enter fullscreen mode Exit fullscreen mode

This prevents false-success runs in unsupported regions.

Interpreting Results for Release Decisions

Do not treat one score as the whole truth. Use a small gate matrix.

Figure. Foundry evaluation evidence used as release-gate input, not only as post-run reporting.

Dimension Signal to watch Practical release question
Task quality TaskAdherence, IntentResolution Are answers complete and aligned with user goal?
Tool behavior ToolCallAccuracy Is orchestration stable after changes?
Graph grounding EntityAccuracy, RelationshipValidity Are claims supported by the knowledge graph?
Safety Red team ASR and risk outcomes Did risk exposure improve, regress, or stay flat?
Traceability OTel traces + run artifacts Can we explain failures quickly?

Recommended practice:

  • Compare against previous baseline, not isolated absolute values.
  • Block release on clear regressions in critical metrics.
  • Keep known exceptions documented and time-boxed.
  • Treat missing ToolCallAccuracy in a run as dataset-shape not-applicable (no structured tool calls), not as an automatic failure.

Common Failure Patterns

Pattern Typical cause Mitigation
Missing or wrong evaluator columns Incorrect evaluator_config shape Use nested column_mapping
Intermittent evaluator failure Deployment incompatibility for token params Use evaluator-only deployment override
Red team run shows 0/0 Region capability mismatch Move Foundry project to supported region
Good text but poor grounding Response not constrained by graph evidence Add graph checks and update prompts
Hard-to-debug regressions Missing traces/artifacts Keep OTel + result JSON in every run

Environment Variables

Variable Required Purpose
AZURE_OPENAI_ENDPOINT Yes Azure OpenAI endpoint
AZURE_OPENAI_API_KEY Yes Azure OpenAI key
AZURE_OPENAI_CHAT_DEPLOYMENT No Default evaluator/chat deployment
AZURE_OPENAI_EVAL_CHAT_DEPLOYMENT No Evaluator-only deployment override
AZURE_AI_PROJECT Step 4 New Foundry project endpoint
APPLICATIONINSIGHTS_CONNECTION_STRING No Production telemetry sink
OTEL_TRACING_ENDPOINT No Local OTLP endpoint

Recommended Foundry Screenshots

  • Evaluations list view for portfolio-level evidence.
  • Batch quality run details with summary and row-level metrics.
  • Red team run details with risk categories and outcomes.
  • Release-gate evidence view for decision making.
  • Prefer detailed run views over reduced pages that only show token counts.

Validation Snapshot

Current module tests:

File Tests
tests/evaluation/test_config.py 15
tests/evaluation/test_builtin_evaluators.py 21
tests/evaluation/test_custom_evaluators.py 14
tests/evaluation/test_monitoring.py 6
tests/evaluation/test_run_redteam.py 7
Total 63

Why This Part Is a Milestone

Part 5 turns the project from a functional demo into an engineering-grade evaluable system.

  • You can compare behavior over time.
  • You can detect regressions before production.
  • You can produce safety evidence in a repeatable way.
  • You can keep a traceable path from query to metric.

This is the baseline for Part 6 (Human-in-the-Loop) and production quality gates.

Key Takeaways

  • Agent quality in production is not a single score. You need quality, safety, and traceability together.
  • Built-in evaluators and custom graph evaluators solve different problems and should be used as a combined gate.
  • Azure AI Foundry gives shared visibility for runs, while local artifacts preserve GraphRAG-specific evidence.
  • Missing ToolCallAccuracy is often a dataset-shape condition, not automatically a regression.
  • Red team outcomes should be treated as release evidence, not as an optional post-check.

What's Next

In Part 6, we will add Human-in-the-Loop controls to the same pipeline:

  • Approval gates for sensitive tool actions.
  • Explicit escalation paths for low-confidence responses.
  • Audit-friendly checkpoints connected to the same evaluation workflow.

Resources

Top comments (0)