Building Evaluation, Cost Governance, and Observability for a Multi-Agent System in Microsoft Foundry

#azure #microsoft #foundry #agents

This closes out the series' capstone: the multi-agent customer support system built across Parts 6-9, now hardened with evaluation, cost governance, and observability so it can actually run in production with an on-call rotation behind it, not just in a demo environment.

Continuous evaluation pipeline

Evaluation: measuring quality continuously, not just at launch

A one-time eval before launch tells you nothing about drift once real traffic — and real edge cases — start hitting the system. Set up a continuous evaluation pipeline using a G-Eval-style approach, where a separate model scores production outputs against explicit criteria:

eval_criteria = {
    "correctness": "Does the response accurately reflect the order/refund status retrieved from the tools?",
    "escalation_appropriateness": "If the case was ambiguous or high-risk, did the agent escalate to a human rather than resolving it alone?",
    "tone": "Is the response professional and appropriately empathetic given the customer's stated frustration level?",
}

def geval_score(response, context, criterion_name, criterion_description, eval_model_client):
    prompt = f"""Evaluate the following response against this criterion: {criterion_description}
Context: {context}
Response: {response}
Score from 1-5 and give one sentence of reasoning. Return JSON: {{"score": int, "reasoning": str}}"""
    result = eval_model_client.complete(prompt)
    return json.loads(result)

def run_continuous_eval(sample_of_production_traffic):
    scores = {crit: [] for crit in eval_criteria}
    for interaction in sample_of_production_traffic:
        for crit_name, crit_desc in eval_criteria.items():
            result = geval_score(interaction.response, interaction.context, crit_name, crit_desc, eval_model_client)
            scores[crit_name].append(result["score"])
    return {crit: sum(vals) / len(vals) for crit, vals in scores.items()}

Sample a percentage of real production traffic daily (not just synthetic test cases) and track these scores over time. A drop in escalation_appropriateness specifically is the metric most worth alerting on — it's a direct proxy for the system doing something risky without a human check, which is exactly the failure mode the recovery and authorization work in Parts 7 and 9 was designed to prevent.

Cost governance: PTU vs. pay-as-you-go, decided with real math

For a system with predictable, sustained traffic (which a production support system should have), Provisioned Throughput (PTU) usually beats pay-as-you-go on cost — but the crossover point depends on your actual volume:

def compare_ptu_vs_payg(monthly_token_volume, ptu_monthly_cost, payg_per_1k_tokens):
    payg_monthly_cost = (monthly_token_volume / 1000) * payg_per_1k_tokens
    return {
        "payg_monthly": payg_monthly_cost,
        "ptu_monthly": ptu_monthly_cost,
        "recommendation": "ptu" if ptu_monthly_cost < payg_monthly_cost else "payg",
        "breakeven_tokens": (ptu_monthly_cost / payg_per_1k_tokens) * 1000,
    }

Run this quarterly, not once — traffic volume for a maturing production system tends to grow, and the PTU crossover point is usually reached faster than teams expect once an agent system is handling a meaningful fraction of real support volume.

Chargeback tagging: attributing cost to the right owner

With multiple agents (fraud-check, refund, notification) potentially running on shared compute, tag at the project level so cost attribution doesn't require manual reconciliation later:

resource_tags = {
    "business-unit": "customer-support",
    "system": "multi-agent-refund-flow",
    "environment": "production",
    "cost-center": "CC-4471",
}

Apply these consistently at the Azure resource level (not just in application logs) so Cost Management reports can be filtered directly without a separate reconciliation step — this is the difference between a chargeback model that's usable monthly versus one that requires a data-engineering project every quarter.

Dashboard signal	Source	What it indicates
Request-level trace	Part 2 tracing patterns	Latency and failure location per agent step
Authorization denials	Part 9 identity logging	Potential security issue, not just a bug
Escalation rate vs. appropriateness score	Eval pipeline + agent logs	Whether the system is escalating correctly
Cost burn rate	Azure Cost Management tags	Budget overage risk before month-end

Observability: the on-call-ready dashboard

Pull together the tracing work from Part 2, the authorization logging from Part 9, and the eval scores above into a single dashboard an on-call engineer can actually use at 2am:

Request-level trace: which agents were invoked, in what order, with what latency per step (from Part 2's tracing patterns).
Authorization denials: any agent attempting an action outside its scope (from Part 9) — a spike here is a security signal, not just a bug signal.
Escalation rate: percentage of interactions escalated to a human, tracked against the eval-measured escalation_appropriateness score — a rising escalation rate paired with a falling appropriateness score means the system is escalating things it shouldn't, which is its own kind of problem.
Cost burn rate: token consumption against the PTU/PAYG budget, with an alert threshold before month-end overage becomes a surprise.

A concrete incident: what the on-call runbook actually looks like

All the observability infrastructure above is only as good as the runbook someone follows at 2am when an alert fires. Here's a worked example tying every prior post together into one incident response flow, using a realistic trigger: the escalation-rate alert from the dashboard fires, showing escalations up 3x over baseline in the last 30 minutes.

Step 1 — check the authorization denial log (Part 9). A spike in escalations correlated with a spike in authorization denials usually means an agent is attempting actions outside its scope — possibly a misconfigured deployment, possibly a prompt-injection attempt. This is checked first because it's the highest-severity possible cause.

Step 2 — check the circuit breaker state (Part 7). If a downstream dependency (the fraud-check API, say) is degraded, the circuit breaker should already be routing to human escalation rather than retrying — confirm it's open and working as designed, not that agents are timing out repeatedly without the breaker engaging.

Step 3 — check the eval scores for escalation_appropriateness (this post). If the score is stable and escalations are simply more frequent, this may be a legitimate traffic pattern (a genuinely higher-risk cohort of requests, e.g., during a known incident like a payment processor outage) rather than a system problem. If the score is dropping alongside the escalation spike, the system's judgment about when to escalate may itself be degrading — this points back toward Part 5's schema validation and Part 7's handoff logic as places to check for a recent regression.

Step 4 — check recent deployments against the canary process (Part 2). Cross-reference the timestamp of the spike against any recent flow, model version, or schema change. If a change went out in the last few hours without full canary ramp-up, that's the most likely single cause, and rollback is usually faster than root-causing forward.

def incident_triage(alert_context):
    checks = [
        ("authorization_denials", check_authorization_spike),
        ("circuit_breaker_state", check_circuit_breaker_status),
        ("eval_score_trend", check_escalation_appropriateness_trend),
        ("recent_deployments", check_recent_flow_changes),
    ]
    findings = {}
    for name, check_fn in checks:
        findings[name] = check_fn(alert_context)
        if findings[name].get("severity") == "critical":
            return {"triage_result": name, "findings": findings, "action": "immediate_rollback_or_escalation"}
    return {"triage_result": "inconclusive", "findings": findings, "action": "manual_investigation"}

Writing this ordering down explicitly — check security signals before assuming it's a quality regression, check for a bad deploy before deep root-causing — is what turns nine posts' worth of individually reasonable safeguards into something an on-call engineer who didn't build the system can actually execute under pressure.