This closes out the series' capstone: the multi-agent customer support system built across Parts 6-9, now hardened with evaluation, cost governance, and observability so it can actually run in production with an on-call rotation behind it, not just in a demo environment.
Continuous evaluation pipeline
Evaluation: measuring quality continuously, not just at launch
A one-time eval before launch tells you nothing about drift once real traffic — and real edge cases — start hitting the system. Set up a continuous evaluation pipeline using a G-Eval-style approach, where a separate model scores production outputs against explicit criteria:
eval_criteria = {
"correctness": "Does the response accurately reflect the order/refund status retrieved from the tools?",
"escalation_appropriateness": "If the case was ambiguous or high-risk, did the agent escalate to a human rather than resolving it alone?",
"tone": "Is the response professional and appropriately empathetic given the customer's stated frustration level?",
}
def geval_score(response, context, criterion_name, criterion_description, eval_model_client):
prompt = f"""Evaluate the following response against this criterion: {criterion_description}
Context: {context}
Response: {response}
Score from 1-5 and give one sentence of reasoning. Return JSON: {{"score": int, "reasoning": str}}"""
result = eval_model_client.complete(prompt)
return json.loads(result)
def run_continuous_eval(sample_of_production_traffic):
scores = {crit: [] for crit in eval_criteria}
for interaction in sample_of_production_traffic:
for crit_name, crit_desc in eval_criteria.items():
result = geval_score(interaction.response, interaction.context, crit_name, crit_desc, eval_model_client)
scores[crit_name].append(result["score"])
return {crit: sum(vals) / len(vals) for crit, vals in scores.items()}
Sample a percentage of real production traffic daily (not just synthetic test cases) and track these scores over time. A drop in escalation_appropriateness specifically is the metric most worth alerting on — it's a direct proxy for the system doing something risky without a human check, which is exactly the failure mode the recovery and authorization work in Parts 7 and 9 was designed to prevent.
Cost governance: PTU vs. pay-as-you-go, decided with real math
For a system with predictable, sustained traffic (which a production support system should have), Provisioned Throughput (PTU) usually beats pay-as-you-go on cost — but the crossover point depends on your actual volume:
def compare_ptu_vs_payg(monthly_token_volume, ptu_monthly_cost, payg_per_1k_tokens):
payg_monthly_cost = (monthly_token_volume / 1000) * payg_per_1k_tokens
return {
"payg_monthly": payg_monthly_cost,
"ptu_monthly": ptu_monthly_cost,
"recommendation": "ptu" if ptu_monthly_cost < payg_monthly_cost else "payg",
"breakeven_tokens": (ptu_monthly_cost / payg_per_1k_tokens) * 1000,
}
Run this quarterly, not once — traffic volume for a maturing production system tends to grow, and the PTU crossover point is usually reached faster than teams expect once an agent system is handling a meaningful fraction of real support volume.
Chargeback tagging: attributing cost to the right owner
With multiple agents (fraud-check, refund, notification) potentially running on shared compute, tag at the project level so cost attribution doesn't require manual reconciliation later:
resource_tags = {
"business-unit": "customer-support",
"system": "multi-agent-refund-flow",
"environment": "production",
"cost-center": "CC-4471",
}
Apply these consistently at the Azure resource level (not just in application logs) so Cost Management reports can be filtered directly without a separate reconciliation step — this is the difference between a chargeback model that's usable monthly versus one that requires a data-engineering project every quarter.
| Dashboard signal | Source | What it indicates |
|---|---|---|
| Request-level trace | Part 2 tracing patterns | Latency and failure location per agent step |
| Authorization denials | Part 9 identity logging | Potential security issue, not just a bug |
| Escalation rate vs. appropriateness score | Eval pipeline + agent logs | Whether the system is escalating correctly |
| Cost burn rate | Azure Cost Management tags | Budget overage risk before month-end |
Observability: the on-call-ready dashboard
Pull together the tracing work from Part 2, the authorization logging from Part 9, and the eval scores above into a single dashboard an on-call engineer can actually use at 2am:
- Request-level trace: which agents were invoked, in what order, with what latency per step (from Part 2's tracing patterns).
- Authorization denials: any agent attempting an action outside its scope (from Part 9) — a spike here is a security signal, not just a bug signal.
-
Escalation rate: percentage of interactions escalated to a human, tracked against the eval-measured
escalation_appropriatenessscore — a rising escalation rate paired with a falling appropriateness score means the system is escalating things it shouldn't, which is its own kind of problem. - Cost burn rate: token consumption against the PTU/PAYG budget, with an alert threshold before month-end overage becomes a surprise.
A concrete incident: what the on-call runbook actually looks like
All the observability infrastructure above is only as good as the runbook someone follows at 2am when an alert fires. Here's a worked example tying every prior post together into one incident response flow, using a realistic trigger: the escalation-rate alert from the dashboard fires, showing escalations up 3x over baseline in the last 30 minutes.
Step 1 — check the authorization denial log (Part 9). A spike in escalations correlated with a spike in authorization denials usually means an agent is attempting actions outside its scope — possibly a misconfigured deployment, possibly a prompt-injection attempt. This is checked first because it's the highest-severity possible cause.
Step 2 — check the circuit breaker state (Part 7). If a downstream dependency (the fraud-check API, say) is degraded, the circuit breaker should already be routing to human escalation rather than retrying — confirm it's open and working as designed, not that agents are timing out repeatedly without the breaker engaging.
Step 3 — check the eval scores for escalation_appropriateness (this post). If the score is stable and escalations are simply more frequent, this may be a legitimate traffic pattern (a genuinely higher-risk cohort of requests, e.g., during a known incident like a payment processor outage) rather than a system problem. If the score is dropping alongside the escalation spike, the system's judgment about when to escalate may itself be degrading — this points back toward Part 5's schema validation and Part 7's handoff logic as places to check for a recent regression.
Step 4 — check recent deployments against the canary process (Part 2). Cross-reference the timestamp of the spike against any recent flow, model version, or schema change. If a change went out in the last few hours without full canary ramp-up, that's the most likely single cause, and rollback is usually faster than root-causing forward.
def incident_triage(alert_context):
checks = [
("authorization_denials", check_authorization_spike),
("circuit_breaker_state", check_circuit_breaker_status),
("eval_score_trend", check_escalation_appropriateness_trend),
("recent_deployments", check_recent_flow_changes),
]
findings = {}
for name, check_fn in checks:
findings[name] = check_fn(alert_context)
if findings[name].get("severity") == "critical":
return {"triage_result": name, "findings": findings, "action": "immediate_rollback_or_escalation"}
return {"triage_result": "inconclusive", "findings": findings, "action": "manual_investigation"}
Writing this ordering down explicitly — check security signals before assuming it's a quality regression, check for a bad deploy before deep root-causing — is what turns nine posts' worth of individually reasonable safeguards into something an on-call engineer who didn't build the system can actually execute under pressure.
References
- Azure AI Foundry evaluation SDK: https://learn.microsoft.com/en-us/azure/ai-foundry/how-to/develop/evaluate-sdk
- G-Eval and LLM-as-judge evaluation approach: https://learn.microsoft.com/en-us/azure/ai-foundry/concepts/observability
- Provisioned Throughput Units (PTU) for Azure OpenAI: https://learn.microsoft.com/en-us/azure/ai-services/openai/concepts/provisioned-throughput
- Azure Cost Management and tagging: https://learn.microsoft.com/en-us/azure/cost-management-billing/costs/cost-mgt-best-practices


Top comments (0)