Why Your AI Agent Monitoring is Wrong (And How to Fix It)

#ai #sre #devops #fintech

As I discussed in my SLO Design article, traditional reliability metrics fail for agentic AI systems. Now let's look at how to actually implement semantic monitoring in production.
Your AI agent is running in production.
HTTP 200. Uptime 99.9%. All dashboards are green.
But it's making the wrong decision 30% of the time.
Your monitoring won't tell you.
The Gap
I spent six months figuring this out the hard way. Traditional SRE monitoring measures infrastructure. Network latency. Error rates. Uptime. It's designed for services that crash when they break. But agents don't crash. They degrade. Slowly. Silently.
An agent can be:

94% accurate (still 94%)
But losing confidence (0.92 to 0.41)
Compensating by calling tools 3x more (1.1x to 3.1x)
While humans reject more of its output (1% to 19%)
As work piles up waiting for approval (8 to 340 items)

Your monitoring sees "everything is fine."
You see $2M impact by the time you notice.
What We Actually Need to Measure Not infrastructure metrics. Semantic metrics.
Four things:
Decision Quality Rate (DQR)
Is the agent picking the right tool?
Healthy: 92%+
Threshold for action: <85%
Tool Invocation Efficiency **(TIE)**
Is it over-compensating by calling tools more than normal?
Healthy: 1.0-1.2x baseline
Threshold for action: >1.5x
Human Escalation Rate (HER)
Are humans rejecting its decisions?
Healthy: <2%
Threshold for action: >5%
Approval Queue Depth Drift (AQDD)
Is work piling up waiting for approval?
Healthy: <20 pending
Threshold for action: >50 pending
When any of these drift, semantic failure is 48 hours away.
Real Scenario
Tuesday 2pm: Agent starts degrading. DQR drops from 94% to 88%. TIE increases from 1.1x to 1.4x. Nothing alarming yet by traditional metrics.
Your infrastructure monitoring stays green.
Thursday 10am: DQR at 62%. TIE at 3.1x. Queue at 340 items.
Your first alert finally fires - from your infrastructure monitoring noticing error rates creeping up.
You've just lost 40+ hours of bad decisions.
With semantic SLIs, you would have known Tuesday at 2:15pm.
How We Built It
Semantic SLI monitoring system that:

Tracks what matters - DQR, TIE, HER, AQDD (not uptime)
Detects degradation early - 48 hours before traditional SLIs Suggests remediation - Not just "something's wrong" Automates response - Progressive autonomy constraints

When degradation detected:

Agent autonomy automatically constrained (FULL → GUIDED → SUPERVISED → BLOCKED)
Slack notification sent with context
Remediation steps suggested (prioritized by success rate)
Everything tracked for audit and learning

Code Example
pythonfrom agentsre.orchestration import FintechSREOrchestrator, AgentRole, AlertManager

Initialize orchestrator

orch = FintechSREOrchestrator()
orch.register_agent("payment-1", AgentRole.PAYMENT_PROCESSOR)

Initialize alerts

alerts = AlertManager()

def on_critical_alert(alert_dict):
send_to_slack(alert_dict)

alerts.slack_handler = on_critical_alert

Update metrics as agent runs

orch.update_metrics(
agent_id="payment-1",
dqr=62.0, # Decision quality degraded
tie=2.8, # Tool calls increased
her=15.0, # Escalations up
aqd=180, # Queue growing
confidence=0.42,
cost=0.0003
)

Create alert with remediation suggestions

alert = alerts.create_alert(
agent_id="payment-1",
reason="Semantic degradation detected",
triggered_metrics=["DQR", "TIE", "HER"],
current_values={
"dqr": 62.0,
"tie": 2.8,
"her": 15.0,
"aqd": 180
}
)

Get remediation steps

for step in alert.suggested_remediations[:3]:
print(f"→ {step.action} ({step.estimated_time_minutes}min)")
Output:
→ Review latest 10 agent decisions - identify pattern (15min)
→ Check upstream service - likely returning bad data (10min)
→ Agent over-compensating - check confidence scores (10min)
What This Means for SRE
You're not just detecting problems. You're understanding them.
Instead of:

"Error rate is high"
"Latency is up"
"Something's wrong"

You get:

"Agent decision quality dropped 15%, tool calls increased 2.8x, humans rejecting 15% of output, 180 items pending approval"
Suggested fix: Check upstream service (likely corrupting data)
Severity: CRITICAL

That's the difference between reactive and proactive reliability.
Open Source Built all this open source. MIT licensed.
Tested in production at scale. Works with LangChain, CrewAI, Bedrock.
GitHub: https://github.com/Ajay150313/agentsre
For Your Team
If you're running agents in production, you probably have this problem too. You just don't know it yet.
Try semantic SLIs. If you catch something you didn't know was degrading (most teams do), you'll know it was worth it.
The cost of not knowing? Sometimes it's $2M.

Top comments (2)

Jasmine Park • May 25

Semantic SLOs are the right reframe. The piece I would add is OTel GenAI semantic conventions as the standardization layer underneath. Otherwise every team rolls a different schema for prompt_id, tool_call_id, model_name, and you cannot move off your tracing vendor without rewriting your alerts. We picked OTel-native from day 1, joined trace spans to invoice-level cost rollups nightly with per-team attribution, and got rid of three bespoke monitoring layers. The HTTP 200 problem you described is exactly why an LLM gateway emitting OTel spans with output_quality_score (from an inline judge) lets you alert on quality, not just uptime. Spec is still evolving but stable enough to commit. What schema are you using for the semantic signals?

Some comments may only be visible to logged-in visitors. Sign in to view all comments.