As I discussed in my SLO Design article, traditional reliability metrics fail for agentic AI systems. Now let's look at how to actually implement semantic monitoring in production.
Your AI agent is running in production.
HTTP 200. Uptime 99.9%. All dashboards are green.
But it's making the wrong decision 30% of the time.
Your monitoring won't tell you.
The Gap
I spent six months figuring this out the hard way. Traditional SRE monitoring measures infrastructure. Network latency. Error rates. Uptime. It's designed for services that crash when they break. But agents don't crash. They degrade. Slowly. Silently.
An agent can be:
94% accurate (still 94%)
But losing confidence (0.92 to 0.41)
Compensating by calling tools 3x more (1.1x to 3.1x)
While humans reject more of its output (1% to 19%)
As work piles up waiting for approval (8 to 340 items)
Your monitoring sees "everything is fine."
You see $2M impact by the time you notice.
What We Actually Need to Measure Not infrastructure metrics. Semantic metrics.
Four things:
Decision Quality Rate (DQR)
Is the agent picking the right tool?
Healthy: 92%+
Threshold for action: <85%
Tool Invocation Efficiency **(TIE)**
Is it over-compensating by calling tools more than normal?
Healthy: 1.0-1.2x baseline
Threshold for action: >1.5x
Human Escalation Rate (HER)
Are humans rejecting its decisions?
Healthy: <2%
Threshold for action: >5%
Approval Queue Depth Drift (AQDD)
Is work piling up waiting for approval?
Healthy: <20 pending
Threshold for action: >50 pending
When any of these drift, semantic failure is 48 hours away.
Real Scenario
Tuesday 2pm: Agent starts degrading. DQR drops from 94% to 88%. TIE increases from 1.1x to 1.4x. Nothing alarming yet by traditional metrics.
Your infrastructure monitoring stays green.
Thursday 10am: DQR at 62%. TIE at 3.1x. Queue at 340 items.
Your first alert finally fires - from your infrastructure monitoring noticing error rates creeping up.
You've just lost 40+ hours of bad decisions.
With semantic SLIs, you would have known Tuesday at 2:15pm.
How We Built It
Semantic SLI monitoring system that:
Tracks what matters - DQR, TIE, HER, AQDD (not uptime)
Detects degradation early - 48 hours before traditional SLIs Suggests remediation - Not just "something's wrong" Automates response - Progressive autonomy constraints
When degradation detected:
Agent autonomy automatically constrained (FULL → GUIDED → SUPERVISED → BLOCKED)
Slack notification sent with context
Remediation steps suggested (prioritized by success rate)
Everything tracked for audit and learning
Code Example
pythonfrom agentsre.orchestration import FintechSREOrchestrator, AgentRole, AlertManager
Initialize orchestrator
orch = FintechSREOrchestrator()
orch.register_agent("payment-1", AgentRole.PAYMENT_PROCESSOR)
Initialize alerts
alerts = AlertManager()
def on_critical_alert(alert_dict):
send_to_slack(alert_dict)
alerts.slack_handler = on_critical_alert
Update metrics as agent runs
orch.update_metrics(
agent_id="payment-1",
dqr=62.0, # Decision quality degraded
tie=2.8, # Tool calls increased
her=15.0, # Escalations up
aqd=180, # Queue growing
confidence=0.42,
cost=0.0003
)
Create alert with remediation suggestions
alert = alerts.create_alert(
agent_id="payment-1",
reason="Semantic degradation detected",
triggered_metrics=["DQR", "TIE", "HER"],
current_values={
"dqr": 62.0,
"tie": 2.8,
"her": 15.0,
"aqd": 180
}
)
Get remediation steps
for step in alert.suggested_remediations[:3]:
print(f"→ {step.action} ({step.estimated_time_minutes}min)")
Output:
→ Review latest 10 agent decisions - identify pattern (15min)
→ Check upstream service - likely returning bad data (10min)
→ Agent over-compensating - check confidence scores (10min)
What This Means for SRE
You're not just detecting problems. You're understanding them.
Instead of:
"Error rate is high"
"Latency is up"
"Something's wrong"
You get:
"Agent decision quality dropped 15%, tool calls increased 2.8x, humans rejecting 15% of output, 180 items pending approval"
Suggested fix: Check upstream service (likely corrupting data)
Severity: CRITICAL
That's the difference between reactive and proactive reliability.
Open Source Built all this open source. MIT licensed.
Tested in production at scale. Works with LangChain, CrewAI, Bedrock.
GitHub: https://github.com/Ajay150313/agentsre
For Your Team
If you're running agents in production, you probably have this problem too. You just don't know it yet.
Try semantic SLIs. If you catch something you didn't know was degrading (most teams do), you'll know it was worth it.
The cost of not knowing? Sometimes it's $2M.
Top comments (0)