DEV Community

Ajay Devineni
Ajay Devineni

Posted on

Your OTel Traces Are Lying to You Observability for the Reasoning Layer

Three weeks ago someone on the AWS Builders Slack posted something that stopped me cold. Their production AI agent had been running for six hours. CPU normal. Memory stable. Latency within SLO. Zero error rate in CloudWatch.
The agent was re-planning on every single task. One tool kept returning stale data. The agent recognized it, switched tools, got a different failure, re-planned again. It completed tasks — slowly, expensively, with degrading output quality. Nothing in the dashboard moved.
This is not an edge case. This is the default failure mode of agentic AI in production, and your current observability stack cannot see it.
Why OTel Misses the Problem
OpenTelemetry is the best thing that's happened to observability in a decade. Traces, metrics, logs — stable across all three signal types as of the 2026 CNCF milestone. Auto-instrumentation is production-grade. The ecosystem is mature.
And for agent reasoning behavior, it is the wrong level of abstraction.
OTel traces infrastructure execution. A trace shows you: this request arrived, it called this service, that service called this database, the database returned in 42ms, the response went back. Perfect for distributed systems.
An agent doesn't execute a fixed call graph. An agent reasons. It evaluates state, picks a tool, observes the result, decides whether to continue or re-plan, picks another tool. The reasoning path is dynamic. The same input can produce different call graphs on different runs depending on what the tools return.
The key shift is that once agent reasoning is exported into your observability stack, traces stop showing infrastructure execution and start showing reasoning behavior — but only if you're emitting the right data. Kore.ai
Most teams aren't. They're emitting infrastructure spans. The reasoning is invisible.

The Pattern: Silent Degradation via Re-Planning Loops
Here's what silent agent degradation looks like in a trace when you're not capturing reasoning:
span: agent-task-processor duration: 4.2s status: OK
span: tool-call-cloudwatch duration: 0.8s status: OK
span: tool-call-s3 duration: 0.3s status: OK
span: tool-call-cloudwatch duration: 0.8s status: OK
span: tool-call-dynamodb duration: 0.4s status: OK
Looks fine. Four tool calls, all successful, task completed.
Here's what's actually happening:
agent receives task
→ plans: use CloudWatch metric X
→ calls CloudWatch: returns stale data (tool succeeds, data is wrong)
→ agent evaluates result: doesn't match expected state
→ RE-PLANS: try DynamoDB instead
→ calls DynamoDB: schema mismatch (tool succeeds, data wrong format)
→ RE-PLANS: back to CloudWatch, different metric
→ calls CloudWatch: stale again
→ RE-PLANS: escalate to human
Four successful spans. Two re-planning cycles. One HER escalation. Zero errors in your monitoring.
This is your RSI (Retry Storm Index) in action — not at the HTTP retry level, but at the reasoning level.

Introducing Reasoning Trace Depth
I want to introduce a new observable to pair with RSI: Reasoning Trace Depth (RTD).
RTD = the number of re-planning cycles an agent goes through before either completing a task or escalating.
Baseline for a healthy agent on routine tasks: 0–1 re-planning cycles.
Warning threshold: 3+ re-planning cycles.
Critical threshold: 5+ re-planning cycles (agent is effectively stuck).
RTD is your earliest signal. It rises before HER (because the agent is still trying before escalating), before latency becomes visible to users, and before cost metrics show anomalous spend.
pythonfrom dataclasses import dataclass, field
from typing import List, Optional
import time

@dataclass
class AgentDecisionTrace:
"""
Structured reasoning trace for a single agent task execution.
Emitted once per task — NOT once per tool call.
This is your reasoning observability layer.
"""
agent_id: str
session_id: str
task_id: str
timestamp: str

# Reasoning behavior
initial_plan: str
tools_called: List[str] = field(default_factory=list)
replan_count: int = 0           # RTD — Reasoning Trace Depth
replan_reasons: List[str] = field(default_factory=list)

# Outcome
task_completed: bool = False
human_escalated: bool = False   # HER signal

# Cost signals
total_tool_calls: int = 0
latency_ms: int = 0

# Quality proxy (if available)
confidence_proxy: Optional[float] = None
Enter fullscreen mode Exit fullscreen mode

def emit_decision_trace(trace: AgentDecisionTrace) -> dict:
"""
Emit structured decision trace to your log aggregator.
This sits ABOVE your OTel infrastructure spans.
One entry per agent task — your reasoning observability layer.
"""
record = {
"trace_type": "agent_decision",
"agent_id": trace.agent_id,
"session_id": trace.session_id,
"task_id": trace.task_id,
"timestamp": trace.timestamp,
"reasoning": {
"initial_plan": trace.initial_plan,
"replan_count": trace.replan_count, # RTD
"replan_reasons": trace.replan_reasons,
"tools_sequence": trace.tools_called
},
"outcome": {
"completed": trace.task_completed,
"human_escalated": trace.human_escalated, # HER
},
"cost": {
"tool_calls_total": trace.total_tool_calls,
"latency_ms": trace.latency_ms
}
}

# Flag for immediate attention
if trace.replan_count >= 3:
    record["alert"] = "RTD_WARNING"
if trace.replan_count >= 5:
    record["alert"] = "RTD_CRITICAL"

return record
Enter fullscreen mode Exit fullscreen mode

The Three-Layer Observability Model for Agents
Your current stack has two layers. You need three.
Layer 1 — Infrastructure (you already have this)
OTel traces, Prometheus metrics, structured logs. Tool call latency, error rates, resource utilization. This is what Datadog, Grafana, and CloudWatch show you. It's correct and necessary. It just doesn't see reasoning.
Layer 2 — Control Plane (from Post 7 — RAR, RSI, DCS)
Routing accuracy, retry patterns at the orchestration level, decomposition quality. This is your agent behavior at the workflow level — are tasks being routed correctly? Is the orchestrator stable?
Layer 3 — Reasoning (what's missing)
RTD (Reasoning Trace Depth), re-plan reasons, plan-to-execution delta, decision confidence proxies. One structured log entry per agent task. This is the layer your dashboards don't have.
The diagnostic flow when something feels wrong but dashboards are green:

  1. Check Layer 1: Is infrastructure healthy?
    → Yes → move to Layer 2

  2. Check Layer 2: Is RSI elevated? Is RAR degraded?
    → RSI elevated → move to Layer 3

  3. Check Layer 3: Is RTD above baseline?
    → RTD > 3 → agent is re-planning, find the tool/data source causing it
    → RTD normal, HER elevated → agent is escalating cleanly, check decision envelope

What This Looks Like in CloudWatch
pythonimport boto3

cw = boto3.client('cloudwatch', region_name='us-east-1')

def publish_rtd_metric(agent_id: str, rtd_value: int) -> None:
"""
Publish Reasoning Trace Depth to CloudWatch.
Alert when RTD exceeds 3 — agent is re-planning excessively.
"""
cw.put_metric_data(
Namespace='AgentSRE/Reasoning',
MetricData=[{
'MetricName': 'ReasoningTraceDepth',
'Dimensions': [{'Name': 'AgentId', 'Value': agent_id}],
'Value': float(rtd_value),
'Unit': 'Count'
}]
)
Set your alarm at RTD > 3 sustained over a 5-minute window. That's your early warning before HER spikes, before users feel latency, before cost anomalies appear in your billing dashboard.

The Connection to Your Existing SLI Framework
If you've been following this series:

Post 4 introduced HER — your human escalation signal. HER is what happens after the agent gives up re-planning.
Post 7 introduced RSI — your retry storm signal at the control plane level.
This post introduces RTD — the earlier, reasoning-level signal that predicts both RSI and HER before they breach.

RTD → feeds → RSI → feeds → HER
The three form a causal chain. If you're only watching HER, you're watching the end of the chain. RTD gives you the front.

The Practical Checklist
Before your next agent ships, add to your production-readiness checklist:
☐ Decision trace structured logging configured (one JSON entry per task, not per span)
☐ RTD metric emitting to CloudWatch / Prometheus
☐ RTD baseline established (30-day shadow mode — same as HER baseline protocol)
☐ RTD alarm set at threshold > 3
☐ RTD correlated to HER in your dashboards — rising RTD without rising HER means the agent is struggling but not yet escalating
Your OTel traces are correct. They're just answering the wrong question.
https://www.linkedin.com/posts/ajay-devineni_sre-agenticai-observability-activity-7462294037518159872-iF29?utm_source=share&utm_medium=member_desktop&rcm=ACoAACIp55QBRGVmAcEbf0D-1PaR5vEbm2yMcJU

Ajay Devineni | AWS Community Builder | Senior SRE/Platform Engineer | github.com/Ajay150313/agentsre

Top comments (0)