The Semantic Gap Why Your APM Sees the Agent But Misses the Decision, and What RTD Does About It

#agentaichallenge #sre #devops #aws

Sherlocks.ai published something yesterday that names a problem precisely.
The core problem: traditional APM was built for synchronous request response. Agents break that model entirely, and most observability platforms are stitching together legacy APM rather than observing agents as a distinct thing. If your observability stack cannot correlate an agent's intended action with what actually happened at the system level, you are flying blind through the exact moments when cost and risk concentrate. Sherlocks AI
They call it the semantic gap. I've been building toward this from a different direction across this series starting with RTD (Reasoning Trace Depth) in Post 11 and the Pre-Action SRE Gate in Post 13. This post is where those frameworks connect to the industry's emerging framing.What the Semantic Gap Actually Is
Existing tools observe an agent's high-level intent — prompts, tool selections — or its low-level actions — system calls, API hits, latency. They do not correlate both views. You can see the LLM prompt and you can see the system call, but you cannot see whether the agent intended that exact action or reasoned its way to something unexpected. When failure happens, this gap becomes your investigation crater. Sherlocks AI
The gap lives in the decision sequence — what happened between the prompt and the system call. Every re-plan, every tool evaluation, every "this result doesn't match what I expected so I'll try differently" — all of that is invisible to APM because APM instruments execution, not reasoning.
Five percent of AI model requests fail in production today. Roughly sixty percent of those are capacity-related, not model errors. Which means the majority of production failures aren't the model doing something wrong. They're the infrastructure around the model — tool availability, API response times, token budget, context state — creating conditions the agent can't navigate cleanly. And your observability stack is optimized to catch model errors. Sherlocks AI
You're instrumented for the minority failure mode.
How RTD Closes the Semantic Gap
Reasoning Trace Depth is a single structured log entry per agent task — not per tool call. It captures:

What the agent planned to do initially
Every re-plan event: why, which tool triggered it, what the new plan was
How many cycles before completion or escalation
Whether HER fired at the end

That record is the intent-to-action correlation layer. It sits above your OTel spans (low-level execution) and below your business metrics (outcome). It's the semantic layer that connects "agent received this task" to "here's exactly how the decision sequence played out."
Without RTD, your investigation after a production failure looks like this: agent ran, spans look clean, outcome was bad, no idea what the agent decided between the tool calls.
With RTD, it looks like this: agent re-planned 4 times, tool 3 returned stale data on every attempt, HER fired at re-plan 5, here is the full decision sequence with timestamps.
That second version is a postmortem. The first is a guess.
What the Market Is Getting Right and Missing
Fifteen tools actively compete on agent observability in 2026, most built on OpenTelemetry standards. The critical test for any of them: does it handle reasoning loops as a first-class concern? Can you see the decision tree — prompt, tool choice, outcome, next decision — as a continuous trace? Does it distinguish between a tool failure and an agent misunderstanding? Does it alert on semantic drift, where agent behavior changes but metrics look normal? Sherlocks AI
Those are the right questions. Most tools fail at least two of them because they were designed as APM add-ons, not as reasoning-native observability.
The practical implication: even if you adopt a good commercial agent observability tool, you still need the reasoning trace layer. Commercial tools give you the infrastructure view. RTD gives you the decision view. You need both.
The Three-Layer Stack, Restated
I've been building this framing across the series. The Sherlocks piece clarifies why it matters:
Layer 1 — Infrastructure (APM, OTel, CloudWatch)
What executed. Tool call latency, error rates, span data. Answers: did the tools work? Misses: did the agent reason correctly?
Layer 2 — Control Plane (RAR, RSI, DCS from Post 7)
How the orchestration behaved. Routing accuracy, retry patterns, task decomposition. Answers: did the workflow hold up? Misses: what was the agent deciding inside each task?
Layer 3 — Reasoning (RTD from Post 11)
What the agent decided. Re-plan count, tool sequence, decision rationale, HER correlation. Answers: did the reasoning hold up? This is the semantic gap layer.
If you are buying observability tooling, demand explicit agent loop tracking. Ask for examples. Do not accept "we can log prompts" as an answer. Sherlocks AI
Logging prompts is Layer 1. You need Layer 3.
The Postmortem Template Addition
Every postmortem for an agent-involved incident should now have a section that didn't exist before: Semantic Gap Analysis.
Three fields:
Intent vs. outcome delta — what did the agent plan to do vs. what did it actually do? If these match, the reasoning held. If they diverge, you have a semantic gap event.
Re-plan sequence — RTD value, re-plan reasons, which tools triggered each re-plan. This is where you find the actual root cause in most agent failures.
HER correlation — did HER spike during this task? At which re-plan decision? That's the moment the agent recognized it was outside its reliable envelope.
Without these three fields, your postmortem explains what broke. It can't explain why the agent did what it did before the break.
Where This Fits in the Arc
Post 4: SLOs for agents (DQR, TIE, HER, AQDD) — what to measure.
Post 7: Control plane SLIs (RAR, RSI, DCS) — where Layer 2 lives.
Post 11: RTD — the Layer 3 reasoning primitive.
Post 13: Pre-Action Gate — using SLIs as authorization signals.
Post 14: The semantic gap — why all three layers are necessary and what happens without Layer 3.
The industry is arriving at this independently. The frameworks were already here.
Ajay Devineni | AWS Community Builder | Senior SRE/Platform Engineer
github.com/Ajay150313/agentsre

DEV Community

The Semantic Gap Why Your APM Sees the Agent But Misses the Decision, and What RTD Does About It

Top comments (0)