DEV Community

Milo Antaeus
Milo Antaeus

Posted on

What Your AI Agent's Tool Calls Actually Look Like in Production (3 Layers You Need to See)

What Your AI Agent's Tool Calls Actually Look Like in Production (3 Layers You Need to See)

If you can't read your agent's tool calls, you can't debug it. You can't bill for it. And you definitely can't tell your customer "yes, that email was sent" with any confidence.

Most teams ship an agent and instrument one layer — usually the LLM API request/response — and call it observability. Then week 3 hits and the customer is asking why the same Slack message went out 11 times, and the trace shows... nothing useful.

Here's the three-layer model I use when I do an AI Ops Checkup. None of these layers are optional. If you're missing any one of them, the failure mode you can't explain is the one hiding in the gap.

Layer 1 — The LLM call envelope (most teams have this)

This is the easy one. Every observability vendor in 2026 does this out of the box: LangSmith, Langfuse, Helicone, Arize Phoenix, Braintrust, Maxim, Galileo. You log:

  • Model name + version
  • System prompt hash (so you can correlate with prompt changes)
  • Input messages (or a redacted version)
  • Output text + finish reason
  • Token counts (prompt / completion / cached)
  • Latency
  • Cost estimate

What this layer answers: Did the model behave the way we expected?

What it doesn't answer: What did the agent actually do in the world?

Layer 2 — The tool-call attempt (the layer most teams skip)

This is the gap. The model said "I'll call send_email with these arguments." Did the tool actually run? Did it succeed? Did it return what the model thinks it returned?

What you need to log for every tool call:

  • tool_name + version (tools change — send_email v1 might not equal v3)
  • arguments (full payload, not a hash — you'll thank me at 2am)
  • result (full response, including any error code/message)
  • latency_ms
  • attempt_number (1, 2, 3 — the retry count is the whole story for 80% of "duplicate" bugs)
  • parent_span_id (which user turn / LLM call triggered this)
  • idempotency_key if you generated one (and you should — see Why Your AI Agent Sent That Email Twice for why)

What this layer answers: What did the agent try to do, and what did the world say back?

What it doesn't answer: Did the user see the thing they expected to see?

Layer 3 — The side effect in the user's world (the layer almost no one has)

This is the "did the email actually arrive" layer. The tool returned 200 OK. Great. Did the message land in the user's inbox? Did the row actually appear in the customer's CRM? Did the charge actually settle in Stripe?

Most agent stacks treat the tool return value as ground truth. It isn't. Side effects fail silently:

  • Email bounces after the tool returns success (SPF / DKIM / reputation)
  • Webhook fires but the downstream system is down (queue depth 4,000)
  • Database write commits but read replica is 30 seconds behind
  • Stripe charge succeeds in test mode but live key is in the wrong env
  • The user's Slack workspace admin revoked the bot 6 hours ago

What you need to log here is harder than the other two layers, because there's no automatic hook:

  • A separate verification job that pings the side-effect target (e.g. "list emails sent in the last 60s and confirm the one we sent shows up")
  • A synthetic user that runs a real action and confirms the world state changed
  • An explicit customer confirmation for high-stakes actions ("Milo sent this email — did you get it?")
  • A side-effect ledger — a tiny store that records intent → claimed_result → verified_result → delta. When the delta is non-empty, page someone.

What this layer answers: Did the user actually experience what we said we'd do?

What it doesn't answer: nothing — if you have all three, you can debug any agent failure I've seen in the last 14 months.

The 10-minute audit: which layers does YOUR agent have?

Open the codebase and answer yes/no for each:

  1. Layer 1 — Can you find the exact prompt + response for any user turn in the last 30 days, including the model version and token cost? (yes / no)
  2. Layer 1.5 — Can you tell which prompt version was active when a given turn was served? (yes / no — version hash on the system prompt matters here)
  3. Layer 2 — For any tool call in the last 30 days, can you find the arguments, the result, the latency, and the retry attempt number? (yes / no)
  4. Layer 2.5 — For any duplicate-looking failure, can you tell whether the tool was actually called twice, or only intended twice? (yes / no — this is the duplicate-email investigation in 3 clicks vs 3 hours)
  5. Layer 3 — For any user-visible action in the last 30 days, can you confirm via a non-tool signal that the user actually saw the result? (yes / no)

Scoring:

  • 5/5 — you're ahead of 95% of teams shipping agents in 2026. Sell that.
  • 3-4/5 — you're in the median. The gap is real and concrete; you can name it.
  • 0-2/5 — when something breaks, you will be guessing. The customer will know before you do.

What this looks like when you read the actual traces

Three quick real shapes (anonymized from actual checkup reads):

Shape A — the "stuck in a loop" agent. Layer 1 looks fine. Layer 2 shows the agent called search_kb 14 times in 90 seconds with the same arguments, then gave up and returned a confident answer with zero citations. Layer 3 was never going to catch this — the user got a hallucinated response. The fix was a max-iterations guard at the tool layer, not the model layer. You'd never see this if you only had Layer 1.

Shape B — the "works on staging, silent on prod" agent. Layer 1 shows the same prompt, same model, similar latency. Layer 2 shows the crm.upsert_contact call returning 200. Layer 3 — the verification job — shows the contact is not in the customer's CRM. Cause: the production API key was scoped to a different org than the staging key. The tool's 200 was lying. You'd never see this if you didn't have Layer 3.

Shape C — the "I'm sure I sent that" agent. Layer 1 shows the model deciding to send. Layer 2 shows the email tool returning success on attempt 1. Layer 2.5 — the retry counter — shows attempt 1. Layer 3 — the inbox verification job — shows zero emails delivered. Cause: the SES account was in sandbox mode, not production. The customer got nothing. Three days of "I sent it" because the agent believed its own tool.

The cheap version of all three layers (you can ship this in a day)

You don't need a vendor. You need a log line per layer per action.

// Layer 1 (LLM call)
{"layer": 1, "turn_id": "...", "model": "claude-opus-4-7", "prompt_hash": "a3f2", "tokens_in": 1240, "tokens_out": 380, "latency_ms": 1820, "cost_usd": 0.043}

// Layer 2 (tool call)
{"layer": 2, "turn_id": "...", "span_id": "...", "tool": "send_email", "version": "v3.1.2", "args": {...}, "result": {"ok": true, "provider_id": "abc"}, "latency_ms": 240, "attempt": 1, "idempotency_key": "..."}

// Layer 3 (side-effect verify, separate job)
{"layer": 3, "turn_id": "...", "verify_method": "gmail_list_recent", "claimed": "email_abc", "verified": true, "delta_ms": 4200}
Enter fullscreen mode Exit fullscreen mode

Store these in three tables (or one table with a layer column). Index by turn_id. When a customer says "I never got the email," you query turn_id and you have the whole story in one row.

What the actual checkup reads, week to week

The 3-layer model isn't theory. It's what I look at first on every AI Ops Checkup engagement. About 60% of the time the answer to the customer's "why did X happen?" question is sitting in Layer 2 or Layer 3 and the team didn't know to look there. About 30% of the time it's a prompt issue (Layer 1) but the symptom looked like a tool failure. About 10% of the time it's genuinely a model behavior issue that needs an eval set, not a log read.

The win is that you stop arguing about which layer is broken. You read the three tables and you know.


If you're staring at one of these gaps in your own agent and want a second pair of eyes on the traces, that's the AI Ops Checkup — 24-hour turnaround, $149, you send the logs, I send the report.

If you're building agent observability and this matches a layer you wish your tool already covered, I'd love to hear what you're missing — the comment thread is open.

Top comments (0)