What I Actually Look at When Debugging a Slow or Expensive LLM Call

#observability #ai #llm #devops

TL;DR

Standard APM metrics (latency, status code, error rate) don't capture the things that actually drive LLM cost and behavior: token counts, per-model cost, guardrail overhead, and prompt-level detail.
Traces need to include input/output tokens, cost, and latency per hop, and that data has to flow to wherever your team already looks - usually via OpenTelemetry into Langfuse, SigNoz, or a similar backend.
Guardrail metrics are a separate axis worth watching on their own: how often is a guardrail blocking, mutating, or adding latency, and to which users or teams.

The gap between "observability" and "LLM observability"

A normal trace tells you a service was called and how long it took. An LLM trace needs a few more fields to be useful: input tokens, output tokens, per-model cost, and often the prompt and completion themselves for debugging quality issues (with the access controls that implies - more on that below). OpenTelemetry's GenAI semantic conventions exist specifically because different LLM observability tools were inventing incompatible attribute names for the same underlying data, and it's worth reading if you're instrumenting anything yourself.

TrueFoundry's metrics dashboard rolls LLM and MCP performance, cost, guardrail outcomes, routing decisions, and cache hit rates into one place, which is the same shape of data most of the standalone LLM observability tools (Langfuse, SigNoz, Lunary, Laminar) expect if you export traces out via OTLP.

What actually matters when something's slow or expensive

Token counts, not just latency. A slow request and an expensive request are different problems with different fixes. If a request is slow because it's generating 4,000 tokens, the fix might be a lower max_tokens or a smaller model, not a networking investigation.

Per-model, per-provider cost, in real time, not from a monthly invoice you see three weeks later. By the time a cost anomaly shows up on a bill, whatever caused it has usually already happened a thousand more times.

Guardrail latency and outcomes, as their own metric. Guardrail metrics - evaluated requests, blocked/mutated rates, and P50-P99 latency per guardrail - matter because a PII redaction or prompt-injection check that adds 400ms per call is a real cost, and it's easy to add guardrails without ever checking what they cost you in latency.

Traces, exported to wherever your team already looks. TrueFoundry exports OpenTelemetry traces (and separately, metrics) to whatever OTEL-compatible backend you already run - the export docs cover the setup, and there are dedicated guides for Langfuse and other backends if you want traces to land somewhere your team already has dashboards.

Who can see what. Request logs contain prompts and completions, which in a lot of orgs is sensitive by default. Data access rules let you restrict prod request logs to on-call/SRE/security while keeping dev and staging logs open to a wider group - a distinction that's easy to skip until someone asks why an intern can read production customer prompts.

Where this gets harder than normal service observability

Multi-hop agent traces are the part that still isn't fully solved industry-wide. When agent A calls agent B calls an MCP tool, keeping a single trace ID coherent across all three hops so you can reconstruct "what actually happened" after the fact is genuinely a harder problem than tracing a normal microservice call chain, mostly because the agent frameworks involved don't all propagate context the same way.

Audit logs are a related but distinct thing worth not conflating with observability: they're the durable "who did what" record for compliance, not the dashboard you check when something's slow.

What's the messiest part of your own LLM tracing setup right now - is it the multi-hop context problem, cost attribution, or something else entirely? Curious what other people have run into.