DEV Community

Manfred Macx
Manfred Macx

Posted on

Your Production Agent Is Flying Blind (Here's the Fix)

You built the agent. It works in dev. You deploy it. Then, three days later, a user reports it's broken and you have no idea why — because you have no idea what it actually did.

This is the #1 operational failure mode for production AI agents. Not hallucinations. Not prompt injection. Not model capability gaps.

Lack of observability.

Here's what changes when you add proper tracing.


Why Standard APM Tools Fall Short

Your Datadog setup catches HTTP 500s. That's not good enough for agents.

LLM agents fail in ways that don't map to status codes:

  • The model answered, just incorrectly (success by APM, failure by business)
  • The response took 45 seconds instead of 2 (latency spike invisible without percentile tracking)
  • The agent used $0.84 on one request instead of the expected $0.004 (cost runaway)
  • The new prompt version degraded quality by 12% across all users (regression you can't see without evals)

The five questions your observability stack must answer:

  1. What did the agent decide to do — and why?
  2. Which tool calls succeeded, failed, or were retried?
  3. How much did this request cost in tokens and dollars?
  4. Did quality regress since the last prompt change?
  5. Which feature/user/workflow is burning my budget?

If you can't answer all five from your current tooling, you're flying blind.


The Minimum Viable Observability Stack

Here's what you need before going to production:

1. Structured Traces (not logs)

Logs tell you "something happened." Traces tell you "these things happened in this order, with this timing, as part of this request."

from contextlib import contextmanager
import uuid, time

@contextmanager
def traced(name: str, kind: str, attrs: dict = None):
    span = {
        "span_id": str(uuid.uuid4())[:8],
        "name": name,
        "kind": kind,
        "start": time.time(),
        "attrs": attrs or {}
    }
    try:
        yield span
        span["status"] = "ok"
    except Exception as e:
        span["status"] = "error"
        span["error"] = str(e)
        raise
    finally:
        span["duration_ms"] = (time.time() - span["start"]) * 1000
        collect(span)  # send to your backend
Enter fullscreen mode Exit fullscreen mode

Every LLM call, every tool invocation, every agent turn gets a span. Spans nest (parent ‚Üí child). You get a tree of everything that happened.

2. LLM-Specific Metrics

The metrics that matter for language models aren't the ones you're used to:

@dataclass  
class LLMCallMetrics:
    model: str
    input_tokens: int
    output_tokens: int
    latency_ms: float
    cost_usd: float
    finish_reason: str  # stop | length | tool_use | content_filter

    @property
    def tokens_per_second(self):
        return self.output_tokens / (self.latency_ms / 1000)
Enter fullscreen mode Exit fullscreen mode

Track finish_reason = "length" separately — it means the model hit your max_tokens and got cut off. That's almost always a bug.

3. Rolling Latency Percentiles

Never use average latency for LLM calls. Use p99:

from collections import deque

class LatencyTracker:
    def __init__(self, window=1000):
        self._samples = deque(maxlen=window)

    def record(self, ms: float):
        self._samples.append(ms)

    @property
    def p99(self):
        if not self._samples:
            return None
        s = sorted(self._samples)
        return s[int(len(s) * 0.99)]
Enter fullscreen mode Exit fullscreen mode

Your p50 might be 800ms (fine). Your p99 might be 12,000ms (users are churning). Average hides this completely.


Tool Call Tracing

Every tool invocation needs to be observable. Not just "did it work" — but how it failed when it failed:

class ToolCallStatus(Enum):
    SUCCESS = "success"
    ERROR = "error" 
    TIMEOUT = "timeout"
    RATE_LIMITED = "rate_limited"
    INVALID_ARGS = "invalid_args"  # model passed wrong schema
Enter fullscreen mode Exit fullscreen mode

INVALID_ARGS is particularly useful — if you're seeing this frequently after a prompt update, your tool schema changed and the model doesn't know about it yet.


Multi-Agent Trace Correlation

This is where most teams hit a wall. Your orchestrator spawns sub-agents. Each starts a new trace. You lose the parent-child relationship. Every request looks like an independent event.

The fix: W3C traceparent header propagation.

# Orchestrator: inject trace context into sub-agent request
headers = {
    "traceparent": f"00-{trace_id}-{current_span_id}-01"
}

# Sub-agent: extract and continue the trace
trace_id, parent_id = extract_traceparent(request.headers)
root_span = Span(trace_id=trace_id, parent_id=parent_id, ...)
Enter fullscreen mode Exit fullscreen mode

Now every sub-agent call shows up as a child span under the root request. One request, complete visibility, regardless of how many agents were involved.


Cost Attribution (the one nobody does)

Token costs are invisible until they're a crisis. Don't wait for the crisis.

class CostLedger:
    def record(self, cost_usd: float, feature: str, user_id: str = None):
        self._by_feature[feature] += cost_usd
        self._by_user[user_id or "anonymous"] += cost_usd
        self._total += cost_usd

    def budget_check(self, daily_budget: float, hours_elapsed: float) -> dict:
        projected = (self._total / hours_elapsed) * 24
        return {
            "projected_daily": projected,
            "status": "over_budget" if projected > daily_budget else "ok"
        }
Enter fullscreen mode Exit fullscreen mode

When your costs spike, you want to know: which feature? Which user? Which model? Without attribution, you're looking at a total number with no idea where to start.


SLO Monitoring with Error Budget Burn Rate

Define your SLOs explicitly. Then track whether you're burning through your error budget faster than expected:

class SLOMonitor:
    def __init__(self, target_rate=0.999, window_hours=24):
        self.target_rate = target_rate
        self._events = deque()

    @property
    def burn_rate(self) -> float:
        """1x = normal. >1x = accelerating. >10x = page immediately."""
        allowed_errors = 1 - self.target_rate
        actual_errors = 1 - self.current_rate
        return actual_errors / allowed_errors if allowed_errors > 0 else float('inf')
Enter fullscreen mode Exit fullscreen mode

burn_rate > 2 = alert. burn_rate > 10 = page immediately. This gives you warning before you breach the SLO, not after.


The 40-Point Pre-Launch Checklist (abbreviated)

Instrumentation (must-haves before launch):

  • [ ] Every LLM call captures tokens, cost, latency, model, finish_reason
  • [ ] Every tool call records status, retry count, error category
  • [ ] Errors are caught, recorded in spans, and re-raised (never silently swallowed)
  • [ ] Trace context propagates to sub-agents via W3C traceparent

Alerting:

  • [ ] p99 latency alert at 3x baseline
  • [ ] Error rate alert at 5%
  • [ ] Daily cost budget alert at 80% projected burn
  • [ ] Alert deduplication (don't re-page on the same error every 30 seconds)

Operations:

  • [ ] Runbook exists for: latency spike, error rate spike, cost runaway
  • [ ] Graceful degradation behavior is defined for LLM API outages
  • [ ] Cost runaway protection: hard budget limit with auto-disable

When Something Goes Wrong: Three Runbooks

Latency spike (p99 > 3x baseline):

  1. Check provider status page first (usually the answer)
  2. Route to faster model temporarily (GPT-4o-mini, Claude Haiku)
  3. Enable prompt compression to reduce context size
  4. Add 30s hard timeout, return cached or degraded response

Error rate spike (>5%):

  1. Classify errors by type — context_length? rate_limit? content_filter?
  2. Each has a different fix (truncation vs. backoff vs. prompt audit)
  3. invalid_args errors ‚Üí your tool schema drifted after a prompt change

Cost runaway:

  1. ledger.report() immediately — find the feature/user burning budget
  2. Hard-cap per-request spend while investigating
  3. Check for infinite loops (agent calling tools repeatedly without stopping)

Integration Options

Langfuse (hosted): Best for getting started fast, great UI for LLM traces

langfuse.generation(trace_id=..., model=..., usage={"input": tokens_in, "output": tokens_out})
Enter fullscreen mode Exit fullscreen mode

OpenTelemetry (self-hosted): Best for existing infra, sends to Jaeger/Grafana Tempo/Zipkin

from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
Enter fullscreen mode Exit fullscreen mode

Custom collector: Best for control and cost — just a list of spans in a JSONL file, queryable with any tool.


The 20-Minute Quick Start

If you're not tracing anything right now, here's what to add first:

collector = TraceCollector()
llm = ObservableAnthropicClient(feature_tag="my-feature")
ledger = CostLedger()
latency = LatencyTracker()

def run_agent(user_message: str) -> str:
    with traced("agent/turn") as span:
        response, metrics = llm.messages_create(
            model="claude-3-5-sonnet-20241022",
            max_tokens=1024,
            messages=[{"role": "user", "content": user_message}]
        )
        ledger.record(metrics.total_cost_usd, feature="chat")
        latency.record(metrics.total_latency_ms)
        return response.content[0].text

# Check any time:
print(latency.summary())  # p50/p95/p99
print(ledger.report())    # cost by feature/user/model
Enter fullscreen mode Exit fullscreen mode

20 minutes. Traces, latency percentiles, cost tracking. That's your foundation. Everything else builds on top.


The full implementation (all 9 modules, multi-agent correlation, SLO monitoring, W3C traceparent propagation, Langfuse + OTEL integrations, 40-pt checklist, 3 incident runbooks) is packaged as MAC-018 in the Machina Market pattern library.

What's your current observability setup for agents? Curious what people are using in production.

Top comments (0)