Trace Sampling for Agents: Keeping the Trajectories That Matter

#ai #agents #observability #tracing

Book: Observability for LLM Applications — Tracing, Evals, and Shipping AI You Can Trust
Also by me: Agents in Production — the companion book in The AI Engineer's Library (2-book series)
My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools
Me: xgabriel.com | GitHub

You get paged. The triage agent is writing wrong labels. You open your backend, filter to the last hour, and find the failing run. You click it. Half the trace is there. Five chat spans, no tool spans, no invoke_agent parent. Someone turned on head sampling to cut the bill, and the sampler kept a random 10% of spans without caring which trace they belonged to. The one trajectory you need to read is a fragment. You cannot debug what you cannot see, so you spend the next hour reading raw spans instead of the fifteen minutes it should have taken.

That is the failure that trace sampling for agents exists to prevent. The goal is not "keep less data." The goal is to keep every trajectory that teaches you something and drop the ones that do not, without ever keeping half of one.

Why head sampling breaks on agents

Head sampling decides at the start of a trace, before anything has happened. Roll a die on the first span, keep 10%, drop the rest. It works for high-volume web traffic where one request looks like the next and the failures are already caught by your error tracker.

Agents break both assumptions. A single run spans thirty spans across five seconds and fourteen decisions. When you make the keep-or-drop call at span one, you have no idea yet whether this run will loop search_kb twenty times, blow past its token budget, or quietly emit a wrong answer. The information you would sample on does not exist until the trajectory finishes.

So a random 10% at the head throws away 90% of your failures and keeps 10% of your boring successes. That is backwards. You want the opposite ratio.

Sample the trajectory, never half a tree

Two rules, and the second one is the one teams forget.

First: make the sampling decision at the invoke_agent level, not the chat level. The unit of debugging is the trajectory. The unit of sampling has to match it.

Second: when you keep a trajectory, keep the complete tree. Every chat span, every tool span, every nested handoff. A half-sampled trajectory, five chat spans out of fifteen, is worse than no trace at all, because it looks complete and lies to you. You read it, reconstruct a story from the surviving spans, and the story is wrong.

This is what tail sampling gives you that head sampling cannot. Tail sampling buffers the whole trace, waits for it to finish, then decides. By decision time you know the outcome, the latency, and the cost, and you can keep or drop the entire tree as one unit.

The four things you always keep

Tail sampling lets you write a keep-list based on what actually happened. For agents, four categories earn an always-keep:

Failed runs. Any trajectory where a span carries an error status. A tripped budget, a tool that threw, a guardrail that blocked. These are the runs you get paged about.

Slow runs. Any trajectory whose root span runs past a latency threshold. Ten seconds for a triage agent means it looped or a tool hung. The slow tail is where the runaways live.

Expensive runs. Any trajectory whose total token count crosses a ceiling. Latency does not always catch cost. A run can finish fast and still burn 200K input tokens because context accreted on every turn. Sample on tokens directly.

Eval traffic. Every run tagged as coming from your eval suite or CI, kept at 100%, always. This is the traffic you grade trajectories against later. Losing an eval trace to a probabilistic drop means a hole in your regression data. Tag it, and never sample it away.

Everything else is the boring happy path. Keep a baseline slice of it, around 10 to 20%, so you still have a picture of what normal looks like, and drop the rest.

The collector config

The OpenTelemetry Collector's tail_sampling processor does this natively. Each policy is an OR: if any policy votes to keep, the whole trace is kept.

processors:
  tail_sampling:
    decision_wait: 30s
    num_traces: 100000
    policies:
      - name: keep-failed
        type: status_code
        status_code:
          status_codes: [ERROR]
      - name: keep-slow
        type: latency
        latency:
          threshold_ms: 10000
      - name: keep-expensive
        type: numeric_attribute
        numeric_attribute:
          key: gen_ai.usage.total_tokens
          min_value: 100000
      - name: keep-eval-traffic
        type: string_attribute
        string_attribute:
          key: agent.traffic.source
          values: [eval, ci]
      - name: sample-the-rest
        type: probabilistic
        probabilistic:
          sampling_percentage: 15

decision_wait is how long the collector buffers a trace after the first span before it decides. Set it above your p99 trajectory duration, or a slow run finishes after the decision and gets cut anyway. num_traces is the in-memory buffer size; size it to your throughput times the wait window.

The latency policy reads the root span duration, which is the true wall-clock of the trajectory. Do not try to sum child spans for this. Tool calls sometimes run concurrently and the sum lies.

Make the signals visible

Two of those policies only work if your harness writes the signal onto the root span before it ends. The collector reads what you emit.

status_code: ERROR needs you to set the span status when a run fails or a budget fires. gen_ai.usage.total_tokens needs you to roll up the per-call token counts and stamp the total on the invoke_agent root.

from opentelemetry import trace
from opentelemetry.trace import StatusCode

tracer = trace.get_tracer("triage-agent")

with tracer.start_as_current_span(
    "invoke_agent triage-agent"
) as root:
    total_tokens = 0
    stop_reason = "final_answer"

    for step in run_loop(messages, tools):
        # each step emits its own chat/tool child spans
        total_tokens += step.input_tokens
        total_tokens += step.output_tokens
        if step.budget_hit:
            stop_reason = step.budget_hit
            break

    root.set_attribute(
        "gen_ai.usage.total_tokens", total_tokens
    )
    root.set_attribute("gen_ai.agent.stop_reason", stop_reason)

    if stop_reason != "final_answer":
        root.set_status(StatusCode.ERROR, stop_reason)

The rollup on the root is the same aggregation your finance dashboard already wants for per-trajectory cost. You emit it once and two systems read it: the cost view and the sampler.

Tag your eval traffic at the source

The keep-eval-traffic policy is only as good as the tag. Set agent.traffic.source on the root span for every run that comes from an eval harness or CI, so those runs bypass the probabilistic drop. Eval suites are often written in TypeScript, so the tag goes on wherever the run starts:

import { trace, SpanStatusCode } from "@opentelemetry/api";

const tracer = trace.getTracer("triage-agent");

await tracer.startActiveSpan(
  "invoke_agent triage-agent",
  async (root) => {
    root.setAttribute("agent.traffic.source", "eval");
    try {
      await runTrajectory(testCase);
    } finally {
      root.end();
    }
  },
);

Production traffic leaves the attribute unset, so it falls through to the 15% baseline. Eval and CI runs carry the tag and land every time.

The gotcha nobody mentions: route by trace ID

Tail sampling needs the whole trace in one place to decide. If you run more than one collector behind a load balancer, the spans of a single trajectory can scatter across instances, and no single collector ever sees the complete tree. Each one decides on a fragment, and you are back to half-sampled trajectories.

The fix is a two-tier collector layout. A first tier receives spans and routes them with the loadbalancing exporter keyed on trace ID, so every span of one trajectory lands on the same second-tier collector. The tail_sampling processor runs only on the second tier, where each trace is now whole. Skip this and your keep-the-complete-tree rule quietly stops holding under load.

Tune it by watching what you kept

Start at a 15% baseline with the four always-keeps, then look at your kept-trace mix after a day. If failures and slow runs dominate what you stored, the baseline is fine. If the boring happy path is still eating your storage, drop the baseline to 10%. If you find yourself wishing you had more normal runs to compare against a regression, raise it.

The number to protect is not the storage bill. It is the odds that the one trajectory you get paged about is sitting in your backend, complete, when you go looking for it.

If this was useful

If you are building the agent loop that emits these trajectories, Agents in Production covers the harness, the budgets, and the decision spans that make a trace worth sampling. If you are wiring the tracing and eval side, Observability for LLM Applications goes deep on OpenTelemetry attributes, tail sampling, and grading trajectories against a gold sequence. The two books in The AI Engineer's Library are meant to sit next to each other on exactly this kind of problem.