Gabriel Anhaia

Posted on Jul 4

What to Capture on an Agent Span (and What to Redact)

#ai #agents #observability #python

Book: Observability for LLM Applications — Tracing, Evals, and Shipping AI You Can Trust
Also by me: Agents in Production — the companion book in The AI Engineer's Library (2-book series)
My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools
Me: xgabriel.com | GitHub

You get paged. A support agent wrote the wrong refund amount into a
customer email. You open your tracing backend, find the trajectory,
and start reading the spans. Every span is green. Every tool call
returned. And there, on the chat span three turns in, sits the raw
customer message with a full credit-card number in the prompt, now
copied into your observability vendor, your log pipeline, and the
screenshot you just pasted into Slack.

Two problems in one trace. You could not tell where the agent went
wrong because the span did not record the decision. And you leaked
PII because the span recorded the payload verbatim. Those two
failures pull in opposite directions, and getting the balance right
is most of what span design for agents is about.

The rest is knowing which attributes are safe to record raw, and
which need a redaction pass before they leave the process.

Start with the four attributes you will actually query

An agent turn is not one LLM call. It is a loop: the model decides,
a tool runs, the result comes back, the model decides again. The
span that wraps the whole loop is your invoke_agent parent. Attach
four attributes to it and you can answer almost every operational
question later.

from opentelemetry import trace

tracer = trace.get_tracer("support-agent")

with tracer.start_as_current_span(
    "invoke_agent support-agent"
) as span:
    span.set_attribute("gen_ai.agent.id", "asst_01H3K")
    span.set_attribute("gen_ai.agent.name", "support-agent")
    span.set_attribute(
        "gen_ai.agent.description",
        "Answers billing questions.",
    )
    span.set_attribute("gen_ai.agent.version", "2.1.0")
    # run the loop; emit child chat/tool spans

id is what you file a bug against. name is the label in the UI.
description is a one-line summary of the system prompt, not the
whole prompt. version is the one people skip and regret. Agents
regress quietly: a prompt tweak, a new tool on the whitelist, a
model swap between two minor revisions. When an eval score drops,
the first thing you do is filter by gen_ai.agent.version to see if
the agent rev moved. Without it, you are guessing.

Record the decision, not just the response

On a plain LLM call, the attribute you care about is what the model
said. On an agent's chat span, the model rarely produced the
user-facing answer. It produced a decision: call this tool with
these arguments, hand off to that agent, or stop. That decision is
the thing worth recording, and it has structure.

span.set_attribute("gen_ai.agent.step", 3)
span.set_attribute("gen_ai.agent.decision", "call_tool")
span.set_attribute(
    "gen_ai.agent.decision.tool", "lookup_order"
)
span.set_attribute(
    "gen_ai.agent.decision.reason",
    "user asked about a refund",
)

Keep decision to a small fixed vocabulary your harness defines and
uses everywhere: call_tool, handoff, final_answer, reflect,
stop. Keep decision.reason short and derived from the model's own
rationale if you have it. These are not stable in the OpenTelemetry
GenAI conventions yet, but the agent backends already read them under
gen_ai.agent.*, and your eval pipeline later scores trajectories by
grouping on gen_ai.agent.decision. That query is impossible if the
attribute was never emitted.

One more on the parent: set gen_ai.agent.step.count when you exit
the loop. It is one integer, the final turn count, and it is what
your "agent exceeded 25 steps" alert keys off. If the loop dies on a
recursion limit, set it in the exception handler before re-raising so
the span still carries the truth.

Your child chat spans keep the numbers that let you cost a run:
gen_ai.request.model, gen_ai.usage.input_tokens,
gen_ai.usage.output_tokens. Your execute_tool spans keep
gen_ai.tool.name and gen_ai.tool.call.id. None of that is
sensitive. Record it all.

Payloads are where it goes wrong

The attributes above are metadata. The temptation is to also record
every prompt and every completion in full, because the one time you
did not, that was the trace you needed. Resist the full-fidelity
version. Two reasons.

The first is size. A late-loop chat span on an agent can carry
tens of thousands of tokens of accreted context. Most span exporters
and backends silently truncate or drop attributes past a limit
(OpenTelemetry defaults to 128 attributes and a per-value length cap
you can configure). A 40 KB prompt string on every span is how you
blow your ingest bill and lose the small attributes that mattered.

The second is the leak. The prompt is exactly where user PII lives.
Record it raw and you have copied names, emails, card numbers, and
health details into a third-party vendor that was never in scope for
that data.

So cap the payload before it goes on the span. A blunt truncation
with a marker beats an unbounded string:

MAX_CHARS = 2048

def clip(text: str, limit: int = MAX_CHARS) -> str:
    if len(text) <= limit:
        return text
    kept = len(text) - limit
    return text[:limit] + f"... [+{kept} chars clipped]"

Record the clipped payload for debuggability, plus the true length as
its own attribute so you never mistake a clipped span for a short
one:

span.set_attribute("gen_ai.prompt.length", len(prompt))
span.set_attribute("gen_ai.prompt.preview", clip(prompt))

Now the span tells you the input was 9,000 characters and shows you
the first 2,000. That is usually enough to see the shape of what went
in without hauling the whole thing across the wire.

Redact PII without blinding the trace

Truncation caps size. It does not remove the card number in the
first 2,000 characters. You need a redaction pass that runs on the
value before it ever reaches set_attribute. The trick is redacting
the identity of the data while keeping its shape, because shape
is what makes a trace debuggable. "The email was malformed" is a real
bug you want to still see after redaction.

import re

PATTERNS = {
    "EMAIL": re.compile(r"[\w.+-]+@[\w-]+\.[\w.-]+"),
    "CARD": re.compile(r"\b(?:\d[ -]?){13,16}\b"),
    "PHONE": re.compile(r"\b\+?\d[\d ()-]{8,}\d\b"),
}

def redact(text: str) -> str:
    for label, pat in PATTERNS.items():
        text = pat.sub(f"[{label}]", text)
    return text

Feeding "charge card 4111 1111 1111 1111 for a.b@x.io" through this
returns "charge card [CARD] for [EMAIL]". You kept the sentence
structure, the tool intent, and the fact that a card and an email
were present. You dropped the values. A reviewer reading that span in
your vendor sees a well-formed request and no regulated data.

Chain the two passes so redaction runs first, then the size cap:

def safe(text: str) -> str:
    return clip(redact(text))

span.set_attribute("gen_ai.prompt.preview", safe(prompt))
span.set_attribute(
    "gen_ai.completion.preview", safe(completion)
)

Regex catches the formatted classes: cards, emails, phones, IBANs,
SSNs. It will not catch a free-text name or a home address, so treat
it as the floor, not the ceiling. For the harder classes, run a
lightweight structured-output pass with Claude to tag spans before
export, or allowlist which tool arguments are safe to record at all.
The lookup_order tool's order_id is fine to keep raw. Its
billing_address is not.

Do it once, at the boundary

Sprinkling safe() across your codebase means one forgotten call is
a leak. Put the redaction in a custom SpanProcessor so the prompt
and completion previews pass through it on the way out, even when a
caller forgets:

from opentelemetry.sdk.trace import SpanProcessor

SENSITIVE = {"gen_ai.prompt.preview",
             "gen_ai.completion.preview"}

class RedactingProcessor(SpanProcessor):
    def on_end(self, span):
        for key in SENSITIVE:
            val = span.attributes.get(key)
            if isinstance(val, str):
                span._attributes[key] = safe(val)

    def on_start(self, span, parent_context=None):
        pass

Register it before your exporter in the pipeline and those two fields
get redacted across every agent and every tool, even when a developer
forgets the safe() call. It only covers the keys in SENSITIVE, so
add any other free-text attribute you record to that set. The same
idea works in TypeScript with a custom SpanExporter wrapper that
maps over span.attributes before delegating to the real OTLP
exporter.

The balance

The cheap attributes are safe to record in full, and they buy you the
difference between a fifteen-minute debug session and a two-hour one.
The payloads are the ones that need a cap and a redaction pass at the
export boundary. Get that split right and you get a trace that shows
you what the agent decided without showing you your customer's card
number: the one you can actually paste into Slack at 3am.

If you are building the agent, Agents in Production
walks through the harness that emits these spans and the tool loop
underneath them. If you are wiring up the tracing, evals, and cost
accounting around it, Observability for LLM Applications
is the companion volume. They are the two halves of The AI
Engineer's Library, and span design like this is the seam where they
meet.

Top comments (1)

ANP2 Network • Jul 4

The SENSITIVE set is the part I'd invert. As written, the processor is still fail-open: safe() can be forgotten, and so can adding the next free-text key to SENSITIVE. If a teammate later records raw tool args or a handoff message under some new gen_ai.agent.* attribute, it exports before anyone notices. At the boundary I'd redact every string attribute by default, then keep a small ALLOWLIST for fields that are known-safe raw, like order_id, model name, token counts, tool name. Forget to classify a new key under that scheme and you lose some trace detail, you don't leak regulated data. That's the whole reason to centralize this in a processor.

I'd also be careful with decision.reason. If it's derived from the model's stated rationale, grouping eval trajectories on it scores self-report rather than behavior. A model can call the right tool and narrate a wrong reason. Keep decision.reason for debugging, and group on the decision plus the observable inputs it keyed off, like tool args or retrieved record ids.