- Book: LLM Observability Pocket Guide: Picking the Right Tracing & Evals Tools for Your Team
- Also by me: Thinking in Go (2-book series) — Complete Guide to Go Programming + Hexagonal Architecture in Go
- My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools
- Me: xgabriel.com | GitHub
The Slack message lands on a Tuesday. Finance: "S3 spend tripled last quarter. What changed?" Engineering: "Nothing." Both are correct. Two months ago, someone added LLM tracing (prompts in, responses out, full payload on every span). Nobody set a retention policy. The bucket grew at 14 GB a day, then 22, then 31.
This isn't a corner case. It's the default shape of every LLM tracing pipeline shipped without a payload strategy. The good news: three fixes, ordered by how cheap they are to deploy, each cuts the bill without breaking the workflow you actually use traces for.
Where the bytes go
A normal OTel span is small. Trace ID, span ID, parent, attributes, events, status. Maybe 2 KB if you've decorated it with HTTP method, status code, user ID, region. The kind of span your APM has stored for a decade at a price you stopped looking at.
An LLM span is not normal. It carries the prompt: system message, full chat history, retrieved context, tool definitions, response schema. Then the response. Then sometimes the reasoning trace if you turned that on. A single span on a long agent turn runs 80 KB. A 47-turn agent run hits 4 MB. At 200 requests per second, the payload bytes outweigh the span metadata roughly 50 to 1.
So when you look at S3 and see the bucket growing 30 GB a day, that's not a span explosion. It's text. Text you wrote into the trace because the SDK said "set gen_ai.prompt" and you obliged.
The first instinct is to turn down sampling. That's the wrong instinct. The right one is to think about which payloads earn their storage.
Fix 1: sample on success, full retention on error
Production LLM traffic has two populations. The 99% that worked and look identical to the last 10,000 working traces. And the 1% that failed, errored, timed out, returned nonsense, got flagged by an eval, or generated a complaint. Storing the second population is the whole reason you have tracing. Storing all of the first one is the bill.
A sane default samples the success population aggressively and keeps every failure intact. Tail-based sampling makes this trivial because the decision happens after the span finishes. By then you know whether it errored, whether it tripped a hallucination detector, whether latency went over your SLO.
def should_keep_payload(span):
if span.status.status_code == StatusCode.ERROR:
return True
if span.attributes.get("llm.eval.flagged"):
return True
if span.attributes.get("llm.latency_ms", 0) > 5000:
return True
# success path: keep 1 in 50
return random.random() < 0.02
That's it. You still emit the span with metadata, token counts, latency, model name, so dashboards and cost reports stay accurate. You just drop the prompt and response bytes on 98% of the boring traffic.
This single change usually cuts payload storage by 90%+. If you do nothing else from this post, do this.
Fix 2: tiered retention
The second pattern teams under-use. Hot, warm, cold. Three buckets, three lifecycle rules, three price points.
Recent traces are the ones engineers actually open. Last 24 hours, definitely. Last 7 days, often. After that, the access pattern collapses. Somebody pulls a 30-day-old trace once a month during a postmortem. Paying S3 Standard prices for that traffic is theatre.
# s3 lifecycle rule
LifecycleConfiguration:
Rules:
- Id: llm-traces-tiering
Status: Enabled
Prefix: traces/
Transitions:
- Days: 7
StorageClass: STANDARD_IA
- Days: 30
StorageClass: GLACIER_IR
- Days: 180
StorageClass: DEEP_ARCHIVE
Expiration:
Days: 730
S3 Standard runs ~$0.023 per GB-month. Standard-IA drops to ~$0.0125. Glacier Instant Retrieval lands around $0.004. Deep Archive bottoms at ~$0.00099. The lifecycle transitions cost a few cents per 1k objects, but you make that back inside a day on a busy bucket.
One gotcha worth surfacing: Standard-IA has a 128 KB minimum billing size per object. If you're writing one-span-per-object at small payload sizes, you'll pay for 128 KB even when the object is 4 KB. Batch your trace writes (one object per minute per trace stream, or roll up by trace ID) so each object is at least a few hundred KB. The teams that skip this step end up with IA bills that look like Standard bills and write angry blog posts about how tiering doesn't work.
Fix 3: payload truncation with rehydration tokens
The third fix targets the long tail. A 4 MB agent transcript is the outlier that wrecks averages. You don't want to drop it (the engineer debugging the agent loop needs it) but you also don't want it inlined on the span in your hot trace store.
The pattern: truncate the payload on the span itself, write the full version to object storage under a content-addressed key, and store the key as an attribute. The trace UI shows the truncation up front and offers a rehydrate-on-click button.
def truncate_with_token(payload: str, span, max_inline: int = 2048):
if len(payload) <= max_inline:
return payload
digest = hashlib.sha256(payload.encode()).hexdigest()
key = f"traces/payloads/{digest[:2]}/{digest}.txt"
s3.put_object(Bucket=PAYLOAD_BUCKET, Key=key, Body=payload)
span.set_attribute("llm.payload.s3_key", key)
span.set_attribute("llm.payload.full_bytes", len(payload))
return payload[:max_inline] + f"\n…[truncated, rehydrate: {digest[:12]}]"
Content-addressed keys mean identical payloads (system prompts, common tool definitions, repeated user queries) dedupe for free. On a real agent workload that's another 40-60% storage win because system prompts are the same on every span and you stop paying to store the same 8 KB block a million times.
A 40-line OTel SpanProcessor that does all three
This is the version that ships. It's a BatchSpanProcessor wrapper that runs sampling, truncation, redaction, and rehydration-token rewriting before the span hits the exporter. Drop it in front of whatever exporter you use (Tempo, Honeycomb's OTLP endpoint, an S3-backed pipeline).
import random, hashlib, re
from opentelemetry.sdk.trace import SpanProcessor
from opentelemetry.trace import StatusCode
PII = [
(re.compile(r"\b[\w.+-]+@[\w-]+\.[\w.-]+\b"), "[email]"),
(re.compile(r"\b\d{3}-\d{2}-\d{4}\b"), "[ssn]"),
(re.compile(r"\b(?:\d[ -]*?){13,16}\b"), "[card]"),
]
PAYLOAD_ATTRS = ("gen_ai.prompt", "gen_ai.response", "llm.input", "llm.output")
class LLMPayloadProcessor(SpanProcessor):
def __init__(self, downstream, s3, bucket, inline_limit=2048):
self.downstream, self.s3, self.bucket = downstream, s3, bucket
self.inline_limit = inline_limit
def _keep_full(self, span) -> bool:
if span.status.status_code == StatusCode.ERROR: return True
if span.attributes.get("llm.eval.flagged"): return True
if span.attributes.get("llm.latency_ms", 0) > 5000: return True
return random.random() < 0.02
def _redact(self, text: str) -> str:
for pattern, repl in PII:
text = pattern.sub(repl, text)
return text
def on_end(self, span):
keep = self._keep_full(span)
for key in PAYLOAD_ATTRS:
raw = span.attributes.get(key)
if not raw: continue
clean = self._redact(raw) # PII out before anything else
if not keep:
span._attributes[key] = "[sampled-out]"
continue
if len(clean) > self.inline_limit:
digest = hashlib.sha256(clean.encode()).hexdigest()
obj_key = f"traces/payloads/{digest[:2]}/{digest}.txt"
self.s3.put_object(Bucket=self.bucket, Key=obj_key, Body=clean)
span._attributes[key] = clean[:self.inline_limit] + f"\n…[rehydrate:{digest[:12]}]"
span._attributes[f"{key}.s3_key"] = obj_key
else:
span._attributes[key] = clean
self.downstream.on_end(span)
def shutdown(self): self.downstream.shutdown()
def force_flush(self, timeout_millis=30000):
return self.downstream.force_flush(timeout_millis)
Wire it up like this:
tracer_provider.add_span_processor(
LLMPayloadProcessor(
downstream=BatchSpanProcessor(OTLPSpanExporter()),
s3=boto3.client("s3"),
bucket="acme-llm-payloads",
)
)
A few things that aren't accidents in the code above. PII redaction runs before the sampling check, so even the dropped payloads have been cleaned in case downstream logging picks them up. Content-addressed S3 keys give you free deduplication. The s3_key attribute is what the trace UI uses to offer rehydration, and you can write a tiny Lambda behind a signed URL to serve it. The sampling thresholds are tunable per environment. Error rate in dev is 30%, so the "always keep errors" rule won't bury you there.
The mutation of span._attributes is the one rough edge. OTel's public API treats span attributes as immutable after start, but BatchSpanProcessor runs on_end on a worker thread where the span is no longer being written to. In practice this is safe. If you want to be strictly correct, wrap the span and re-emit it through a custom exporter instead.
The gotcha: PII redaction before storage, not on retrieval
The instinct that bites teams hardest is "store the raw payload, redact on read." It seems reasonable. You keep the original in case you need it. You serve a redacted version to engineers without a need-to-know. You move on.
Then GDPR shows up, or SOC2, or a customer DSR. Now you owe an auditable answer to "how was personal data stored, who accessed it, how do we delete it?" The answer "we redact on read" means raw PII is sitting in S3 indefinitely. That's the storage event regulators care about, not the display event.
Redact in the span processor, before the export ever happens. The truncated payload that goes to S3 should already have emails, SSNs, card numbers, phone numbers, and any domain-specific identifiers (customer IDs, internal account numbers) replaced with tokens. Keep a separate, access-controlled, encrypted-at-rest pathway for the rare cases where the raw payload is needed for an incident. Make that pathway opt-in per request, not the default state of your trace store.
The redaction step in the processor above is the minimum. Add a per-tenant rule layer if you serve regulated industries. And run the redactor against your own eval set monthly, because the day someone adds a new entity type and forgets to update the regex is the day raw card numbers start hitting your trace store again.
What this gets you
Done together (sample-on-success, tiered retention, truncate-with-rehydration, redact-before-store) the same workload that was costing $9k/month in S3 lands somewhere between $400 and $900. Debugging stays intact because errors and slow paths keep full payloads. Compliance posture improves because PII isn't sitting in object storage waiting to be discovered. Engineers don't notice the change in the trace UI except that long agent traces now have a "load full transcript" button.
The thing nobody tells you when you ship LLM tracing: payload economics are the design decision. Spans are free. Prompts and responses are not.
What's the worst LLM observability bill surprise you've seen, and which of these three fixes would have caught it? Drop it in the comments.
If this was useful
The trade-offs in this post (sampling shape, retention tiers, payload handling, where redaction belongs in the pipeline) are exactly what my LLM Observability Pocket Guide walks through. The chapter on trace pipeline design covers the SpanProcessor patterns above plus the eval-flagging and self-consistency detectors that make sample-on-success actually targeted instead of random. Worth a read if you're picking a tracing tool or trying to make the one you have less expensive.

Top comments (0)