DEV Community

Cover image for Your LLM traces are write-only
Albert Alov
Albert Alov

Posted on

Your LLM traces are write-only

You spent weeks building observability for your LLM app. Traces in Jaeger. Metrics in Grafana. Alerts in Slack. You can see exactly what your model says, how long it takes, and how much it costs.

Then you change the prompt.

Did the model get better? Worse? For which inputs? You have no idea — because your traces are write-only. You observe but never evaluate. Your production data sits in Jaeger and never becomes a test.

We built the bridge from traces to tests. Then we ran it on our own traces and discovered half our spans had no content — because recordContent was off by default. The tool designed to extract test data couldn't extract anything.

Fixed that. Here's the workflow.


The loop nobody closes

Every LLM team has some version of this:

1. Deploy prompt v2
2. Watch dashboards for a few hours
3. "Looks fine, latency is similar, no errors"
4. Move on
Enter fullscreen mode Exit fullscreen mode

"Looks fine" is not evaluation. You're checking system health — latency, errors, cost — but not output quality. Your model could be returning subtly worse answers and you'd never know, because you don't have regression tests built from real production data.

The teams that do this well have a different loop:

1. Collect production inputs and outputs (traces)
2. Extract test cases from real traffic
3. Run the new prompt against those inputs
4. Score: is v2 better than v1?
5. Deploy with confidence
Enter fullscreen mode Exit fullscreen mode

Steps 2-4 are what eval frameworks do. The problem is getting from step 1 to step 2. Your traces live in Jaeger. Your eval framework expects YAML datasets. Nobody builds the bridge.

The bridge: export-trace

One CLI command converts a Jaeger trace into a test dataset:

npx toad-eye export-trace abc123def456
Enter fullscreen mode Exit fullscreen mode
✅ Exported trace abc123def456 → ./trace-abc123de.eval.yaml
Enter fullscreen mode Exit fullscreen mode

The generated YAML:

name: exported-trace-abc123de
source: toad-eye-export
metadata:
  trace_id: abc123def456
  exported_at: "2026-03-15T14:22:00.000Z"
  model: gpt-4o
  provider: openai
cases:
  - id: production-case-1
    variables:
      input: "What are the side effects of ibuprofen?"
    assertions:
      - type: max_length
        value: 1500
      - type: not_contains
        value: "i cannot"
  - id: production-case-2
    variables:
      input: '{"action": "summarize", "text": "..."}'
    assertions:
      - type: max_length
        value: 800
      - type: is_json
        value: true
Enter fullscreen mode Exit fullscreen mode

One trace, multiple LLM calls, each becomes a test case. The assertions are auto-generated from what the production model actually returned.

How assertions are generated

The export doesn't just copy inputs and outputs. It analyzes the production response and creates baseline assertions:

What it checks How Why
max_length completion.length × 1.5 New prompt shouldn't produce wildly longer output
not_contains Checks for refusal phrases If production didn't refuse, the new prompt shouldn't either
is_json JSON.parse() succeeds If production returned valid JSON, new prompt must too

These are conservative baselines — they catch regressions, not improvements. If your current prompt returns a 500-character JSON answer and the new prompt returns a 3,000-character refusal, something is broken. These assertions catch that.

You add domain-specific assertions on top:

assertions:
  - type: max_length
    value: 1500
  - type: not_contains
    value: "i cannot"
  # Your domain expertise:
  - type: contains
    value: "nausea"
  - type: llm_judge
    value: "Answer is medically accurate and lists at least 3 side effects"
Enter fullscreen mode Exit fullscreen mode

Auto-generated assertions bootstrap the dataset. Your domain knowledge makes it useful.

The prerequisite nobody remembers

By default, toad-eye doesn't record prompts and completions in traces. Article #3 explained why — the OTel spec says don't, and your security team agrees.

But for trace-to-eval export, you need the content.

We learned this the embarrassing way. Built the entire export-trace pipeline, ran it on our own Jaeger instance, and got:

✗ No exportable spans in trace abc123. Was recordContent enabled?
Enter fullscreen mode Exit fullscreen mode

Half our spans had inputs and outputs as empty strings. The tool worked perfectly — on empty data. Classic.

Enable it where it matters:

initObservability({
  serviceName: "my-app",
  recordContent: true,  // enable in staging or for a traffic sample
});
Enter fullscreen mode Exit fullscreen mode

The recommendation: enable recordContent in staging, or use content sampling in production to record a percentage of traffic. Export from those traces. Don't record everything — record enough.

From export to CI

The concrete workflow, compressed:

Find interesting traces in Jaeger. Look for high-token traces (complex reasoning), traces with tool calls (agent behavior), traces near budget limits (cost-sensitive paths). These are your golden test cases.

Export them:

npx toad-eye export-trace abc123def456 --output ./eval-datasets
Enter fullscreen mode Exit fullscreen mode

Add your assertions to the generated YAML. The scaffolding is there — add the domain-specific checks that matter for your use case.

Run evals on every prompt change:

npx toad-eval run --dataset ./eval-datasets/trace-abc123de.eval.yaml --model gpt-4o
Enter fullscreen mode Exit fullscreen mode

Now you know: does prompt v2 pass the same cases that prompt v1 handled in production? Not "it didn't break in the first 2 hours" confidence — "it passes the same inputs our users actually send" confidence.

Automate it. The programmable API (exportTrace, fetchTrace, traceToEvalYaml) lets you build a cron job that exports traces nightly from staging, feeds them into CI, and blocks deploys when regressions are detected. The pieces compose.

Why OTel-native matters here

This workflow only works because toad-eye uses OpenTelemetry. The trace format is standard. Jaeger stores it. The export reads it via Jaeger's API. No vendor lock-in, no proprietary format, no "export your data" button that gives you a CSV.

If you're using Langfuse or Arize, you can build the same pipeline — through their API, in their format, with their rate limits. With OTel, your traces are yours. They live in your Jaeger. You query them whenever you want.

What comes next

The manual export covers "build a dataset, run evals in CI." But there's a second mode we're working toward: inline eval callbacks where every completed span triggers a scoring function automatically. No Jaeger query, no manual export — production traffic scores itself in real time.

That's a separate deep dive. For now, the manual pipeline is the foundation — and it's already more than most teams have.


Quick checklist

If you want to start building eval datasets from production traces:

  • Enable recordContent: true in staging or for a traffic sample
  • Find 10-20 traces that represent your core use cases
  • Export with npx toad-eye export-trace <trace_id>
  • Add domain-specific assertions to the generated YAML
  • Run evals against your current prompt — establish the baseline
  • Run evals against every prompt change before deploying
  • Automate: nightly exports, CI runs evals on PR

Your traces already contain the best test data you'll ever get — real inputs from real users. Stop letting them rot in Jaeger.


Previous articles:

toad-eye — open-source LLM observability, OTel-native: GitHub · npm

🐸👁️

Top comments (0)