Albert Alov

Posted on Mar 29

Your LLM traces are write-only

#ai #programming #typescript #opentelemetry

You spent weeks building observability for your LLM app. Traces in Jaeger. Metrics in Grafana. Alerts in Slack. You can see exactly what your model says, how long it takes, and how much it costs.

Then you change the prompt.

Did the model get better? Worse? For which inputs? You have no idea — because your traces are write-only. You observe but never evaluate. Your production data sits in Jaeger and never becomes a test.

We built the bridge from traces to tests. Then we ran it on our own traces and discovered half our spans had no content — because recordContent was off by default. The tool designed to extract test data couldn't extract anything.

Fixed that. Here's the workflow.

The loop nobody closes

Every LLM team has some version of this:

1. Deploy prompt v2
2. Watch dashboards for a few hours
3. "Looks fine, latency is similar, no errors"
4. Move on

"Looks fine" is not evaluation. You're checking system health — latency, errors, cost — but not output quality. Your model could be returning subtly worse answers and you'd never know, because you don't have regression tests built from real production data.

The teams that do this well have a different loop:

1. Collect production inputs and outputs (traces)
2. Extract test cases from real traffic
3. Run the new prompt against those inputs
4. Score: is v2 better than v1?
5. Deploy with confidence

Steps 2-4 are what eval frameworks do. The problem is getting from step 1 to step 2. Your traces live in Jaeger. Your eval framework expects YAML datasets. Nobody builds the bridge.

The bridge: `export-trace`

One CLI command converts a Jaeger trace into a test dataset:

npx toad-eye export-trace abc123def456

✅ Exported trace abc123def456 → ./trace-abc123de.eval.yaml

The generated YAML:

name: exported-trace-abc123de
source: toad-eye-export
metadata:
  trace_id: abc123def456
  exported_at: "2026-03-15T14:22:00.000Z"
  model: gpt-4o
  provider: openai
cases:
  - id: production-case-1
    variables:
      input: "What are the side effects of ibuprofen?"
    assertions:
      - type: max_length
        value: 1500
      - type: not_contains
        value: "i cannot"
  - id: production-case-2
    variables:
      input: '{"action": "summarize", "text": "..."}'
    assertions:
      - type: max_length
        value: 800
      - type: is_json
        value: true

One trace, multiple LLM calls, each becomes a test case. The assertions are auto-generated from what the production model actually returned.

How assertions are generated

The export doesn't just copy inputs and outputs. It analyzes the production response and creates baseline assertions:

What it checks	How	Why
max_length	`completion.length × 1.5`	New prompt shouldn't produce wildly longer output
not_contains	Checks for refusal phrases	If production didn't refuse, the new prompt shouldn't either
is_json	`JSON.parse()` succeeds	If production returned valid JSON, new prompt must too

These are conservative baselines — they catch regressions, not improvements. If your current prompt returns a 500-character JSON answer and the new prompt returns a 3,000-character refusal, something is broken. These assertions catch that.

You add domain-specific assertions on top:

assertions:
  - type: max_length
    value: 1500
  - type: not_contains
    value: "i cannot"
  # Your domain expertise:
  - type: contains
    value: "nausea"
  - type: llm_judge
    value: "Answer is medically accurate and lists at least 3 side effects"

Auto-generated assertions bootstrap the dataset. Your domain knowledge makes it useful.

The prerequisite nobody remembers

By default, toad-eye doesn't record prompts and completions in traces. Article #3 explained why — the OTel spec says don't, and your security team agrees.

But for trace-to-eval export, you need the content.

We learned this the embarrassing way. Built the entire export-trace pipeline, ran it on our own Jaeger instance, and got:

✗ No exportable spans in trace abc123. Was recordContent enabled?

Half our spans had inputs and outputs as empty strings. The tool worked perfectly — on empty data. Classic.

Enable it where it matters:

initObservability({
  serviceName: "my-app",
  recordContent: true,  // enable in staging or for a traffic sample
});

The recommendation: enable recordContent in staging, or use content sampling in production to record a percentage of traffic. Export from those traces. Don't record everything — record enough.

From export to CI

The concrete workflow, compressed:

Find interesting traces in Jaeger. Look for high-token traces (complex reasoning), traces with tool calls (agent behavior), traces near budget limits (cost-sensitive paths). These are your golden test cases.

Export them:

npx toad-eye export-trace abc123def456 --output ./eval-datasets

Add your assertions to the generated YAML. The scaffolding is there — add the domain-specific checks that matter for your use case.

Run evals on every prompt change:

npx toad-eval run --dataset ./eval-datasets/trace-abc123de.eval.yaml --model gpt-4o

Now you know: does prompt v2 pass the same cases that prompt v1 handled in production? Not "it didn't break in the first 2 hours" confidence — "it passes the same inputs our users actually send" confidence.

Automate it. The programmable API (exportTrace, fetchTrace, traceToEvalYaml) lets you build a cron job that exports traces nightly from staging, feeds them into CI, and blocks deploys when regressions are detected. The pieces compose.

Why OTel-native matters here

This workflow only works because toad-eye uses OpenTelemetry. The trace format is standard. Jaeger stores it. The export reads it via Jaeger's API. No vendor lock-in, no proprietary format, no "export your data" button that gives you a CSV.

If you're using Langfuse or Arize, you can build the same pipeline — through their API, in their format, with their rate limits. With OTel, your traces are yours. They live in your Jaeger. You query them whenever you want.

What comes next

The manual export covers "build a dataset, run evals in CI." But there's a second mode we're working toward: inline eval callbacks where every completed span triggers a scoring function automatically. No Jaeger query, no manual export — production traffic scores itself in real time.

That's a separate deep dive. For now, the manual pipeline is the foundation — and it's already more than most teams have.

Quick checklist

If you want to start building eval datasets from production traces:

Enable recordContent: true in staging or for a traffic sample
Find 10-20 traces that represent your core use cases
Export with npx toad-eye export-trace <trace_id>
Add domain-specific assertions to the generated YAML
Run evals against your current prompt — establish the baseline
Run evals against every prompt change before deploying
Automate: nightly exports, CI runs evals on PR

Your traces already contain the best test data you'll ever get — real inputs from real users. Stop letting them rot in Jaeger.

Previous articles:

toad-eye — open-source LLM observability, OTel-native: GitHub · npm

🐸👁️

DEV Community

Your LLM traces are write-only

The loop nobody closes

The bridge: `export-trace`

How assertions are generated

The prerequisite nobody remembers

From export to CI

Why OTel-native matters here

What comes next

Quick checklist

Top comments (0)

The loop nobody closes

The bridge: export-trace

How assertions are generated

The prerequisite nobody remembers

From export to CI

Why OTel-native matters here

What comes next

Quick checklist

The bridge: `export-trace`