You spent weeks building observability for your LLM app. Traces in Jaeger. Metrics in Grafana. Alerts in Slack. You can see exactly what your model says, how long it takes, and how much it costs.
Then you change the prompt.
Did the model get better? Worse? For which inputs? You have no idea — because your traces are write-only. You observe but never evaluate. Your production data sits in Jaeger and never becomes a test.
We built the bridge from traces to tests. Then we ran it on our own traces and discovered half our spans had no content — because recordContent was off by default. The tool designed to extract test data couldn't extract anything.
Fixed that. Here's the workflow.
The loop nobody closes
Every LLM team has some version of this:
1. Deploy prompt v2
2. Watch dashboards for a few hours
3. "Looks fine, latency is similar, no errors"
4. Move on
"Looks fine" is not evaluation. You're checking system health — latency, errors, cost — but not output quality. Your model could be returning subtly worse answers and you'd never know, because you don't have regression tests built from real production data.
The teams that do this well have a different loop:
1. Collect production inputs and outputs (traces)
2. Extract test cases from real traffic
3. Run the new prompt against those inputs
4. Score: is v2 better than v1?
5. Deploy with confidence
Steps 2-4 are what eval frameworks do. The problem is getting from step 1 to step 2. Your traces live in Jaeger. Your eval framework expects YAML datasets. Nobody builds the bridge.
The bridge: export-trace
One CLI command converts a Jaeger trace into a test dataset:
npx toad-eye export-trace abc123def456
✅ Exported trace abc123def456 → ./trace-abc123de.eval.yaml
The generated YAML:
name: exported-trace-abc123de
source: toad-eye-export
metadata:
trace_id: abc123def456
exported_at: "2026-03-15T14:22:00.000Z"
model: gpt-4o
provider: openai
cases:
- id: production-case-1
variables:
input: "What are the side effects of ibuprofen?"
assertions:
- type: max_length
value: 1500
- type: not_contains
value: "i cannot"
- id: production-case-2
variables:
input: '{"action": "summarize", "text": "..."}'
assertions:
- type: max_length
value: 800
- type: is_json
value: true
One trace, multiple LLM calls, each becomes a test case. The assertions are auto-generated from what the production model actually returned.
How assertions are generated
The export doesn't just copy inputs and outputs. It analyzes the production response and creates baseline assertions:
| What it checks | How | Why |
|---|---|---|
| max_length | completion.length × 1.5 |
New prompt shouldn't produce wildly longer output |
| not_contains | Checks for refusal phrases | If production didn't refuse, the new prompt shouldn't either |
| is_json |
JSON.parse() succeeds |
If production returned valid JSON, new prompt must too |
These are conservative baselines — they catch regressions, not improvements. If your current prompt returns a 500-character JSON answer and the new prompt returns a 3,000-character refusal, something is broken. These assertions catch that.
You add domain-specific assertions on top:
assertions:
- type: max_length
value: 1500
- type: not_contains
value: "i cannot"
# Your domain expertise:
- type: contains
value: "nausea"
- type: llm_judge
value: "Answer is medically accurate and lists at least 3 side effects"
Auto-generated assertions bootstrap the dataset. Your domain knowledge makes it useful.
The prerequisite nobody remembers
By default, toad-eye doesn't record prompts and completions in traces. Article #3 explained why — the OTel spec says don't, and your security team agrees.
But for trace-to-eval export, you need the content.
We learned this the embarrassing way. Built the entire export-trace pipeline, ran it on our own Jaeger instance, and got:
✗ No exportable spans in trace abc123. Was recordContent enabled?
Half our spans had inputs and outputs as empty strings. The tool worked perfectly — on empty data. Classic.
Enable it where it matters:
initObservability({
serviceName: "my-app",
recordContent: true, // enable in staging or for a traffic sample
});
The recommendation: enable recordContent in staging, or use content sampling in production to record a percentage of traffic. Export from those traces. Don't record everything — record enough.
From export to CI
The concrete workflow, compressed:
Find interesting traces in Jaeger. Look for high-token traces (complex reasoning), traces with tool calls (agent behavior), traces near budget limits (cost-sensitive paths). These are your golden test cases.
Export them:
npx toad-eye export-trace abc123def456 --output ./eval-datasets
Add your assertions to the generated YAML. The scaffolding is there — add the domain-specific checks that matter for your use case.
Run evals on every prompt change:
npx toad-eval run --dataset ./eval-datasets/trace-abc123de.eval.yaml --model gpt-4o
Now you know: does prompt v2 pass the same cases that prompt v1 handled in production? Not "it didn't break in the first 2 hours" confidence — "it passes the same inputs our users actually send" confidence.
Automate it. The programmable API (exportTrace, fetchTrace, traceToEvalYaml) lets you build a cron job that exports traces nightly from staging, feeds them into CI, and blocks deploys when regressions are detected. The pieces compose.
Why OTel-native matters here
This workflow only works because toad-eye uses OpenTelemetry. The trace format is standard. Jaeger stores it. The export reads it via Jaeger's API. No vendor lock-in, no proprietary format, no "export your data" button that gives you a CSV.
If you're using Langfuse or Arize, you can build the same pipeline — through their API, in their format, with their rate limits. With OTel, your traces are yours. They live in your Jaeger. You query them whenever you want.
What comes next
The manual export covers "build a dataset, run evals in CI." But there's a second mode we're working toward: inline eval callbacks where every completed span triggers a scoring function automatically. No Jaeger query, no manual export — production traffic scores itself in real time.
That's a separate deep dive. For now, the manual pipeline is the foundation — and it's already more than most teams have.
Quick checklist
If you want to start building eval datasets from production traces:
- Enable
recordContent: truein staging or for a traffic sample - Find 10-20 traces that represent your core use cases
- Export with
npx toad-eye export-trace <trace_id> - Add domain-specific assertions to the generated YAML
- Run evals against your current prompt — establish the baseline
- Run evals against every prompt change before deploying
- Automate: nightly exports, CI runs evals on PR
Your traces already contain the best test data you'll ever get — real inputs from real users. Stop letting them rot in Jaeger.
Previous articles:
- #3: OpenTelemetry just standardized LLM tracing
- #4: Your LLM streaming traces are lying to you
- #5: Your AI agent re-sends 80% of your budget every loop
toad-eye — open-source LLM observability, OTel-native: GitHub · npm
🐸👁️
Top comments (0)