One Triage Pass, Every Trace Format: Stop Letting Fragmentation Shrink Your Eval Coverage

#ai #agents #observability #evaluation

Your agent traces are scattered across four incompatible formats, and that fragmentation is quietly the reason your evals don't cover production. You run OpenClaw in one service, someone bolted LangSmith onto the Python side, the platform team standardized on OpenTelemetry, and your homegrown recorder writes its own JSON. Four shapes. Four schemas. Zero shared triage. So when you finally sit down to find the production runs worth turning into eval cases, you either write four parsers or — far more likely — you look at one source and call it a day.

I just built the adapter layer that makes that a non-problem, and the exercise taught me something about honest tooling I want to show you, bug and all.

The premise: your eval set should come from production, not imagination

I've argued before that the hardest part of agent evaluation isn't the scorer, it's the corpus — that a rigorous judge over twelve hand-invented cases is grading fiction. The only honest source of eval cases is the traffic you actually serve. Your users run a free, adversarial fuzzing campaign against your agent every day; the job is to capture the runs that broke and promote them into permanent regression cases.

But there's a step-zero nobody talks about: before you can promote a trace, you have to be able to read it. And "read it" is where the fragmentation tax hits. A trace store is only useful if the thing that grades runs can ingest whatever recorded them. Otherwise your beautiful trace archive is four silos, and your eval coverage quietly collapses to whichever silo was easiest to parse.

This is exactly why I treat tracing and evaluation as one workflow. AgentLens captures the full execution trace of every run — the resolved input the model actually saw after template interpolation, every tool call with its arguments, the raw outputs, the final answer. agent-eval is the other half: it takes those runs, applies deterministic checks, and returns a pass/fail verdict you can gate on. AgentLens decides which runs are worth testing; agent-eval decides whether the agent passed. But that pairing only pays off if agent-eval can eat traces from tools that aren't AgentLens — because real teams are never on one stack.

One triage pass, four formats

So I wrote adapters. agent-eval now normalizes four native trace shapes into a single session contract and triages them in one pass:

OpenClaw logs
LangSmith / LangGraph runs
any OpenTelemetry GenAI export — which means Arize Phoenix, Traceloop / OpenLLMetry, and the raw OTel SDK, all at once
AgentLens session exports

That OTLP row is the high-leverage one: because Phoenix, Traceloop, and OpenLLMetry all emit the same OpenTelemetry GenAI semantic conventions, one adapter swallows the entire OpenTelemetry-native ecosystem. You don't standardize your stack to get unified triage; the adapter layer absorbs the fragmentation for you.

Each adapter maps its native shape onto the same normalized session:

// The shared contract every adapter produces. Whatever recorded the run —
// OpenClaw, LangSmith, OTLP, AgentLens — it comes out looking like this.
interface BuiltSession {
  sessionId: string;
  label: string;              // the task line, for triage output
  tokenUsage: number;         // total tokens burned = cost signal
  runtimeMs: number;          // wall-clock duration
  endedCleanly: boolean;      // did it actually finish?
  trajTimedOut: boolean;      // hit a cap / never returned
  abortedAny: boolean;        // errored or abandoned
  errorEvents: number;
}

// Adapters are pure functions: raw export text -> normalized sessions.
// No network, no AI, no state. Just parsing.
export function parseOtlp(text: string): BuiltSession[];       // Phoenix, Traceloop, OpenLLMetry, raw OTel
export function parseLangSmith(text: string): BuiltSession[];  // LangChain / LangGraph
export function parseAgentLens(text: string): BuiltSession[];  // AgentLens exporter

// Then the same deterministic triage ranks them, regardless of origin:
const report = triageOtlp(rawTrace, {
  dollarsPerMillionTokens: 9,
  costlyTokenThreshold: 100_000,
});
// -> sessions ranked by wasted spend + failure mode:
//    timeouts, abandoned runs, token bonfires — the ones worth freezing into eval cases.

Notice what these adapters are and aren't. They are Tier 1 checks in agent-eval's independence model: externally observable proof the agent can't forge. Did the run finish within its timeout? Did it error? How many tokens did it actually burn? A finish_reason of length in an OTLP span, or a still-active AgentLens session with no ended_at, is unforgeable evidence of a timeout — the model can't argue its way out of it. That's the whole point of parsing traces rather than asking a model "did this go okay?"

And critically: this triage runs over the agent's trajectory — the full sequence of steps — because Tier 1 is allowed to. A deterministic check reading token counts and finish reasons has independent ground truth. A model-as-judge does not: a model grading another model's reasoning is circular, because judge and judged share a substrate. So the judge never sees the trajectory; it only ever inspects final artifacts the judged agent didn't get to author, and even then it's a signal, not a verdict. Triage is deterministic, costs about nothing, and runs fast enough to sit inline. That's why it's the front door and the judge is the offline back room.

The part where the tool caught my own bug

Here's the moment that mattered. Each adapter was written against a real export emitted by that tool's own SDK — not a hand-authored mock. For OTLP I installed the actual opentelemetry-sdk, emitted real GenAI spans, and serialized them through the SDK's own exporter. For AgentLens I built genuine session objects and ran them through its real SessionExporter. Authoritative shapes, because a mock only proves your adapter agrees with your imagination — the exact failure mode I keep warning about with eval sets.

When I ran the AgentLens adapter's test, triage reported zero flagged sessions — even though my adapter had correctly marked a never-ended run as a timeout. That looked like a bug in the adapter. It wasn't. The default triage gate keys off observable timeline gaps, not the status flags an adapter sets. AgentLens encodes failure in a richer place — session.status — and the deterministic staleness check wasn't consulting it. The tool wasn't wrong; it was telling me my assumption about how failure gets detected was wrong.

I chased the why instead of forcing the assertion green, and the fix was real: AgentLens runs should be triaged in the mode that consumes their status verdict. That's the discipline the whole approach is built on. An eval that you can bend until it passes is worthless; the entire value proposition is a check that tells you the truth even when the truth is inconvenient. If I'd "fixed" that test by loosening the assertion, I'd have shipped an adapter that silently ignored abandoned runs — the precise category of failure I built the thing to catch.

The takeaway

Stop letting format fragmentation quietly shrink your eval coverage to one silo. Your traces are already being recorded — by OpenClaw, by LangSmith, by whatever OpenTelemetry tracer your platform team blessed, by your own recorder. The move is an ingest layer that reads all of them into one triage pass, ranks the runs by wasted spend and failure mode, and hands you the exact production failures worth freezing into permanent eval cases. AgentLens captures the trace; agent-eval grades it; the adapters mean it doesn't matter which tool did the recording.

Your users are writing your test cases for you, every day, across every stack you run. The only question is whether your tooling can read all of it — or just the parts that were convenient.

Top comments (1)

Max Quimby • Jul 4

Strong agree on the core claim — grading twelve hand-invented cases is grading fiction, and the only honest corpus is traffic you actually served. The adapter layer is the right move too; collapsing the whole OTel GenAI ecosystem through one parser because Phoenix/Traceloop/OpenLLMetry share the semantic conventions is the high-leverage bit. Where I'd push is the step after triage. Your BuiltSession captures outcome signals — endedCleanly, trajTimedOut, tokenUsage — beautifully, but promoting a trace into a permanent regression case needs the input to be replayable, and production traces usually aren't. The resolved prompt is deterministic, sure, but the tool calls hit a database that's since moved on, an API that returns something new, a clock that's advanced. So the trace that broke yesterday might not break — or even run — today. How are you handling that? Freezing tool outputs into the fixture, mocking at the adapter boundary, or grading the recorded trajectory as-is with no re-execution? That choice quietly decides whether your suite is testing the agent or testing a museum piece.