Trace-to-Training: how agent runs become learning data

#ai #agents #machinelearning #typescript

Trace-to-Training: how agent runs become learning data

Every agent run is a data point. Most frameworks throw it away.

WasmAgent keeps it — evaluated by the compliance engine, ranked by outcome, exported as a typed ComplianceEvalRecord ready for SFT or DPO training. No human labeling.

Three repair modes

import { ComplianceRun } from "@wasmagent/compliance";

const run = new ComplianceRun({
  mode: "full_pcl",   // "direct" | "prompt_retry" | "full_pcl"
  taskSpec: {
    instruction: "Write a summary in exactly 3 bullet points.",
    constraints: [{ type: "format", rule: "bullet_count", value: 3 }],
  },
});

const result = await run.execute(agent, input);
// result.complianceEvalRecord → typed, versioned, schema-validated

direct — one shot, record pass/fail.

prompt_retry — retry once with a rephrased prompt.

full_pcl — full repair loop: run → evaluate → patch/regenerate → re-evaluate → record the entire trace.

What the numbers show

IFEval × Qwen2.5-1.5B-Q4 (3 seeds × 50 samples):

Mode	Pass rate	Std dev
prompt_retry	46.0%	±2.0pp
full_pcl	54.7%	±1.2pp

+8.7pp. The variance drop (±2.0 → ±1.2) matters for production reliability.

Reproduce: bun packages/compliance/benchmarks/ifeval/run.ts --limit=50 --seed=42

The repair trace is the training data

When full_pcl repairs a failing output, RepairPlanner records every attempt:

// Inside ComplianceEvalRecord
attempts: [
  { strategy: "direct",     output: "...", passed: false },
  { strategy: "patch",      output: "...", passed: false },
  { strategy: "regenerate", output: "...", passed: true  },
]

The full sequence — what failed, what was tried, what worked — is what feeds DPO training. The model learns from failure traces, not just final outputs.

Parallel rollouts for preference pairs

import { RolloutForkRunner, RolloutRanker } from "@wasmagent/core";

const runner = new RolloutForkRunner({ forks: 4 });
const rollouts = await runner.run(agent, input, taskSpec);

const ranked = new RolloutRanker().rank(rollouts);
// ranked[0] → chosen (SFT)
// ranked[1..] → rejected (DPO pairs)

The compliance verifier is the reward signal. No human annotation.

Try it

git clone https://github.com/WasmAgent/wasmagent-js
bun test packages/compliance/   # 113 pass / 0 fail

Code: packages/compliance · RolloutForkRunner · RolloutRanker

Series: AEP (part 1) · MCP Trust Pack (part 2) · Trace-to-Training (part 3)

Top comments (1)

Raju Dandigam • Jun 30

I like the idea that traces should not be treated as disposable logs. If a run contains attempts, repairs, failures, and final success, it can become both debugging evidence and learning data. The important part is making the record typed, versioned, and tied to the evaluation context so future training does not lose the reason the repair worked. I’m exploring a lighter local-first trace layer in agent-inspect, and this trace-to-training direction is a natural next step for more mature agent systems.