DEV Community

Teller
Teller

Posted on

Trace-to-Training: how agent runs become learning data

Trace-to-Training: how agent runs become learning data

Every agent run is a data point. Most frameworks throw it away.

WasmAgent keeps it — evaluated by the compliance engine, ranked by outcome, exported as a typed ComplianceEvalRecord ready for SFT or DPO training. No human labeling.

Three repair modes

import { ComplianceRun } from "@wasmagent/compliance";

const run = new ComplianceRun({
  mode: "full_pcl",   // "direct" | "prompt_retry" | "full_pcl"
  taskSpec: {
    instruction: "Write a summary in exactly 3 bullet points.",
    constraints: [{ type: "format", rule: "bullet_count", value: 3 }],
  },
});

const result = await run.execute(agent, input);
// result.complianceEvalRecord → typed, versioned, schema-validated
Enter fullscreen mode Exit fullscreen mode

direct — one shot, record pass/fail.

prompt_retry — retry once with a rephrased prompt.

full_pcl — full repair loop: run → evaluate → patch/regenerate → re-evaluate → record the entire trace.

What the numbers show

IFEval × Qwen2.5-1.5B-Q4 (3 seeds × 50 samples):

Mode Pass rate Std dev
prompt_retry 46.0% ±2.0pp
full_pcl 54.7% ±1.2pp

+8.7pp. The variance drop (±2.0 → ±1.2) matters for production reliability.

Reproduce: bun packages/compliance/benchmarks/ifeval/run.ts --limit=50 --seed=42

The repair trace is the training data

When full_pcl repairs a failing output, RepairPlanner records every attempt:

// Inside ComplianceEvalRecord
attempts: [
  { strategy: "direct",     output: "...", passed: false },
  { strategy: "patch",      output: "...", passed: false },
  { strategy: "regenerate", output: "...", passed: true  },
]
Enter fullscreen mode Exit fullscreen mode

The full sequence — what failed, what was tried, what worked — is what feeds DPO training. The model learns from failure traces, not just final outputs.

Parallel rollouts for preference pairs

import { RolloutForkRunner, RolloutRanker } from "@wasmagent/core";

const runner = new RolloutForkRunner({ forks: 4 });
const rollouts = await runner.run(agent, input, taskSpec);

const ranked = new RolloutRanker().rank(rollouts);
// ranked[0] → chosen (SFT)
// ranked[1..] → rejected (DPO pairs)
Enter fullscreen mode Exit fullscreen mode

The compliance verifier is the reward signal. No human annotation.

Try it

git clone https://github.com/WasmAgent/wasmagent-js
bun test packages/compliance/   # 113 pass / 0 fail
Enter fullscreen mode Exit fullscreen mode

Code: packages/compliance · RolloutForkRunner · RolloutRanker


Series: AEP (part 1) · MCP Trust Pack (part 2) · Trace-to-Training (part 3)

Top comments (0)