Trace-to-Training: how agent runs become learning data
Every agent run is a data point. Most frameworks throw it away.
WasmAgent keeps it — evaluated by the compliance engine, ranked by outcome, exported as a typed ComplianceEvalRecord ready for SFT or DPO training. No human labeling.
Three repair modes
import { ComplianceRun } from "@wasmagent/compliance";
const run = new ComplianceRun({
mode: "full_pcl", // "direct" | "prompt_retry" | "full_pcl"
taskSpec: {
instruction: "Write a summary in exactly 3 bullet points.",
constraints: [{ type: "format", rule: "bullet_count", value: 3 }],
},
});
const result = await run.execute(agent, input);
// result.complianceEvalRecord → typed, versioned, schema-validated
direct — one shot, record pass/fail.
prompt_retry — retry once with a rephrased prompt.
full_pcl — full repair loop: run → evaluate → patch/regenerate → re-evaluate → record the entire trace.
What the numbers show
IFEval × Qwen2.5-1.5B-Q4 (3 seeds × 50 samples):
| Mode | Pass rate | Std dev |
|---|---|---|
| prompt_retry | 46.0% | ±2.0pp |
| full_pcl | 54.7% | ±1.2pp |
+8.7pp. The variance drop (±2.0 → ±1.2) matters for production reliability.
Reproduce: bun packages/compliance/benchmarks/ifeval/run.ts --limit=50 --seed=42
The repair trace is the training data
When full_pcl repairs a failing output, RepairPlanner records every attempt:
// Inside ComplianceEvalRecord
attempts: [
{ strategy: "direct", output: "...", passed: false },
{ strategy: "patch", output: "...", passed: false },
{ strategy: "regenerate", output: "...", passed: true },
]
The full sequence — what failed, what was tried, what worked — is what feeds DPO training. The model learns from failure traces, not just final outputs.
Parallel rollouts for preference pairs
import { RolloutForkRunner, RolloutRanker } from "@wasmagent/core";
const runner = new RolloutForkRunner({ forks: 4 });
const rollouts = await runner.run(agent, input, taskSpec);
const ranked = new RolloutRanker().rank(rollouts);
// ranked[0] → chosen (SFT)
// ranked[1..] → rejected (DPO pairs)
The compliance verifier is the reward signal. No human annotation.
Try it
git clone https://github.com/WasmAgent/wasmagent-js
bun test packages/compliance/ # 113 pass / 0 fail
Code: packages/compliance · RolloutForkRunner · RolloutRanker
Series: AEP (part 1) · MCP Trust Pack (part 2) · Trace-to-Training (part 3)
Top comments (0)