Alex Chen

Posted on May 7

The 50,000-Token Demonstration Nobody Saved: Capturing Agent Trajectories to Train Your Own Code-SLM

#agents #claude #llm #softwareengineering

Last Tuesday, Sonnet 4.5 spent forty-three minutes implementing JWT authentication in a project I run. It read four files, wrote a 180-line patch, ran the test suite, watched two tests fail, traced one of the failures to a stale fixture, fixed both, ran the suite again, watched it pass, then squash-merged the work to main with a commit message that read like a senior engineer wrote it. The whole exchange consumed about 50,000 tokens of model output, broken into nineteen AssistantMessage turns interleaved with twenty-three ToolUseBlock calls and twenty-one ToolResultBlock returns.

I have the final code. I have the commit. I do not have the trajectory.

I had nineteen turns of expert reasoning — the kind of demonstration that, if you handed it to a smaller model as supervised fine-tuning data, would teach that smaller model how to act like a coding agent, not just how to write Python. And I threw it on the floor the moment the ResultMessage arrived, because my harness was wrapped around claude_agent_sdk.query() like this:

result_text = ""
async for message in run_agent(prompt, ...):
    if message.__class__.__name__ == "ResultMessage":
        result_text = message.result or ""
return result_text

Look at that loop. Eighteen messages walked past it for free. The last one paid the rent.

This is the post about why I decided that was insane, what I built to fix it, and what it now lets me do — including, eventually, train my own Qwen2.5-Coder fine-tune on Sonnet's distilled coding behavior.

1. The thing nobody is doing yet, but should be

If you are running an agent harness at any scale — even hobby scale, even one-developer scale — you are paying a Frontier-model API bill and generating a continuous stream of high-quality expert demonstrations and throwing them away. The math on this is depressing once you actually run it. A two-week sprint with one agent running ten hours a day at modest concurrency produces something like 500 task trajectories. Each one is, on average, six thousand to twenty thousand tokens of expert thinking, tool use, and code edits, paired with the canonical "right answer" diff that landed on main.

This is the shape of training data people pay for. Coding-specific SFT corpora don't fall out of the sky. The teams shipping the leading code models scrape GitHub, run synthetic generation pipelines, hire annotators. You have a smaller, narrower, higher-quality version of that already happening in your dev environment for free, modulo the fact that you are not capturing it.

The reason most teams aren't doing this isn't technical difficulty. It's a missing primitive. The agent SDK gives you a stream of messages. Most harnesses iterate the stream once and discard it. Adding a tee — a "yield to the caller AND write to a database" wrapper — is eighty lines of code. The hard part is not the tee. The hard part is figuring out what to capture and what shape to capture it in so that six months from now, when someone says "let's actually try training that model now," you don't discover you stored the wrong thing.

2. The two design questions that actually matter

Before any code, two decisions:

What format do you store? The naive answer is "store it in the format your fine-tuning library wants." That answer is wrong. Fine-tuning libraries change. The chat template you use today (let's say OpenAI tool-use) is not the chat template you'll use in eighteen months. ShareGPT had its moment, ChatML is having its moment, the next thing is already in someone's repo. If you store in the trained-model format, you locked yourself in.
What's your training label? A trajectory by itself is imitation-learning data — "here's what the expert did, copy it." That gets you to mid-tier capability, full stop. The reason DPO and rejection-sampling matter is they let you do preference learning: "of these K candidate solutions, which one matches the actual answer?" To do that, you need a label — a canonical "this is what the correct final state looked like" against which candidate completions get scored. If you only stored the trajectory, you've half-stored the dataset.

The answers I landed on, after going down both wrong paths first:

Capture the superset. Store the raw SDK message stream — every AssistantMessage with its ThinkingBlock and TextBlock and ToolUseBlock content, every UserMessage with its ToolResultBlock content, every model name, every usage tally. Don't project to a chat format at capture time. Projection is cheap and reversible from the superset; the reverse direction isn't true. This is the same principle as event-sourcing in databases: store the events, project the views.

Capture the diff. When the agent's branch squash-merges to main, the resulting commit hash is the ground-truth label. git show <sha> gives you the canonical patch the expert eventually landed. Add one nullable column to your task table, PATCH the SHA back after squash, and at export time you can attach the diff to every successful trajectory. Now your dataset isn't "trajectory." It's "trajectory plus the right answer." DPO and rejection sampling become trivial future work because the label is already on disk.

That's the design. The implementation is small enough to fit on a napkin.

3. The recorder is a tee, and it's eighty lines

The whole capture surface is a single async iterator wrapper:

async def record_messages(
    messages: AsyncIterator[Any],
    *,
    dest: RecordingDestination,
    client: httpx.AsyncClient | None = None,
) -> AsyncIterator[Any]:
    own_client = client is None
    active = client if client else httpx.AsyncClient(timeout=5.0)
    try:
        turn = 0
        async for message in messages:
            try:
                payload = _serialize_message(message, turn=turn)
                await active.post(
                    f"{dest.state_url}/sessions/{dest.session_id}/events",
                    json={
                        "event_type": "agent_message",
                        "task_id": dest.task_id,
                        "payload": payload,
                    },
                )
            except Exception:
                log.warning(
                    "trace recording failed for task %s turn %d; continuing",
                    dest.task_id, turn, exc_info=True,
                )
            turn += 1
            yield message
    finally:
        if own_client:
            await active.aclose()

That's it. Yield to the caller; tee to the events table. The four design choices baked into those eighty lines are worth naming because they're the ones that go wrong if you skip past them:

Caller-side, not runner-side. The wrapper sits at the call site that already knows session_id and task_id. The agent runner stays a pure SDK wrapper. This is the boring choice and the right choice — it keeps the runner module reusable in contexts (testing, ad-hoc scripts) where there's no state service to record into.
Best-effort. A network blip, a state-service restart, a transient permission error — none of them abort the agent. The recorder catches every exception, logs a warning, and continues. The asymmetry is correct: the agent's job is to ship the feature, not to ship the trace. Lost traces are a nuisance. Lost agent runs are a fire.
Lossless serialization. _serialize_message walks the SDK Message object's attributes generically — model, stop_reason, usage, content blocks — and JSON-serializes them with no projection, no opinion. Whatever shape the SDK emits is what lands in the database. When the SDK adds a new content-block type next quarter, the recorder doesn't break.
One event per Message, not per content-block. Tool-use ↔ tool-result correlation stays implicit via the SDK's IDs; reconstructing the conversation at export time is straightforward; the events table doesn't 5x its row count for marginal queryability.

The storage is the existing events table. No new schema. The payload is a JSON column. SQLite handles 1–3 MB per task comfortably. A hundred tasks is 100–300 MB. Disk is cheap. WAL mode makes the writes essentially free at this volume. The state service this lands inside has been doing this for other event types since v0.1.

4. The merge_commit_sha column does most of the conceptual work

The single largest design decision in this whole feature is one nullable String(40) column on the task table. Everything else is mechanism. This column is meaning.

When the harness squash-merges a feature branch to main, squash_merge() returns {"merged": True, "commit_hash": "abc1234"}. The cli.py task handler PATCHes that hash back to the corresponding task row. The PATCH is best-effort and try/excepted because the task is already complete by then — a failed PATCH costs you the diff label for that record, not the agent run.

At export time, --include-diff reads the column and shells out to git show --pretty=format: <sha> against the project's git repo. The diff lands on the JSONL record as final_diff. Now every outcome="success" trajectory carries the canonical patch the expert eventually shipped — the one that survived the test suite, the code review, the squash merge.

This is the difference between "imitation data" and "imitation + reward". It's also the difference between "a corpus you can SFT on" and "a corpus you can DPO on later." You don't need the DPO pipeline today — the schema's already forward-compatible, so when you decide it's time, the labels are sitting there on disk waiting.

I did not appreciate how much this column matters until I started thinking about evaluation. If you're going to fine-tune a smaller model on captured trajectories, you need a metric that says "did the smaller model learn to land the right diff?" Not "did the smaller model produce text that looks like the expert" — that's BLEU on assistant content, and BLEU on assistant content is a vanity metric. The honest metric is diff similarity: reconstruct the smaller model's proposed patch from its tool-call sequence (Edit / Write blocks), score it line-level Jaccard plus difflib.SequenceMatcher.ratio() against final_diff, and call that your eval. You cannot run that eval without the ground-truth column. The column is the experiment.

5. Format projection is a one-page module

With the superset captured, projection to any chat format at export time is mechanical:

OpenAI tool-use — fold thinking + text + tool_use blocks into one assistant message with tool_calls; emit each tool_result block as its own role: tool message. Default format. Reads natively into HuggingFace apply_chat_template(tools=...).
ShareGPT — flatten tool calls to <tool_call name="X">{...}</tool_call> text. Lossy but trl/Axolotl ShareGPT loaders eat it without complaining.
ChatML — generic <|im_start|> tags; no tool semantics; useful for non-tool-using base models.
raw-jsonl — direct dump of the SDK message stream. Use when you want to write your own templating.

The projector module is two hundred lines. The interesting half is _assistant_from_blocks, which folds an assistant message's heterogeneous content blocks into one OpenAI-format message. Thinking blocks become a thinking field (a non-standard extension that most loaders silently drop, which is fine — if you want chain-of-thought training, use --format raw-jsonl). Text blocks concatenate to content. Tool-use blocks become tool_calls[] with their JSON arguments stringified. The shape mirrors what apply_chat_template expects when you pass tools=....

Hygiene at the JSONL layer is two more functions:

Dedupe — drop trajectories where (prompt, final_diff) already appears in the corpus. Default mode is "both must match." Cheap and obvious — protects against the user re-running the same task five times during debugging and polluting their training set.
Deterministic split — train/val/test by SHA-256 of task_id. Same input set always partitions the same way, so val and test holdouts stay stable across re-exports. Important when you're iterating on the export pipeline and want to know whether a metric change came from new data or new partition.

That's the export mechanism. Reader → filter → projector → redactor → splitter → JSONL. Each stage is replaceable. The reader is the only one with database access. Everything downstream operates on dicts.

6. Redaction has to happen at export, not capture

This was the choice I almost got wrong, and I want to flag it because the wrong instinct is very tempting.

The wrong instinct: "I should redact secrets at capture time, before they hit the database." This feels safer. It's not. It's destructive. If your redaction rule has a bug — and your redaction rule will have a bug, because regex secret-detection is not a solved problem — you've lost the original data forever. You can't re-export with a fixed rule. You can't audit what was actually said. The DB is downstream of the redactor and you've thrown away your ground truth.

The right instinct: redact at export time. Keep the database authoritative. Treat the project-local .claw-forge/state.db as having the same trust boundary as the source code itself — if a laptop compromise leaks the DB, the source code is the bigger problem. The export pipeline applies redaction rules to projected JSONL records after projection, so re-exporting with new rules is a one-line operation. You can also generate a fully-faithful, un-redacted export for local fine-tuning experiments, then a redacted export for sharing. The DB is the same.

The redaction module is three composable rules:

SecretsRule — well-known patterns: AWS keys, GitHub PATs, Stripe keys, Anthropic keys, OpenAI keys, GCP API keys, Authorization headers. Conservative by design — better to miss some than to mangle innocuous text that happens to look secret-shaped.
UsernamesRule — substitutes /Users/<you>/ and /home/<you>/ with <REDACTED:username> while preserving directory structure. File layout is meaningful learning signal; the username is not.
CustomPatternsRule — user-supplied regex list from the YAML config. For project-specific stuff: customer IDs, internal hostnames, ticket prefixes.

The Redactor walks records recursively. Strings get every rule applied. Dicts and lists recurse. Everything else passes through. Replacement markers are structured (<REDACTED:secret>, <REDACTED:username>) so the model never learns to fabricate the redacted form — it learns "this is a placeholder, ignore."

7. The opt-in is two flags and a banner

Anthropic's Usage Policies prohibit using Claude outputs to develop models that compete with their services. This feature is squarely in the grey zone unless you treat it carefully. I built the gate with that in mind.

There are two flags in claw-forge.yaml:

training_traces:
  enabled: true
  acknowledged_terms: true

Both must be true before the recorder emits anything. If enabled: true but acknowledged_terms: false, the state service logs a one-time banner at startup with the relevant policy excerpt and does not record traces. The user has to flip the second flag explicitly.

In v0.7.1 I made claw-forge init scaffold both flags as true by default — the scaffold itself acknowledges the policy via the comments in the YAML, and the user is opting in by running init. Existing claw-forge.yaml files are untouched (the scaffold only writes when the file is absent). This was a deliberate friction-vs-discovery tradeoff: gate-by-default would mean nobody ever discovers the feature; default-on means everyone discovers it but the policy reminder is one cat claw-forge.yaml | head -100 away. I chose discovery.

This is a feature that's intended for personal/internal distillation — building a smaller model that imitates your own Claude usage on your own code, for your own internal use. Distribution of derived models is the user's responsibility and emphatically out of scope. The provenance fields on every JSONL record (model name, capture date, claw-forge version, applied redaction rules) preserve a verifiable lineage if you ever need to demonstrate "this corpus came from my own Claude usage."

8. The training recipe is short and lives outside the harness

claw-forge stops at JSONL export. The downstream Unsloth/Axolotl/trl pipeline lives in a separate user repo. The harness has no train command, no model registry, no inference layer. The reasons are scope hygiene: training stacks change fast, GPU dependencies are heavyweight, and the harness is supposed to run on any laptop. The recipe is documented in a markdown file (docs/training/unsloth-recipe.md) and ships as a reference, not as code.

The recipe at the time of writing:

Base model: unsloth/Qwen2.5-Coder-7B-Instruct-bnb-4bit. Coder-specialised, native tool-use chat template, Apache 2.0 licence, fits a 24 GB consumer GPU with LoRA. DeepSeek-Coder-V2-Lite-Instruct as the alternative if you have raw eval scores to chase and don't mind MoE finickiness for LoRA.
LoRA config: r=16, alpha=32, target modules q/k/v/o + gate/up/down, dropout 0, gradient checkpointing on.
Training: per-device batch 2, grad accumulation 8 (effective 16), 3 epochs, lr 2e-4, cosine schedule, adamw_8bit, max_seq_length 8192, bf16 if available.
Eval on the held-out test split: ROUGE-L on assistant content (cheap reasoning-quality proxy), and diff similarity on the model's reconstructed patch vs the captured final_diff (the real correctness signal). The latter is the one that matters; the former is the one you watch during training to spot collapse.

For a corpus of ~500 trajectories at average 6K tokens each, expect single-digit hours on a 4090 for a full training run. Calibrate with a 50-task dry run before committing. That's not a thousand-GPU pretraining job; it's a weekend's worth of consumer-grade compute on top of months of accumulated agent traces. The economics make sense at single-developer scale, which is the part nobody seems to be talking about yet.

9. What this doesn't solve (be honest)

Some things this design explicitly does not handle:

Distribution rights. I keep coming back to this because it's the part most likely to bite someone. Training a model on Sonnet's outputs and using it internally on your own code is one thing. Distributing that model — uploading weights to HuggingFace, releasing a derivative product — is a different thing and not protected by anything in this pipeline. Read the policy. Talk to a lawyer. The provenance stamping helps with audit; it does not authorize redistribution.
Eval beyond diff similarity. Diff similarity catches "did the model land the right code change" but it doesn't catch "did the model produce a clean, well-reasoned, well-commented solution." For that you need either human eval or LLM-as-judge eval, both of which sit outside the harness. The corpus enables both, but the harness doesn't ship them.
Multi-Claude-version mixing. Every trace stamps the originating model name. Mixing trajectories captured under Sonnet 4.5 with trajectories captured under Opus 4.7 gives you a heterogeneous teacher signal. Sometimes that's what you want — pooled expert demonstrations across model strengths — and sometimes it isn't (when the lower-capability traces are noise). The provenance field lets you filter, but the harness has no opinion about whether you should.
Capturing failure modes that didn't go through claw-forge. If the engineer drops out of the harness and edits a file by hand, none of that lands in the trace. The corpus represents what the agent did, not what the human did to clean up after the agent. For pure agent-distillation that's fine; for "train a model that handles the real workflow including human-in-the-loop fixups," this is a gap.
Cross-project corpus building. Each project has its own state.db. Combining corpora across projects is cat *.jsonl plus a check that the provenance.claude_model and claw_forge_version fields are compatible. Works fine for SFT, but if you're seriously building a multi-project corpus you want a manifest and a deduplicator that operates across files. That's tooling I haven't built yet.

The honest framing: this design captures a very specific kind of training data — the agentic coding loop on your own codebase, paired with the ground-truth diff that landed. That kind of data is unusually hard to come by and unusually valuable. It's not a substitute for general-purpose pre-training data, and it's not going to give you a model that handles tasks outside your codebase's distribution. It is going to give you a model that, on tasks similar to the ones you've been running, behaves more like Sonnet than the base Qwen2.5-Coder weights do. That's the win.

10. The cultural shift I keep coming back to

There's a meta-point that took me too long to internalize.

If you're paying for Frontier-model API calls, the expensive artifact isn't the code that ships. The code that ships is checkable, reviewable, reversible. The expensive artifact is the expert demonstration — the nineteen turns of senior-engineer reasoning that took the model forty-three minutes to produce. You're paying for the trajectory whether you save it or not. Saving it is the line between "I rented a senior engineer for an hour" and "I rented a senior engineer for an hour and learned how they work."

The harness equivalent of this insight is: the events table is a training corpus in disguise. The schema was already there. The state service was already writing to it. Adding a new event type and tee-ing the SDK message stream into it is eighty lines of code. The data was always going to be high-value; the only question was whether you'd capture it.

I think more harnesses are going to do this in the next twelve months, and I think it's going to start showing up as a competitive feature. The teams running large agent fleets without trace capture are paying for expert demonstrations and discarding them. The teams running with trace capture have a data flywheel: every agent run produces both a feature and a training example. After six months of that, you have something to fine-tune. After twelve months, you might have a smaller model that handles the easy 60% of your tasks for an order of magnitude less per-call cost than the Frontier model that produced the training data. The Frontier model still handles the hard 40%. The cost curve bends.

That's not a hypothetical; that's just SFT plus rejection-sampling with a corpus you already paid for. The mechanism is well-understood. The piece nobody is shipping yet — at least, not in the open-source agent harness landscape I follow — is the capture primitive. I built it because I wanted it. I'm sharing the design because I think the rest of the ecosystem will arrive at this primitive eventually, and the sooner it's a commodity, the sooner the interesting work above it can start.

If your harness throws away every Message except the final ResultMessage, you are walking past free training data every day. The fix is eighty lines, one nullable column, and a config gate. Build it before you next run the swarm.

Alex Chen builds AI-coding-agent infrastructure shipped to production. He runs ten-agent swarms daily and is currently waiting for his Qwen2.5-Coder fine-tune to finish so he can find out whether the months of captured Sonnet trajectories were worth the disk.

DEV Community