DEV Community: Alex Chen

The Coding Challenge That Came for Your development Directory: Anatomy of a Job Interview Infostealer

Alex Chen — Fri, 15 May 2026 12:41:15 +0000

Last week I received a take-home assignment from a company calling itself a real estate technology firm. The email was well-formatted. The instructions were plausible. The package was a zipped Node.js project with a README that asked me to implement a data ingestion feature and return the result within forty-eight hours. Standard stuff. I've done thirty of these in the past year.

I did not run it.

What I found instead, after twenty minutes of static analysis, was a three-stage infostealer with a persistent socket backdoor, a browser credential exfiltration loop, and a recursive filesystem sweeper that specifically prioritizes ~/development, ~/Development, ~/Documents, and ~/Desktop — in that order — before crawling the rest of your home directory for anything matching the words "wallet," "private_key," "seed," "api_key," "token," ".env," or "password." The payload was 3.8 MB of obfuscated JavaScript. The legitimate-looking feature request was the delivery mechanism. The take-home was the pretext.

This post is the deobfuscation report, the anatomy lesson, and the checklist — written for the developer who has run twelve take-homes this year and never once audited what was in the package before executing npm install && npm start.

1. The shape of the attack

Before the technical details, the shape matters because the shape is what you need to recognize next time.

The attacker's goal is to get you to run node on their code. That's it. Everything else — the professional email, the Calendly invite for a follow-up call, the plausible-sounding feature description — exists to build enough trust that you type npm install && npm run dev without opening the package.json first. Once that command runs, the payload is executing. You will not see a terminal flash or a suspicious popup. The infostealer spawns three detached child processes and immediately returns control to the "real" application. The feature works. The tests pass. You submit your assignment. The data leaves in the background.

The variant I analyzed was embedded in a build script called from the project's postinstall hook — the hook npm runs automatically after npm install completes, before you've even read the README. This is not a novel technique. Malicious postinstall hooks have been a known attack surface since at least 2018. The novel part is the sophistication of what ran inside the hook and the deliberateness with which it targeted developers specifically: people who by definition have API keys, database credentials, private repositories, and crypto wallets, all sitting in a well-organized ~/development directory, all matching the payload's sensitive keyword list.

2. Three workers, three lock files, one temp directory

The payload's architecture is clean enough to admire in a grim way. On execution it spawns three detached Node.js processes by piping script strings to node - via child_process.spawn. Each worker is passed as a string of code written to the child's stdin — meaning the workers never exist as files on disk you might notice in a directory listing. Each worker writes a lock file to os.tmpdir() (pid.6.1.lock, pid.6.2.lock, pid.6.3.lock) to prevent duplicate instances. If a lock file already exists with a running PID, the payload kills the old process and spawns a fresh one. If the machine restarts and the lock files go stale, the next npm install on any project carrying this hook resurrects all three workers.

The three workers communicate with a command-and-control server at 144.172.117.220 on three different ports:

Port 8085 — browser credential and wallet file upload
Port 8086 — sensitive filesystem file upload
Port 8087 — real-time socket backdoor, beacon, and clipboard stream

When I captured the response headers, the server returned a Vercel edge response with a valid strict-transport-security header and a live x-vercel-id. The C2 was alive at the time of analysis. The upload endpoints were accepting connections.

3. Worker one: your browser is an unlocked credential store

The first worker targets Chromium-family browsers — Chrome, Brave, Edge, Opera, and their variants — across Windows, macOS, Linux, and WSL. It knows where each browser stores its profile data on each platform. It reads the Login Data SQLite database that every Chromium browser uses to store saved passwords, then it attempts to decrypt them.

The decryption flow is the part that surprises people who assume "saved passwords are encrypted." They are, but the encryption key is also stored locally, protected by the operating system's credential API. On macOS, the payload calls security find-generic-password -w -a 'Chrome' -s 'Chrome Safe Storage' to extract the Keychain key, then runs the PBKDF2 derivation that Chrome uses internally to produce the AES key, then decrypts the stored credentials. On Windows it uses PowerShell to call the DPAPI. On Linux it calls secret-tool. These are the same APIs your password manager's native app uses. The browser's designers never contemplated a script running in your terminal asking the OS for the key and the ciphertext in the same process.

The result is a plaintext dump of URLs, usernames, and passwords that gets uploaded to port 8085 every thirty seconds for up to ten cycles — five minutes of active exfiltration even if you close the terminal immediately after noticing something wrong.

Beyond passwords, the worker collects:

Session cookies from all profiles (allowing account takeover without the password)
Web Data (autofill, stored form values, potentially credit card numbers)
login.keychain-db on macOS
The extension storage directories for approximately fifteen browser crypto wallet extensions — MetaMask, Phantom, Exodus, Coinbase Wallet, Brave Wallet, Rabby, Coin98, and others

The wallet extension data is the prize for a certain class of attacker. Browser wallets store encrypted vault data locally. The encryption password is the user's wallet password. If that password is weak, the vault is offline-crackable. If the user has a hardware wallet passphrase set, the seed phrase is still in the extension's storage at rest. The payload does not attempt to crack the vault — it just uploads it and lets the C2 operator handle decryption offline, with GPU time they control.

4. Worker two knows where developers keep their secrets

The second worker is a recursive filesystem sweeper. It starts with a five-minute delay — enough time for the developer to submit the assignment and stop watching their terminal — then begins scanning.

The priority directory list is the thing I found most revealing about the attacker's intent:

~/Desktop
~/Documents
~/Downloads
~/Library/CloudStorage
~/Development
~/development
~/Code
~/code
~/Projects
~/projects
~/Source
~/source
~/OneDrive
~/Google Drive

This is not a generic consumer infostealer. A generic consumer infostealer targets Documents and Downloads and stops there. This one explicitly lists ~/development, ~/Development, ~/Code, ~/Projects, ~/Source — the directories that exist on developer machines and essentially nowhere else. The attacker knows their target demographic. They know developers keep projects there. They know those projects have .env files. They know those .env files have the keys to your cloud infrastructure, your payment processor, your database, and your AI API accounts.

After the priority directories, the worker scans the full home directory with a keyword filter. Any file whose path contains one of these terms gets uploaded if it is under 5 MB:

Credentials: password, passwd, token, api_key, secret, credential, auth
Keys and certs: id_rsa, id_ed25519, .pem, .p12, .pfx, .jks, keystore
Crypto: wallet, seed, mnemonic, private_key, bitcoin, ethereum, metamask, phantom, ledger, trezor
Config files: .env, .yml, .yaml, .toml, .conf, .cfg, .ini, .properties
Documents: .pdf, .docx, .md, .txt, .csv, .xls, .xlsx
Databases: .db, .sqlite, .sqlite3, .sql
Source code: .js, .ts, .json, .jsx, .tsx, .xml
Images: .png, .jpg, .jpeg

That last category — images — is there for screenshots. Most developers have screenshots of dashboards, infrastructure diagrams, API key configuration pages, terminal sessions. Screenshots are a surprisingly rich data source for credential recovery.

The exclusion list is equally telling. The payload explicitly skips node_modules, .git, venv, dist, build, .next, .cache, vendor, and a hundred other paths that would generate noise without yielding secrets. This is tuned. Whoever wrote it understood the shape of a developer's filesystem well enough to write a precise filter. The inclusion and exclusion lists together constitute a detailed model of what a working developer's machine looks like.

5. Worker three: the clipboard is a live credential stream

The third worker maintains a persistent socket.io connection to the C2 server on port 8087. On connect, it immediately beacons your hostname, OS version, and username. It then does two things in parallel.

The first is an environment file hunt: after five minutes, it searches the entire filesystem for any file matching \.env* and uploads everything it finds to port 8085. Not just .env — .env.local, .env.production, .env.staging, .env.test, .env.backup. Every variant. Every project.

The second is clipboard surveillance. The worker polls your clipboard approximately every second using platform-native tools — pbpaste on macOS, PowerShell's [System.Windows.Forms.Clipboard]::GetText() on Windows, xclip or xsel on Linux. Any new non-empty clipboard content is sent as a log event to /api/log on the C2 server.

Think about what passes through your clipboard on a working day. API keys copied from a dashboard to paste into a .env file. Database connection strings. SSH passwords. JWT tokens copied from a browser devtools console to test an endpoint. Wallet addresses. Seed phrases (if you are the kind of person who ever copies a seed phrase — you should not be, but people do). Every one of those clipboard snapshots, timestamped and tagged with your hostname and username, streams to the C2 in real time for as long as the worker is running.

The socket channel also accepts remote commands from the C2 operator. Command code 102 returns a directory listing. Command code 107 reads a specific file and exfiltrates it. Command code 108 bulk-uploads every file in a specified folder under 25 MB. This is a full remote shell in everything but name.

6. The evasion stack

The payload earned my grudging respect in one sense: the evasion design is layered, and each layer targets a different detection mechanism.

Obfuscation. The original file was 3.8 MB of obfuscated JavaScript — string tables, integer-key lookups, junk branch logic, anti-debug heartbeats that call setInterval every two seconds. Static analysis requires a dedicated deobfuscation pass before the behavior is readable. A developer who opens the file and sees 3.8 MB of compressed variable names closes it and moves on.

No files on disk. The workers are never written to disk as named files. They are strings piped to node - via stdin. ps aux shows a node process with no arguments except the interpreter. There is nothing in the filesystem to find.

Detached processes. spawn(..., { detached: true }) causes the workers to be reparented to PID 1 (or the Windows equivalent) on Unix systems. The parent process exits normally. The terminal returns. The workers keep running.

Swallowed exceptions. process.on("uncaughtException", () => {}) and process.on("unhandledRejection", () => {}) suppress all unhandled errors globally. The workers will not crash on malformed responses, revoked tokens, or unexpected filesystem layouts. They will silently continue.

Runtime dependency installation. If socket.io-client, axios, sql.js, or form-data are not installed in the victim's global npm environment, the payload installs them with npm install --no-save and re-requires them. It adapts to the victim's environment rather than assuming it.

Anti-debug noise. A setInterval every two seconds triggers a function that creates Function().constructor("return this")() — a construct that evaluates arbitrary code and makes tracing the execution graph noticeably harder. Most analysts give up before the behavior tree is fully mapped.

The combination of these techniques means that standard "run it in a VM and watch what happens" dynamic analysis still exposes the behavior — the network connections to port 8085/8086/8087 are clearly visible in a packet capture — but defeats casual inspection and makes it through most automated static analysis passes that look only for known bad strings.

7. What to do before you run any take-home assignment

I want to be precise here because "use a VM" is the advice everyone gives and almost nobody follows. VMs are slow to set up, inconvenient for multi-hour assignments, and give you a false sense of safety if you copy your real SSH keys or API credentials into them (which people do). Here is what actually fits into a realistic pre-run check that takes under ten minutes.

Read package.json first, completely. The scripts field is the attack surface. postinstall, preinstall, prepare — these run automatically. If any of these scripts call a file you have not read, read the file. If any of them call node scripts/something.js and scripts/something.js is 3.8 MB of minified code, that is your signal. Legitimate take-home assignments do not need 3.8 MB of compressed JavaScript in their build pipeline.

Check dependencies and devDependencies against the stated scope. An assignment that asks you to build a REST endpoint does not need socket.io-client, sql.js, or form-data unless those packages are relevant to the feature. Unexpected dependencies in a small project warrant investigation.

Grep the non-minified files for network calls. grep -r "http\|https\|socket\|fetch\|axios\|got\|request" --include="*.js" --include="*.ts" . on the unminified source. Legitimate take-homes do not phone home. If you see an IP address in a source file that is not localhost or a well-known API, stop.

Run npm install --ignore-scripts first. This flag tells npm to skip all lifecycle scripts (preinstall, postinstall, prepare, etc.) and just install the packages. Inspect everything, then decide whether to allow scripts.

Block outbound network for the duration. On macOS, pfctl can block outbound connections from a specific process. On Linux, firejail --net=none node . runs the process with no network access. If the application needs network access that is legitimate (calling the assignment's specified API), you will know about it from the README. Unexpected network connections during the interview task are not a feature.

None of this is foolproof. A sufficiently sophisticated payload will hide its indicators. But the category of attack I analyzed — interview-targeted Node.js infostealers — is not sophisticated in its delivery. It is sophisticated in its payload. The delivery relies on the developer's habit of trusting the interview context enough to skip the audit.

8. The upload infrastructure tells you something about the operator

The C2 infrastructure deserves a paragraph. The three upload endpoints are plain HTTP at 144.172.117.220, a cloud-hosted IP with no domain name. Port 8087 (the socket and beacon channel) is fronted through Vercel — the response headers include a x-vercel-id with an iad1 region marker, meaning the operator is proxying the WebSocket channel through Vercel's edge network for a different IP profile on port 443.

Each upload includes a validation HMAC computed with the secret SuperStr0ngSecret@)@^. Every file upload carries the victim's hostname, the file path (URL-encoded), a Unix timestamp, and the campaign marker userKey=609, t=6. This is a multi-tenant C2 — t=6 is the tenant or campaign identifier, and userKey=609 is likely a unique victim identifier assigned at delivery time. The operator can correlate all uploads from a single victim across all three workers using these markers.

The retry logic and validation scheme suggest a mature operation rather than a one-off experiment. A first-time attacker does not write exponential backoff logic with per-file retry counts and HMAC-validated uploads. Somebody has been running this for a while.

9. The incident response question you should answer first

If you are reading this after running an untrusted package, the first question is not "how do I remove it." The first question is "what was on this machine."

That question is harder and more important. If your ~/development directory was scanned, every .env file in every project was potentially uploaded. Every API key you have ever configured locally is now at risk. Every browser-saved password on this machine is potentially in the attacker's hands. Every piece of text you have copied to your clipboard in the time since the payload executed is logged.

The remediation checklist is:

Rotate every API key and secret in every project, immediately, before doing anything else. Prioritize cloud credentials (AWS, GCP, Azure), payment processor keys (Stripe, Braintree), and AI API keys (Anthropic, OpenAI). A rotated key on a compromised project is useless to the attacker; an unrotated key is a live credential even after you've wiped the machine.
Log out of every active session — GitHub, email, banking, cloud consoles, anything you access from this machine. Revoke OAuth tokens where possible. The session cookies from your browser profiles may have been exfiltrated.
Revoke and regenerate SSH keys. Push the new public key to every service and host. Treat every existing id_rsa, id_ed25519, or .pem file as compromised.
If you have any crypto wallet funds accessible from this machine, move them to a fresh wallet generated on an air-gapped device. Treat any seed phrase or private key that has ever been on this machine as known to the attacker, even if you stored it "securely."
Check for running workers: ps aux | grep node. Look for pid.6.*.lock files in your temp directory ($TMPDIR on macOS, /tmp on Linux). Kill any lingering processes.
Rotate any password or secret you may have typed or copied to clipboard during the infection window.

Step one takes longer than everything else. Do it first anyway.

10. The cultural problem that makes this attack work

There is a structural problem with the technical interview pipeline that this attack exploits, and naming it matters because fixing the checklist alone will not fix the underlying vulnerability.

Developers are conditioned to treat the interview context as a trust boundary. The hiring manager emailed you. They found you on LinkedIn. The assignment is from a real company (or appears to be). The psychological contract of "I am being evaluated so I should perform, not investigate" suppresses exactly the skepticism that would catch this attack. Nobody audits code they are about to be graded on the same way they would audit a random package. The social framing of the take-home is the exploit.

Attackers understand this. The real estate company framing in the sample I analyzed is plausible and unremarkable — real estate tech companies run take-home assignments constantly, the domain is boring enough not to raise questions, and the feature request (data ingestion pipeline) is exactly the kind of thing you would expect. Nothing in the email or the README flags as suspicious. The payload hides in a build script that runs before you read a single line of the application code.

The fix has to be a habit, not just a checklist. Make npm install --ignore-scripts your default for every unfamiliar project. Make reading package.json before running anything a reflex. Take the ten minutes to check network calls in the source before you type npm start. These are habits that cost you very little on legitimate assignments — legitimate assignments do not have malicious postinstall hooks — and that will catch the attack on the one in fifty assignments where the package is not what it claims to be.

The inconvenient truth is that the developers most at risk are the ones who are most active in the job market: recent grads doing thirty take-homes a year, experienced engineers exploring new roles, contractors evaluating client projects. These are also the people with the most API keys, the most cloud credentials, and the most active side projects sitting in a well-organized ~/development directory. The attack selects for exactly this profile, which is why the priority directory list says ~/development before it says ~/Documents.

You are the target because you are a developer. That knowledge should be inconvenient enough to make you open package.json before you type npm install.

The payload I analyzed has been reported to the hosting provider and to GitHub's security team. If you receive a take-home from a company you do not recognize with a Node.js project that includes a build script you cannot read, you are welcome to run the same grep queries I described in section seven. They take four minutes. They will not slow down your legitimate job search. They will, eventually, save you the experience of rotating fifty API keys at midnight while reading through what a three-stage infostealer took from your machine.

The 50,000-Token Demonstration Nobody Saved: Capturing Agent Trajectories to Train Your Own Code-SLM

Alex Chen — Thu, 07 May 2026 12:06:14 +0000

Last Tuesday, Sonnet 4.5 spent forty-three minutes implementing JWT authentication in a project I run. It read four files, wrote a 180-line patch, ran the test suite, watched two tests fail, traced one of the failures to a stale fixture, fixed both, ran the suite again, watched it pass, then squash-merged the work to main with a commit message that read like a senior engineer wrote it. The whole exchange consumed about 50,000 tokens of model output, broken into nineteen AssistantMessage turns interleaved with twenty-three ToolUseBlock calls and twenty-one ToolResultBlock returns.

I have the final code. I have the commit. I do not have the trajectory.

I had nineteen turns of expert reasoning — the kind of demonstration that, if you handed it to a smaller model as supervised fine-tuning data, would teach that smaller model how to act like a coding agent, not just how to write Python. And I threw it on the floor the moment the ResultMessage arrived, because my harness was wrapped around claude_agent_sdk.query() like this:

result_text = ""
async for message in run_agent(prompt, ...):
    if message.__class__.__name__ == "ResultMessage":
        result_text = message.result or ""
return result_text

Look at that loop. Eighteen messages walked past it for free. The last one paid the rent.

This is the post about why I decided that was insane, what I built to fix it, and what it now lets me do — including, eventually, train my own Qwen2.5-Coder fine-tune on Sonnet's distilled coding behavior.

1. The thing nobody is doing yet, but should be

If you are running an agent harness at any scale — even hobby scale, even one-developer scale — you are paying a Frontier-model API bill and generating a continuous stream of high-quality expert demonstrations and throwing them away. The math on this is depressing once you actually run it. A two-week sprint with one agent running ten hours a day at modest concurrency produces something like 500 task trajectories. Each one is, on average, six thousand to twenty thousand tokens of expert thinking, tool use, and code edits, paired with the canonical "right answer" diff that landed on main.

This is the shape of training data people pay for. Coding-specific SFT corpora don't fall out of the sky. The teams shipping the leading code models scrape GitHub, run synthetic generation pipelines, hire annotators. You have a smaller, narrower, higher-quality version of that already happening in your dev environment for free, modulo the fact that you are not capturing it.

The reason most teams aren't doing this isn't technical difficulty. It's a missing primitive. The agent SDK gives you a stream of messages. Most harnesses iterate the stream once and discard it. Adding a tee — a "yield to the caller AND write to a database" wrapper — is eighty lines of code. The hard part is not the tee. The hard part is figuring out what to capture and what shape to capture it in so that six months from now, when someone says "let's actually try training that model now," you don't discover you stored the wrong thing.

2. The two design questions that actually matter

Before any code, two decisions:

What format do you store? The naive answer is "store it in the format your fine-tuning library wants." That answer is wrong. Fine-tuning libraries change. The chat template you use today (let's say OpenAI tool-use) is not the chat template you'll use in eighteen months. ShareGPT had its moment, ChatML is having its moment, the next thing is already in someone's repo. If you store in the trained-model format, you locked yourself in.
What's your training label? A trajectory by itself is imitation-learning data — "here's what the expert did, copy it." That gets you to mid-tier capability, full stop. The reason DPO and rejection-sampling matter is they let you do preference learning: "of these K candidate solutions, which one matches the actual answer?" To do that, you need a label — a canonical "this is what the correct final state looked like" against which candidate completions get scored. If you only stored the trajectory, you've half-stored the dataset.

The answers I landed on, after going down both wrong paths first:

Capture the superset. Store the raw SDK message stream — every AssistantMessage with its ThinkingBlock and TextBlock and ToolUseBlock content, every UserMessage with its ToolResultBlock content, every model name, every usage tally. Don't project to a chat format at capture time. Projection is cheap and reversible from the superset; the reverse direction isn't true. This is the same principle as event-sourcing in databases: store the events, project the views.

Capture the diff. When the agent's branch squash-merges to main, the resulting commit hash is the ground-truth label. git show <sha> gives you the canonical patch the expert eventually landed. Add one nullable column to your task table, PATCH the SHA back after squash, and at export time you can attach the diff to every successful trajectory. Now your dataset isn't "trajectory." It's "trajectory plus the right answer." DPO and rejection sampling become trivial future work because the label is already on disk.

That's the design. The implementation is small enough to fit on a napkin.

3. The recorder is a tee, and it's eighty lines

The whole capture surface is a single async iterator wrapper:

async def record_messages(
    messages: AsyncIterator[Any],
    *,
    dest: RecordingDestination,
    client: httpx.AsyncClient | None = None,
) -> AsyncIterator[Any]:
    own_client = client is None
    active = client if client else httpx.AsyncClient(timeout=5.0)
    try:
        turn = 0
        async for message in messages:
            try:
                payload = _serialize_message(message, turn=turn)
                await active.post(
                    f"{dest.state_url}/sessions/{dest.session_id}/events",
                    json={
                        "event_type": "agent_message",
                        "task_id": dest.task_id,
                        "payload": payload,
                    },
                )
            except Exception:
                log.warning(
                    "trace recording failed for task %s turn %d; continuing",
                    dest.task_id, turn, exc_info=True,
                )
            turn += 1
            yield message
    finally:
        if own_client:
            await active.aclose()

That's it. Yield to the caller; tee to the events table. The four design choices baked into those eighty lines are worth naming because they're the ones that go wrong if you skip past them:

Caller-side, not runner-side. The wrapper sits at the call site that already knows session_id and task_id. The agent runner stays a pure SDK wrapper. This is the boring choice and the right choice — it keeps the runner module reusable in contexts (testing, ad-hoc scripts) where there's no state service to record into.
Best-effort. A network blip, a state-service restart, a transient permission error — none of them abort the agent. The recorder catches every exception, logs a warning, and continues. The asymmetry is correct: the agent's job is to ship the feature, not to ship the trace. Lost traces are a nuisance. Lost agent runs are a fire.
Lossless serialization. _serialize_message walks the SDK Message object's attributes generically — model, stop_reason, usage, content blocks — and JSON-serializes them with no projection, no opinion. Whatever shape the SDK emits is what lands in the database. When the SDK adds a new content-block type next quarter, the recorder doesn't break.
One event per Message, not per content-block. Tool-use ↔ tool-result correlation stays implicit via the SDK's IDs; reconstructing the conversation at export time is straightforward; the events table doesn't 5x its row count for marginal queryability.

The storage is the existing events table. No new schema. The payload is a JSON column. SQLite handles 1–3 MB per task comfortably. A hundred tasks is 100–300 MB. Disk is cheap. WAL mode makes the writes essentially free at this volume. The state service this lands inside has been doing this for other event types since v0.1.

4. The merge_commit_sha column does most of the conceptual work

The single largest design decision in this whole feature is one nullable String(40) column on the task table. Everything else is mechanism. This column is meaning.

When the harness squash-merges a feature branch to main, squash_merge() returns {"merged": True, "commit_hash": "abc1234"}. The cli.py task handler PATCHes that hash back to the corresponding task row. The PATCH is best-effort and try/excepted because the task is already complete by then — a failed PATCH costs you the diff label for that record, not the agent run.

At export time, --include-diff reads the column and shells out to git show --pretty=format: <sha> against the project's git repo. The diff lands on the JSONL record as final_diff. Now every outcome="success" trajectory carries the canonical patch the expert eventually shipped — the one that survived the test suite, the code review, the squash merge.

This is the difference between "imitation data" and "imitation + reward". It's also the difference between "a corpus you can SFT on" and "a corpus you can DPO on later." You don't need the DPO pipeline today — the schema's already forward-compatible, so when you decide it's time, the labels are sitting there on disk waiting.

I did not appreciate how much this column matters until I started thinking about evaluation. If you're going to fine-tune a smaller model on captured trajectories, you need a metric that says "did the smaller model learn to land the right diff?" Not "did the smaller model produce text that looks like the expert" — that's BLEU on assistant content, and BLEU on assistant content is a vanity metric. The honest metric is diff similarity: reconstruct the smaller model's proposed patch from its tool-call sequence (Edit / Write blocks), score it line-level Jaccard plus difflib.SequenceMatcher.ratio() against final_diff, and call that your eval. You cannot run that eval without the ground-truth column. The column is the experiment.

5. Format projection is a one-page module

With the superset captured, projection to any chat format at export time is mechanical:

OpenAI tool-use — fold thinking + text + tool_use blocks into one assistant message with tool_calls; emit each tool_result block as its own role: tool message. Default format. Reads natively into HuggingFace apply_chat_template(tools=...).
ShareGPT — flatten tool calls to <tool_call name="X">{...}</tool_call> text. Lossy but trl/Axolotl ShareGPT loaders eat it without complaining.
ChatML — generic <|im_start|> tags; no tool semantics; useful for non-tool-using base models.
raw-jsonl — direct dump of the SDK message stream. Use when you want to write your own templating.

The projector module is two hundred lines. The interesting half is _assistant_from_blocks, which folds an assistant message's heterogeneous content blocks into one OpenAI-format message. Thinking blocks become a thinking field (a non-standard extension that most loaders silently drop, which is fine — if you want chain-of-thought training, use --format raw-jsonl). Text blocks concatenate to content. Tool-use blocks become tool_calls[] with their JSON arguments stringified. The shape mirrors what apply_chat_template expects when you pass tools=....

Hygiene at the JSONL layer is two more functions:

Dedupe — drop trajectories where (prompt, final_diff) already appears in the corpus. Default mode is "both must match." Cheap and obvious — protects against the user re-running the same task five times during debugging and polluting their training set.
Deterministic split — train/val/test by SHA-256 of task_id. Same input set always partitions the same way, so val and test holdouts stay stable across re-exports. Important when you're iterating on the export pipeline and want to know whether a metric change came from new data or new partition.

That's the export mechanism. Reader → filter → projector → redactor → splitter → JSONL. Each stage is replaceable. The reader is the only one with database access. Everything downstream operates on dicts.

6. Redaction has to happen at export, not capture

This was the choice I almost got wrong, and I want to flag it because the wrong instinct is very tempting.

The wrong instinct: "I should redact secrets at capture time, before they hit the database." This feels safer. It's not. It's destructive. If your redaction rule has a bug — and your redaction rule will have a bug, because regex secret-detection is not a solved problem — you've lost the original data forever. You can't re-export with a fixed rule. You can't audit what was actually said. The DB is downstream of the redactor and you've thrown away your ground truth.

The right instinct: redact at export time. Keep the database authoritative. Treat the project-local .claw-forge/state.db as having the same trust boundary as the source code itself — if a laptop compromise leaks the DB, the source code is the bigger problem. The export pipeline applies redaction rules to projected JSONL records after projection, so re-exporting with new rules is a one-line operation. You can also generate a fully-faithful, un-redacted export for local fine-tuning experiments, then a redacted export for sharing. The DB is the same.

The redaction module is three composable rules:

SecretsRule — well-known patterns: AWS keys, GitHub PATs, Stripe keys, Anthropic keys, OpenAI keys, GCP API keys, Authorization headers. Conservative by design — better to miss some than to mangle innocuous text that happens to look secret-shaped.
UsernamesRule — substitutes /Users/<you>/ and /home/<you>/ with <REDACTED:username> while preserving directory structure. File layout is meaningful learning signal; the username is not.
CustomPatternsRule — user-supplied regex list from the YAML config. For project-specific stuff: customer IDs, internal hostnames, ticket prefixes.

The Redactor walks records recursively. Strings get every rule applied. Dicts and lists recurse. Everything else passes through. Replacement markers are structured (<REDACTED:secret>, <REDACTED:username>) so the model never learns to fabricate the redacted form — it learns "this is a placeholder, ignore."

7. The opt-in is two flags and a banner

Anthropic's Usage Policies prohibit using Claude outputs to develop models that compete with their services. This feature is squarely in the grey zone unless you treat it carefully. I built the gate with that in mind.

There are two flags in claw-forge.yaml:

training_traces:
  enabled: true
  acknowledged_terms: true

Both must be true before the recorder emits anything. If enabled: true but acknowledged_terms: false, the state service logs a one-time banner at startup with the relevant policy excerpt and does not record traces. The user has to flip the second flag explicitly.

In v0.7.1 I made claw-forge init scaffold both flags as true by default — the scaffold itself acknowledges the policy via the comments in the YAML, and the user is opting in by running init. Existing claw-forge.yaml files are untouched (the scaffold only writes when the file is absent). This was a deliberate friction-vs-discovery tradeoff: gate-by-default would mean nobody ever discovers the feature; default-on means everyone discovers it but the policy reminder is one cat claw-forge.yaml | head -100 away. I chose discovery.

This is a feature that's intended for personal/internal distillation — building a smaller model that imitates your own Claude usage on your own code, for your own internal use. Distribution of derived models is the user's responsibility and emphatically out of scope. The provenance fields on every JSONL record (model name, capture date, claw-forge version, applied redaction rules) preserve a verifiable lineage if you ever need to demonstrate "this corpus came from my own Claude usage."

8. The training recipe is short and lives outside the harness

claw-forge stops at JSONL export. The downstream Unsloth/Axolotl/trl pipeline lives in a separate user repo. The harness has no train command, no model registry, no inference layer. The reasons are scope hygiene: training stacks change fast, GPU dependencies are heavyweight, and the harness is supposed to run on any laptop. The recipe is documented in a markdown file (docs/training/unsloth-recipe.md) and ships as a reference, not as code.

The recipe at the time of writing:

Base model: unsloth/Qwen2.5-Coder-7B-Instruct-bnb-4bit. Coder-specialised, native tool-use chat template, Apache 2.0 licence, fits a 24 GB consumer GPU with LoRA. DeepSeek-Coder-V2-Lite-Instruct as the alternative if you have raw eval scores to chase and don't mind MoE finickiness for LoRA.
LoRA config: r=16, alpha=32, target modules q/k/v/o + gate/up/down, dropout 0, gradient checkpointing on.
Training: per-device batch 2, grad accumulation 8 (effective 16), 3 epochs, lr 2e-4, cosine schedule, adamw_8bit, max_seq_length 8192, bf16 if available.
Eval on the held-out test split: ROUGE-L on assistant content (cheap reasoning-quality proxy), and diff similarity on the model's reconstructed patch vs the captured final_diff (the real correctness signal). The latter is the one that matters; the former is the one you watch during training to spot collapse.

For a corpus of ~500 trajectories at average 6K tokens each, expect single-digit hours on a 4090 for a full training run. Calibrate with a 50-task dry run before committing. That's not a thousand-GPU pretraining job; it's a weekend's worth of consumer-grade compute on top of months of accumulated agent traces. The economics make sense at single-developer scale, which is the part nobody seems to be talking about yet.

9. What this doesn't solve (be honest)

Some things this design explicitly does not handle:

Distribution rights. I keep coming back to this because it's the part most likely to bite someone. Training a model on Sonnet's outputs and using it internally on your own code is one thing. Distributing that model — uploading weights to HuggingFace, releasing a derivative product — is a different thing and not protected by anything in this pipeline. Read the policy. Talk to a lawyer. The provenance stamping helps with audit; it does not authorize redistribution.
Eval beyond diff similarity. Diff similarity catches "did the model land the right code change" but it doesn't catch "did the model produce a clean, well-reasoned, well-commented solution." For that you need either human eval or LLM-as-judge eval, both of which sit outside the harness. The corpus enables both, but the harness doesn't ship them.
Multi-Claude-version mixing. Every trace stamps the originating model name. Mixing trajectories captured under Sonnet 4.5 with trajectories captured under Opus 4.7 gives you a heterogeneous teacher signal. Sometimes that's what you want — pooled expert demonstrations across model strengths — and sometimes it isn't (when the lower-capability traces are noise). The provenance field lets you filter, but the harness has no opinion about whether you should.
Capturing failure modes that didn't go through claw-forge. If the engineer drops out of the harness and edits a file by hand, none of that lands in the trace. The corpus represents what the agent did, not what the human did to clean up after the agent. For pure agent-distillation that's fine; for "train a model that handles the real workflow including human-in-the-loop fixups," this is a gap.
Cross-project corpus building. Each project has its own state.db. Combining corpora across projects is cat *.jsonl plus a check that the provenance.claude_model and claw_forge_version fields are compatible. Works fine for SFT, but if you're seriously building a multi-project corpus you want a manifest and a deduplicator that operates across files. That's tooling I haven't built yet.

The honest framing: this design captures a very specific kind of training data — the agentic coding loop on your own codebase, paired with the ground-truth diff that landed. That kind of data is unusually hard to come by and unusually valuable. It's not a substitute for general-purpose pre-training data, and it's not going to give you a model that handles tasks outside your codebase's distribution. It is going to give you a model that, on tasks similar to the ones you've been running, behaves more like Sonnet than the base Qwen2.5-Coder weights do. That's the win.

10. The cultural shift I keep coming back to

There's a meta-point that took me too long to internalize.

If you're paying for Frontier-model API calls, the expensive artifact isn't the code that ships. The code that ships is checkable, reviewable, reversible. The expensive artifact is the expert demonstration — the nineteen turns of senior-engineer reasoning that took the model forty-three minutes to produce. You're paying for the trajectory whether you save it or not. Saving it is the line between "I rented a senior engineer for an hour" and "I rented a senior engineer for an hour and learned how they work."

The harness equivalent of this insight is: the events table is a training corpus in disguise. The schema was already there. The state service was already writing to it. Adding a new event type and tee-ing the SDK message stream into it is eighty lines of code. The data was always going to be high-value; the only question was whether you'd capture it.

I think more harnesses are going to do this in the next twelve months, and I think it's going to start showing up as a competitive feature. The teams running large agent fleets without trace capture are paying for expert demonstrations and discarding them. The teams running with trace capture have a data flywheel: every agent run produces both a feature and a training example. After six months of that, you have something to fine-tune. After twelve months, you might have a smaller model that handles the easy 60% of your tasks for an order of magnitude less per-call cost than the Frontier model that produced the training data. The Frontier model still handles the hard 40%. The cost curve bends.

That's not a hypothetical; that's just SFT plus rejection-sampling with a corpus you already paid for. The mechanism is well-understood. The piece nobody is shipping yet — at least, not in the open-source agent harness landscape I follow — is the capture primitive. I built it because I wanted it. I'm sharing the design because I think the rest of the ecosystem will arrive at this primitive eventually, and the sooner it's a commodity, the sooner the interesting work above it can start.

If your harness throws away every Message except the final ResultMessage, you are walking past free training data every day. The fix is eighty lines, one nullable column, and a config gate. Build it before you next run the swarm.

Alex Chen builds AI-coding-agent infrastructure shipped to production. He runs ten-agent swarms daily and is currently waiting for his Qwen2.5-Coder fine-tune to finish so he can find out whether the months of captured Sonnet trajectories were worth the disk.

The Architectural Shape Hint: A Spec-Time Trick That Lets 10 AI Agents Run in Parallel Without Stepping on Each Other

Alex Chen — Sun, 03 May 2026 06:12:10 +0000

I run agent swarms now. Not "an agent" — agents, plural, in flight at once, each working on a different feature against the same repo. Ten agents per session is normal. Twenty isn't unusual when the spec is well-decomposed. The token math works, the wall-clock math works, the model latency hides inside the swarm because something is always landing while something else is still compiling. The economics make a strong case for parallel execution as the default.

Until you hit the wall everyone hits: two agents touched the same file.

I've spent the better part of the year fighting this. I've shipped four layers of runtime defense. They all work and none of them are the answer. The answer turned out to be one attribute on the spec. This is the post about that one attribute.

1. The four layers nobody told you you'd need

Before I describe the fix, let me describe the disease — because if you're running parallel agents and you don't recognize this stack, you're probably going to recognize it next week.

When two agents in flight at once both want to edit src/router/routes.py, here's what claw-forge (the harness I work in) does:

File-claim locks. Each task declares touches_files=[...] upfront. The dispatcher refuses to start a second task that wants a file currently held by a running task. The second task defers to the next dispatch cycle.
Pre-dispatch worktree sync. Before the agent runs, the harness merges target_branch into the feature branch inside the worktree. If target moved while the task was queued, the merge happens before any token is spent. Conflicts surface as resume_conflict: failures with the offending file list.
Catch-up rebase inside squash_merge. When the agent's branch finally squash-merges to main and conflicts with concurrent work, the harness merges target into the branch and retries the squash automatically.
Resume-on-retry preamble. If a task fails mid-run, the next attempt picks up the worktree as-is, with a prompt prefix listing what's already committed and what failed last time. The agent doesn't redo the first 60% of the work.

This stack is correct. Each layer earns its keep. If I deleted any one of them, real users would file real bug reports within 48 hours. But notice what they all have in common: they are reactive. Every layer is a response to "two agents touched the same file." The conflict has already happened by the time the layer fires.

What if it never happened?

2. Conflicts are usually predictable from architecture

Sit down with a senior engineer who has worked on a codebase for six months. Hand them a list of feature requests. Ask: "If we built these in parallel with one engineer per feature, where would the merge conflicts happen?" They'll be right within five minutes. They don't run the merges. They look at the codebase's structure and know.

The reason they know is that conflicts cluster around architectural surfaces. A few specific files — the dispatcher, the routes table, the global event bus, the error envelope, the auth middleware — get touched by every feature. Most other files are owned by one feature each. The conflict surface isn't uniformly distributed across the repo. It's concentrated on the structural choke points.

This is the same insight that drives plugin architectures in big software systems. WordPress plugins don't conflict because each lives in wp-content/plugins/<name>/. VS Code extensions don't conflict because each lives in its own directory and registers through a stable API. The host is small and stable. The plugins are everything else.

If you build your codebase as a small core plus many plugins, and your spec tells the harness which features are plugins versus core, and the harness honors that distinction at scheduling time — then ten agents working on ten plugins literally cannot conflict. They are editing files in ten different directories. The locks are decorative. The catch-up rebase is dead code. The pre-dispatch sync is a no-op.

This was the unlock. Encode the architectural intent in the spec. Let the scheduler use it.

3. Two shapes, one attribute

Every feature in our specs now carries an architectural-shape attribute. There are exactly two shapes that matter:

shape="plugin" — vertical features. Live in their own directory, own their own data model, own their own tests. Adding or removing the plugin doesn't touch sibling plugins. Examples: "user can register," "user can edit profile," "task CRUD with tag filtering." Each lives in src/plugins/<name>/.
shape="core" — cross-cutting concerns. Edit files used by every plugin. Examples: "all endpoints validate JWT," "uniform RFC 7807 error envelope," "global rate limit," "database connection pool." Each lives in src/core/<concern>/.

That's it. No tier, no taxonomy, no UML. Two values. The simplicity is load-bearing — if the classifier had three values it would have ten by next quarter, and the scheduling rule would have to handle a Cartesian product of cases.

A spec entry now looks like this:

<feature index="14" shape="plugin" plugin="auth">
  <description>User can register with email and password</description>
</feature>
<feature index="20" shape="core"
         touches_files="src/core/middleware/auth.py">
  <description>All endpoints validate JWT on incoming requests</description>
</feature>

The plugin="auth" attribute auto-fills touches_files to ["src/plugins/auth/**"]. The harness now knows that feature 14 will only touch files inside src/plugins/auth/. Two shape="plugin" features with different plugin names are guaranteed to be file-disjoint. Not "probably." Not "usually." Guaranteed by directory boundaries.

For shape="core" features the auto-derivation can't help — cross-cutting work touches a specific file by name. The author writes touches_files="src/core/middleware/auth.py" explicitly. The parser refuses any spec where shape="core" lacks a touches_files value. Cross-cutting work without a declared file set is a bug in the spec, not a runtime decision the dispatcher gets to make.

4. The scheduling rule that follows

Once shape is in the spec, the dispatcher gets two new rules:

shape="plugin" tasks dispatch freely up to --concurrency N. Their file sets are disjoint by construction. The file-claim lock layer becomes a sanity check rather than a primary defense. Plugin tasks scale linearly with concurrency.
shape="core" tasks single-flight. At most one cross-cutting task runs at a time, regardless of --concurrency. Two core tasks both want to edit the auth middleware? They serialize. Always. No clever overlap analysis, no "well actually they touch different lines." Cross-cutting work is cheap to serialize — it's a small minority of features — and the cost of getting it wrong is high.
Tasks without shape (legacy specs) fall through to the existing concurrency cap + file-claim lock behavior. Backward compatibility is free because the new rules are gated on task.shape IS NOT NULL.

The scheduler's filter is twelve lines of Python:

def get_ready_tasks(self) -> list[TaskNode]:
    ready = [t for t in self._tasks.values() if self._is_ready(t)]
    # Cross-cutting (shape="core") tasks single-flight: drop any
    # candidate ``core`` task from the ready set if another core task
    # is already running.
    any_core_running = any(
        t.status == "running" and t.shape == "core"
        for t in self._tasks.values()
    )
    if any_core_running:
        ready = [t for t in ready if t.shape != "core"]
    return sorted(ready, key=lambda t: -t.priority)

That's the entire enforcement mechanism. The scheduler has no opinion about parallelism beyond this. The touches_files lock layer handles the second-line defense for cases where a plugin author lied about their shape (which the code review should catch separately).

5. Why this works structurally, not just behaviorally

The thing that makes this approach durable is that the safety property is structural: it's a consequence of file-system layout, not of clever runtime detection.

If feat/plugins/auth/ and feat/plugins/profile/ are the only file sets two agents touch, there is no possible interleaving where they conflict. Not because the harness is smart. Because the files don't overlap. The same way two git worktree instances on different branches can edit different files without any locking — git just doesn't see them as a conflict.

Compare this to the old approach: "predict conflicts at runtime by checking which files each agent claims to touch." That works if every agent honestly declares its file set. In practice, agents trying to wire a plugin into a registry often need to edit the registry too. They forget to declare the registry file. The lock layer doesn't fire. The merge conflicts at squash time. The whole reactive stack kicks in.

The plugin-shape approach refuses to be in that situation. If your codebase has a registry that every plugin has to edit, that registry is a hotspot and you should restructure it — or declare it as shape="core" and serialize work on it. The architecture catches up to the parallelism, not the other way around.

This is also why the harness composes naturally with my project's boundaries audit pass. That tooling already identifies hotspot files (registries, route tables, dispatch chains) and refactors them into plugin-extensible patterns. After a boundaries apply --auto pass, the codebase is more amenable to plugin-shape features — fewer surfaces remain that force a shape="core" declaration. The two pieces — spec-time architectural intent and codebase structural refactoring — pull in the same direction. Each makes the other more effective.

6. The brownfield path: refactor first, then extend

Greenfield projects can be built plugin-shaped from day one. Brownfield projects — i.e. every project worth working on — usually have an existing dispatcher / route table / event bus that gets touched by every feature. You can't bolt plugin-shape semantics onto a codebase whose architecture isn't ready for them.

So the brownfield workflow has an extra step:

analyze — generate a manifest with stack, conventions, test baseline.
boundaries audit — emit boundaries_report.md listing extension hotspots and the refactor pattern best suited to each (registry / split / route-table / extract-collaborators).
boundaries apply --auto — refactor each hotspot one at a time on its own feature branch with test gating. Squash-merges to main on green; reverts on red.
/create-spec — the slash command reads boundaries_report.md first. If hotspots remain unrefactored, it warns the user before generating any spec. Then it asks shape per feature.
claw-forge add — runs the planner against the now-shape-aware spec.

Skipping step 3 is the costly mistake. New features land as shape="plugin", but the file-claim lock catches them when they try to edit the un-refactored hotspot, the dispatcher fails the task with resume_conflict, and the agent has wasted one full attempt on stale state. Refactoring up front is cheaper than discovering you need to mid-flight. The boundaries harness exists exactly to make that "up front" step automatic.

The cultural ask is: when adding non-trivial features to an existing codebase, do the structural work first. That's not a new principle — it's "make the change easy, then make the easy change," Kent Beck, twenty years ago. Plugin-shape specs make this principle observable: if you can't write a clean spec without declaring half your features as shape="core", that's a structural signal, not a spec-writing failure.

7. What this doesn't solve (be honest)

I want to be careful not to oversell this. Here's what plugin-shape specs explicitly do not do:

Semantic conflicts inside a single plugin. Two tasks for the same plugin (plugin="auth") still serialize via touches_files locks. Adding "user can reset password" while "user can change email" is in flight will defer the second one until the first finishes. This is fine — it's the correct behavior — but it limits intra-plugin parallelism to one task at a time.
Cross-plugin coupling that wasn't designed in. If your tasks plugin imports from your auth plugin's internals (and your codebase doesn't enforce plugin isolation via lint or import boundaries), edits to auth/ can break tasks/ after merge. The spec doesn't catch this; tests do. Treat the spec as a parallelism hint, not an isolation guarantee.
Shared infrastructure changes. A migration that adds a column to the users table is shape="core" because the migrations directory is shared. Two such migrations serialize. They have to — concurrent migration writers race on the migration sequence number. Don't try to plugin-ify your migrations.
Specifications written as shape-agnostic. A feature whose acceptance criteria say "the system shall …" without naming a directory or file is hard to classify. Either rewrite the criterion to reference a concrete piece of the system, or accept that the feature won't get a shape attribute and will fall through to legacy scheduling.

The honest framing: plugin-shape specs make the common parallelism case (many vertical features against a clean plugin host) trivial-safe. The hard cases — cross-cutting concerns, coupled plugins, shared infrastructure — still require engineering judgment. The win is that the common case becomes the default rather than the exception.

8. The cultural shift this enables

There's a meta-point here that's bigger than the technical mechanism.

Most discussions of "AI agents at scale" focus on the agent's capabilities — context window, reasoning depth, tool-use accuracy. Those matter, but they're not where the leverage is. The leverage is in encoding the human's architectural intent in a place the harness can read. Specs are not just task descriptions for the agent. They're scheduling hints for the orchestrator. They're isolation declarations for the locks. They're refactoring targets for the boundaries pass. They're documentation for the next human reviewer.

When you start writing specs that carry this much load, the spec format itself stops being a casual prose blob and becomes a structured contract. XML attributes that look fussy at first — index, depends_on, shape, plugin, touches_files — earn their keep because every one of them maps to a runtime decision the harness will otherwise have to guess. Guessing is what produces the four-layer reactive stack. Declaring is what makes that stack a quiet backstop instead of a daily firefight.

This is the same shift that happened in deployment automation a decade ago: declarative manifests beat imperative shell scripts because the intent — "I want three replicas behind a load balancer" — was machine-readable rather than buried in a sequence of side-effecting commands. Plugin-shape specs are doing the same thing for AI-agent orchestration: making intent readable so the orchestrator can stop guessing.

If you're building AI-coding-agent infrastructure right now and your dispatcher is making scheduling decisions based purely on what's in the queue, you're building the imperative-shell-script version of this. The declarative version — where the agents read what the human meant rather than what they typed — is meaningfully better, and it doesn't require a smarter model. It requires a more structured spec.

9. The minimum implementation

If you want to try this in your own harness, the minimum viable version is:

One attribute on your task/feature object. Call it shape, kind, category, whatever — but pick exactly two values. "vertical" and "horizontal" works. "feature" and "infra" works. Two values. The temptation to add a third is a trap.
One auto-derivation rule. When shape="plugin" and a plugin="X" is set, the file-claim list defaults to ["plugins/X/**"]. One line of helper code.
One scheduling rule. When any shape="core" task is running, drop other core tasks from the ready set. Twelve lines of Python.
One spec-time validation. shape="core" without an explicit file list raises an error before the planner runs. Five lines.

That's the whole ship. Total surface area: maybe 50 lines of harness code, plus the spec schema extension and the docs to teach the spec author what to declare.

The minimum tests:

A round-trip test that parses the documented XML example and asserts the auto-derived file lists match (guards against doc/code drift).
A scheduler test that adds two shape="core" tasks and confirms only one is in the ready set when the other is running.
A scheduler test that confirms shape="plugin" tasks dispatch freely when a core task is running.

Three tests. Done. The pattern compounds: now your codebase has a place to put new shape-aware behavior, and your spec authors have a place to encode new architectural intent. Future work — auto-derived shape inference via static analysis, telemetry on adoption rates, conflict-prediction at scheduler time — all builds on this primitive.

10. Closing thought

The thing that took me too long to internalize is that parallelism is a property of the architecture, not the runtime. You can't bolt safe parallelism onto a codebase whose architecture forces every feature through the same chokepoint. You can build elaborate runtime defenses against the resulting conflicts — and you should, because real codebases always have some chokepoints — but the runtime defenses are the patch, not the cure.

The cure is to design codebases where parallelism is structurally safe, and to encode that structural intent in the spec so the orchestrator can lean on it. Two values, one attribute, twelve lines of scheduler logic. That's the surface area of the win. The cost was a year of fighting the four-layer reactive stack to recognize that the layers were treating symptoms, not the disease.

If your AI-agent harness is dropping conflicts on you, look at your spec format before you look at your dispatcher. The dispatcher is downstream. The spec is where the architecture lives.

Alex Chen builds AI-coding-agent infrastructure shipped to production. He runs ten-agent swarms daily and would like to thank the team's boundaries harness for finally making it stop hurting.

Building an Autonomous Crypto Trading Bot

Alex Chen — Sun, 03 May 2026 06:05:58 +0000

I've been spending too much time inside trading bot codebases lately. Most of them are one of two things: a 200-line Jupyter notebook that someone calls a "system," or a sprawling monorepo where the strategy logic and exchange integration are so tangled that you can't swap exchanges without rewriting half the code.

A few weeks ago I went deep on AlphaStrike, a production-grade crypto perpetual futures bot. Not because the returns were headline-grabbing (though a 2.4 Sharpe is nothing to sneeze at), but because the architecture solves problems most of us hand-wave past. I want to walk through what's interesting, what's novel, and what I'd steal for my own projects.

The Problem Space

Algorithmic crypto trading sounds simple at the whiteboard: read prices, predict direction, place orders, manage risk. In practice, every layer of that stack will try to kill you.

Exchanges are inconsistent. WEEX, Binance, Hyperliquid — every one has different symbol formats, different REST paradigms, different WebSocket lifecycles, different ways of representing a position.
Models decay. A signal that worked last quarter doesn't work this quarter. Pretending otherwise is how accounts get blown up.
Volatility is non-stationary. Static leverage and fixed position sizes are a lie you tell yourself until you wake up at -40% drawdown.
Pure quant is fragile. Numbers don't know that the SEC just sued the second-largest exchange.

AlphaStrike's design isn't trying to be the smartest bot. It's trying to be the bot that's still alive in 12 months. That's a different optimization target, and it shows.

The Architecture, Top-Down

EXCHANGE → DATA GATEWAY → FEATURE LAYER → FEATURE VALIDATOR
                                                    │
                                                    ▼
EXECUTION ← RISK LAYER ← STRATEGY LAYER ← ML LAYER

Eight stages, every one of them able to halt the pipeline on its own. That's the first lesson: every layer is a potential circuit breaker. If features fail validation (PSI drift, KS test, CUSUM), no signal reaches the model. If the risk layer flags exposure, no order reaches the exchange. Fail-closed by default.

Let me walk through the four pieces I actually want to talk about.

1. Exchange Abstraction Done Right

This is where most trading bots rot. AlphaStrike defines two Protocol classes — ExchangeRESTProtocol and ExchangeWebSocketProtocol — and every adapter (WEEX, Hyperliquid, Binance, generic OpenAPI) implements them. The trading logic only talks to the unified protocol.

@runtime_checkable
class ExchangeRESTProtocol(Protocol):
    async def get_ticker(self, symbol: str) -> UnifiedTicker: ...
    async def place_order(self, order: UnifiedOrder) -> UnifiedOrderResult: ...
    async def get_positions(self, symbol: str | None = None) -> list[UnifiedPosition]: ...
    async def set_leverage(self, symbol: str, leverage: int) -> bool: ...

The unified data models (UnifiedOrder, UnifiedPosition, UnifiedCandle) are the contract. Every adapter has a mappers.py that translates between exchange-native shapes and the unified shapes. Symbol normalization happens at the adapter boundary — internally everything is BTCUSDT, externally it becomes cmt_btcusdt or whatever WEEX wants this week.

Why I care: I've shipped trading code where exchange-specific assumptions leaked into the strategy. It's death by a thousand if exchange == "binance" cuts. The Protocol-based approach keeps the boundary honest. You add a new exchange by writing one adapter file, not by hunting through the codebase.

2. The ML Layer That Doesn't Trust Itself

The signal pipeline runs 12 categories of weak signals — order flow, microstructure, volatility, correlation, sentiment, seasonality, statistical, price action, volume, derivatives, alternative, macro — and combines them through a regime-aware ensemble. This is the explicitly Renaissance/Medallion-inspired bit, and the backtest deltas are real:

Metric	Single Signal	12-Category Ensemble
Sharpe	1.2	2.4
Win Rate	52%	58%
Max Drawdown	-15%	-8%

But the part I find genuinely novel is the signal decay tracker. Every signal logs its predictions, the system records outcomes, and signals get auto-retired when their rolling accuracy drops below 48%. Weight is (edge × 2)², so signals with real edge get amplified and weak signals fade out without anyone touching code.

edge = accuracy - 0.5            # 0.52 accuracy → 0.02 edge
weight = (edge * 2) ** 2         # quadratic weighting of strong signals
if accuracy < 0.48:
    signal.retire()

This is the right way to do it. Most "ensemble" systems use static weights tuned once and forgotten. Here the weights are alive — they update with reality. Models that lose their edge get fired by the system itself.

3. Dynamic Leverage as a First-Class Citizen

Static leverage is the crypto equivalent of running with scissors while drunk. AlphaStrike treats leverage as a continuous control variable:

leverage = base × vol_factor × dd_factor × perf_factor

vol_factor  = normal_vol / current_vol     # clamped 0.3 to 1.5
dd_factor   = 1.0 / 0.7 / 0.5 / 0.3        # tiered by drawdown
perf_factor = half_kelly_fraction          # 0.6 to 1.2

Real scenarios from the doc:

Conditions	Leverage
Normal	5.0x
High vol (5%)	2.0x
In 12% drawdown	2.5x
Strong perf + low vol	9.0x
All bad (high vol + DD + losing)	1.0x

The leverage state lives in data/state/leverage_state.json so it survives restarts. When the system reduces from 5x to 2x because volatility spiked, the next process boot doesn't forget. That detail matters more than it sounds — most bots reset to defaults on restart and quietly take on more risk than the operator thinks.

4. The LLM Layer That Knows Its Place

Here's the part that surprised me. AlphaStrike has an LLM decision layer — a local Ollama-served qwen2.5:1.5b — but its design philosophy is the opposite of what's currently fashionable. The LLM does not generate signals. It does not pick trades. It does not "reason about the market."

It only intervenes when performance degrades. When the rolling win rate drops below 40%, drawdown crosses 15%, or you stack 5 consecutive losses, the system hands the LLM a structured performance report and a tightly scoped tool palette:

adjust_conviction(symbol, threshold, reason)
adjust_position_size(symbol, multiplier, reason)
adjust_leverage(new_leverage, reason)
disable_shorts(symbol, reason)
disable_asset(symbol, duration_hours, reason)
no_action(reason)

Example LLM response when SOL is having a 25% win rate, 22% drawdown, 7-loss streak:

[
  {"tool": "adjust_position_size", "params": {"symbol": "SOL", "multiplier": 0.3}},
  {"tool": "adjust_conviction", "params": {"symbol": "SOL", "new_threshold": 85}},
  {"tool": "disable_shorts", "params": {"symbol": "SOL"}},
  {"tool": "send_alert", "params": {"severity": "critical"}}
]

That's the right shape for LLMs in financial systems: bounded actions, explicit triggers, no inference loops touching live capital. The model doesn't have to be smart, it has to be defensive. A 1.5B parameter local model is more than enough when the action space is six tools wide.

What I Took Away

Three things I'm stealing:

Protocol-based exchange abstraction. No more if exchange == chains. Define the contract once, swap implementations behind it. This generalizes way past trading.
Self-retiring signals with quadratic edge weighting. Static feature weights are tech debt the moment you ship them. Make signal decay a first-class concept and let the data prune your own model.
LLM-as-circuit-breaker, not LLM-as-strategist. The hype-cycle take is "use the LLM to pick trades." The mature take is "use the LLM to recognize when your quant system is dying and apply targeted, reversible, well-typed interventions." The hype-cycle take blows up your account. The mature take saves it.

What I'd build next: an offline evaluation harness for the LLM's tool-call decisions. Right now the LLM's interventions only get evaluated by their downstream P&L impact, which is noisy and slow. A counterfactual replay framework — "what would have happened if the LLM had done nothing, or chosen a different tool?" — would let you tune the trigger thresholds and the prompt without burning real capital. That's where I'd put the next two weeks of engineering time.

Trading bots are not magic. They're software systems that have to survive volatility, exchange flakiness, model decay, and operator panic. The systems that survive are the ones that take all four threats seriously at the architecture level — not the ones with the prettiest backtest curve.