DEV Community

Cover image for What 16 Parallel Claude Agents Built Around Themselves: Deconstructing Anthropic's C Compiler Experiment
Vitalii Cherepanov
Vitalii Cherepanov

Posted on

What 16 Parallel Claude Agents Built Around Themselves: Deconstructing Anthropic's C Compiler Experiment

On February 5, 2026, Nicholas Carlini from Anthropic published a piece about an experiment that runs significantly ahead of what most of us are doing with LLM agents today. Sixteen parallel instances of Claude Opus 4.6, two weeks of work, ~2,000 Claude Code sessions, a budget around $20,000. The output: 100,000 lines of a C compiler in Rust that builds Linux 6.9 on x86, ARM, and RISC-V; passes 99% of GCC's torture test suite; compiles PostgreSQL, SQLite, FFmpeg, Redis, and QEMU; and runs Doom. The repository is open and anyone can read it and try it themselves.

It's serious engineering work, and the article itself is a great read for anyone thinking about autonomous agents in production. Carlini is honest about what worked and what didn't, walks through five concrete lessons from designing the harness, and shares numbers and metrics. This is exactly the kind of writeup the industry needs more of — a first-hand account of what long autonomous runs actually look like.

Headlines split into two camps. "AI replaced programmers" on one side. "It's just a demo" on the other. Both miss what's actually interesting.

If you read the article carefully, what Carlini is documenting is not "AI writes a compiler." He's documenting how much infrastructure had to be built around the agents because there isn't yet any infrastructure between the agents themselves in 2026. Lockfiles in a shared directory as a sync mechanism. READMEs that the agent writes to itself. GCC pressed into service as a known-good reference oracle. A Ralph-loop wrapped around Docker for indefinite autonomy. Each of these is an answer to a concrete problem that today simply has nowhere to be pushed into a standard layer.

And that's the article's real value. Not as an "AI demo," but as a detailed map of missing primitives, drawn by someone who built workarounds for them by hand. I've been working on these primitives for the past few months, and Carlini's writeup is a great excuse to talk through what the next generation of agent teams actually needs.

Every session starts with amnesia

Carlini built a harness that runs Claude in an infinite loop — when the agent finishes one task, it picks up the next. Architecturally this is the familiar "Ralph-loop" pattern: a while true cycle in a bash script, wrapped in Docker for safety. In one of the runs, Claude accidentally killed itself with pkill -9 bash, which Carlini notes as an amusing side effect.

The crucial detail is that each of those ~2,000 launches started in a fresh Docker container with empty context. No memory between sessions. Every agent figured out from scratch: what is this repo, what's already done, what's the status of tasks, what's been tried and failed.

Carlini's workaround was to instruct Claude itself to maintain extensive READMEs and progress files, updated frequently. When the agent gets stuck on a bug, it also keeps a running doc of failed approaches and remaining tasks.

This works within the bounds of current tooling — and that's its value. But if you look at scaling, two architectural points start to creak.

First, a text file isn't structured. If you want to ask "what were the three most recent bugs I fixed in the parser area and how did they end?", you only have grep and regular expressions. On a small project that's tolerable. On 100,000 lines of code and 2,000 sessions, it becomes a bottleneck.

Second, more subtle: each agent maintains these files for itself. They live in the shared git repository, but there's no mechanism that says "before you take task X, look at what the other 16 agents wrote about this area in the past 6 hours." Each agent writes its own README, merges others' edits, and hopes things converge.

This is the first generation of shared memory — implemented as plain text because no more convenient primitive has become standard yet.

Lockfiles as coordination

Parallelism is implemented minimally. Each agent runs in its own Docker container; a shared bare git repo holds state. Task coordination happens through lockfiles: an agent writes a file like current_tasks/parse_if_statement.txt, does a git push, and thereby "claims" the task. If two agents try to take the same one, git synchronization forces the second to pick something else. Done — delete the lockfile.

Carlini states the current state of the system plainly: "no other method for communication between agents... I don't use an orchestration agent." No mechanism for agents to "ask each other." No central coordination. Each Claude decides for itself what to do next — usually "the next obvious problem."

The lockfile here does exactly one thing — it works as a mutex, protecting against parallel claims on a single task. That's valuable. But it doesn't solve the other problem: two agents working on different tasks in the same code area can write conflicting code under different task names. That's exactly what happened to the Linux kernel in the experiment — agents converged on the same bug, fixed it differently, overwrote each other's edits, and parallelism temporarily stopped paying off.

Carlini's solution was a separate test harness using GCC as a known-good compiler oracle: most of the kernel gets compiled with GCC, and a random subset of files goes through Claude's compiler. If the kernel doesn't boot, the bug is somewhere in Claude's subset, and you can keep narrowing it down. It's a clever and elegant idea, and it worked exactly as intended.

It's worth noting the bounds in which this works. The GCC oracle is a precise solution for this specific task, because the task has three convenient properties: there exists a ready-made reference compiler for the same spec, the task decomposes at the level of individual files, and the outcome is binary (boots or doesn't).

In most real projects — product development, legacy refactoring, ML pipelines, mobile applications — these conveniences don't exist. There's no ready known-good for comparison. There's no natural file-level decomposition. Outcomes aren't binary. Which means the GCC-oracle technique can't be generalized as a primitive — it works where it works, and doesn't exist where it doesn't.

Taken as a whole, Carlini's toolkit lays out neatly along two axes:

What agent teams need What's in the experiment Nature of the solution
Agent discovery hardcoded number of containers hardcoded
Inter-agent communication lockfile via git push mutex without messaging
Task delegation next-most-obvious from queue no routing
Shared state / memory README + progress files plain text
Causal history running doc of failed approaches personal log
Verification GCC oracle task-specific

These are two independent axes of the problem: communication (how agents talk to each other) and memory (what they remember between sessions and whether they share it). These axes require different primitives and different solutions. And on each, the industry is currently converging on standards and open-source implementations.

What follows is what's available on each axis today.

The communication axis: A2A protocol and a2abridge

Communication has moved fast and has already arrived at a mature standard. In April 2025, Google opened the A2A protocol — Agent-to-Agent. In August 2025, IBM's ACP merged into A2A under the Linux Foundation, and by April 2026 the spec is at version 1.2, supported by 150+ organizations (Microsoft, AWS, Salesforce, SAP, ServiceNow, IBM among them), and natively built into Google ADK, LangGraph, CrewAI, LlamaIndex Agents, Semantic Kernel, and AutoGen. A2A has effectively won the protocol war. The spec is deliberately minimal: an Agent Card is a JSON description of an agent's capabilities (what it does, what endpoint to hit). A Task is a unit of work with statuses and artifacts. Transport is JSON-RPC 2.0 over HTTPS, with Server-Sent Events for streams.

The analogy that gets used everywhere is HTTP. HTTP doesn't tell you what's in your backend (Rails, Django, Go) — it just defines the shape of requests and responses. A2A doesn't tell you what LLM, framework, or database you use — it defines the contract between agent A and agent B. A minimum on top of which you can build the rest.

If you rewrote Carlini's scenario on A2A, instead of a lockfile in current_tasks/, an agent would query a directory service for "who's working on the parser right now?", get the neighbor's Agent Card, and send a Task with a streaming response over SSE. That's the communication primitive his harness doesn't yet have.

I've been writing a2abridge for the past several months — an open Go implementation of A2A 1.0 targeted at the practical scenario of "several different AI agents on one developer's machine." At the time of publication, six IDEs are supported: Claude Code, Codex CLI, Cursor, Cline, Continue, and Gemini CLI. Any A2A-compliant agent (including future Google ADK, LangGraph, CrewAI implementations) is a first-class peer with no glue code.

Architecturally it's a single Go binary (~10 MB) with several subcommands. a2abridge directory is a discovery service on 127.0.0.1:7777 that runs as a user-level system service (launchd on macOS, systemd-user on Linux, Windows Service on Windows, works correctly inside WSL2). a2abridge bridge is a per-agent process that hosts both an MCP stdio server (through which the IDE sees a2abridge as a regular MCP server with tools) and an A2A HTTP server on a random port, with an Agent Card at /.well-known/a2a and the full set of JSON-RPC 2.0 methods from §7 of the spec: SendMessage, SendStreamingMessage, GetTask, ListTasks, CancelTask, SubscribeToTask, GetExtendedAgentCard. The bridge's lifecycle equals the IDE session's lifetime — when MCP stdio closes, the bridge dies, no orphan processes.

What Claude Code (or any other IDE) sees as MCP tools: a2a_whoami, a2a_list_agents, a2a_send_message, a2a_send_streaming, a2a_get_task, a2a_cancel_task, a2a_inbox, a2a_complete_task. Inside the session, the agent can independently discover other agents on the machine, send them tasks, wait for replies, and read its inbox — without user involvement.

On top of the protocol there's a pro-active layer that isn't in the spec but is needed for real use. The bridge writes an inbox file at ./.a2a/inbox-<ppid>.json every time the message queue changes. A UserPromptSubmit hook injects incoming messages into the system prompt before the first tool call — meaning Claude sees "you have a message from a peer with FYI about a breaking API change" before it starts taking blind action. The SSE fast-path delivers replies in milliseconds, with a 5-second polling fallback. For Claude Code there's also a skill called a2a-bridge that auto-loads only when triggered by relevant prompts — no globally loaded rules burning tokens on every session.

In Carlini's scenario this would look like: agent 5 takes the task "fix kernel build error in mm/page_alloc.c." Before acting, it calls a2a_list_agents, sees that agent 2 has an open Task with capability kernel-debug in the same area. It sends a2a_send_message: "what are you working on, do you have a hypothesis?". It gets a streaming response: "tried alignment fix, failed on test_kernel_boot, currently looking at reorder header includes." It picks a different angle.

Why an open protocol and not yet another custom wire format. Several solutions already exist in this niche: Anthropic Agent Teams works only Claude↔Claude and is tied to a subscription. CCB and claude-multi-agent-bridge are closed formats locked to specific agent combinations. Ruflo is excellent for enterprise federations of 100+ agents with central queens, but that's a different class of problem. The niche a2abridge targets is cross-vendor, open-protocol mesh, where today Claude and Codex drop in, and tomorrow any A2A-compliant agent does, with no glue rewriting. If the industry is moving toward a standard, the bridge had better speak that standard.

Production maturity: cross-machine federation with mTLS + ed25519 (opt-in, for the "home Mac ↔ office Linux" scenario), mDNS auto-discovery on the local network, a PII/secret screen running 11 regex detectors before sending (AWS keys, GitHub tokens, Anthropic/OpenAI/Google/Stripe/Slack tokens, JWTs, PEM blocks — replaced with [REDACTED:<name>], secret never leaves the bridge), Push Notifications per A2A 1.0 §9.5, HTTP+REST binding per §7.3, 35 test cases under -race, a GitHub Actions release matrix, and a cross-platform a2abridge doctor with a 9-check health audit. Install is a one-liner via install.sh or install.ps1, with auto-detection of every IDE on the machine and .bak backups of their configs before edits.

Repository: github.com/vbcherepanov/a2abridge — MIT, Go 1.25, current release v2.0.

The memory axis: total-agent-memory and BrainCore

Memory is in a different state. There's no A2A-level standard yet — everyone builds their own layer, and different approaches get picked for different tasks. What an agent writes to itself in a README is essentially a causal log in textual form: "tried A, failed at B, moved to C." The structure is right; the implementation is still plain text.

I'm working on two products on this axis.

total-agent-memory — open-source implementation. The core retrieval patterns, MCP integration, and the basic causal-chain model live here. Anyone can clone it, see how it works, and plug it into their Claude Code or Cursor.

BrainCore — production-grade. A Go binary, local SQLite + WAL, tree-sitter for code-graph across 14 languages (PHP, TypeScript, Python, Ruby, Rust, Java, Kotlin, C/C++, C#, Swift, Bash, Lua, YAML, plus Go through its own native AST), internal git for time-travel memory, MCP protocol for connecting to Claude Code, Cursor, Codex CLI, Windsurf, and several other agents. Currently in beta.

Architecturally there are three points where both projects diverge from a flat bag-of-facts with cosine search.

First, causal decision chains instead of flat facts. Not "function X is in file Y," but "agent 3 in task fix kernel build formulated hypothesis alignment issue, verified through test_kernel_boot, failed, moved to hypothesis header reorder." Each step is typed, connected by a causal arrow, and queryable by every agent.

Second, AST-stable code identity. When several agents refactor in parallel, text diffs quickly turn into mush, and merge conflicts become endless. An AST node remains a node even if a function moved from parser.rs to frontend/lexer.rs and got renamed from parse_decl to parse_declaration. In the graph, it's the same node with a movement history. Every agent looks at the same abstraction, not at "lines 127-145 of file X."

Third, persistence across container restart. Memory lives outside the Docker container: on the host through a volume, or remotely via MCP. The query brain.causal_lookup(area="parser", lookback="6h") returns the same result regardless of which fresh container you're in.

Rewriting Carlini's scenario with memory: agent 5 goes to BrainCore, gets the causal log "agent_2 tried alignment fix → failed, agent_7 tried header reorder → failed at L98, current hypothesis from agent_3 is alignment issue, in progress," picks a fourth hypothesis, writes it to the causal chain. Agents 2, 3, and 7 see this decision on their next pull. No READMEs, no greps.

How they fit together

a2abridge and BrainCore are different layers, not competitors. One answers "how do agents talk to each other," the other answers "what do they remember."

The full picture for an agent team looks like this. BrainCore holds the shared state of the world: code-graph, causal chains, hypotheses, conclusions. a2abridge provides actual communication between agents: discovery, delegation, streaming responses, an inbox with context injection. When they work together, agent 5 sees a message in its inbox from agent 2 ("I'm working on X"), queries BrainCore for details ("what specifically has been tried in this area"), makes an informed decision, replies to agent 2 about its intention to take an adjacent task, and writes the result to shared memory.

That's the architecture Carlini is building by hand in the experiment through the combination of lockfiles + READMEs + GCC oracle. With independent primitives instead of self-built glue, the infrastructure works in tasks where there's no ready-made known-good compiler.

What these primitives don't solve

Carlini is absolutely right about the article's main lesson: a high-quality test harness is the foundation of everything. No amount of shared memory and no A2A will save you if the task verifier is imprecise — agents will autonomously solve the wrong task. CI pipelines, well-designed logs, defenses against context window pollution, fighting "time blindness" — these work at any infrastructure level and remain the first priority.

The GCC oracle in the compiler task is genuinely the optimal choice. Binary verification is almost always better than comparing causal hypotheses. If you have a ready known-good in your project — use it. No memory replaces a good verifier.

But in most real tasks — product development, refactoring, ML pipelines, business logic — there's no GCC equivalent. And there, the primitives of communication and memory become not an "improvement" but a necessary condition for a team of 16 agents to be more productive than one.

That Carlini had to build this entire text-and-file layer in 2026 isn't a flaw in his approach but a symptom of the moment: infrastructure for agent teams is still forming. The Anthropic experiment is the best possible illustration of how it's forming and where it's headed. And that, in my view, is the real value of Carlini's article: an honest report from the earliest point on the curve along which this infrastructure will grow.


Open source and links

If you're building your own agent teams and running into these problems — get in touch. Experience exchange is valuable in any case, and feedback on early product versions is the best thing that can happen to authors.

Top comments (0)