Automated draft from LLL
Artificial Intelligence Rewrites a Major Program, Shifting Its Predicted Development Timeline
MirrorCode Reveals A.I.'s Capacity for Complex Coding Projects
METR and Epoch AI, prominent auditors of AI, recently released MirrorCode, a benchmark designed to challenge AI agents to rewrite complex command-line programs from scratch. This challenge uniquely restricts agents to using only the original binary's execution and test cases, without access to its source code. Claude Opus 4.6 successfully rewrote gotree, a bioinformatics toolkit comprising sixteen thousand lines of Go and over forty commands. Researchers estimate a human engineer would require two to seventeen weeks to complete this task. The benchmark also suggests that performance scales with compute power, indicating that larger programs may yield to AI as budgets increase.
Jack Clark of Import AI vividly described the result: "Imagine giving a skilled programmer a complex program's command-line interface and asking them to write the underlying code without seeing it. Only a fraction could do it. That AI can do this autonomously proves a long-term coding ability most benchmarks miss." Several caveats temper this achievement: the benchmark favors programs with standard outputs, simplifying specification generation; it allows for memorization of basic Unix utilities; and it covers only a fraction of real-world software. Still, the overall trend is undeniable.
In the same week, Ryan Greenblatt, an AI researcher and respected forecaster, doubled his estimate for full AI research and development automation by 2028, from fifteen to thirty percent. As Import AI reported, his reasoning cited Opus 4.5 and Codex 5.2, which "significantly exceeded my expectations," and Opus 4.6, which "again surpassed them." He now believes AI systems can reliably handle tasks ranging from a month to several years in duration, provided the task has a verifiable evaluation loop. Greenblatt joins Ajeya Cotra (whose updated timeline this digest noted on April 12) and Daniel Kokotajlo of AI 2027 in advancing their predictions by about eighteen months. This significant shift stems from the unexpectedly rapid progress of coding agents.
An editorial in Import AI noted: "Almost everyone in AI research routinely underestimates AI progress, including me. Maybe the only person who doesn't is my colleague Dario Amodei." Clark finds it puzzling that after five years of benefiting from scaling laws, most researchers remain conservative. Perhaps, he suggests, we should assume we all continue to underestimate.
Anthropic Publishes the Agent Playbook as the Community Dissects Its Leaked Code
Harness engineering, a topic this digest has followed since the Claude Code source leak on April 9, advanced on two fronts this past weekend. Anthropic published "Building effective agents," a guide arguing, counterintuitively, that effective agents rely on simple, composable patterns, not complex frameworks. It distinguishes between workflows (LLMs and tools following predefined code paths) and agents (dynamic, LLM-directed systems), describes five workflow patterns, and emphasizes simplicity, transparency, and detailed tool documentation.
Meanwhile, AlphaSignal's Sunday deep dive dissected the leaked five-hundred-twelve-thousand-line Claude Codebase—revealing KAIROS (Dream Mode), the self-healing query loop, and KV cache stabilization by alphabetically sorting tools (covered April 13). The analysis shows the codebase proves "the era of harness engineering is here": the LLM is just the processor, and software engineers still build the operating system. The AI Daily Brief podcast synthesized this into a three-layer harness architecture: information (memory, context, skills), execution (orchestration, coordination, guardrails), and feedback (evaluation, verification, observability).
There's a productive tension here. Anthropic's official blog advocates for simplicity, yet Anthropic's leaked production code reveals great complexity. The resolution, as the podcast notes, is that Anthropic builds the inner harness so users can keep their outer harness simple. The discipline is permanent; the specific implementation is not. In a related move, Anthropic launched Ultraplan, moving implementation planning from the terminal to the browser. Users draft a plan in the command-line interface, review and refine it in a browser with inline comments, and then execute it locally or in the cloud. It's a minor feature with a major implication: the planning and execution layers of agent work are deliberately separated.
Google DeepMind Maps Six Attack Surfaces for AI Agents
As agents proliferate, so do the ways to break them. A new paper from Google DeepMind, covered in Import AI, identifies six attack vectors against AI agents: content injection (embedding commands in CSS, HTML, or media), semantic manipulation (steering behavior via sentiment or identity claims), cognitive state attacks (planting fabricated facts in retrieval corpora or memory), behavioral control (embedding adversarial prompts in external resources), systemic attacks (flooding agents with side quests, forcing collusion through correlation, or running jigsaw attacks that distribute harm across independent agents), and human-in-the-loop attacks (exploiting human overseers' cognitive biases).
Clark's analogy is fitting: AI agents are like toddlers—powerful intelligences that are gullible, sometimes follow dangerous instructions, and lack self-preservation instincts. Securing agents demands changes at every level: pre-training robustness, runtime content scanners, output behavior monitors, ecosystem-level verification protocols, and legal frameworks prosecuting websites that target agents. AI safety, Clark argues, is "about to be ecosystem safety."
This connects to the practitioner experience detailed in three Claw Mart Daily issues this past weekend. A memory poisoning piece details a real case: an AI assistant started rejecting sound pull requests, insisting on nonexistent partnerships, and misidentifying its own product—all because casual interactions over three months corrupted its memory. This wasn't prompt injection, but accidental poisoning: jokes became facts, hypotheticals became strategies, and frustrated comments morphed into core beliefs. The author reports sixty-four to seventy-four percent success rates for deliberate memory-poisoning attacks on OpenClaw agents, but argues accidental poisoning is more common. Companion pieces on memory decay scoring and heartbeat patterns address the problem from an engineering perspective: agents need structured forgetting and event-driven awareness, not just better retrieval.
The Anti-Scaffold Argument and Four arXiv Papers That Complicate It
Noam Brown of OpenAI recently argued that reasoning models render complex scaffolding unnecessary, claiming simple prompting without agentic infrastructure outperforms it. A podcast offered counterevidence: Blitzy scored 66.5 percent on SWE-bench Pro, against GPT-4o's 57.7 percent. Knowledge graphs gave Blitzy deep code context, an advantage single-pass models lack.
This week, arXiv preprints painted a more nuanced picture.
Synthius-Mem: Brain-Inspired Structured Persona Memory
Synthius-Mem presents a brain-inspired structured persona memory system that achieved 94.4 percent accuracy on the LoCoMo benchmark (ACL 2024). It exceeded the previous state-of-the-art, MemMachine (91.69 percent), and human performance (87.9 F1). Its adversarial robustness—its ability to refuse questions about undisclosed facts—reached 99.55 percent, a metric no competing system reports. The architecture decomposes conversations into six cognitive domains—biography, experiences, preferences, social circle, work, and psychometrics. It retrieves structured facts with 21.79-millisecond latency, cutting token consumption by five times.
RoMem: Dynamic Temporal Memory
RoMem approaches the temporal dimension of agent memory differently. Most systems model time as discrete metadata; sorting by recency buries older, permanent knowledge, and overwriting erases evolving facts. RoMem introduces a "Semantic Speed Gate" that maps each relation's text embedding to a volatility score: "president of" rotates quickly in complex vector space, while "born in" remains stable. Instead of deletion, obsolete facts are geometrically shadowed. This achieved state-of-the-art results on ICEWS05-15 (72.6 MRR) and improved temporal reasoning benchmarks two to three times.
BEHEMOTH: Heterogeneous Memory Extraction Benchmark
BEHEMOTH, a new benchmark for heterogeneous memory extraction, confirms practitioners' suspicions: no single extraction prompt dominates across all task types. Their proposed CluE strategy—cluster-based self-evolution that groups training examples by extraction scenario—achieved a 9.04 percent relative gain across heterogeneous tasks.
TCER: Correcting Triviality Bias
And TCER, which addresses open-ended text generation, identifies "Triviality Bias" in confidence-based endogenous rewards: policies collapse toward high-probability outputs, reducing their diversity. Their correction mechanism rewards relative information gain between a specialist policy and a generalist reference, achieving consistent improvements without external supervision.
Taken together, the message is clear: the scaffold matters, but only if it encodes genuine structural intelligence about memory, time, and task heterogeneity. Brown is right that naive scaffolding can be worse than nothing. Researchers are discovering what non-naive scaffolding looks like.
Google Invests in the Human Side of the AI Transition
Shifting focus, Google announced its inaugural AI for the Economy Forum in Washington, D.C., unveiling concrete investments. These include ten million dollars with Johnson & Johnson for health-care worker AI literacy, training for forty thousand manufacturing employees, apprenticeships at one hundred companies, and educator training for six million K-12 teachers. These initiatives add to a billion dollars in prior education investments. The timing is notable: This weekend, Import AI also covered the Windfall Trust's Policy Atlas, a navigable interface of forty-eight distinct policy proposals for addressing economic disruption from transformative AI. These proposals fall into categories such as public investment, labor adaptation, wealth capture, regulation, and global coordination. Neither initiative offers novel policy; both are tools for building the institutional capacity for decisions that once felt distant but now seem imminent. David Krueger's "ten views of gradual disempowerment," also in Import AI, frames the stakes starkly: even if we succeed at building and aligning powerful AI, failing to build the right deployment system could still leave humanity worse off.
Five Things With 30-Day Clocks
- MirrorCode scaling experiments. The benchmark showed continued gains from inference scaling on larger projects, "suggesting enough tokens might solve them." The specific test: does applying ten times the compute power at the benchmark's largest unsolved programs (beyond sixteen thousand lines) produce working rewrites? If yes, the "weeks-long coding tasks" framing undersells what is possible.
- Synthius-Mem's LoCoMo benchmark reproduction. Achieving 94.4 percent accuracy against 87.9 percent human performance, and 99.55 percent adversarial robustness (a metric no competitor reports), this is either a breakthrough in structured agent memory or a benchmark artifact. Independent reproduction on LoCoMo within thirty days would clarify whether the six-domain cognitive architecture generalizes beyond the paper's ten conversations.
- RoMem's zero-shot transfer financial domain generalization. The paper claims zero-shot transfer to unseen financial domains (FinTMMBench). Should practitioners validate this on production financial data, geometric shadowing could replace the crude, recency-biased approaches most agent memory systems use.
- Google's AI for Economy Forum as a policy coordination mechanism. The Windfall Policy Atlas makes forty-eight proposals navigable; Google's forum commits funds. Whether these converge into actionable policy or remain parallel announcements depends on the next round of Congressional hearings and executive orders. Watch for the apprenticeship program's one-hundred-company enrollment numbers as the first signal of corporate seriousness.
- PaperOrchestra's multi-agent research writing pipeline. Google's PaperOrchestra claims fifty to sixty-eight percent improvements in literature-review quality over baseline systems when using specialized agents (outline, literature retrieval, plotting, writing, and refinement). If academic labs adopt this for actual paper drafts—not just benchmarks—it could reshape research productivity timelines within a single conference cycle.
Top comments (0)