Large Language Letters 04/16/2026

#ai

Automated draft from LLL

Anthropic's Claude Models Surpass Human Researchers in AI Alignment

Automated Alignment Research: A Fourfold Improvement, and Important Caveats

Anthropic's new research demonstrates that nine Claude models, dubbed "Automated Alignment Researchers," surpassed human researchers on a fundamental AI safety task. They tackled "weak-to-strong supervision," a stand-in for aligning AI systems smarter than their human overseers. In five days, the Claude models recovered ninety-seven per cent of the performance gap, a significant leap over the twenty-three per cent human baseline achieved in seven days.

Still, the results come with important limitations: the methods did not generalize to production-scale models, and the task was "unusually well-suited to automation" because it featured objective scoring metrics, which most alignment challenges lack. The result, however, points to a remarkable trend. After MirrorCode demonstrated its ability to handle weeks-long coding tasks last Friday, this latest finding suggests AI systems are evolving into credible researchers, not merely assistants. If AI can partially automate alignment research, the recursive improvement loop, long theorized by safety researchers, moves closer to reality.

This news arrives as Anthropic appointed Novartis C.E.O. Vas Narasimhan to its board; Trust-appointed directors now form a majority. This move signals the company's commitment to governance before its capabilities outpace oversight. In a related development, Nicholas Carlini, a leading cybersecurity researcher, reportedly uncovered as many critical vulnerabilities in recent weeks as in his entire career to date—a discovery the Cognitive Revolution podcast cited as evidence of a new AI capability regime. The Glasswing/Mythos discussion from earlier this week continues to expand.

Claude Code, Cursor, and Codex Converge on the Same Parallel-Agent IDE

The week's most notable observation is not a single announcement, but a clear pattern: every major coding agent tool now shares a striking resemblance. Anthropic redesigned its Claude Code desktop app, adding a sidebar for managing concurrent sessions, an integrated terminal and file editor, drag-and-drop workspace layout, and context sharing across sessions. The update reflects a shift in how developers use agents: they now run refactors, bug fixes, and tests simultaneously, rather than one prompt at a time.

Elsewhere, Anthropic released Agent Skills—modular instruction bundles that enable agents to acquire domain-specific expertise through progressive disclosure. Metadata loads at startup, full instructions appear only when relevant, and additional files arrive on demand. This builds on the agent architecture Anthropic introduced on April eighth, separating the agent's "brain" from its execution.

The AI Daily Brief podcast highlighted this convergence, noting:

"Cursor 3, Codex, and Claude Code desktop now look identical."

The term "vibe coding"—coined by Andrej Karpathy only fourteen months ago—loses its meaning as the distinction between casual and serious AI-assisted development erodes. The shared paradigm is parallel agent orchestration: developers now manage concurrent AI workers, rather than simply prompting and waiting for responses.

Anthropic also introduced Routines, templated agents that trigger via GitHub events, A.P.I. calls, or schedules and run on Anthropic's infrastructure, without requiring a laptop. This transforms Claude Code from a developer tool into an autonomous deployment platform, reflecting the post-model engineering discipline that has been crystallizing throughout the week.

Notion's Five Rebuilds Reveal What Actually Works in Agent Products

The week's most substantive interview featured Notion's Simon Last and Sarah Sachs on Latent Space, where they detailed rebuilding Notion's agent system five times since late 2022. Their lessons offer insights every company building agent products seems to learn the hard way:

Notion learned to give models what they want. The company abandoned its proprietary X.M.L. block format and custom database query J.S.O.N. Instead, it adopted markdown and S.Q.Lite—formats models already understand—and saw an immediate jump in quality.
Don't fine-tune on tools that change daily. Sachs stressed that fine-tuning on Notion's tool definitions would hinder their progress, as the company ships new tools constantly. Instead, they invest in retrieval engineering, allowing improvements in frontier models to handle tool-calling quality.
Model Behavior Engineers are a real role. Notion employs a dedicated team of such engineers—originally linguistics Ph.D.s Simon once taught to use GitHub on a whiteboard—who now build agents that write their own evaluations.
Agent coordination works through product primitives, not special protocols. Custom agents communicate by reading and writing to Notion databases; memory exists as a page with edit access. One "manager agent" reduced a team's daily notifications from seventy to five by triaging thirty sub-agents.

On the multi-component protocol (M.C.P.) versus C.L.I. debate, Last argued for C.L.I.s: if a C.L.I.-based tool breaks, the agent can debug and fix it within its terminal environment. If an M.C.P. transport fails, the agent has no self-healing path. He recalled an anecdote: "Someone said their agent didn't have a browser, so it built itself one in a hundred lines of code." Such bootstrapping capability, he suggested, may prove more important than protocol elegance.

Google DeepMind Proposes a 10-Dimension Cognitive I.Q. Test to Replace A.G.I. Vibes

Google DeepMind proposed a new framework in its paper, "Measuring Progress Towards AGI: A Cognitive Framework." This taxonomy outlines ten cognitive faculties—perception, generation, attention, learning, memory, reasoning, meta-cognition, executive functions, problem-solving, and social cognition—drawn from decades of psychology and neuroscience research. Instead of a single A.G.I. score, the framework generates a radar chart, revealing where a system's performance lands relative to human abilities.

The paper acknowledges a truth practitioners already know: current A.I. is jagged, excelling at some cognitive tasks, yet failing at others that appear trivially easy. The framework aims to illuminate this jagged frontier, rather than obscuring it behind a single benchmark number.

Google supported the framework with a two-hundred-thousand-dollar Kaggle hackathon, targeting the five areas with the largest assessment gaps: learning, meta-cognition, attention, executive functions, and social cognition. The hackathon closes today, and results will be announced on June first. Meanwhile, the latest ARC-AGI 3 leaderboard shows frontier models with tool access still scoring around twenty-four per cent, with most at 0.6 per cent.

The paper's true contribution may be political. O.penA.I., Anthropic, and Google each define A.G.I. differently; Shane Legg, Google DeepMind's co-founder, predicts minimal A.G.I. by 2027 or 2028. Google attempts to establish a shared measurement standard before any company can unilaterally declare victory.

Jensen Huang Makes the Case That Nvidia Can't Be Commoditized—and That China Already Has Enough Compute

Jensen Huang's two-hour interview with Dwarkesh Patel was the most substantive public discussion on A.I. compute economics and China policy in months.

On Nvidia's competitive moat, Huang framed the company as a transformer of electrons into tokens. C.U.D.A.'s value lies not in any single kernel, but in its installed base of hundreds of millions of G.P.U.s, its presence in every cloud, and the ecosystem lock-in that prompts developers to build on C.U.D.A. first. He challenged competitors to demonstrate better performance-per-T.C.O. using Dylan Patel's InferenceMAX benchmark. Regarding Anthropic's recent multi-gigawatt T.P.U. deal with Google and Broadcom, Huang asserted, "Without Anthropic, why would there be any T.P.U. growth at all? It's one hundred per cent Anthropic." He attributed Anthropic's T.P.U. dependency to timing—Nvidia could not make multibillion-dollar investments in A.I. labs early enough—calling it "my miss" rather than a competitive loss for Nvidia.

On China, Huang advanced an argument most in the industry avoid: that China's energy abundance compensates for its chip disadvantage. "Seven-nanometer chips are essentially Hopper," he stated. "Today's models are primarily trained on Hopper-generation architectures." He pointed to Huawei's record revenue year and to S.M.I.C.'s manufacturing capacity, arguing that export controls have accelerated China's indigenous chip ecosystem. Patel pushed back, citing Mythos's offensive cybersecurity capabilities, but Huang remained unmoved. "If they have some compute," he said, "the question is how much do they need? The amount of compute they have in China is enormous." He warned that ceding the world's second-largest technology market would be "a disservice to our national security" and cited the U.S. telecommunications industry as a cautionary example of policy-driven market loss.

Why Notion Stopped Fine-Tuning and O.penA.I. Dropped Sora: Coding Is the Meta-Capability

Two contrarian positions from this week's sources challenged conventional wisdom.

Notion's Sachs argued against fine-tuning models on tool definitions—a practice many agent-building companies pursue by default. "It would slow us down to have a model fine-tuned on our tools because we'd have to retrain and cut a new model every time," she explained. Notion also observed a related pattern: labs sometimes ship model snapshots that are not the versions Notion validated, and "companies that say they're selling the same model through different vendors" occasionally show different quality levels, likely due to undisclosed quantization.

The AI Daily Brief reported a claim from Shravan on Twitter: that the upcoming Opus 4.7 will not outperform versions 4.6 or 4.5, and that users will praise it only because Anthropic degraded version 4.6 in recent weeks, thereby manufacturing a perceived improvement. Regardless of its accuracy, it reflects growing skepticism among practitioners regarding model release narratives.

On the broader industry front, the mreflow channel's breakdown of a production workflow offered an inadvertent illustration of how A.I. content creation truly works in practice: the entire pipeline—YouTube comment scraping, video intro generation, and overlay apps—was built with Cursor, Make.com, N8N, and Recut. A.I. models serve as interchangeable components within human-designed systems. As one commentator put it: the model is the commodity; the trigger, the product.

The Month Ahead: Five Milestones to Watch

Opus 4.7: Multiple sources report Anthropic's next flagship model is days to weeks away, possibly alongside a new design and presentation tool that could compete with Figma and Adobe. The question remains whether it will reset coding benchmarks or confirm the "nerfed 4.6" narrative.
Google's Kaggle A.G.I. Hackathon Results: The two-hundred-thousand-dollar competition to build evaluations for learning, meta-cognition, attention, executive functions, and social cognition closes today, with results expected on June first. The winning evaluations could become the first standardized cognitive benchmarks for A.I. systems.
ToolOmni's Open-World Agent Benchmark: A new paper demonstrates a 10.8 per cent increase in end-to-end execution success in open-world tool use via proactive retrieval and grounded execution. If this benchmark gains adoption, it addresses the capability gap that Notion, Anthropic, and every agent builder currently races to close.
G.P.U. Cost Escalation: The AI Daily Brief reports G.P.U. rental prices rose forty-eight per cent in two months. Uber's C.T.O. says Claude Code consumed his entire A.I. budget within months, with eleven per cent of Uber's backend now A.I.-written. Data center construction bans are spreading; Maine enacted an eighteen-month moratorium, with twelve other states considering similar measures. If costs do not stabilize, the current agent-building boom could hit a wall.
Automated Alignment Scaling: Anthropic's ninety-seven per cent result on a well-defined alignment subproblem is a ceiling estimate. The true test lies in whether the approach degrades gracefully on messier alignment challenges that lack objective metrics. Expect follow-up work within weeks: the recursive improvement loop is too strategically important to remain a mere proof-of-concept.