Large Language Letters 04/26/2026

#ai

Automated draft from LLL

The New A.I. Models Are Brilliant Liars

OpenAI’s GPT 5.5 leads the benchmarks but invents answers to most of its mistakes. Meanwhile, Google guards its compute advantage, and open-source models challenge the frontier.

Within twenty hours, two new A.I. models arrived, promising to reshape how hundreds of millions use artificial intelligence daily. OpenAI’s GPT 5.5, available to paid ChatGPT and Codex users, now leads the Artificial Analysis Intelligence Index—a composite of ten challenging benchmarks. It scores 82.7 percent on Terminal-Bench 2.0, surpassing Anthropic’s unreleased Mythos (82.0 percent) on agentic terminal tasks. The model performs the same coding tasks as GPT 5.4 but uses significantly fewer tokens. Input tokens cost five dollars per million, offering a million-token context window; A.P.I. access will open soon.

But GPT 5.5’s system card reveals a problematic detail, complicating its victory lap. When the model answers a factual question incorrectly, it confidently fabricates a response eighty-six percent of the time, instead of admitting ignorance. Opus 4.7, by contrast, bluffs on only thirty-six percent of its errors. Including correct answers, the net hallucination rate narrows—twenty-six percent for GPT 5.5 against twenty percent for Opus 4.7—but the calibration gap remains the widest among current frontier models. On SWE-Bench Pro, the coding benchmark OpenAI itself deemed robust, GPT 5.5 lags Opus 4.7 by six points and Mythos by nearly twenty. OpenAI bluntly stated that GPT 5.5 has “no plausible chance” of reaching a high threshold in recursive self-improvement, citing its limited coherence and inability to sustain goals on multi-hour tasks. As always, benchmark selection dictates the perceived winner.

OpenAI also introduced GPT Image 2, which leads LM Arena’s image leaderboard by two hundred and thirty ELO points over Google’s Nano Banana. Concurrently, it launched Workspace Agents—persistent, cloud-running team automations that connect to Slack and internal tools, available free until May 6.

The same day, DeepSeek, a Chinese A.I. lab, released its V4 Pro model. It boasts 1.6 trillion total parameters (forty-nine billion active through a mixture-of-experts architecture), a million-token context window, and open weights under an M.I.T. license. DeepSeek admits the model trails the cutting edge by three to six months but costs roughly one-tenth as much. Independent reviewers offered sharp contrasts: AI Explained found V4 Pro comparable to Opus 4.7 in spatial reasoning, at a fraction of the price. Yet In The World of AI noted its repeated failures on basic U.I. generation tasks that smaller models handle cleanly, calling the model "benchmark maxed." DeepSeek’s service capacity, Bloomberg reported, remains severely limited by a computing crunch.

The open-source tier continues to narrow the gap with proprietary models. Alibaba’s Qwen 3.6-27B surpasses its larger 397-billion-parameter sibling on coding benchmarks, running on eighteen gigabytes of VRAM. Z.ai’s GLM-5.1, a 754-billion-parameter open-weights model designed for autonomous coding sessions up to eight hours, ranked third on Arena Code days after its launch. Following Moonshot AI’s Kimi release earlier this week, the cost to achieve eighty percent of frontier capability drops faster than the cost to reach the final twenty percent.

As a footnote to the unfolding Mythos saga, Anthropic confirmed that unauthorized users accessed the model it deemed too powerful for public release. The company maintains there is no evidence of impact on its systems. Sam Altman seized the moment to criticize Anthropic’s messaging, calling the restricted release “incredible marketing”—“building a bomb, then selling the bomb shelter.”

Google Is the Only Frontier Lab Not Starved for Compute

Google Cloud C.E.O. Thomas Kurian, in a recent interview, explained why Google maintains an abundance of computing power while its competitors ration theirs. Google attributes this advantage to eleven years of in-house T.P.U. development, diversified monetization across chips and tokens (including selling inference to Anthropic), and a manufacturing approach to data centers. This last method involves pre-assembling and pre-testing entire server racks in central facilities for faster deployment.

Google will announce its eighth-generation T.P.U. at Google Cloud Next. For the first time, Google is splitting the T.P.U. line into dedicated training (8T) and inference (8i) chips. Kurian noted that agentic workloads prompted this division: agents running for six to twelve hours require persistent K.V. caches and fundamentally different memory economics than chatbot queries. The air-cooled inference chip allows deployment in more locations. A new Gemini model will arrive “very, very soon,” Kurian said. He expressed confidence that Google’s disaggregated serving stack can handle “the largest models in the world”—a pointed response to questions about the commercial feasibility of Mythos-scale models, rumored at ten trillion parameters. Since January, Gemini Enterprise token consumption has jumped from ten billion to sixteen billion per minute, and enterprise users have increased by forty percent sequentially.

OpenAI president Greg Brockman acknowledged publicly, “We are headed to a world of compute scarcity,” noting that competitors “are not having a good time on compute.” Anthropic’s one-hundred-billion-dollar A.W.S. commitment, disclosed earlier this week alongside its thirty-billion-dollar revenue run rate, partly reflects this same pressure. This widening gap between compute haves and have-nots defines the structural story of 2026. It may also explain why Google can afford to sell T.P.U. time to a direct competitor like Anthropic while still advancing its own models.

Yet, building computing power at this scale faces increasing political resistance. Maine’s legislature passed the first statewide moratorium on large data centers, which awaits the governor’s signature. Twelve other states consider similar legislation in 2026. Ohio citizens have initiated a ballot measure to amend their constitution against facilities exceeding twenty-five megawatts, requiring four hundred thousand signatures by July 1. The backlash has even turned violent: assailants threw a Molotov cocktail at Sam Altman’s home, and fired thirteen gunshots at the home of an Indianapolis city councilor who voted for a data center project. Kurian acknowledged the tension, citing investments in behind-the-meter energy and community development, but recognized this as “part of the journey we’re on as a society.”

Six Percentage Points of Your Favorite Benchmark May Be Measuring Server Specs

Anthropic’s engineering team published a finding that should reframe every leaderboard debate: infrastructure configuration alone—C.P.U. count, R.A.M., resource enforcement—can shift agentic coding benchmark scores by up to six percentage points on Terminal-Bench 2.0. This margin surpasses most observed differences between adjacent models on any leaderboard. Strict resource limits caused a 5.8-percent infrastructure failure rate, compared to 0.5 percent for uncapped systems. On SWE-Bench, the effect registered 1.54 points with five times the R.A.M.—a smaller but consistent impact. When GPT 5.5 lags Opus 4.7 by six points on SWE-Bench Pro, some of that difference may stem from hardware, not intelligence.

This insight connects to a broader pattern evident in the week’s releases. A domain-specialized GPT 5.4, designed for clinicians, outperforms GPT 5.5 on medical benchmarks, even though 5.5 is the overall “smarter” model. DeepSeek V4 Pro, tuned for Chinese professional tasks, reportedly surpasses Opus 4.6 Max on those benchmarks while lagging on English-language coding. As AI Explained asked, “What do A.G.I. or A.S.I. mean if such disparity exists between domains?” Models are not universal generalizers; they rely heavily on reinforcement learning in specific domains. The single-axis view of intelligence increasingly appears a useful fiction, rather than a description of reality.

Andrew Ng offered a complementary observation in The Batch: coding agents accelerate frontend development dramatically, but less so for backend, even less for infrastructure, and barely at all for research. The implication is clear: benchmark-driven model selection misses the essential question—which model best suits the specific work you are paying for, rather than which one simply tops a table?

MCP Triples to 300 Million Monthly Downloads as Anthropic Pushes Into Japan

Beyond its infrastructure noise paper, Anthropic rolled out several updates this week. The Claude Code quality postmortem, following Thursday’s thread, confirmed fixes for all three root causes: a downgraded default reasoning effort, a caching bug that dropped reasoning history, and a system prompt change that traded intelligence for brevity. N.E.C., Japan’s largest I.T. services company, will deploy Claude to thirty thousand employees as Anthropic’s first Japan-based global partner. Together, they will co-develop A.I. products for finance, manufacturing, and government, including integrating Claude into N.E.C.’s cybersecurity operations center. M.C.P. S.D.K. downloads reached three hundred million per month, tripling since January. New production guidance documents an eighty-five-percent token reduction through tool search and a thirty-seven-percent reduction through programmatic calling. Claude also expanded its reach to more than two hundred integrations, including AllTrails, Instacart, Audible, and Uber.

Four Countdowns Running Right Now

Google Cloud Next (Next Week): Google confirms its eighth-generation T.P.U.s and a new Gemini model. The inference chip’s air-cooled design signals Google’s wager that agentic workloads demand geographic distribution, not just cluster scale. Observers will watch whether the new Gemini closes the SWE-Bench Pro gap with Opus 4.7.
GPT 5.5 A.P.I. Release (Imminent): Independent benchmarks will soon test OpenAI’s self-reported numbers. The model presents contradictions—it leads Terminal-Bench, but trails SWE-Bench Pro and records the highest hallucination rate among peers—making third-party evaluations potentially market-moving.
Maine Data Center Moratorium (Awaiting Governor’s Signature): If signed, this bill enacts the first statewide ban on large data centers. With twelve other states considering similar measures, the precedent could reshape the pace and location of A.I. infrastructure development across the U.S.
Ohio Constitutional Amendment (400,000 Signatures by July 1): Should this ballot initiative, aimed at prohibiting data centers over twenty-five megawatts, qualify, voters in November could establish a constitutional precedent that lobbying efforts may not easily reverse.