Large Language Letters 04/25/2026

#ai

Automated draft from LLL

GPT-5.5 Halves Token Use, Setting a New Efficiency Standard

OpenAI's Latest Model Delivers More for Less

OpenAI introduced GPT-5.5 this week for paid ChatGPT and Codex users. The key metric, however, isn't a benchmark score; it's the token count. On Terminal-Bench 2.0, which evaluates real command-line workflows, GPT-5.5 scored 82.7 percent using about 2,165 output tokens per task. Its predecessor, GPT-5.4, achieved 75 percent with nearly 4,950 tokens.

AI Explained detailed the economic impact: per-token API pricing doubled—to five dollars for input and thirty dollars for output per million tokens. Yet, because the model solves problems in fewer steps, the net cost per completed task actually dropped. OpenAI optimized GPT-5.5 for NVIDIA's GB200 and GB300 NVLink 72 systems. The new model matches GPT-5.4's latency, even with its increased capability.

Ethan Mollick, who gained early access, tested GPT-5.5 Pro with four prompts. The model generated an academic paper of nearly PhD quality, synthesizing years of dormant crowdfunding data. It provided a thorough literature review, sound statistics, and verified citations. Mollick called it "a noteworthy step," but observed that the "jagged frontier" persists: "the fiction is still flat and the hypotheses are sometimes uninteresting even when the statistics are sound." Matthew Berman, after two weeks of testing, highlighted GPT-5.5’s skill at diagnosing production website problems without logs or real data. He noted this intuition about system behavior surpassed anything Opus 4.6 or 4.7 could offer.

However, GPT-5.5 falls short in other areas. On SWE-Bench Pro, the agentic coding benchmark OpenAI recommended as less prone to contamination, GPT-5.5 scored 58.6 percent. It trailed Opus 4.7 by about six points and Anthropic's unreleased Mythos by almost twenty. Regarding hallucinations, AI Explained revealed a stark difference: GPT-5.5 hallucinated on eighty-six percent of its incorrect answers, compared to Opus 4.7’s thirty-six percent. It almost never admits ignorance. GPT-5.5 Pro, the more powerful variant, will soon reach the API but was unavailable for independent benchmarking, making a direct comparison with Mythos impossible.

OpenAI also released ChatGPT Images 2.0, which now tops the LM Arena image leaderboard with a clear lead over Google's Nano Banana. They also introduced Workspace Agents for business and enterprise users. These persistent, Codex-powered bots operate in the cloud, access tools like Linear and Slack, and are set to replace Custom GPTs.

Three Open Models Emerge Amid a Computing Crunch

On the day GPT-5.5 launched, DeepSeek V4 and Qwen 3.6-27B also arrived, each offering a distinct vision for value in the model stack.

DeepSeek V4, from the Chinese lab that shook the industry with V3, released open weights under an MIT license. It features a 1.6-trillion-parameter mixture-of-experts architecture, which activates forty-nine billion parameters per token. Its key feature: a one-million-token context window at about one-tenth the cost of frontier models. DeepSeek estimates it lags frontier models by three to six months. On AI Explained's private common-sense benchmark, the Pro variant scored within one to two percent of Opus 4.7. But real-world testing by intheworldofai proved far harsher: DeepSeek V4 was "benchmark-maxed"—solid on standardized tests, but sloppy on front-end generation, failing to complete an Instagram feed clone and producing a 3D PS5 controller that resembled a table. Bloomberg reported that DeepSeek itself acknowledged service capacity "is limited due to a computing crunch."

Alibaba's Qwen 3.6-27B entered the market with a smaller, technically elegant model: a twenty-seven-billion-parameter open-source model (Apache 2.0) that outperforms Alibaba’s own 397B model on SWE-Bench Verified (77.2 to 76.2) and runs on about eighteen gigabytes of VRAM. Its "Thinking Preservation" feature, which carries reasoning state across conversation turns, solves a practical problem in multi-step coding.

Moonshot AI's Kimi K2.6, the trillion-parameter open-source coding model released on April 23, gained attention for its twelve-hour-plus autonomous sessions and support for three-hundred parallel agents. It outperformed both Opus 4.6 and GPT-5.4 on Humanity's Last Exam and deep search.

Z.ai's GLM-5.1 offers eight-hour autonomous task persistence in an open-weights model with a 754B MoE architecture, and claims the top SWE-Bench Pro score among open models at 58.4 percent.

Compute scarcity, the underlying issue, shapes strategy at every lab. In a Google Cloud campus interview, Thomas Kurian explained how Google's decade of TPU investment gives it a structural advantage. The company powers Gemini inference, sells TPUs to labs like Anthropic, and still retains enough capacity to announce eighth-generation TPUs. These chips mark the first architectural split into dedicated training (8T) and inference (8i) units. "It's better to have your own chips and demand than not having your own chips," Kurian said. Gemini Enterprise token volume jumped from ten billion to sixteen billion per minute between January and April. Asked about competitors' compute struggles, OpenAI president Greg Brockman laughed: "Our competitors are not having a good time on compute."

Anthropic, following this week's one-hundred-billion-dollar AWS commitment and five-gigawatt capacity deal, reports 98.8 percent uptime on claude.ai. This figure is notable not for what Anthropic claims, but for how far it falls short of the 99.9 percent or higher uptime competitors report. Matthew Berman traced the company's policy whiplash to its source. He cited restrictions on third-party harness access over Easter weekend, trials of removing Claude Code from Pro plans, and unfulfilled promises of clarity. Berman concluded that Dario Amodei underestimated compute demand and chose not to risk the company on capital expenditure spending. OpenAI relentlessly exploited this situation, resetting Codex usage limits at every opportunity and acquiring OpenClaw creator Peter Steinberger.

Two insightful Anthropic engineering posts offered clarity amid the noise. A postmortem on Claude Code quality traced March and April degradation reports to three root causes. They identified a default reasoning effort downgrade, a caching bug that repeatedly dropped reasoning history, and a system prompt change that traded intelligence for conciseness. All issues were fixed by April 20, and usage limits for all subscribers reset. Separately, research on infrastructure noise in evaluations found that hardware configuration alone can swing Terminal-Bench 2.0 scores by six percentage points—a difference larger than typical leaderboard gaps that influence model selection. Small benchmark differences between models may reflect hardware, not capability.

AI's Hidden Costs: Waste, Overconfidence, and Practical Limits

The Pragmatic Engineer published the most detailed account to date of "tokenmaxxing"—the practice of inflating AI token usage to climb internal leaderboards. At Meta, eighty-five thousand employees burned 60.2 trillion tokens in thirty days; at list prices, this totaled roughly nine-hundred million dollars. Engineers at Microsoft admitted they deliberately queried AI for answers already in documentation, prototyped features they would never ship, and "defaulted to always using the agent, even when I could do the work by hand faster." Salesforce set minimum weekly spend targets: one-hundred dollars on Claude Code, seventy dollars on Cursor. Shopify implemented a sound approach. The company renamed its leaderboard to "usage dashboard," added circuit breakers for runaway agents, and its leadership investigates each top spender's actual output. After media coverage, Meta removed its leaderboard—though a long-tenured engineer suspects the real goal was generating training data for Meta's next coding model.

AlphaSignal covered a related finding: a new paper on the "LLM Fallacy" reports that users who produce good output with AI assistance systematically overestimate their own skill. The low-friction experience obscures the AI's contribution, inflating confidence across coding, writing, analysis, and language tasks while actual ability atrophies. It's the GPS effect applied to your career.

On biosecurity, Second Thoughts published a well-sourced analysis from the Golden Gate Institute for AI. It argues that AI bio-risk assessments overestimate the threat by focusing on information access while ignoring "tacit knowledge"—the muscle memory, mentor-transmitted intuitions, and thousands of micro-judgments required to execute lab procedures. The piece centers on Aum Shinrikyo: with one billion dollars and trained microbiologists, the cult failed to weaponize anthrax because its team lacked hands-on experience with the specific steps. The spore concentration was too low, the suspension too viscous for aerosolization, the strain insufficiently virulent. Current AI evaluations "may be measuring the wrong thing" by testing codified knowledge instead of whether AI erodes the tacit-knowledge barrier—and until automated labs absorb more of that knowledge, the barrier remains real.

Andrew Ng, writing in The Batch, offered a practical taxonomy of how coding agents accelerate different types of work: frontend development (dramatically), backend (significantly, though less so), infrastructure (modestly), and research (marginally). "I now ask front-end teams to implement products dramatically faster than a year ago," he wrote, "but my expectations for research teams have not shifted nearly as much." This maps to a pattern visible across this week's model releases: the demos are nearly always frontend showcases—Minecraft clones, landing pages, Mac OS simulations—because that's where the acceleration is real.

Looking Ahead: Five Key Developments

GPT-5.5 Pro API Access. Once available for independent benchmarking, direct comparisons with Opus 4.7 and Mythos on SWE-Bench Pro will clarify whether OpenAI closed the agentic coding gap or merely the efficiency gap. OpenAI stated "very soon." This stands as the most important pending evaluation in the model race.
Cursor × SpaceX. SpaceX secured the right to acquire Anysphere's Cursor for sixty billion dollars, or pay ten billion dollars for the partnership if it declines. Should the acquisition close, Cursor gains access to SpaceX's Colossus supercomputer, equivalent to a million H100 units—potentially producing the first frontier coding model trained on the world's richest proprietary coding dataset. Watch for a formal training announcement.
Google TPU 8T/8i at Cloud Next. Google will launch the first split training/inference TPU generation. The 8i chip runs without water cooling, enabling deployment in standard data centers—a direct play for the inference-at-the-edge market driven by agentic workloads. The 8T fits two petabytes of memory in a single system. Benchmark results against NVIDIA's GB300 will follow within weeks.
Anthropic's Compute Recovery. The five-gigawatt AWS deal, announced April 20, will start delivering Trainium 2 and 4 capacity later this quarter. Whether Anthropic stabilizes service quality and stops its loss of agentic users to OpenAI hinges on how quickly this materializes. The policy damage compounds with each week of delay.
WebGen-R1 and RL for Project-Level Code. A new paper details an end-to-end reinforcement learning framework that trains a seven-billion-parameter model to generate deployable multi-page websites. It rivals DeepSeek-R1 (671B) on functional success while significantly exceeding it on visual quality. If reinforcement learning approaches can close the gap between small open models and frontier ones on project-level generation, the cost structure of AI-assisted development will fundamentally change within months.