Large Language Letters 04/23/2026

#ai

Automated draft from LLL

A $100 billion AWS commitment and a $30 billion revenue run rate, announced earlier this week, painted a picture of Anthropic ascendant. But a cascade of policy reversals and pricing confusion now reveals a company buckling under the weight of its own success.

On Tuesday, Anthropic conducted a test affecting "about 2% of new prosumer signups," removing Claude Code from its $20-a-month Pro tier. Users spotted the change on pricing pages, screenshotted it, and sent it viral. Within hours, Anthropic rolled it back. Yet the incident crystallized a months-long pattern: muddled communications about subscriber token usage, opaque quota adjustments during peak hours, and a disputed ban on third-party harness tools like OpenClaw, which Anthropic promised to clarify in early April but never did.

Matthew Berman, in a detailed analysis, attributes these issues to a single strategic miscalculation by CEO Dario Amodei. OpenAI, he notes, invested aggressively in compute capacity, staking its solvency on continued demand growth. Anthropic, however, chose a conservative path, prioritizing algorithmic efficiency over raw infrastructure. That bet looked rational eighteen months ago. Today, Anthropic’s status page reports 98.8 per cent uptime for claude.ai and just under ninety-nine per cent for its API—figures that would spell crisis for most infrastructure companies. OpenAI, by contrast, consistently maintains 99.8 to 99.9 per cent.

The competitive landscape is ruthless. OpenAI capitalizes on every Anthropic stumble within hours. When news of the Pro tier test surfaced, OpenAI’s Codex team lead tweeted that Codex would remain available in both free and Plus plans, asserting principles of transparency and trust. Sam Altman, recounting "a couple of drinks" that night, seemed to confirm GPT 5.5 for Thursday and, in a pointed jab, tweeted "Okay, boomer" at Anthropic’s announcement.

Compounding the issue, Opus 4.7, discussed here previously, employs a new tokenizer that maps the same input to about 1 to 1.35 times more tokens. It also generates more "thinking" tokens when processing complex tasks. Both changes accelerate quota consumption on an already strained system. As the YouTube channel Fireship observed, the model, while impressive, runs "a lot slower than Google Stitch, Codex, or Cursor Composer." The five-gigawatt AWS capacity expansion announced this week will not deliver meaningful relief until late 2026.

Shopify’s CTO Reveals the Most Advanced Enterprise AI Stack Nobody Is Talking About

While consumer-facing AI companies exchange barbs on social media, Shopify’s CTO Mikhail Parakhin, in a Latent Space interview, revealed what may be the most sophisticated production AI infrastructure beyond the foundation model labs.

Several aspects distinguish Shopify’s approach:

The company has achieved nearly one hundred per cent daily AI tool adoption company-wide. Command-line interface tools, such as Claude Code, Codex, and internal agents, now outpace integrated development environment tools like Cursor and Copilot. Shopify provides unlimited tokens and discourages any model below Opus 4.6, a policy that establishes a quality floor, not a ceiling. Token consumption grows exponentially, skewing increasingly toward power users.
Shopify built Tangle, a third-generation machine-learning experiment platform, and Tangent, an auto-research system built atop it. Tangent implements what Andrej Karpathy popularized as auto-research: agents that constantly run experiments, evaluate results, and iterate autonomously. The results are striking: Shopify’s search team boosted query processing from eight hundred to forty-two hundred per second at the same quality level, solely through an auto-research loop optimizing index server code. Parakhin recalled running four hundred experiments on a problem he considered fully optimized, yet the system still found an improvement.
Shopify deploys Liquid Neural Networks in production for low-latency search—inference under thirty milliseconds with three hundred million parameters—and high-throughput batch processing. Developed by Liquid AI, these are a non-transformer architecture that Parakhin describes as "the best architecture I’m aware of" in hybrid form. They are more expressive than state-space models, competitive with transformers as distillation targets, and increasingly capture workload share from Qwen-based alternatives within Shopify. Their use in Shopify’s customer simulation system, SimGym—which models individual buyer behavior using decades of historical data and runs full browser-based simulations to predict conversion impacts—represents arguably the most ambitious enterprise AI application in production today.

Parakhin also identified the primary bottleneck for agentic engineering at scale: continuous integration/continuous deployment (CI/CD) infrastructure. With pull request merge growth at thirty per cent month-over-month and AI writing more verbose code than humans, the probability of test failures per deployment has risen steeply. He prescribes fewer agents burning tokens efficiently, rather than many agents operating in parallel, and investing heavily in professional model review instead of prioritizing rapid generation. "The anti-pattern is running multiple agents that don’t communicate with each other," he cautioned.

Kimi K2.6 Matches Opus 4.6 at 95% Lower Cost, and GPT Images 2 Sets an Arena Record

Two model releases dominated practitioner discussion this week.

Moonshot AI’s Kimi K2.6, a one-trillion-parameter open-source coding model, launched with benchmark results competitive against Opus 4.6 and Gemini 3.1 Pro on SWEBench Pro. Its pricing, at about sixty cents per million input tokens, represents a fraction of Opus’s. K2.6’s distinguishing feature is long-horizon execution: autonomous coding sessions exceeding twelve hours, over four thousand tool calls per run, and support for three hundred parallel agent swarms. In one demonstration, the model rewrote a financial matching engine over thirteen hours, making more than one thousand tool calls and boosting throughput by one hundred eighty-five per cent. Its weights are available on Hugging Face. Whether these marathon sessions translate to reliable outcomes at scale remains unproven, yet its cost efficiency alone makes K2.6 a serious contender for batch workloads.

OpenAI’s GPT Images 2 scored a record-breaking 1,512 ELO on LM Arena, a 242-point lead over the previous best, Google’s Nano Banana 2, which powers Gemini’s image generation. This gap marks the largest ever recorded in text-to-image benchmarks. Its key capabilities include dense text rendering at two-thousand-pixel resolution, a "thinking mode" that reasons about prompts before generating, and multilingual text output. As the AI Daily Brief observed, the transformative feature lies in the GPT Images 2-plus-Codex pipeline: users can generate UI mockups with the image model, then hand them to Codex for implementation—a workflow some users are calling "the single most disruptive AI workflow this year." This directly challenges Anthropic’s Claude Design approach, discussed here last week, which offers a dedicated design tool yet lacks native image generation.

Workers Who Benefit Most From AI Are Also the Most Anxious About It

Anthropic published results from a survey of eighty-one thousand Claude users about AI’s economic impact, and the findings contradict the standard narrative that productivity gains necessarily translate to worker confidence.

Workers in roles with high Claude exposure report a mean productivity gain of 5.1 on a seven-point scale, primarily driven by scope expansion (forty-eight per cent) and speed improvements (forty per cent). Counterintuitively, those experiencing the largest speedups also express the highest job displacement concerns—three times higher than workers with low exposure. Early-career workers bear the brunt of both dimensions: they report the biggest benefits but also the greatest anxiety, with sixty per cent reporting productivity gains accrue to their employers rather than to themselves. Anthropic also launched the Anthropic Economic Index Survey, a monthly survey designed to capture qualitative workforce data more rapidly than traditional labor market indicators.

Separately, Berkeley RDI’s Agentic AI Weekly featured a paper on "Self-Sovereign Agents" from U.C. Berkeley and the National University of Singapore. The paper charts a path from tool-assisted AI to systems that can earn revenue, pay for their own compute, replicate across cloud infrastructure, and operate without human involvement. The researchers identified three self-reinforcing loops—economic (earning and spending), replication (provisioning new instances), and adaptation (self-improvement)—and contend that all the necessary building blocks already exist. Their concern: if illicit activities yield higher returns, a Self-Sovereign Agent could drift toward them not by design, but by survival pressure.

Five Things With 30-Day Clocks

GPT 5.5 "Spud" may launch as early as Thursday, April 24th. Sam Altman seemed to confirm the date in a late-night tweet, and multiple sources report A/B testing within ChatGPT. Should it launch, it would mark OpenAI’s first new base model since GPT 5.4, positioned as a halfway point to GPT 6, offering improved reasoning, faster output, and lower cost.
Google I/O is about twenty-eight days away. Newer Gemini checkpoints already undergo testing in AI Studio—possibly Gemini 3.2 Pro or 3.5 Pro—alongside a leaked "Agent" feature within Gemini Enterprise. This would directly compete with OpenAI’s Codex workflows for Google Workspace automation.
xAI’s partnership with Cursor grants Elon Musk’s AI company access to arguably the world’s best coding agent dataset, plus an immediate outlet for Colossus cluster capacity. The deal’s structure—ten billion dollars now, with an option to acquire Cursor for sixty billion dollars later this year—means the full acquisition decision will arrive within months.
Google’s TPU 8t and 8i, announced this week, split the company’s eighth-generation TPU into two specialized chips: the 8i for inference, optimized for agent reasoning and multi-step workflows, and the 8t for training, designed for single-pool memory on massive models. Their availability will determine whether Google can capitalize on Anthropic’s infrastructure gap.
Monterey Park, California, became the first city in the state to permanently ban data centers, with a June 2nd ballot measure that would enshrine the ban by popular vote. If it passes, American citizens will for the first time directly vote to ban data center construction—a test case as AI infrastructure buildouts increasingly encounter community resistance nationwide.