DEV Community: Max Quimby

Russia's Oreshnik on Kyiv: The Strike Nobody Could Stop

Max Quimby — Tue, 26 May 2026 03:53:24 +0000

On the night of May 24, Russia fired an Oreshnik intermediate-range hypersonic ballistic missile at Kyiv — the first time the weapon has been used against the Ukrainian capital. Six MIRV warheads, each carrying up to six sub-munitions, traveling at speeds above Mach 10. Western air defense systems currently in Ukraine cannot intercept it. DW's reporting makes the operational point bluntly: Ukraine knew the strike was coming and could not stop it.

📖 Read the full version with embedded sources and YouTube footage on The Arc of Power →

This is the most consequential military development of the week and it barely registered in US legacy media — because the "deal-making president" narrative cannot accommodate a story where Russia escalates and Western technology has no answer. What follows is not a recap. It is four lessons on why this strike matters more than the news cycle suggests.

Lesson One: The Air Defense Era Just Ended for Intermediate-Range Threats

The Oreshnik is the operational deployment of a weapons class NATO's deployed air defense was not built to intercept. Per CSIS Missile Threat and Wikipedia:

Range: 3,500–5,470 km. Reaches most European capitals from Russian territory.
Speed: above Mach 10. Re-entry velocity in the hypersonic regime where most deployed terminal-defense interceptors lose engagement geometry.
Payload: six MIRV warheads, each capable of dispensing six sub-munitions. A single missile produces up to 36 independent terminal threats arriving on different trajectories within seconds of each other.
First combat use: November 21, 2024 against Dnipro — "the first time a MIRV was used in combat."
January 9, 2026: against Lviv.
May 24, 2026: first strike on Kyiv.

The defensive answer to a MIRV is not a single interceptor — it is coverage across six terminal trajectories at hypersonic re-entry speeds simultaneously. The Patriot, IRIS-T, and NASAMS systems Ukraine fields are designed against a smaller-than-MIRV threat envelope. SAMP/T NG does better but is nowhere near the density required.

Berlin, Warsaw, Stockholm, Prague, Helsinki, Bucharest, Riga, Tallinn, and Vilnius are now in the same defensive posture as Kyiv was on May 23. None of them have a deployed interceptor today that meaningfully changes the math.

⚠️ The doctrinal break. For two decades, Western air-defense planning has assumed that strategic (nuclear-weight) MIRV threats and theater (conventional) ballistic threats were separable problems. The Oreshnik fuses them — a conventional-payload missile delivered via a MIRV bus at hypersonic re-entry speeds, against tactical targets, on an operational timetable. The planning categories don't hold anymore. It will take 5–10 years to field a deployed answer at scale; the strikes are happening this week.

Lesson Two: Putin Just Re-priced the Negotiating Table

On Polymarket's "Ukraine signs peace deal with Russia before 2027", the price was 25% a week ago and is now sitting at 30% — up 5 points, against the Oreshnik news cycle. The reading is not that the strike makes peace more likely. It is that the strike makes Ukraine's negotiating position weaker, which makes a deal — on Russia's terms — more likely.

Three propositions worth holding together:

Russia has a weapon Ukraine cannot defend against, deployed at will.
Western military aid cycles take 12–18 months to deploy at scale.
The political tolerance for that timeline in Ukraine's Western backers is declining.

If those three are jointly true, the Russian negotiating position improves every week the Oreshnik is operational and the West does not have a deployed counter.

Lesson Three: This Is the Summer Offensive Signaling Itself

Reuters' May 25 reporting asked the question directly: will Russia launch a major offensive on Ukraine this summer? The signals from the past 14 days that should be read as a cluster:

The Oreshnik on Kyiv is the demonstration. It is also a political signal — we are willing to spend a strategic-class weapon on a city to make a point.
A "massive aerial attack" earlier in the week confirms Russian missile and drone production has scaled to a sustained-tempo cadence Ukraine cannot match defensively.
France 24 named the political subtext: "Self-destructive Vladimir Putin loses home support as war rages on." If Putin is losing home support, a summer offensive is no longer optional — it is the political demand.
Iran absorbing US diplomatic attention has displaced Ukraine from the top of the US security agenda. The window is favorable for Russian initiative.

These are the conditions Russia has historically used to commit to a summer push, not the conditions of a power preparing to talk.

Lesson Four: NATO's Compute-and-Capacity Posture Is the Real Constraint

The West does not lack the engineering knowledge to build a counter to the Oreshnik. It lacks the deployed industrial capacity to do it on the operational timeline that matters. SAMP/T NG and PAC-3 MSE production rates are measurable in dozens per year per major manufacturing line.

Even if a credible MIRV-capable, hypersonic-engagement interceptor entered production tomorrow — and the candidates on European drawing boards are 24–36 months from IOC — the footprint required to make Kyiv defensible against the Oreshnik is eight to ten years of current European production, sustained.

That is not a budget problem. It is a factory problem. It is the same factory problem the data-center buildout is hitting: physical-capacity constraints have replaced finance constraints as the binding limit on Western strategic ambition. The Microsoft Caledonia data-center cancellation last week and European interceptor production lines maxed out at low-dozens rates are the same story expressed in different industries.

ℹ️ The under-reported parallel. In a 14-day window, Russia demonstrated MIRV strikes on Kyiv, the Pope released an encyclical naming concentration of compute as dehumanization, and Microsoft pulled a 244-acre US data center under community pushback. Three independent events; one underlying pattern. The capacity to build what modernity requires — interceptors, data centers, hyperscale storage — is bottlenecked at the physical layer simultaneously.

What Markets Are Telling Us

"Ukraine signs peace deal with Russia before 2027" — 30%, up 5 points week-on-week.
"Russia × Ukraine ceasefire before 2027" — 34%, up roughly 3 points.
"Will Ukraine agree to cede territory before 2027" — trading higher than the headline market, implying the "deal" the market is pricing is one with territorial concessions.

The composite: markets are pricing in a settlement on Russian terms. The Oreshnik strike accelerated that trade.

What to Watch in the Next Six Weeks

A second Oreshnik strike on Kyiv within 30 days — the weapons class crosses from "demonstration" to "operational tempo."
A Ukrainian deep-strike retaliation against a Russian production facility (Kapustin Yar, Votkinsk).
Polymarket cession-of-territory market above 50% — the threshold at which traders collectively believe the settlement will be on Russian terms by default.
A European announcement on accelerated interceptor production.
US strategic-aid posture toward Ukraine in June — if the Iran story resolves, where does the diplomatic capacity flow?

The Forecast

A negotiated settlement of the Russo-Ukrainian war during 2026 is now more likely than it was a week ago, and that is a Russian win, not a Western one. The Oreshnik strike on Kyiv changed the calculus on the negotiating table without changing a single line on the map.

That gap — between political guarantee and physical capacity — is the real news of May 24, 2026. The Iran pendulum will swing back. The Pope's encyclical will be cited for the next decade. The Colbert cancellation will be a footnote. But the doctrinal break between Western air-defense planning and Russian intermediate-range strike capability is the kind of event that gets a chapter in the strategic history of this decade.

The Polymarket numbers are not predicting peace. They are pricing a settlement. There is a difference.

Originally published at The Arc of Power

Pope Leo's AI Encyclical: An Enterprise Governance Decoder

Max Quimby — Tue, 26 May 2026 03:44:43 +0000

The single most-upvoted r/technology post today is not a product launch, a benchmark, or a Big Tech earnings note. It's the Pope. The thread is sitting at 12,505 upvotes and 351 comments as of this writing. Pope Leo XIV released his first encyclical, Magnifica Humanitas — "On Safeguarding the Human Person in the Time of Artificial Intelligence" — at the Vatican Synod Hall this morning. Christopher Olah, co-founder of Anthropic, stood beside him at the launch and welcomed the document.

📖 Read the full version with embedded sources and YouTube context on ComputeLeap →

If you skim the headlines — "Pope warns of opaque algorithms," "Pope calls to disarm AI" — the encyclical sounds like a moral broadside, the kind of document an enterprise governance team can safely file under "interesting, not actionable." That would be a mistake. Magnifica Humanitas is the broadest legitimizing voice yet for the AI-governance wave that's been quietly assembling around your existing security stack. The Pope's warnings map almost line-by-line to the OWASP Agentic Top 10 and to the regulatory framework you're going to be audited against starting in August. This piece is the decoder ring.

The Encyclical, Quickly

Pope Leo XIV signed Magnifica Humanitas on May 15; the Holy See released it publicly today. It is a 235-page document, framed in the social-teaching tradition that runs from Rerum Novarum (1891) through Laudato Si' (2015). What's new is that the subject is AI specifically.

The six core arguments worth knowing for governance purposes:

"Opaque algorithms" controlled by "a few" private companies bring "new forms of dehumanization." Variety's coverage.
"Technology is never neutral." Directly quoted: "Technology is never neutral, because it takes on the characteristics of those who devise, finance, regulate, and use it."
AI must be "disarmed" — removed from military and pure economic-extraction use cases.
Labor dignity is the central material concern. Per Vatican News: "AI frequently forces workers to adapt to the speed and demands of machines, rather than machines being designed to support those who work."
Data is a "common good" that cannot be morally neutral, per Decrypt.
"Robust legal frameworks, independent oversight, informed users and a political system that does not abdicate its responsibility." The normative ask.

ℹ️ The institutional weight here. Catholic social teaching has a 130-year track record of becoming reference frameworks for European and Latin American regulators. The EU AI Act's worker-protection language draws on the same intellectual lineage. Treating Magnifica Humanitas as "religious commentary you can skip" is a category error.

Where the Encyclical Lands in the Existing Governance Stack

Warning 1: Opaque algorithms → Goal hijacking, identity abuse, memory poisoning

The Pope's "opaque algorithms" concern is precisely what OWASP's Top 10 for Agentic Applications covers in technical language. The OWASP taxonomy names the specific failure modes: goal hijacking (agent acts on a different objective than intended), identity abuse (authentication delegated to opaque inputs), memory poisoning (stored context becomes a manipulation vector), and cascading failures. The Microsoft Agent Governance Toolkit, released April 3 2026, is the first toolkit to address all 10 with deterministic sub-millisecond enforcement.

What "opaque" means operationally: every agent decision needs a deterministic trace, every tool call needs an identity-bound capability check, every memory write needs majority-voted verification.

Warning 2: Technology is never neutral → Plugin signing, supply chain integrity

The Pope's "characteristics of those who devise, finance, regulate" sentence is, in security-engineering terms, a statement about supply chain provenance. The toolkit's answer is Ed25519 plugin signing and manifest verification. The community-built mukul975/Anthropic-Cybersecurity-Skills repo (9K stars on GitHub) maps 754 structured cybersecurity skills to MITRE ATT&CK, NIST CSF 2.0, MITRE ATLAS, D3FEND, and NIST AI RMF. The technical mechanism enforces the philosophical claim.

Warning 3: "Disarm" AI → Acceptable use policy enforcement at the runtime layer

The technical analog is policy engines that block specific tool combinations based on declared use cases. If your model is licensed for "internal analytics" only, the policy engine should refuse calls that combine outbound messaging + customer PII + an inference about credit-worthiness. The Microsoft toolkit's policy engine supports this kind of compound rule via semantic intent classification.

Warning 4: Labor dignity → The hidden compliance audit

This is where the encyclical lands hardest. The Pope's claim that "AI frequently forces workers to adapt to the speed and demands of machines" is the same factual claim being made in the Toyota/Alabama r/technology post at 12,306 upvotes today.

The enterprise question: does your deployment include observability that distinguishes "agent did the work" from "human approved the work"? If you can't separate those metrics, you can't honestly answer to a worker-protection auditor — and EU AI Act high-risk obligations take effect August 2026, with Colorado's AI Act effective June 2026.

Warning 5: Data as common good → Provenance, not just consent

"Common good" data, operationally, means provenance tracking that survives downstream use. The "AI is a black box" framing is exactly what Magnifica Humanitas rejects. The technical answer is the same answer EU AI Act demands and NIST AI RMF's documentation pillar specifies. There is no daylight between the religious doctrine and the regulatory framework on this point.

Warning 6: Legal frameworks → August 2026 is closer than your roadmap thinks

The encyclical's normative ask aligns precisely with the existing regulatory calendar. EU AI Act high-risk obligations: August 2026. Colorado AI Act: June 2026. The toolkit's Agent Compliance module already maps capability evidence to those frameworks plus HIPAA and SOC2. Pope Leo XIV is not creating a new compliance burden; he's adding moral weight to one already on your calendar.

The Compute-Concentration Subtext

There's a second story today on Hacker News that reads as a perfect parallel signal:

Microsoft pulled the plug on a 244-acre data center in Caledonia after community pushback. The Pope's "opaque algorithms controlled by a few" framing and the Caledonia story share the same underlying concern: AI infrastructure has reached the scale where it warrants community-scale governance, not just technical governance. Compute-siting strategy is now a stakeholder-management strategy.

⚠️ Contrarian Corner: "The Pope's AI Encyclical Isn't Really About AI"

TechCrunch's read is the most defensible counter-argument. Their argument: the encyclical uses AI as a lens to examine older, systemic problems — power concentration in any technological era, erosion of democratic processes, structural inequality. The practical-engineering response: ignore the religious framing and just adopt the frameworks (OWASP, NIST AI RMF, EU AI Act compliance) you already needed.

We think this undersells the legitimization effect. The Vatican's institutional weight makes the existing governance frameworks politically defensible in jurisdictions where they were previously fringe — particularly Latin America, Southern and Central Europe. It's a calendar shift, not a doctrinal shift.

What Enterprise Governance Teams Should Do This Week

Map your existing AI agents against the OWASP Agentic Top 10 today. The Microsoft toolkit's QUICKSTART.md is the lowest-friction path.
Audit the "agent did it vs. human approved it" split. If you can't tell those apart in your logs, you can't defend a labor-impact claim to a regulator.
Inventory plugin / tool provenance. Every tool your agents can call needs a signer of record, a manifest, and a trust score.
Treat the encyclical as a stakeholder-communications artifact. Boards, ethics committees, customer trust teams will all be asked about Magnifica Humanitas inside two weeks.
Plan compute siting against social-license risk. The Caledonia pullback is not a one-off.

The Twelve-Month Forecast

By mid-2027, Magnifica Humanitas will not be remembered as the moment AI policy was decided. But it will be remembered as the moment AI governance crossed from "interesting" to "institutionally legitimized" — the moment when the political cost of not having a governance framework went up sharply.

The Pope did not name names. He didn't have to. The phrase "controlled by a few" reads, in current context, against a 99%-of-prediction-market-share monopoly framing for Anthropic, against the OpenAI valuation overhang, against the Big Five hyperscaler capacity dominance, against the Caledonia pushback. The audience for the encyclical knows who is meant.

The frameworks are ready. The tools exist. The compliance calendar is set. Magnifica Humanitas turned the political subtext into the political text. Use that.

Originally published at ComputeLeap

Claude for Small Business: 382K Day-One Buyer's Guide

Max Quimby — Tue, 26 May 2026 03:36:00 +0000

The headline number making the rounds on r/ClaudeAI yesterday is real: Anthropic's just-shipped Small Business bundle pulled around 382,000 downloads on day one, according to the community thread that landed at 1,685 upvotes. That number is doing a lot of work in the agent-platform discourse this week — it's the single cleanest signal we have that the SMB-agent market isn't a "someday" TAM.

📖 Read the full version with all embedded screenshots and YouTube context on AgentConn →

But the more interesting thing isn't the download number. It's what Anthropic actually shipped inside the bundle — and what it deliberately didn't ship.

What's Actually In the Box (Spoiler: It's Not 31 Skills)

Walk into the discourse cold and you'll hear "31 skills" everywhere — that's the framing in the r/ClaudeAI thread, in the YouTube walkthrough titled "Anthropic Just Dropped Claude for Small Businesses (31 Skills)", and in Charlie Hills's LinkedIn install guide. Charlie's the one who first counted the slash-command library and published the infographic — that's where the 31 came from.

But the official Anthropic announcement on May 13 is more disciplined. It says 15 ready-to-run agentic workflows plus 15 reusable skills — and Anthropic is careful to keep these conceptually separate. Workflows are SOP-shaped: payroll planning, monthly close, invoice chasing, lead triage, contract review, campaign builder. Skills are the lower-level building blocks they compose out of: cash-flow forecasting, margin analysis, customer sentiment, hiring packet builder, tax prep.

The 31st item — depending on how you count — is the bundle wrapper itself: the single Cowork toggle that turns the whole package on. Spicy Advisory's enumeration is the cleanest reconciliation of the official 15 + 15 against the community's 31 count.

The reason the distinction matters: 15 + 15 + 7 connectors is the shape of a platform play, not a feature dump. Each workflow consumes multiple skills. Each skill talks to one or more of the seven connectors — QuickBooks, PayPal, HubSpot, Canva, DocuSign, Google Workspace, Microsoft 365 (plus Slack). The 382K downloads don't measure 382K people installing 31 things. They measure 382K people flipping a single Cowork toggle.

💡 Buyer takeaway: when you see "31 skills" in the discourse, mentally translate to "15 SOP workflows + 15 reusable skills, all gated behind one Cowork toggle and seven connectors." That's the shape of the thing you're actually evaluating.

The TAM Signal: Why 382K Day-One Matters

Anthropic's official framing in the launch is that small businesses are 44% of U.S. GDP and nearly half the private-sector workforce, per the TechCrunch coverage. The head of SMB explicitly named the segment Anthropic is going after: "the 15-person HVAC company, the 30-person landscaper, the 50-person real estate brokerage." That's a long way from the Anthropic of two years ago — the one whose pricing chart implied a $100K+ enterprise contract was the minimum-viable customer.

The 382K-downloads number, set against that framing, is the cleanest market-validation signal the agent space has produced this quarter:

It's bigger than the entire daily active developer count of most agent harnesses combined.
It dwarfs every prior SMB-agent launch. Salesforce's Agentforce for SMB, Microsoft's Copilot for Small Business, and the various Zapier/Make AI-agent rollouts have all reported "early traction" without a number this large in a comparable window.
It happened in May, off a single-toggle install flow with no app-store marketing, no influencer push — just an announcement and a tour.

The implication for anyone building in the SMB-agent lane: the demand was here all along; what was missing was a credible enough provider to convince an owner-operator to flip the toggle. Anthropic is now that provider.

What Anthropic Deliberately Didn't Build

This is where the buyer's-perspective unpack gets interesting:

1. No SMB-specific dashboard. The bundle lives inside Claude Cowork — the same interface enterprise teams use. The bet is that the surface where SMB owners experience agents is the chat interface, plus their existing SaaS. Not a new pane of glass.

2. No agent personas. Compare this to Salesforce Agentforce or the various "AI receptionist / AI bookkeeper" pitches. Skills are nouns ("invoice chaser") not characters ("meet Penny, your AI bookkeeper"). That's a different bet about how SMBs will mentally model AI — as utilities, not employees.

3. No new pricing tier. Per the Inc. coverage, there is no extra charge above Claude Pro ($20/month) or Max ($100–$200/month). The 382K downloads happened with zero new revenue line attached to the bundle itself.

4. No vertical agents. Anthropic chose horizontal SMB skills (payroll, marketing, sales) rather than verticalized ones (HVAC, landscaping, real estate brokerage). That's a deliberate "platform, not application" call.

5. No agent-on-agent commerce hook. The SMB bundle does not hook into Anthropic's test marketplace for agent-on-agent commerce at launch. That play is being saved for later.

ℹ️ Read these five "didn't build" choices together and you can see the architecture: Anthropic is establishing Skills as the bookable abstraction layer for SMB software. The marketplace, verticals, agent personas, agent-to-agent commerce — those are future upsell vectors. Today's bundle is the substrate.

The 15 Workflows, Mapped to the Seven Connectors

Per the Anthropic announcement and the TechInformed walkthrough:

Finance (5): Payroll planning, monthly close/reconciliation, invoice chasing, margin analyzer, tax-season organizer. Primary connector: QuickBooks. Secondary: PayPal.
Operations (3): Business pulse, month-end prepper, contract reviewer. Primary connectors: QuickBooks + DocuSign.
Sales (2): Lead triager, campaign runner. Primary connectors: HubSpot + Canva.
Marketing (2): Content strategist, campaign runner (overlap). Primary connectors: Canva + HubSpot.
HR (2): Hiring packet builder, onboarding planner. Primary connectors: Google Workspace + Microsoft 365.
Customer Service (1): Customer sentiment / pulse. Primary connector: HubSpot.

The two most expensive workflows to replicate from scratch are monthly close (QuickBooks + PayPal + CSV export pipeline) and invoice chasing (QuickBooks billing + PayPal settlement reconciliation + outbound message flow gated by approval). Both are full-day-per-month operations at most SMBs.

The Open-Standard Play in the Background

Two weeks before the SMB bundle, Anthropic released Agent Skills as an open standard with a partner-built directory featuring Atlassian, Figma, Canva, Stripe, Notion, and Zapier. The anthropics/skills GitHub repository is the canonical reference.

The SMB bundle is the first official Anthropic-branded skill pack built to that open standard. Every skill in it is technically portable — a competitor harness can run the skill spec. The 382K downloads are downloads of the Cowork toggle; the underlying skill files are open-spec. Anthropic is betting ecosystem growth > proprietary lock-in.

The GitHub Trending board today corroborates this. Of the top fifteen repos, at least four are explicitly Skills-targeted: multica-ai/andrej-karpathy-skills (154K stars), affaan-m/ECC (192K stars), mukul975/Anthropic-Cybersecurity-Skills (9K), and the broader multica-ai/multica managed-agents platform.

The Vendor Landscape This Lands Into

Already in the SMB stack and getting commoditized:

Zapier — Anthropic's workflows substitute for many Zapier zaps wired into QuickBooks + HubSpot + Canva.
Make / n8n — same substitution risk for the lower-end use cases.
Bench (bookkeeping) and similar AI-bookkeeping point solutions.

Adjacent but not directly hit (yet):

Salesforce Agentforce — moving toward SMB but priced for the mid-market and up. The 50-person real estate brokerage Anthropic is naming is below the Salesforce floor.
Microsoft 365 Copilot for Small Business — bundled with M365, but Microsoft-stack-only.
The skills-marketplace cohort — Composio, Agensi, LobeHub, SkillsMP. Agensi's positioning piece is explicit: Anthropic's directory is small and curated; community marketplaces are larger. The 382K downloads suggest Anthropic-curated quality beats community-curated breadth for the SMB owner-operator persona.

Vertical SMB players (HVAC, dental, salons): ServiceTitan, Jobber, Square Appointments, Mindbody. None of these has rolled out a credible cross-tool AI workflow layer. The SMB bundle does not compete with them directly yet. The moment Anthropic ships a vertical skill pack, that daylight closes.

⚠️ Contrarian Corner: The 382K Number is Doing Too Much Work

Three reasons to discount the signal before committing a roadmap to it:

1. Downloads are not active usage. The Cowork toggle counts as a download the moment a user enables the bundle. No public metric exists for how many of those 382K toggles produced a completed workflow run.

2. The slash-command UX favors one-offs over assembly. Per the long-tail of Claude usage reviews, the predominant SMB use of Claude has been single-shot ("draft this email"). Multi-step workflows require trust across multiple steps and connector authentications — a much higher behavioral lift than the download metric reveals.

3. Human-approval gates have historically killed SMB automation. Every outbound action waits for a human click. The Register's launch coverage framed this skeptically — the entire pitch of SMB automation has historically been "I don't have time to approve every email," but the bundle reintroduces exactly that approval step for safety reasons.

The 382K number is real. The active-monthly-workflow number is the one to watch in the August earnings cycle.

What Builders Should Do About This

Treat horizontal SMB skills as commoditized as of May 13, 2026. Don't ship a startup whose moat is "we automate invoice chasing." The moat just collapsed.
Ship verticals. The five-question test: Does it require domain knowledge Anthropic's general-purpose skill won't have? HVAC, dental practices, real-estate brokerages — that's where to play.
Ship connectors Anthropic hasn't, and won't. ServiceTitan, Jobber, Square Appointments, Mindbody. A skill bundle that bridges Jobber → Anthropic skills → QuickBooks has a meaningful place.
Stop optimizing for downloads. Start optimizing for completed workflows. The number that will matter in twelve months is completed monthly workflows per active user.
Read the open-standard play correctly. The substrate wins by being open; the distribution wins by being closed (Cowork, partner network, the 10-city tour, the AI Fluency course).

The Twelve-Month Prediction

By mid-2027, the 382K-day-one number will be remembered as the moment the SMB-agent TAM stopped being a slide deck and started being a P&L line. But the more important inflection is structural: the surface where SMBs experience agents was decided this month, and it is "skills inside existing SaaS, gated by approval, behind one toggle." That UX choice will calcify into the default.

The next thing to watch is the active-monthly-workflows number Anthropic discloses (or pointedly doesn't) at its next investor update or AI Engineer talk. Downloads opened the door. Whether SMBs walk through it — and how many times per month — is the next question.

If the active-monthly number comes in above ~30% of downloads, the SMB-agent market is real, and Anthropic has it. If it comes in below ~10%, the contrarian corner above was the right read. The honest answer is somewhere between those poles, and we'll know inside two quarters.

Either way, the 382K downloads have already done the structural work: every other agent vendor's roadmap has to be re-justified against the question "why aren't your customers just toggling on the Anthropic bundle?" That question didn't exist on May 12. It does today.

Originally published at AgentConn

codegraph: The Missing Knowledge Graph for 5 Coding Agents

Max Quimby — Sun, 24 May 2026 03:34:51 +0000

📖 Read the full version with embedded sources on AgentConn →

colbymchenry/codegraph added 2,434 stars in 24 hours and rocketed to #2 on GitHub Trending on May 23. It's a local-first, multi-agent code knowledge graph — built specifically for Claude Code, Codex CLI, Cursor, OpenCode, and Hermes Agent — with median benchmarks of 59% fewer tokens, 49% faster responses, and 70% fewer tool calls across seven real-world codebases.

That ranking matters because of the company it's keeping. The same trending day, multica-ai/andrej-karpathy-skills gained 3,372 stars at #1, Lum1104/Understand-Anything gained 2,331 stars at #3 with the same knowledge-graph thesis, and NousResearch/hermes-agent gained 1,334 stars. Five of the day's top ten trending repos are coding-agent infrastructure.

What codegraph Actually Builds

Most coding agents waste 60–70% of their tokens re-discovering code structure on every task. codegraph replaces that with a one-time, local indexing pass:

The pipeline: Tree-sitter parses source → language-specific queries extract nodes (functions, classes, files) and edges (calls, imports, inheritance) → SQLite (.codegraph/codegraph.db) stores it with FTS5 full-text search → post-extraction reference resolution links calls to definitions → native OS file watchers keep it current with 2-second debouncing.

The agent doesn't read files to understand structure; it queries the graph. The graph already knows parseConfig is called by 3 callers, that it imports from ./yaml-utils.ts, and that its definition lives at src/config/parser.ts:147. The agent reads only the file it's editing.

The MCP Surface — Why Multi-Agent Works

codegraph exposes itself as an MCP server with nine tools (codegraph_search, codegraph_context, codegraph_callers/callees, codegraph_impact, codegraph_explore, codegraph_node, codegraph_files, codegraph_status). Because the surface is MCP, the same .codegraph/codegraph.db works for any agent that speaks the protocol. Today that's Claude Code, Codex CLI, Cursor, OpenCode, and Hermes Agent.

None of them have to ship their own indexer. None of them have to re-invent semantic exploration. Switch agents, keep the graph.

The Swift Compiler Benchmark

The most-quoted figure: 25,874 files, 272,898 nodes, indexed in under 4 minutes. On a complex question against that index, an agent answered with 6 explore calls and zero file reads in 35 seconds.

The same question through vanilla Claude Code would routinely take 90–180 seconds, 25–40 tool calls, and 200K–400K tokens of context. codegraph compresses it to a half-minute conversation that fits in context without truncation.

The Parallel Implementations Tell the Real Story

If codegraph were a one-off, the trending page would have one knowledge-graph tool. It has at least three (codegraph, Understand-Anything, code-review-graph). Three independent implementations of the same primitive shipped within the same trending window.

That's the signal. The agent ecosystem has collectively discovered that the bottleneck on coding agents stopped being model capability and became context efficiency, and that pre-indexing is the obvious answer. It's the same lesson search engines learned in 1998.

Where codegraph Sits in the Multi-Agent Stack

For production agent setups, codegraph belongs in your stack when: codebase size is >5,000 files; you're running multi-agent workflows; and your privacy posture requires 100% local processing.

The Buy/Build/Wait Read

Should you adopt today? If your codebase is over 5,000 files and you're paying Claude Code/Codex bills above $200/month per developer, yes. Install via npx @colbymchenry/codegraph and codegraph init -i.

Should you bet on the category? Yes, but loosely. Anthropic, OpenAI, and Cursor will each ship their own native indexer in 6 months. The MCP-based ones (codegraph) have a path to surviving that squeeze.

The deeper bet — that something like a pre-indexed code knowledge graph becomes table stakes for serious coding agents in 2026 — is the safe one. Three independent implementations on GitHub Trending in the same week is the market making that call out loud.

Originally published at AgentConn.

What Microsoft Canceling Claude Code Means for Enterprise AI

Max Quimby — Sun, 24 May 2026 03:34:24 +0000

📖 Read the full version with charts and embedded sources on ComputeLeap →

On May 22, The Verge reported that Microsoft is canceling most internal Claude Code licenses across its Experiences and Devices group — the unit that builds Windows, Microsoft 365, Teams, Outlook, and Surface. The deadline: June 30, 2026. Engineers were told to move to GitHub Copilot CLI.

The story hit Hacker News at #2 with 418 points and 398 comments. On Reddit, an adjacent Fortune piece — "Microsoft reports are exposing AI's real cost problem: using the tech is more expensive than paying human employees" — hit r/technology with 14,714 upvotes. The All-In Podcast cut an episode framed as "America Turns on AI." Three independent narratives, same week, pointing in the same direction.

It's tempting to read this as Microsoft picking a fight with Anthropic, or as the long-predicted "AI fatigue" finally landing. Both are wrong. The real signal is procurement.

📊 The fact pattern

Microsoft: ending most internal Claude Code use in Experiences and Devices by June 30, 2026 (Windows Central)

Uber: exhausted its entire 2026 AI budget in 4 months, with engineers reporting $500–$2,000/month per person in API spend (Briefs.co)

Industry: "Companies like Microsoft, Uber, Meta, and Amazon initially incentivized maximum AI usage through leaderboards. However, escalating bills forced reversals." (Fortune)

The June 30 Tell

The most quoted line in the entire week of coverage is the one nobody is centering: Microsoft's fiscal year ends June 30. So does the Claude Code contract.

Windows Central's reporting is direct: "Pulling external Claude Code seats reduces external software spending heading into the new fiscal year." This is not a strategic AI thesis. This is a CFO clearing a line item before FY27 budgets get approved.

That detail rewrites the whole story. If Microsoft were truly making a "Copilot vs. Claude" call on the merits, the timing would be tied to product milestones — a new GitHub Copilot CLI release, a Claude pricing change, a security finding. Instead it's tied to the calendar. The fact that Anthropic models remain available through Microsoft Foundry and inside Microsoft 365 Copilot for specific tasks (Developer Tech) confirms it. Microsoft isn't ending the Anthropic relationship — they're cutting the tool whose pricing model is incompatible with their budgeting cadence.

For enterprise AI buyers, that's the procurement signal worth reading. Not "Microsoft hates Anthropic." It's "the seat-based budgeting cycle just ran out of road."

Uber Is the Real Microsoft Microcosm

While the Verge story dominated HN, the more revealing data point is buried in Yahoo Finance's coverage of Uber's CTO comments: Uber blew through its full 2026 AI budget by April. Four months. $3.4B in committed spend. The CTO's stated cause wasn't waste — it was adoption:

~95% of Uber engineers use AI tools monthly
~70% of committed code is AI-generated
~11% of live backend code updates are written entirely by AI agents
Individual engineer API spend: $500–$2,000/month

That's not failure. By every productivity dashboard, it's the textbook win. And it still broke the budget.

AI Magazine's analysis names the root cause cleanly: "The predictive models established by the finance team based on the traditional SaaS era of 'fixed seats' and 'low-frequency calls' have completely failed in the face of the intensive token consumption of AI agents."

Translation: enterprises priced AI like they priced Slack. Then their engineers started running 12-hour agentic loops.

💡 The math that broke procurement

Per-token costs have collapsed — Gartner projects inference costs for advanced models drop ~90% by 2030 vs. 2025. Goldman Sachs projects 24x token consumption growth by 2030. Multiplied together, your AI spend goes up, not down.

What the HN Thread Actually Said

The 418-point HN thread is the most honest enterprise-AI focus group of the week. The arguments worth reading:

On metrics: "Token consumption metrics are flawed — like measuring sawdust on a construction site." (bob1029)

On developer pressure: "Developers face pressure to maximize output quickly and cannot afford to gamble on cheaper models." (harimau777)

On the real reason: "Microsoft prefers directing telemetry toward improving its own Copilot product rather than competitors." (community consensus)

Notice what the comments don't fight about: whether Claude Code is the better tool. That war is over. Polymarket's "best AI model end of May" contract has Anthropic at 98%. The cancellation is happening despite model dominance.

The "America Turns on AI" Misread

It's tempting to lump this into the broader anti-AI cluster — Fortune's data center backlash piece, Meta's internal AI dissent leaks, the layoffs-for-AI thread crossing 4,500 upvotes. But the Microsoft story is a different category. Microsoft isn't turning on AI. Microsoft is concentrating AI: same models, fewer vendors, tighter telemetry loops.

For enterprises watching this play out, the lesson isn't "rip out Claude." Many shops would lose 30% of their developer velocity overnight. The lesson is: assume your seat-based AI contract has a 12-month half-life, and start planning the next one before your CFO does.

What CIOs Should Do Before FY27

1. Reframe the budget unit from "seat" to "task." Anthropic's /usage command finally makes per-developer attribution legible. If you can't measure it per task, you can't budget for it.

2. Negotiate consumption ceilings, not unit prices. The 90% per-token deflation will not save you. Uber's $500–$2,000/month range is the realistic envelope.

3. Decouple model choice from tool choice. Microsoft is keeping Anthropic models, killing the Anthropic tool. That's the right architectural read. Your CLI, IDE plug-in, and review bot should all be model-pluggable.

4. Build an "internal Foundry." Microsoft's move only works because they have Foundry — a model gateway that abstracts vendor relationships. Internal model gateways are the new IT shared service.

5. Audit which agentic workflows are economic at $0.01/1K tokens vs $0.10/1K tokens. The bottom 20% of your agent workflows will be uneconomic at any plausible 2027 price.

The Story That Matters Next Week

Microsoft canceling Claude Code is the loudest event. It is not the most important one. The important one is whoever follows them in the next 90 days.

The companies that won't be fine are the ones who didn't see the procurement signal under the tool-war headline, who treat this as "Microsoft AI strategy news" rather than "your AI budget assumptions just broke." Your FY27 budget proposal lands sometime in the next 90 days. The Microsoft memo is the rehearsal. Read it that way.

Originally published at ComputeLeap.

Antigravity 2.0 Review: The Switching-Cost Trap

Max Quimby — Fri, 22 May 2026 03:42:59 +0000

Most reviews of Antigravity 2.0 will tell you about the new CLI, the SDK, and the multi-agent orchestration. Those are real, and we will cover them. But the most important thing that happened when Google shipped Antigravity 2.0 at I/O 2026 was not a feature. It was this: developers opened their laptops, clicked the shortcut they had used for months, and found their IDE gone — replaced by a single conversational prompt box, their chat history and settings wiped.

📖 Read the full version with screenshots and embedded sources on AgentConn →

That is the review. Everything else is detail. Antigravity 2.0 is a genuinely capable agent platform, and if you are starting fresh it is worth a look. But the rollout turned a product launch into a trust incident, and the lesson for anyone choosing an agent IDE in 2026 is structural: the lock-in was never the pricing. It was the install.

What Antigravity 2.0 actually is

Give Google credit for ambition. Antigravity 1.0 was an agent-augmented IDE — a VS Code fork with an AI agent bolted on. Antigravity 2.0 is no longer one app. Per MarkTechPost's breakdown, it is now five surfaces:

A standalone desktop app that orchestrates multiple agents in parallel, with a built-in browser agent for visual QA
A Go-based CLI that is SSH-compatible and lets you spin up agents without a GUI
An SDK for defining custom agent behaviors and hosting them on your own infrastructure
A Managed Agents API inside the Gemini API, running agents in isolated Linux environments
An enterprise path through the Gemini Enterprise Agent Platform

The desktop app adds dynamic subagents for parallel workflows, scheduled background tasks, native voice commands, and native hooks into Google AI Studio, Firebase, and Android. TechCrunch and 9to5Google both frame 2.0 as Google reframing Antigravity from "an IDE" into "a full agentic development suite." On capability, that framing is fair. The browser agent reportedly scores 76.2% on SWE-bench Verified, and the parallel-agent model is real.

Google's own demo made the ambition concrete: its team used Antigravity 2.0 with Gemini 3.5 Flash to build a functioning operating system from scratch in under 12 hours — 93 parallel sub-agents, 2.6 billion tokens, for less than $1,000 in API credits. Whatever you think of the rollout, the orchestration engine underneath is not a toy.

ℹ️ If you have never used Antigravity and you are shopping for an agent IDE, 2.0 is a legitimate contender — especially if visual QA, parallel agents, or tight Firebase/Android integration matter to you. The criticism in this review is about the rollout, not the underlying engineering.

The reset: what broke

Here is what Google did not put on the I/O stage. The 2.0 update did not ship as an opt-in. It force-updated existing installations through the background update channel, and the new build aggressively rewrote default application paths so the old IDE could not coexist with the new one.

The clearest account comes from a widely-shared developer write-up. The author's summary is blunt: "When I clicked my usual shortcut, my entire IDE was just gone." The forced update wiped chat history and settings, broke a plan-review-implement workflow built up over months, and left the machine in a state where the only fix was, in the author's words, "a total purge of everything Antigravity related" before either version would run again.

That write-up hit Hacker News under the title "Google's Antigravity Bait and Switch" and climbed past 337 points. The phrase that stuck: background updates "are meant for performance patches, not secretly shipping an entirely different piece of software." Techloy reported the obvious consequence — the forced change "raises questions about trust in Google's update policies."

The other shoe: Gemini CLI is being retired

The Antigravity reset did not happen in isolation. The same week, Google announced it is retiring the Gemini CLI and transitioning users to the Antigravity CLI. Free, AI Pro, and AI Ultra users face a hard deadline: June 18, 2026. After that, the Gemini CLI API shuts down.

Per byteiota's migration guide, the Antigravity CLI is "not 1:1 feature parity at launch" — some edge-case workflows "may need adjusting." So developers who standardized on the Gemini CLI now face a forced migration, on a deadline, to a tool that does not yet do everything the old one did. The migration is a one-liner to install, but install is the easy part. Rebuilding scripts, CI integrations, and muscle memory is not.

The quota problem

Capability is not the same as usability, and Antigravity 2.0 inherited a quota system that the community has been loudly unhappy with. The Google AI Developers Forum has multiple active threads — including one titled simply "Broken IDE & Joke AI Quotas".

The mechanics, as documented in a community fix guide: Antigravity uses a dual-limit structure — a 250-unit "sprint" cap that refreshes every 5 hours, sitting on top of a 2,800-unit weekly baseline. Both pools must be positive for the tool to run. Exhaust the weekly pool on Monday and you are locked out until the reset — waiting five hours does nothing, because the weekly bucket is empty.

The community workarounds are telling, because every one of them is a way of escaping the product you are nominally reviewing:

Downgrade to v1.19.6 (late February 2026), which has more predictable quota behavior
Disable AI Credits, which several users report causes rapid, unexpected quota depletion
Route to Gemini Flash, which runs on a separate, lighter quota pool than the Claude models

⚠️ When the most-recommended fixes for a tool are "use an old version," "turn off a feature," and "route around the default model," that is not a configuration problem. It is a signal that the product shipped ahead of its economics.

The real lesson: evaluate exit cost, not just features

Step back from the specifics. Antigravity 2.0 is a useful product attached to a rollout that treated developers' working environments as Google's to rewrite. The instinct is to file this under "bad PR week." That undersells it.

Agent IDEs have quietly become load-bearing infrastructure. They hold your prompt history, your agent configs, your MCP servers, your project context — and increasingly they are your workflow, not an accessory to it. When a tool like that auto-updates into a different product and deletes your settings, you learn how much control you actually had. The answer, this week, was: very little.

This is the same platform-risk theme we explored in our review of cc-switch — the tool whose entire pitch is hedging against exactly this. The cc-switch thesis was that no single AI vendor should be able to lock you out of your workflow. Antigravity 2.0 is the case study that proves the thesis. The developers who were least disrupted were the ones who kept their configs portable and did not treat any single agent IDE as permanent.

💡 When you evaluate an agent IDE, add one line to the checklist: what does it cost to leave? Can you export your agent configs? Are your prompts in plain files you own, or in the app's database? If the vendor force-updated tomorrow, how much of your setup survives? Features are easy to compare. Exit cost is the number that actually protects you.

Should you adopt or migrate?

A practical read, by situation:

New users, greenfield projects: Antigravity 2.0 is worth trying. The parallel-agent orchestration and built-in browser agent are genuinely strong. Just keep your agent config and prompts in version-controlled files from day one — do not let the app become the only place your setup exists.
Existing Antigravity 1.x users: You have likely already been force-migrated. Audit what you lost, check whether the v1.19.6 downgrade path restores predictable quotas, and back up your config now. Treat the next forced update as a when, not an if.
Gemini CLI users: The June 18, 2026 deadline is real. Start the migration now, not in June — the Antigravity CLI is not at feature parity, so you need runway to find and fix the gaps in your scripts and CI.
Teams choosing an agent CLI from scratch: This is a good moment to compare the field on portability, not just capability. Read the comparisons with exit cost in mind.

Antigravity 2.0 is not a bad tool. It is a good tool that just gave the clearest demonstration of the year that an agent IDE is something you rent, not something you own. Use it if it fits — but build your workflow so that when the next forced reset comes, you are the one who decides whether to follow it.

Originally published at AgentConn

Gemini 3.5 Flash: Is 'Cheaper Than Frontier' Real?

Max Quimby — Fri, 22 May 2026 03:24:58 +0000

Google walked onto the I/O 2026 stage with a number, not a model. Sundar Pichai told the audience that companies running roughly a trillion tokens a day on Google Cloud could save more than $1 billion a year by shifting most of their workload onto Gemini 3.5 Flash. His framing was blunt: enterprises are "already blowing through their annual token budgets, and it's only May."

📖 Read the full version with charts and embedded sources on ComputeLeap →

That is the story Google wants you to repeat. It is also the story worth interrogating, because the headline — "cheaper than frontier" — is doing a lot of quiet work. Cheaper than what, exactly? Against whom? And at which of the model's several price tiers?

The honest answer is more interesting than the press release. Gemini 3.5 Flash is genuinely a strong agentic and coding model, and for a specific class of workloads it will save real money. But it is also three times more expensive than the Flash model it replaces, and in its highest reasoning mode it can cost more to run than Gemini 3.1 Pro. The "cheaper than frontier" claim is true — but only if you read the fine print and route your traffic accordingly.

What Google actually shipped

Gemini 3.5 Flash launched generally available on day one, May 19, 2026 — no preview gate. It went straight into the Gemini app, AI Mode in Google Search, Android Studio, and, notably, GitHub Copilot. That distribution is the part most coverage underplays. Google did not announce a model so much as flip a switch under a billion existing users.

The benchmarks are real and they are good. According to llm-stats.com's launch breakdown, Gemini 3.5 Flash posts 76.2% on Terminal-Bench 2.1, 1656 Elo on GDPval-AA, 83.6% on MCP Atlas, and 84.2% on CharXiv Reasoning. All four numbers top Gemini 3.1 Pro — last year's flagship. On output speed, MarkTechPost clocked it around 284 tokens per second, with Pichai citing 289 on stage — roughly 4x the throughput of comparable frontier models.

ℹ️ The thing to notice: every benchmark Google led with is an agentic or tool-use benchmark — Terminal-Bench, MCP Atlas, agent-style coding suites. On pure reasoning, the picture flips. Gemini 3.1 Pro still wins Humanity's Last Exam (44.4% vs 40.2%) and ARC-AGI-2 (77.1% vs 72.1%). Google is not claiming a new intelligence ceiling. It is claiming a better speed-and-cost frontier.

That distinction is the whole article. Google did not try to build the smartest model in the world this cycle. It tried to build the model that is good enough for the workloads enterprises actually run at volume — agent loops, code edits, tool calls — and then made it fast and ubiquitous. As we argued in Harness Leaderboards Are the New Model Leaderboards, the raw model score has stopped being the interesting variable. Throughput, cost per task, and how the model behaves inside a harness are what move production decisions now.

The price tag nobody puts in the headline

Here is where "cheaper than frontier" starts to wobble.

Gemini 3.5 Flash is priced at $1.50 per million input tokens and $9.00 per million output tokens for the thinking variant, with cached input at $0.15. Against Gemini 3.1 Pro at $2.00 / $12.00, that is about 25% cheaper. Against a Pro-tier competitor it is genuinely a discount. So far, so good for the headline.

But run the comparison the other direction — against the model Flash users were already paying for — and the story inverts. As Simon Willison documented in his launch-day analysis, Gemini 3.5 Flash is 3x the price of Gemini 3 Flash Preview and 6x the price of Gemini 3.1 Flash-Lite.

The Hacker News reaction was immediate. The top of one thread, titled bluntly "It's discouraging to see Google price Gemini 3.5 Flash at 3x the cost of Gemini 3 Flash", captured the frustration: the entire identity of the Flash line was being the cheap option. A Flash that costs nearly as much as last year's Pro is a different product wearing the same name. TechTimes put the tension right in its headline: "a Cheap-to-Run Agent Model That Costs 3x More Per Token."

Willison's most damaging data point is not the per-token sticker, though. It is the all-in cost of actually using the thing. Running Artificial Analysis's standard benchmark suite, Gemini 3.5 Flash in "high" reasoning mode cost $1,551.60 — versus $892.28 for Gemini 3.1 Pro Preview.

⚠️ Read that again. In its highest reasoning setting, the "Flash" model cost 74% more to complete the same benchmark suite than the actual Pro model. Flash models think before they answer, and thinking tokens are billed at the output rate. A cheap per-token price plus a high token count does not equal a cheap bill.

This is the same trap we documented in The 6x AI Pricing Lie: per-token pricing is a marketing surface, not a budget. The number that matters is cost per completed task, and for reasoning-heavy work a fast model that emits a lot of thinking tokens can quietly outspend a slower, "more expensive" one.

So is the $1 billion claim a lie?

No — and this is where fairness matters. Pichai's $1B figure is not fabricated. It is conditional.

The claim assumes a company running ~1 trillion tokens per day that shifts roughly 80% of its workload from a frontier-tier model to a mix of Flash and other models. For the right workload mix, that math holds. A huge share of enterprise token volume is not frontier-reasoning work — it is classification, extraction, summarization, routing, simple code edits, retrieval-augmented answers. On that traffic, 3.5 Flash at non-thinking rates ($0.50 / $3.00 in its base configuration) genuinely undercuts a Pro model while clearing the quality bar. VentureBeat's enterprise coverage and R&D World's analysis — which notes 3.5 Flash scores within two points of Anthropic's flagship at a third of the price — are both describing a real phenomenon.

The catch is the word mix. The savings come from routing, not from a model. If you route every request to 3.5 Flash on "high" and call it a cost optimization, you will be unpleasantly surprised by the invoice. If you tier your traffic — base Flash for the bulk, thinking mode only where it earns its keep, Pro for the genuinely hard reasoning — the billion-dollar math is reachable.

💡 The actionable takeaway for an engineering team: 3.5 Flash is a routing-tier upgrade, not a default-everything button. Pin the model version, pin the reasoning tier explicitly in your API calls, and instrument cost-per-task per route. Treat "high" mode as a frontier-priced resource, because that is what it is.

There is also a pricing-confusion problem Google created for itself. The HN threads filled with people disagreeing about the actual numbers — $0.50/$3.00 base versus $1.50/$9.00 thinking — because the model ships with multiple price points under one name. When your own developer community cannot agree on what your model costs, "cheaper" is not a message you have landed.

The honest competitive read

Strip away the I/O theater and Gemini 3.5 Flash is a confident, slightly cynical product decision. Trending Topics called it a launch with "higher prices but no generational leap," and that is roughly correct — but it is not necessarily a bad decision.

Willison's broader observation is the one to sit with: "It feels like all three of the major AI labs are starting to probe the price tolerance of their API customers." He points out OpenAI's GPT-5.5 launched at 2x the price of GPT-5.4, and Claude Opus 4.7 runs about 1.46x Opus 4.6 once you account for the new tokenizer. Every lab is testing how much developers will absorb. Google's bet is that day-one GA across Search, the Gemini app, Android Studio, and Copilot means most of its token volume never makes a price-sensitive decision at all. Distribution does the selling. The API price can drift up because the API is not where the volume is.

For the convergence-watchers: this is why Gemini 3.5 Flash was a HIGH-confidence cluster across YouTube, HN, and Substack this week. It is not the model that is interesting. It is the strategy — competing on the cost-and-speed frontier while quietly conceding the intelligence ceiling and quietly raising prices behind a "cheaper" headline.

The switching-cost asterisk: Antigravity 2.0

If you are an enterprise reading the $1 billion number and thinking about consolidating onto Google's stack, the same I/O week handed you a cautionary tale.

Google also pushed Antigravity 2.0, the new version of its agent IDE. It did not ship as an opt-in. It force-updated existing installations, and in doing so replaced the IDE developers had been using for months with a single conversational prompt box — wiping chat history and settings in the process. The Hacker News thread, titled "Google's Antigravity Bait and Switch," climbed past 337 points, with developers reporting they had to fully purge every Antigravity file on their machine before either version would run again.

The 2.0 reset is a sharp reminder of the cost that never shows up in a token-pricing comparison: switching cost and platform risk. The $1B savings figure assumes you can move 80% of your workload onto Google's models. Doing that deepens your dependence on Google's update policy — the same policy that just deleted developers' IDE configurations without asking.

⚠️ Cost analysis that stops at token price is incomplete. The real question for an enterprise is total cost of dependence: token price plus migration cost plus the risk that the vendor reorganizes the product underneath you. Antigravity 2.0 just repriced that risk upward for everyone evaluating Google's agent stack.

Should you adopt it?

A practical read, by situation:

High-volume, non-reasoning workloads (extraction, classification, routing, RAG answers, simple edits): Yes. Use base 3.5 Flash, not thinking mode. This is where the savings are real and the quality bar is comfortably cleared.
Agentic coding and tool-use loops: Strong yes on capability — the Terminal-Bench and MCP Atlas numbers are legitimate — but instrument cost per task. Agent loops emit a lot of tokens; a fast model amplifies both speed and spend.
Hard reasoning, long-context retrieval, research-grade work: Be skeptical of "high" mode as a cost play. 3.1 Pro still wins the reasoning benchmarks and, in Willison's test, cost less to run the suite. Route this traffic to an actual Pro/flagship tier.
Anyone consolidating their whole agent stack onto Google: Factor in Antigravity 2.0. Pin versions, keep your prompts and configs portable, and do not assume the product you adopt today is the product you will have next quarter.

The "cheaper than frontier" line is not a lie. It is a half-truth that becomes true only with disciplined routing — and becomes false the moment you treat 3.5 Flash as a drop-in replacement for everything. Google built a fast, capable, well-distributed model and raised the price while telling you it got cheaper. Both things are true. Your invoice will reflect whichever one you actually engineer for.

Originally published at ComputeLeap

Chamath's 18 Months: Taiwan, TSMC-Arizona, and the Clock

Max Quimby — Tue, 19 May 2026 03:33:27 +0000

On the latest All-In Podcast, Chamath Palihapitiya delivered a single sentence that has been circulating across the political-class Twitter feed for forty-eight hours: "We're 18 months from Taiwan not being an important moment of conversation."

📖 Read the full version with charts and embedded sources on thearcofpower →

It is the most interesting thing said about Taiwan this week, and almost nobody is reading it correctly.

The dominant frame — picked up across PBS, The New York Times, Pakman, and The Guardian — has been about the Trump-Xi Beijing summit and whether Trump looked weak under Xi's Taiwan warning. That is the invasion frame. It assumes Taiwan matters because Beijing might one day move on it militarily.

Chamath's framing is different and more structurally interesting. He is not saying invasion risk drops in 18 months. He is saying Taiwan's structural relevance to the US is being engineered out of existence — through TSMC-Arizona Phase 2/3, domestic fab buildouts in Texas and Ohio, and the compute-substrate migration that's already moved Anthropic-grade workloads onto US-domiciled silicon. If he is right on the timeline, the policy regime shifts before the military regime even has to.

Polymarket, with $335,460 in 24-hour volume on the headline market, agrees on direction and disagrees on speed. The market prices a 2026 invasion at 7% and the end-of-June variant at 2%. Below that level of risk, the strait is not the binding constraint on US strategy — substrate migration is. Above it, the strait still matters. The 7% number is the test of Chamath's thesis, not a refutation of it.

This piece reads Chamath's quote in full context, walks the TSMC-Arizona evidence honestly, prices the Polymarket leg, identifies where the thesis breaks, and ends with the side story most political coverage missed: Elon Musk in the Great Hall of the People during the summit, and what that quietly says about how fast decoupling is actually moving.

1. What Chamath actually said — the full context

The 18-month line is being circulated as a clip. The clip strips the analytical move that makes the thesis interesting.

On the All-In episode, Chamath was responding to a question about US-China policy posture. His full argument has four parts, only one of which travels in the clip:

TSMC-Arizona Phase 2 is on schedule for 2026-2027 production ramp, with Apple and NVIDIA having committed volume. This part is true and documented; we walk the evidence in §2.
The compute substrate that matters most for US AI sovereignty is migrating to US-domiciled silicon faster than the policy debate. Cerebras, the Anthropic-SpaceX compute deal, and the inference inflection shift toward US-built inference fleets are the practitioner version of this claim.
In 18 months, the marginal strategic asset Taiwan provides — leading-edge logic capacity for the US — will be available domestically with enough volume to take the strait off the critical-path list. This is the thesis — and it's the part operators should test.
Therefore the diplomatic and military posture around Taiwan will change, because the asset being defended will have substantially decoupled from the geography being defended.

The clip travels because of part 4 — the rhetorical punchline. The analytical work is in parts 1-3. Chamath is making a claim about substrate migration timelines and using "18 months" as the inflection where TSMC-Arizona's contribution to US leading-edge logic capacity crosses a strategic threshold.

To evaluate the thesis honestly, we have to take part 1 — the TSMC-Arizona ramp — at face value or refute it. So that's where we start.

2. The TSMC-Arizona evidence — what's on schedule, what isn't

TSMC's Arizona Fab 21 project is the largest single foreign direct investment in US history at $65B committed across three phases. As of mid-2026:

Phase 1 (4nm/5nm): Already in volume production since H2-2025. Apple A18 SoCs and AMD HPC chips shipping out of the Arizona facility in commercial volume. This is no longer projection; it's verified output.
Phase 2 (3nm): Building completion mid-2026; tool-installation underway; first wafers projected H1-2027. Apple and NVIDIA have public volume commitments.
Phase 3 (2nm): Site preparation underway; production target 2028.

Combine that with Intel's Ohio buildout, Samsung's Texas Taylor fab, and a re-shored packaging ecosystem (Amkor in Arizona, TSMC's own US ATP), and the picture sharpens. By H2-2027, the US has the option of leading-edge logic capacity in volume on US soil. Whether that fully replaces TSMC Taiwan for the most strategically sensitive workloads is a separate question — but the option exists, and the option is what Chamath's 18-month frame is pricing.

The honest counter-evidence:

DRAM, mature nodes, and the supply-chain depth issue. Even with leading-edge logic re-shored, the broader semiconductor supply chain — substrate, photomasks, specialty chemicals, ATMI gases, the long tail — still runs through Taiwan-Japan-Korea. The headline-grabbing 2nm/3nm line ignores 60-70% of the semiconductor BOM.
Volume vs. capability. Arizona's Phase 2 output will be a meaningful percentage of leading-edge logic, not all of it. Through 2027, the marginal datacenter buildout will still pull from Taiwan for some segment of need. The decoupling is partial, not total.
Skilled-labor and equipment-vendor migration. ASML, Tokyo Electron, and Applied Materials' US service-engineer pipeline is not at parity with Hsinchu. Production yield in year-one US 3nm is going to be a real number.

The net read: Chamath is directionally right on the substrate migration. The 18-month timeline for "marginal-strategic-asset" substitution is defensible but tight. A 24-30 month variant is more honestly hedged. The thesis survives that adjustment — the dollar-volume threshold where Taiwan stops being the binding constraint is approached on either timeline.

3. The Polymarket leg — pricing the binding test

The cleanest market test of the Chamath thesis is Polymarket's "Will China invade Taiwan by end of 2026" market. $335,460 of 24-hour volume, $979,681 of liquidity. The market prices Yes at 7%, No at 93%, with the curve down 3.4% this month.

The shorter-horizon market — June 30, 2026 — prices Yes at 2% on $124,927 of 24-hour volume.

The 7% number is doing more work than most observers credit. Here is what it actually says under Chamath's frame:

The 7% is the binding test. If Polymarket priced invasion-by-end-of-2026 at 25%, the strait would still matter to US strategy regardless of what TSMC-Arizona does — because at 25%, an invasion happens before the substitution completes. At 7%, the market is saying the substitution window is wide enough that the strait is unlikely to be the binding constraint before the substrate migration changes the strategic stakes. That is Chamath's thesis in market pricing.

The 2% June 30 variant tightens the read. Near-term invasion is priced effectively at zero. The combination — 2% by June 30, 7% by year-end — implies the market's mode is "no invasion in the migration window, then strategic reassessment after the substrate decouples." That is structurally consistent with the 18-month frame.

The disagreement points worth watching. The Polymarket curve is down 3.4% this month — the market is becoming less convinced of invasion in 2026, not more, despite Xi's summit rhetoric. If the curve moves up by 3-5 percentage points without a corresponding TSMC-Arizona slowdown story, Chamath's thesis weakens. If the curve continues down while Phase 2 ramp news lands, his thesis strengthens. This is a tradeable instrument with a clear interpretation.

The Reddit and HN conversation on this market has converged on the same read — invasion risk priced low not because the rhetoric is empty, but because the cost-benefit for Beijing changes as the substrate migrates. The strait is more valuable to invade before the migration completes. After, the asset being captured has substantially relocated.

4. Where the thesis breaks

Three legitimate counter-arguments to the 18-month frame, in increasing order of severity:

Counter #1: The "Taiwan is more than chips" argument. Taiwan's geographic position controls the second island chain. Even if every fab were duplicated in Arizona tomorrow, Taipei's relationship to US Pacific posture, freedom-of-navigation operations, and the Korea-Japan defense architecture would persist. Chamath's framing — Taiwan as a substrate-supply question — under-prices the geographic-strategic dimension.

This counter is real but somewhat off-target for the specific Chamath claim. He is not saying Taiwan becomes geopolitically irrelevant. He is saying Taiwan's place in the US economic-security calculation — the calculation that drove $52B of CHIPS Act money and $65B of TSMC Arizona spend — changes. The geographic-strategic Taiwan and the substrate-supply Taiwan are not the same Taiwan. Chamath's thesis is about the second.

Counter #2: The supply-chain-depth realities. Discussed in §2. Even leading-edge logic migration leaves 60-70% of the BOM behind. The reshoring of the long tail is a 5-10 year project. Through 2027, US AI-frontier datacenter buildouts will still touch the Taiwan supply chain on multiple legs. The "Taiwan doesn't matter" headline is too strong by the strict reading.

Counter #3: Production yield in US 3nm year-one. TSMC Arizona's leading-edge output through 2027 is going to be lower-yield than Hsinchu's, by a real number. Apple and NVIDIA have committed volume, but the cost of that volume — measured in usable die per wafer — is higher in Arizona for the first 18 months of production. If the cost premium is large, customers will pull from Taiwan in parallel through the migration window — and the binding-constraint calculus shifts.

The thesis survives all three counters in its careful form. It does not survive the headline-clip version. Treat "Taiwan doesn't matter in 18 months" as a political-class slogan and "Taiwan stops being the binding US-strategic constraint sometime in 2027-2028" as the operator claim.

5. The Musk side story — what the political coverage missed

Buried inside the Beijing summit coverage was a CBS-flagged detail that almost no other outlet picked up: Elon Musk was photographed inside the Great Hall of the People during Trump's visit. Reports placed him there with his son.

The political-class commentary on the summit fixated on whether Trump "looked weak" under Xi's Taiwan language. That is the visible story. The Musk presence is the structural one.

If you treat the summit as theater, it was theater. If you treat it as a meeting of the actors who actually control the substrate migration, the cast list is interesting. The US Pacific posture is run from Washington and DOD. The substrate migration is run from Hsinchu, Arizona, and the boards of SpaceX, Tesla, xAI, and TSMC. One of those boardrooms had its principal physically present inside the Beijing summit perimeter.

There are at least three reads on the Musk-at-the-Great-Hall observation:

Diplomatic backchannel. Musk's Chinese-government relationships, especially via Tesla Shanghai, have been an active channel since 2020. A symbolic appearance during the bilateral signals durability of that channel regardless of the rhetorical surface.
Decoupling-speed signal. The presence of the most consequential American tech CEO inside the Great Hall during a Taiwan-tense summit suggests actual US-China industrial decoupling is more measured than the rhetorical decoupling. If decoupling were really running at 18-month pace, the optics of Musk-in-Beijing would be politically untenable. They were not.
Substrate-migration coordination. A more aggressive read: parts of the substrate migration require coordinated US-China posture — semiconductor equipment exports, talent flows, rare-earth supply, and EV-chain components are all coordination problems, not pure-decoupling problems. Musk's presence is consistent with substrate-migration coordination being the actual underlying policy, even as the public rhetoric tracks separation.

Whatever read you take, the Musk side-story sharpens the Chamath thesis. The decoupling is real, but the operational reality has more US-China coordination embedded in it than the political-class commentary suggests. The 18-month clock is partly a managed migration, not a clean break.

What this means for operators and analysts

Three specific takeaways:

1. Re-time your Taiwan-risk hedge. If you have been pricing Taiwan invasion risk against your 2026 ops plan at consensus 10-15%, the Polymarket 7% number says the market disagrees with consensus. If you trust the substrate-migration thesis, the strategic Taiwan-risk window is closing on Chamath's clock, not the invasion-risk clock. Re-time accordingly.

2. Watch TSMC-Arizona Phase 2 ramp news as the leading indicator. Not the summit rhetoric. Not the strait incident reports. The phase-2 wafer-output news through H1-2027 is the variable that moves the Chamath thesis. A six-month slip moves the timeline to 24-30 months. A clean ramp ratifies the 18-month frame and starts repricing assumptions across the China-Iran, the Trump-Xi summit, and the Xi-Taiwan trigger frames.

3. Don't confuse the headline-clip with the thesis. The viral "18 months" line is being used to argue both that Taiwan is irrelevant and that Chamath is reckless. Neither read is right. The thesis is narrower: leading-edge logic capacity for US strategic use is substantially decoupling from Taiwan inside the 18-30 month window. That is a defensible claim. It is also the most important claim being made about US-China substrate strategy this quarter, and it deserves to be evaluated on its actual content rather than its viral packaging.

The clock is ticking. The question is not whether Chamath is right; it's whether the substrate migration completes faster than the political timeline expects. That is a tradeable question, and Polymarket is the cleanest place to test it.

Originally published at thearcofpower.

Long-Running Agents: Harness, Evaluator, Handoff

Max Quimby — Tue, 19 May 2026 03:32:14 +0000

A year ago "long-running agent" meant "we tried to chain prompts and it broke."

📖 Read the full version with charts and embedded sources on agentconn →

This week, three independent talks converged on the same engineering thesis: hour-scale autonomy is a harness problem, not a prompt problem. Anthropic's "Build Agents That Run for Hours" session frames it as adversarial evaluators plus structured handoffs. IBM frames it as harness substrate. AI LABS frames it as a development lifecycle. The agent runs for hours when you stop trying to make the model smarter and start engineering the substrate that keeps the model on-task — three components in particular: harnesses, adversarial evaluators, and structured handoffs.

The signal is loud enough that the GitHub trending board has spent four consecutive days dominated by repos that operationalize one or more of those three components. Skills registries, agent-memory layers, agent toolkits — they are the production-pattern equivalent of the same engineering shift the conference talks are naming out loud.

This piece is the practitioner version of those talks. What each one says, where they overlap, where they diverge, and how to spend the next hour of your harness work for the biggest gain in agent autonomy.

The frame shift. Before this week, "skills" was a Claude Code feature. After this week, harness-evaluator-handoff is a category — and skills, sub-agents, memory, validators, and CI loops are how you build it. The substrate is now the thing you ship, not the model.

The three talks

The three converging sources, in order of impact:

1. Anthropic — Build Agents That Run for Hours (Ash Prabaker & Andrew Wilson). The argument is structural: self-evaluation is a trap. A model asked to grade its own work returns false positives that compound across long chains. The fix is adversarial evaluator agents — separate agent instances with antagonistic objectives that grade each handoff, force re-do, and break out of plausible-but-wrong trajectories. Pair that with structured handoffs — explicit, schema-typed state transfers — and you get long horizons that don't drift.

2. IBM — Harnesses in AI: A Deep Dive (Tejas Kumar). The diagnostic is sharp. Kumar walks through a browser-agent failure: the agent reports success, but the screenshot shows a login page. The model didn't hallucinate — the harness lacked the verification step that would have caught the login redirect. His thesis: most "agent reliability" failures are harness gaps, not model gaps. The model is fine; the substrate around it is missing the loops that would have caught the deviation.

3. AI LABS — ADLC: Claude Code's New Lifecycle (video). The framing is process. Agent Development Lifecycle — ADLC — is named as the successor to "vibe coding." The implied developer workflow: stop iterating on prompts; start iterating on the agent's production lifecycle — handoff schemas, sub-agent decomposition, evaluator harnesses, replay-and-debug loops.

Three different speakers. Three different angles. One shared thesis: the substrate around the model is now where the engineering happens.

What "harness" actually means

The word "harness" got popular fast and is now overloaded. Three definitions live in the wild — they describe different things and confusing them costs hours of design discussion.

Harness as runtime. Claude Code, Cursor's agent mode, Codex CLI, gstack, archon, obra superpowers — the program that boots the model, loads skills, threads memory, manages tool calls, and renders output. This is the most common usage. When IBM's Tejas Kumar talks about "harness gaps," this is what he means.

Harness as skill-pack assembly. The collection of SKILL.md files, CLAUDE.md instructions, and sub-agent definitions a developer installs in a project. GStack, the Anthropic skills repo, academic-research-skills — these are harnesses in the sense that they shape what the model does, even though they ship as data, not as runtime code.

Harness as test/eval substrate. The evaluator agents, CI checks, replay loops, and adversarial graders that wrap an agent run to detect failure. This is the Anthropic talk's "harness" — the structural pieces that catch a long-run agent before it confidently delivers garbage.

For the rest of this piece, we use harness = runtime + assembly, and we treat evaluator and handoff as separate first-class categories. The Anthropic talk's "evaluator harness" is then just "harness + evaluator" — two things at once.

Adversarial evaluators — why self-grading fails

The Anthropic talk's most useful technical claim is that self-evaluation is a trap.

When you ask a Claude or GPT instance to grade its own output, you get one of three failures:

Plausibility blindness. The model that produced the answer cannot distinguish its own confident-but-wrong outputs from confident-and-right ones. They feel identical from the inside.
Reward-hacking under cost pressure. When evaluator and producer share token budget, the model has every incentive to mark the work passing and move on.
Context dilution. A long agent run carries thousands of tokens of intermediate work. Asking the same model to grade against that context invites context-snowing — the answer that "fits the trajectory" gets marked as correct because it's locally plausible.

The Anthropic fix is adversarial evaluator agents. A separate agent instance — different system prompt, different tools, different objective — runs in parallel and grades the producer's output against an adversarial standard. It is paid (in the design sense) to find failure, not to confirm success. When the evaluator vetoes, the producer is forced to re-do.

This is the same pattern that won in software engineering twenty years ago when "developer writes a test for their own code" was replaced by "QA tries to break it." The engineering substrate for agents is converging on the same workflow.

The practitioner version of this for a typical Claude Code stack:

Two sub-agents, antagonistic system prompts. Producer's job: ship the feature. Evaluator's job: find what's missing. Evaluator returns a structured veto with reasons; producer must address before continuing.
Schema-typed grade objects. Don't let the evaluator return free-form text — make it return {pass: bool, blockers: [...], suggestions: [...]} so the producer can branch on the result instead of free-reading.
One adversarial sub-agent per high-stakes handoff. Don't grade every step — grade the transitions where bad state compounds (test-pass → ship, plan → execute, design → implement).

Patterns like Archon's deterministic-review loop and the agent-judge layer on AgentConn are concrete production versions of this exact idea.

Structured handoffs — the schema problem

The second piece of the engineering thesis is structured handoffs.

A "handoff" is what happens when one agent step ends and another begins. The naive version: the previous step appends to a free-form context buffer; the next step reads the buffer and decides what to do. This works for two or three hops. It does not work for hours.

The failure mode is well-known. Free-form handoffs lose schema. The agent at step 18 reads a context buffer that's 15K tokens long and has no canonical way to ask "what is the current state of the build?" Some of the answer is in step 3's output. Some is in step 11's tool call. Some is in step 14's revised plan. The model does its best — and confidently reports a state that is partly true.

The fix is making state transfers typed and explicit. Each handoff carries a schema. Each handoff replaces the relevant slot in a structured world model. Each sub-agent reads only the slots it needs.

What that looks like in practice:

A working-state object. A JSON document the agent maintains and updates explicitly. {tasks_done, tasks_pending, files_modified, tests_passing, blockers, last_evaluator_verdict}. The agent reads-writes this object the way a service reads-writes a database — not by re-reading the entire conversation.
Memory layers. agentmemory's ★1,226 day-one trending position is the signal here. Persistent agent memory is becoming the standard layer that survives across handoffs and across sessions. It's why "long-running" went from prompting curiosity to infrastructure category in eight months.
Sub-agent contracts. Each spawned sub-agent runs with a typed input ("here is your task, here are your tools, here is the slot you write to") and a typed output. The parent agent reads the output slot — not the conversation transcript. This is the convention the Anthropic skills repo pushes on, and it's what makes the SKILL.md format more than a prompt template.

The agentmemory drop on GitHub trending is the practitioner-side proof that this is the actual production bottleneck right now. Memory layers don't move a model leaderboard. They move the duration over which an agent stays useful — which is the variable that just became visible.

The ADLC piece — process, not just architecture

The third talk, AI LABS' ADLC framing, is about process — and it's the one with the most direct operator value.

ADLC ("Agent Development Lifecycle") argues that the failure mode of most agent work in 2026 is not technical — it's lifecycle-shaped. Developers iterate on prompts when they should be iterating on harness, evaluator, and handoff design. Sprints get framed as "tune the model" when they should be framed as "instrument the agent run."

What changes if you adopt the ADLC stance:

Replay-first debugging. Every agent run captures its tool-call log, sub-agent decisions, and intermediate state. When the agent fails, you don't re-run with a tweaked prompt — you replay the failure and identify which substrate component (harness, evaluator, handoff) let the wrong state through.
Eval-first feature work. Before you ship a new agent capability, you write an evaluator that catches the failure mode you're afraid of. The evaluator gates the merge. This is the agent-equivalent of TDD.
Sub-agent decomposition as architecture, not optimization. ADLC treats sub-agent decomposition the way a backend team treats microservices: a structural choice up front, not a performance tweak after.

The tokenmaxxing YC operator pattern and the cursor-skills-as-runtime shift are both ADLC moves — the operators who win in 2026 treat the agent as a system, not a chat session.

The Anthropic Skills announcement is the substrate piece going public. Treat the convention as the contract — author skills with explicit inputs/outputs, version them, and version the handoff schema independently of the model. That's the substrate move that survives a model rotation.

Where the three talks disagree

The talks agree on the high-level pattern. They disagree on the substrate layer that matters most.

Anthropic prioritizes the evaluator. The Anthropic talk is engineered around the claim that adversarial evaluators are the binding constraint on hour-scale autonomy. Their bet is that handoff schemas matter, but evaluator quality matters more — a good evaluator catches a bad handoff; a good handoff cannot save a bad evaluator.

IBM prioritizes the harness substrate. Tejas Kumar's diagnostic frame says the substrate around the model is where most agent failures live. His implicit ranking: harness > evaluator > handoff. Fix the substrate, and the rest follows.

ADLC prioritizes the lifecycle. AI LABS' framing pushes process. Their implicit claim: harness, evaluator, and handoff design all converge if you build a replay-first, eval-first development loop. Without the lifecycle, the three components drift.

The right read for an operator: all three are right at different scales. Prototype-scale agents need a usable harness first. Hour-scale agents need adversarial evaluators. Org-scale agent systems need the ADLC lifecycle so the substrate stays maintainable. Pick the one that's failing for you and invest there first.

What to ship in the next hour

If you have an hour to spend on hour-scale agent autonomy this week, here are the three highest-leverage moves — pick the one that matches where your stack hurts.

1. If your agent confidently delivers wrong work: add an adversarial evaluator sub-agent for the highest-stakes handoff in your flow. One sub-agent. Different system prompt. Structured veto schema. Wire it into the merge-blocking step. You will spend an hour and catch 30% more failures.

2. If your agent loses state mid-run: add a working-state JSON object the agent reads and writes explicitly. Strip the free-form context dependency. The agent at step 20 should not need to re-read the conversation — it should read_state() and act. Memory layers like agentmemory make this concrete; auto-dream context files are another shape of the same idea.

3. If your agent runs are unreproducible: instrument the tool-call log so you can replay failures. The ADLC win is that you stop debugging by re-prompting and start debugging by re-running. Even a crude replay loop — log every tool call, every sub-agent spawn, every state update — pays back the same week.

The bigger frame: the category just hardened. "Build an agent" used to mean "compose a prompt and pray." This week it means "design a harness, an evaluator harness, and a handoff schema — then iterate on those." The shift looks the same on the GitHub trending board, in the Anthropic conference talk, and in the practitioner threads on Hacker News. When three independent surfaces all describe the same engineering shift in the same week, it's not a vibe — it's a category.

Originally published at AgentConn.

Originally published at agentconn.

Anthropic at 92%: Three Surfaces Tell the Same Story

Max Quimby — Tue, 19 May 2026 03:32:13 +0000

Three independent telemetry surfaces just rhymed.

📖 Read the full version with charts and embedded sources on computeleap →

Polymarket's "Which company has the best AI model end of May" sits at Anthropic 92%, Google 6%, OpenAI 1% — on $428,985 of 24-hour volume and $2.4M of liquidity, the deepest book in the AI-market category. Boris Cherny's Ramp AI Index post showed Anthropic at 34.4% of enterprise card spend versus OpenAI at 32.3%, with Anthropic's adoption up roughly 4× year-over-year while OpenAI sat flat at +0.3%. And on the developer surface, four of the top eleven repos on GitHub trending today are explicitly Claude-Code-skills-shaped — the fourth consecutive day of the same compositional pattern.

Three surfaces. One name. The interesting question is not whether Anthropic is winning — every telemetry surface we can read agrees that it is. The interesting question is what each surface is actually measuring, where the three disagree, and what an operator should do with the disagreement.

ℹ️ The triangulation play. When prediction-market price, enterprise-spend share, and developer-mindshare all move together, you have something rare in markets: three independent observers pricing the same outcome through different cost functions. The signal isn't the agreement. It's the disagreement points — that's where there's still alpha.

1. The Polymarket leg — what's priced, what's not

The headline market — which-company-has-the-best-ai-model-end-of-may — prices Anthropic at 92% with thirteen days left in the month. With $2.4M of liquidity sitting on the book, this is not a thin Polymarket curiosity. It is the deepest AI-leaderboard market currently trading.

Three things to notice before reading this as "Anthropic won."

First, the Style Control On variant — same question, but normalized for response-length and formatting effects on LMSYS Arena — sits at 93% but has moved down 17% this month. The market is less convinced of Anthropic dominance under style-controlled conditions than under raw conditions. That gap is the leaderboard's response-length bias being priced in real time.

Second, when you shift the horizon to year-end, the picture inverts. The market which-companies-will-have-a-1-ai-model-by-december-31 — which companies will hold #1 at any point through end-of-year — prices Google at 72%, OpenAI at 41%, and xAI at 20%. Anthropic does not break the top three on the multi-month horizon. The market believes Anthropic owns right now but expects the lead to rotate before year-end. That's a strong claim and a tradeable one.

Third, June already prices differently. Anthropic drops from end-of-May 92% to end-of-June 74% on the equivalent market. The market expects a release cadence from one or more of (Google Gemini 3.5, OpenAI GPT-5.5 successor, xAI Grok) that closes some of the gap inside thirty days. If you are making a vendor decision based on the 92%, you are making a six-week decision, not a twelve-month decision.

2. The Ramp leg — what card-spend telemetry actually proxies for

Ramp is a corporate card platform. The Ramp AI Index reports the share of card-spend on AI vendor categories across its customer base. Boris Cherny's post — Anthropic engineer — flagged the latest cut: Anthropic 34.4% of AI card-spend, OpenAI 32.3%, with Anthropic's curve up roughly 4× YoY while OpenAI's curve has gone effectively flat at +0.3%.

The Ramp data is the most-cited and the most-misread of the three legs. Three things it is not:

It is not usage. Card-spend tells you who is paying. It does not tell you how many tokens, how many users, or how many seats are deployed. A company that buys $50K of Claude credits via corporate card and a company that buys $50K of ChatGPT Enterprise via the same card look identical to Ramp. Whether one is being aggressively rolled out and the other shelved is invisible.

It is not market share. Ramp's customer base is a specific cut of US-based, post-Series-A SaaS-and-fintech companies. Enterprise contracts that flow through procurement and AP, not corporate cards, are entirely outside the dataset. The big-ticket OpenAI enterprise deals (Microsoft-routed, custom-billed) are precisely the kind of transactions that do not appear here.

The YoY ratio is real. Even with the caveats, the 4× year-over-year vs. +0.3% comparison is striking. It is consistent with the story that Claude has become the new-budget AI vendor across the Ramp customer cohort — the line item that mid-market companies added in the last twelve months. OpenAI's flat curve is not a decline; it is a saturation pattern. The companies who were going to use OpenAI a year ago still are. The new companies are buying Anthropic.

The right way to read Ramp is as a leading indicator for SaaS-and-fintech-startup adoption, not a definitive market-share figure. It rhymes with the API-developer-platform story we covered — the developer surface is where Anthropic is winning incremental dollar, and Ramp is the cleanest telemetry for that surface.

3. The GitHub mindshare leg — category consolidation

The third surface is the noisiest and the most interesting.

For four consecutive days, GitHub trending has been dominated by repos with a Claude-Code-skills shape. Today's cut: Imbad0202/academic-research-skills at ★1,302, tech-leads-club/agent-skills at ★1,244, rohitg00/agentmemory at ★1,226, and K-Dense-AI/scientific-agent-skills at ★610. Yesterday it was Karpathy's CLAUDE.md template and Composio's skills bundle. The day before, mattpocock's directory race and the openhuman runaway #1.

Four days of the same composition is not a meme — it is a category consolidating. The repos are converging on three archetypes:

Skills registries — curated bundles of .md files describing capabilities a Claude Code instance can pull in at runtime. Academic-research-skills, agent-skills, scientific-agent-skills, the obra superpowers framework variant — all share the same directory layout, the same SKILL.md frontmatter, the same Anthropic-published Skills spec underneath.
Agent memory — persistent context layers that survive across sessions. agentmemory's ★1,226 day-one is the signal that "long-running" has become an explicit engineering category, not a prompting trick.
Agent toolkits — earendil-works/pi, CLI-Anything, microsoft/ai-agents-for-beginners. The framing is "make every CLI agent-native" — explicitly framed against the Claude Code substrate as the reference target.

What makes this a mindshare signal rather than a usage signal: GitHub stars don't pay rent. The developers chasing these repos are not paying Anthropic. But they are choosing what to read, what to fork, what to publish on top of — and the substrate they are publishing on top of is Claude Code. That's how categories form. Not when one vendor wins the spend; when the publishing surface tilts toward one vendor's substrate as the default to extend.

4. Where the three surfaces disagree

The three telemetry surfaces agree on direction. The interesting question is where they diverge.

Disagreement #1: Time horizon. Polymarket is bullish at four weeks and bearish at eight months — the multi-month horizon prices Google at 72% to hold #1 at some point in the year. Ramp is structurally a six-month indicator (card-spend rebases slowly). GitHub mindshare is a four-week indicator (categories consolidate fast and rotate faster). If you trust all three, you should trust Anthropic dominance for the next thirty days, hedge for a Q3/Q4 rotation, and assume the GitHub-trending composition will look completely different by August.

Disagreement #2: What "win" means. Polymarket prices model-leaderboard wins. Ramp prices spend share. GitHub prices substrate share. These are three different products. It is possible — and historically common — for one vendor to own model-leaderboard while another owns spend share while a third owns substrate. The current pattern of all three pointing the same way is the unusual case. The more common case is fragmentation. If the disagreement starts opening, watch the substrate leg — substrate share is the stickiest of the three.

Disagreement #3: Who's actually using the product. Ramp says SaaS-and-fintech mid-market. GitHub mindshare says developers and AI-tooling builders. Polymarket prices a leaderboard that LMSYS Arena visitors curate. None of these three populations is "enterprise IT buyer at a Fortune 500." The biggest blind spot in the triple-leg is the procurement-driven enterprise segment that flows through neither cards, nor GitHub, nor LMSYS — and where OpenAI's Microsoft channel is still the dominant lane.

5. The Stainless acquihire — Anthropic builds the platform layer

The under-reported story today is the Anthropic acquires Stainless news that hit HN front-page at ★70 with thirty-four comments. Stainless is an SDK-generator company — it produces typed SDKs from OpenAPI specs for the kind of companies that ship developer platforms.

The HN comment thread is reading this as an acquihire and the hosted Stainless product is being wound down. That's not the interesting framing.

The interesting framing is what Anthropic is signaling about its own product roadmap. If you're acquiring an SDK-generation team, you're not optimizing for prompt engineering. You're optimizing for an agent API — the kind of platform surface where third-party developers publish skills, sub-agents, and integrations and Anthropic ships them typed bindings across five languages.

Combine that with this week's other signals:

The Build Agents That Run for Hours Anthropic talk laying out adversarial-evaluator + structured-handoff patterns
The ADLC framing from AI LABS — "Agent Development Lifecycle" as a successor to vibe coding
The Peter Yang Anthropic interview where Anthropic says "we treat the model like a product, but model development is growing, not speccing"
The Karpathy CLAUDE.md template going viral with hundreds of repos shipping the convention

The composition is clear. Anthropic is no longer just a model vendor. It is building the platform layer — the runtime, the skills convention, the SDK pipeline, the agent lifecycle. The Stainless acquihire is the SDK piece of the puzzle.

The HN comment thread on today's Stainless news lands on a similar read — the wind-down of the hosted Stainless generator is being framed not as a product failure but as a transparent acquihire-for-capability play, with the SDK pipeline migrating into Anthropic's own developer surface. That is the platform-build signal in plain sight.

⚠️ For operators. Three months ago the question was "should we use Claude or GPT?" Today the question is closer to "should we build on top of Claude Code's runtime, or roll our own?" That is a different question with a different answer. If Anthropic continues consolidating the substrate layer, the build-vs-buy line on agent runtimes is going to shift fast. The "we'll do it ourselves on top of an OpenAI-compatible interface" answer that worked in 2024 doesn't hold up against a Claude Code platform that publishes typed SDKs, ships skills, and has four consecutive days of trending repos extending it.

What an operator should do this week

Three concrete actions:

1. Re-time your vendor review. If you locked a six-month contract this quarter on the assumption that the model leader was stable, the Polymarket end-of-June pricing (Anthropic 74% vs end-of-May 92%) says you should re-time your review window. The market believes the lead is contestable within thirty days. Bake quarterly review checkpoints in.

2. Read your own Ramp. If your finance team uses Ramp or Brex, pull your last six months of AI-vendor card spend. The Ramp aggregate masks enormous variance — your specific cohort may be 70/30 Claude or 70/30 OpenAI. The aggregate gives you the macro; your own cut gives you the action. The Anthropic-vs-OpenAI overtake story shows up in customer data well before it shows up in aggregate filings.

3. Watch the substrate layer. Don't optimize your stack for the model leaderboard — optimize it for the substrate. If the GitHub trending pattern persists another two weeks, the answer is to invest in skill authorship and runtime instrumentation on the Claude Code substrate, regardless of which model leads next quarter. The substrate is the durable bet; the leaderboard is the trade.

The 92% number is loud. The disagreement points are louder. The story under the story is the platform pivot — and that is the thing that lasts past June 30.

Originally published at ComputeLeap.

Originally published at computeleap.

Codex Pulling Ahead of Claude Code? Read the 2026 Shift

Max Quimby — Mon, 18 May 2026 03:53:57 +0000

Three independent creators all dropped "Codex is pulling ahead of Claude Code" takes on the same day this week. Nate B Jones and Tibo did a head-to-head and concluded that Codex was the daily driver now. Chase AI's "Time to Switch?" workshop went out the same morning. A third creator landed a Claude-Code-to-Codex switch post on Medium, arguing that Codex's /goal command and 4x token efficiency made the choice obvious. Three creators, one direction, one day.

📖 Read the full version with embedded sources on the original site →

Meanwhile, r/ClaudeAI hit thousands of upvotes on a single post about the operator emotion underneath all of this: devs are tired of reviewing AI-generated PRs they didn't initiate. Brian Douglas's "Death by a Thousand AI Pull Requests" Substack from the open-source side made the same point in a different vocabulary. The category moment isn't "Codex won." It's "the agent-PR-review loop broke, and we're sorting out which agent fits which seat in that loop."

This piece reads that moment cleanly. What actually changed in the last 14 days, what didn't change, and the operator question that matters: does the answer reroute your stack — or your review process?

What actually changed in the last 14 days

Three concrete things landed in the May 2026 window.

Codex's /goal command crossed the autonomy threshold. Up until April, Codex's autonomous loop topped out around 20-30 minute runs before drifting. The May Codex release tightened the plan-act-test-review cycle enough that it now sustains multi-hour autonomous sessions on the right kind of task — codebase-wide migrations, dependency upgrades, test backfills. The New Stack tested it on a real Python codebase and called it "the strongest Claude Code rival yet" — explicitly framing it as a daily-driver shift, not a benchmark win. The benchmarks themselves moved with it: GPT-5.5 now leads SWE-bench Verified at 88.7%, edging Claude Opus 4.7's 87.6%, and leads Terminal-Bench at 82.7%.

The cost gap widened in a single direction. A widely-circulated Express.js refactor benchmark cost roughly $15 on Codex versus $155 on Claude Code for the same task — a 10x gap. The token-per-task delta isn't subtle anymore. For a small team running a daily-driver coding agent 4-8 hours a day, that gap is the difference between $200/month and $2,000/month in agent costs. The math now lands in a place where switching cost can be paid back in a single billing cycle.

The Anthropic skills ecosystem kept compounding. Even as Codex pulled ahead on raw daily-driver mechanics, Anthropic shipped Code Review and the broader skills directory race kept tilting toward the Claude ecosystem. Mitchell Hashimoto's skill stack, the tech-leads-club agent-skills registry, and obra/superpowers all sit in the Claude orbit. That ecosystem doesn't move when daily-driver preference shifts. It's a separate moat operating on a separate timeline.

So three things shifted, and one didn't. The takes converging on "Codex won" are reading the first three and ignoring the fourth.

What didn't change

Three things stayed put.

Anthropic still owns the model-quality consensus. Polymarket's end-of-May "best AI model" market has Anthropic at ~82%, Google at ~19%, OpenAI well behind. That's not a benchmark consensus — that's a money-weighted consensus across thousands of traders pricing the actual perception of model leadership. Anthropic also holds SWE-bench Pro at 64.3% vs GPT-5.5's 58.6% — a 5.7-point gap on the harder, less-saturated benchmark. The "Codex pulled ahead" takes are talking about the coding agent runtime, not the underlying model. Conflating the two is the most common error in this week's coverage.

Code quality on the produced output still favors Claude. Blind reviews of completed work rated Claude Code's output cleaner 67% of the time to Codex's 25%. That gap shows up most on frontend UI work, refactors that need to match an existing codebase's idiom, and any task where the diff has to read well to a human reviewer six months later. Codex ships the feature faster and cheaper. Claude ships a smaller, cleaner diff. If your downstream cost is "code that future engineers can actually maintain," the trade isn't obvious.

The skills ecosystem is still gravitationally Anthropic-aligned. This week's GitHub trending told the same story — the skills-folder repos kept dominating, Zerostack's pure-Rust coding agent hit HN at 499 points framed as an Anthropic-ecosystem alternative, and the agent-runtime category overall stayed weighted toward Claude-orbit tooling. Codex has the daily-driver crown. The surrounding ecosystem isn't migrating with it on the same timeline.

The PR-review meme is doing the actual work

Now the part most of the comparison posts skip.

The signal that's traveling fastest this week isn't a Codex review. It's the r/ClaudeAI PR-review fatigue thread, echoed by Brian Douglas's "Death by a Thousand AI Pull Requests" on Substack. The operator emotion is consistent: agents are generating PRs faster than humans can meaningfully review them. The unit of work that's becoming the bottleneck isn't writing code — it's reading code you didn't write and forming judgment on whether to merge it.

That's a different problem from "which agent should I use." It's a workflow problem, and it doesn't care which model wrote the diff.

Anthropic's response to this — Code Review for Claude Code, launched March 9 2026 — is interesting precisely because it's not a competing-agent feature. It's a competing-reviewer feature. Boris Cherny's framing on the launch is direct: code output per Anthropic engineer is up 200% this year, and reviews became the bottleneck. The fix is more agents, on the other side of the diff.

The HN thread on the launch made the deeper question explicit, and most of the top comments landed on it: if the same vendor's agent both writes and reviews the code, is that even review? One commenter put it bluntly: "Why didn't the AI write the correct code in the first place?" Another: "So their business model is to deliver me buggy code and then charge me to fix it?" The skepticism is reasonable. The fact that it costs $15-25 per PR review is also a real cost line a team has to plan for.

But the operator framing matters here. The PR-review bottleneck is real, the human-review channel is genuinely saturated on teams shipping ~30+ agent-PRs/day, and "agent reviews agent" isn't the only option — it's just the only option that exists today. The teams that solve this first will be the ones that take the review loop as seriously as the generation loop, which most teams currently don't.

Does this change your stack, or your review loop?

That's the operator question this week's convergence actually poses. The two have different answers.

On the stack question: probably not, for most teams. If you're running a Claude Code stack today and shipping, the right move is to add Codex as a second daily-driver rather than switch wholesale. The tokenmaxxing pattern we covered last month is the canonical version: route long-horizon autonomous tasks to Codex (where the /goal loop pays off and the token math wins), route quality-sensitive refactors and frontend work to Claude (where the cleaner-diff bias pays off), and keep skills/MCP infrastructure on the Anthropic side. The 500-Reddit-developer survey from this week confirms the pattern — 65% prefer Codex for daily coding, but most serious teams run both. The $20+$20/month Pro-subscription combo is the unsexy answer that's quietly winning.

If you're starting fresh, the calculus is different. A team that's never paid Claude Code's per-seat plan and can architect around Codex's autonomous loop from day one will save real money. But that's not the median team this week.

On the review-loop question: yes, this is the move you should make first. Specifically:

Measure your current PR-review queue. How many agent-generated PRs hit your repo per day? What's the average human-eyeball time per PR? If you're past ~10 PRs/day and human review is sub-3-minutes, you're already in the saturation zone — the merge is becoming a rubber-stamp regardless of whether you've named the problem.
Decide what "agent reviews agent" looks like for you. Claude Code Review is the most polished option today. The alternative is rolling your own with a second agent in CI (which works fine for most teams). Either way, the goal is a second pass with different incentives — bug-hunting incentives, not generation incentives. Don't let the same agent write and approve.
Set a per-PR cost budget. $15-25 per review at Claude Code Review pricing is meaningful at scale. A 30-PR/day team is looking at ~$500/day in review costs if it runs on every PR. The right move is per-PR-size tiering: heavy review on PRs over 200 LOC, lightweight review under that. Build the tiering into the merge process.
Reclaim human review for the things humans uniquely do. Architecture-level judgment, intent verification, and "is this the right feature" calls aren't review-agent territory — they're senior-engineer territory. The point of the review-agent layer is to free that time up, not to replace it.

That's the actionable read on this week's convergence. The Codex-vs-Claude-Code debate is real but mostly resolves to "run both." The PR-review-loop problem is real and resolves to "build the review layer, don't let it stay implicit."

The market read

A last note on the framing wars. Every six months in 2026, one of the major agent vendors has a two-week stretch of dominant takes. February was Claude Code's. April was Codex Mobile's. This week is Codex's again. The pattern in each cycle is the same: a feature ships that genuinely moves the daily-driver line, three creators converge on the same take within 48 hours, and the take then calcifies into "X won" framing that lasts about three weeks before the next vendor releases something.

If you're building product around AI coding agents, you should expect this cadence to continue through the year and not over-fit your stack to any single two-week window. The teams that quietly run both and route by task type are accumulating an advantage that won't show up in the convergence cycles — but will compound across them.

The narrative shift is real. It's just smaller than the takes are pricing it.

For deeper-dive paths: GSD-2 vs Claude Code vs Codex CLI is our long-form harness comparison from earlier this year. Tokenmaxxing covers the YC-operator pattern of running both. Codex Mobile Operator Playbook covers Codex's mobile angle specifically. And DeepClaude vs Claude Code vs Codex Pro is the cost-stack comparison that started this whole thread.

Originally published at the original site.

Per-Seat SaaS Is a Liability: A 2026 Operator's Checklist

Max Quimby — Mon, 18 May 2026 03:53:46 +0000

In one 24-hour window this week, four independent surfaces converged on the same thesis from four different angles. Palantir's deployment team told the supply-chain industry that SaaS is dead. Salesforce CEO Marc Benioff told the All-In Podcast that this is the "current SaaSpocalypse" — his third in two decades, and not his first. Hacker News pushed a piece titled "Every AI Subscription Is a Ticking Time Bomb for Enterprise" to 275 points, where the top-comment math priced the gap between today's subsidized seat licenses and tomorrow's API-grade reality. And John Gruber, of all people, weighed in from Cupertino with a quiet line that anchored the rest: AI is technology, not a product.

📖 Read the full version with embedded sources on the original site →

That convergence is the story. Not the death of SaaS. Not the rebirth of bespoke software. The story is what those four signals look like from the inside of a CFO's desk in the second half of 2026, with a stack of renewals waiting to be priced for 2027.

We've written before about how Claude and the agent-native runtimes are eating SaaS distribution from the platform side. This piece is the flip — the enterprise buyer's view. What actually breaks when seats stop being the unit of value, and what an operator should be checking for at the next renewal cycle.

The "SaaS is dead" framing flatters everyone

Start with what the convergence is not. It's not a death certificate.

Palantir's "SaaS is dead" framing — delivered by deployment strategist Danny Lukus and amplified across enterprise X — is a sales line. Palantir sells ontology-driven, custom-deployed AI infrastructure, and it has every incentive to bury the off-the-shelf SaaS narrative. Their CTO makes the same case on a16z's channel: the software layer should step back, the agent should take over, the bespoke ontology becomes the moat.

Benioff's "not my first SaaSpocalypse" framing is the mirror image. Salesforce will book $46 billion in annual revenue this year, generate $16 billion-plus in cash flow, and serve 83,000 employees' worth of customers who've built their operations on the platform, per his All-In appearance. He has every incentive to call this cyclical — the third re-rating, not a structural break — and to point at Agentforce's growth as proof the platform absorbs the AI wave.

Both are right and both are selling something. The interesting question isn't whether SaaS dies. It's whether per-seat pricing — the specific commercial mechanic that built the last twenty years of enterprise software — survives contact with a workforce where the unit doing the work isn't a seat anymore.

Bessemer Venture Partners' 2026 AI Pricing and Monetization Playbook has the actual data: hybrid pricing — a base subscription plus usage overage — is now the industry standard at 41% of AI vendors, up from 27% a year ago. 43% of buyers prefer consumption-based; 27% prefer outcome-based. The shift isn't extinction. It's a quiet renormalization that's already past the halfway mark.

That's the actual environment an operator is buying into right now. The framing wars are loud; the renewal math is quiet.

💡 The reframe: "Is SaaS dead?" is a press question. "Are you priced for what your stack actually costs in 2027?" is the operator question. The rest of this piece is built around the second one.

Failure mode #1: Token-shifting

The first failure mode is the one HN was pricing.

The argument from The State of Brand, summarized in the HN thread: every AI lab is currently losing money serving your company, and they're doing it on purpose. A team of 50 on Claude Pro costs $1,000 a month. The equivalent API usage for that same team — measured by actual tokens consumed during real agent workflows — sits somewhere between $15,000 and $40,000 a month, depending on intensity. The seat-priced subscription is the loss-leader. The API-grade economics are the real economics.

That gap isn't a forecast. It's a balance-sheet reality at the foundation labs right now. When labs unwind the subsidy — whether through tiering, throttling, or just letting the per-seat plans atrophy while pushing customers toward API consumption — the cost line at the buyer doesn't move 10%. It moves 15x in the worst case.

The seat-priced enterprise contract you're signing in May 2026 is being underwritten against an unsustainable subsidy. That subsidy survives as long as the labs are racing for distribution. It does not survive once the market settles.

This is what we mean by token-shifting: the unit of cost is migrating from headcount to consumption, but the contracts haven't repriced yet. The first vendor to reprice — to move you from "$20/user/month with unlimited AI features" to "$20/user/month base plus $X per million tokens" — will look hostile. They're not. They're the first one telling you what your stack actually costs.

Failure mode #2: Role compression

The second failure mode is the one Salesforce can't talk about on its own earnings call.

Per-seat pricing assumes you bought N seats because you had N humans who needed software to do their jobs. The model breaks the moment one of those humans is a workflow-orchestrating agent that performs the work of several seats while occupying one — or zero.

MindStudio puts the dynamic plainly: "When one AI agent can do the work that used to require 10, 20, or 50 human users, per-seat pricing doesn't just compress — it collapses." Gartner's call, cited across the trade press, is that seat-based revenue share will decline from 21% to 15% over the next 12 months, with at least 40% of enterprise SaaS spend shifting to usage-, agent-, or outcome-based models by 2030.

SAP CEO Christian Klein said the quiet part out loud earlier this spring, per SAPinsider: "It would be foolish to still charge subscription base, because AI is so powerful that it will automate a lot of tasks." SAP is moving wall-to-wall to consumption pricing. ServiceNow and Workday are drawing similar lines — particularly around external agents touching their stored customer data.

The buyer's exposure here is asymmetric and easy to miss. If you're buying a SaaS product today and your renewal is twelve months out, the vendor's incentive is to not reprice during your current contract — to let you keep your generous seat count, let your usage grow, and then reset everything at renewal. The vendor that doesn't reset is the vendor that's eating the margin. The vendor that does is the one that lives to negotiate.

You should expect every Tier-1 enterprise software contract negotiated between now and 2027 to land somewhere other than pure per-seat. Plan procurement accordingly.

Failure mode #3: Vendor-lock erosion

The third failure mode is the counterintuitive one, and it's the one most pricing pieces miss.

The instinct, watching Palantir's argument or Sierra's outcome-pricing pitch, is that consolidating to fewer, deeper AI agents inside a single vendor's ecosystem is the cost-controlled path. Sierra's framing is the cleanest version: vendors only get paid when the AI actually solves the buyer's problem. Intercom charges $0.99 per resolved conversation. HubSpot dropped to $0.50 in April 2026. Outcome-based is the rationalist's preferred model.

The problem is that the lock-in mechanic of outcome-priced agent platforms is worse than the seat-license lock-in it replaces.

Seat-license lock-in is mostly contractual and switching-cost-driven. The data lives in the vendor's database; you've trained users on the UI; you've integrated four systems through the platform. Painful to leave, but the unit of dependency is observable.

Agent-platform lock-in compounds invisibly. Every conversation an outcome-priced agent resolves accumulates context, learned workflows, and silent integrations that don't transfer. The "outcome" is partly a function of the platform's accumulated memory of your specific operation. When you try to switch, you're not just porting data. You're reconstructing implicit institutional knowledge that lives in someone else's vector store and policy graph.

⚠️ The hidden cost of outcome-priced agent platforms isn't the per-resolution fee. It's the behavioral lock-in: portability requirements need to be in the contract before the agent is deeply embedded — exports of context, audit logs of agent decisions, and a defined off-ramp. Vendors won't volunteer those clauses.

This is the part of the SaaS conversation that's actually new. The lock-in shape changed. The defensive moves changed with it.

Why Gruber's line matters here

Now back to Gruber, because his framing is what stitches the three failure modes together for a buyer.

His argument, made in the Apple context: AI is technology, not a product — the same way wireless networking is technology. There is no "killer wireless product." Everything is a wireless device. Everything will be an AI device. The category error is treating AI as a discrete bundled thing you procure.

For an enterprise buyer in 2026, that line cashes out as: stop evaluating "AI products" against each other. Start evaluating the AI-bearing-capacity of every vendor in your stack. Every existing SaaS line item — your CRM, your ITSM, your HRIS, your finance suite — is becoming an AI-bearing line item. The right question at renewal isn't "does this vendor have AI?" Every vendor has AI. The right question is whether the vendor's pricing model is honest about the cost of the AI it's about to start charging you for.

That reframes the whole procurement conversation. You're not buying AI products. You're managing AI exposure across an existing portfolio of software contracts, most of which are about to renegotiate the meaning of "user" in the licensing line.

The 2026 operator checklist

Five questions to take into every renewal between now and the end of 2027. None of them are clever; all of them tend to get skipped.

1. What's the all-in price at 10x current AI usage?
If the answer is "let's discuss enterprise pricing," you're getting a vague number that protects vendor optionality at your expense. Push for a written quote at projected Year-3 volume — token volume, agent-action volume, outcome volume, whichever unit the vendor's pricing actually meters on. The answer should be specific to four significant figures. If the vendor won't give you one, the vendor doesn't know what their model costs to run either, and that's the relevant signal.

2. What's the migration path off this vendor in 18 months?
Especially for outcome-priced agent platforms. Ask for: full export of agent context and learned workflows, machine-readable audit logs of agent decisions, and a published off-boarding SLA. If the contract is silent on portability, the lock-in cost is whatever the vendor wants it to be later. Get the clauses in the master agreement, not the data-processing addendum.

3. Who eats the cost-overrun if AI usage spikes?
Most hybrid models — base + overage — have soft caps that quietly convert overruns to next-tier subscriptions. That's a pricing escalator, not a usage meter. The right contract structure is: pre-purchased usage commits with rollover, hard caps with notification thresholds, and a documented procedure for re-baselining usage assumptions annually. Without those, you've bought a variable cost line with no governor.

4. How is "outcome" defined, and who decides when one occurred?
For any outcome-priced contract. Resolution criteria must be defined contractually — including what happens for false positives, where the AI claims a resolution but the customer follows up. The vendor will want flexibility; the buyer needs precision. Specify the criteria in writing before signing, with a defined disputes process. This is the single most-skipped step in 2026 outcome-pricing deals, per the Bessemer pricing playbook.

5. Does this vendor's pricing change if our headcount drops 20%?
This is the diagnostic question. If a vendor's pricing is genuinely AI-aligned, the answer should be "no, our pricing is decoupled from your headcount." If the answer is "yes, you'd save money," the vendor is still selling you seats with AI features bolted on — and you're carrying the SaaSpocalypse risk on the vendor's behalf. The vendors that have actually done the work — SAP and ServiceNow on the consumption side, Sierra and Intercom on the outcome side — give you a clean answer here. Everyone else is hedging.

What to do with all of this

You don't need to pick a winner between Karp and Benioff. Both will be standing at the end of this cycle, and both companies will be larger than they are today. The convergence isn't predicting a vendor outcome. It's telling you that the commercial layer of enterprise software is repricing in real time, and your contract portfolio is probably calibrated to a 2024 understanding of "user."

The work is unglamorous. Pull every Tier-1 SaaS contract that renews in the next 18 months. Run them against the five questions above. Flag the ones with no AI-overrun governor, no portability clause, or no honest answer to question #1. Those are the line items that have unpriced exposure — not because the vendor is hostile, but because the underlying economics moved and the contract hasn't caught up.

The companies that come through 2027 cleanly aren't the ones that bet correctly on Palantir versus Salesforce. They're the ones whose procurement teams treated this twelve-month window as a repricing window — and renegotiated for the world that's already arrived.

The SaaSpocalypse is, as Benioff says, not new. The repricing is.

If you found this useful, the companion piece — Claude Kills SaaS Distribution: The Cascade — covers the same shift from the AI-platform side. And our review of agentic-coding economics digs into the actual token math behind the subscription-vs-API gap.

Originally published at the original site.