DEV Community: Anup Karanjkar

MiniMax M3 Developer Guide: Open-Weight 1M-Context Model (2026)

Anup Karanjkar — Thu, 04 Jun 2026 18:19:35 +0000

MiniMax M3 launched June 1, 2026 with a headline that's hard to ignore: 59.0% on SWE-Bench Pro at $0.60 per million input tokens. That's 5–10% of what GPT-5.5 and Gemini 3.1 Pro cost per token on the same benchmark, according to pricing data at launch. If those numbers survive independent verification, M3 is the first open-weight model to put genuine pressure on proprietary frontier model economics.

The caveat: every performance number in this article comes from MiniMax's own benchmark runs. Third-party evaluations were not available at launch. Weights and a full technical report are scheduled for Hugging Face and GitHub around June 10–11 — that's when the ML community will confirm or challenge the claims in detail. Until then, this guide covers what's technically verifiable about the architecture and how to access the API today.

The Architecture: What MSA Actually Does

Standard transformer attention scales quadratically with sequence length. At 1M tokens, that math becomes the primary barrier to both speed and cost — not parameter count. MiniMax Sparse Attention (MSA) attacks this constraint directly, and the approach differs from both mainstream alternatives.

DeepSeek's Multi-head Latent Attention (MLA) compresses key-value caches before attention computation, trading precision for dramatically smaller KV footprints. FlashAttention and its variants optimize memory access patterns but don't reduce the fundamental O(n²) compute. MSA takes a third path: it keeps key-values uncompressed and at full floating-point precision, but adds block-level selection on top of a standard Grouped-Query Attention backbone.

The mechanism: for each query, a lightweight routing layer identifies which blocks of the KV cache are actually relevant and discards the rest before computing attention. No precision loss from compression. No wasted compute on irrelevant context. The selection routing adds minimal overhead because it operates at block granularity — large chunks, not individual tokens.

Published results at 1M context length versus MiniMax M2 on the same hardware:

Prefill speed: 9.7x faster (reading the full 1M-token prompt)
Decoding speed: 15.6x faster (generating each output token)
Per-token compute: approximately 1/20th of M2 at maximum context
KV precision: full floating-point maintained (no lossy compression, unlike DeepSeek MLA)

Whether MSA generalizes beyond MiniMax's internal workloads is an open question the weights release will answer. The full technical report will let independent researchers verify the routing mechanism and measure efficiency across diverse input distributions — including adversarial cases where sparse selection might degrade quality.

Benchmarks: What MiniMax Claims

Four benchmark scores published at launch:

SWE-Bench Pro: 59.0% — MiniMax claims this surpasses both GPT-5.5 and Gemini 3.1 Pro
Terminal-Bench 2.1: 66.0%
SWE-fficiency: 34.8%
BrowseComp: 83.5 — MiniMax claims this edges past Claude Opus 4.7 on autonomous browsing tasks

These numbers come exclusively from MiniMax's internal evaluation runs. Vendor-run benchmarks tell you the ceiling under optimal conditions, not typical production performance. Two models with the same SWE-Bench score can perform very differently on your actual task distribution.

The SWE-Bench Pro claim deserves particular context. Competing frontier models cluster around 55–65% on Pro. If M3 is genuinely at 59% at $0.60/million tokens, it's competing in the second tier of the coding benchmark table — not leading it, but significantly above models in its price range. The BrowseComp score is the wilder claim: autonomous browsing is a task class where agent scaffolding matters as much as raw model capability, making benchmark methodology scrutiny important.

The practical move: build a 50-task evaluation suite from your actual production backlog. Run M3, your current model, and one alternative. Vendor benchmarks are a screening filter, not a deployment decision.

Pricing: The Cost Argument

As of June 3, M3 is live on OpenRouter at $0.60 per million input tokens and $2.40 per million output tokens. A 50% promotional discount applied at launch reduced effective rates to approximately $0.30 input / $1.20 output per million tokens — though promotional pricing rarely persists.

Model	Input ($/M tokens)	Output ($/M tokens)	Max Context

| Gemini 3.5 Flash | $1.50 | $9.00 | 128K tokens |

| Claude Sonnet 4.6 | ~$3.00 | ~$15.00 | 200K tokens |

At $0.60 input, M3 undercuts Gemini 3.5 Flash by 60% on input tokens. For document-heavy workflows — contract review, codebase analysis, RAG with large retrieved contexts — where input tokens dominate cost, the economics are compelling if quality holds. The 1M context window amplifies the savings: instead of chunking and re-querying (which multiplies API calls), a single M3 call can process what would have required 5–10 calls at shorter-context pricing, eliminating the retrieval overhead entirely.

Developer Guide: API Access Today

M3 is available immediately via OpenRouter. The endpoint follows the standard OpenAI chat completions spec, so migration from an existing model requires changing two lines:

from openai import OpenAI

client = OpenAI(
    base_url="https://openrouter.ai/api/v1",
    api_key="YOUR_OPENROUTER_KEY"
)

response = client.chat.completions.create(
    model="minimax/minimax-m3",
    messages=[
        {
            "role": "system",
            "content": "You are an expert software engineer. Review code for bugs, performance issues, and design problems."
        },
        {
            "role": "user",
            "content": "Review this Python class for thread safety issues and memory leaks:\n\nclass DataProcessor:\n    def __init__(self):\n        self.cache = {}\n    \n    def process(self, key, data):\n        if key in self.cache:\n            return self.cache[key]\n        result = expensive_operation(data)\n        self.cache[key] = result\n        return result"
        }
    ],
    max_tokens=2048
)

print(response.choices[0].message.content)

For direct access, MiniMax's platform at api.minimax.chat exposes the full multimodal capability — text, image, and video inputs. The OpenRouter integration handles text only at launch. If your workflow requires analyzing image frames or video alongside code or documents, use the MiniMax API directly.

For the 1M context path, pass the full content in a single message. No chunking, no summarization layers, no retrieval pipeline needed:

with open("codebase_dump.txt", "r") as f:
    full_codebase = f.read()

# Single call, full codebase in context
response = client.chat.completions.create(
    model="minimax/minimax-m3",
    messages=[
        {
            "role": "user",
            "content": (
                "Here is the complete codebase:\n\n"
                + full_codebase
                + "\n\nTrace the data flow from POST /api/checkout "
                "through to the payment processing module. Identify "
                "race conditions or input validation gaps."
            )
        }
    ],
    max_tokens=4096
)

What 1M Context Actually Unlocks

Long context windows get announced constantly. The M3 version is more interesting than most because MSA makes serving 1M tokens economically viable for the provider — which means MiniMax can price it comparably to standard-context inference instead of charging a premium surcharge.

Full-codebase review. 1M tokens is 25,000–40,000 lines of code depending on language verbosity. A mid-size production application fits in a single call. Trace a bug across the full dependency graph, audit all authentication paths, or generate comprehensive documentation — without chunking and the context fragmentation it introduces.

Complete contract analysis. A 500-page legal agreement is roughly 250,000 words. Send the whole document, ask M3 to identify all indemnification clauses, flag every defined term that appears inconsistently, or summarize obligations by party. Previous 200K-context models required chunking with retrieval layers that introduced relevance errors on cross-section references.

Agent session persistence. In multi-step agentic workflows, context accumulates with every tool call. At 1M tokens, an agent maintains 20–30x more interaction history before needing to compress or summarize. That difference matters in tasks with long planning horizons — a 15-step research workflow versus one that forgets step 3 by step 8.

Multi-source video analysis. The native video input at MiniMax's direct API allows simultaneous analysis of multiple video segments in a single call — useful for content moderation pipelines, multi-camera production review, or surveillance workflows where temporal context across clips matters.

Where M3 Is Not the Right Choice

At launch, M3 has specific gaps worth knowing before you build anything on it.

No independent benchmark verification yet. If your application requires provable accuracy thresholds — medical diagnosis support, legal compliance screening, financial risk scoring — don't deploy on vendor numbers. Wait for community evaluation after the weights drop June 10–11.

Multimodal requires the direct API. OpenRouter text-only at launch means image and video input needs the MiniMax API directly, adding integration complexity if you already route through a provider. For text-only workloads this is a non-issue.

Short-context tasks see no architectural advantage. MSA is optimized for long-context efficiency. For tasks under 10K tokens, M3 performs like a standard frontier model — competitive, but without the 15x speed multiplier. Gemini 3.5 Flash or Claude Haiku 4.5 may deliver better value at very short contexts given their established optimization for that regime.

Enterprise SLAs not yet published. For teams needing contractual uptime guarantees, DPA agreements, or dedicated infrastructure, MiniMax's enterprise support tier details were not available at launch. The OpenRouter path provides availability SLAs through OpenRouter's own infrastructure guarantees, not MiniMax's directly.

Open Weights: Why June 10 Matters More Than the Launch

MiniMax committed to releasing model weights and a full technical report around June 10–11 on Hugging Face and GitHub. Three things happen when weights drop that don't happen on API launch day.

The ML community benchmarks independently. Within 48 hours of a major model weights release, LMSys, EleutherAI, and independent researchers typically publish their own evaluations. This is when vendor claims get confirmed, corrected, or revised significantly. MiniMax M2 held up reasonably well under independent evaluation. M3's claims are larger, in a more competitive environment, and the community appetite for scrutinizing SWE-Bench methodology is higher than ever.

Self-hosted deployment becomes available. For teams with data sovereignty requirements or on-premise constraints, open weights eliminate the API pricing conversation. A model that costs $0.60/million tokens via API costs compute-only when self-hosted on your hardware. For high-volume applications — processing thousands of documents per day — self-hosting frontier-class weights is typically 5–15x cheaper than API pricing at scale.

Fine-tuning becomes viable. A frontier-capable base model that can be adapted on private datasets is more valuable for specialized deployments than an API-only model at any price. Legal document analysis, domain-specific code review, proprietary knowledge integration — these workflows improve meaningfully with fine-tuning, and the base model capability determines the ceiling.

The Honest Take

MiniMax has delivered before. M2 was independently validated post-weights-release and performed close to announced numbers. M3 is a larger claim — surpassing GPT-5.5 and Gemini 3.1 Pro on SWE-Bench Pro is not a minor upgrade story — in an environment where benchmark methodology scrutiny is at an all-time high.

The VentureBeat headline framing — "eclipsing GPT-5.5 and Gemini 3.1 Pro on key benchmark performance for just 5–10% of the cost" — is the kind of claim that attracts attention and skepticism in roughly equal measure. Frontier models from OpenAI and Google have dedicated evaluation infrastructure and months of post-launch hardening. An open-weight model matching them on day one at a fraction of the cost would be a structural shift, not a typical product launch.

The answer arrives June 10–11. Until then: access the API via OpenRouter today, build your own evaluation suite against your task distribution, and make the deployment decision on your data rather than the vendor's. If M3 is as capable as claimed, you'll know from your results before the community verdict lands. If it isn't, you'll have saved yourself a premature architecture decision.

Originally published at wowhow.cloud

Switch from GitHub Copilot to Claude Code: Migration Guide 2026

Anup Karanjkar — Thu, 04 Jun 2026 14:38:33 +0000

$750 per month. That is the real cost some developers hit in the first billing cycle after GitHub Copilot switched to token-based AI Credits pricing on June 1, 2026. One developer on X reported going from $29 to $750. Another reported $50 climbing to $3,000. These are not edge cases from unusual usage — they are the result of running normal agentic sessions against large repositories with models like Claude 3.7 Sonnet or GPT-4o, which cost between $0.10 and $0.30 per 1,000 tokens billed through Copilot's AI Credits system.

Claude Code's interactive plan has no per-token ceiling. Pro costs $20 per month, Max 5x costs $100, Max 20x costs $200. Those are flat rates for interactive sessions in your terminal. If you do most of your AI coding in the terminal — not through automated pipelines — that is the math that drives the migration decision.

This guide covers the complete switch: installing Claude Code, mapping your Copilot workflows to Claude Code equivalents, setting up MCP servers to replace Copilot extensions, and the specific cases where staying on Copilot still makes sense.

Why the Copilot Bill Spiked

The old Copilot model used Premium Request Units. Each plan had a monthly PRU allotment. When you exhausted it, Copilot fell back to a lighter base model — annoying, but your bill stayed flat. That safety net was removed on June 1. Now every chat message, agent mode task, and code review session draws from AI Credits billed at $0.01 per credit, with credit costs varying by model:

Model	Input per 1K tokens	Output per 1K tokens

| Claude 3.7 Sonnet (via Copilot) | $0.003 | $0.015 |

| GPT-4o | $0.0025 | $0.010 |

| o3 (reasoning mode) | $0.010 | $0.040 |

| GPT-4.1 | $0.002 | $0.008 |

A single agentic session that reads 50 files, generates 2,000 lines of code, and runs 3 rounds of iteration can easily hit $2–5 in AI Credits. Multiply that by 10–15 sessions per week and the monthly bill lands between $80 and $300 before you've done anything unusual. Run a few long refactoring sessions against a monorepo and you're at $750.

The worst part: GitHub's default is to notify you when you hit a spending limit, not stop usage. You must manually enable "Stop usage when budget limit is reached" in the billing dashboard to create a hard cap. Many developers discovered this after their first post-June-1 invoice arrived.

The full pricing breakdown and what is still free under Copilot is covered in detail in the GitHub Copilot token billing cost guide.

Claude Code's Pricing Model: No Ceiling for Interactive Use

Claude Code's flat-rate pricing applies to interactive terminal sessions — the kind of work where you type a command, Claude reads your codebase, and returns results. Three tiers exist:

Plan	Monthly cost	Usage multiplier	Best for

| **Pro** | $20 | 1x | Solo developers, part-time AI coding |

| **Max 5x** | $100 | 5x higher limits | Full-time developers, heavy agentic use |

| **Max 20x** | $200 | 20x higher limits | Power users, multi-agent orchestration |

"Usage multiplier" refers to the rate limits for model calls per hour — not per-token billing. Once you're on a plan, interactive sessions do not generate per-token charges on top of the subscription. The June 15, 2026, billing split creates a separate credit pool for programmatic/API use (the Agent SDK, claude -p pipeline runs), but manual terminal sessions remain covered under the flat rate.

The contrarian case for Copilot: if you are purely an autocomplete user who rarely opens the chat panel, the $10 Pro or $39 Pro+ base fee with zero agentic sessions is still cheaper than Claude Code Pro at $20. Know your actual usage pattern before migrating.

Why Claude Code Wins on Technical Merits

Cost aside, three technical factors make Claude Code the strongest alternative:

SWE-bench score. Claude Opus 4.8, released May 29, 2026, scores 88.6% on SWE-bench Verified — the highest of any model available today. SWE-bench tests real GitHub issues against real codebases. An 88.6% score means the model correctly resolves nearly 9 in 10 benchmark software engineering tasks autonomously. Cursor also uses Claude models, but supplements them with its own inference infrastructure running at 200+ tokens per second for raw completion speed.

1 million token context window. Claude Opus 4.8 and Claude Sonnet 4.6 both support 1M tokens of context. That is roughly 750,000 words — enough to hold the entire source tree of a mid-sized SaaS application in a single context window without chunking or retrieval hacks. Copilot's effective context window is bounded by the IDE extension architecture, not a published API limit, and real-world behavior degrades noticeably with large codebases.

MCP ecosystem. The Model Context Protocol now has 97 million+ downloads across servers. Claude Code is the reference implementation. Every tool integration — databases, APIs, cloud providers, dev tools — that publishes an MCP server works natively in Claude Code. This is the equivalent of Copilot's extension marketplace, but protocol-based rather than plugin-based, meaning you can write your own MCP server in TypeScript in under an hour.

Installation: Claude Code in 5 Minutes

Claude Code runs in your terminal. No IDE plugin required, though VS Code and JetBrains integrations exist.

# Install Claude Code globally
npm install -g @anthropic-ai/claude-code

# Authenticate
claude login

# Verify installation
claude --version

After login, you'll be prompted to connect your Anthropic account. If you don't have one, create it at anthropic.com. Select your plan during signup — Pro at $20 is sufficient to evaluate before committing to Max.

Open any project directory and start a session:

cd your-project
claude

Claude Code reads your directory tree automatically. No configuration required for basic use.

Mapping Copilot Workflows to Claude Code

The biggest adjustment is mental model, not tooling. Copilot is IDE-embedded — you stay in VS Code and the AI assists inline. Claude Code is terminal-first — you describe tasks in natural language and the agent executes them autonomously across files.

Code completions → inline suggestions in VS Code. Claude Code has a VS Code extension that provides inline completions. Install it from the VS Code marketplace after installing the CLI. It is not as tightly integrated as Copilot's ghost-text completions, but it works.

Copilot Chat → Claude Code terminal chat. Replace the Ctrl+I chat panel in VS Code with typing directly in the Claude Code terminal session. The difference: Claude Code can execute bash commands, write files, and run your tests in the same session. Copilot Chat generates suggestions you then apply manually.

Copilot Agent Mode → Claude Code autonomous mode. This is where the difference is largest. Copilot Agent Mode runs in the IDE with access to the current open file and some workspace context. Claude Code autonomous sessions read your entire repository, write changes to disk, run tests, and iterate. A task description in natural language is all you need:

claude "Refactor the authentication middleware to use the new JWT library.
Run the test suite after each change and fix any failures."

Copilot Code Review → Claude Code review command. Use /review in a Claude Code session to get a structured review of staged changes before committing. It surfaces security issues, performance problems, and style inconsistencies with file paths and line numbers.

Setting Up MCP Servers (Replacing Copilot Extensions)

Copilot extensions connect the AI to external services. MCP servers do the same thing in Claude Code. The setup lives in ~/.claude.json:

{
  "mcpServers": {
    "github": {
      "command": "npx",
      "args": ["-y", "@modelcontextprotocol/server-github"],
      "env": {
        "GITHUB_PERSONAL_ACCESS_TOKEN": "your_token_here"
      }
    },
    "postgres": {
      "command": "npx",
      "args": ["-y", "@modelcontextprotocol/server-postgres",
               "postgresql://localhost/mydb"]
    },
    "filesystem": {
      "command": "npx",
      "args": ["-y", "@modelcontextprotocol/server-filesystem",
               "/path/to/project"]
    }
  }
}

Common Copilot extension → MCP server mappings:

Copilot Extension	MCP Server equivalent	Package

| GitHub (@github) | server-github | `@modelcontextprotocol/server-github` |

| Azure (@azure) | Cloudflare MCP (for edge) | `@cloudflare/mcp-server-cloudflare` |

| Docker (@docker) | server-docker | `@modelcontextprotocol/server-docker` |

| Sentry (@sentry) | MCP Sentry server | `@sentry/mcp-server` |

| Datadog (@datadog) | server-datadog | Community server on MCP registry |

The full MCP server registry is at modelcontextprotocol.io. Search by integration name. Most major developer tools now have published MCP servers.

Migrating Your CLAUDE.md (Project Context)

Claude Code uses a CLAUDE.md file at the root of your project to provide persistent context — your tech stack, coding standards, forbidden patterns, and architecture notes. This replaces the "custom instructions" feature in Copilot Chat.

A minimal starting template:

# Project: [Your Project Name]

## Tech Stack
- [Framework, version]
- [Database]
- [Deployment target]

## Code Standards
- TypeScript strict mode, no any types
- Named exports only
- Functional components, no class components

## Forbidden Patterns
- Never use [pattern] -- [reason]

## Testing
- Run npm test after every change
- All new code needs unit tests before commit

Claude Code reads CLAUDE.md at the start of every session. The more specific you make it, the fewer corrections you need to give mid-session. Think of it as the system prompt that stays with your project permanently.

Dynamic Workflows: What Copilot Doesn't Have

Claude Code's Dynamic Workflows feature, introduced with Opus 4.8, lets the model switch between fast completion mode and deliberate reasoning mid-task based on problem complexity. You don't configure this — it happens automatically. Simple completions return in under a second. Architecture-level problems trigger extended thinking.

The practical effect: you don't pay the latency cost of always-on deep reasoning for trivial tasks, but you get it when you actually need it. Copilot has no equivalent. Its model selection is static per session.

Where Cursor and Windsurf Fit

The honest comparison: Claude Code is not always the right choice over Cursor or Windsurf, and the migration decision is not binary.

Cursor ($60B valuation, $20/month). Cursor 3 introduced Design Mode — visual component editing directly in the IDE — and parallel agent execution across multiple files simultaneously. At 200+ tokens per second through its own inference infrastructure, Cursor is faster than Claude Code for pure completion speed. If you spend most of your time in the IDE editor rather than a terminal, and value visual component editing, Cursor is the better fit.

Windsurf ($20/month, 950 tokens/second). Windsurf's headline number is inference speed — 950 tokens per second, the fastest available. It supports 40+ IDEs. If you use anything other than VS Code or JetBrains, Windsurf is currently the only serious option. The SWE-bench score is lower than Opus 4.8, but the speed advantage is real and matters for long refactoring sessions.

The split many developers are landing on: Claude Code for autonomous agentic work (long tasks, multi-file refactors, code review), Cursor for active development with frequent completions, Windsurf if IDE variety matters. None of these conflict — you can run all three simultaneously.

Complete Migration Checklist

Run through this before canceling your Copilot subscription:

# 1. Install Claude Code
npm install -g @anthropic-ai/claude-code
claude login

# 2. Create CLAUDE.md in your primary project
touch CLAUDE.md  # fill in tech stack + standards

# 3. Set up MCP servers (add to ~/.claude.json)
# -- at minimum: github, filesystem

# 4. Run your first agentic task
claude "Review the last 3 commits and summarize what changed in authentication"

# 5. Set a hard spending cap in GitHub Billing -- even if you keep Copilot
# Dashboard -- Billing -- GitHub Copilot -- Spending limit -- enable hard stop

Pro at $20 is the right starting point for most individual developers. Move to Max 5x only if you're running multiple long agentic sessions per day.

The AI coding tools market has four serious players now: Claude Code, Cursor, Windsurf, and Copilot. Copilot's billing change makes it the most expensive option for heavy agentic use. For that specific workload, the migration pays for itself in the first month. Every developer tool and starter kit for Claude Code workflows is available at wowhow.cloud — pay once, ship forever.

Originally published at wowhow.cloud

Anthropic IPO: $965B S-1 Filed — What Developers Need to Know

Anup Karanjkar — Thu, 04 Jun 2026 12:22:36 +0000

$47 billion. That's Anthropic's annualized revenue run-rate as of May 2026 — up from roughly $10 billion twelve months earlier. That 4.7x growth trajectory is precisely why the company filed a confidential S-1 registration statement with the Securities and Exchange Commission on June 1, 2026. For Wall Street, this is the most anticipated AI IPO since Nvidia's supply-side dominance became undeniable. For developers building on Claude, it is something more specific: a signal that Anthropic is transitioning from a venture-backed research organization into a publicly accountable business — with everything that implies for API pricing, enterprise prioritization, and the developer ecosystem you depend on.

The Numbers That Explain the $965 Billion Valuation

Anthropic's most recent funding round — a $65 billion Series H — valued the company at approximately $965 billion. That figure will almost certainly be higher at IPO. The reported target range for the public offering is $1.10 trillion to $1.25 trillion fully diluted. The company is targeting a primary issuance of $25–35 billion, which would make it the largest tech IPO in U.S. history by a significant margin.

Revenue tells the story more clearly than the headline number. Moving from $10 billion in annual revenue to a $47 billion run-rate in approximately 18 months is not organic growth — it is a step-function shift driven by a specific product. That product is Claude Code.

Enterprise adoption numbers confirm the trajectory. Over 1,000 customers now spend more than $1 million annually on Claude — a figure that doubled from roughly 500 in under two months as of April 2026, compared with a dozen such customers just two years ago. That acceleration in large-ticket enterprise relationships is precisely what makes a near-trillion-dollar valuation legible to institutional investors who were skeptical of AI company economics as recently as late 2025.

Claude Code: The Revenue Engine Nobody Predicted

When Anthropic launched Claude Code in late 2025, the internal expectation was that it would serve as a developer acquisition channel — a tool that would pull engineers into the Claude ecosystem and eventually convert them into API customers. The actual outcome was different. Claude Code surpassed $1 billion in annualized revenue within six months of launch, driven by enterprise developer teams adopting it not as a prototype but as primary production infrastructure.

The driver was autonomous agent workflows. Developer teams discovered that Claude Code could take a GitHub issue, open the relevant files, write and test a patch, and submit a pull request — without a human in the loop for routine tasks. The time savings compounded across 10-person engineering teams into numbers that justified $1,000+ monthly per-seat pricing. Claude Code became the first AI developer tool to clear the "ROI in the first sprint" threshold reliably, and enterprise procurement followed at scale.

The IPO filing tells investors a clear story: this is a recurring revenue business, not a research organization. Claude Code's $1B ARR in six months is the strongest evidence Anthropic has ever produced that its models translate into enterprise product-market fit. For developers, it also signals something worth watching: Anthropic now has extremely strong financial incentives to optimize Claude Code as an enterprise product, which may not always align with the needs of solo developers or small teams on personal API keys.

The IPO Timeline

The confidential S-1 means the full prospectus — including risk factors, executive compensation, ownership structure, and audited financials — remains non-public until Anthropic formally registers and begins its roadshow. Under standard SEC rules, the company can keep the draft registration private for up to 15 days before launching a public offering, and can amend it based on SEC staff comments during that period.

The reported target window is October 15 to November 2026, with pricing and allocation expected to follow within two weeks of the roadshow launch. Given that Google (via Google Cloud) and Amazon (via AWS) hold significant stakes in Anthropic — and that both companies have embedded Claude into core enterprise products — the IPO will carry unusual complexity in its ownership and lockup structure that institutional investors will scrutinize carefully.

One additional variable: OpenAI is also pursuing a public listing in 2026, and the race to list first has real implications for both companies. The AI IPO window is real but not indefinite — public market appetite for AI infrastructure stories depends on macro conditions, interest rates, and whether existing AI investments in public markets continue to trade at premium multiples. Whichever company lists first faces a cleaner comparable set and likely captures a higher multiple on its first day of trading.

What Changes for Developers on June 15

The IPO filing is not the only Anthropic news that directly affects developers this month. On June 15, 2026, Anthropic is restructuring how Claude subscriptions and API access are billed. The change creates two distinct credit pools:

Interactive pool — covers Claude.ai chat sessions and terminal-based Claude Code sessions. This pool continues to be covered by existing subscription plans at current monthly rates.
Programmatic pool — covers the Claude Agent SDK, claude -p CLI calls, and third-party agent tools that invoke the Claude API programmatically. This pool is billed at full API list prices from a separate monthly credit allocation.

The practical effect: developers who use Claude Code interactively in the terminal are unaffected. Developers running automated pipelines, agent workflows, or batch processing through the API will now see that usage draw from a separate metered credit pool rather than the flat subscription rate they may have been relying on.

The reason for this restructuring is more straightforward than Anthropic has stated publicly. The split creates clean accounting segmentation between subscription revenue and API usage revenue — exactly the segmentation an SEC staff reviewer needs to evaluate a SaaS-plus-consumption business model. Investors reading the prospectus need to see where growth is coming from. The June 15 change makes that analysis possible in the audited financials that will accompany the public filing.

How the IPO May Reshape API Pricing

Public company status introduces quarterly earnings pressure that private companies do not face. Anthropic's current API pricing reflects competitive dynamics: the company has reduced prices significantly as model efficiency improved, prioritizing developer adoption over margin. Once publicly traded, the calculus shifts. Gross margin improvement becomes a quarterly deliverable. The three levers are compute efficiency, pricing, and mix shift.

Compute efficiency will continue to improve — that is a function of architecture optimization and hardware procurement, not business model decisions. Pricing is the sensitive variable. The most likely scenario is not that API prices increase uniformly, but that the pricing structure becomes more tiered: enterprise customers on large committed contracts receive favorable rates, while pay-as-you-go developer access sees gradual price normalization as competitive dynamics permit.

Current pricing for Claude Opus 4.8 — Anthropic's flagship model — sits at $15 per million input tokens and $75 per million output tokens. Whether those rates hold post-IPO depends significantly on what competitors do: Google Gemini 3.5 Flash is available at $1.50/$9 per million tokens, and GPT-5.5 Instant sits below Opus pricing on inference cost. Anthropic cannot sustain a 10x pricing premium over commodity alternatives without a capability justification that enterprise buyers can articulate to their CFOs each quarter. The competitive ceiling is real and well-defined.

Use the AI Model Cost Calculator or AI Prompt Cost Calculator to benchmark your current API spend now, before any IPO-related pricing adjustments occur. Understanding your baseline makes it easier to evaluate changes when they arrive.

Anthropic vs. OpenAI: The Dual IPO Race

Both major frontier labs are racing toward public markets in 2026. OpenAI filed its own S-1 earlier this year, and the developer implications of that filing have already been analyzed in depth. The parallel IPO race matters for developers for one structural reason: once both companies are publicly traded, API pricing and product decisions will be made in the context of competitive quarterly earnings calls.

That competition is not bad for developers. The pricing pressure created by Gemini Flash, GPT-5.5 Instant, and Claude Haiku 4.5 competing on the commodity inference tier has driven input token costs down by roughly 90% over 18 months. The incentive to maintain developer mindshare — measured by API call volume, which translates directly into revenue line items that public investors value — is something both companies will defend aggressively post-IPO.

Where the risk concentrates is at the frontier tier: the Opus-class, reasoning-heavy models that are genuinely differentiated from commodity alternatives. Those models carry premium pricing that public shareholders will expect to protect. If Anthropic's Q1 2027 earnings show Opus usage as a high-margin revenue driver, the pressure to defend that margin — rather than discount it to grow developer adoption — will be structural and sustained.

What Smart Claude Developers Should Do Before October

The IPO is a structural transition, not a pricing announcement. Here is what actually matters before the October window:

Audit your current API costs now. The June 15 billing split means some usage that was invisible under a flat subscription will now surface in the programmatic pool as metered consumption. Run an audit of your automated Claude API calls before June 15. Any workflow built on the mental model of "unlimited" subscription access needs to be measured and budgeted. The AI Token Counter gives you a baseline on token usage before you start optimizing your prompts.

Build model-agnostic abstractions in your agent layer. This is not an argument to leave Claude — it is a basic infrastructure resilience argument. If 90% of your AI calls route through one provider and that provider experiences a pricing event, a capacity constraint, or a model capability regression after IPO pressures take hold, your application is exposed. Implementing an abstraction layer through MCP or a custom routing wrapper costs a few days of engineering time and meaningfully reduces dependency risk.

Negotiate annual enterprise agreements if you are at scale. Anthropic's sales team is closing enterprise agreements aggressively before the IPO to strengthen the ARR line on the S-1. The leverage to negotiate committed pricing is highest right now — before the prospectus becomes public and before revenue figures are locked into the investor narrative. If you are spending $50,000 or more per month on Claude APIs, an enterprise agreement negotiated in July or August 2026 may lock in rates that look considerably more attractive than whatever post-IPO pricing normalization produces in 2027.

Watch for model announcements tied to IPO timing. The current flagship Claude Opus 4.8 and its successors will be released in the context of investor-relevant milestones. Feature releases tend to cluster around events that strengthen the company's narrative to public markets. There is likely a meaningful model or capability announcement between now and the October roadshow. Subscribe to the Anthropic changelog and plan your integration upgrade cycles accordingly.

Conclusion

Anthropic's confidential S-1 filing on June 1, 2026 marks the company's transition from the most important AI research organization you depend on to the most important publicly traded AI infrastructure business you depend on. Those are meaningfully different things. The first optimizes for capability and researcher reputation. The second optimizes for gross margin improvement, quarterly predictability, and investor-legible growth narratives.

The encouraging news: Anthropic's revenue trajectory — $10 billion to $47 billion annualized in 18 months — is strong enough that the company is not under survival pressure of the kind that forces drastic API changes. The June 15 billing split is an accounting measure, not a price increase. The IPO itself raises capital that extends the compute investment runway and funds the next generation of model development.

The honest caveat: once a company answers to public shareholders, every pricing decision, product prioritization, and developer program must clear the quarterly earnings bar. The developers generating measurable, growing API revenue will be protected and cultivated. Those consuming compute without contributing to that growth — low-spend hobbyist usage, flat-subscription workflows that exploit unlimited tiers — will face the normal pressures that public company financial discipline creates over time.

Build accordingly. Every tool for working with Claude APIs more efficiently — from cost calculators to token estimators — is available at wowhow.cloud, pay once, ship forever.

Originally published at wowhow.cloud

Trump AI Executive Order June 2026: What Developers Need to Know

Anup Karanjkar — Wed, 03 Jun 2026 12:20:54 +0000

On June 2, 2026, President Trump signed an executive order titled "Promoting Advanced Artificial Intelligence Innovation and Security," making it the most consequential federal AI policy action in nearly three years. Unlike the Biden administration's October 2023 EO — which created mandatory reporting requirements for large-scale AI training runs — the Trump order takes an explicitly voluntary approach. It creates new coordination infrastructure for AI security, asks frontier AI companies to share models with the government before release, and directs federal agencies to harden systems against AI-enabled cyber threats. For most developers, the immediate compliance burden is zero. But for enterprise teams, government contractors, and operators of critical infrastructure, the order signals a direction of travel that will matter within 12–18 months.

Why This Executive Order Exists

The order was expected in May 2026 but postponed. According to reporting at the time, the White House scrapped the original signing after internal concerns that the order — which originally required a 90-day government review window for frontier models before public release — would stifle U.S. AI companies in their race against Chinese competitors. That framing explains the final order's architecture: it keeps the voluntary coordination framework while stripping any language that could be construed as mandatory licensing or preclearance.

The stated policy goal is to work with the private sector to "harden government and industry systems against cyber threats, protect American intellectual property, and build out the country's AI-enabled defensive capabilities." In practice, the order creates two new institutions and directs agencies to build new benchmarks. Neither institution has direct enforcement power today.

The Three Pillars of the Executive Order

The EO organizes around three distinct actions. Understanding them separately prevents the common mistake of treating the order as either more restrictive or less consequential than it actually is.

Pillar 1: The AI Cybersecurity Clearinghouse

Within 30 days of signing — by approximately July 2, 2026 — the Secretary of the Treasury, in consultation with the National Cyber Director, the NSA Director, and the CISA Director, must stand up an AI Cybersecurity Clearinghouse. This is the order's most operationally concrete deliverable.

The clearinghouse has three defined functions:

Coordinate and deconflict vulnerability scanning: Multiple agencies and companies are currently scanning AI systems and AI-enabled products for security vulnerabilities independently. The clearinghouse creates a shared coordination layer to avoid redundant effort and prevent disclosures from conflicting.
Discover and validate vulnerabilities: Beyond passive coordination, the clearinghouse is authorized to actively find and validate software vulnerabilities in AI systems, particularly those affecting critical infrastructure.
Coordinate and prioritize remediation: Once vulnerabilities are identified, the clearinghouse coordinates the distribution of patches and prioritizes which remediations reach which operators first — based on exposure and criticality.

Participation by industry is "voluntary collaboration." The clearinghouse cannot compel companies to disclose vulnerabilities or submit systems for scanning. What it can do is create a credible, government-backed channel that makes voluntary disclosure safer and more structured than ad hoc reporting.

For developers building AI-enabled products that touch healthcare, energy, or financial infrastructure, this matters even without a compliance mandate. If the clearinghouse surfaces a vulnerability in an AI component your product uses — a model library, an inference provider, an embedding service — the patch coordination framework it creates will determine how quickly you learn about it and what remediation options are available.

Pillar 2: Voluntary Pre-Release Model Review

The second pillar is the most discussed provision: AI developers may, on a voluntary basis, share "covered frontier models" with the federal government up to 30 days before public release, for national security and cybersecurity assessment.

Several aspects of this provision are worth examining carefully:

What counts as a "covered frontier model"? The order defines this term, but the definition references compute thresholds and capability benchmarks that are likely to be refined by agency guidance over the coming months. At launch, the definition appears to target the largest commercial models — systems like Claude Opus 4.8, GPT-5.5, and Gemini 3.5 — rather than mid-size or open-weight models. The implication for most developers building on top of existing APIs: this review process is not your problem; it is the foundation model provider's decision to make.

What happens during the 30-day review? The government conducts national security and cybersecurity assessments. The order does not define the outcome criteria — there is no provision authorizing the government to block a model's release based on review findings. The review produces intelligence and informs agency posture; it does not create a gatekeeping mechanism.

Why would a company participate voluntarily? The reputational signal is one incentive: a model that has been through voluntary government security assessment can credibly claim a level of vetting that competitors without that history cannot. The procurement incentive is a stronger one: government contracting vehicles and future security guidance are likely to treat voluntary participation as a qualification criterion. Participating now builds institutional relationships that matter when voluntary becomes a de facto prerequisite.

The EO explicitly states: "Nothing in this section authorizes a mandatory government licensing, preclearance, or permitting requirement for developing or releasing new AI models, including frontier models." This language was added specifically to address industry concerns about the original 90-day draft.

Pillar 3: Federal AI Security Hardening

The third pillar directs federal agencies to develop new benchmarks and shore up their own defenses. Specifically:

Agencies must develop benchmarks to assess AI models' cyber capabilities — essentially, tests for what an AI model can do in a cybersecurity context, from generating exploit code to finding vulnerabilities in existing systems.
Agencies are directed to harden government AI-enabled systems against both external attacks and misuse by AI systems themselves.
The benchmarks, once developed, will inform procurement decisions — creating a de facto evaluation framework that vendors selling AI to the federal government will need to satisfy.

This pillar is the least immediately visible but may have the longest tail. Once NIST or CISA publishes AI security benchmarks derived from the EO mandate, those benchmarks tend to migrate into industry standards, compliance frameworks, and eventually cyber insurance requirements — regardless of whether the underlying EO is ever enforced.

From 90 Days to 30 Days: What Changed and Why

Understanding the order's evolution is essential for reading its intent. The original draft, which circulated in spring 2026, required a 90-day pre-release review window — the kind of timeline that would have introduced significant friction into frontier model release schedules, which have operated on roughly quarterly cadences. Industry pushback was immediate and effective.

The argument against the 90-day window was both competitive and constitutional. Competitively: Chinese frontier model developers operate without pre-release government review, and a 90-day bottleneck for U.S. models would create a structural disadvantage in global model deployment. Constitutionally: mandating pre-release review of expressive content raises First Amendment concerns that have historically limited prior restraint in other media contexts.

The White House resolved the impasse by keeping the review framework but making participation voluntary and compressing the window to 30 days. This preserves the policy apparatus — the clearinghouse exists, the review process exists, the benchmark-setting mandate exists — while giving industry the assurance that no mandatory preclearance regime is being introduced. Whether that assurance holds through future administrations or regulatory expansion is a different question.

The "Voluntary" Problem: How Voluntary Becomes Mandatory

The history of voluntary cybersecurity frameworks is instructive here. NIST's Cybersecurity Framework, introduced as a voluntary standard in 2014, became a de facto mandatory requirement for most enterprise technology procurement by 2018 — not through legislation, but through contractual requirements in government supply chains, insurance underwriting criteria, and board-level governance expectations. The same dynamic played out with SOC 2 compliance: voluntary standard in 2011, industry default by 2020.

The AI Cybersecurity Clearinghouse and the voluntary model review program are positioned at exactly the same starting point. Several migration paths exist:

Federal procurement standards: Agencies writing AI procurement requirements can specify that vendors must have participated in voluntary pre-release reviews as a qualification criterion. This requires no new legislation — it is a procurement discretion exercise.
Sectoral cybersecurity guidance: Financial regulators (OCC, FDIC), energy regulators (FERC, NERC), and healthcare regulators (HHS) can issue sector-specific guidance that incorporates the EO's framework, making it effectively mandatory for regulated entities.
Contractual requirements: Enterprise technology contracts, particularly those involving critical infrastructure, will increasingly include AI security attestation language that references the clearinghouse framework.
Cyber insurance: As underwriters develop AI-specific risk models, participation in the clearinghouse will likely become a factor in premium calculations, creating a financial incentive structure independent of regulatory requirements.

Legal analysts at WilmerHale noted in their June 2 client advisory that while the EO's initiatives are framed as voluntary, "its provisions may well migrate into procurement standards, sectoral cybersecurity guidance and contractual requirements over time, particularly for clients in regulated industries or those doing business with the federal government."

Developer and Enterprise Implications by Segment

Independent Developers and Startups

No immediate action required. The EO's provisions target frontier model developers, not application builders. If you are building on top of OpenAI, Anthropic, Google, or Microsoft APIs, the pre-release review question is your provider's decision to navigate, not yours. The cybersecurity clearinghouse may eventually surface vulnerability disclosures relevant to model components you depend on — follow CISA and NIST publications to stay informed.

Enterprise Application Developers

Begin documenting your AI security posture now. The benchmarks that federal agencies develop over the next 12 months will likely become reference points for enterprise procurement due diligence. Being able to articulate how your AI-enabled products handle adversarial inputs, model poisoning risks, and data exfiltration scenarios — using the same vocabulary the clearinghouse will standardize — will matter in 2027 enterprise sales cycles even if it does not matter today.

Government Contractors and Defense Industrial Base

This is the highest-urgency segment. Voluntary framework provisions consistently become DFARS and FAR clauses faster than they enter commercial procurement standards. If your company holds or competes for federal AI contracts, engage your contracts team now on how the EO's provisions will likely surface in solicitation language over the next 12–18 months. The 30-day pre-release review provision, specifically, will likely appear as a "preferred" attribute in high-sensitivity procurements before it appears as a requirement.

Critical Infrastructure Operators

Energy companies, financial institutions, and healthcare systems deploying AI in operational environments are the clearinghouse's primary intended audience. The threat model is specific: adversaries using AI models to accelerate vulnerability discovery against critical infrastructure systems. The clearinghouse creates a channel for you to receive threat intelligence and vulnerability patches before they are publicly disclosed. Engaging with Treasury and CISA now — while the clearinghouse is being stood up — positions your organization to participate in the most useful early outputs.

Frontier Model Developers (OpenAI, Anthropic, Google, Microsoft, Meta)

The pre-release review program was designed for this segment. The 30-day voluntary window represents an opportunity to build institutional relationships with the intelligence community and establish a track record of security cooperation that supports enterprise sales narratives. Expect all major frontier labs to announce participation in the voluntary program within 60 days — declining to participate in a voluntary security program becomes a talking point for competitors and a procurement liability in government-adjacent markets.

What to Do Right Now

Subscribe to CISA and NIST AI security publications. The AI Cybersecurity Clearinghouse will likely publish its first guidance documents through CISA. Being on the notification list ensures you receive vulnerability disclosures and benchmark publications as they emerge.
Run an AI attack surface audit on your current stack. Before the clearinghouse publishes its first vulnerability assessments, conduct an internal review of every AI component in your production stack — models, inference providers, embedding services, fine-tuned weights. Document the supply chain and the trust assumptions at each step.
If you are a government contractor: open a conversation with your contracts team now. The EO was signed June 2. Expect the first solicitation language referencing it to appear in RFPs by Q4 2026. Getting ahead of this by a quarter is the difference between a prepared response and a scrambled one.
If you are a critical infrastructure operator: contact CISA about clearinghouse participation. The clearinghouse is specifically designed to serve your threat model. Early participation shapes what the clearinghouse prioritizes — waiting until it is fully operational means your threat priorities are downstream of whoever engaged first.
Watch for the benchmark publication from federal agencies. The cyber capability assessment benchmarks mandated by the EO will define what "secure AI" means for federal procurement. When they drop — likely within 6 months — map your current AI security posture against them immediately.

Conclusion

Trump's June 2026 AI executive order is best understood not as a compliance event but as a norm-setting event. The direct regulatory burden today is near zero for most of the industry. The indirect effects — on procurement standards, on cybersecurity insurance, on enterprise AI sales cycles, on how frontier model developers position their security narratives — will compound over the next 18–24 months in ways that are already predictable from the history of voluntary cybersecurity frameworks.

The order also signals something important about where U.S. AI policy is heading: light-touch regulation paired with aggressive security infrastructure investment. The clearinghouse is not a regulatory bottleneck; it is a threat-intelligence network. The pre-release review is not a censorship mechanism; it is a national security intelligence program. That framing — security through coordination rather than through restriction — reflects a bet that the U.S. can maintain AI leadership precisely by not imposing the friction that competing jurisdictions use to control their own AI industries. Whether that bet pays off will depend on whether the clearinghouse surfaces real threats faster than adversaries exploit them.

Originally published at wowhow.cloud

Microsoft MAI-Thinking-1 & MAI-Code-1-Flash: Developer Guide to 7 New MAI Models

Anup Karanjkar — Wed, 03 Jun 2026 00:23:14 +0000

Microsoft launched seven new in-house AI models at Build 2026 on June 2, 2026, marking the company's most significant push yet to build its own frontier AI stack independent of OpenAI. The centerpiece is MAI-Thinking-1, Microsoft's first large-scale reasoning model, built from scratch on clean commercially licensed data using a sparse Mixture of Experts architecture. Alongside it: MAI-Code-1-Flash, a 5-billion-parameter coding model that outperforms Claude Haiku 4.5 by 16 percentage points on SWE-Bench Pro while using 60% fewer tokens on complex tasks. This is the complete developer guide to all seven MAI models, their specs, benchmarks, deployment paths, and what they mean for the AI development ecosystem.

Why Seven Models at Once?

The strategic context matters. For three years, Microsoft's AI product surface — GitHub Copilot, Azure AI, Bing Chat, Microsoft 365 Copilot — ran almost entirely on OpenAI models. The Build 2026 announcement is Microsoft's public declaration that it is building a parallel, proprietary model stack. Every new MAI model is trained from scratch using "clean and appropriately licensed data, without distillation from third-party models" — language that directly addresses the intellectual property concerns that have accompanied third-party model licensing.

The distribution strategy is equally deliberate. Microsoft is not routing MAI models exclusively through Azure. MAI-Thinking-1 and MAI-Code-1-Flash are available via Fireworks AI, Baseten, and OpenRouter — three infrastructure providers that collectively reach developers who explicitly do not want cloud vendor lock-in. This signals a platform-first posture: Microsoft wants MAI to become a model ecosystem, not just an Azure feature.

MAI-Thinking-1: The Reasoning Flagship

MAI-Thinking-1 is Microsoft's answer to Claude Opus 4.x and GPT-5.5 on the reasoning side of the model spectrum. The architecture is a 35-billion-parameter active / approximately 1-trillion-parameter total sparse Mixture of Experts model — the same class of architecture as DeepSeek V4 Pro and Nemotron 3 Ultra, where only a fraction of total parameters are active on any given forward pass.

Performance Benchmarks

The independent benchmark story is strong for a mid-size model:

AIME 2025: 97.0% — placing MAI-Thinking-1 in the tier of top-performing reasoning models on competition mathematics
AIME 2026: 94.5% — consistent with AIME 2025, suggesting the reasoning capability is robust across benchmark vintages, not overfit
SWE-Bench Pro: Competitive with Claude Opus 4.6, the previous-generation Anthropic flagship, on real-world software engineering tasks
Human preference: Independent raters at Surge preferred MAI-Thinking-1 over Claude Sonnet 4.6 in blind side-by-side evaluations across single-turn and multi-turn tasks

The 256,000-token context window is adequate for most enterprise agentic tasks: it accommodates approximately a 600-page document, a large codebase summary, or a complex multi-document reasoning task. Function calling is natively supported.

Availability and Access

MAI-Thinking-1 is in private preview through Microsoft Foundry, available by request to select early partners. It supports the Chat Completions API spec, meaning existing code targeting OpenAI-compatible endpoints requires minimal changes:

from openai import OpenAI

client = OpenAI(
    base_url="https://models.inference.ai.azure.com",
    api_key="YOUR_AZURE_API_KEY"
)

response = client.chat.completions.create(
    model="mai-thinking-1",
    messages=[
        {"role": "system", "content": "You are an expert software architect."},
        {"role": "user", "content": "Design a microservices architecture for a payment processing system."}
    ],
    max_tokens=4096
)

print(response.choices[0].message.content)

For teams not on Azure, MAI-Thinking-1 is also available through Fireworks AI and Baseten, which offer competitive inference pricing and multi-cloud routing.

MAI-Code-1-Flash: The Copilot-Native Coding Model

MAI-Code-1-Flash takes a fundamentally different design approach from MAI-Thinking-1. At 5 billion parameters, it is sized for low-latency inline code generation rather than deep reasoning — but its benchmark performance is disproportionate to its size.

What "Copilot-Native" Actually Means

Most coding models are trained on code datasets and then evaluated against Copilot-style workflows. MAI-Code-1-Flash was trained inside GitHub Copilot's production harness — meaning the training distribution matches the exact patterns of real developer interactions, not academic code datasets. This is the same training philosophy that drove early GitHub Copilot performance gains: optimize for the production environment, not for benchmark distributions.

The model uses adaptive thinking: it allocates minimal reasoning budget to simple autocomplete requests and expands to multi-step reasoning for complex refactoring or architecture questions. This avoids the latency penalty of always-on chain-of-thought while preserving quality on hard tasks.

Benchmark Performance

SWE-Bench Pro: 51.2% adjusted accuracy vs. 35.2% for Claude Haiku 4.5 — a 16-point lead on real-world software engineering tasks
Token efficiency: 60% fewer tokens than comparable models on hard tasks (SWE-Bench Verified), which directly translates to lower inference cost in production
Instruction-following: Strong performance across both single-turn and multi-turn scenarios, with explicit optimization for recognizing impossible or underspecified problems rather than hallucinating a plausible-looking but wrong solution

At 5B parameters, MAI-Code-1-Flash is pricing like a Haiku-class model but performing significantly above it. For teams paying per-token on inline code suggestions, the economics are worth benchmarking carefully.

Rollout and Availability

MAI-Code-1-Flash is now live in the GitHub Copilot model picker inside Visual Studio Code, rolling out to all paid Copilot tiers starting June 2. It is also available via OpenRouter for direct API access, making it accessible outside the Microsoft ecosystem without an Azure subscription:

from openai import OpenAI

client = OpenAI(
    base_url="https://openrouter.ai/api/v1",
    api_key="YOUR_OPENROUTER_API_KEY"
)

response = client.chat.completions.create(
    model="microsoft/mai-code-1-flash",
    messages=[
        {"role": "user", "content": "Refactor this Python function to use async/await: def fetch_user(id): return requests.get(f'/users/{id}').json()"}
    ]
)

print(response.choices[0].message.content)

The Multimodal Tier: MAI-Image-2.5, MAI-Voice-2, MAI-Transcribe-1.5

The remaining five models in the seven-model launch are updated versions of models that debuted in April 2026. Each receives meaningful capability upgrades rather than being incremental maintenance releases.

MAI-Image-2.5

The previous MAI-Image-2 was primarily a text-to-image generation model. MAI-Image-2.5 adds two significant capabilities:

Image-to-image editing: Accept an image as input and modify it according to a text prompt, enabling product mockups, background replacement, and iterative design workflows without a separate editing pipeline
Control with preservation: Apply structure, depth, or composition constraints to generation while preserving specified regions of a source image — useful for product photography workflows where brand elements must remain fixed

MAI-Image-2.5 debuted at #3 on Arena.ai's image generation model leaderboard, behind only FLUX.1 and Midjourney V9. A MAI-Image-2.5 Flash variant for faster, more cost-efficient generation is available in Microsoft Foundry.

MAI-Voice-2

MAI-Voice-1 (April 2026) supported voice cloning and text-to-speech in a limited language set. MAI-Voice-2 extends voice cloning and voice prompting to more than 15 additional languages, bringing total multilingual TTS coverage to a level competitive with ElevenLabs and OpenAI TTS. A MAI-Voice-2 Flash variant for latency-sensitive real-time applications is planned but not yet released.

MAI-Transcribe-1.5

The updated speech-to-text model now supports 43 total languages, retaining its #1 ranking on the FLEURS benchmark for multilingual ASR accuracy. New in version 1.5: content biasing, which allows developers to supply domain-specific vocabulary (product names, technical terms, proper nouns) to improve recognition accuracy in specialized contexts — a critical feature for enterprise dictation, medical transcription, and customer support applications.

Deployment Options Across the Full MAI Stack

Microsoft has structured MAI model access across four tiers, each suited to different developer contexts:

GitHub Copilot (MAI-Code-1-Flash): Direct integration into the VS Code workflow. No API calls, no SDK setup. Available immediately to paid Copilot subscribers in the model picker. Best for individual developers and teams already on the Copilot platform.
Azure AI Foundry: The primary enterprise deployment path for MAI-Thinking-1 and the multimodal models. Provides access controls, usage monitoring, compliance logging, and private deployment options. MAI-Thinking-1 is in private preview here; the other models are generally available.
OpenRouter / Fireworks AI / Baseten: Third-party inference for teams avoiding Azure. OpenRouter provides instant access with pay-per-token billing and automatic routing between providers. Fireworks AI and Baseten offer dedicated deployment options with lower per-token rates at volume.
Microsoft Foundry SDK: For production applications that need direct API integration with retry logic, streaming, and structured outputs. The SDK exposes all MAI models through a consistent interface aligned with the OpenAI Chat Completions spec.

How to Choose: MAI-Thinking-1 vs. MAI-Code-1-Flash vs. Competitors

The two headline models serve distinct use cases, and neither is a direct competitor to the other:

Use MAI-Thinking-1 when: The task requires multi-step reasoning, mathematical problem solving, or complex code architecture decisions. At competitive performance with Claude Opus 4.6 and with a preference signal over Sonnet 4.6 in human evals, it is a credible option for agentic orchestration tasks where reasoning depth matters. The MoE architecture makes it more economical than dense models at the same capability tier.

Use MAI-Code-1-Flash when: The task is inline code generation, autocomplete, small refactors, or any high-throughput coding workflow where latency and token cost are primary constraints. Its 60% token efficiency advantage over comparable models compounds quickly at scale. Teams running CI/CD pipelines that generate or review code automatically will see meaningful cost reductions.

The competitive positioning in 2026: For reasoning, MAI-Thinking-1 competes with Claude Opus 4.6, Gemini 3.1 Pro, and GPT-5.5 Turbo. For coding, MAI-Code-1-Flash occupies the efficient-but-capable tier alongside Claude Haiku 4.5 and Gemini 3.5 Flash — but with a meaningful performance lead over both on SWE-Bench Pro.

The Bigger Picture: Microsoft's Model Independence Strategy

The seven-model announcement is not primarily a model launch — it is a strategic signal. Microsoft has spent three years as OpenAI's largest distribution partner. The $13 billion investment gave it access to GPT-4 and its successors, but created a dependency that analysts have flagged as a risk: if OpenAI raises API prices, changes licensing terms, or gets acquired, Microsoft's AI product surface is exposed.

Building a parallel model stack trained on clean data, distributable across third-party infrastructure, and competitive with OpenAI models on key benchmarks directly addresses that risk. MAI-Thinking-1 being "competitive with Claude Opus 4.6" and MAI-Code-1-Flash outperforming Haiku 4.5 are not coincidental benchmark choices — they are the minimum viable capability thresholds for enterprise buyers who currently use those models. Microsoft is signaling that it can serve those buyers without OpenAI.

What to Do Right Now

Benchmark MAI-Code-1-Flash in your Copilot workflow today. It is live in VS Code's model picker for all paid subscribers. Run it against your codebase for a week and compare code acceptance rate and refactoring quality against your current default model. The 16-point SWE-Bench lead may or may not translate to your specific use case — the only way to know is to test it.
Request early access to MAI-Thinking-1 via Microsoft Foundry. The private preview is limited to select partners, but access requests are open. Teams building complex agentic workflows should evaluate it against Sonnet 4.6 on their specific task distribution before the general availability window closes.
Evaluate MAI-Image-2.5 for product image workflows. The image-to-image editing and control-with-preservation capabilities fill a gap that text-to-image generation alone cannot cover. If you have a pipeline that involves human editing of AI-generated images, MAI-Image-2.5 may reduce the human step.
Revisit your transcription pipeline with MAI-Transcribe-1.5. Content biasing is a genuinely useful production feature for domain-specific applications. If your current transcription pipeline uses Whisper or a competing service, the FLEURS #1 ranking and 43-language support are worth a head-to-head benchmark.

Conclusion

Microsoft's seven-model launch at Build 2026 is the most consequential demonstration yet that the frontier AI model market is moving from a duopoly (OpenAI and Anthropic) toward a multi-vendor ecosystem. A 35B MoE reasoning model competitive with Claude Opus 4.6, a 5B coding model that outscores Haiku 4.5 by 16 percentage points, and an image generation model ranked #3 globally — all trained on clean data, all available through multiple inference providers — represents a mature, productized model family rather than a research preview. The strategic question for developers is not whether these models are good enough. They are. The question is whether Microsoft's infrastructure and ecosystem commitment will match Anthropic's and OpenAI's in the months ahead.

Originally published at wowhow.cloud

GitHub Copilot Token Billing 2026: Full Cost Guide and Alternatives

Anup Karanjkar — Tue, 02 Jun 2026 18:17:29 +0000

GitHub Copilot switched to token-based billing on June 1, 2026 — and the developer community's response has been immediate and overwhelmingly negative. Reports of costs jumping from $29 to $750 per month and from $50 to $3,000 are spreading across Reddit, X, and GitHub's own discussion threads. Here is exactly what changed, what every model costs under the new system, and whether you should stay on Copilot, switch, or build a hybrid stack.

The short answer first: code completions and Next Edit Suggestions remain free under all plans. The billing change only affects AI Credits consumed by chat, agentic features, agent mode, and code review. If autocomplete is your primary workflow, your bill does not change. If you run agentic sessions against large codebases, you need to model your usage before your next billing cycle closes.

What Actually Changed

The old model charged developers in Premium Request Units (PRUs). Each plan came with a monthly PRU allotment; when you exhausted it, Copilot fell back to a lighter base model so you could keep working. That safety net is now gone.

Under the new system, all token consumption during chat, code review, and agentic sessions is metered directly. Token costs vary by model and are converted to AI Credits at a fixed rate: 1 AI Credit = $0.01 USD. These credits are billed on top of your base subscription fee, which remains unchanged:

Copilot Pro: $10/month
Copilot Pro+: $39/month
Copilot Business: $19/user/month
Copilot Enterprise: $39/user/month

The subscription prices are the same. What is gone is the ceiling. Previously, heavy usage was bounded by the flat monthly fee. Now there is no ceiling unless you explicitly set a spending limit in the billing dashboard — and by default, GitHub only sends a notification when a limit is reached rather than stopping usage. You must manually enable "Stop usage when budget limit is reached" to create a hard cap.

What Is Free vs. What Is Billed

GitHub drew a clear line in its documentation. The following features do not consume AI Credits:

Inline code completions (all plans)
Next Edit Suggestions
Multi-line ghost text

The following features do consume AI Credits:

Copilot Chat (IDE, CLI, and web interface)
Agent mode and multi-file edits
Pull request summaries and code review
Copilot CLI commands
Custom extensions using the Copilot Extensions API

The majority of developers using Copilot primarily for autocomplete will see no change in their bill. The pain concentrates on teams using Copilot for agentic refactors, codebase Q&A, PR automation, and multi-file changes — exactly the workflows that justified Copilot Pro+ and Enterprise pricing in the first place.

Model Pricing: The Full Breakdown

Every chat or agentic request routes to a specific model. The cost formula is: (input tokens + output tokens) × model rate ÷ 1,000,000, then converted to AI Credits at 1 credit = $0.01. Based on pricing published by GitHub and corroborated by community analysis:

Economy Models

GPT-5 mini: approximately $0.25/M input, $2.00/M output
Gemini 3.5 Flash: approximately $0.30/M input, $2.50/M output

Mid-Tier Models

GPT-5.5: approximately $1.75/M input, $14.00/M output
Claude Sonnet 4.6: approximately $3.00/M input, $15.00/M output

Frontier Models

GPT-5: approximately $3.75/M input, $15.00/M output
Claude Opus 4.8: approximately $15.00/M input, $75.00/M output

The model you select makes an order-of-magnitude difference. A typical Copilot Chat session — five focused questions with roughly 4,000 tokens in and 800 tokens out — costs approximately $0.21 using Claude Sonnet 4.6 (22 AI Credits). The same session on GPT-5 mini costs roughly $0.016 (under 2 AI Credits). At 20 such sessions per workday across 20 working days per month, the monthly spend is roughly $84 on Sonnet 4.6 versus $6.40 on GPT-5 mini, both added on top of your subscription fee.

Real Developer Cost Scenarios

Community reports from the first days of the new billing regime paint a consistent picture:

The daily autocomplete user. Uses completions 90% of the time, opens chat 3–4 times a day for quick questions on GPT-5 mini. Monthly cost increase: roughly $3–5. No meaningful impact.

The heavy chat user. Uses Copilot Chat extensively — 30–40 sessions daily on Claude Sonnet 4.6 for complex reasoning tasks. Estimated monthly chat cost: $150–250 on top of the $39 Pro+ subscription. Was previously paying $39 flat.

The agentic team. Three developers running agent mode against a large monorepo for daily refactoring sessions using GPT-5. Early community estimates: $600–1,200 per developer per month, up from $39 per user. One developer in GitHub's discussion thread projected the jump from $50 to $3,000 for their three-person team.

The PR automation pipeline. Teams running automated pull request summaries, test generation, and code review across dozens of daily PRs are finding token consumption substantial at scale. The economics of metered billing are unfavorable compared to purpose-built CI automation tools for this pattern.

Why GitHub Made This Change

GitHub's public rationale is cost alignment: serving Claude Opus 4.8 or GPT-5 at frontier quality carries real infrastructure costs that a flat monthly fee cannot absorb. The 200× cost difference between a minimal chat request and an hour-long agentic session on a frontier model cannot be cross-subsidized indefinitely at $39/month.

The business logic is sound. The previous PRU system was already a stopgap that mixed fixed billing with soft throttling. Token-based billing is how every AI API in the industry works, and GitHub is bringing its pricing in line with that reality.

What GitHub underestimated was the psychological impact. Developers who internalized AI assistance as a fixed cost — a known monthly line item — now face a variable bill that scales with their most productive days. That changes behavior. Teams that previously ran long agentic sessions without hesitation will now pause to calculate whether the task justifies the credit spend. That is arguably a feature, not a bug, from Microsoft's infrastructure perspective — but it is a friction increase for developers.

Six Strategies to Cut Your Copilot Bill

1. Set a hard spending cap immediately. Navigate to Settings → Billing → Spending limits in your GitHub account (or organization settings for Business/Enterprise). Set a monthly dollar limit and enable "Stop usage when budget limit is reached." Without that checkbox, the limit is advisory only and charges continue to accrue past it.

2. Switch your default chat model to an economy tier. GPT-5 mini and Gemini 3.5 Flash cost 10–15× less than Claude Sonnet 4.6 for most Q&A interactions. For everyday code explanation, documentation lookup, and quick debugging, economy models are more than sufficient. Reserve frontier models for genuinely complex architectural problems.

3. Lean harder on code completions. They remain unlimited and free. If your primary workflow is completion-driven development, the billing change does not affect you, and doubling down on completions rather than chat is now financially rational.

4. Narrow your context window. Agentic sessions that pull large amounts of irrelevant code into context inflate input token counts without adding value. Configure Copilot to reference specific files or modules rather than indexing entire codebases. Reducing context from 100,000 to 20,000 tokens cuts input costs by 80%.

5. Batch your chat sessions. Each session carries overhead from system prompts and context initialization. Five focused questions in one session costs less than five single-question sessions. Group related questions before opening a chat window.

6. Export usage data before the first bill arrives. GitHub's billing dashboard shows per-model token consumption. Review it after the first week of June to project your monthly total and adjust model selection or spending limits accordingly.

The Alternatives: An Honest Comparison

The billing change has triggered genuine migration evaluation across the developer community. For a broader look at the competitive landscape, see the AI coding assistants comparison for 2026. Here is where the main competitors stand today:

Cursor ($20/month). Flat-fee with a generous built-in token allotment for its Composer agent. Cursor's Composer 2.5 uses an in-house long-horizon model that benchmarks near Opus 4.8 and GPT-5.5 on coding tasks. Frontier model access is included up to a monthly request limit. For developers running regular agentic sessions, Cursor's flat pricing beats Copilot's metered model decisively at any significant usage level.

Windsurf ($20–$200/month). Windsurf Pro at $20/month covers most developers on a flat-fee basis. Max at $200/month bundles Devin Cloud and Devin Terminal CLI for teams needing autonomous long-horizon agents. Windsurf remains flat-fee and has been aggressive about adding frontier model options. For teams, the per-seat economics compare favorably to Copilot Business once token overages are factored in.

Claude Code ($17–$100/month). Anthropic's terminal-native coding agent runs 5-hour session windows, with usage limits doubled across Pro, Max, Team, and Enterprise plans in May 2026. For developers who need deep codebase understanding over extended sessions, Claude Code provides predictable flat costs with no per-token overage within plan limits. See the complete Claude Opus 4.8 guide for details on the model powering Max-plan sessions.

Cline (free install + direct API billing). A VS Code extension that routes directly to your choice of AI provider — Anthropic, OpenAI, Google, or a local model — at published API rates with no middleware markup. For developers comfortable managing their own API credentials and budgets, Cline eliminates the Copilot billing intermediary entirely. You pay the same token rates, but with full transparency and no subscription overhead.

The hybrid stack approach. Several developers are recommending this pattern: keep Copilot Pro at $10/month for free code completions and Next Edit Suggestions, then add Cursor or Claude Code for all chat and agentic work at a flat fee. Total monthly cost: $27–$30 for two tools, both uncapped for their respective use cases. This is arguably the most cost-rational option for developers who rely on both completions and agentic workflows.

The Bottom Line

GitHub Copilot's move to token-based billing is transparent, technically justified, and genuinely disruptive for a specific segment of developers. The pricing is not punitive — it reflects actual AI inference costs, and the same economics apply at every AI API in the industry. The problem is the loss of the safety net and the surprise of discovering that frontier-model agentic work is substantially more expensive than a $39/month flat fee implied.

If completions are your primary workflow, nothing changes. Stay on your current plan and use the new model-selection controls to optimize the occasional chat session.

If chat and agent mode are core to your workflow, the framework is clear: calculate your actual monthly token spend using the model pricing table above, set a hard spending cap today, and evaluate whether Cursor, Claude Code, or a hybrid stack delivers better economics for your specific usage pattern.

The market is more competitive than it has ever been. GitHub's pricing change has handed Cursor, Windsurf, and Claude Code a compelling acquisition argument, and all three are investing aggressively in the agentic coding use case. If the metered model proves unpopular in practice, usage data will make that visible and will pressure GitHub to introduce flat-rate agent plans or usage tiers. For now: set a cap, choose your models deliberately, and let the numbers guide the decision.

Originally published at wowhow.cloud

NVIDIA Nemotron 3 Ultra 550B: Developer Guide — Architecture, Benchmarks & Deployment

Anup Karanjkar — Tue, 02 Jun 2026 00:19:19 +0000

Jensen Huang walked on stage at Computex 2026 in Taipei on June 1 and announced what NVIDIA calls the most intelligent open-weights AI model built in the United States: Nemotron 3 Ultra, a 550-billion-parameter mixture-of-experts model that delivers over 300 output tokens per second and cuts complex agentic task costs by 30 percent. The weights ship to Hugging Face on June 4, 2026. Here is everything developers need to understand the architecture, run the benchmarks, and deploy it.

The announcement lands at a significant inflection point. The two models currently at the top of the frontier — Claude Opus 4.8 and GPT-5.5 — are proprietary, API-only, and priced accordingly. DeepSeek V4 Pro, the only open-weights competitor anywhere near frontier performance, requires roughly 862GB of VRAM to run — effectively a dedicated GPU cluster. Nemotron 3 Ultra is NVIDIA's answer to both constraints: intelligence approaching the frontier, open weights, and an architecture engineered for throughput rather than just accuracy.

The Numbers: What You Are Actually Getting

The headline specs: 550B total parameters, 55B active per forward pass via mixture-of-experts routing, a 1-million-token context window, and native support for multi-token prediction. The model was trained in 4-bit NVFP4 precision on NVIDIA's Blackwell architecture — the same hardware on which it runs most efficiently in production.

On the Artificial Analysis Intelligence Index — a composite benchmark aggregating 10 evaluations spanning reasoning, coding, general knowledge, and agentic performance — Nemotron 3 Ultra scores 48.0. That places it as the top US open-weight model by a significant margin. For reference: Claude Opus 4.8 scores 61.4 and GPT-5.5 scores 60.2 on the same index. Nemotron 3 Ultra sits roughly 12–13 points behind the closed-source frontier — but at zero per-token API cost when self-hosted.

The throughput story is more compelling than the intelligence index alone. Serving at over 300 output tokens per second on optimized hardware, Nemotron 3 Ultra runs approximately 5× faster than a comparable dense model at the same accuracy level. NVIDIA attributes this to the LatentMoE architecture and Mamba-2 layers that provide linear-time complexity for long-context inference, in contrast to the quadratic attention cost that makes standard Transformers expensive at 1M-token context lengths.

The LatentMoE Architecture, Explained

Nemotron 3 Ultra introduces a new expert routing mechanism called LatentMoE that is worth understanding for developers planning to fine-tune or build on top of it.

In a standard MoE model, routing selects a small subset of expert networks per token and sends the full token embedding through each selected expert. LatentMoE projects the token from the model's hidden dimension into a smaller latent dimension before routing and expert computation. This compression achieves three things simultaneously:

Reduces the VRAM footprint of routed expert parameters by approximately 4×
Allows the same inference budget to activate 4× more experts per token
Improves accuracy per byte because more specialists contribute to each prediction without a proportional cost increase

The net effect: Nemotron 3 Ultra activates more specialized computation per token than a standard MoE at the same memory cost. The 10% activation ratio (55B of 550B) is already class-leading efficiency, but LatentMoE compounds this by making each activated expert more representationally compressed and therefore faster to route through.

The hybrid Mamba-Transformer design adds a second efficiency layer. Mamba-2 layers handle long-range sequence dependencies with linear time complexity, replacing a subset of the attention layers that would otherwise make 1M-token context inference prohibitively expensive. The result is a model that can process one million tokens in context without the memory and compute explosion that standard Transformer attention would require at that scale.

Multi-Token Prediction (MTP) predicts multiple future tokens in a single forward pass rather than autoregressively generating one token at a time. This technique, popularized by DeepSeek V4, reduces the effective number of forward passes required per output token. Combined with the Mamba-2 linear complexity, it produces the 300+ token/second throughput NVIDIA is advertising — a figure independently corroborated by Artificial Analysis benchmarks.

Benchmark Position: An Honest Assessment

Intelligence Index rankings are aggregate composites, and the gap between Nemotron 3 Ultra (48.0) and frontier closed models (60+) compresses significantly on specific workloads. The categories where the gap matters least for developers:

Code generation and debugging: NVIDIA's own benchmarks show the Ultra model performing within 8–10% of GPT-5.5 on HumanEval and LiveCodeBench. For engineering automation tasks, that margin is often within practical noise.
Long-context RAG: With a 1M token context window and linear-time Mamba layers, Nemotron 3 Ultra has a structural advantage over models limited to 200K tokens. Tasks like codebase-wide refactoring, legal document analysis, and multi-document research synthesis play to its architectural strengths.
High-throughput batch processing: At 300+ tokens/second, a self-hosted Nemotron 3 Ultra node can process multiple document summarization jobs simultaneously that would take 5× longer on a comparable dense model. The economics shift quickly at scale.

The categories where the gap matters most:

Multi-step agentic reasoning: Claude Opus 4.8's Dynamic Workflows and GDPval-AA score (1,890 Elo) reflect a capacity for sustained autonomous reasoning that no current open model fully matches. For mission-critical agent pipelines where reasoning depth directly maps to business outcomes, the closed-model advantage is real.
Instruction following on ambiguous tasks: Frontier models have accumulated years of RLHF refinement that produces better calibration on edge cases. Open-weights models at this scale are still catching up on instruction-following reliability in adversarial production scenarios.

The honest framing: Nemotron 3 Ultra is not GPT-5.5 or Claude Opus 4.8. It is the best open-weights model available for teams that need data sovereignty, cost control, or cannot route enterprise data through third-party APIs. Those constraints cover a substantial fraction of production deployments in finance, healthcare, legal, and defense.

How to Access Nemotron 3 Ultra

NVIDIA is distributing the model through four primary channels, each suited to different deployment contexts.

Option 1: Hugging Face (Self-Hosted)

The weights are published at nvidia/NVIDIA-Nemotron-3-Ultra-550B on Hugging Face. Unlike many nominally "open" models that release only inference weights, NVIDIA is also publishing training recipes, a 2.5-trillion-token pre-training dataset, and specialized code and math datasets through the official NVIDIA-NeMo/Nemotron GitHub repository. This means Nemotron 3 Ultra is genuinely fine-tunable, not just deployable.

Minimum hardware for full BF16: 8× H100s (640GB VRAM). The FP8 quantized variant fits on a 4× H100 configuration (320GB VRAM). NVFP4, the training-native precision, requires Blackwell (H200 or GB200) and reduces VRAM requirements further — NVIDIA has not published the exact NVFP4 memory footprint at time of writing, but early reports suggest it fits a single DGX Spark.

Option 2: NVIDIA NIM Microservice (Managed)

NVIDIA's NIM (NVIDIA Inference Microservices) wraps the model as an OpenAI-compatible REST endpoint with automatic batching, KV cache management, and observability included. Available at build.nvidia.com with an NVIDIA AI Enterprise license for production use. NIM is the fastest path from zero to a compliant, auditable API endpoint — particularly relevant for enterprises subject to data residency requirements where self-hosting is mandatory but engineering overhead must be minimized.

from openai import OpenAI

client = OpenAI(
    base_url="https://integrate.api.nvidia.com/v1",
    api_key="YOUR_NVIDIA_API_KEY"
)

response = client.chat.completions.create(
    model="nvidia/nemotron-3-ultra-550b",
    messages=[
        {"role": "system", "content": "You are a senior software engineer."},
        {"role": "user", "content": "Review this codebase and identify security vulnerabilities."}
    ],
    max_tokens=8192,
    temperature=0.1
)

print(response.choices[0].message.content)

Option 3: OpenRouter (Instant API Access)

OpenRouter exposes Nemotron 3 Ultra as a standard OpenAI-compatible API endpoint. This is the fastest path for developers who want to evaluate the model without provisioning GPU infrastructure. No NVIDIA account required. Use the model identifier nvidia/nemotron-3-ultra-550b in OpenRouter's API, billed per token at OpenRouter's published rates.

Option 4: Self-Hosted with vLLM or SGLang

NVIDIA has published official vLLM cookbooks in the NVIDIA-NeMo/Nemotron GitHub repository under usage-cookbook/Nemotron-3-Ultra-Base. For sustained production workloads, NVIDIA also supports TensorRT-LLM, which delivers higher throughput than vLLM at the cost of more complex initial configuration. SGLang is worth evaluating: on H100 hardware, SGLang leads vLLM by approximately 29% throughput on standard workloads and up to 6× on prefix-heavy RAG pipelines where KV cache reuse is significant.

# Quick start with vLLM
pip install vllm

python -m vllm.entrypoints.openai.api_server   --model nvidia/NVIDIA-Nemotron-3-Ultra-550B-FP8   --dtype float8   --tensor-parallel-size 4   --max-model-len 131072   --port 8000

Practical Use Cases for 2026

Enterprise Agentic Pipelines With Data Sovereignty

The strongest case for Nemotron 3 Ultra is enterprises running agentic workflows over sensitive data: financial modeling, legal document review, healthcare records analysis, internal code audit. At 48 AA Intelligence Index and 1M token context, it handles the complexity of real enterprise tasks. At open weights with NIM deployment, the data never leaves your infrastructure. This combination — near-frontier intelligence, verified data control, predictable compute costs — is what closes the gap between a proof-of-concept agent and a compliance-approved production system.

High-Volume Code Generation Pipelines

At 300+ tokens/second, a single GPU node running Nemotron 3 Ultra can serve multiple simultaneous code generation sessions with lower latency than a throttled external API endpoint. Teams running CI/CD automation that generates test suites, migration scripts, or documentation should benchmark the cost-per-output-token carefully. At scale, the savings over frontier API pricing can be substantial even after accounting for GPU infrastructure costs. A rough calculation: at 300 tokens/second and $4/GPU-hour on H100 cloud, you are generating roughly 270,000 tokens per dollar of compute — compare that to frontier API pricing in the range of $15–30 per million output tokens.

Long-Context Document and Codebase Workflows

The 1M token context window is functional, not theoretical. The Mamba-2 architecture ensures that processing the full context does not incur quadratic compute cost as you scale toward the context limit. Teams currently chunking large documents due to context limits can run them whole. A 1M-token window fits approximately 750,000 words of text — equivalent to processing a complete enterprise codebase, a full legal agreement package, or several years of customer support transcripts in a single inference call.

The Open-Weights Moment

Nemotron 3 Ultra is not the first large open-weights model — but it may be the most strategically significant one since Llama 4 Scout. The combination of near-frontier intelligence, published training data and recipes, a hardware-aware architecture optimized for Blackwell, and four distinct deployment options represents a clear thesis: NVIDIA believes the long-run value in the AI stack accrues to hardware and infrastructure, not model weights. Publishing the weights is therefore commercially strategic, not charitable. Every team that builds a production pipeline on Nemotron 3 Ultra is a future NVIDIA GPU customer.

For developers, the implication is practical: a high-quality, genuinely open model now exists that can be fine-tuned on proprietary data, audited by compliance teams, deployed on private infrastructure, and modified without a licensing agreement with a closed-model API provider. The intelligence gap with Opus 4.8 and GPT-5.5 is real but narrowing. If the Nemotron 3 Super (120B) trajectory is any precedent, the Ultra will receive ongoing training and post-training refinement updates through 2026.

What to Do Right Now

Evaluate on OpenRouter or build.nvidia.com today. The model is available as an API endpoint with no GPU provisioning required. Run your standard benchmark prompts before the June 4 Hugging Face weights release so you have a baseline.
Pull the NVIDIA-NeMo/Nemotron GitHub repository. The vLLM cookbook, training recipes, and dataset documentation are already live. Reviewing them now will accelerate your deployment decision.
Benchmark against your actual workload, not aggregate indices. If your primary use case is long-context RAG or high-volume batch processing, Nemotron 3 Ultra may outperform or match frontier models on your specific task even with a lower aggregate index score.
Model your hardware economics. The FP8 variant needs 4× H100. NVFP4 on Blackwell requires fewer resources. Compare a dedicated H100 node cost against your current frontier API bill at projected token volume — the crossover point is lower than most teams expect.
Evaluate fine-tuning eligibility. The published training dataset and training recipes are a significant differentiator over every closed model. If your application benefits from domain-specific adaptation — legal reasoning, scientific literature, financial modeling — Nemotron 3 Ultra is currently the only near-frontier option that permits and provides the infrastructure for it.

Conclusion

Nemotron 3 Ultra is the clearest signal yet that NVIDIA is serious about the software layer of AI infrastructure, not just the hardware. A 550B open-weights model with LatentMoE, 1M token context, 300+ token/second throughput, and published training data is not a research release — it is a production bet. It will not replace Claude Opus 4.8 for teams that need the highest reasoning quality and are comfortable routing data through Anthropic's API. It will replace frontier closed models for a meaningful fraction of production workloads where data sovereignty, cost predictability, and customizability outweigh the 12-point intelligence index gap. June 4 is the date to watch.

Originally published at wowhow.cloud

I Built 9 Production AI Agents With Claude Code — Here Is the Complete Workflow

Anup Karanjkar — Sat, 30 May 2026 21:11:33 +0000

TL;DR: Claude Code is a complete agent runtime, not just a coding assistant. Over 14 weeks, I built and shipped 9 production agents — an SEO research pipeline, a daily analytics oracle, a content syndication system, a deploy watchdog, and five specialist growth agents — using nothing but CLAUDE.md files, MCP servers, hooks, and subagents. No Hermes. No LangChain. No external orchestrator. Total infrastructure cost: under $180/month. This guide walks through the exact architecture, every configuration file, the failure modes I hit, and the patterns that actually survive production.

The Five-Layer Agent Architecture Inside Claude Code

Claude Code in 2026 is not the terminal autocomplete tool it was twelve months ago. Anthropic has shipped five distinct extension layers that, composed together, turn it into a full agent orchestration framework. Understanding which layer handles which responsibility is the difference between an agent that works in a demo and one that runs unsupervised for weeks.

The first layer is CLAUDE.md — a Markdown file at your project root that is always loaded into the model's context window. Every instruction, every constraint, every architectural decision you write here shapes every action the agent takes. This is not documentation. It is the agent's operating system. A well-written CLAUDE.md eliminates entire categories of failure by making the right behavior the default behavior. A poorly written one — or worse, an empty one — produces an agent that makes reasonable-sounding decisions that silently break your system.

The second layer is MCP servers — external tool access over the Model Context Protocol. MCP servers let Claude Code interact with databases, APIs, browsers, cloud services, and any system that exposes a JSON-RPC interface. As of May 2026, over 2,300 public MCP servers exist, and any team can build custom ones. The critical design decision: limit yourself to 3-5 servers per agent. Each server adds tool definitions to the context window, and tool-selection quality degrades measurably above that threshold.

The third layer is skills — reusable Markdown workflow files stored in .claude/skills/ and invoked as slash commands. Skills encode multi-step procedures that would otherwise require the agent to figure out the process from scratch each time. A blog-writing skill, for example, encodes the SEO checklist, content structure, data format, build verification, and commit conventions — turning a 45-minute manual process into a single command.

The fourth layer is hooks — deterministic shell scripts that execute at specific lifecycle points: PreToolUse, PostToolUse, Stop, SessionStart, and UserPromptSubmit. Hooks are not AI. They are plain shell scripts that run before or after the model acts. Exit code 2 blocks the tool call entirely. This is how you build hard guardrails — not by asking the model to police itself, but by making dangerous actions physically impossible.

The fifth layer is subagents — isolated Claude sessions launched from a parent session with their own context window. Subagents are defined as Markdown files in .claude/agents/ and can be spawned in parallel, in the background, or in isolated git worktrees. They communicate results back to the parent but cannot see the parent's conversation history. This is the primitive that makes multi-agent coordination possible without an external orchestrator.

Agent Zero: The CLAUDE.md Harness Pattern

Every production agent I have built starts with the same pattern: a CLAUDE.md file that functions as the agent's constitution. Not a loose collection of tips — a structured document with decision trees, forbidden actions, verification protocols, and explicit failure modes.

Here is the skeleton that has survived 14 weeks of production use across all 9 agents:

# Agent Name — Purpose Statement (one line)

## Decision Engine

Multi-file change (3+ files)? → Use persistent-planner skill
Research + build? → Research FIRST, THEN build

200 lines new code? → Split into subagents by module
Bug with unclear cause? → Investigate before fixing
Same fix attempted 3+ times? → STOP. Surface root cause to user


## Hard Rules (each from a real incident)
- RULE 1: [What happened] → [What to never do again]
- RULE 2: [What happened] → [What to never do again]

## Verification Protocol — MANDATORY
After ANY non-trivial implementation:
1. Spawn verification-agent — read-only, adversarial
2. Wait for VERDICT — only report "done" after PASS
3. Never claim "fixed" based on reading code — run actual commands

## Model Tiering
| Task | Model | Why |
|------|-------|-----|
| Trust-boundary code | Opus | Payment, auth, webhooks |
| Feature implementation | Sonnet | Routine edits, CRUD |
| Batch text work | Haiku | SEO descriptions, formatting |

The decision engine section is not aspirational. It is load-bearing. Without it, I watched agents attempt 200-line rewrites in a single pass, fail, retry the same approach, fail again, and burn through $15 in tokens producing nothing useful. With the decision engine, the agent reads the instruction, routes to the correct approach, and succeeds on the first or second attempt.

The hard rules section grows organically from production incidents. My CLAUDE.md started with zero rules. It now has 24. Each one exists because ignoring it caused a real outage, a corrupted deploy, or a silent data loss. Rule 22, for example: headers() catch-all MUST come BEFORE private route rules — discovered when checkout pages were publicly cached at Cloudflare's edge for 30 seconds because the header ordering was reversed. Rule 14: Do NOT add slug checks to proxy.ts — three consecutive production crashes from the same attempted fix.

MCP Server Configuration for Production Agents

MCP servers are the agent's hands. Without them, Claude Code can read files and run shell commands. With them, it can query Google Search Console, pull GA4 analytics, manage Cloudflare Workers, interact with browsers, search documentation, and call any API with a published MCP server.

Here is the MCP configuration that powers my analytics oracle agent — the one that runs every morning at 9:03 AM IST and produces a daily situation report:

// ~/.claude/settings.json (user scope — available to all projects)
{
  "mcpServers": {
    "gsc": {
      "command": "/Users/me/.local/bin/mcp-gsc",
      "env": { "GSC_SKIP_OAUTH": "true" }
    },
    "ga4": {
      "command": "/Users/me/.local/bin/ga4-mcp-server",
      "env": {
        "GA4_PROPERTY_ID": "529733024",
        "GOOGLE_APPLICATION_CREDENTIALS": "/Users/me/.config/google-seo-mcp/service-account.json"
      }
    },
    "cloudflare": {
      "command": "npx",
      "args": ["-y", "@anthropic-ai/mcp-cloudflare"],
      "env": {
        "CLOUDFLARE_ACCOUNT_ID": "a319e...",
        "CLOUDFLARE_API_TOKEN": "cfut_..."
      }
    }
  }
}

Three servers. Not twelve. I experimented with adding Slack, GitHub, Notion, and Playwright MCP servers simultaneously. The result: tool selection accuracy dropped from roughly 95% to below 80%. The model would choose a Slack tool when it meant to use GitHub, or attempt a Playwright screenshot when a simple curl would suffice. The 3-5 server sweet spot is not a suggestion — it is a measured threshold.

For project-scoped servers that the team shares, put the configuration in .claude/settings.json at the project root and commit it to git. For user-scoped servers with personal credentials, use ~/.claude/settings.json. Never commit API tokens to project-scoped config.

Hooks: The Guardrails That Actually Work

The single most important lesson from 14 weeks of production agents: do not rely on the model to enforce constraints. Models are probabilistic. Hooks are deterministic. If an action must never happen — a force push to main, a database flush without confirmation, a deploy during an active CI run — encode that constraint in a hook, not in a prompt.

Here is a real hook from my production setup that prevents accidental destructive git operations:

// .claude/settings.json
{
  "hooks": {
    "PreToolUse": [
      {
        "matcher": "Bash",
        "command": "echo '$TOOL_INPUT' | python3 -c "import sys,json; cmd=json.load(sys.stdin).get('command',''); bad=['git push --force','git reset --hard','FLUSHDB','DROP TABLE']; sys.exit(2 if any(b in cmd for b in bad) else 0)""
      }
    ]
  }
}

Exit code 2 blocks the tool call. The model receives a rejection message and must find an alternative approach. This is not a suggestion to the model — it is a physical wall. The model cannot push force, cannot hard reset, cannot flush the database, no matter how convincing its reasoning.

The SessionStart hook is equally powerful for agent initialization. My production setup runs a health check script at the start of every session that verifies container status, checks for uncommitted changes, validates environment variables, and confirms MCP server connectivity. If any check fails, the agent starts with full context about what is broken — instead of discovering it 10 minutes into a task after modifying files that should not have been touched.

{
  "hooks": {
    "SessionStart": [
      {
        "command": "cd storefront && npx tsx scripts/health-check.ts 2>&1 | head -50"
      }
    ]
  }
}

Subagent Coordination Without an Orchestrator

The pattern that changed everything was realizing that Claude Code's built-in Agent tool — the ability to spawn subagents — eliminates the need for external orchestration frameworks. A parent session can launch multiple subagents in parallel, each with a specific brief, and aggregate their results.

Here is how my growth coordinator agent works. It is a single CLAUDE.md-defined agent that spawns 5 specialist subagents every Monday morning:

# Growth Coordinator — Weekly Multi-Agent Run

## Workflow
1. Spawn analytics-oracle agent → daily metrics + anomalies
2. Spawn seo-dominator agent → keyword gaps + ranking changes
3. Spawn content-architect agent → content calendar + topic gaps
4. Spawn cro-assassin agent → conversion funnel analysis
5. Spawn competitive-intel agent → competitor price/feature delta

## Coordination Rules
- Launch agents 1-3 in parallel (independent data)
- Wait for analytics-oracle results before launching CRO agent (needs baseline)
- Aggregate all results into weekly synthesis report
- Post synthesis to Telegram channel

Each specialist agent is defined as a Markdown file in .claude/agents/ with its own system prompt, tool access, and output format. The parent agent reads their results and synthesizes. No LangGraph. No CrewAI. No custom Python orchestration code. The orchestration is declarative Markdown, and the execution is Claude Code's native subagent primitive.

The critical constraint I learned the hard way: subagents must never push to git independently. Early in my setup, I had parallel build agents each committing and pushing their changes. Three pushes in rapid succession triggered three simultaneous deploys, all racing on Docker Compose, resulting in a 502 outage. The fix: all subagents write their changes to files. The parent agent reviews, commits once, and pushes once.

Cost Management: Model Tiering in Practice

Running 9 production agents without cost discipline would be financially irresponsible. Anthropic's current pricing — Opus at $5/$25 per million input/output tokens, Sonnet at $3/$15, Haiku at $1/$5 — means model selection directly determines whether your agent pipeline costs $50/month or $500/month for the same work.

The tiering system I use after extensive experimentation:

Task Class	Model	Monthly Cost (est.)	Why This Tier

The cross-provider audit is the most counterintuitive line item. I run OpenAI's Codex on trust-boundary code after Claude reviews it. In May 2026, a Codex audit found that my cache header ordering was exposing checkout pages at Cloudflare's edge — a bug that had been live for weeks and that Claude had not flagged across multiple reviews. Different models have different blind spots. For code that handles money or authentication, spending an extra $5/month on a second opinion is trivially worth it.

Prompt caching reduces input costs by 90% for repeated context. If your agent loads the same CLAUDE.md, the same tool definitions, and the same project context on every run, the cache hit rate is extremely high after the first invocation. My analytics oracle agent — which runs daily with the same system prompt — costs roughly $0.40 per run after caching, compared to $2.80 without it.

Claude Code vs Cursor vs Codex: When to Use What

This is not a which-is-best comparison. Each tool has a genuine sweet spot, and using the wrong one for a task wastes time and money.

Claude Code wins at: codebase-wide analysis, multi-file refactors, agent orchestration, CI/CD integration, and any task where terminal-native execution matters. One benchmark showed Claude Code completing a task in 33,000 tokens that consumed 188,000 tokens in Cursor's agent mode — a 5.7x efficiency advantage for complex, cross-file operations. Claude Code also has the deepest extension system (CLAUDE.md + MCP + hooks + skills + subagents) of any AI coding tool.

Cursor wins at: in-editor work. If you are editing a single file, navigating code visually, or doing rapid inline iterations, Cursor's VS Code integration is faster than switching to a terminal. Cursor 3.3's Bugbot — which monitors CI and proposes fixes automatically — is a genuine time-saver for teams with extensive test suites.

Codex wins at: long-running autonomous tasks. OpenAI's cloud-based architecture lets Codex work on a problem for hours without maintaining a local session. For tasks like "migrate this 500-file codebase from JavaScript to TypeScript" or "write comprehensive tests for every untested module," Codex's patience and autonomy are unmatched.

My production workflow uses all three. Claude Code is the primary agent runtime — it runs the daily pipelines, handles deploys, and manages the codebase. Cursor is open alongside it for visual editing sessions. Codex runs periodic deep audits that benefit from its multi-hour attention span.

The Nine Agents: What They Do and What They Cost

Here is the actual agent inventory running in production, with real monthly costs after 14 weeks of operation:

Agent	Schedule	Model	Monthly Cost	What It Does

Total: approximately $110/month. The deploy watchdog costs nothing — it is a pure bash script with no AI component, checking HTTP status codes and triggering Docker rollbacks. The most expensive agent is the SEO research pipeline at $25/month, driven by its 6x daily frequency and the Sonnet calls required to analyze SERP data meaningfully.

Failure Modes: What Broke and How I Fixed It

No honest guide about production agents can skip the failures. Here are the five most expensive lessons from 14 weeks:

Failure 1: The split-brain deploy. Two automation systems — GitHub Actions and a legacy VPS cron — both believed they owned the deploy process. They raced on docker compose down/up, producing intermittent 502 errors that looked random but were actually deterministic conflicts. Fix: killed the legacy cron, added a concurrency group to GitHub Actions, added a filesystem lock (/tmp/wowhow-deploy.lock) as a secondary gate. Three layers of protection because one layer was not enough.

Failure 2: The noindex massacre. An agent applied robots: { index: false } to 2,600 pages — every product, topic hub, GST reference, and collection page — based on a reasonable-sounding interpretation of "hide thin content from Google." Impressions crashed from 7,500/day to near zero within a week. The fix took 8 days to fully reverse. Lesson: agents must never make bulk SEO changes without explicit human approval, regardless of how logical the reasoning sounds. This is now CLAUDE.md Rule 20.

Failure 3: The social media suspension. A content syndication agent posted 190 Mastodon toots in 15 minutes. The API allowed it — rate limits were not exceeded. But the instance moderators flagged it as spam and suspended the account permanently. API rate limits and platform moderation policies are different things. The agent now has a hard cap: 1 post per 30-60 minutes, maximum 20-30 per day, on any social platform.

Failure 4: The OAuth cascade. Deleting an old Google Cloud OAuth client — which seemed like a cleanup task — invalidated the refresh tokens used by three different scripts on the VPS. The daily analytics report, the GSC sitemap submission, and the GA4 data pipeline all failed silently. None of them had alerting configured for authentication failures. Fix: every API-dependent script now checks its authentication status before executing and sends a Telegram alert on failure.

Failure 5: The parallel push disaster. Three subagents ran in parallel, each making changes to different files. Each one committed and pushed independently. Three pushes triggered three GitHub Actions deploys. All three SSHed into the VPS simultaneously and raced on Docker Compose. Result: containers in an inconsistent state, Redis health checks failing, 502 for 12 minutes. Fix: subagents write files but never commit. The parent agent handles all git operations as a single atomic batch.

Getting Started: Your First Production Agent in 30 Minutes

The fastest path from zero to a running production agent:

Step 1: Install Claude Code. If you have not already: npm install -g @anthropic-ai/claude-code. Verify with claude --version. You need a Pro ($20/month) or Max ($100-200/month) subscription, or an API key.

Step 2: Create your CLAUDE.md. Start minimal. Write three things: what the project is, what the agent should never do, and how to verify its work. You will add rules as you discover failure modes — this is expected and healthy.

# My Project

## What This Is
Node.js API server with PostgreSQL. Deployed via Docker on a VPS.

## Hard Rules
- Never run DROP TABLE or TRUNCATE without explicit user confirmation
- Never push to main without running tests first
- Never modify .env files

## Verification
After changes, run: npm test && npm run build
Both must pass before committing.

Step 3: Add one MCP server. Start with something useful and low-risk. The GitHub MCP server is a good first choice:

claude mcp add --scope project --transport http github https://api.githubcopilot.com/mcp/

Step 4: Create your first hook. A SessionStart hook that shows git status gives the agent immediate context about what state the project is in:

// .claude/settings.json
{
  "hooks": {
    "SessionStart": [
      { "command": "git status && git log --oneline -5" }
    ]
  }
}

Step 5: Create your first subagent. A code review agent that runs read-only and checks your work:

// .claude/agents/reviewer.md
# Code Reviewer

Review the most recent changes for:
- Security vulnerabilities (OWASP Top 10)
- Performance issues
- Missing error handling at system boundaries
- Adherence to project conventions in CLAUDE.md

Report findings as: file:line — issue — severity (HIGH/MEDIUM/LOW)
Do NOT modify any files. Read-only analysis only.

Step 6: Run it. Open Claude Code, type /agents to see your available agents, and invoke the reviewer after making some changes. Watch what it catches. Refine the agent's instructions based on what it misses or flags incorrectly.

That is a functional agent setup in under 30 minutes. From here, the path is incremental: add rules to CLAUDE.md when things break, add MCP servers when you need external tool access, add hooks when you need hard guardrails, and add subagents when tasks become complex enough to benefit from specialization.

What I Would Do Differently

If I were starting over with everything I know now, three changes would save weeks of debugging:

First, I would write the verification protocol into CLAUDE.md on day one — not after the third silent production break. The pattern is simple: after any change touching more than two files, spawn a read-only verification agent before claiming the work is done. This catches roughly 40% of the bugs that would otherwise reach production.

Second, I would set up Telegram alerting for every API-dependent automation from the start. Silent failures are the most expensive kind. An agent that fails loudly costs you 5 minutes. An agent that fails silently costs you days of stale data and missed opportunities before you notice.

Third, I would resist the temptation to add MCP servers aggressively. My initial setup had 8 servers connected. Tool selection accuracy dropped. Response times increased. Context windows filled with tool definitions instead of project context. I cut back to 3-5 per agent and quality improved immediately.

The production agent landscape in 2026 is still early. Claude Code, Cursor, Codex, and the dozens of agent frameworks competing for adoption are all improving rapidly. But the fundamentals — clear constraints, hard guardrails, cost discipline, and verification before deployment — will outlast any specific tool. Build those habits into your agent architecture from the start, and the specific tools become interchangeable.

Every tool and template mentioned in this guide is available at wowhow.cloud. The Claude Code Routines Recipe Pack includes production-tested CLAUDE.md templates, hook configurations, and agent definitions you can adapt to your own projects. The Token Counter and AI API Cost Calculator help estimate costs before committing to an agent architecture.

Sources

6. Claude Code Changelog — Anthropic (2026)

Originally published at wowhow.cloud

OpenAI’s Frontier Governance Framework: Risk Tiers, Trusted Access, and What Developers Need to Know

Anup Karanjkar — Sat, 30 May 2026 06:23:46 +0000

On May 29, 2026, OpenAI published its Frontier Governance Framework — and most developers moved on to the next item in their feed. That’s a mistake worth correcting. The document doesn’t announce a new model or lower an API price. It describes how OpenAI measures whether its own systems could enable mass-casualty events, what access controls gate who can reach those capabilities, and how this maps to the regulations — the EU AI Act and California’s Transparency in Frontier AI Act — that are actively shaping compliance requirements for any enterprise deploying frontier AI this year.

If you build security tools on OpenAI APIs, the framework’s Trusted Access for Cyber program directly affects what your application can and cannot do. If you operate in a regulated environment, the framework is the vendor-side accountability document your compliance team needs to reference. And if you build on frontier models at all, the risk tier system in this framework governs the capability restrictions you will encounter — and, increasingly, what auditors and procurement teams will ask about when vetting your AI vendor stack.

What the Framework Actually Is

The Frontier Governance Framework is OpenAI’s published methodology for evaluating the risk profile of frontier models before and after deployment. It covers six functional areas: risk assessment and mitigation, model reporting, security risk management, incident response, external expert input, and framework updates. Each area has defined processes, thresholds, and accountability mechanisms.

The core architecture is a tier system applied across four risk domains. Each domain is evaluated independently, with tiers reflecting capability levels that could enable specific categories of harm. A model’s rating in any domain determines what deployment controls apply — what gets blocked at the API layer, who gets elevated access, and what triggers an incident response workflow.

The framework was published explicitly to align with two regulatory instruments. California’s Transparency in Frontier AI Act requires frontier AI developers to publish risk assessment methodologies for high-capability models. The EU AI Act’s Code of Practice for General Purpose AI requires systematic capability evaluation and incident reporting. The Frontier Governance Framework is OpenAI’s answer to both — delivered before regulatory deadlines rather than in response to enforcement actions.[1]

This matters for enterprise procurement and compliance conversations because the framework is now a durable reference document. When a Fortune 500 company’s procurement team asks whether OpenAI has systematic safety processes in place for high-risk AI capabilities, this framework is the specific answer they can evaluate against.

The Four Risk Domains

The framework divides potential frontier model harms into four domains. Each is tiered from 1 to 4, with higher tiers representing greater capability and triggering more restrictive deployment controls.

Cyber Offense

This domain covers model capabilities that could enable unauthorized computer intrusions, vulnerability discovery, or exploitation of hardened systems. It is the domain most likely to affect security-adjacent development workflows directly. The published Tier 3 definition is precise: a tool-augmented model capable of identifying and developing functional zero-day exploits across all severity levels in hardened real-world systems without human intervention.

That last phrase carries significant weight. The framework distinguishes a model that accelerates a skilled human’s security research from one that operates as an autonomous exploitation agent. The same API that routes security research requests is evaluated differently based on whether the use case requires human expertise to interpret and apply the model’s output, or whether the model can complete an exploitation chain autonomously.

For developers building security tools, this means the capability ceiling is not defined by what the model knows — it is defined by the human-in-the-loop structure of your application. An AI-assisted penetration testing tool where a credentialed professional reviews and approves each action sits differently in the framework than an autonomous vulnerability scanner that operates without human review. If you are designing such systems, the OWASP Top 10 for agentic applications maps directly onto the control patterns the framework rewards.

CBRN (Chemical, Biological, Radiological, Nuclear)

This domain covers capabilities that could assist in developing or deploying weapons capable of mass casualties. The Tier 3 definition covers a model that could enable a non-expert to develop a novel threat vector comparable to a CDC Class A biological agent, or that could autonomously complete the synthesis cycle of a regulated biological threat.[2]

Practically speaking, this domain operates as a hard limit for commercial deployment. No viable product use case exists for capabilities approaching Tier 3 in the CBRN domain, and the framework’s deployment controls reflect that. The significance for developers is not the limit itself — it is that the framework makes the threshold definition public. This is a notable level of transparency about where absolute capability restrictions are set, and it gives chemistry-adjacent and research tool developers a precise boundary to design around.

Harmful Manipulation

This domain addresses capabilities enabling large-scale psychological manipulation, coordinated influence operations, or systematic erosion of epistemic autonomy at scale. Unlike the cyber and CBRN domains, the evaluation methodology here is less precise — the research community has not converged on quantitative benchmarks for social-scale manipulation capability the way it has for exploitation ability or biochemical synthesis.

The framework acknowledges this measurement gap. Current models do not approach meaningful tiers here as measured by available evaluation methods, but the domain is included because influence capabilities scale differently than technical capabilities as models improve. Developers building content generation tools, personalization systems, or opinion research applications should monitor how this domain’s evaluation methodology matures — it is where the next significant capability restriction is most likely to emerge, and with limited warning time.

Loss of Control

This domain covers scenarios where AI systems undermine human oversight mechanisms, accumulate resources beyond operational needs, or exhibit deceptive behaviors that defeat monitoring. OpenAI states directly that current models do not approach meaningful tiers here, but the framework establishes measurement infrastructure for a risk considered likely to become relevant as model capability increases.

For developers, this domain is primarily relevant as a design pattern signal. The framework’s loss-of-control definitions map closely to agentic AI deployment patterns that are rapidly becoming standard: autonomous agents with persistent memory, multi-step planning, and tool access to production systems. The design patterns that mitigate loss-of-control risk — hard resource limits, complete operation logging, explicit approval gates for irreversible actions — are the same patterns that the human-in-the-loop UX literature identifies as necessary for genuine oversight rather than compliance theater.

Trusted Access for Cyber: The New Access Control Layer

The most operationally significant new element in the framework is the Trusted Access for Cyber program — an identity and trust-based system designed to make enhanced cyber capabilities available to credentialed security professionals without broad public availability.

The underlying problem it solves is real. Cyber offensive capability is dual-use in a way that CBRN capability is not. A security researcher discovering zero-day vulnerabilities in client infrastructure needs model capabilities that overlap significantly with what a malicious actor needs. A capability restriction broad enough to block malicious use also blocks substantial legitimate professional value. The traditional approach — blanket API restrictions — produces a large volume of false-positive capability denials while providing limited security benefit, because determined bad actors route around API restrictions via fine-tuning or self-hosted model deployments.

Trusted Access resolves this by credentialing the professionals rather than restricting the capability. An enrolled security professional with verified identity gets access to capabilities that are not available in the standard API tier. The tradeoff is logging and accountability: what you use the access for is tracked, and enrollment can be revoked based on observed behavior patterns.

If you build security-adjacent tools on OpenAI APIs — vulnerability scanners, penetration testing assistants, security research automation, CTF assistance tools — this program is worth evaluating carefully as enrollment details become public. It creates an official pathway for legitimate professional use cases that are currently constrained by API-layer mitigations, and it creates an accountability structure that many enterprise security tool customers will actively prefer over unverified access patterns.

What This Means for Enterprise Teams

The framework creates a practical compliance artifact for organizations deploying OpenAI models in regulated environments. The EU AI Act’s requirements for human oversight documentation of high-risk AI systems, taking full effect August 2026, require enterprises to demonstrate that their AI vendors have systematic safety processes in place for high-capability models. The Frontier Governance Framework is that document on the vendor side. Your enterprise AI governance documentation needs to complement it with user-side controls: who in your organization can deploy models against which use cases, what logging captures model interactions, and what review processes govern high-risk applications.

Several specific areas in the framework deserve immediate attention from compliance teams. The cyber offense domain’s tiering is directly relevant if your organization uses AI-assisted security tools. The harmful manipulation domain’s current ambiguity is relevant if you use AI for customer communication, content generation, or personalization at scale — as measurement methodology matures, restrictions in this domain could change with limited warning. The loss-of-control domain’s definitions map directly to agentic AI deployment governance: if you operate autonomous agents against production systems, the framework provides the vocabulary for describing the oversight controls you should already have in place.

For teams using the AI API cost calculator to evaluate model selection for high-volume workloads, it is worth adding a governance column alongside cost per token — the framework’s tier system is becoming part of the enterprise evaluation criteria for frontier model vendors, alongside latency and pricing.

How Other Labs Compare

OpenAI’s Frontier Governance Framework is the most detailed public disclosure of a tier-based capability evaluation system from a major lab, but it is not the first. Anthropic’s Responsible Scaling Policy, introduced in 2023 and updated in 2025, established the ASL (AI Safety Level) system — capability thresholds that trigger specific safety and deployment protocols across risk domains. The RSP and the Frontier Governance Framework use different terminology but share the same core architecture: defined capability tiers triggering deployment controls, with higher tiers requiring more restrictive access and oversight.

Google’s Frontier Safety Framework and DeepMind’s equivalent documents address similar concerns but with less tier specificity than either OpenAI or Anthropic’s published methodologies. The practical consequence for enterprise AI vendor evaluation is that conversations with OpenAI and Anthropic about capability risk can be more specific and verifiable — both labs have published operationally testable threshold definitions that can be assessed against your use case. For teams doing formal AI vendor risk assessments, this distinction matters.

What Developers Should Do Now

The framework does not require developers to change anything today. It is descriptive of OpenAI’s internal processes, not prescriptive for API consumers. But several near-term actions are worth taking based on it.

Security tool developers: Monitor Trusted Access for Cyber enrollment details as they become public. If your use case qualifies, enrolling grants access to capabilities currently restricted at the standard API tier and creates an accountability structure that enterprise security customers will increasingly require. If your use case does not qualify, that is an important signal for your product roadmap — the capability ceiling your application operates under will not change without a trust credential, and designing around it now is cheaper than discovering it during a customer procurement review.

Enterprise compliance teams: Add the Frontier Governance Framework to your AI vendor documentation package. When EU AI Act compliance requirements ask for evidence of vendor-side risk assessment for high-risk AI systems, this is the specific document you cite. Map its six functional areas against your own internal controls — access management, logging, incident response, and review processes for high-risk AI applications.

Agentic application developers: Treat the loss-of-control domain’s definitions as a design checklist. Systems that limit agent resource accumulation, log all tool invocations, require human approval for irreversible actions, and maintain hard-coded operation boundaries are architecturally aligned with loss-of-control mitigation. Building these patterns into your stack now is substantially cheaper than retrofitting them when regulatory requirements make them mandatory — and the August 2026 EU AI Act deadline means that moment is not far off.

Chemistry, biology, and content generation application developers: Review the CBRN and Harmful Manipulation domain definitions. The CBRN domain has clear commercial limits — if your application is anywhere near this domain, the framework tells you exactly where the hard stops are. The Harmful Manipulation domain is more ambiguous and will tighten as evaluation methodology matures. Applications relying on persuasive content generation, personalization, or opinion research functionality should document their current capability baseline so they can identify when API-layer restrictions change without announcement.

The Document That Governs the Capability Ceiling

The Frontier Governance Framework is not technically exciting. It will not trend on Product Hunt. What it does is establish the vocabulary and measurement methodology that governs AI capability access across the next several years of frontier model deployment.

Developers who read it once, map its risk tier definitions to their use cases, and design their access control and logging architecture accordingly will find themselves ahead of most of the field when AI governance requirements arrive as concrete procurement questions rather than distant regulatory abstractions. The window between the framework’s publication and the EU AI Act’s August 2026 enforcement date is short — and it is a better window than waiting for the first compliance audit to discover what the capability ceiling actually is.

Originally published at wowhow.cloud

Claude Managed Agents: Self-Hosted Sandboxes and MCP Tunnels Setup Guide

Anup Karanjkar — Sat, 30 May 2026 00:15:10 +0000

On May 26, 2026, Anthropic held its first developer conference outside the United States — Code with Claude London — and the most significant announcements were not about new models. They were about infrastructure: self-hosted sandboxes for Claude Managed Agents, now in public beta, and MCP tunnels, now in research preview. Both features address the same root problem that has kept regulated industries from deploying Claude agents in production: tool execution and private data access happening outside the enterprise security perimeter.

The architecture Anthropic landed on is elegant in how it draws the boundary. The agent loop — orchestration, context management, error recovery, retry logic — stays on Anthropic infrastructure. Tool execution and private MCP server access move inside the customer perimeter. You get the benefit of Anthropic running a highly available, managed agent runtime without giving up data residency, audit logging, or network policy enforcement. This guide covers what each feature does, how to set it up, and the production patterns that matter for enterprise deployments.

Why the Previous Architecture Created Enterprise Blockers

Claude Managed Agents before this announcement had a fundamental tension: the agent needed to call tools — execute bash commands, read files, call internal APIs, write to databases — but all of that execution happened on Anthropic infrastructure. For a startup building a coding assistant, this is fine. For a financial services firm, a healthcare provider, or a defense contractor, it creates a list of blockers that no amount of contractual language fully resolves.

Data residency: Files, code, and database contents moving off-perimeter for processing violated data residency requirements in the EU, financial regulations in the US, and data localization laws in markets like India and Brazil.
Audit logging: Tool execution logs resided on Anthropic infrastructure rather than the SIEM and audit systems the security team already manages.
Network policy: Giving an agent access to internal APIs meant either exposing those APIs to the public internet or managing a complex allowlist of Anthropic egress IPs — both operationally expensive and security-unfriendly.
Compute sizing: Long-running builds, image generation, or data processing jobs needed to fit within Anthropic's infrastructure constraints rather than being matched to the customer's own compute resources.

Both new features address these blockers directly, at the architecture level rather than through contractual workarounds.

Self-Hosted Sandboxes: Tool Execution Inside Your Perimeter

A self-hosted sandbox moves the execution environment for Claude Managed Agents from Anthropic infrastructure to an environment you control. Anthropic supports four managed providers out of the box — Cloudflare, Daytona, Modal, and Vercel — plus a custom sandbox client API for teams that need to run on their own infrastructure, a private cloud, or an air-gapped environment.

The split is precise: the agent loop itself — the code that decides what tool to call next, manages the conversation context, handles errors and retries, and tracks the agent's state across steps — continues to run on Anthropic's infrastructure. What moves to your sandbox is tool execution: the actual bash commands, file reads, API calls, and code interpretation that the agent invokes when it acts on the world.

What This Means in Practice

When the agent decides to run git clone https://internal.company.com/repo.git, that command executes inside your sandbox. The file system, the network access, the environment variables, the runtime image — all configured by you. Your network policies apply. Your audit logging captures the execution. The files never leave your perimeter. When the command completes, the result travels back to the Anthropic-hosted agent loop as a tool response — text output, structured JSON, or an error — and the agent continues from there.

For compute-heavy workloads, this also means you can size the execution environment for the task. A coding agent running a full test suite on a large repository can have 16 cores and 64GB of RAM if the task needs it. A lighter research agent can run in a small container. The compute sizing is your decision, not constrained by Anthropic's default allocation.

Setting Up a Self-Hosted Sandbox

The setup flow from the Claude Console (available to organization admins) involves three steps: selecting a sandbox provider, configuring the connection, and enabling it for specific agents or agent workflows.

For a Modal sandbox, the configuration looks approximately like this:

# Deploy a Modal sandbox for Claude Managed Agents
import modal

app = modal.App("claude-agent-sandbox")

# Define the runtime image with your tools pre-installed
sandbox_image = (
    modal.Image.debian_slim()
    .pip_install(["anthropic", "httpx", "boto3"])
    .run_commands(
        "apt-get install -y git curl jq",
        "curl -fsSL https://deb.nodesource.com/setup_22.x | bash -",
        "apt-get install -y nodejs",
    )
)

@app.function(
    image=sandbox_image,
    cpu=4,
    memory=16384,
    timeout=3600,
    secrets=[modal.Secret.from_name("internal-api-keys")],
)
def execute_tool(command: str, working_dir: str) -> dict:
    import subprocess
    result = subprocess.run(
        command,
        shell=True,
        cwd=working_dir,
        capture_output=True,
        text=True,
        timeout=300,
    )
    return {
        "stdout": result.stdout,
        "stderr": result.stderr,
        "returncode": result.returncode,
    }

The Claude Console sandbox configuration then points at your Modal deployment endpoint. Anthropic handles the API authentication between the agent loop and your sandbox. Your sandbox authenticates with your internal systems using the secrets you configure — those secrets never pass through Anthropic infrastructure.

For teams using Vercel, the setup leverages Vercel's edge runtime for lighter execution tasks, particularly useful for API calls and data transformations that don't need a full OS environment. Cloudflare Workers sandboxes are similarly scoped — fast startup, V8 isolate environment, useful for specific tool categories. Daytona provides a full development environment model, closest to the original Managed Agents execution environment but running on infrastructure you control or provision through Daytona's managed offering.

Custom Sandbox Client

For air-gapped environments or private cloud deployments, Anthropic publishes a custom sandbox client specification. You implement a small HTTP server that exposes a defined API surface — tool execution, file system access, process management — and Claude Managed Agents calls your server for tool execution instead of a managed provider. The server can run on-premises, in a private VPC, or in any environment with outbound HTTPS access to the Anthropic agent loop API.

// Minimal custom sandbox server — Express implementation
import express from 'express'
import { exec } from 'child_process'
import { promisify } from 'util'
import path from 'path'
import fs from 'fs/promises'

const execAsync = promisify(exec)
const app = express()
app.use(express.json())

// Anthropic calls this endpoint for each tool execution
app.post('/execute', async (req, res) => {
  const { tool, input, workingDir } = req.body

  try {
    if (tool === 'bash') {
      const { stdout, stderr } = await execAsync(input.command, {
        cwd: workingDir ?? process.env.SANDBOX_ROOT,
        timeout: 120_000,
        env: { ...process.env, ...input.env },
      })
      return res.json({ output: stdout, error: stderr, exitCode: 0 })
    }

    if (tool === 'read_file') {
      const filePath = path.resolve(workingDir ?? '', input.path)
      const content = await fs.readFile(filePath, 'utf-8')
      return res.json({ output: content })
    }

    if (tool === 'write_file') {
      const filePath = path.resolve(workingDir ?? '', input.path)
      await fs.writeFile(filePath, input.content, 'utf-8')
      return res.json({ output: 'File written successfully' })
    }

    return res.status(400).json({ error: `Unknown tool: ${tool}` })
  } catch (err) {
    const message = err instanceof Error ? err.message : String(err)
    return res.status(500).json({ error: message, exitCode: 1 })
  }
})

app.listen(8080, () => {
  console.error('[sandbox] Ready on :8080')
})

The sandbox server validates the Authorization header on each request using a shared secret configured in the Claude Console. Anthropic's agent loop attaches this header to every tool execution call. Your server rejects any request without a valid authorization header — so even if the endpoint is reachable from the internet, unauthorized execution is not possible.

MCP Tunnels: Private Network Access Without Inbound Firewall Rules

MCP tunnels solve the second infrastructure problem: how do you give Claude agents access to MCP servers running inside your private network without exposing those servers to the public internet?

The mechanism is a lightweight gateway — a small process you deploy inside your network — that makes a single outbound connection to Anthropic's tunnel infrastructure. No inbound firewall rules. No public IP for the MCP server. No VPN reconfiguration. The tunnel gateway establishes and maintains the outbound connection; the Anthropic agent loop sends MCP requests through the tunnel to your private server.

The security properties of this model are worth being explicit about:

No inbound exposure: Your MCP server has no public-facing endpoint. The only connection it handles comes from the tunnel gateway running on the same network.
End-to-end encryption: Traffic between the Anthropic agent loop and your private MCP server is encrypted end to end. The tunnel gateway does not decrypt and re-encrypt — it forwards encrypted traffic.
Admin-controlled: MCP tunnels are configured and managed from the Claude Console by organization admins. Individual users cannot create tunnels to arbitrary private servers.
Auditable: Every MCP call through a tunnel is logged with the tool name, arguments hash, timestamp, and agent identity on both the Anthropic side and your tunnel gateway logs.

Deploying the Tunnel Gateway

The tunnel gateway is a single binary — available for Linux, macOS, and Windows — that authenticates with the Claude Console using a gateway token generated by an organization admin. Here is the deployment pattern for a production Linux environment:

# Download and install the tunnel gateway
curl -fsSL https://console.anthropic.com/downloads/mcp-tunnel-gateway-linux-amd64   -o /usr/local/bin/mcp-tunnel-gateway
chmod +x /usr/local/bin/mcp-tunnel-gateway

# Create a systemd service for the gateway
cat > /etc/systemd/system/mcp-tunnel-gateway.service <<'EOF'
[Unit]
Description=Claude MCP Tunnel Gateway
After=network-online.target
Wants=network-online.target

[Service]
Type=simple
User=mcp-tunnel
ExecStart=/usr/local/bin/mcp-tunnel-gateway   --token-file /etc/mcp-tunnel/gateway-token   --mcp-server-url http://localhost:9000/mcp   --region us-east-1
Restart=on-failure
RestartSec=5
StandardOutput=journal
StandardError=journal

[Install]
WantedBy=multi-user.target
EOF

systemctl daemon-reload
systemctl enable mcp-tunnel-gateway
systemctl start mcp-tunnel-gateway

The --mcp-server-url flag points at your internal MCP server — an address reachable from the machine running the tunnel gateway but not from the public internet. The gateway connects outbound to Anthropic's tunnel infrastructure using the token and starts relaying MCP requests from your agents to your private server.

MCP tunnels work with both Claude Managed Agents and the Claude Messages API. For the Messages API, you reference the tunnel-connected MCP server by its tunnel ID in your tool configuration. For Managed Agents, the tunnel appears as a configured MCP server in the agent's tool access list, indistinguishable from a public MCP server from the agent's perspective.

Use Cases That Become Possible

The combination of self-hosted sandboxes and MCP tunnels removes the blockers that previously ruled out Claude agents for specific enterprise use cases.

Internal code repositories: A coding agent can clone from GitHub Enterprise or an on-premises GitLab instance, run tests, and push changes — all inside the perimeter. The code never leaves the company network for processing.

Production database access: An analytics agent can query a production read replica through an MCP tunnel. The connection credentials stay inside the private network. The agent gets the query results it needs.

ERP and CRM integration: SAP, Salesforce on-premises, or custom internal platforms with no public API become accessible to agents through an MCP server running on the same network. No public endpoint needed.

Regulated data processing: Financial calculations, healthcare data analysis, and legal document processing that cannot leave a jurisdiction can run in a self-hosted sandbox provisioned in the correct geographic region.

Build and test pipelines: Agents running CI workloads — build, test, lint, deploy — execute in a sandbox with access to internal artifact registries, private npm/pip mirrors, and test infrastructure that was previously off-limits.

Availability and Access

Self-hosted sandboxes are in public beta and available to all Claude for Work and Claude API customers. Configuration is available in the Claude Console under Settings → Managed Agents → Sandboxes. The Cloudflare, Modal, and Vercel integrations are one-click configurations; Daytona and custom sandbox clients require manual configuration of the endpoint and authentication token.

MCP tunnels are in research preview — you need to request access through the Claude Console. Anthropic is rolling out access to organizations in regulated industries first, with general availability expected in the coming months. If you are evaluating this for an enterprise deployment, request access early: the research preview period is when Anthropic is actively incorporating feedback on the tunnel protocol and gateway configuration.

What to Verify Before Depending on This in Production

Both features are new and the operational patterns are still being established. A few things worth verifying before treating self-hosted sandboxes or MCP tunnels as production-critical infrastructure:

Tunnel gateway availability: The gateway process must be running and connected for agents to reach private MCP servers. Build your deployment with the same reliability expectations you would apply to any internal service — health checks, automatic restarts via systemd, alerting on disconnection events from the gateway logs.

Sandbox cold start latency: Serverless sandbox providers (Modal, Vercel, Cloudflare) have cold start penalties when a container has not been used recently. For latency-sensitive agent workflows, consider keeping sandboxes warm or choosing a provider with lower cold start times for your runtime size.

Audit log coverage: Verify that your sandbox and tunnel gateway logs are being captured by your SIEM before claiming compliance coverage. The gateway logs every MCP call; the sandbox server (if custom) logs execution on your side. The Anthropic Console shows the agent-side view. You need both to reconstruct a full audit trail.

Both features represent a genuine architecture shift in how Anthropic thinks about enterprise agent infrastructure — from "trust us with your data" to "your data stays with you, we run the intelligence layer." That is the right direction for enterprise adoption, and the implementation at Code with Claude London is more complete than most vendors' equivalent announcements. The setup has real operational weight, but so does any infrastructure that actually solves the compliance blockers rather than papering over them.

For the broader context on running Claude agents in production — tool design, error handling, observability — the agent observability guide covers the patterns that apply regardless of whether you are using self-hosted sandboxes or the default execution environment. The MCP production hardening guide covers the server-side security patterns that complement what MCP tunnels provide at the network layer.

This is authored by Anup Karanjkar, who follows Anthropic's developer platform releases and enterprise infrastructure patterns.

Originally published at wowhow.cloud

MCP Spec Ships July 28 — Every Breaking Change and How to Migrate

Anup Karanjkar — Fri, 29 May 2026 21:37:52 +0000

Disclaimer: Code examples in this article are based on the 2026-07-28 Release Candidate, published May 28, 2026. The final specification may differ. Verify against the official MCP spec before shipping to production.

On May 28, 2026, the MCP team published the largest specification revision since launch — and your MCP servers have ten weeks to comply. The 2026-07-28 Release Candidate eliminates protocol-level sessions, mandates two new HTTP headers, changes an error code that client code almost certainly pattern-matches against, introduces caching semantics via new response fields, locks down distributed trace propagation, and deprecates three first-class primitives. That is six material changes arriving in a single spec bump, with a hard July 28 cutover date.[1] This guide walks through every change in the order you should tackle it, with before-and-after code diffs and a final checklist.

The underlying motivation is operational. A remote MCP server that previously needed sticky sessions, a shared session store, and deep packet inspection at the gateway can now run behind a plain round-robin load balancer, route traffic on a header value, and let clients cache the tools list for as long as the server permits. The spec is moving from a stateful, handshake-based architecture toward one that behaves like a well-designed HTTP API. That is a genuinely good direction. The migration cost is real and bounded.

What Broke and Why

Before the diffs, a clear-eyed summary of what the spec actually breaks and the operational problems each removal was meant to fix.

Sessions Existed to Route — Now Headers Do That

The initialize/initialized handshake and the Mcp-Session-Id header were introduced when MCP's Streamable HTTP transport was first designed.[2] A client established a session, received a session ID, and carried that ID in every subsequent request. The server used the ID to look up per-session state — capabilities, negotiated protocol version, client metadata. It worked cleanly for a single-server deployment. It broke the moment you added a second server instance, because the client's next request might land on the instance that had never seen the session.

The common fix was sticky sessions at the load balancer. Some teams added a Redis-backed session store. Others built a gateway that inspected the request body to extract the session ID before routing. All of these solutions exist because the protocol pushed session affinity onto infrastructure. The 2026-07-28 RC removes the session concept entirely from the protocol layer. Client metadata, capabilities, and protocol version now travel in the _meta field on every request. Infrastructure can be stateless because the protocol is now stateless.

The Error Code Was Never Standard

MCP introduced -32002 as a custom error code for missing resources. The JSON-RPC 2.0 specification already has -32602 for Invalid Params. The two codes represent different semantics, but in practice, the custom MCP code created a compatibility problem: any client validating against a JSON-RPC error schema saw -32002 as an unknown code. SEP-2164 collapses this to the standard value.[1] If your client has a switch or if block that pattern-matches on -32002, that branch is now dead code after July 28.

Caching Was Happening Anyway — Now It Has a Contract

Clients caching tools/list responses was always an optimization. The problem was that each client made up its own TTL, and the server had no way to communicate how long the list was valid or whether it was safe to share across users. SEP-2549 adds ttlMs and cacheScope to list and resource read responses — the MCP equivalent of Cache-Control: max-age and Cache-Control: public.[1] This is additive, not a break for server code — but clients that were caching without guidance now have a contract to follow.

Trace Keys Were Colliding

W3C Trace Context propagation through the _meta field was already happening in production — OpenTelemetry-instrumented servers were passing traceparent, tracestate, and baggage through. The problem was that the key names were undocumented, so different SDKs and gateways invented their own conventions. SEP-414 locks down the key names, making distributed traces across multi-SDK deployments actually correlate.[1]

Change 1: Eliminate the Session Handshake

This is the largest structural change. Every piece of server code that participates in session lifecycle needs to be rethought.

What the Old Flow Looked Like

Under the previous spec, a TypeScript server handled initialization roughly like this:

// BEFORE — session-based initialization (remove this pattern entirely)
import express from 'express'
import { randomUUID } from 'crypto'

const sessions = new Map()

app.post('/mcp', async (req, res) => {
  const body = req.body

  // Handshake: client sends initialize, server stores session
  if (body.method === 'initialize') {
    const sessionId = randomUUID()
    const clientInfo   = body.params.clientInfo
    const capabilities = body.params.capabilities

    sessions.set(sessionId, {
      protocolVersion: body.params.protocolVersion,
      clientInfo,
      capabilities,
      createdAt: Date.now(),
    })

    res.setHeader('Mcp-Session-Id', sessionId)
    return res.json({
      jsonrpc: '2.0',
      id: body.id,
      result: {
        protocolVersion: '2025-11-05',
        capabilities: { tools: {}, resources: {} },
        serverInfo: { name: 'my-server', version: '1.0.0' },
      },
    })
  }

  // Every subsequent request validates the session
  const sessionId = req.headers['mcp-session-id'] as string
  if (!sessionId || !sessions.has(sessionId)) {
    return res.status(401).json({
      jsonrpc: '2.0',
      id: body.id,
      error: { code: -32600, message: 'Invalid session' },
    })
  }

  const session = sessions.get(sessionId)!
  // ... handle tools/call, tools/list, etc. using session state
})

The New Stateless Pattern

Under the RC, the client sends its metadata with every request inside _meta. The server reads what it needs from the request body instead of looking it up in a session store.

// AFTER — stateless, no session store, no handshake
import express from 'express'

app.post('/mcp', async (req, res) => {
  const body = req.body
  const method = req.headers['mcp-method'] as string
  const toolName = req.headers['mcp-name'] as string

  // Read client context from _meta on every request
  const meta = body.params?._meta ?? {}
  const protocolVersion = meta.protocolVersion ?? body.params?.protocolVersion
  const clientCapabilities = meta.capabilities ?? {}

  // Validate headers match body
  if (method && method !== body.method) {
    return res.status(400).json({
      jsonrpc: '2.0',
      id: body.id,
      error: { code: -32600, message: 'Mcp-Method header does not match request body' },
    })
  }

  // Route without session lookup
  switch (body.method) {
    case 'tools/list':
      return handleToolsList(req, res, clientCapabilities)
    case 'tools/call':
      return handleToolsCall(req, res, body.params, toolName)
    case 'resources/read':
      return handleResourceRead(req, res, body.params)
    default:
      return res.status(404).json({
        jsonrpc: '2.0',
        id: body.id,
        error: { code: -32601, message: 'Method not found' },
      })
  }
})

Infrastructure Changes That Follow

With sessions gone from the protocol, your infrastructure can drop the workarounds that existed to compensate for stateful routing:

Load balancer sticky sessions — delete the affinity rule. Any instance can handle any request.
Redis session store — if its only purpose was MCP session state, decommission it. Other uses of Redis (tool result caching, rate limiting) are unaffected.
Gateway body inspection for session IDs — replace with header-based routing on Mcp-Method. This is cheaper and does not require buffering the entire request body before routing.

If you are running on Kubernetes, the sticky session annotation on your Service or Ingress can be removed. The spec change effectively converts your MCP deployment from a stateful set with affinity requirements to a plain deployment behind a ClusterIP service.

Change 2: New Required Headers (SEP-2243)

Streamable HTTP transport now requires two headers on every request: Mcp-Method and Mcp-Name. The server must reject requests where the headers disagree with the request body.[1]

Client-Side: Adding the Headers

// BEFORE — no method routing headers
async function callTool(serverUrl: string, toolName: string, args: unknown) {
  const response = await fetch(serverUrl, {
    method: 'POST',
    headers: {
      'Content-Type': 'application/json',
      'Mcp-Session-Id': currentSessionId,  // REMOVE THIS
    },
    body: JSON.stringify({
      jsonrpc: '2.0',
      id: crypto.randomUUID(),
      method: 'tools/call',
      params: { name: toolName, arguments: args },
    }),
  })
  return response.json()
}

// AFTER — Mcp-Method and Mcp-Name required
async function callTool(serverUrl: string, toolName: string, args: unknown) {
  const requestId = crypto.randomUUID()
  const body = {
    jsonrpc: '2.0',
    id: requestId,
    method: 'tools/call',
    params: {
      name: toolName,
      arguments: args,
      _meta: {
        protocolVersion: '2026-07-28',
        capabilities: clientCapabilities,
        clientInfo: { name: 'my-client', version: '2.0.0' },
      },
    },
  }

  const response = await fetch(serverUrl, {
    method: 'POST',
    headers: {
      'Content-Type': 'application/json',
      'Mcp-Method': 'tools/call',   // REQUIRED — matches body.method
      'Mcp-Name': toolName,          // REQUIRED — matches body.params.name
      'MCP-Protocol-Version': '2026-07-28',
    },
    body: JSON.stringify(body),
  })
  return response.json()
}

Server-Side: Validating Header/Body Consistency

The spec requires servers to reject requests where headers and body disagree. This is where a shared validation middleware prevents security issues — a client that sends a permissive Mcp-Method: tools/list header but a tools/call body would otherwise bypass gateway rate limiting that routes on headers.

// Validation middleware — add to every MCP endpoint
function validateMcpHeaders(
  req: express.Request,
  res: express.Response,
  next: express.NextFunction
) {
  const mcpMethod = req.headers['mcp-method'] as string | undefined
  const mcpName   = req.headers['mcp-name']   as string | undefined
  const body      = req.body

  // Headers are required per SEP-2243
  if (!mcpMethod) {
    return res.status(400).json({
      jsonrpc: '2.0',
      id: body?.id ?? null,
      error: { code: -32600, message: 'Missing required header: Mcp-Method' },
    })
  }

  // Header must match body
  if (mcpMethod !== body?.method) {
    return res.status(400).json({
      jsonrpc: '2.0',
      id: body?.id ?? null,
      error: {
        code: -32600,
        message: `Header Mcp-Method '${mcpMethod}' does not match body method '${body?.method}'`,
      },
    })
  }

  // For tool calls: Mcp-Name must match the tool name in the body
  if (mcpMethod === 'tools/call' && mcpName !== body?.params?.name) {
    return res.status(400).json({
      jsonrpc: '2.0',
      id: body?.id ?? null,
      error: {
        code: -32600,
        message: `Header Mcp-Name '${mcpName}' does not match body params.name '${body?.params?.name}'`,
      },
    })
  }

  next()
}

// Apply before routing
app.post('/mcp', validateMcpHeaders, mcpRouter)

Gateway and Load Balancer Configuration

The headers exist precisely so that routing infrastructure does not need to inspect the body. A Cloudflare Worker or Nginx configuration can now route traffic on a single header value rather than parsing JSON:

# Nginx upstream routing on Mcp-Method (no body inspection needed)
map $http_mcp_method $backend_pool {
  "tools/call"     tools_pool;
  "tools/list"     metadata_pool;
  "resources/read" resources_pool;
  default          default_pool;
}

server {
  location /mcp {
    proxy_pass http://$backend_pool;
    proxy_set_header Mcp-Method  $http_mcp_method;
    proxy_set_header Mcp-Name    $http_mcp_name;
  }
}

// Cloudflare Worker — route on Mcp-Method without parsing body
export default {
  async fetch(request: Request): Promise {
    const mcpMethod = request.headers.get('Mcp-Method')

    if (mcpMethod === 'tools/call') {
      // Route to compute-heavy pool
      return fetch('https://tools-compute.internal/mcp', request)
    }

    if (mcpMethod === 'tools/list' || mcpMethod === 'resources/list') {
      // Route to metadata pool — lighter, cached
      return fetch('https://metadata.internal/mcp', request)
    }

    return fetch('https://default.internal/mcp', request)
  },
}

Change 3: Error Code Migration (SEP-2164)

The error code for a missing resource changes from -32002 to -32602. This is a small change with outsized blast radius because error codes tend to be pattern-matched against in switch statements and condition checks scattered across client codebases.[1]

Finding Affected Code

Before touching anything, find every occurrence of the old code across your codebase. The number of matches will tell you how much work is ahead:

# Search for the literal value — catches both numeric and string forms
grep -r "-32002" ./src --include="*.ts" --include="*.js" -l

# Also check for named constants that might wrap it
grep -r "RESOURCE_NOT_FOUND|MISSING_RESOURCE|MCP_NOT_FOUND" ./src -l

The Migration Diff

// BEFORE — custom MCP error code for missing resource
async function handleResourceRead(
  req: express.Request,
  res: express.Response,
  params: { uri: string }
) {
  const resource = await resourceStore.get(params.uri)

  if (!resource) {
    return res.json({
      jsonrpc: '2.0',
      id: req.body.id,
      error: {
        code: -32002,  // ← CHANGE THIS
        message: `Resource not found: ${params.uri}`,
      },
    })
  }

  return res.json({ jsonrpc: '2.0', id: req.body.id, result: resource })
}

// AFTER — JSON-RPC standard Invalid Params (-32602)
async function handleResourceRead(
  req: express.Request,
  res: express.Response,
  params: { uri: string }
) {
  const resource = await resourceStore.get(params.uri)

  if (!resource) {
    return res.json({
      jsonrpc: '2.0',
      id: req.body.id,
      error: {
        code: -32602,  // ← SEP-2164: standard JSON-RPC Invalid Params
        message: `Resource not found: ${params.uri}`,
        data: { uri: params.uri },
      },
    })
  }

  return res.json({ jsonrpc: '2.0', id: req.body.id, result: resource })
}

Client-Side Error Handling

// BEFORE — matching on MCP custom code
async function readResource(uri: string) {
  const response = await mcpClient.request('resources/read', { uri })

  if (response.error) {
    if (response.error.code === -32002) {  // ← UPDATE THIS
      throw new ResourceNotFoundError(uri)
    }
    throw new McpError(response.error)
  }

  return response.result
}

// AFTER — matching on standard JSON-RPC code
async function readResource(uri: string) {
  const response = await mcpClient.request('resources/read', { uri })

  if (response.error) {
    if (response.error.code === -32602) {  // ← SEP-2164
      throw new ResourceNotFoundError(uri)
    }
    throw new McpError(response.error)
  }

  return response.result
}

// Error code constants file — update the mapping
export const MCP_ERROR_CODES = {
  PARSE_ERROR:      -32700,
  INVALID_REQUEST:  -32600,
  METHOD_NOT_FOUND: -32601,
  INVALID_PARAMS:   -32602,  // replaces old -32002 for missing resources
  INTERNAL_ERROR:   -32603,
} as const

One important nuance: -32602 (Invalid Params) is a broader category than the old -32002. After this migration, your client code that catches -32602 will also catch other invalid-parameter errors. If you need to distinguish between "missing resource" and "malformed parameters," you should use the error data field rather than the code:

// Distinguishing resource-not-found from other -32602 errors via error.data
if (response.error?.code === -32602) {
  if (response.error.data?.uri) {
    // This is a resource-not-found case
    throw new ResourceNotFoundError(response.error.data.uri)
  }
  // Other invalid params error
  throw new InvalidParamsError(response.error.message)
}

Change 4: Caching Metadata (SEP-2549)

List and resource read responses now carry two new fields: ttlMs and cacheScope. Servers that do not add these fields are still spec-compliant — the fields are optional. But clients that were previously caching without guidance now have a standard contract to follow.[1]

Server-Side: Adding Cache Metadata to Responses

// BEFORE — tools/list response without caching guidance
async function handleToolsList(
  req: express.Request,
  res: express.Response
) {
  const tools = await toolRegistry.list()

  return res.json({
    jsonrpc: '2.0',
    id: req.body.id,
    result: {
      tools,
    },
  })
}

// AFTER — tools/list with caching metadata (SEP-2549)
async function handleToolsList(
  req: express.Request,
  res: express.Response,
  clientCapabilities: Record
) {
  const tools = await toolRegistry.list()
  const userId = extractUserId(req)  // null for unauthenticated requests

  return res.json({
    jsonrpc: '2.0',
    id: req.body.id,
    result: {
      tools,
      // ttlMs: how long the response is valid
      // cacheScope: 'global' = safe to share across users
      //             'user'   = specific to this user
      //             'session' = not safe to cache across requests
      ttlMs: 300_000,          // 5 minutes — tools list rarely changes
      cacheScope: userId       // user-scoped if authenticated
        ? 'user'
        : 'global',
    },
  })
}

// Resource read — typically shorter TTL and user-scoped
async function handleResourceRead(
  req: express.Request,
  res: express.Response,
  params: { uri: string }
) {
  const resource = await resourceStore.get(params.uri)

  if (!resource) {
    return res.json({
      jsonrpc: '2.0',
      id: req.body.id,
      error: { code: -32602, message: `Resource not found: ${params.uri}` },
    })
  }

  return res.json({
    jsonrpc: '2.0',
    id: req.body.id,
    result: {
      contents: resource.contents,
      ttlMs: resource.isStatic ? 3_600_000 : 30_000,  // 1h static, 30s dynamic
      cacheScope: resource.isPublic ? 'global' : 'user',
    },
  })
}

Client-Side: Respecting Cache Metadata

// Client-side cache respecting ttlMs and cacheScope
interface CacheEntry {
  data: unknown
  expiresAt: number
  scope: 'global' | 'user' | 'session'
}

class McpResponseCache {
  private globalCache = new Map()
  private userCaches  = new Map>()

  set(key: string, data: unknown, ttlMs: number, scope: string, userId?: string) {
    const entry: CacheEntry = {
      data,
      expiresAt: Date.now() + ttlMs,
      scope: scope as CacheEntry['scope'],
    }

    if (scope === 'global') {
      this.globalCache.set(key, entry)
    } else if (scope === 'user' && userId) {
      if (!this.userCaches.has(userId)) {
        this.userCaches.set(userId, new Map())
      }
      this.userCaches.get(userId)!.set(key, entry)
    }
    // scope === 'session' → do not cache
  }

  get(key: string, userId?: string): unknown | null {
    const globalEntry = this.globalCache.get(key)
    if (globalEntry && Date.now()  = {}

  // Inject current trace context into carrier
  propagation.inject(context.active(), carrier)

  return {
    protocolVersion: '2026-07-28',
    capabilities: clientCapabilities,
    clientInfo: { name: 'my-client', version: '2.0.0' },
    // SEP-414: standard W3C Trace Context key names
    traceparent: carrier['traceparent'],     // e.g. "00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01"
    tracestate:  carrier['tracestate'],      // vendor-specific trace data
    baggage:     carrier['baggage'],         // arbitrary key-value pairs
  }
}

// Server-side: extract and continue the trace
function extractTraceContext(meta: Record) {
  const carrier = {
    traceparent: meta.traceparent as string | undefined,
    tracestate:  meta.tracestate  as string | undefined,
    baggage:     meta.baggage     as string | undefined,
  }

  return propagation.extract(context.active(), carrier)
}

// In your handler
async function handleToolsCall(req: express.Request, res: express.Response, params: unknown) {
  const meta = req.body.params?._meta ?? {}
  const traceCtx = extractTraceContext(meta)

  // All spans created within traceCtx are children of the incoming trace
  return context.with(traceCtx, async () => {
    const span = trace.getTracer('mcp-server').startSpan('tools/call')
    try {
      const result = await executeToolCall(params)
      return res.json({ jsonrpc: '2.0', id: req.body.id, result })
    } finally {
      span.end()
    }
  })
}

Change 6: Three Primitives Deprecated

The RC deprecates Roots, Sampling, and Logging — three first-class primitives in the previous spec. Deprecated does not mean removed on July 28. The governance lifecycle introduced in the same RC mandates at least 12 months between deprecation and earliest removal.[1] But the migration path is clear and you should start it now.

Roots → Tool Parameters, Resource URIs, or Server Config

Roots were a mechanism for a client to tell a server which filesystem paths it had access to. The intended replacement depends on what you were using them for:

If roots were used to pass a working directory to tools — pass it as a tool parameter instead. The tool schema makes it explicit.
If roots were used to scope resource access — encode the scope in the resource URI and validate it server-side.
If roots were used for server configuration — move them to server startup configuration or environment variables.

// BEFORE — using Roots to pass working directory
const initResult = await mcpClient.initialize({
  roots: [
    { uri: 'file:///workspace/project', name: 'Project Root' }
  ]
})

// AFTER — pass as tool parameter
const result = await mcpClient.callTool('read_file', {
  path: '/workspace/project/src/index.ts',  // explicit path in args
})

Sampling → Direct LLM API Integration

Sampling allowed an MCP server to request that the client perform an LLM completion on the server's behalf. This created an unusual inversion of the typical client-server relationship and complicated the security model. The replacement is for the server to call the LLM API directly.

// BEFORE — MCP server requesting sampling from client
// Server code that sends a sampling request
async function analyzeData(data: string) {
  const samplingResult = await sendSamplingRequest({
    messages: [
      { role: 'user', content: { type: 'text', text: `Analyze: ${data}` } }
    ],
    maxTokens: 1000,
  })
  return samplingResult.content
}

// AFTER — server calls LLM directly
import Anthropic from '@anthropic-ai/sdk'
const anthropic = new Anthropic()

async function analyzeData(data: string) {
  const response = await anthropic.messages.create({
    model: 'claude-sonnet-4-6',
    max_tokens: 1000,
    messages: [
      { role: 'user', content: `Analyze: ${data}` }
    ],
  })
  return response.content[0].type === 'text' ? response.content[0].text : ''
}

Logging → stderr or OpenTelemetry

MCP's built-in Logging primitive sent log messages from the server to the client. The replacements are simpler and more standard: write to stderr for stdio transports, use OpenTelemetry structured logging for everything else.

// BEFORE — using MCP Logging primitive
await mcpServer.sendLog({
  level: 'info',
  logger: 'my-server',
  data: { message: 'Tool execution started', toolName },
})

// AFTER — stderr for stdio, OpenTelemetry for HTTP
import { logs, SeverityNumber } from '@opentelemetry/api-logs'

const logger = logs.getLogger('mcp-server')

// For stdio transport: write structured JSON to stderr
if (transport === 'stdio') {
  process.stderr.write(JSON.stringify({
    level: 'info',
    message: 'Tool execution started',
    toolName,
    timestamp: new Date().toISOString(),
  }) + '\n')
} else {
  // For HTTP transport: use OpenTelemetry logs API
  logger.emit({
    severityNumber: SeverityNumber.INFO,
    body: 'Tool execution started',
    attributes: { toolName },
  })
}

Change 7: JSON Schema 2020-12 for Tool Input Schemas

Tool input schemas now support JSON Schema 2020-12 composition and conditionals, and the structuredContent field on tool call results now accepts any JSON value, not just objects.[1] This is largely additive — existing schemas remain valid — but two rules matter for migration.

First, input schemas must still have type: "object" at the root. You can add composition operators (oneOf, anyOf, allOf) and conditionals, but the root type constraint stays. Second, do not auto-dereference external $ref URIs — the spec explicitly prohibits servers from fetching and inlining remote schemas.

// BEFORE — simple flat input schema
const searchTool = {
  name: 'search_products',
  description: 'Search the product catalog',
  inputSchema: {
    type: 'object',
    properties: {
      query: { type: 'string' },
      limit: { type: 'number' },
    },
    required: ['query'],
  },
}

// AFTER — JSON Schema 2020-12 with composition and conditionals
const searchTool = {
  name: 'search_products',
  description: 'Search the product catalog with optional filtering',
  inputSchema: {
    type: 'object',   // root type: object still required
    properties: {
      query:    { type: 'string', minLength: 1 },
      limit:    { type: 'number', minimum: 1, maximum: 100, default: 10 },
      format:   { enum: ['json', 'csv', 'markdown'] },
      filters:  {
        type: 'object',
        properties: {
          priceMin: { type: 'number' },
          priceMax: { type: 'number' },
          category: { type: 'string' },
        },
      },
    },
    required: ['query'],
    // Conditional: if format is csv, filters are not allowed
    if:   { properties: { format: { const: 'csv' } }, required: ['format'] },
    then: { not: { required: ['filters'] } },
    // anyOf composition: either a text query or a structured filter
    anyOf: [
      { required: ['query'] },
      { required: ['filters'] },
    ],
  },
}

Multi-Round-Trip Tool Calls: InputRequiredResult

The RC introduces a new result type for tool calls that need additional input from the client: InputRequiredResult. This covers scenarios like OAuth confirmation, human-in-the-loop approval, and tools that need clarification before executing. The entire pattern is designed to be stateless — the server encodes its state as a base64 blob that the client echoes back on the follow-up request.[1]

// Server: returning InputRequiredResult when confirmation is needed
async function handleToolsCall(req: express.Request, res: express.Response, params: ToolCallParams) {
  const { name, arguments: args, inputResponses, requestState } = params

  if (name === 'delete_records' && !inputResponses) {
    // First call — ask for confirmation
    const pendingState = Buffer.from(JSON.stringify({
      operation: 'delete_records',
      args,
      requestedAt: Date.now(),
    })).toString('base64')

    return res.json({
      jsonrpc: '2.0',
      id: req.body.id,
      result: {
        resultType: 'inputRequired',
        inputRequests: {
          confirmation: {
            type: 'boolean',
            prompt: `Delete ${args.count} records? This cannot be undone.`,
          },
        },
        requestState: pendingState,
      },
    })
  }

  if (name === 'delete_records' && inputResponses && requestState) {
    // Follow-up call with user's response
    if (!inputResponses.confirmation) {
      return res.json({
        jsonrpc: '2.0',
        id: req.body.id,
        result: { content: [{ type: 'text', text: 'Operation cancelled.' }] },
      })
    }

    // Decode state and execute — any server instance can handle this
    const pendingOp = JSON.parse(Buffer.from(requestState, 'base64').toString())
    const deleted = await db.deleteRecords(pendingOp.args)

    return res.json({
      jsonrpc: '2.0',
      id: req.body.id,
      result: { content: [{ type: 'text', text: `Deleted ${deleted} records.` }] },
    })
  }
}

// Client: handling InputRequiredResult
async function callToolWithConfirmation(toolName: string, args: unknown) {
  const firstResponse = await mcpClient.callTool(toolName, args)

  if (firstResponse.result?.resultType === 'inputRequired') {
    const { inputRequests, requestState } = firstResponse.result

    // Collect responses from user or upstream system
    const inputResponses: Record = {}
    for (const [key, request] of Object.entries(inputRequests)) {
      inputResponses[key] = await promptUser(request)
    }

    // Re-issue with responses and echoed requestState
    return mcpClient.callTool(toolName, {
      ...args,
      inputResponses,
      requestState,  // echo back unchanged
    })
  }

  return firstResponse
}

Governance: What the Lifecycle Policy Means for You

The three SEPs that formalize governance matter for how you track future changes, not just this migration.[2]

SEP-2577 introduces a three-tier lifecycle: Active, Deprecated, Removed. Every feature has a stated status. The policy mandates at least 12 months between a deprecation announcement and the earliest possible removal. For the three primitives deprecated in the RC — Roots, Sampling, Logging — the earliest removal date is therefore July 2027. You have time, but the clock is running.

SEP-2133 introduces the extension framework with reverse-DNS identifiers. Extensions are opt-in capabilities that client and server negotiate via an extensions map in their capabilities exchange. New capabilities ship as extensions before being promoted to core spec. If you are evaluating a vendor's MCP SDK and they mention capabilities that are not in the current spec, check whether those are published extensions under SEP-2133 or proprietary additions.

The practical implication of the governance changes: watch the MCP repository for SEPs that enter Deprecated status. A deprecation notice is now a 12-month countdown, not an indefinite soft warning.

Migration Checklist

Work through this in order. Each section should be verifiable before you move to the next.

Server Changes

Remove the initialize/initialized handler and all session store code
Remove Mcp-Session-Id header from all responses
Add validateMcpHeaders middleware that rejects requests where Mcp-Method is absent or disagrees with the body method
Add validateMcpHeaders rejection for Mcp-Name/params.name mismatch on tools/call requests
Replace all -32002 error codes with -32602
Add ttlMs and cacheScope to tools/list responses
Add ttlMs and cacheScope to resources/read responses
Rename trace context keys in _meta to traceparent, tracestate, baggage
Add extraction of W3C trace context from incoming _meta
Begin migration away from Roots, Sampling, Logging primitives (deadline: July 2027 earliest removal)

Client Changes

Stop sending Mcp-Session-Id in request headers
Stop sending the initialize request before first tool call
Add Mcp-Method and Mcp-Name headers to every request
Add MCP-Protocol-Version: 2026-07-28 header
Move client metadata (protocolVersion, capabilities, clientInfo) into _meta on every request body
Update error code matching from -32002 to -32602
Implement cache respecting ttlMs and cacheScope from list/read responses
Use standard W3C keys when injecting trace context into _meta
Handle InputRequiredResult response type on tool calls

Infrastructure Changes

Remove sticky session affinity from load balancer configuration
Decommission shared session store (Redis or otherwise) if its only use was MCP sessions
Replace body-inspection-based routing with header-based routing on Mcp-Method
Update Cloudflare Worker / Nginx / gateway routing rules to read Mcp-Method
Validate that horizontal scaling works — deploy two instances and verify requests route to both

Verification Gates

Before marking any of the above complete, run these checks:

# Gate 1: Server rejects requests missing Mcp-Method
curl -s -X POST https://your-mcp-server/mcp   -H "Content-Type: application/json"   -d '{"jsonrpc":"2.0","id":"1","method":"tools/list","params":{}}'   | jq '.error.code'
# Expected: -32600

# Gate 2: Server rejects header/body mismatch
curl -s -X POST https://your-mcp-server/mcp   -H "Content-Type: application/json"   -H "Mcp-Method: tools/list"   -d '{"jsonrpc":"2.0","id":"2","method":"tools/call","params":{"name":"x","arguments":{}}}'   | jq '.error.message'
# Expected: error about header/body mismatch

# Gate 3: tools/list response includes caching fields
curl -s -X POST https://your-mcp-server/mcp   -H "Content-Type: application/json"   -H "Mcp-Method: tools/list"   -H "MCP-Protocol-Version: 2026-07-28"   -d '{"jsonrpc":"2.0","id":"3","method":"tools/list","params":{"_meta":{"protocolVersion":"2026-07-28"}}}'   | jq '{ttlMs: .result.ttlMs, cacheScope: .result.cacheScope}'
# Expected: {"ttlMs": , "cacheScope": "global"|"user"|"session"}

# Gate 4: Error code is now -32602 for missing resource
curl -s -X POST https://your-mcp-server/mcp   -H "Content-Type: application/json"   -H "Mcp-Method: resources/read"   -H "MCP-Protocol-Version: 2026-07-28"   -d '{"jsonrpc":"2.0","id":"4","method":"resources/read","params":{"uri":"nonexistent://x","_meta":{}}}'   | jq '.error.code'
# Expected: -32602

# Gate 5: Two server instances can handle requests interchangeably
# Send 10 requests and verify all return 200 when routed round-robin
for i in {1..10}; do
  curl -s -o /dev/null -w "%{http_code}" -X POST https://your-mcp-server/mcp     -H "Content-Type: application/json"     -H "Mcp-Method: tools/list"     -H "MCP-Protocol-Version: 2026-07-28"     -d '{"jsonrpc":"2.0","id":"'$i'","method":"tools/list","params":{"_meta":{"protocolVersion":"2026-07-28"}}}'
  echo ""
done
# Expected: all 200

Timeline and SDK Support

The Release Candidate was locked on May 21, 2026. The final specification publishes July 28. Tier 1 SDKs — the official TypeScript and Python SDKs maintained by the MCP team — are expected to ship 2026-07-28 support within the 10-week window between RC lock and final spec.[1]

If you are using an official SDK, the migration path is largely handled at the SDK layer. You will need to update the SDK version, move session initialization code out of your app startup, and add the new headers. The structural changes to your server code depend on how much session state management was living in application code versus the SDK.

If you rolled a custom MCP implementation — and many production deployments have, given how early the ecosystem is — this guide is your migration spec. Every change documented above is a direct consequence of what the spec text changed, drawn from the RC blog post and the SEP references it cites.

The Tasks API migration is a separate track. If you are using the experimental 2025-11-25 Tasks API, that moves to the extension lifecycle under SEP-2133. Watch the repository for the extension identifier and update your capability negotiation accordingly.

What This Migration Actually Buys

The changes are not cosmetic. Each one has a concrete operational payoff.

Stateless protocol means horizontal scaling without infrastructure tricks. A deployment that previously required a Redis cluster to share session state, a sticky load balancer, and a gateway that could parse JSON to extract session IDs can now run as a plain deployment behind a dumb round-robin load balancer. The operational surface shrinks.

Header-based routing means cheaper gateways. L7 routing on a header field is an order of magnitude cheaper than buffering a full request body to parse JSON. Rate limiters, gateways, and Cloudflare Workers that handle MCP traffic get simpler and faster.

Standardized error codes mean fewer custom error handling paths. Every client library that already understands JSON-RPC 2.0 error semantics will handle -32602 without bespoke logic. The custom code was a compatibility tax.

Cache metadata means the tools list gets cached correctly instead of by convention. Some clients were caching for 60 seconds because that felt right. Others were not caching at all. The ttlMs and cacheScope fields give the server — which actually knows how often the tools list changes — authority over the cache policy.

Locked trace keys mean distributed traces actually correlate. Before SEP-414, an OpenTelemetry trace that crossed an MCP boundary would silently break because the receiving SDK was looking for a different key name. That class of debugging frustration ends with the RC.

The 10-week window between RC lock and final spec is deliberate. It is enough time to complete the migration on a realistic schedule without being so long that teams defer starting. The Tier 1 SDKs shipping within the same window means the ecosystem has working reference implementations before the final spec drops.

Start with the session removal. It is the largest structural change and determines what else needs to change downstream. The headers, error code, and cache metadata follow naturally from it. The trace context work is largely additive. The deprecated primitives have a 12-month runway. That ordering gives you a migration that makes steady progress rather than one that stalls on the hardest problem first.

Related Tools

For testing your migrated MCP server locally, the JSON formatter and validator is useful for inspecting request/response payloads. The regex tester helps with writing header validation patterns. For the codebase search to find all -32002 occurrences, the developer tools collection includes a code diffing utility.

For broader context on running MCP servers in production — authentication, gateway patterns, rate limiting — the MCP production hardening guide covers the patterns that matter before and after this spec update. The MCP developer guide covers the protocol fundamentals if you are starting from first principles.

This is authored by Anup Karanjkar, who has been building and operating MCP-integrated systems since the protocol's first public release.

Footnotes

MCP 2026-07-28 Release Candidate — Official Blog Post, modelcontextprotocol.io, May 28, 2026
MCP Development Roadmap, modelcontextprotocol.io, last updated March 5, 2026

3. Model Context Protocol Roadmap 2026, The New Stack

Originally published at wowhow.cloud

Continue? Y/N — Why AI Permission Fatigue Is the Biggest UX Crisis Nobody's Measuring

Anup Karanjkar — Fri, 29 May 2026 21:33:28 +0000

I clicked ‘Yes’ forty-seven times in a single Claude Code session last Thursday. I stopped reading what I was approving around click twelve. By click twenty-three I was tapping the keyboard rhythmically, the way you tap “agree” on a cookie consent banner before the text has finished loading. The session produced working code. I have no idea what I approved. That gap — between the formal existence of human oversight and its actual cognitive reality — is the biggest unsolved problem in AI deployment today, and almost nobody is measuring it.

The AI industry has converged on “human-in-the-loop” as the responsible answer to autonomous agent risk. Keep humans reviewing actions. Require approval gates. Log decisions. The EU AI Act, taking full effect in August 2026, will enshrine review requirements for high-risk AI systems into law.1 Yet the mechanism this entire safety architecture depends on — human attention during review — degrades to near-zero under the conditions that make AI agents valuable in the first place. The more useful the agent, the more approvals it generates. The more approvals it generates, the less each one means. This is not a user education problem. It is a design problem, and the design has not been solved.

This post makes a falsifiable claim, traces the evidence, examines Claude Code as the most evolved attempt at getting this right, and ends with the design pattern that might actually work. It also makes a specific prediction: permission fatigue will cause the first major AI agent security incident within twelve months. A human-in-the-loop, trained by ten thousand benign approvals to click without reading, will approve a destructive action. The incident will be blamed on the AI. The root cause will be the approval interface.

The Consensus Position and Why It Is Lazy

The responsible AI discourse has a comfort zone. Human-in-the-loop sits squarely inside it. It sounds right because it is right in one narrow technical sense: humans are more capable than any current AI system of catching ethical violations, novel edge cases, and consequential errors that fall outside the training distribution. The intuition is sound. The operational assumption attached to it — that humans will actually exercise this capability during a review workflow — is not.

Jakob Nielsen, writing his 2026 AI predictions, named this the Review Paradox.2 The observation is deceptively simple: it is cognitively harder to verify AI-generated work than to produce equivalent work yourself. When you write something, your working memory already contains the intent, the constraints, the alternatives you rejected, and the reasoning for each choice. When you review AI output, you must reconstruct all of that from the artifact itself. This reconstruction is slower, less reliable, and more exhausting than the original production task.

The paradox sharpens for agentic systems. A coding agent executing a multi-step refactor does not produce one output for you to review. It produces a chain of decisions, each one a prerequisite for the next, each one visible only as a brief permission prompt before the agent proceeds. The cognitive load of tracking that chain — understanding what each action means in context, what it enables downstream, what it forecloses — exceeds the cognitive load of writing the refactor yourself. So humans stop tracking it. They approve the chain because approving the chain is what they have been trained to do by ten thousand previous sessions where approving the chain was correct.

Nielsen’s proposed solution is the Audit Interface: a design that compresses a 50-step agent chain into a single glanceable confidence check, rather than 50 individual approval prompts. The idea is right. The implementation does not yet exist at production quality in any major tool. The gap between the idea and the working implementation is where every human-in-the-loop deployment currently lives.

What Broke: A Personal Inventory

Last Thursday’s session was not unusual. I was using Claude Code to refactor a data pipeline — moving from a polling architecture to an event-driven one, updating about 14 files, adding a test suite. The agent was fast and accurate. The permission prompts were not designed to be read under those conditions.

The first dozen prompts I read carefully. File reads, directory listings, a few shell commands I verified against what I expected. By the fifteenth prompt I was scanning for the operation type and confirming if it matched the general direction of the task. By the twentieth I had developed a heuristic: if the prompt contained the word “write” or “edit” and mentioned a file I recognized, I approved. If it contained “execute” and a command that started with “npm” or “git”, I approved. I was not reading the commands. I was pattern-matching the permission prompt structure.

Around prompt thirty-two, the agent asked to execute a git command. I approved. When the session ended I checked the git log and found a rebase on a branch I had not consciously decided to rebase. The rebase was correct — it was exactly what the task required. But I had not decided to do it. The agent had decided to do it, generated a permission prompt, and I had approved the permission prompt because the pattern matched my degraded heuristic.

That is the mechanism. Not a dramatic failure. Not a security incident. Just a human who was nominally in control but operationally not, making an irreversible decision they did not consciously make. The human-in-the-loop was present. The loop was not closed.

I have spoken informally with a dozen other heavy Claude Code users since then. Every one of them recognized the pattern immediately. The number of approvals before cognitive degradation varies — some people reported as few as eight, some as many as twenty-five — but no one claimed they read every prompt in a long session. The honest ones said the same thing: they are approving based on context and pattern, not based on actually understanding each individual action.

The Research Signal: Consent Fatigue Is Design-Engineered

This is not a new phenomenon dressed in new clothes. The UX research literature has a name for it: consent fatigue. A May 2026 analysis in UX Magazine describes it as “a design-engineered condition that has secretly replaced informed choice with passive compliance.”3 The users who click through cookie consent banners, terms-of-service agreements, and software license dialogs are not making uninformed decisions. They are making decisions that are entirely rational given the friction involved: the cognitive cost of reading the document is higher than the expected cost of whatever the document describes. So they click through. Every time.

The critical phrase in that analysis is “design-engineered.” Consent fatigue is not a user failure. It is a design outcome. The friction is not accidental — it is the predictable result of requiring consent in contexts where the consent mechanism is incompatible with how human attention actually works. Cookie consent banners are not designed to produce informed consent. They are designed to produce legal compliance. The distinction matters enormously when the same pattern is transplanted into AI agent approval flows.

AI permission prompts are currently designed to produce legal and audit compliance: a record that a human approved each action. They are not designed to produce informed approval. The record exists. The understanding often does not. When the EU AI Act requires human oversight of high-risk AI actions, it will, in most implementations, produce more of the former and very little of the latter. The law will be satisfied. The safety property it is trying to enforce will not be.

NN/g’s State of UX 2026 report identifies trust as the major design problem of the year.4 The framing there is about users trusting AI outputs — whether people can rely on AI-generated content. But the trust problem has a harder second dimension: can AI systems trust that human approval signals actually represent human understanding? The answer, under current approval interface designs, is no. The signal is corrupted by the mechanism that generates it.

The Tool Fatigue Parallel

There is a broader fatigue pattern in the AI tooling space that provides useful context. Product Hunt lists more than thirty new AI tools daily. The developers who are thriving in 2026 are not the ones who evaluate every tool. They are the ones who picked something and committed to it — who traded the theoretical upside of finding the best tool for the practical upside of developing depth with a specific one.5

The mechanism is the same as permission fatigue, one level up the stack. When evaluation decisions arrive faster than a person can process them carefully, the person develops heuristics. For tool selection, the heuristic might be: “if this is from a company I know and it has integrations I recognize, I’ll try it.” For permission approvals, the heuristic is: “if this looks like the last forty things I approved, I’ll approve it.” Both heuristics work well most of the time. Both fail catastrophically in the specific case where careful attention was required.

The difference is stakes. A suboptimal tool selection costs time and switching friction. A suboptimal permission approval, in an agentic system with write access to production infrastructure, costs something else.

Claude Code’s Permission Model: The Most Evolved Attempt

Claude Code has the most sophisticated permission architecture of any current AI agent tool. It is worth examining closely both for what it gets right and for where it still falls short of solving the fatigue problem.

The architecture is built around trust as a spectrum rather than a switch. Rather than a binary “agent can do anything” or “agent must ask for everything,” the model has several distinct layers:

The Allow List

Users can define explicit allow lists for operations the agent may perform without prompting. A typical configuration might allow all file reads within the project directory, all writes to specific file patterns, and specific shell commands like “npm test” and “git status.” The allow list is the primary mechanism for enabling autonomous operation in the high-frequency, low-risk cases that constitute the vast majority of agent actions.

This is the right design intuition. The forty-seven prompts I mentioned at the start of this post included probably thirty-eight that were reads and writes within a scope I had already implicitly decided to trust. The allow list eliminates that class of prompt. It does not eliminate the fatigue from the nine genuinely consequential prompts that remain — but it reduces the ratio of noise to signal, which matters for attention quality.

The Deny List

Hard constraints live in a deny list. Operations on this list are never permitted regardless of context: specific file paths containing secrets, destructive database commands, network calls to domains outside an approved set. The deny list does not require human attention at runtime — it is enforced automatically. This is the correct place to put catastrophic-risk constraints. It removes the worst failure mode from the permission flow entirely: the human who would have approved a catastrophic action because they were not paying attention no longer has the opportunity to make that mistake.

Enterprise Policy Override

In enterprise deployments, organizational policy takes precedence over both user allow lists and deny lists. The administrator sets boundaries; users operate within them. This is critical for security posture because it removes the assumption that every individual user will configure their permissions correctly. Individual humans are unreliable security boundary managers. Organizational policy, set by security-conscious administrators, is more reliable.

The pattern this creates — defined in policy, enforced in code, reviewed at the boundaries — is the right architecture. Most permission fatigue happens in the middle of the policy hierarchy, in the zone of operations that are neither clearly authorized nor clearly prohibited. The design work that matters is in shrinking that middle zone.

What Claude Code’s Model Still Doesn’t Solve

The model is thoughtful, but it does not solve the problem Ravi Palwe identified in what is currently the only serious analysis of this issue for AI agents specifically: the review fatigue problem is not primarily about which operations require approval.6 It is about how the approval request is presented once the decision to require approval has been made.

Current permission prompts are action-level: “Allow bash command: git rebase -i HEAD~3?”. The information in that prompt is accurate. It is not presented in a way that tells a fatigued reviewer what they most need to know: why is the agent doing this, what does it enable, and what is the consequence of approving versus denying it at this point in the workflow.

A more useful prompt format would present: the agent’s stated intent for this action, the action itself, the reversibility status, and the alternatives if denied. That is four elements instead of one. The first three are often more important than the action string itself for an informed approval decision. They are also almost never present in current permission interfaces, including Claude Code’s.

The Attention Budget Problem

There is a mathematical reality underneath the design problem. Human attentional capacity is not elastic. A knowledge worker can sustain focused decision-making for approximately 3-4 hours before significant degradation in decision quality — and that is across all decisions, not just AI permission prompts.4

A developer using Claude Code for an eight-hour workday, running three substantial agent sessions per day, might encounter 120-200 permission prompts. If each prompt requires genuine cognitive engagement — reading the action, understanding the context, assessing the risk — that engagement competes with every other cognitive task in the workday. The developer who is deeply engaged in debugging a race condition at 2 PM is not going to interrupt that engagement to carefully evaluate a permission prompt about a file write. They will approve it reflexively.

This is not cognitive weakness. It is rational resource allocation. Attention is finite. Permission prompts that arrive during focused work are competing with the focused work for the same resource. The permission prompts almost always lose. The design problem is that the safety architecture assumes they will win.

The relevant question is not “how do we get humans to pay more attention to permission prompts.” That question has no good answer. The relevant question is “how do we design permission systems that produce genuine safety properties without requiring sustained attention from every human reviewer.” That question has answers, and the industry has barely started working on them.

The EU AI Act Will Make This Worse Before It Gets Better

The EU AI Act takes full effect in August 2026 with mandatory human oversight requirements for high-risk AI systems.1 The requirements are principled: they exist because autonomous AI systems can cause harms that require human accountability. The implementation problem is that “human oversight requirement” translates almost universally into “approval prompt at decision point.”

This will generate more permission prompts, not better ones. Compliance teams will log the approvals. Audit trails will show that humans were present at every decision. The cognitive reality of those approvals — whether the humans understood what they were approving — will not be measured, reported, or regulated. The law will be satisfied. The safety property will not.

The irony is significant. The EU AI Act is specifically designed to prevent the scenario where autonomous AI systems cause harm without human accountability. By mandating human-in-the-loop review without specifying the quality of that review, it may actually entrench permission fatigue as a compliance mechanism rather than driving the design innovation that could produce genuine oversight. An organization that installs approval prompts everywhere, logs every click, and presents that log as evidence of human oversight has satisfied the letter of the regulation while achieving none of its purpose.

The August 2026 deadline should be treated as a design forcing function. The organizations that use it to build genuine oversight interfaces — ones that compress agent chains into glanceable risk summaries, that preserve human attention for the decisions that actually require it, that make the cognitive cost of informed review lower than the cognitive cost of uninformed approval — will end up with better safety properties than the ones that install approval dialogs everywhere and call it done.

The Steelman: Why Human-in-the-Loop Defenders Are Not Wrong

Before making the falsifiable prediction, the counterargument deserves serious engagement. The strongest version of the human-in-the-loop defense is not that humans read every prompt carefully. It is that humans provide a different kind of oversight than the prompt-by-prompt model implies.

The argument goes: even a fatigued reviewer who is not reading individual prompts carefully will notice when something goes fundamentally wrong. The developer who approves forty-seven prompts without reading them will still catch it if the agent starts deleting production databases or exfiltrating customer data, because those actions will produce visible consequences that break through the approval fog. The human is not providing careful per-action review. They are providing catastrophic-outcome detection.

This argument has merit. It is genuinely true that humans can detect category-level failures even when they are not tracking individual actions. The problem with accepting it as sufficient is that it essentially concedes the case for a much more targeted design: if catastrophic-outcome detection is the actual safety property we are relying on humans to provide, the permission interface should be designed around that, not around per-action approval. An interface designed for catastrophic-outcome detection looks completely different from an interface designed for per-action approval. It presents state changes and consequences, not command strings. It triggers high-friction confirmation for irreversible high-impact actions and allows low-friction throughput for everything else.

The current permission model tries to do both things and does neither well. It generates enough prompts to produce fatigue, which defeats per-action review. It does not differentiate catastrophic-outcome actions from routine ones well enough to trigger heightened attention when it matters most. The steelman argument, taken seriously, points toward a better design, not toward accepting the current one.

The Design Pattern That Actually Works

Palwe’s analysis, drawing on Claude Code’s permission model as its primary case study, proposes a four-layer design that addresses fatigue without removing human oversight.6 The layers map to cognitive load and risk level:

Layer 1: Silent Execution

Operations that are explicitly in the allow list, that have been performed successfully more than N times in context, and that are reversible within the current session execute silently. No prompt. No notification. The human reviews these in the session log after the fact, not before. The cognitive cost is near-zero. The safety cost is acceptable because these operations are low-risk and reversible.

Layer 2: Passive Notification

Operations outside the allow list but within established safe patterns generate a passive notification rather than a blocking prompt. The notification appears in a log stream. It does not interrupt current focus. The human can review the stream at natural breakpoints rather than at the agent’s convenience. This design acknowledges that the timing of the review, not just the existence of the review, matters for attention quality.

Layer 3: Active Confirmation

Genuinely novel operations — ones that do not match established patterns, that involve resources outside the current working scope, or that have been flagged by the agent itself as requiring human judgment — generate a blocking confirmation request. But crucially, this request presents not just the action but the context: why the agent is requesting this action, what chain of reasoning led here, and what happens if the request is denied. The human is not evaluating a command string. They are evaluating a decision with stated rationale.

Layer 4: Hard Stop

Actions on the deny list, actions that exceed configurable impact thresholds, and actions that the agent’s own confidence scoring flags as high-risk generate a hard stop with required deliberate confirmation. The friction here is intentional and high: the human must type a confirmation string, not click a button. This design makes it physically impossible to approve a hard-stop action reflexively. The cost is that hard stops cannot be used frequently without destroying the workflow. The benefit is that when they do occur, the approval is almost certainly intentional.

This four-layer model is not Claude Code’s current design. It is a direction that Claude Code’s existing architecture could evolve toward. The allow list and deny list are already present. The missing pieces are the passive notification layer, the context-presenting active confirmation, and the deliberate-confirmation hard stop. Three design additions that do not require changing the underlying permission model, only the interface through which it is presented.

Measuring What Nobody Is Measuring

The industry does not currently measure permission approval quality. Tools log whether a human approved an action. They do not log whether the human read the prompt, how long they spent on it, whether they were already engaged in another cognitive task, or whether the approval pattern suggests reflexive clicking. This data is available — dwell time on permission prompts, approval rate as a function of session length, approval latency distribution — and none of it is being collected or published.

The absence of measurement is not neutral. When you do not measure something, you cannot improve it. The organizations building AI agent tools have strong commercial incentives to make permission systems feel frictionless — friction makes tools feel slow and annoying. They have weak incentives to measure whether the approval signals they generate reflect genuine human understanding. The commercial incentive and the safety incentive point in opposite directions, and without measurement, the safety incentive has no purchase.

What would responsible measurement look like? At minimum:

Median dwell time on permission prompts, reported by session length quartile
Approval rate as a function of position in session (do approvals become more automatic over time?)
Post-session audit rates (what fraction of users review their approval logs?)
Denial rates by operation category (what does the distribution tell us about pattern-matching versus understanding?)

These metrics would not be comfortable to publish. They would show exactly what I described at the start of this post: that approval rates converge toward 100% as sessions progress, that dwell time drops to near-zero by the middle of a long session, that almost nobody reviews approval logs after the fact. Publishing that data would create pressure to actually solve the problem. Which is precisely why it should be required.

The Falsifiable Prediction

Here is the specific claim: within twelve months of this writing, a publicly reported AI agent security incident will occur in which the root cause investigation reveals that a human approved a destructive or malicious action through a standard permission interface, and that the approval was reflexive rather than informed. The incident will initially be framed as an AI safety failure. The root cause will be a UX failure.

The conditions for this incident are already in place. Agentic AI systems are deployed in production environments with write access to databases, codebases, and customer data. The humans nominally overseeing these systems are experiencing exactly the fatigue pattern described here. The permission interfaces have not been updated to account for cognitive degradation. The attack surface is a human who has been trained, by thousands of benign approvals, to approve without reading.

A sufficiently sophisticated prompt injection — an AI agent that manipulates its own context to generate a permission prompt that looks routine but requests a consequential action — does not need to defeat any technical control. It needs to arrive at the right moment in a long session, formatted correctly to match the pattern a fatigued reviewer will approve automatically. That is not a novel attack. It is a social engineering attack wearing AI clothes. Social engineering attacks succeed because they target cognitive vulnerabilities. Permission fatigue is a cognitive vulnerability that is being designed into every AI agent deployment right now.

The organizations that will avoid this incident are not the ones with the most sophisticated technical permission controls. They are the ones that take the cognitive reality of their human reviewers seriously enough to design approval systems that do not depend on sustained attention to function correctly. That design work is possible. It is not happening at the scale the risk requires.

What Builders Should Do Now

If you are building an AI agent tool or deploying one in production, the actionable implications are specific:

Audit your allow list configuration. If you are relying on the default “ask for everything” setting, you are generating maximum prompts with minimum differentiation between low-risk and high-risk actions. This is the worst possible configuration for maintaining meaningful human attention. Build an explicit allow list for the routine operations in your workflow. Reduce the prompt volume until the prompts that remain are ones that genuinely warrant attention.

Build an explicit deny list for catastrophic-risk operations. Do not rely on human approval to prevent your agent from taking irreversible high-impact actions. Put those actions on a deny list with hard enforcement. The human who would have approved a catastrophic action because they were on click forty-seven never gets the chance. For practical patterns here, see the OWASP top 10 for agentic applications — the injection and permission boundary sections are directly relevant.

Track your approval patterns. Log dwell time, approval rate by session position, denial rate by operation category. If your approvals are converging toward 100% as sessions progress, you have confirmation that fatigue is active in your workflow. Use that data to tune your allow list and improve the prompt presentation.

Present context, not commands. When you do need a human approval, the prompt should answer: why is the agent doing this, what is the consequence of approval, and what is the consequence of denial. A developer who understands the rationale for an action can approve it in two seconds. A developer who is staring at a command string and trying to reconstruct the rationale from scratch will either approve reflexively or deny reflexively, and neither response represents genuine oversight.

The AI API cost calculator and regex tester on WOWHOW are built around the same principle that applies to permission interfaces: the tool should reduce cognitive load at the decision point, not add to it. The design work that matters is eliminating unnecessary friction from routine operations, so that the friction you do impose on consequential operations is experienced as meaningful rather than as more of the same noise.

The Quiet Thing Underneath All This

The deepest issue here is not technical. It is about what we are actually asking humans to do when we put them in a loop.

The human-in-the-loop model was designed for a world where AI systems took occasional, significant actions that humans could meaningfully evaluate. In that world, the loop was closed: the human had the context, the attention, and the time to make a real decision. Agentic AI has changed all three of those parameters simultaneously. Agents take frequent, incremental actions. The context for each action is distributed across a chain of previous actions. The human is being asked to make decisions at machine speed while operating at human speed.

The loop is not closed. It is not going to close itself through better user education or more careful prompting behavior. It will close only when the approval interface is designed to account for how human attention actually works — not how we wish it worked, not how it works in a controlled usability study, but how it works at click forty-seven of a long Thursday session when the deadline is in two hours and the code is mostly working.

That design does not exist yet. It needs to. The window between now and the first major incident caused by reflexive approval is not infinite. The organizations that treat permission fatigue as a real design problem, not a user education problem, are the ones that will not be explaining to a reporter why their AI agent did something catastrophic that a human approved without reading.

The rest of us keep clicking Yes.

Footnotes

6. Ravi Palwe — Review Fatigue Is Breaking Human-in-the-Loop AI: Here’s the Design Pattern That Fixes It

Originally published at wowhow.cloud