DEV Community: David Aronchick

The Map Is the Moat

David Aronchick — Fri, 31 Jul 2026 18:26:49 +0000

On June 17, a coalition of the biggest names in enterprise software published a new open standard called Agentic Resource Discovery, ARD for short. Google, Microsoft, and Salesforce headlined it, with Cisco, Databricks, GitHub, Hugging Face, NVIDIA, ServiceNow, and Snowflake signing on. The stated goal is boring in the way that important infrastructure is always boring: a common way for an AI agent to find out what tools and other agents exist and how to call them, so nobody has to hand-wire every connection between every piece of enterprise software. It is the latest in a run of big-tech moves to set the open standards for agentic AI, and on the surface it looks like more of the same cooperative plumbing.

The two companies not on the list are OpenAI and Anthropic, which, seems like a pretty big miss.

Strip the branding off ARD and you are looking at service discovery - an area I spent quite a bit of time on with Kubernetes and know and love since distributed systems have needed since there were two computers to introduce to each other. A service comes online and has to announce what it is and where it lives, and other services have to be able to look it up without someone editing a config file by hand. DNS is service discovery for hostnames, while Consul and etcd and ZooKeeper are service discovery for microservices. ARD is the same primitive aimed at agents and tools: each organization publishes an ai-catalog.json manifest under its own domain, an agent describes what it is trying to do, and the discovery layer tells it what is available to do it with. Federated, Apache-licensed, hosted under the Linux Foundation, no single company owning the registry.

This, in a lot of ways, could be the new Google (or Yahoo, depending how old you are). Whoever controls how services find each other controls which services get found which is pretty fucking powerful, since the thing that resolves names decides what is allowed to exist. A service that isn't in the registry is a service that, functionally, is not there. Put another way, it's a map, and the map is the moat, and the enterprise incumbents are taking a stab at deciding how you draw it.

People have already been circling around this. Anthropic gave the world the Model Context Protocol in late 2024, which standardized how an agent connects to a tool, and it was good enough that everyone adopted it, all the way to 10,000-plus public servers and 80% of the Fortune 500 touching it within a year. Google shipped Agent2Agent for how agents talk to each other. Both of those layers eventually got donated to the Linux Foundation. And while the labs sit on the governance of how agents connect and how agents converse, the ARD stakes out the one layer above both, the catalog of what is worth connecting to in the first place. It's not ALWAYS binary, but this is a move that lets the more traditional enterprise folks (Google, Microsoft, Salesforce, Snowflake, etc etc) push what they already own. The labs have every incentive to push another angle, since the enterprise folks are trying to redraw the boundary of the fight to a place where they hold the land, where the labs only show up as entries in a directory somebody else defines.

There is a second, sharper move buried in the design, and it is a bet against the labs' entire product surface. ARD assumes the agent discovers and calls tools on its own, without a conversational interface as the bottleneck. Read that as what it is: a wager that the chatbot, the thing OpenAI and Anthropic have built their enterprise businesses around, is not where real work will happen, right as the two labs are visibly pulling in different directions on what an agent should even be. In the ARD worldview the chat window is a demo, and the actual economy is agents quietly resolving a procurement request against the approval system and the budget tool and the vendor database with no human typing in a box. If that bet is right, the most valuable real estate in enterprise AI is the registry, not the model, and the registry is the thing the labs conspicuously do not have.

Now, credit where it's due, because this is solid engineering and a real fix for a real problem. Before a discovery standard, every autonomous workflow is bespoke integration, hand-built and brittle, which is most of why "agentic AI" has been a great demo and a miserable production system. A federated catalog you publish under your own domain, keeping control of what you expose and to whom, is the correct architecture. It keeps the description of your capabilities next to your capabilities instead of shipping the whole map to a vendor who rents it back to you. I have argued for a while that any catalog that lives away from the thing it describes is a catalog that is already wrong, and a federated, locally-owned manifest is the first design I've seen from the majors that takes that seriously.

But the word "federated" is doing a lot of work, because the history of federated systems is the history of things that were going to stay decentralized and didn't. Email is federated, and most of the world's mail now flows through a handful of providers who decide what counts as spam. DNS is federated, and there is still a root, and there are still registrars, and there is still an afternoon where a name can stop resolving. A2A's own pitch is that it mitigates vendor lock-in, which is precisely the promise every federation makes on day one. Federation describes where the data sits, not where the power settles, and power settles wherever the defaults are set. If ARD becomes how agents find things, then a tool that does not publish an ARD manifest becomes invisible to every agent that uses it, and the companies that authored the schema and ship the most popular catalogs get to define what "discoverable" means. The quieter fight underneath all of this is not who runs the agent but what you are allowed to take back out, and a discovery layer is where that gets decided first. Apache 2.0 licensing means nobody can charge rent on the standard itself. It says nothing about who benefits when your agents reflexively discover Workspace, M365, and Salesforce first.

So the fight nobody framed correctly at the time is not model versus model. It is the layer that owns the enterprise's tools and data drawing a border against the layer that owns the models, and choosing to fight it at the map. Whether you build agents or just buy them, the question worth asking your vendors is not whose model is smartest. It is who controls the directory your agents read before they do anything at all, and whether the map of your own capabilities is something you own or something you have quietly agreed to rent. Being left off that map is what OpenAI and Anthropic are worried about this month. Eventually it is what you should be worried about too.

Want to learn how intelligent data pipelines can reduce your AI costs? Check out Expanso. Or don't. Who am I to tell you what to do.

NOTE: I'm currently writing a book based on what I have seen about the real-world challenges of data preparation for machine learning, focusing on operational, compliance, and cost. I'd love to hear your thoughts!

Originally published at The Map Is the Moat.

Free Binaries, Again

David Aronchick — Tue, 28 Jul 2026 18:24:58 +0000

The phrase "open source" has a birthday. February 3, 1998, at a strategy session in Palo Alto, called in a hurry after Netscape announced it was releasing its browser code. Christine Peterson coined it in that room, and her reasoning was specific: "free software" made everyone think about price, and price was the wrong axis. She wanted a word that pointed at the source code itself. Within a few weeks Netscape and O'Reilly were both using it, and the Open Source Initiative existed to defend the meaning.

The thing the word was invented to be different FROM matters more than the word. There was already freeware, free binaries, software you could download and run and pass to a friend at zero cost, and could not read, could not rebuild, and could not carry forward without the company that made it. The whole point of 1998 was to say that those are different categories and that we should stop confusing them, because one makes you a participant and the other makes you a guest.

Twenty-eight years later, we are calling model weights "open," and I would like someone to explain to me which category they're actually in.

Tobi Knaup made the case this week that open-weight AI is having its Kubernetes moment, and I want to be careful, because I agree with nearly everything he wants to happen. He also has standing on this; he co-founded Mesosphere, built DC/OS, and then watched Kubernetes take the category out from under him. I was on the other end of that trade. I was the first (non-founding) product manager on Kubernetes, which means I spent 2015 and 2016 doing the thing that ran him over.

So when he says "Kubernetes moment," I know exactly which moment he means. And I want to push on the analogy harder than he does, because if you take it seriously — actually, seriously, not as a compliment — it is asking for a great deal more than a download link.

Start with the definition, since we already have one. The OSI published its Open Source AI Definition in late 2024, and it asks for four freedoms (based on the Stallman-coined freedoms): use the system for any purpose, study how it works and inspect its components, modify it, and share it. Open weights, nail use, and share. However, unlike source code, models have a bit more complexity. You cannot study a model the way you study source code; you can probe its behavior, which is a different activity that we call evals precisely because it is not reading. And you cannot modify how it was made. You can fine-tune the output of a process you were never shown, which is closer to sanding a table than to changing the design. Meta's LLaMa ships under a custom license with a monthly-active-user threshold attached, and publishes nothing meaningful about its training data, which is why OSI has had to keep posting things with titles like "Meta's LLaMa license is still not Open Source" and why Meta simply rejected the definition rather than argue with it.

I've had some experience in this rodeo before. I believe open weights are the most important thing happening in AI economics, and I still do. The commodity tier is going to eat this market from below. Good enough at a fraction of the cost, running on hardware you control, is the winning bid for most enterprise work, and the labs sealing off the top of the index do not change that. None of what follows is a case against open weights. It's a case against the sentence people say right after "we went with the open model," which is usually some version of: so now we're not dependent on anyone.

A bit from my history: I was there when, on July 21, 2015, Kubernetes hit 1.0, and Google handed it to a foundation that did not exist the week before. While we released it under Apache 2, I believe the real novelty was the governance. There was real friction from people who did not want to be tied to Google or its schedule. Giving the project away was how you removed our veto. And that's what we wanted! We knew that if we exercised too tight a control over the top, we never would be able to get the industry moving in that direction. I was only one of the people weighing in on this decision, but I am so glad we did.

With the foundation and the license, people also had the freedom to completely fork. Not grab binaries and then hope upstream stayed compatible; people could take every element of Kubernetes (other than the name) and fork away. Not that anyone planned to use this threat, but it kept the community honest. If the steward went bad, or slow, or greedy, you could take the whole project and keep going, and everyone knew it, which is exactly why nobody had to. That is what "open" purchased: Continuity without asking.

This is where I start to reject the concept of "open" models. You cannot fork a model; there is nothing to fork. A fork of Kubernetes is a living project with a build and a roadmap you now control. A "fork" of an open-weight model is a fine-tune of a snapshot, because the two inputs that would let you continue its development, the training data and the compute, were never in the box. The lab kept the factory and shipped you the output. That is not a criticism of the lab! It's a description of what you received.

At the end of 2017, Jeremy Lewi, Vishnu Kannan, and I announced Kubeflow, which was our attempt to open-source the way Google ran machine learning internally. It was ALSO Apache 2.0, with contributions from Google, Cisco, IBM, and Red Hat. It had all four freedoms, for real, the full 1998 package, no asterisks. Sadly, one thing that really hurt us in the beginning (and even now) was that it was miserable to install, miserable to upgrade, and miserable to keep running. We shipped a pile of genuinely open components and told people they had a platform.

We weren't alone; Kubernetes was ALSO miserable. In 2015 you stood a cluster up by hand, in the right order, and if you got certificate rotation wrong you found out about it eleven days later. Kelsey Hightower wrote a tutorial called Kubernetes The Hard Way, and it wasn't satire; it was the documentation that a lot of us actually used. Difficulty didn't kill Kubernetes; it barely slowed it down.

So, difficulty was never the variable. Two projects, same license, same four freedoms, both a nightmare on day one. What Kubernetes had was a few hundred people at Red Hat and Rancher and three cloud providers whose paychecks depended on making it installable, plus a foundation that made it safe for all of them to show up in the same room, with a license that made it legal for them to try.

I think this is the thing missing from all the discussions of model "openness." Publishing something and getting it adopted are two different projects with two different budgets, and getting it adopted and having a community are two more. We shipped the first one and wrote "platform" on the box. We are doing a version of the same thing with open weights, and it is going to cost somebody a couple of years.

Because when a lab hands you weights, you're getting the cheapest artifact in the building. Not cheap to make — training is the most expensive thing anyone does with a GPU. Cheap relative to what it costs to keep a model running in production, for other people, which is where the actual bill lives.

Go count what a frontier lab actually operates. Serving infrastructure that holds a tail-latency target while traffic swings 10x over an afternoon. A batching and caching strategy that decides whether the unit economics work at all. Quantized builds for whatever silicon the customer actually bought, not the silicon you wish they'd bought. An eval suite that catches the regression before the customer does. A safety layer, an abuse pipeline, a deprecation policy, capacity planning eighteen months out, and a human being holding a pager at 3 AM with the authority to roll the whole thing back.

None of that shipped with the weights. And, most importantly, almost none of it is represented in open source either. Not because open source is bad at it, but because open source has never been in the business of operating things on behalf of strangers. Nobody's pager is attached to your cluster. That was equally true of Kubernetes, which is precisely why Red Hat and the clouds got to build real businesses on top of a free thing, and why nobody found that outrageous.

The labs and the hyperscalers are doing an enormous amount of this work, and the open-weight conversation has picked up a bad habit of calling all of it rent. Most of what they're charging for is the product. You can hate the price and still be accurate about what's being priced.

Knaup is right that a serving stack has shown up, and it's a good one. vLLM, SGLang, llama.cpp, Ollama, MLX — I use these; they're excellent, and honestly, they were never in doubt. Engineers build inference runtimes because inference runtimes are FUN. Nobody open-sources a deprecation policy. Nobody sends a pull request with a capacity plan. The missing pieces are missing because they're boring and because somebody has to be accountable for them, and accountability is the one thing a download cannot transfer.

People are choosing open-weight models right now as an insurance policy, a hedge against a lab repricing them, deprecating them, or quietly re-aligning the thing they built on. I understand the instinct completely; I've written about what it costs when the wire in front of the model goes away. That's not to say weights on your own disk, which let you keep running what you already have, isn't genuinely worth something. But they do not let you extend it in the spirit of open source. When the version you're on stops being good enough, your options are to wait for whatever the lab decides to release next or to stop.

Run it, or stop. That's the whole menu.

So let's total up what "open" has actually bought us so far. You can run a very good model on hardware you already own, and the top slice of the inference bill, the expensive slice with somebody's margin stapled to it, goes away. I've spent a lot of words arguing that it's going to reshape this market from the bottom, and I still believe that. It is also, start to finish, an argument about price.

But I want to capture, in the current discussion, the wisdom from those people in February 1998 in Palo Alto, in a new phrase, precisely because the old one made everybody argue about price, and price was the wrong axis. It took about twenty-eight years for us to need their word again, and we've spent it on a discount.

Want to learn how intelligent data pipelines can reduce your AI costs? Check out Expanso. Or don't. Who am I to tell you what to do.

Originally published at Free Binaries, Again.

Acceptable for Inference

David Aronchick — Tue, 21 Jul 2026 18:24:28 +0000

Starcloud closed $170 million at a $1.1 billion valuation this month, the fastest company in Y Combinator's history to reach a billion. Starcloud-2 launches later this year carrying Blackwell B200s to run commercial cloud workloads in orbit for customers that already include AWS and Google Cloud. This is pretty rare for a YC company! ACTUAL paying tenants, this year, in a rack that is going around the Earth every ninety minutes.

When Google published Project Suncatcher, the press took the obvious angle: Google wants data centers in space, fleets of TPUs linked by free-space optics into kilometer-wide arrays of 81, two test birds going up with Planet by early 2027. Solar power that never sets, which seems exactly right! Let's do it!

But, Google ran its TPUs through a particle accelerator to simulate the dose of low-earth orbit, and the compute chips came through fine. The high-bandwidth memory took uncorrectable errors that the error-correcting code could not catch and repair, at a rate Google described as "likely acceptable for inference."

Not "acceptable" for anything, but "acceptable for inference". This is pretty specific guidance that running in a space is only suitable for a specific job it has in mind will forgive the occasional wrong bit.

This makes sense! A model writing the seventh paragraph of a product description genuinely does not care if one weight in one layer got nudged by a passing cosmic ray. The output was a probability distribution to begin with, so a little noise in the machine is a rounding error inside a process that was already rolling dice.

Precision was always a marketing choice

We have spent the entire AI era letting people believe these systems are precise, and orbit just makes the imprecision physical. Down here, the fuzziness hides inside phrasing that sounds authoritative. Up there, it's a photon flipping a one to a zero in a memory cell, and the model downstream will report the result with exactly the same confidence it would have had if the bit were correct. I wrote a while back about a support chatbot that told a customer they had 365 days to return a product when the real policy was 30, every dashboard green, the model perfectly poised while it was flatly wrong. Now picture that same confidence, except this time the error was injected by the sky.

For a chatbot, who cares. The trouble starts the instant somebody wires a forgiving workload to an unforgiving job. Starcloud has filed to put 88,000 satellites in orbit to process data rather than relay it, and somewhere in the addressable market for eighty-eight thousand orbiting accelerators is a company that will run something that counts on hardware whose spec sheet says, in so many words, good enough to be wrong sometimes. The problem isn't JUST that it can be wrong, but that it can be wrong silently, since nobody in that chain is going to be told which rack the answer came from. That is the entire product promise of cloud: you don't think about the hardware. It is a very good promise right up until the hardware develops opinions.

The rot is already in the building

If you're about to file this under "space is weird," don't. Silent data corruption is not JUST an orbital phenomenon. Meta went looking on its own fleet and found corrupted computations coming out of perfectly healthy-looking CPUs at a rate high enough to matter across a datacenter, and the industry now has an Open Compute working group and a whitepaper about it specifically because inference multiplies the blast radius: one marginal device quietly wrong, hundreds of thousands of inferences an hour, every one of them delivered to a customer with full confidence. The causes are mundane and unfixable, timing violations, aging, marginal defects, temperature, voltage, and yes, cosmic rays hitting silicon at sea level.

So orbit is not introducing a new bargain; it is turning up the gain on one we already made and mostly declined to discuss. What Google did that's genuinely new is write the terms down. Every terrestrial datacenter is running some rate of silent corruption it does not advertise and cannot fully measure. Google put a number next to it, attached a workload class, and called it acceptable. That candor is rare enough here that it reads as alarming, but it shouldn't. That candor should come stapled to the side of every output, like an FDA label.

We named this bargain once already

Distributed systems made exactly this trade a long time ago, on the ground, and we even gave it a name. Eventual consistency. We decided that for a shopping cart or a like counter, the right answer soon-ish beats the exact answer slower, and we built half the modern internet on top of that call. It sits close enough to the point of this whole blog that it's in the subtitle. But the expensive lesson, the one every architect learns once and never forgets, was working out which systems you are absolutely not allowed to make eventually consistent. Amazon runs its catalog eventually consistent and its payments emphatically not, and the entire art was knowing exactly where that line sat.

The orbital memory result is that same fork, pushed down to the level of a single bit and handed to radiation to decide. Correctness is about to become a per-workload dial that gets set, in part, by how much cosmic radiation a given satellite happened to eat that week. And the thing generating the answer is not going to print which setting it was running on.

So the question worth asking about compute in space was never whether we can do it. We can (at least to some degree), the first paying workloads go up this year, and the solar-power argument is real and worth taking seriously. The question is who keeps track of which answers came back from a place where the memory does not reliably hold, because the model certainly won't volunteer it. It'll sound exactly as confident either way. It always does.

Google, to its credit, told us the setting. Ask your own vendors what theirs is.

Want to learn how intelligent data pipelines can reduce your AI costs? Check out Expanso. Or don't. Who am I to tell you what to do.

Originally published at Acceptable for Inference.

The Paramount Problem

David Aronchick — Sun, 19 Jul 2026 00:16:12 +0000

In 1948 the Supreme Court told the movie studios they could not own the theaters. At the time the five majors, Paramount, MGM, Warner, Fox, and RKO, made the films, controlled how they were distributed, and owned the cinemas that showed them. They used that integrated grip to run a racket called block booking: a theater that wanted the one film audiences were actually lining up for had to agree, sight unseen, to rent a year's worth of the studio's other dreck along with it. The Court looked at the arrangement, called it what it was, and forced the studios to sell the theaters. That settlement, the Paramount Decrees, governed Hollywood for seventy-two years.

Then in 2020 the Justice Department asked a judge to throw the whole thing out. The argument was that the decrees were a relic, that the market had moved on, that streaming had made the old theater monopoly irrelevant. The judge agreed, with a two-year sunset. And here is the part that should stop you cold. The instant the rule came off, the structure it had banned reassembled itself, just with fiber instead of film reels. Netflix makes the content, owns the distribution pipe, and owns the direct relationship with the customer. So does Amazon. So does Apple, so does Disney. The exact vertical integration the Supreme Court spent a generation dismantling is now the default business model of everyone who streams video, and we call it innovation.

I bring this up because the AI industry is running the same experiment right now, at roughly ten times the speed, and oddly no one is bringing up the Paramount Decrees. In late June OpenAI revealed its first co-designed chip, stood up a consulting subsidiary to deploy its own models, and kept building its own data centers. The model vendor increasingly also owns the silicon, the cloud, the agent framework, and the people who walk into your office to install it. Every cloud provider is racing to own more of that same stack. So, while the pieces are different, the shape is looking a whole lot like the old thing.

The thing the court disagreed with was not bigness for its own sake; it was the anti-monopoly provisions set in stone a half a century earlier: tying. Using control of the one input you cannot do without to force you into taking the things you would never choose on their own merits. If there is something that destabilizes the AI stack, that's going to be it. When the company that owns the model you depend on also owns the chip it runs on, the cloud it runs in, the framework that orchestrates it, the consulting arm that deploys it, and the eval harness that grades whether it's working, the bundle stops being a convenience so significant you're willing to overlook all the downsides. If you're in a situation where you wanted the model, but now you are now renting the year's worth of dreck along with it, sight unseen, people are going to get mad.

None of this means integration is evil, and the Columbia Law Review crowd that treats every merger as a crime misses the same point the deregulators do. Vertical integration genuinely lowers transaction costs, can make the product better, and, sometimes (as with electricity right now), it is the only way to secure a critical input at all. The studios made some of the best films in the history of the medium inside the integrated system, and nobody who saw them in a studio-owned theater felt robbed in the moment. The point of the decree was never that integration is always bad. It was narrower and more durable than that: an integrated incumbent can quietly convert a great product into a captive market, and by the time enough people notice, the only available fix is a court order and thirty years of waiting.

If you build systems rather than antitrust briefs, the reason to care is that the architecture decision and the market-structure decision turn out to be the same decision wearing two hats. A modular stack, open weights you can run yourself, a data plane you control, compute that moves to where your data already lives, clean seams between layers you can actually pull apart, is not only the better engineering pattern. It is the thing that keeps you from waking up one morning inside someone else's block-booking arrangement with no exit that doesn't cost you a rewrite. Every layer you let collapse into a single vendor's bundle is a layer you have agreed, in advance, never to have an opinion about again.

Hollywood needed the Supreme Court to unbundle it because the studios were never going to do it themselves. Nobody standing inside an integrated monopoly wakes up wanting to break it apart. The companies assembling the AI stack this year won't either, and we should stop expecting them to. The only thing that keeps the layers separable is whether the people writing the checks insist on the seams while the seams still exist. Right now, mostly, they are buying the bundle and calling it the future. We have seen this movie. We even know the runtime: about seventy-two years to break the thing up, and roughly eighteen months to put it back together.

Wondering whether your AI stack has any seams left to pull apart? Check out Expanso. Or don't. Who am I to tell you what to do.

Originally published at The Paramount Problem.

Three Bridges, Same River

David Aronchick — Tue, 14 Jul 2026 18:19:13 +0000

On June 29, the Supreme Court decided Trump v. Slaughter, a 6-3 ruling about whether a president can fire Federal Trade Commission commissioners without cause. And, setting aside the politics of it, it may have knocked the legal floor out from under every byte of European personal data sitting on an American server.

The EU-US Data Privacy Framework is the agreement that lets companies move European personal data to the United States without individually lawyering every single transfer. It rests on a European Commission finding that the US provides protection "essentially equivalent" to EU law, and that finding leans on the FTC acting as an independent enforcer. How hard does it lean? By one count, there are separate references to the FTC in the shared legislation. Since 2000, every version of the EU-US data deal has named the FTC as the cop on the beat, and the Supreme Court just ruled that the cop serves at the pleasure of the president. One day after the ruling, noyb sent the Commission a letter asking it to withdraw the adequacy decision in an orderly fashion, and started preparing a challenge before the Court of Justice of the European Union. The Commission, for its part, says it is assessing the implications, which is what you say when your lawyers are already in the building on a Saturday.

By some accounts, the ruling does not touch the redress mechanism, because the Data Protection Review Court sits inside the Department of Justice, not the FTC. However, it is also a strange comfort, because the DPRC exists by executive order, inside the executive branch, revocable by the same pen that created it. And the Privacy and Civil Liberties Oversight Board, which the framework also cites for oversight, has been functionally headless since January 2025, when its Democratic members were fired. The defense amounts to the specific pillar the Court demolished isn't load-bearing, as long as you ignore the other pillars already lying in the yard.

This isn't unprecedented, by the way. Safe Harbor was adopted in 2000 and lasted fifteen years before the CJEU struck it down in Schrems I. Privacy Shield was adopted in July 2016 and lasted four years before Schrems II killed it, on the grounds that US surveillance law and the lack of independent redress made the promises unenforceable. The Data Privacy Framework was adopted on July 10, 2023, which means the challenge now being drafted lands days after its third birthday. Each bridge was negotiated faster than the last and is failing faster than the last, and every one of them failed for the same underlying reason; the European legal system requires independent oversight of data access, and the American legal system keeps demonstrating, in increasingly explicit terms, that it does not have any to offer.

In civil engineering, when the same bridge design collapses twice in the same river, nobody commissions a third from the same blueprints and calls the problem solved. They ask what's wrong with the design. And the design flaw here is not the FTC, or the DPRC, or whichever acronyms get shuffled in the Framework 4.0 that gets negotiated in a panic next year. The design flaw is the assumption underneath all three frameworks: that the data has to cross the river at all.

Think about what an adequacy decision actually is. It is a stack of paper asserting that a warehouse in Virginia is, legally speaking, in Europe. Everything else follows from trying to make that fiction hold against a legal system that keeps telling you, on the record, that it won't. For twenty-five years the compliance industry has been building ever more elaborate versions of the same paper bridge, while treating the underlying act, copying the data out of its jurisdiction, as a law of nature.

It isn't remotely a law of nature. It is an architectural choice, and it stopped being a necessary one years ago. Leave the personal data where it was collected, run the processing next to it, and move the outputs: the aggregates, the model updates, the answers. (Have I mentioned we have a platform that helps you do just that?) Those cross borders just fine, because what GDPR governs is personal data, not arithmetic performed on it. In other words, the entire quarter-century of transatlantic legal drama exists to legalize a data transfer that, for a growing share of workloads, you no longer need to perform. The cheapest adequacy decision is the transfer you never make. (Yes, that is the same lesson as your egress bill. Funny how physics and law keep converging on the same answer.)

The EU is not waiting around, incidentally. Brussels adopted a tech sovereignty package in June built on the observation that Europe depends on foreign suppliers for over 80% of its key digital products and infrastructure, and the Commission is already weighing whether sensitive government workloads should sit on US clouds at all. The winds, as they say, are blowing in a direction. Pay attention.

The adequacy decision remains in effect today, and the thousands of certified companies can keep relying on it, right up until the CJEU says otherwise, on whatever schedule the CJEU feels like. If your compliance posture depends on the fourth attempt at a bridge whose first three attempts are at the bottom of the river, you do not have a compliance posture. You have a countdown, and you don't get to see the number.

Want to learn how intelligent data pipelines can reduce your AI costs? Check out Expanso. Or don't. Who am I to tell you what to do.

Originally published at Three Bridges, Same River.

Painted Seams

David Aronchick — Fri, 10 Jul 2026 18:27:12 +0000

Apple raised the price of a Mac and an iPad this month, somewhere between 15 and 25% depending on the configuration, and the machines did not get better. Sadly, the thing that made it more expensive doesn't even get to be part of a keynote: the memory. Framework jacked up its DDR5 upgrade prices by half. And they aren't the only ones: Dell warned of hardware increases measured in hundreds of dollars; and back in February, Micron quietly retired Crucial, the consumer memory brand a whole generation of people who built their own PCs grew up on, so it could point every wafer it makes at enterprise AI. If you have shopped for a laptop lately and felt like you were being mugged, you were not imagining it, and it has almost nothing to do with the laptop.

The mechanism is cleaner and crueler than a shortage.

One bit up top, three bits gone below

One thing to understand is the topology of a machine (or cluster) that supports ML is very different than even the God Box you may have sitting out under your desk. The AI buildout wants high-bandwidth memory, HBM, the exotic stacked stuff that feeds a GPU, and there are (for now anyway) exactly three companies on the planet that can make it: Micron, SK Hynix, and Samsung. The catch is that HBM and the plain DRAM in your phone come off the same fabs and the same finite pile of wafers. When Micron commits a wafer to an HBM stack, it forgoes roughly three bits of the conventional memory it could have sold to everyone else. As a result, HBM has quietly grown to claim around 23% of all DRAM wafer output, up from 19% a year earlier, and every point of that came out of the supply that used to go into cheap laptops, mid-range phones, game consoles, and the SSD in your camera.

So the price did what prices do when a giant new buyer corners a fixed supply. Consumer DRAM ran up as much as 90 to 95% quarter over quarter in the first three months of 2026 alone. Contract DRAM prices were up more than 170% year over year heading into the year, and enterprise SSDs doubled. Gartner is telling buyers to brace for a 130% memory cost surge, Bloomberg has been calling it a genuine crisis since February, and IDC does not expect real relief until new fabs come online in 2027 or 2028. Intel's Lip-Bu Tan put it more bluntly: no relief until 2028. Two full years in which the memory inside a device that has nothing to do with AI costs more because of AI.

And the allocation is already locked. Micron's entire 2026 HBM output sold out under binding contracts before the year even started, some of it under multi-year deals that lock in roughly $100 billion in minimum contracted revenue and $22 billion in upfront customer cash. What you have is a perfect storm of supply chain constraints; the consumer is not being outbid in a live auction, and so has no say in the price. Instead, they are last in a line that was already full when the doors opened.

The stockings went to the parachutes

There is a rhyme here, and it is not from the chip industry.

In 1942, nylon and silk stopped showing up in American stores, and the reason was not that DuPont had forgotten how to make them. The War Production Board requisitioned the material for parachutes, glider tow ropes, and powder bags. Silk stockings, the small everyday luxury of an entire generation of women, simply vanished, and the vanishing had nothing to do with anyone's feelings about stockings. It was a straightforward consequence of a bigger buyer with a bigger priority taking the whole supply. The famous part is what people did about it; they drew seams up the backs of their bare legs with eyeliner to fake the look of a stocking that no longer existed. Painted seams; a cosmetic workaround for a supply chain that had been pointed somewhere else.

That is where the consumer memory market sits in 2026, minus the war and minus the ration book that at least made the trade honest. In 1942 the government stood up and said out loud that the material was going to the front, and it handed you a coupon so you understood the deal. In 2026 there is no declaration and no coupon. There is just a price, and a laptop that costs 25% more for reasons the person at the counter cannot explain, and a memory maker retiring its consumer brand rather than say in plain words that you are no longer the customer.

The shape of the demand is the problem

It is true that memory has always been cyclical, gluts and shortages are the heartbeat of that industry, and the fabs really are coming in 2027. But the cyclicality is showing up elsewhere, because right now the ENTIRE STACK is cyclical. The electric bill and the model that just got eaten from the bottom are showing the exact same characteristics. Turns out when you concentrate an enormous demand into one synchronized spike aimed at one finite shared resource, the shock does not stay where you put it. It radiates until it finds the person who never placed an order and hands them the bill, whether the resource is wafers or watts.

A single, centralized, all-at-once draw is the thing that breaks a shared pool. It was true of the grid in one Virginia county and it is true of three fabs in Asia. Spread the demand across time and place and the same total consumption stops being a crisis and goes back to being a line item. That is the physics of shared resources, and we keep relearning it the expensive way, one requisition at a time.

Your kid's laptop went to war this year. Nobody told you, because this time there was no coupon to hand out. Just eyeliner, and a longer wait for the seams to come back.

Want to learn how intelligent data pipelines can reduce your AI costs? Check out Expanso. Or don't. Who am I to tell you what to do.

Originally published at Painted Seams.

The Cheapest Connection You Never Build

David Aronchick — Tue, 07 Jul 2026 18:36:39 +0000

On June 18, FERC stopped studying the data center power problem and started issuing orders about it. No notice of proposed rulemaking, no request for comment, none of the usual multi-year administrative slow walk. Instead, six show cause orders under Section 206 of the Federal Power Act, aimed at the six organized markets that keep the lights on for roughly 200 million Americans across more than thirty states: PJM, MISO, SPP, the California ISO, ISO New England, and the New York ISO. Each operator gets 30 days to account for its spare capacity and 60 to defend or rewrite its rules, with the goal being to get big loads onto the grid fast, or let them co-locate with their own generation, and either way, stop sticking ordinary ratepayers with the bill.

There's a great reason for this decision; the interconnection queue in this country is genuinely broken. Getting a large load connected can take years, and by Bloomberg's reporting PJM now projects it will be six gigawatts short of its own reliability requirement by 2027, while wholesale electricity has climbed as much as 267% against where it sat five years ago. Data center demand, for its part, is on track to nearly triple through 2035. Back in April I wrote about eleven gigawatts of announced capacity sitting frozen because the grid physically could not deliver the power. So FERC moved. This is good.

And the ratepayer protection is also good! Ish. If you want priority access, you pay for it. The data center pays for its own interconnection: the wire, the substation, the transformer, the steel. This kills a real and ugly abuse, where a utility quietly smears a hyperscaler's hookup costs across every residential meter in the region and calls it the cost of doing business. Fixing that is worth doing, and FERC deserves credit for doing it under threat of a federal enforcement deadline instead of a strongly worded letter.

But the interconnection cost is the part you can itemize, and the part you can itemize is the cheap part. The expensive part has no line item, and it cannot be assigned to anyone, because a grid is a shared pool and the price clears at the margin. When you bolt six gigawatts of new demand onto a system that is already six gigawatts short, the marginal price moves for every single person drawing off that pool. The data center can write a check for its on-ramp, but it cannot write a check for the price of the electricity it just bid up, because that cost never arrives as an invoice. It arrives as everyone's rate. That 267% did not happen because some grandmother in Toledo had her interconnection mispriced. It happened because demand outran supply in a market that prints exactly one clearing price for all of us. In other words, FERC can ring-fence the wire, but physics does not recognize the fence around the scarcity.

If we needed evidence of this, six days later, the same PJM agreed to bolt a new capacity advisory onto its emergency playbook, which is basically a way to warn its 67 million customers that supply can run short now even on ordinary days, without the heat waves that used to be the only thing that rationed power. That is the operator conceding, in its own paperwork, the part the order cannot itemize. The scarcity is already here, it is shared, and it does not read the invoice.

I like to think of this as a "new stadium" problem, basically. You can make a new stadium for a city pay for its own parking structure and its own freeway on-ramp and then stand at the ribbon-cutting and announce that on game day there will be no troubles getting here to enjoy your $73 beer, hotdog, and soft serve out of a plastic helmet. But every road for ten miles received no upgrades whatsoever, and was certainly not the budget, received no zoning variance, and, generally, will just degrade much more quickly. And the people who will eat it are the ones six blocks away who were sitting at home PROBABLY watching the game on tv (which is what they were doing before the new stadium went in anyway). We have spent a century learning that the parking lot is never the part of the development that costs the neighborhood something. The road is.

The whole order reveals - starkly - that the bigger assumption is that the demand should be measured by the shared grid, not the interconnects. I really like the colocation option which says fine, go sit next to your own generation (in the stadium scenario, this would be the equivalent of adding some a high rise hotels where you could just walk to the stadium since it's right next door). The cheapest interconnection in the world is the one you never have to build, because the load already lives where the power is. Compute that sits next to its own power doesn't bid up grandma's rate, because it isn't standing in grandma's line. That is not a regulatory trick. It's just where the physics has been pointing the entire time, and it is the opposite of hauling a gigawatt of demand three states over to a substation that was already maxed out.

NVIDIA, for its part, published a blog the same week calling the FERC orders a win for affordability. The orders may well be. The only question worth asking is affordability for whom, and the invoice, conveniently, doesn't say.

Want to learn how intelligent data pipelines can reduce your AI costs? Check out Expanso. Or don't. Who am I to tell you what to do.

Originally published at The Cheapest Connection You Never Build.

Eaten From the Bottom

David Aronchick — Tue, 30 Jun 2026 18:31:03 +0000

On June 17, a Beijing company most people in American boardrooms still cannot pronounce released the best open model on Earth and barely made the front page. Z.ai, formerly Zhipu, shipped GLM-5.2, a roughly 750-billion-parameter model with a million-token context window, under an MIT license, which means you can download the weights, run them on your own hardware, modify them, and ship a product on top of them without asking anyone's permission or paying anyone a toll. The independent benchmarker Artificial Analysis put it at number one among open-weight models and number four overall, behind only the closed Western frontier, and it does that at roughly one-sixth the price of the model just above it. Chinese open models now hold most of the top slots on the open leaderboards and supply a majority of the world's open-model tokens.

Six weeks ago I wrote The Frontier Became a Club, about Anthropic's Mythos preview going to eleven named organizations with a hundred million dollars in credits attached and to nobody else. That post was about the top of the market sealing itself off, and it was correct. The genuinely hardest reasoning still lives behind the closed labs, the index still has a Western model at the summit, and four points of separation on a capability benchmark is four real points. The club is right that the very top still matters.

While everyone watched the top, the floor moved. And the floor is where these things always get decided.

The rebar nobody wanted

In 1969, a company called Nucor built a steel mill in Darlington, South Carolina, that did something the giants of American steel found mildly amusing. It melted scrap in an electric furnace and rolled it into rebar, the cheap reinforcing bar that gets buried in concrete where nobody can see it and nobody checks the metallurgy. It was the garbage tier of the steel business. Low margin, low status, low everything. US Steel and Bethlehem were happy to let it go, because they owned the high end, the structural beams and the sheet steel that went into car doors and appliances, the stuff that actually required good steelmaking. Ceding rebar to the upstarts was the obvious call. Why fight over the worst product in your catalog?

So the mini-mills took rebar. Then, with the rebar money, they got a little better and took angle iron and merchant bar. The integrated mills retreated up the ladder again, and again it was the rational move, because each tier they gave up was lower margin than the tier they kept. Then in 1989 Nucor opened a plant in Crawfordsville, Indiana, using thin-slab casting to make flat-rolled sheet, the crown jewel, the product the giants had told themselves the upstarts could never touch. By 2001 Bethlehem Steel was in bankruptcy. The integrated mills were right about quality at every single step of the retreat. Their steel really was better at each tier, right up until the moment "good enough and a sixth the price" climbed all the way up the ladder and there was no higher rung to retreat to. This is the most thoroughly documented pattern in business history, and it still fools the incumbent every time, because every individual decision to abandon the low end looks smart in isolation.

Open weights are rebar. Four points behind the frontier on the index, free to download, a sixth of the cost, and they run in a building you control. For the overwhelming majority of what enterprises actually do with these models, which is not frontier mathematics but classification, extraction, summarization, routing, and the ten-thousand boring tasks that make up real production work, "good enough at a sixth the cost and I can run it on my own machines" is not a compromise. It is the winning bid. The closed labs keep the genuinely hardest tier, and they are right that they have it, and they are watching the price of everything below it get set by a company in Beijing that licenses its weights for the cost of agreeing to an MIT license.

Where the moat went

Here is the part the leaderboard does not measure, and it is the whole game. The word that matters in "open-weight model" is not "model." It is "open." A closed frontier model is a dependency. You rent it, you live on its pricing, its release schedule, its content policies, and its jurisdiction, which is the exact bind I described when Apple wired Siri to a competitor's Gemini and could not attest its way back out of renting the part that thinks. An open-weight model you run yourself is the structural opposite of that. Nobody can reprice it on you, nobody can deprecate it out from under you, and nobody can change what it will and will not say after you have built on it.

Which means that when the model itself becomes free, open, and good enough, the leverage stops living in the model. It moves to the two things the leaderboard will never score: where you run the thing, and what data you feed it. If the weights are a commodity you can put anywhere, then the entire competitive question becomes whether you can put them next to your data instead of shipping your data to them. The moat drains out of the model and pools in the data plane, in locality, in the boring infrastructure that decides whether your near-free intelligence runs against a local cache or racks up egress fees round-tripping to a central cluster. Beijing just did the industry the favor of making the model the cheap part. The expensive part is the part nobody is benchmarking.

The integrated mills kept making the best steel in America right up until the day the best steel stopped being the thing that decided who survived. Figure out where your leverage actually sits before the commodity tier figures it out for you.

Want to learn how intelligent data pipelines can reduce your AI costs? Check out Expanso. Or don't. Who am I to tell you what to do.

Originally published at Eaten From the Bottom.

We Rented the Mainframe Back

David Aronchick — Fri, 26 Jun 2026 18:29:52 +0000

On Wednesday, June 10, Google's Gemini stopped answering, world-wide. It threw Error 1076 and Error 1099 at users in at least nine countries for roughly seven hours, from 3:26 in the morning Pacific until 10:30, when Google called it resolved and pointed at a backend database. The next afternoon, Microsoft's Copilot went dark for thousands of people, DownDetector reports spiking past twelve thousand, the company eventually tracing it to a botched software update it had to roll back. Two assistants, two companies, about thirty-six hours apart.

In neither case did the model break; the thing that broke was the wire.

Platforms go down! And, as a former employee of both companies, I can tell you that many thousands of employees are working INCREDIBLY hard to prevent this. But even so, a whole bunch of people (those tasked with choosing an AI model/company) who have never thought about tail latency and number of 9s of uptime are suddenly having to become aware of the basics of service availability. And, sadly, we've done them no service since we have spent two years arguing about whether these systems can reason, whether they're conscious, whether they'll take everyone's job. Last I checked, a team of humans fairly rarely disappear for hours on end. You can have the smartest model ever built and it is worth exactly nothing to the person staring at a spinner because the token issuer two hops upstream just fell on its face.

So who cares if a chatbot takes an afternoon off? Well, in our new world, the chatbot has become load-bearing. Copilot is wired into Windows, into Edge, into the guts of Microsoft 365, doing code completion and drafting and the actual minute-to-minute of how a lot of people get work done. When it goes quiet, those people don't fall back to doing it the old way, because for a lot of them there is no old way anymore. And, as they become more load bearing, they are also facing growing pains. Network monitors logged a 30 percent jump in public-cloud outage events that same week backed up by Forrester who has been saying out loud for months that the AI build-out will trigger two multi-day hyperscaler outages this year. This is not a fluke; it is the shape of the thing.

NOW WE GET TO THE STUPID THING THAT MAKES ME SHAKE MY HEAD. We spent forty years walking away from this exact architecture, and last week we walked right back into it. The entire arc of computing from about 1980 to 2010 was decentralization. The PC pulled compute off the mainframe and put it on your desk, and the reason that mattered wasn't speed, it was blast radius. If your machine died, the company kept running. Then the cloud quietly recentralized all of it, which was a perfectly good trade when the cloud was mostly where your files lived and your email got sorted. But the AI assistant is a different animal. It isn't something that generally you can route around, or build a caching layer for that hides any intermittent outages. It's become the core of the engine that makes these local rich apps work, and welcome to timesharing on a PARC-MAXC in 1981. (AS AN ASIDE: If you have not watched Halt and Catch Fire, PLEASE go do so. It is both an exceptional story about really interesting characters and a love letter to the entire computing industry of that time).

This in NO WAY is saying that Google and Microsoft are bad at this! They are about as good at running infrastructure as anyone who has ever lived, and it happened anyway, because at this level of concentration it is supposed to happen. When one backend database sits in the path of every Gemini query on Earth, that database is not a database. It's a fuse. The only open question is when it blows, and the status page will say everything is fine right up until the smoke clears. What we - the industry - need to do is built a multi-layer inference strategy, as we have been doing for other services for 20+ years, and enable some/all of that inference to live near each other and survive each other. An assistant baked into your editor ought to degrade to something small and local when the mothership is unreachable, not transform into a loading animation. Interestingly, part of Gemini DID stay up during the outage: Flash Lite, the smallest, cheapest tier, kept partially answering. The "dumb" little model that ran closer to the edge survived because it wasn't routed through the expensive part that fell over.

A few weeks ago I wrote that Apple had subcontracted Siri's brain to Gemini. Two days after that post went up, Gemini spent seven hours returning error codes to half the planet. There's zero schadenfreude here, it's a super annoying problem that no amount of engineering can prevent. What I hope happens is figuring out how we augment the existing choices in architecture. "The Cloud" is already the number-two line item on a lot of IT budgets, right behind payroll, and InfoWorld has gone ahead and called 2026 the year we stop trusting any single cloud. We solved this problem in 1995 and then we just un-solved it, because renting was easier than owning. The bill for that decision doesn't come due as a price. It comes due as a Thursday when nobody can work.

Want to learn how intelligent data pipelines can reduce your AI costs? Check out Expanso. Or don't. Who am I to tell you what to do.*

Originally published at We Rented the Mainframe Back.

The Token Got Cheaper. Your Bill Didn't.

David Aronchick — Tue, 23 Jun 2026 18:35:22 +0000

An enterprise client of an AI consultant SUPPOSEDLY accidentally spent half a billion dollars on Claude in a single calendar month (I am going to leave whether or not this is true as an exercise to the reader, because it LIKELY will happen... let's call it historical fiction?) Apparently, they had failed to set per-employee usage limits on the licenses, the agentic workflows their employees were running compounded against each other until the bill hit the comma where it did, and the consultant told Axios about it in late May. And while it is being called a "cautionary tale," the reality is that the cost structure of enterprise AI in 2026 is mismatched against the way it is being priced, sold, and budgeted for, by enough that one missing license control compounded to nine figures inside thirty days.

This is one of my BIGGEST pet peeves in the industry right now... per-token pricing for end-users.

Gartner's latest forecast says inference on a trillion-parameter LLM will cost more than 90% less in 2030 than it did in 2025. Epoch AI's tracker puts the year-over-year drop at roughly 10x for equivalent capability. Equivalent-to-GPT-4 performance, which cost more than $400 per million tokens in 2023, now sits at $0.40. That is, by every reasonable benchmark, the single largest deflation in price-per-unit performance any computing platform has ever produced.

And yet.

The companies actually deploying this technology in production are watching their monthly AI bills go up by roughly 320% year-over-year, against unit prices that fell something like 280x. Uber's CTO admitted (claimed?) in April the company had already burned through its entire 2026 Claude Code budget. We have a structural mismatch between what the industry is pricing and what the industry is actually buying. This is not going to last.

The vendors know it, and the ones closest to the cost structure are repricing first. In the first week of June, the three tools that own agentic coding all stopped pretending the flat seat could survive contact with the actual cost of inference. GitHub Copilot moved to usage-based billing on June 1, a monthly credit allotment and metered tokens after that, and developers running long agent sessions opened their first invoice to jumps of 10x to 50x. Within forty-eight hours Cursor had carved its team plans into tiers with separate usage pools, and Cognition had relaunched Windsurf as a metered Devin. Three competitors who would happily watch each other go bankrupt made the identical unpopular move inside one week, which is as good a leading indicator as anything. The all-you-can-eat seat was a venture subsidy against a bill that has now come due, and a subsidy is the most expensive thing in the world to be a customer of right up until the moment it ends.

The arithmetic of the loop

The thing the per-token price chart does not tell you is how many tokens a single user request actually generates. In 2023, the typical "AI feature" inside an application was a single model call. The user typed a question, the model returned an answer, the bill was one round trip. The unit economics were simple enough: price per token times tokens per response times number of responses per day.

In 2026, however, a modern agentic workflow, the kind every enterprise vendor is selling and every Fortune 500 is buying, calls the model somewhere between 10 and 20 times per user task. There is a planner call, a retrieval call, a verifier call, a tool-use call, a critique call, a refinement call, possibly a second retrieval informed by the critique, and a final answer-formatting call. Each of those calls is cheaper than the one call it replaced. The product of all of them, against the same user task, is more expensive than the original was.

The RAG pipelines that are now mandatory in any enterprise deployment make this worse, not better. Every retrieval-augmented call inflates the context window with retrieved documents, which means the input token count for the model balloons by a factor of three to five. The cost of an input token is lower than it has ever been, and the number of input tokens being shoved into every call is higher than it has ever been, and the two trends are not converging. They are diverging, and the divergence is the bill.

Always-on monitoring agents, the ones every cybersecurity vendor and every observability platform is now shipping with a default-on toggle, are the third factor. A monitoring agent that runs continuously against a production data feed does not generate a single request per user. It generates a continuous request per data point. The unit cost of that request is trivial, but the product of unit cost and request rate, over a month, is not trivial. It is the largest line item the buyer did not budget for. The unnamed half-a-billion-dollar customer is what happens when you stack all three of those factors on top of each other, give the result a default-on toggle, and then go home for a long weekend.

Containers got cheap. The shipping business didn't.

The cleanest analogy here is the shipping container, and I am going to use it because the parallel is exact, not because it is fashionable.

Containerization, which arrived as a serious industrial standard in the late 1960s, reduced the cost of moving a ton of goods across an ocean by roughly an order of magnitude in fifteen years. The container itself became a commodity and the price of a single trans-Pacific shipment plummeted. By every measurable unit, the cost of moving cargo went down. YET the result was not that shipping got cheaper as a category. The result was that the volume of cargo being shipped exploded, because the cost reduction made entire product categories economic that previously were not. Cheap electronics. Fast fashion. Perishable food on long-haul routes. Just-in-time global manufacturing. None of it existed at meaningful scale before the container. All of it exists now.

The visible cost is the container price, which fell. The invisible cost is what the cheap container made possible: warehousing networks the size of small countries, the inventory-financing operations needed to keep them stocked, the customs and compliance infrastructure that absorbs the friction, and the consumer behaviors that assume a six-day delivery window from anywhere on Earth. The container did not save the world money. It moved the money from the moving of goods to the storing, financing, choreographing, and consuming of them. The bill went up. The container got cheaper. Both can be true.

A token is a container. The model call is the box. The thing you actually pay for in a 2026 production AI deployment is not the boxes. It is the warehouse: the data plane, the retrieval substrate, the orchestration layer, the eval harness, the safety review, the monitoring system that runs against your monitoring system. The token is what the vendor quotes you on. The warehouse is what you actually built.

The bill you have not seen yet

The dominant cost of a 2026 enterprise AI deployment is not the LLM bill. It is the data movement that feeds the LLM, where every RAG retrieval pulls data from somewhere, and every agent invocation reads context from a database, a vector store, a cached document, a tool call, an upstream system. The bytes moved per useful answer have gone up by orders of magnitude. The price of moving a byte across a public cloud has not gone down. In some regions, against some egress paths, it has gone up.

This is the place where the entire architecture conversation should be happening, and it isn't. The vendors are competing on price-per-token because that is the metric the customer is measuring. The customer is measuring price-per-token because that is the metric the vendor is publishing. Both sides agree to compete on the part of the bill that is collapsing, and quietly ignore the part of the bill that is growing. The result is a market in which the headline cost is falling 10x per year and the actual cost is going up, and nobody is willing to put both numbers on the same slide.

There is a version of enterprise AI architecture that handles this correctly, and it is the version where the compute moves to the data rather than the data moving to the compute. If the retrieval substrate sits next to the model, you stop paying egress fees. If the agent loop runs against a local cache of the relevant context, you stop paying for the redundant retrieval round-trips. If the monitoring agents run at the edge against the data they are monitoring, you stop paying to ship that data into a central inference cluster and back out again. The unit-cost-of-token chart says nothing about this, because it is not measuring it. The total bill does.

Akamai and Comcast ran a benchmark on this in March where they had a voice small language model on four NVIDIA RTX PRO 6000 GPUs, single centralized cluster versus an AI Grid distributed across four sites, under burst traffic. The distributed deployment ran 52.8% cheaper at baseline and 76.1% cheaper during bursts, with sub-500ms latency at P99 and an 80.9% throughput gain at peak. That is what the architecture conversation looks like when you measure the right thing. It is not a per-token comparison. It is a total-cost-of-delivering-the-answer comparison, and the centralized model loses.

Stop watching the wrong number

If you are signing a contract for AI infrastructure this quarter, stop optimizing for the per-token price. The price will keep falling, on a timescale that makes any contract you sign for it irrelevant inside of a year. The vendors competing on it are competing on the visibly cheap part of a cost structure that is shifting somewhere else.

Optimize for where the data sits, what it costs to move, and which calls have to round-trip through your central inference path. The bill you have not yet seen is in the egress line item, the vector store retrieval costs, and the monitoring spend that compounds while you sleep. The bill on the model is the easy one. It is also, increasingly, not the bill.

The half-a-billion-dollar customer set their license limits wrong. That was a control failure. The control failure is interesting because the thing it failed to control got large enough in a single month to make the news. Two years ago that same control failure would have produced a five-figure bill, the CFO would have noticed at the next quarterly review, and nobody would have written about it. The control failures are getting expensive faster than the controls are getting better. That gap is the part of the cost curve nobody has put on a chart.

The token got cheaper. Your bill didn't. Both of those things are true at the same time, and the gap between them is where the next decade of enterprise AI architecture is going to be decided.

Want to learn how intelligent data pipelines can reduce your AI costs? Check out Expanso. Or don't. Who am I to tell you what to do.*

Originally published at The Token Got Cheaper. Your Bill Didn't..

Six Hundred Ways Not to Connect a Hose

David Aronchick — Sat, 20 Jun 2026 18:24:44 +0000

On the morning of February 7, 1904, fire crews from Washington, Philadelphia, and as far off as New York loaded their pumpers onto railroad flatcars and raced to Baltimore, which was on fire. When they got there, they ran their hoses to Baltimore's hydrants, and the couplings didn't fit. So a good number of those men stood in the freezing street and watched the city burn, because the threads on their hose wouldn't bite the threads on Baltimore's plugs.

The fire took thirty hours and leveled more than 1,500 buildings across seventy blocks of downtown, some 140 acres. Thirty-five thousand people lost their jobs in a weekend. Adjusted forward, the loss runs north of five billion dollars. It's still the third-worst urban conflagration in American history, behind only the Great Chicago Fire and the San Francisco quake. And the craziest thing is that the water was there, the pumps were there, and the firefighters were there. What stopped them were several hundred thin sheets of metal that made up the threading on fire hydrants. By 1903 the United States had more than six hundred different sizes and variations of fire-hose coupling.

Six. Hundred.

Ask yourself how a country ends up with six hundred incompatible ways to attach a hose to a water source, and you get to the real lesson, which has nothing to do with fire. The incompatibility wasn't an accident or an oversight; manufacturers patented their own couplings and guarded them. This made every city forced to sink real money into whatever system it already owned. The standards efforts that had been kicking around since the 1870s went exactly nowhere, because the people who would have had to adopt a standard were doing fine without one and the people selling the equipment made more money when you couldn't take your business across the street. (W3C and IETF folks, I KNOW you are shaking your heads right now). It took an enormous disaster to expose a market structure where proprietary couplings were more important than safety.

I'm now supposed to say "and isn't that just like the cloud," and the irritating truth is that it is, in almost perfect detail. Your data sits in one provider's object store, in their preferred table format, and the day you decide to move it to a competitor you discover the coupling: the egress fee on the way out, the proprietary format that needs translating, the API that is almost but not quite the one next door. It's the hose that won't thread onto the other guy's hydrant, and like the 1904 version, it is engineered to not fit. Brussels finally lost patience with it, and the EU Data Act bans cloud switching fees outright as of January 12, 2027, caps the notice period for leaving at two months, and forces providers to publish a full list of the data categories you're allowed to take with you when you go. That is, almost word for word, a fire-hose coupling standard, about 122 years after the lesson learned by the National Fire Protection Association.

But the Baltimore story does not end with "and then they fixed it." After 1904 the NFPA did the obvious thing and published a national standard coupling. A century later, a NIST study went and checked, and found that only 18 of the 48 largest American cities had actually adopted it. Thirty cities, a hundred years on, were still running their own thread. The standard has been put in place! It's been around FOR A CENTURY. But inertia is still winning today. The 1991 Oakland Hills firestorm burned hotter and longer in part because Oakland's hydrants used a three-inch coupling while the mutual-aid crews showed up with the two-and-a-half-inch national standard. Twenty-five people died in a fire made worse by a thread mismatch, eighty-seven years after Baltimore made that exact lesson free for anyone willing to read it.

So while you CAN say "standardize everything," but that's just not enough. A world where every system speaks one format and routes through one provider is a world where a single bad config push knocks everyone flat at the same instant, which happens ALL the time (I'm not providing links, because i don't want to shame people, but search yourself; no matter what date you are reading this - today or ten years from now - I will bet dollars to donuts that there's a "down time" notice from a major provider in the past week). The real lesson is narrower and more annoying than "pick a standard." It's that the failure is almost never inside the box. It's in the coupling between boxes, the seam everyone treats as a tedious detail because it's boring and it's somebody else's job. Baltimore had no shortage of water. The cloud has no shortage of compute. What neither one reliably had was a connection that worked when it actually mattered, owned by someone whose business depended on it working rather than on it quietly not.

We are very good at building magnificent boxes. We ship the seams as an afterthought, or worse, as a moat. Mayor McLane stood in the ashes in 1904 and told the cities offering help that "Baltimore will take care of its own, thank you." It's a great line. It's also what every cloud contract says to you, in much smaller type, on the way in.

Want to learn how intelligent data pipelines can reduce your AI costs? Check out Expanso. Or don't. Who am I to tell you what to do.*

Originally published at Six Hundred Ways Not to Connect a Hose.

Apple Just Subcontracted the Voice

David Aronchick — Sat, 20 Jun 2026 00:23:07 +0000

On June 8, the keynote that opened WWDC unveiled "Siri AI", the rebuilt assistant Apple has been promising and delaying since 2024. The demo was really good! And it did all the things that we would EXPECT an AI should do in 2026. I thought it was particularly interesting that the new Siri runs on Google's Gemini. Apple licensed a custom Gemini build of around 1.2 trillion parameters, is reportedly paying something close to a billion dollars a year for it, and quietly retired the ChatGPT hand-off that was the showpiece of the 2024 launch. The most tightly controlled hardware in consumer technology now does its hardest thinking on a competitor's model.

I want to be fair to the engineering, because it IS very good, and it is not the cartoon version where Apple ships your diary to Mountain View. Apple built a three-tier stack: simple requests stay on the device, moderately hard ones go to Apple's Private Cloud Compute, and only the heaviest reasoning routes out to Google Cloud, where the custom Gemini runs on what is probably some combination of TPUs and NVIDIA accelerators. Queries that leave the phone are anonymized and tokenized so that, by Apple's account, neither Apple nor Google can tie a request back to a person. If you are going to rent a brain, this is close to the most careful way to wire it, and the byte-level privacy story mostly survives the announcement. That is not the part of the announcement that is interesting.

What the architecture used to say

In June 2024, Apple staked Apple Intelligence on a specific architectural claim. The premium property of an Apple model was that it ran on the device, your data never left, and the rare query that exceeded on-device capacity went to Private Cloud Compute, Apple's own hardware in Apple-controlled enclaves with cryptographic attestation. Third-party models were a fallback, available when you explicitly chose them. ChatGPT was the named partner; Gemini was discussed but not shipped. The hierarchy was on-device first, Apple's cloud second, somebody else's model last and only by choice.

The bet that Apple was making was their silicon team and their model team, running the same roadmap, would close the gap to frontier capability inside two years. The need for an outside frontier model was supposed to be temporary.

And in many ways it was! But the external world kept going even faster.

Why it widened

The on-device model Apple shipped in late 2024 was not the one the original pitch implied. Its capable cousin, the internal frontier model, slipped twice and landed in restructured form after the WWDC 2025 reorganization. Apple's foundation-model group lost senior people to Meta's superintelligence group and to Anthropic over the same stretch. Google, meanwhile, shipped Gemini 2.5, then 3.0, then 3.1 Pro on roughly a six-month clock, each one clearing a bar the last one missed. By early 2026 Apple's choices on the assistant had narrowed to two: ship a Siri that worked, or ship a Siri whose architecture matched the 2024 marketing. Monday told you which one Apple picked.

What actually changed

The thing that changed on Monday is not where your bytes go, because Apple engineered that fairly well. The thing that changed is who supplies the intelligence. For a decade Apple's entire argument, the one that justified designing its own chips and writing its own frameworks and refusing the easy integration, was that owning every layer of the stack was the only way to keep the promises it made about the device. On Monday Apple kept the assistant promise by renting the most important layer from the one company it competes with most directly across phones, ads, browsers, and now models. Their "we own the whole stack" became "we own the stack except the part that does the thinking," and you cannot attest your way out of that sentence.

Lots of folks are calling this a blow to "sovereign AI", and in the small and specific sense that matters to anyone who builds systems, it kind of is. Apple's most strategic consumer feature now carries a hard dependency on a competitor's model, a competitor's pricing, and a competitor's release schedule, and for the heaviest queries it runs inside a jurisdiction Apple does not control. Most users will never notice and most queries will never matter.

The biggest thing that changed here is the strategy that caused Apple's position movement, not any individual query. They admitted that the industry (and customer expectations) are moving too fast for them to keep up.

Right in physics, wrong in calendar

The on-device thesis was the architecturally correct answer to the question Apple was asking, where privacy by construction beats privacy by contract, and on-device latency beats a data-center round trip. Apple's silicon division spent ten years building the substrate that should have made on-device frontier intelligence a category.

However, the calendar call, and the rest of the world, missed. Apple bet its model team could reach the frontier as fast as its silicon team and product team could ship, and the frontier moved faster than any single company's roadmap. By the time an on-device path would have reached parity, Google had three more model generations out, OpenAI had four, and Anthropic had the tier jump that produced Mythos. Right on the physics, wrong on the calendar, and in product the calendar wins every time.

There is a pattern here that is going to define the next couple of years. The vertically integrated "own every layer" architecture is the correct answer to the long-horizon question about control. However, for a while anyway, it will lose to the federated "compose across whoever is best this quarter" architecture on the short-horizon question of what ships now.

The part to watch starts about eighteen months out. It might show up as Google's Gemini roadmap shipping on a clock that is inconvenient for Apple's launch calendar, or the billion-a-year tenancy gets renegotiated in a direction that pinches the Services margin Apple has spent years defending, or a Google policy change moves what Siri will and will not say, on a timeline that is not Apple's. None of that has happened yet, but it could, and it would cause a huge chasm. It's certainly uncharted waters (or at least uncharted for many years) for a company that previously prided itself on owning everything down to the silicon, where now it carries possibly huge decisions on a schedule Apple does not fully set.

Apple spent a decade telling you that owning the whole stack was the only way to keep a promise. On Monday it kept the promise by leasing the part that thinks.

Want to learn how intelligent data pipelines can reduce your AI costs? Check out Expanso. Or don't. Who am I to tell you what to do.

NOTE: I'm currently writing a book based on what I have seen about the real-world challenges of data preparation for machine learning, focusing on operational, compliance, and cost. I'd love to hear your thoughts!

Originally published at Apple Just Subcontracted the Voice.