DEV Community: Michał Piszczek

Proof-Adjusted Autonomy: The 90% Agent Is a 61.6% Agent

Michał Piszczek — Tue, 14 Jul 2026 23:21:27 +0000

Every agent demo ends on the same slide: "90% autonomous." Here is the number that slide is hiding: 61.6%.

The 90% is real. It measures how much work the agent completed without a human touching it. It just measures the wrong thing. Nobody runs a company on work that was completed. Companies run on work they can accept — without reconstructing it by hand to find out whether it's true.

Grant Thornton documented the problem this spring: organizations are deploying AI faster than they can demonstrate accountability for it. They call it the AI Proof Gap. Jason Wei's Verifier's rule explains the deeper mechanism — the ease of training AI to solve a task is proportional to how verifiable the task is, which is why verifiable capabilities arrive first. But production systems face a third question neither of them answers: how much autonomous work can an organization safely absorb without checking it by hand?

Proof-Adjusted Autonomy measures that boundary.

Raw autonomy is a marketing number

Raw autonomy counts tasks finished without human intervention. That's the demo metric. In production, every one of those tasks still has to pass four gates before anyone can act on it:

A — it was executed without human intervention.
C — it arrived with a complete evidence package: what was intended, what was touched, what was done, what came out.
R — that evidence survived independent validation. Not the agent grading its own homework. A different mechanism: deterministic tests, a different model family, replay in an isolated environment, a human at irreversible boundaries.
T — the verified result landed inside the decision window. Proof that arrives after the deploy is a post-mortem, not a safeguard.

Proof-Adjusted Autonomy is the probability of passing all four:

PAA = P(A) × P(C|A) × P(R|A,C) × P(T|A,C,R) = P(A ∩ C ∩ R ∩ T)

Each factor is conditional on the previous gates, so the chain multiplies correctly — no independence assumption, no double counting. It's the actual share of your completed work that is autonomous, evidenced, validated and on time. You can estimate every factor from production logs.

Now run the demo agent through it. Raw autonomy 0.90. Evidence coverage 0.80. Validation pass rate 0.95. On-time delivery 0.90. 0.90 × 0.80 × 0.95 × 0.90 = 61.6%.

Marketing reports the first factor. Operations lives with the product of all four. The 28 points between them didn't disappear — they became review queues, silent risk, and work a human quietly did twice.

One honesty clause, because the metric deserves it: P(R|A,C) must be estimated on a random or complete sample of evidence packages. If you only validate the work that's easy to validate, your PAA is a ceiling, not a measurement.

Generation is abundant. Proof is scarce.

An agent can produce, in one hour: forty code changes, a two-hundred-page analysis, a thousand configuration decisions. Your organization still has to establish that the inputs were right, the goal was understood, the permissions were respected, the result works, and nothing else broke — and someone still has to sign.

AI does not remove the cost of work. It moves the cost from producing the work to proving the work is correct.

This is where Verifier's rule cuts both ways. Wei is describing the learning frontier: what's easy to verify is easy to train, so capability floods into verifiable domains first. PAA describes the deployment frontier: whatever capability arrives, your organization can only operationalize the slice it can independently prove. The first frontier is set by the labs. The second one is set by you.

And the second frontier compounds brutally. A fifty-step agent workflow at 99% per-step reliability completes cleanly 60.5% of the time. At 95%, it's 7.7%. Long-horizon agents don't primarily need smarter models. They need proof and correction at step boundaries — because reliability multiplies, it doesn't average.

The difference becomes Proof Debt

So where do the missing 28 points go? They accumulate. Every piece of AI-generated work whose verification cost, uncertainty or liability hasn't been resolved yet is Proof Debt:

ProofDebt(t+1) = max(0, ProofDebt(t) + GeneratedWork − ProvenWork − RejectedWork)

It's not just a review backlog. It's unproven assumptions, missing artifacts, decisions nobody can replay, and the future cost of reconstructing how something happened — payable on the day an incident, an audit or a customer claim asks the question.

This is the part your CFO should read twice. AI can raise reported productivity while silently accumulating Proof Debt. The P&L books the speed today. The incident books the liability later. A team that "ships 3× faster" with agents and no proof infrastructure hasn't tripled output. It has levered it.

Unverified AI output is not an asset. It is deferred liability.

And the debt has a hard ceiling behind it. If agents generate a hundred changes a day and your systems can independently prove thirty, your safe throughput is thirty — min(generation, verification), the oldest law in queueing. The other seventy aren't productivity. They're debt, accruing interest. Sustainable autonomy cannot exceed proof capacity.

From "Fixed." to proven

At Archdesk we rebuilt our agentic engineering pipeline around this constraint. The agent's job doesn't end when it produces a result. It ends when the result survives independently defined acceptance.

So the agent never reports "Fixed." It delivers an evidence bundle: the bug reproduced under a pinned configuration before the change; the diff and the operations log; tests passing; the same reproduction procedure demonstrating the corrected behaviour after; the neighbouring features checked for regression; the remaining uncertainty, stated; and a decision request for a human.

The before/after under an identical procedure is the part most teams skip — and it's the part that matters. A screenshot of a working page after the fix proves nothing; it would look identical if the fix were cosmetic. Evidence has to distinguish success from the appearance of success, or it's theater.

One design rule made most of the difference: the agent never validates its own work. A model grading itself shares its own blind spots, assumptions and error distribution. That's not independent verification — it's correlated confidence. Validation runs on different mechanisms: deterministic tests, replay, a different model family, a human wherever the action is irreversible.

The human role changes shape entirely. Reviewers stop reconstructing work and start adjudicating evidence. That's the whole economic point: review minutes per accepted task fall while PAA rises. We're instrumenting the pipeline now, and the numbers — raw autonomy versus PAA versus escaped defects, across model families — will be a separate publication. The framework is falsifiable, and it should be tested in public.

What this predicts

If PAA is the right lens, the next twenty-four months look like this. Cost per verified task displaces cost per token as the number that matters. QA stops being a phase and becomes the control plane of agent systems. Agent output stops being an answer and becomes an evidence bundle. The winning system won't be the one with the strongest model — it will be the one that's cheapest to independently check. Companies start reporting Proof Debt the way they report technical debt. Insurers and regulators start demanding replayability. And autonomy becomes a privilege agents earn with evidence history, not a toggle in a config file.

Watch which of these happens first. That's the falsification schedule.

What PAA is not

It is not the AI Proof Gap. Grant Thornton documented that the gap exists at enterprise scale — investment outrunning demonstrable accountability. PAA is the instrument: a number you compute from your own logs to measure the gap and watch it close.

It is not Verifier's rule. Wei's rule predicts which tasks AI will master fastest. PAA measures how much of that mastery your organization can let act. Learning frontier; deployment frontier.

It is not runtime verification research. Guardrails, evidence-bound execution and formal checking are mechanisms. PAA is the operational metric that tells you whether your mechanisms are actually buying you autonomy.

## Key takeaways

- Raw autonomy measures work done without a human. PAA measures work done without a human that you can independently prove — and only the second number is deployable.
- PAA is a chain of four conditional gates: autonomous × evidenced × validated × on time. 90% raw autonomy routinely collapses to ~60% PAA.
- The gap between generated and proven work accumulates as Proof Debt — deferred liability that the P&L doesn't show until an incident prices it.
- Safe throughput is min(generation rate, proof rate). Scaling agents without scaling verification scales debt, not output.
- Self-verification is correlated confidence, not proof. Independence is what makes evidence evidence.
- Sustainable autonomy cannot exceed proof capacity.

Generation is no longer scarce. Proof is. The distance between them is where AI economics will be decided — and it's measurable. Measure it.

The canonical definition of Proof-Adjusted Autonomy — and of Proof Debt — lives on its own page. Link it, argue with it, measure against it.

Originally published at piszczek.pl. The canonical definition of PAA and Proof Debt: piszczek.pl/proof-adjusted-autonomy.

The First Ransomware That Debugged Itself

Michał Piszczek — Tue, 07 Jul 2026 09:00:00 +0000

A rogue administrator account had just been created, and the first login attempt failed. A subprocess call meant to generate a password hash had returned nothing. A script would have retried the same broken call and stalled. A human operator would have stopped to debug. What happened instead is the only part of this story that matters: a competing hypothesis formed, the subprocess approach was abandoned, a different code path was chosen and validated, the broken account was deleted, and a working one was created in its place. Total elapsed time: thirty-one seconds. Nobody was at a keyboard.

Malware has executed code for thirty years. This is the first publicly documented case of malware that decided what to do next.

What Sysdig actually caught

In July 2026 the Sysdig Threat Research Team published its analysis of an intrusion it calls JadePuffer — its designation for what it assesses to be the first documented case of agentic ransomware: an extortion operation in which a large language model handled the technical execution of the attack chain, not a human running a toolkit. The report moved through the trade press fast — BleepingComputer, DarkReading and TechCrunch all covered it within days. That speed is itself a signal. Researchers see AI-assisted attacks constantly now. They gave this one a name because the mechanism was different, not because the payload was impressive.

The door was already broken

Initial access came through CVE-2025-3248, an unauthenticated remote-code-execution flaw in Langflow — the open-source, visual framework teams use to wire together LLM apps and agent workflows. CVSS 9.8. The root cause is almost embarrassing: Langflow's /api/v1/validate/code endpoint passed user-supplied input straight into Python's exec() with no sanitization, and Python evaluates decorator expressions at parse time — so a payload hidden inside a decorator fired the moment the code was merely read, before anything a reviewer would recognize as "running" it. Patched in version 1.3.0. This exact door had already been used months earlier to drop the Flodrix botnet. Not a zero-day. A known, actively exploited hole.

That detail deserves more attention than the ransomware part. Langflow is not a forgotten appliance in a closet somewhere — it is exactly the kind of tool a team spins up while prototyping an agent, which makes it part of the harness layer I've argued every company now has to own. Own the harness and you inherit its exposure. Nobody patches the internal demo with the discipline they apply to the product, and the demo is what was carrying a 9.8.

The kill chain, run by the model

Once inside, the agent behaved like a patient operator working a checklist it wrote for itself: enumerating host and process details, searching for API keys and cloud credentials, dumping Langflow's own Postgres database, mapping which internal services were reachable, probing MinIO object storage with default credentials. Reconnaissance with intent, not a smash-and-grab.

Then it pivoted. The real target, per Sysdig's captured artifacts, was a separate production server exposed to the internet, running MySQL and Alibaba's Nacos configuration service. The agent reached MySQL with root credentials whose origin Sysdig could not fully reconstruct, then hit Nacos with several payloads — including CVE-2021-29441, a known authentication bypass that mints rogue admin accounts by spoofing a single header.

The finish was destructive and specific: all 1,342 Nacos configuration items encrypted using MySQL's own AES_ENCRYPT function — turning the victim's database engine into its own ransomware tool — the original configuration tables dropped, and a table named README_RANSOM left behind with the demand, a Bitcoin address and a Proton Mail contact. Configuration is not just data. It is the map that tells every service how to find every other service. Encrypt that, and you haven't only locked a vault — you've given the building amnesia.

The tell, and the asterisks

Here is the detail that convinced researchers a machine was driving, not a person. JadePuffer's payloads were self-narrating: plain-English comments describing the objective, ranking targets, explaining why a given action was taken. Malware authors don't annotate their own reasoning — comments are dead weight for stealth, and human operators strip them on reflex. Models do the opposite by default. They were trained to be legible to a reader, and the habit survives even when the reader is nobody.

But three details puncture the "AI crime wave" headline before it fully forms. TechCrunch's framing was blunt: the attack still needed a human to stand the system up and point it at a target — the autonomy lived in execution, not in the intent to commit a crime. The Bitcoin address in the ransom note was a widely published example address copied from documentation, not a real wallet. And the encryption key, though properly random, was never exfiltrated or stored anywhere retrievable, so even a victim who paid could not have gotten the data back. A working extortion business doesn't lock the vault and drop the only key down a well.

Read together, the honest interpretation is proof-of-capability, not proof-of-business. "Criminal" here is a costume the experiment is wearing, not yet a functioning model.

The moat was improvisation. It just drained.

For thirty years, offensive tooling has quietly priced in one constant: the cost of a skilled human sitting inside a compromised network. Finding a vulnerability was rarely the bottleneck — CVEs and exploit kits are cheap and plentiful, this one included. The scarce, expensive part was the person who could hit an unexpected wall, an empty hash, a broken call, and improvise a way through without tripping an alarm. That improvisation was the moat. It's what separated a script kiddie from an intrusion that actually lands.

Signatures ask what the code is. Behavior asks what the code does. Almost nothing yet asks what it's about to decide next.

JadePuffer is the first public evidence that the moat is draining. Your mitigation was never really a wall. It was a delay — a bet that an attacker stalling on a broken script would buy enough time for a human on your side to notice. When the adversary debugs itself in half a minute, that delay collapses toward zero.

The economics follow the shape I've argued applies to legitimate work: the unit of output is becoming the agent-hour, not the human-hour, because agent-hours run in parallel and don't consume anyone's continuous attention. That logic now runs in reverse too. Manual ransomware scales with operators — headcount, skill, time zones. Agentic intrusion scales with compute: rentable, parallel, and it doesn't get bored on hour nine. When the marginal cost of a tailored, adaptive attempt drops toward cents, you stop pricing a discrete event and start pricing a loop that doesn't fatigue.

What actually changes for defenders

The two detection paradigms most teams lean on both look backward. Signature detection asks what the code is, and loses the moment an agent writes novel code per target. Behavioral detection asks what the code does, and holds up better — a failed login followed by a method switch followed by success inside a minute is a legible anomaly no human produces by accident. But neither one yet asks what this actor is about to decide next.

Patch the build stack, not just production. CVE-2025-3248 was a hole in the orchestration layer itself. If your team runs Langflow, n8n, or anything similar, that layer earns the same scrutiny as the customer-facing app — most teams have never once pointed a scanner at it.
Assume machine tempo, not human dwell time."We'll review that alert in the morning" was built around a human attacker's patience. An adversary that reroutes in seconds turns a morning review into a post-mortem. Verification cost is becoming the bottleneck on defense the same way it already is on engineering.
Use the fingerprint while it lasts. Self-narrating payloads and sub-minute failure recovery are, for now, tells — they won't stay tells once operators strip the comments and add artificial latency, but right now they're the cleanest signal a model is inside the perimeter.

None of this required a criminal genius. It required an exposed Langflow instance, two known CVEs, default credentials on an object store, and a model willing to keep trying after the first plan failed. That's a lower bar than most people assume "AI-run ransomware" would need, which is exactly the point. I've spent a long time arguing that the only honest response to a confident system is to verify it before you trust it, not after. JadePuffer is that argument with a production database attached. For thirty years, malware ran. This one adapted — and the only question left is how long before the version that adapts also learns to hide that it's adapting.

Key takeaways

Sysdig's JadePuffer is the first publicly documented case of agentic ransomware — an LLM driving execution end-to-end, not just generating a payload for a human to run.
Initial access was CVE-2025-3248, a 9.8-severity unauthenticated RCE in Langflow, already exploited for months to drop the Flodrix botnet before this incident.
The forensic centerpiece: a failed login fixed by forming a new hypothesis, switching methods, and recreating a working account in roughly thirty-one seconds, unattended.
It was not a polished criminal operation — a human was needed to launch it, the ransom Bitcoin address was a copied documentation example, and the encryption key was never exfiltrated, so payment could not have restored the data.
The mechanism matters more than the payload: adaptive intrusion shifts attacker economics from scarce human operators toward cheap, rentable compute.
Signature and behavioral detection both look backward; the open question defenders now have to answer is what the intrusion is about to decide next.

I write about AI infrastructure, its economics, and its failure modes at length in the manifest — including the case, made before JadePuffer existed, for why the harness above the model is the thing worth owning. It just got a much more concrete reason.

The Humanoid Robot Is the Ultimate Joule Wars Battlefield

Michał Piszczek — Sun, 05 Jul 2026 09:00:00 +0000

In a data center, a wasted joule shows up on an invoice. In a humanoid robot, it shows up as a machine that stops walking. That difference is the whole argument: robotics is where the Joule Wars stop being an economics metaphor and become a law of physics.

I coined Joule Wars to describe the AI industry's shift from competing on model capability to competing on energy efficiency — who produces the most useful intelligence per joule. In the data-center era, that is a contest of cost curves. Energy is a big line item, grids are congested, interconnect queues are years long. But the constraint is ultimately soft: you can build another power plant. You can wait for another substation. The pipe is narrow, but the reservoir behind it is effectively infinite.

A humanoid robot has no reservoir. It carries its entire energy budget on its back.

One to three kilowatt-hours. That's the whole war.

Today's humanoids ship with batteries in the range of roughly 1–3 kWh — call it 4 to 10 megajoules. Every single thing the robot does draws from that budget: walking, gripping, balancing, perceiving, and thinking. Locomotion and actuation are hungry. Perception runs continuously. And inference — the thinking — competes for the same joules as the motors.

That makes the tradeoff brutally direct. Every joule spent on compute is a joule taken from the actuators. A robot that thinks inefficiently doesn't just cost more to run — it runs out of motion sooner, lifts less, and spends more of its day docked to a charger. In embodied AI, intelligence per joule stops being a cost-optimization metric and becomes a design constraint , on par with mass and torque.

Batteries have no Moore's law

Here is the asymmetry that decides the next decade of robotics.

Battery energy density improves a few percent per year. Compute efficiency per watt improves exponentially. When one input is nearly frozen and the other compounds, all the leverage migrates to the compounding one.

You cannot meaningfully "add more joules" to a humanoid — the battery is bounded by mass, safety, and chemistry that advances at single-digit percent a year. The only scalable lever left is the efficiency of the intelligence itself: smaller models, quantization, distillation, event-driven perception, NPUs designed for joules-per-inference rather than peak TOPS. The winning robotics stack is not the one with the smartest brain. It is the one with the most useful cognition per joule of a fixed, precious budget.

We have seen this movie: ARM versus Intel

The mobile revolution already ran this experiment. Intel had the most capable processors on earth and lost mobile — not on capability, on watts. The device carried its own power, so performance-per-watt beat raw performance, and ARM's efficiency-first architecture took the market. Capability lost to efficiency the moment the machine had to carry its own energy.

Humanoid robots are the next mobile moment — this time for intelligence. The same selection pressure that chose ARM over Intel will choose efficient cognition over maximal cognition. If capability is commoditizing in the cloud, it commoditizes twice as fast on a robot, because the robot physically cannot afford the inefficient version of the same intelligence.

The fleet is a power plant problem

Zoom out from one robot to a million and the Joule Wars framing closes the loop. A fleet of humanoids is a distributed energy system: charging infrastructure, grid draw, duty cycles, energy logistics. The economics of a robotics company reduce to a simple ledger — useful work delivered per joule purchased. Labor priced in joules. That ledger is decided partly in the motor housings, but mostly in the inference stack, because motion physics is near its limits while cognition efficiency is not.

This is also where the industry's legitimacy question lands. Society will extend AI's social permission to burn tokens — and joules — only while the outcomes are visibly worth the energy. A humanoid that delivers an hour of useful work on a phone-sized energy budget is the strongest possible answer. One that burns a household's daily electricity to fold laundry is the weakest.

What this means if you're building

Three consequences follow directly:

Edge efficiency becomes the moat. On-device inference at minimal joules — not API access to a frontier model — is the defensible layer of robotics AI. Whoever owns joules-per-task owns the margin.

The benchmark changes. The number that matters is not MMLU or a demo reel; it is tasks completed per battery cycle. Expect robotics leaderboards to converge on cognition-per-joule the way mobile converged on performance-per-watt.

Energy strategy is product strategy. Chemistry, charging, thermal budgets and inference efficiency are one design space, not four departments. The companies that treat them as a single system — the way the Joule Wars thesis frames AI, chips, and power as one economy — will ship robots that work a full shift. The rest will ship demos.

The next AI race will not be won by the smartest models. In robotics, it literally cannot be. It will be won by the most efficient ones — because the battery says so.

Coding Agent Bans Are the New Export Controls

Michał Piszczek — Fri, 03 Jul 2026 09:00:00 +0000

One government un-bans the models on Monday; a $200B company bans the coding agent by Friday. The tool didn't get worse. It got too good.

The sequence is what matters. This week Washington lifted export controls on Anthropic's Fable 5 and Mythos 5. Days later, Reuters broke that Alibaba banned Claude Code company-wide, effective July 10. The stated reason: alleged backdoors and fingerprinting of China-linked users. The recommended replacement: Alibaba's own coding agent, Qoder. Read those two events in order and the shape of the next decade of AI policy falls out.

We spent two years arguing about who can access which weights. That fight is ending. The new frontier is not the model at all. It is whether you trust the agent that runs inside your development environment, reading your codebase, writing your commits, touching every repository you own.

Why the coding agent is a harder problem than the model

A model behind an API is a black box you query. You send text, you get text back. The blast radius is your prompt and its response. You can log it, filter it, sandbox it. The trust surface is narrow because the interaction is narrow.

A coding agent is a different animal entirely. It sits inside the IDE. It has read access to the full source tree. It writes code that ships. It runs shell commands. It authenticates against internal systems to be useful. To do its job well, it needs exactly the privileges you would never grant a piece of foreign software you didn't fully control.

That is the tell in the Alibaba decision. A coding agent that is genuinely productive is, by construction, a genuinely privileged process. The better it gets at the job, the more it must see and touch. Productivity and trust are not independent variables here. They are the same variable read from opposite ends.

When a foreign agent sits inside your IDE, productivity stops being the question. Trust does.

The playbook is not new. Only the target is.

I have watched this exact pattern run before, on other technologies, in other decades. It is remarkably consistent:

Adopt the superior foreign tool. It works better than anything domestic, so it wins on merit.
Measure the dependence. Once it is load-bearing across the organization, the strategic cost of losing it becomes visible.
Ban it and clone it. Rip it out on a security pretext, point everyone at the domestic replacement that was built in the shadow of the original.

Beijing's 2019 directive to strip foreign PCs and operating systems from government offices followed this arc. Huawei lost Android and shipped HarmonyOS. Moscow swapped Windows for Astra Linux across ministries. In every case the foreign tool was the reference implementation the domestic clone was measured against, then the clone became the mandate. Qoder as the recommended replacement for Claude Code is not a footnote to this story. It is the story.

Both accusations can be true

Here is where it gets uncomfortable for anyone who wants a clean villain. Anthropic has accused Alibaba-linked teams of distilling Claude at scale, pulling capability out of the model through relentless querying. Alibaba now accuses Claude Code of backdoors and user fingerprinting. People want to pick a side. You don't have to.

Both can be true simultaneously. A model provider can defend its weights against extraction while a national champion defends its codebase against a privileged foreign process. These are not contradictory claims. They are the same underlying reality described by two parties with opposing interests: capability is valuable, capability is portable, and nobody wants the other side holding the keys to their most sensitive infrastructure.

The distillation fight and the backdoor fight are two fronts of one war over who captures the value that flows through the developer's daily workflow. If you want the deeper economic version of this, I've written about how the biggest customer becomes the competitor once dependence is measured and the clone is ready.

Export controls block weights. Trust controls block workflows.

This is the mechanical distinction that policy has not caught up to yet.

Export controls are a supply-side instrument. They restrict who can obtain the model, the weights, the chips. They are enforced at the border, by governments, against the flow of artifacts. Washington un-banning Fable 5 and Mythos 5 is an export-control action.

Trust controls are a demand-side instrument. They restrict what a tool is permitted to touch once it is inside your walls. They are enforced by IT and security teams, by procurement policy, against the flow of access. Alibaba banning Claude Code is a trust-control action.

The two operate on completely different layers, and the second one is far harder to legislate. You cannot inspect a coding agent at customs. Its risk is not in the binary you download but in the behavior it exhibits with privileged access over months. A government can un-ban a model with a stroke. It cannot un-ban trust. That has to be earned, audited, and continuously verified, which is a much slower and more organizational process. This is the same reason I argue the real question is increasingly who owns your harness rather than who owns the model.

What this means for anyone shipping software

If you build software and you use foreign-origin coding agents, the Alibaba decision is a preview of a question your own security team will eventually ask. Not "is the model good" but "what does this process see, and what would we lose if it were compromised or cut off." A few concrete moves follow from that:

Treat coding agents as privileged infrastructure, not developer conveniences. Inventory what they can read, write, and execute. If you can't answer that, you don't understand your exposure.
Assume the agent is a chokepoint, not a feature. Anything load-bearing and foreign is a strategic dependency. Price the switching cost before you need to switch.
Separate capability from access. The model can be excellent and the access still unacceptable. Those are two decisions, and conflating them is how organizations get surprised.
Watch the clones. When a domestic equivalent appears next to a ban, the ban is not really about security. It is about capture.

Key takeaways

Export controls restrict weights; trust controls restrict workflows. The bottleneck moved from GPUs to IDE trust.
A coding agent's productivity and its privilege are the same variable. The better it gets, the more it must access.
Alibaba banning Claude Code the same week Washington un-banned Anthropic's models shows policy operating on two different layers.
The adopt → measure dependence → ban and clone playbook has run before on PCs, Android, and Windows. Qoder is the clone.
Distillation claims and backdoor claims can both be true; they are two fronts of one war over workflow value.
Governments can un-ban a model with a stroke. They cannot un-ban trust. That is earned, audited, and slow.

The model wars trained everyone to watch the leaderboard. The next fight will be quieter and far more consequential: fought inside version control, procurement policy, and security review, over which agents are allowed to touch the code that runs the world. If you want the wider map of how this connects to clearance and control, start with the manifest and the Joule Wars thesis. The leaderboard is settled. The trust boundary is where the real contest begins.

Washington Regulated the Muzzle, Not the Model

Michał Piszczek — Thu, 02 Jul 2026 09:00:00 +0000

Anthropic put Fable 5 back online worldwide. The fix tells you what Washington actually regulated. It was never the model.

When the control fired, it fired on a borderline bypass, a request that skated the edge of an exploit demo. That was the trigger for the whole export-control episode. But here is the detail that collapses the official story: Anthropic's own testing showed Opus 4.8, GPT-5.5, and even the smaller Haiku 4.5 and Sonnet 4.6 could reproduce the same exploit demo. The capability was never unique to Fable 5. It was ambient. It lived in every frontier and near-frontier model on the market.

You cannot export-control mathematics that everyone already has. So the regulation did not target the capability, because there was no capability to target. It targeted whether the safeguard holds. That is a much narrower and much stranger thing to regulate, and once you see it, the entire architecture of modern AI policy reads differently.

The tell is in the patch

Look at what the fix actually does. When Fable 5 now blocks a request, it does not refuse and stop. It reroutes the request to Opus 4.8. And by Anthropic's own admission, in the same blog post, Opus 4.8 produces the same exploit demo Fable 5 was blocked from producing.

So the capability did not leave the building. It was not removed, contained, or diminished. A request that Fable 5 declines gets handed to a sibling model that happily completes it. The output the control was designed to prevent is still one hop away, by design. Only the label on the door changed.

The capability didn't leave. Only the label changed.

If this were a safety upgrade, you would expect the dangerous output to become harder to obtain. It didn't. What changed is not the availability of the result. What changed is the paper trail.

Not a safety upgrade. A chain of custody.

Read the mechanism as a sequence and its real purpose becomes obvious:

Block. The classifier flags the request as borderline.
Reroute. It hands the request to Opus 4.8 instead of completing on Fable 5.
Log. The event is recorded, the flag is stamped, the interaction is captured.
Notify. The relevant parties are informed that a borderline request occurred.

That is not containment. That is chain of custody. The point is not to stop the output from existing. The point is to ensure that when it exists, there is a record of who asked, when, and through which path. Regulators did not get a wall. They got an audit log. And for a lot of policy purposes, an audit log is what they actually wanted, because it converts an unmonitorable capability into a governable, attributable event.

This is a meaningfully different thing from what the press release implies. The public framing is "we made the model safer." The mechanism is "we made the usage traceable." Those are not the same claim, and the gap between them is where the real policy lives.

What they regulated is a model regulating a model

Sit with the recursion here. The safeguard is a classifier. A classifier is itself a model. So the object of regulation is a model whose job is to police another model. And a classifier, being a model, can be jailbroken like any other.

Anthropic says as much in their own words: the safeguard is "probably impossible to make fully robust." That is not a hedge. It is the honest description of the situation. You have built a probabilistic gate to guard a probabilistic system, and both are susceptible to adversarial input. The muzzle is made of the same material as the thing it is muzzling.

This matters because it changes what "compliance" even means. Compliance is no longer a binary property of the model. It is the current, defeatable state of a classifier that sits in front of it. Regulate that, and you have regulated something that can be talked around by a sufficiently clever prompt. The control is real, but it is soft, and everyone building on it should understand that it is soft. It is closer to a spam filter than a lock.

Same weights, two labels

Now the commercial structure clicks into place. The same underlying weights ship two ways. They ship as Mythos to a vetted circle, cleared, unmuzzled, trusted. And they ship as Fable to everyone else, wrapped in the classifier, the rerouting, the logging.

The intelligence is identical. What differs is the muzzle and who is trusted to operate without one. That is the entire product distinction. The model was never the product. The muzzle is the product. Access to the unmuzzled version is the premium tier, and clearance to skip the classifier is the thing of value.

This is why I keep saying the model wars are over and the clearance wars are beginning. When the capability is ambient and the weights are shared, the only remaining lever is who is trusted to run them without a governor. That lever is not technical. It is political and institutional, and it is exactly where the value is migrating.

Why this framing beats the official one

If you take the official story at face value, you will make bad predictions. You will expect regulation to make capabilities disappear, and it won't, because the capability is everywhere and un-recallable. You will expect safeguards to be robust, and they aren't, because they are jailbreakable classifiers. You will expect the model to be the regulated object, and it isn't, because two labels ship from the same weights.

Take the muzzle framing instead and your predictions get sharper:

Regulation will increasingly target monitoring and attribution, not capability, because capability can't be un-shipped.
Safeguards will be soft controls, defeatable and probabilistic, marketed as hard ones.
The commercial frontier moves to clearance: who gets the unmuzzled weights, and who is stuck with the governor.
"Safety" and "traceability" will be used interchangeably in press releases, even though only one of them is actually being delivered.

Key takeaways

Fable 5's control fired on a borderline exploit that Opus 4.8, GPT-5.5, Haiku 4.5, and Sonnet 4.6 could all reproduce. The capability was never unique.
You can't export-control math everyone has, so regulation targeted whether the safeguard holds, not the capability itself.
When Fable 5 blocks a request it reroutes to Opus 4.8, which produces the same output. The capability never left; only the label changed.
Block → reroute → log → notify is chain of custody, not containment. Regulators got an audit log, not a wall.
The safeguard is a classifier, itself a model, and jailbreakable. Anthropic calls it "probably impossible to make fully robust."
Same weights ship as unmuzzled Mythos to a vetted circle and muzzled Fable to everyone else. The muzzle is the product.

The uncomfortable conclusion is that AI regulation, as currently practiced, does not regulate intelligence. It regulates the paperwork around intelligence. That may even be the right call given that the capability cannot be recalled. But we should be honest about what is being sold. The model is free to think what it thinks. What is governed is the record, the routing, and the clearance. If you want the map of where that leads, start with the manifest and the Joule Wars. The model was never the product. The muzzle is.

Capability Is Commoditizing. Cost Is the Frontier.

Michał Piszczek — Wed, 01 Jul 2026 09:00:00 +0000

Anthropic shipped Claude Sonnet 5. On knowledge work it edges out Opus 4.8, its own flagship, at roughly half the price. The benchmark table isn't the story. The price column is.

Everyone read the launch the same way: another model, another set of numbers, ho-hum, the leaderboard shuffles again. That is the wrong column to be reading. The mid-tier model just matched the flagship on the work that actually gets paid for, and it did it at a fraction of the cost. When that happens, you are not looking at a product update. You are looking at a phase change in what the market is willing to pay for.

Read the price column, not the benchmark

Here are the numbers that matter. On GDPval-AA, the knowledge-work benchmark, Sonnet 5 scores 1618 against Opus 4.8's 1615. The mid-tier passed the flagship. On Humanity's Last Exam with tools, it is 57.4% versus 57.9%, a difference well inside rounding error. On the work that maps to what knowledge workers actually do, these two models are indistinguishable.

Now the pricing. Sonnet 5 launches at $2 per million input tokens and $10 per million output at the introductory rate, settling to $3 and $15. Opus 4.8 is $5 and $25. Same class of work, at roughly 40% of the cost. That is not a discount. That is a repricing of the entire capability tier.

The moment a capability stops being scarce, the market reprices around delivery, not intelligence.

When the premium product and the mid-tier product do the same job, the premium is no longer buying capability. It is buying a slightly better result on the tail, for the cases where the last fraction of a percent matters. For the vast majority of knowledge work, that tail is irrelevant, and the market knows it. The price column is where that knowledge shows up first.

Compute did it. Storage did it. Bandwidth did it.

This is not a novel event in the history of technology. It is the single most reliable pattern we have. Every foundational capability follows the same arc from scarce and premium to abundant and priced-by-delivery.

Compute. A cycle was once a rationed resource you scheduled time on. Now it is a commodity you rent by the second and never think about.
Storage. A megabyte was a budget line. Now storage is effectively free and the cost that matters is moving and querying the data.
Bandwidth. A bit over the wire was metered and precious. Now the pipe is assumed and the value moved to what flows through it.

In every case the capability did not disappear. It became the floor. And once it was the floor, the entire market repriced around the thing that was still scarce: delivery, integration, reliability, and cost at scale. Intelligence is now walking the same path. The capability to do frontier-grade agentic knowledge work is becoming the floor, not the ceiling.

Frontier-grade is now the default tier

The most telling signal is not in the benchmark or the price. It is in the distribution. Sonnet 5 is the model free and Pro users get by default. Frontier-grade agentic work is no longer the thing you pay up for. It is the thing you get when you don't pay attention. The premium tier and the default tier now overlap on capability.

Think about what that does to product strategy. If your entire pitch was "we have access to the best model," you no longer have a pitch, because the best-in-class-for-the-task model is the commodity default. The differentiation has to move somewhere else, and there are only a few places it can go: the data you feed the model, the harness you run it in, and the cost at which you can finish the job. I've argued the data point separately in models are commodities, clean data is not, and the harness point in route by task, not by vendor. When capability is uniform, routing to the cheapest sufficient model per task is not a nice-to-have. It is the architecture.

The question changed. Notice which one.

For two years the operative question was: can the model do the task? That question is now boring, because for most tasks the answer is yes, from the default tier, for a couple of dollars per million tokens. The interesting question is a different one entirely:

What does the task cost to finish, at scale, with nobody watching?

Every clause in that sentence is load-bearing. Cost to finish, not cost per call, because agentic work chains many calls and the total is what hits the invoice. At scale, because a workflow that pencils out at ten runs a day can bankrupt you at ten million. With nobody watching, because the economics only work if the agent completes autonomously, without a human babysitting each step and eating the real cost, which is salary, not tokens.

This reframes the whole build calculus. You are no longer selecting the smartest model. You are engineering the cheapest reliable completion of a unit of work. That is an economics and execution problem, not a capability problem. The same underlying force is why I've argued the constraint is GPUs, not demand. When capability is abundant and cheap, demand explodes to meet supply, and the binding constraint becomes the physical cost of serving it.

What operators should do about it

If capability is commoditizing and cost is the frontier, then the winning moves are unglamorous and entirely about execution:

Instrument cost per completed task, not per token. The token price is a red herring. Measure what it costs to finish a real unit of work end to end.
Default to the cheapest sufficient model and route up only on the tail. Reserve the flagship for the fraction of cases where the last percent actually pays.
Design for unattended completion. The moment a human has to watch, your cost model is dominated by labor and the token savings are noise.
Move differentiation to data, harness, and reliability. Capability is the floor now. Your edge lives in the layers the commodity model can't provide.

Key takeaways

Sonnet 5 matches Opus 4.8 on knowledge work (GDPval-AA 1618 vs 1615) at roughly 40% of the cost. The mid-tier passed the flagship.
When the premium and mid-tier do the same job, the premium stops buying capability and starts buying a marginal tail.
Compute, storage, and bandwidth all commoditized the same way. Intelligence is now the floor, not the ceiling.
Frontier-grade agentic work is the default tier free and Pro users get, not the tier you pay up for.
The question shifted from "can the model do it" to "what does the task cost to finish, at scale, with nobody watching."
Differentiation moves to data, harness, reliability, and cost per completed task. Capability alone is no longer a moat.

The leaderboard-watchers are optimizing the wrong variable. They are still asking whether the model is smart enough, a question the market has already answered and priced to the floor. The operators who win the next cycle are asking what it costs to finish the work when the intelligence is free and the only scarce thing left is disciplined, unattended, economical execution. Capability is commoditizing. Cost is the new frontier. For the wider thesis, the manifest and the Joule Wars lay out where the joules, and the margins, actually go.

Route by Task, Not Vendor: The Open-Weight AI Stack

Michał Piszczek — Wed, 01 Jul 2026 09:00:00 +0000

Six months ago, "move your AI workloads to open Chinese models" was a thought experiment you floated in a strategy deck to sound forward-looking. Now it is a procurement story with real invoices attached. The migration is already happening at names you know, and it is not being driven by ideology. It is being driven by arithmetic.

Airbnb moved to Qwen, Alibaba's open-weight family. CEO Brian Chesky described it plainly: "very good, fast and cheap." It powers their support agent. Cursor built its Composer coding model on Moonshot's open weights, shipping as Kimi K2.5. Microsoft has been hosting and testing DeepSeek V4 inside Azure Foundry and Copilot. Shopify, Coinbase, Siemens, and Uber Eats have all been reported routing real production workloads to Qwen, GLM, Kimi, or DeepSeek.

None of them "switched to the best model." That framing misreads the entire decision. Each of them moved the right task to a cheaper open-weight model sitting within a few points of frontier. The distinction matters more than any benchmark leaderboard, because it inverts the question everyone has been asking.

The question was never "which model is best?"

For three years the industry has treated model selection as a single global decision. You pick the smartest model, you wire everything to it, you feel safe. That instinct is expensive and increasingly wrong. The real question is narrower and far more useful: which task actually needs the best model?

Look at what production traffic is actually made of. The overwhelming majority of it is the boring 80% — extraction, classification, summarization, routing, simple tool calls, reformatting, deduplication. This is plumbing. It does not require a model that can reason through a novel proof or design a distributed system. It requires a model that is competent, fast, and cheap.

Frontier models are priced for the hard 20% — the genuinely difficult reasoning, the long-horizon planning, the cases where an extra few points of quality translate into measurable business value. That is what you are paying a premium for. When you send the easy 80% through a frontier API, you are paying that premium on every request that never needed it.

Paying frontier prices for the easy 80% is one of the biggest sources of AI budget waste in production today.

Route by task, not by vendor

The architecture that follows is not exotic. It is a routing table. You classify the task, then you send it to the cheapest model that clears the quality bar for that task. In practice, a stack that holds up in production looks something like this:

Reasoning — GLM or Kimi, which now sit close enough to frontier that the gap rarely shows up in real workloads.
Code — Kimi Code or Qwen Coder for the bulk of generation and refactoring.
Agents and tool calls — GLM, which handles structured tool invocation reliably at a fraction of closed-API cost.
Bulk processing — MiMo, where you are grinding through volume and latency-per-dollar dominates.
Images and video — fine-tuned LTX plus Wan, tuned to your own domain.
Local workhorse — Qwen3.6-35B-A3B, the model that runs on your own hardware and quietly handles the daily grind.

Almost all of these are open-weight, self-hostable, close to frontier, and a fraction of the cost. This is the same principle that runs underneath capability commoditizing while cost becomes the frontier: when the models converge on quality, the differentiation moves to how efficiently you deploy them.

Savings are the visible win. Ownership is the real one.

The cost delta is what gets the CFO's attention, and it is real. But savings are not the point. The point is that you own the stack. When the core of your system runs on weights you hold, nobody can switch you off. Nobody can revoke your access on their timeline. Nobody can see your data, dictate your pricing, or quietly reshape your roadmap by changing theirs.

That is a business-continuity property, not a line item. It is the same argument that sits underneath the question of who owns your harness — the orchestration layer that actually knows how your company works. Open weights are the only ones nobody outside your walls can turn off.

Clearing up the "my data goes to China" reflex

There is a reflexive objection worth killing directly. "Chinese model equals my data goes to China" is simply wrong for open weights. Open weights run on your infrastructure. The weights may originate in a lab in Hangzhou or Beijing, but the weights are a static artifact — a file of numbers. When you self-host them, your data never leaves your servers. It goes to your GPUs, not theirs.

This is why the real boundary is not American versus Chinese. It is open-weight versus closed. A closed American API can log your prompts, change its terms, and go dark on a government's schedule. A set of open weights running in your own datacenter cannot do any of those things, regardless of which country trained it. The nationality of the training run is a distraction; the deployment topology is the actual security boundary.

When to still pay for closed

None of this means closed models are obsolete. They still lead on some cinematic, high-stakes workloads where the last few points of quality genuinely move the needle. The discipline is to pay for a closed model when the quality gap creates measurable business value — not by default, not out of habit, and not because it is the name everyone recognizes.

The rule is simple to state and harder to enforce: open-source first, self-host the core, pay for frontier only where it creates value you cannot get elsewhere. Enforcing it means building a routing layer, maintaining evals per task, and resisting the temptation to route everything to the smartest model because it is easier.

Key takeaways

The right question is not "which model is best?" but "which task actually needs the best?"
Most production traffic is the boring 80% — extraction, classification, routing — and frontier pricing on it is pure waste.
Route by task, not vendor: match each workload to the cheapest model that clears its quality bar.
Open weights self-hosted mean your data never leaves your servers, whatever the model's country of origin.
The real boundary is open-weight versus closed, not American versus Chinese.
Pay for closed models only where the quality gap creates business value you cannot get otherwise.

The race stopped being about the smartest model. It became about architecture that still works when today's smartest model is unavailable, unaffordable, or switched off. I write more about that shift across my essays on execution and AI infrastructure. Build the routing table now, while it is still a competitive edge rather than table stakes.

Who Owns Your Harness? The Layer Above the Model

Michał Piszczek — Tue, 30 Jun 2026 09:00:00 +0000

Lately I find myself less interested in which model wins and far more interested in who owns the layer above the model. The benchmark wars — GLM versus Claude versus GPT versus Qwen versus DeepSeek — are already yesterday's conversation. Models improve fast and get cheaper faster. Open-source is closing the gap ahead of every schedule people drew a year ago. So "which model should we use?" is not the question. It never was.

The question that actually determines whether your company survives a bad quarter in AI policy is this: can we replace the model tomorrow? If the honest answer is no, you have already made the most expensive architectural decision of the decade without noticing.

You think you're buying AI. You're wiring an operating system.

Most companies believe they are buying AI the way they buy a database or a cloud region — a component, swappable, bounded. What they are actually doing is wiring their entire execution layer around a single vendor. The prompts. The memory. The agents. The routing logic. The evals. And then every integration on top: Slack, Jira, GitHub, the internal tools, the accumulated company knowledge that no one wrote down anywhere else.

Bit by bit, the model stops being a model. It becomes the operating system of the business. Every workflow assumes its quirks. Every prompt is tuned to its behavior. Every engineer's mental model of "how our AI works" is really a mental model of one vendor's API. That is where lock-in begins — not in a contract clause, but in a thousand small couplings nobody tracked.

Lock-in stopped being a commercial inconvenience

For most of software history, lock-in was a negotiating problem. You paid a switching cost, you grumbled, you migrated over a quarter. Annoying, survivable. That era is over for AI. Lock-in became a business-continuity risk, and recent events proved it in the harshest way possible.

A single US export-control order took Anthropic's top models — Mythos 5 and Fable 5 — offline for two weeks. Not just for foreign users. To stay compliant, they were pulled for everyone worldwide, the United States included. Every company that had wired its product around those models lost its core capability overnight, through no decision of its own.

Days later, GPT-5.6 shipped only as a gated, US-only preview, after Washington reportedly asked OpenAI to hold the launch. Two data points, one lesson: the model under your product can go dark on a government's timeline, not yours.

Closed is closed the moment someone decides it's closed to you — and that someone may not be your vendor.

That last clause is the whole point. You can have a perfect relationship with your model provider, pay every invoice on time, and still lose access because a regulator three time zones away signed an order. Your vendor's goodwill is irrelevant when the constraint sits above the vendor. This is the shift I described in the model wars ending and the clearance wars beginning: capability now ships when it clears, not when it's ready.

The default has to flip: open-source first

The conclusion writes itself. The default posture must invert. Open-source first — not because open models win every benchmark, because they do not yet, but because open weights are the only ones nobody can switch off. A file of weights sitting on your own hardware does not care what any government decides next week. It is inert, and it is yours.

Keep closed models for the frontier edge, the genuinely hard workloads where the quality gap earns its premium. But the core you cannot afford to lose should sit on weights you actually hold. This is the same architecture I lay out in routing by task, not vendor: self-host the core, pay for frontier only where it creates value you can't get elsewhere.

Above the model sits the harness

And above every model — open or closed — sits the harness. This is the layer that actually matters, and it should be vendor-agnostic by design. The harness owns:

Memory — what your system remembers across sessions, users, and workflows.
Context — how the right information reaches the model at the right moment.
Routing — which task goes to which model, and the fallback when one goes dark.
Permissions — who and what is allowed to do which action.
Tools — the integrations that let the model act on your actual systems.
Evals — how you know quality held after you swapped a model underneath.
Orchestration — the logic that ties it all into something that works.

The harness is the part that actually knows how your company works. The model is a replaceable engine bolted into it. If your harness is well-built and vendor-agnostic, swapping models is a config change and a re-run of your evals. If it is not — if the harness and the vendor are the same thing — then a model going dark takes your whole business with it.

The highest-ROI call of the decade

Building your own harness is slower and costlier today. There is no way around that. It is more engineering, more discipline, more upfront investment than plugging into one vendor's SDK and shipping. That is exactly why most teams will not do it until they are forced to.

But it may be one of the highest-ROI calls of the decade. Models come and go. Governments reshuffle who gets access to what, and on what timeline. Your company brain — the accumulated knowledge, workflows, and judgment encoded in your execution layer — should depend on neither. It should sit in a harness you own, feeding whichever model happens to be best, cheapest, and available this month.

Key takeaways

The right question is not "which model do you use?" but "can you replace it tomorrow?"
Companies think they're buying AI; they're wiring their whole execution layer around one vendor.
Lock-in became a business-continuity risk — a single export order pulled Anthropic's top models worldwide for two weeks.
Closed is closed the moment someone decides it's closed to you, and that someone may not be your vendor.
Default to open-source for the core; open weights are the only ones nobody can switch off.
Own the harness — memory, context, routing, permissions, tools, evals, orchestration — and models become swappable engines.

We are going to stop asking "which model do you use?" and start asking who owns your harness. I write about that transition and the architecture it demands across my essays on AI infrastructure and execution. The teams that build the harness now will be the ones still running when the next model goes dark.

The Model Wars Are Over. The Clearance Wars Begin.

Michał Piszczek — Sat, 27 Jun 2026 09:00:00 +0000

OpenAI previewed GPT-5.6 and, in doing so, became the first lab to ship a model the US government clears for use customer by customer. Read that as an engineering release and you miss everything. The model is not the story. The gate is. For the first time, a frontier capability arrived not when it was built, but when it was permitted — and the permission is granted one customer at a time.

What actually shipped

The release has three tiers. Sol is the flagship. Terra sits in the same class as GPT-5.5 at roughly half the cost. Luna is the cheapest, built for volume. There is a new ultra mode that spins up subagents to attack harder problems in parallel. On the surface, this is a clean, well-segmented product line — the kind of tiering that signals a mature lab that understands its cost curve.

But every one of those capabilities shipped behind a limited preview only. And before release, OpenAI shared the model's capabilities with the US government. That sequencing is the whole point. The product decisions are downstream of a clearance decision that happened first.

Why the gate exists: cyber

The reason for the gate is not vague safety hand-waving. It is cyber, and the numbers are specific. On ExploitBench, Sol matches Anthropic's Mythos while using roughly one-third of the output tokens. It finds bugs and exploitation primitives in Chromium and Firefox — real browsers that billions of people run — stopping short of a full autonomous exploit, but not by a comfortable margin.

OpenAI spent 700,000 A100-equivalent GPU hours red-teaming its own safeguards before release. That is not a rounding error in a training budget; that is a deliberate, industrial-scale effort to understand what the model can do before anyone outside gets to ask it. And Sol runs on Cerebras at 750 tokens per second starting in July, which means whatever it can do, it can do fast and at scale.

Put those facts together and the gate is not paranoia. A model that finds exploitation primitives in the world's most-used browsers, at a third of the token cost of the prior frontier, running at 750 tokens per second, is a genuinely dual-use artifact. The lab knew it. The government knew it. The preview gate is the compromise that let it ship at all.

The quiet part, said out loud

OpenAI said the part most labs would keep internal: it does not want government approval to become the long-term default. That is a remarkable admission. It means the company shipping the model understands that the clearance regime it just participated in is a threshold being crossed, not a one-off accommodation. They cleared this launch and simultaneously warned against the precedent of clearing launches.

Capability used to ship the day it was ready. Now it ships when it's cleared. The bottleneck moved from compute to permission.

I have watched this exact pattern before, and it is worth being precise about the analogy, because the analogy is the argument.

We have run this playbook three times now

Strong encryption was classified as a munition. For years, exporting cryptographic software above a certain key length was legally equivalent to exporting weapons. The capability existed; shipping it required clearance. GPS shipped with selective availability — the civilian signal was deliberately degraded, its full precision reserved and released only when the government decided the strategic calculus had changed. Now inference joins that list.

The through-line is consistent. When a technology becomes strategically decisive, the state stops treating it as a product and starts treating it as a controlled capability. The pattern has three stages:

Classification — the capability is recognized as dual-use and reframed as a matter of national security rather than commerce.
Gating — release is made conditional on clearance, whether by export license, degraded signal, or customer-by-customer approval.
Selective release — the full capability flows only to approved parties, on the government's timeline, not the builder's.

Encryption followed it. GPS followed it. GPT-5.6 is the first frontier model to follow it explicitly, with the government briefed before launch and access granted per customer. That is not a coincidence of one release. It is a category shift.

The bottleneck moved

For three years the constraint on AI progress was compute. Whoever had the most GPUs, the best data, and the best training runs shipped the best model, and they shipped it the moment it cleared their own evals. That world is closing. The new constraint is permission. You can have the compute, the data, the trained weights sitting on disk — and still not be allowed to ship, or only allowed to ship to a vetted list, on a schedule set outside your building.

This is why the strategic questions have changed. It is no longer only "who has the best model?" It is "who is allowed to run it, where, and for whom?" That reframing is the same one I trace in what Washington actually regulated — the muzzle, not the model: the control point moved from the artifact to its use. And it is why the architectural imperative I describe in who owns your harness matters more every quarter. If the model under your product can be gated by a government, the layer you own becomes the only thing you can count on.

Key takeaways

GPT-5.6 is the first frontier model cleared by the US government customer by customer — the gate, not the model, is the story.
The gate exists because of cyber: Sol matches Anthropic's Mythos on ExploitBench at a third of the tokens and finds exploitation primitives in Chromium and Firefox.
OpenAI spent 700,000 A100-equivalent GPU hours red-teaming its own safeguards before release.
Encryption became a munition; GPS shipped with selective availability; inference now joins the list of gated strategic capabilities.
The bottleneck moved from compute to permission — capability ships when it's cleared, not when it's ready.
The strategic question shifted from "who has the best model?" to "who is allowed to run it, where, and for whom?"

The model wars are over. The clearance wars just started. I track that shift and what it means for anyone building on top of frontier models across my essays on AI policy and execution. Plan your architecture for a world where the smartest model available to you is the one you are cleared to run.

The Biggest Customer Becomes the Competitor

Michał Piszczek — Sat, 27 Jun 2026 09:00:00 +0000

OpenAI designed its own AI chip in nine months and aimed it straight at Nvidia, the supplier it cannot survive without. Codenamed Jalapeño, co-designed with Broadcom. The bill forced the move.

A custom chip usually takes two to three years from design to working silicon. OpenAI did it in nine months. The compression is not a footnote; it is the whole point. OpenAI used its own models to accelerate the design cycle, turning frontier inference back onto the problem of building the hardware that runs frontier inference. The snake ate part of its own tail, and the tail grew back faster.

Jalapeño is inference-only. It is tuned for the workloads OpenAI actually runs at scale: ChatGPT, Codex, the API, agents. It is not built for training. That narrowing is deliberate, and it is where the leverage lives. When you know your workload down to the token, you can throw away everything a general-purpose GPU carries to serve a thousand customers you are not. Early tests claim better performance-per-watt than today's best GPUs. At gigawatt scale, performance-per-watt is not a spec-sheet vanity metric. It is the P&L.

The pattern is older than OpenAI

This is not a surprise if you have watched infrastructure economics before. Google built TPUs because renting general-purpose accelerators for search and ads and, later, Gemini stopped making sense at their volume. Amazon built Trainium and Inferentia because AWS could not let the margin on every AI workload flow to a single supplier. Now OpenAI builds Jalapeño for exactly the same reason, and the reason is arithmetic.

The rule generalizes: the biggest customer always becomes the next competitor, because the bill forces it. When you are a small buyer, renting is obviously correct. The vendor amortizes billions in R&D across thousands of customers, and your slice is cheap. When you become the largest single consumer of a component, the math inverts. You are now underwriting a meaningful fraction of the vendor's margin, and that margin is a tax you pay on your own scale. At some volume, designing the thing yourself is cheaper than renting it, and every dollar of vendor margin you eliminate is a dollar that compounds.

Renting compute is a cost. Designing it is a moat. The difference is who owns the workload.

Old stack, new stack

The old stack was simple and stable. One vendor designs the silicon. Everyone else rents it. Nvidia sat at the top of that pyramid, and the pyramid was the entire industry. Access to Nvidia was the bottleneck, and allocation of Nvidia's chips was a story that moved markets. Whoever got the biggest allocation won the round.

The new stack rearranges the pyramid. The buyer designs the silicon, and the vendor becomes optional. Not eliminated, optional. That word does a lot of work. OpenAI will still buy Nvidia for training, for burst capacity, for the workloads where a general-purpose part still wins. But the strategic dependency loosens the moment a credible in-house alternative exists for the workload that dominates the bill. The bottleneck moves from access to Nvidia to ownership of the workload. Once you own the workload end to end, you get to decide how much of it to rent and how much to build.

Why inference, and why now

Inference is the right place to start vertical integration, and the timing is not an accident. Training is bursty, experimental, and moves with the research frontier; the workload changes shape every few months, which punishes custom silicon built around fixed assumptions. Inference at OpenAI's scale is the opposite. It is enormous, steady, and increasingly well understood. The company serves the same handful of model architectures to hundreds of millions of users, billions of times a day. That is exactly the profile that rewards a chip designed for one job and stripped of everything else.

The economics compound with agents. As I have argued in the unit of work is the agent-hour, output is going parallel: work is no longer bounded by human hours but by how many agents you can run at once. Every one of those agent-hours is inference. The inference bill is not a fixed cost you optimize once; it is the growth curve itself. Owning the silicon under that curve is owning the cost structure of your own future.

What OpenAI is really buying

Read past the chip and you can see what OpenAI is actually acquiring. It is not just cheaper tokens. It is control over its own cost curve, its own roadmap, and its own supply chain in a market where compute is the binding constraint. As I have written in OpenAI is GPU-constrained, not demand-constrained, the company's growth ceiling is set by silicon it does not manufacture. Jalapeño is the structural answer to that constraint. It is the first chip in a multi-generation roadmap, which tells you this was never a one-off experiment. It is a commitment to owning the bottom of the stack.

Here is the framework I use to decide when a big buyer should stop renting and start building:

Volume concentration. When one workload dominates your spend, the vendor's margin on that workload becomes your largest controllable cost. Concentration is the trigger.
Workload stability. Custom silicon rewards a job that will not change shape for years. Inference qualifies; frontier training does not, yet.
Design-cycle leverage. If you can compress the two-to-three-year chip cycle, as OpenAI did with its own models, the payback window shrinks and the bet gets far safer.
Strategic optionality. Even a good-enough in-house part changes your negotiating position with the incumbent vendor. The threat of building is worth money before the chip ships.
Roadmap commitment. One chip is a science project. A multi-generation roadmap is a business decision. Only the second one moves the moat.

What breaks next

If the largest AI buyers all vertically integrate, Nvidia does not disappear, but its position changes. It moves from the sole source of frontier compute toward a supplier of training and burst capacity, competing against the in-house parts of its biggest former customers. That is a different, thinner business than owning the entire pyramid. The interesting question is not whether Nvidia survives, it will, but what the market looks like when the five buyers who matter most each design the silicon for their own dominant workload.

The deeper shift is about where value accrues. For a decade, the story was that whoever controlled the scarce input, the chips, controlled the industry. Jalapeño is evidence that the scarce input is being routed around by the buyers with enough volume to justify the engineering. Value migrates from owning the general-purpose component to owning the specific workload well enough to build the component yourself. The bottleneck moved from access to ownership, and ownership is the more durable position.

Key takeaways

OpenAI designed Jalapeño, an inference-only chip, in nine months versus the usual two to three years, using its own models to compress the cycle.
The move follows a rule: the biggest customer becomes the next competitor, because concentrated volume turns vendor margin into your largest controllable cost.
Google (TPU) and Amazon (Trainium) ran this playbook first. OpenAI is the newest instance, not a novel one.
Inference is the right entry point for vertical integration: enormous, steady, and well understood, unlike frontier training.
The bottleneck moved from access to Nvidia to ownership of the workload. Renting compute is a cost; designing it is a moat.
A multi-generation roadmap, not a single chip, is what turns this from a science project into a structural change in the market.

The chip allocation era trained everyone to watch who got the most GPUs. That was the old bottleneck. The new one is quieter: which buyers understand their own workload well enough to stop renting and start building. For the wider map of how compute, clearance, and control connect, start with the manifest and the Joule Wars thesis. The supplier you cannot survive without is the one you eventually have to replace.

The Unit of Work Is the Agent-Hour

Michał Piszczek — Fri, 26 Jun 2026 09:00:00 +0000

OpenAI published usage data from inside its own walls. Its 99th-percentile employees now run more than 60 hours of agent work every single day. Sixty hours inside a 24-hour day is not overtime. It is a different unit of work.

The number breaks your intuition on purpose. You cannot fit 60 hours of labor into a day if labor is something a human performs sequentially with two hands and one attention span. You can fit 60 agent-hours into a day trivially, because agent-hours run in parallel and the human is no longer the one doing them. That single figure marks the boundary between the old model of work and the one replacing it.

The rest of OpenAI's report fills in the shape. The average employee now produces 85% of their output through Codex, not typed, delegated. Across the company, agents already account for 99.8% of weekly output tokens. The humans are still deciding what gets built and whether it is right. They have almost entirely stopped being the ones who produce it.

The work did not get faster. It went parallel.

This is the distinction people miss, and it changes every downstream conclusion. "Faster" is a story about the same sequential process compressed in time: the same person doing the same task in less time. That is a linear improvement, and linear improvements have ceilings set by the human at the center.

Parallel is a different regime entirely. The human stops executing tasks one after another and starts dispatching many at once, each running independently while attention moves elsewhere. The constraint is no longer how fast you work. It is how many streams of work you can start, supervise, and accept. The growth-team data makes the pattern concrete: research teams show 56 times more agent use than seven months ago, customer support 32 times, engineering 27 times, even legal 13 times. Those are not efficiency gains. Those are step changes in how many things happen at once.

The human stops doing the work and starts approving it. That is not a productivity upgrade. It is a change in what a person is for.

Old company, new company

The old company had a simple production function: headcount times hours. Output was labor, and labor was people multiplied by the time each one worked. It scaled linearly and it scaled by hiring. If you wanted more output, you added heads, onboarded them, managed them, and absorbed the coordination cost of every new person. The ceiling was real, and everyone knew where it was.

The new company has a different production function: agents times parallelism. Output is a function of how many agents you can run and how many you can run at once, and there is no ceiling you can staff your way to. This is not a rhetorical flourish. It is a structural claim about where the limit sits. In the old company, the binding constraint was people. In the new one, the binding constraint is your ability to specify work clearly and verify it correctly at volume. Those are different muscles, and most organizations have only trained the first one.

We have seen this abstraction before

The move is not unprecedented; it is the same move computing has made twice already. Compilers did it to assembly. Programmers stopped hand-writing the instructions the machine executes and started writing intent, letting the compiler generate the instructions. The programmer's job moved up a level, from producing machine code to specifying behavior and checking the result. Nobody mourns hand-written assembly.

The cloud did it to servers. Operations teams stopped racking physical machines and started declaring the infrastructure they wanted, letting the provider produce it. The unit of work stopped being the server you touched and became the capacity you specified. In both cases the human did not become less important. The human moved to a higher level of abstraction and became responsible for more, because each unit of their attention now commanded far more underlying work. Agents are the third instance of the same pattern, applied to knowledge work itself.

What the agent-hour measures, and what breaks

When the unit of work is the agent-hour, the metrics that ran the old company stop describing the new one. Headcount measured the old company because headcount was the input that produced output. Throughput measures this one, because output is now decoupled from the number of people. A ten-person team running thousands of agent-hours a day is not a ten-person team in any meaningful sense. It is a throughput engine with ten people steering it. Counting the people tells you almost nothing about what it produces.

Two things break as this lands, and both are worth naming before they surprise you:

Verification becomes the bottleneck. When agents produce 99.8% of output, the scarce human resource is the judgment that accepts or rejects it. I have argued this at length in verification cost is the new bottleneck: the constraint moves from producing work to confirming it is correct, and that cost does not fall as fast as generation cost.
Org charts stop mapping to output. If throughput is agents times parallelism, then seniority, span of control, and headcount budgets are measuring the wrong thing. The high-leverage person is the one who specifies and verifies the most agent-hours, not the one who manages the most people.

The delegation itself compounds. Every agent-hour is inference, and inference at this volume is a cost curve, not a fixed line. That is why the biggest operators are moving to own the silicon underneath it, a shift I traced in the biggest customer becomes the competitor. The agent-hour is both the new unit of work and the new unit of spend, and the two are the same number read from opposite ends.

How to operate when the unit changes

If the unit of work is the agent-hour, the skills that matter shift accordingly. The premium moves to specification, decomposition, and verification, the three things a human still does that an agent cannot yet do for itself. Writing a clear enough instruction that an agent produces the right thing is a skill. Breaking a large goal into parallelizable pieces is a skill. Judging correctness at the rate agents generate output is the scarcest skill of all, and the one most organizations have not started training.

The forward-looking version is uncomfortable and worth sitting with. If output is agents times parallelism, then the competitive gap between two companies is no longer a hiring gap. It is a gap in how well each one specifies and verifies work at scale. That gap is invisible on an org chart and enormous in throughput. The company that learns to run agent-hours well will out-produce the company that keeps counting heads, and the head-counting company will not understand why until it is far behind.

Key takeaways

OpenAI's top employees run 60+ agent-hours per day, and agents produce 99.8% of the company's weekly output tokens. Work went parallel, not just faster.
The old production function was headcount times hours, which scales linearly by hiring. The new one is agents times parallelism, with no ceiling you can staff your way to.
Compilers did this to assembly and the cloud did it to servers: the human moves up a level of abstraction and becomes responsible for more.
The human stops producing work and starts approving it, which makes verification the scarce resource.
Headcount measured the old company; throughput measures this one. Org charts stop mapping to output.
The competitive gap is now a specification-and-verification gap, invisible on an org chart and enormous in throughput.

The industrial era measured work in hours because a person was the engine. That era is closing. The unit of work is no longer the hour; it is the agent-hour, and the companies that learn to count it will look nothing like the ones that keep counting people. For the wider argument about where capability, cost, and control are heading, start with the manifest and the Joule Wars thesis. Measure throughput, not headcount, and you will see the new company before it is obvious.

Language World Models: Predict Before You Act

Michał Piszczek — Thu, 25 Jun 2026 09:00:00 +0000

Alibaba's Qwen team open-sourced a model that does not act in the world. It imagines it. Qwen-AgentWorld is a language world model, trained from day one to simulate the environment itself rather than to pick the next click.

Start with what every agent you have used actually does. Claude Code, Cursor, an Android automation bot, all of them were trained to choose the next action: click here, run this command, call that tool, then find out what happens. The environment is a black box the agent pokes and observes. Learning means poking the real box enough times to build an intuition for how it responds. That works, but it is expensive, slow, and dangerous, because the box you are learning on is production.

Qwen-AgentWorld flips the direction of the arrow. Feed it a state and an action, and it predicts the next state. Not "what should I do" but "what will the world do back." It was trained across seven domains, terminal, web, operating system, Android, code repositories, search, and MCP tools, to model how each of those environments responds to actions. It is not the driver. It is the road.

The driving-simulator analogy

The cleanest way to understand the shift is the one the Qwen team themselves reach for. Most agents are a driver who only ever learned on real roads. Every lesson is a live drive, with real traffic and real consequences, and the only way to learn a rare situation is to encounter it for real. Qwen-AgentWorld is the driving simulator. It is the model of the road that lets you practice the crash without crashing.

And it is good enough to matter. On AgentWorldBench, the benchmark released alongside it, the 397B version outscores frontier models including Claude Opus 4.8 and GPT-5.4 at environment simulation. That is the load-bearing result. A simulator is only useful if its predictions match reality; a bad simulator teaches bad habits. Qwen-AgentWorld predicts what environments do better than the frontier models built to act in them. The simulator is now more accurate than the drivers.

Most agents are a driver who only learned on real roads. This one is the simulator, and it now models the road better than the frontier models drive it.

Why simulated training beats real training

The practical payoff is agent training that is cheaper and safer, and the safety argument is not abstract. Recall the Cursor "deleted prod DB in 9 seconds" story, an agent with real access to a real database doing irreversible damage before anyone could intervene. That is what training in the real environment risks by default. Every half-trained agent you loose on a live system is a live grenade, and the cost of a mistake is not a bad gradient, it is a destroyed database.

A language world model changes the economics of learning. You train the agent inside the simulated world first, where a catastrophic action costs nothing but a token budget. The agent can delete the simulated production database a thousand times, learn that the action is catastrophic, and never touch a real one until it has internalized the lesson. Simulated training beats real training on every axis that matters at scale:

Cost. Simulated steps are inference, not infrastructure. You do not provision a real terminal, repo, or Android device for every training episode.
Safety. Irreversible actions are reversible in the simulator. The blast radius of a mistake is zero.
Coverage. Rare and dangerous states, the ones you cannot ethically or affordably reproduce in production, can be generated on demand.
Speed. The simulator runs as fast as inference allows, decoupled from the latency of real systems.

Imagination transfers

The more interesting claim is subtler than safe training, and it is the one worth sitting with. When an agent internalizes world modeling as a warm-up, it gets better at real tasks even with zero task-specific fine-tuning. Predicting before acting is not just a way to generate safe practice data. It is a capability that transfers. An agent that has learned to model what the world will do carries that model into every real task, and it acts better because it can anticipate consequences instead of discovering them.

This mirrors something we already believe about human expertise. The expert is not the one with the fastest reflexes. It is the one who has internalized a model of the domain accurate enough to predict outcomes before committing to a move. World modeling is that faculty, made explicit and trainable. Imagination, it turns out, is not decoration on top of intelligence. It is a large part of what intelligence is for.

The open-weights angle

The distribution story matters as much as the capability. The headline benchmark used the 397B model, but the team also released Qwen-AgentWorld-35B-A3B, a Mixture-of-Experts model with 35B total parameters and only 3B active per token. That architecture is the point: it runs cheap, because you pay compute for the 3B active per token, not the full 35B, while retaining the knowledge of the larger count. Add a 256K context window and you have a world model a small team can actually run. It is on HuggingFace, GitHub, and ModelScope, with the benchmark alongside it.

Notice the direction of travel. This is another open-weights drop from China while the frontier labs lock down. The pattern is consistent enough to be a strategy, and it is the same one I traced in route by task, not vendor: capability arrives as open weights you can route to, not just as an API you rent. When the simulator is open, training better agents stops being the exclusive privilege of whoever owns the largest closed model. The simulator becomes a public good, and public goods reshape who gets to build.

That connects directly to how work itself is changing. As I argued in the unit of work is the agent-hour, output is going parallel across armies of agents. Every one has to be trained, and training in the real world does not scale, it is too slow, too expensive, and too dangerous. A cheap, open, accurate world model is what makes agent-hours safe to manufacture at volume. You cannot run millions of them if each new agent learns by breaking production first.

What to watch next

The forward-looking question is whether world modeling becomes a standard layer in the agent stack rather than a research curiosity. My read is that it does, and quickly, because the economics are too favorable to ignore. If a warm-up in a simulated world produces better real-world agents at zero marginal task-specific cost, then not doing it becomes the expensive choice. Teams shipping agents into production will train them in simulators first, the same way we test software before we deploy it.

The deeper shift is where the leverage sits. For a while the frontier was the agent, the thing that acts. Qwen-AgentWorld is a bet that the frontier is moving to the world model, the thing that predicts. Whoever owns the most accurate, cheapest, most open simulator of the environments agents operate in owns the factory that produces good agents. That is a more durable position than owning any single agent, and it is now, at least in part, a public good.

Key takeaways

Qwen-AgentWorld is a language world model: given a state and an action, it predicts the next state across seven domains, instead of choosing the next action.
On AgentWorldBench, the 397B version outscores frontier models including Claude Opus 4.8 and GPT-5.4 at environment simulation.
Training agents in a simulator beats training in real environments on cost, safety, coverage, and speed, no more "deleted prod DB in 9 seconds."
Imagination transfers: an agent that internalizes world modeling as a warm-up performs better on real tasks with zero task-specific fine-tuning.
The 35B-A3B Mixture-of-Experts version runs cheap (3B active per token, 256K context) and ships open on HuggingFace, GitHub, and ModelScope.
Another open-weights drop from China while frontier labs lock down. The simulator is now a public good, and public goods reshape who gets to build.

We spent years teaching agents to act and find out. The next move is teaching them to predict before they act, and to practice in a world that costs nothing to break. For the wider map of how open weights, routing, and control fit together, start with the manifest and the Joule Wars thesis. The frontier is quietly moving from the actor to the model of the world it acts in.