<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Michał Piszczek</title>
    <description>The latest articles on DEV Community by Michał Piszczek (@pich).</description>
    <link>https://dev.to/pich</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3704641%2F5a61f624-9755-4edd-8adb-1d9eb5bd6080.png</url>
      <title>DEV Community: Michał Piszczek</title>
      <link>https://dev.to/pich</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/pich"/>
    <language>en</language>
    <item>
      <title>Coding Agent Bans Are the New Export Controls</title>
      <dc:creator>Michał Piszczek</dc:creator>
      <pubDate>Fri, 03 Jul 2026 09:00:00 +0000</pubDate>
      <link>https://dev.to/pich/coding-agent-bans-are-the-new-export-controls-13ap</link>
      <guid>https://dev.to/pich/coding-agent-bans-are-the-new-export-controls-13ap</guid>
      <description>&lt;p&gt;One government un-bans the models on Monday; a $200B company bans the coding agent by Friday. The tool didn't get worse. It got too good.&lt;/p&gt;

&lt;p&gt;The sequence is what matters. This week Washington lifted export controls on Anthropic's Fable 5 and Mythos 5. Days later, Reuters broke that Alibaba banned Claude Code company-wide, effective July 10. The stated reason: alleged backdoors and fingerprinting of China-linked users. The recommended replacement: Alibaba's own coding agent, Qoder. Read those two events in order and the shape of the next decade of AI policy falls out.&lt;/p&gt;

&lt;p&gt;We spent two years arguing about who can access which weights. That fight is ending. The new frontier is not the model at all. It is whether you trust the agent that runs inside your development environment, reading your codebase, writing your commits, touching every repository you own.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why the coding agent is a harder problem than the model
&lt;/h2&gt;

&lt;p&gt;A model behind an API is a black box you query. You send text, you get text back. The blast radius is your prompt and its response. You can log it, filter it, sandbox it. The trust surface is narrow because the interaction is narrow.&lt;/p&gt;

&lt;p&gt;A coding agent is a different animal entirely. It sits &lt;em&gt;inside&lt;/em&gt; the IDE. It has read access to the full source tree. It writes code that ships. It runs shell commands. It authenticates against internal systems to be useful. To do its job well, it needs exactly the privileges you would never grant a piece of foreign software you didn't fully control.&lt;/p&gt;

&lt;p&gt;That is the tell in the Alibaba decision. A coding agent that is genuinely productive is, by construction, a genuinely privileged process. The better it gets at the job, the more it must see and touch. Productivity and trust are not independent variables here. They are the same variable read from opposite ends.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;When a foreign agent sits inside your IDE, productivity stops being the question. Trust does.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  The playbook is not new. Only the target is.
&lt;/h2&gt;

&lt;p&gt;I have watched this exact pattern run before, on other technologies, in other decades. It is remarkably consistent:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Adopt the superior foreign tool.&lt;/strong&gt; It works better than anything domestic, so it wins on merit.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Measure the dependence.&lt;/strong&gt; Once it is load-bearing across the organization, the strategic cost of losing it becomes visible.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Ban it and clone it.&lt;/strong&gt; Rip it out on a security pretext, point everyone at the domestic replacement that was built in the shadow of the original.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Beijing's 2019 directive to strip foreign PCs and operating systems from government offices followed this arc. Huawei lost Android and shipped HarmonyOS. Moscow swapped Windows for Astra Linux across ministries. In every case the foreign tool was the reference implementation the domestic clone was measured against, then the clone became the mandate. Qoder as the recommended replacement for Claude Code is not a footnote to this story. It is the story.&lt;/p&gt;

&lt;h2&gt;
  
  
  Both accusations can be true
&lt;/h2&gt;

&lt;p&gt;Here is where it gets uncomfortable for anyone who wants a clean villain. Anthropic has accused Alibaba-linked teams of distilling Claude at scale, pulling capability out of the model through relentless querying. Alibaba now accuses Claude Code of backdoors and user fingerprinting. People want to pick a side. You don't have to.&lt;/p&gt;

&lt;p&gt;Both can be true simultaneously. A model provider can defend its weights against extraction while a national champion defends its codebase against a privileged foreign process. These are not contradictory claims. They are the same underlying reality described by two parties with opposing interests: capability is valuable, capability is portable, and nobody wants the other side holding the keys to their most sensitive infrastructure.&lt;/p&gt;

&lt;p&gt;The distillation fight and the backdoor fight are two fronts of one war over who captures the value that flows through the developer's daily workflow. If you want the deeper economic version of this, I've written about how &lt;a href="https://dev.to/blog/the-biggest-customer-becomes-the-competitor"&gt;the biggest customer becomes the competitor&lt;/a&gt; once dependence is measured and the clone is ready.&lt;/p&gt;

&lt;h2&gt;
  
  
  Export controls block weights. Trust controls block workflows.
&lt;/h2&gt;

&lt;p&gt;This is the mechanical distinction that policy has not caught up to yet.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Export controls&lt;/strong&gt; are a supply-side instrument. They restrict who can obtain the model, the weights, the chips. They are enforced at the border, by governments, against the flow of artifacts. Washington un-banning Fable 5 and Mythos 5 is an export-control action.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Trust controls&lt;/strong&gt; are a demand-side instrument. They restrict what a tool is permitted to touch once it is inside your walls. They are enforced by IT and security teams, by procurement policy, against the flow of &lt;em&gt;access&lt;/em&gt;. Alibaba banning Claude Code is a trust-control action.&lt;/p&gt;

&lt;p&gt;The two operate on completely different layers, and the second one is far harder to legislate. You cannot inspect a coding agent at customs. Its risk is not in the binary you download but in the behavior it exhibits with privileged access over months. A government can un-ban a model with a stroke. It cannot un-ban trust. That has to be earned, audited, and continuously verified, which is a much slower and more organizational process. This is the same reason I argue the real question is increasingly &lt;a href="https://dev.to/blog/who-owns-your-harness"&gt;who owns your harness&lt;/a&gt; rather than who owns the model.&lt;/p&gt;

&lt;h2&gt;
  
  
  What this means for anyone shipping software
&lt;/h2&gt;

&lt;p&gt;If you build software and you use foreign-origin coding agents, the Alibaba decision is a preview of a question your own security team will eventually ask. Not "is the model good" but "what does this process see, and what would we lose if it were compromised or cut off." A few concrete moves follow from that:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Treat coding agents as privileged infrastructure, not developer conveniences.&lt;/strong&gt; Inventory what they can read, write, and execute. If you can't answer that, you don't understand your exposure.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Assume the agent is a chokepoint, not a feature.&lt;/strong&gt; Anything load-bearing and foreign is a strategic dependency. Price the switching cost before you need to switch.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Separate capability from access.&lt;/strong&gt; The model can be excellent and the access still unacceptable. Those are two decisions, and conflating them is how organizations get surprised.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Watch the clones.&lt;/strong&gt; When a domestic equivalent appears next to a ban, the ban is not really about security. It is about capture.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Key takeaways
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Export controls restrict weights; trust controls restrict workflows. The bottleneck moved from GPUs to IDE trust.&lt;/li&gt;
&lt;li&gt;A coding agent's productivity and its privilege are the same variable. The better it gets, the more it must access.&lt;/li&gt;
&lt;li&gt;Alibaba banning Claude Code the same week Washington un-banned Anthropic's models shows policy operating on two different layers.&lt;/li&gt;
&lt;li&gt;The adopt → measure dependence → ban and clone playbook has run before on PCs, Android, and Windows. Qoder is the clone.&lt;/li&gt;
&lt;li&gt;Distillation claims and backdoor claims can both be true; they are two fronts of one war over workflow value.&lt;/li&gt;
&lt;li&gt;Governments can un-ban a model with a stroke. They cannot un-ban trust. That is earned, audited, and slow.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The model wars trained everyone to watch the leaderboard. The next fight will be quieter and far more consequential: fought inside version control, procurement policy, and security review, over which agents are allowed to touch the code that runs the world. If you want the wider map of how this connects to clearance and control, start with the &lt;a href="https://dev.to/michal-piszczek"&gt;manifest&lt;/a&gt; and the &lt;a href="https://dev.to/michal-piszczek#joule-wars"&gt;Joule Wars&lt;/a&gt; thesis. The leaderboard is settled. The trust boundary is where the real contest begins.&lt;/p&gt;

</description>
      <category>aisecuritygeopolitic</category>
    </item>
    <item>
      <title>Washington Regulated the Muzzle, Not the Model</title>
      <dc:creator>Michał Piszczek</dc:creator>
      <pubDate>Thu, 02 Jul 2026 09:00:00 +0000</pubDate>
      <link>https://dev.to/pich/washington-regulated-the-muzzle-not-the-model-51k0</link>
      <guid>https://dev.to/pich/washington-regulated-the-muzzle-not-the-model-51k0</guid>
      <description>&lt;p&gt;Anthropic put Fable 5 back online worldwide. The fix tells you what Washington actually regulated. It was never the model.&lt;/p&gt;

&lt;p&gt;When the control fired, it fired on a borderline bypass, a request that skated the edge of an exploit demo. That was the trigger for the whole export-control episode. But here is the detail that collapses the official story: Anthropic's own testing showed Opus 4.8, GPT-5.5, and even the smaller Haiku 4.5 and Sonnet 4.6 could reproduce the same exploit demo. The capability was never unique to Fable 5. It was ambient. It lived in every frontier and near-frontier model on the market.&lt;/p&gt;

&lt;p&gt;You cannot export-control mathematics that everyone already has. So the regulation did not target the capability, because there was no capability to target. It targeted whether the safeguard holds. That is a much narrower and much stranger thing to regulate, and once you see it, the entire architecture of modern AI policy reads differently.&lt;/p&gt;

&lt;h2&gt;
  
  
  The tell is in the patch
&lt;/h2&gt;

&lt;p&gt;Look at what the fix actually does. When Fable 5 now blocks a request, it does not refuse and stop. It reroutes the request to Opus 4.8. And by Anthropic's own admission, in the same blog post, Opus 4.8 produces the same exploit demo Fable 5 was blocked from producing.&lt;/p&gt;

&lt;p&gt;So the capability did not leave the building. It was not removed, contained, or diminished. A request that Fable 5 declines gets handed to a sibling model that happily completes it. The output the control was designed to prevent is still one hop away, by design. Only the label on the door changed.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;The capability didn't leave. Only the label changed.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;If this were a safety upgrade, you would expect the dangerous output to become harder to obtain. It didn't. What changed is not the availability of the result. What changed is the paper trail.&lt;/p&gt;

&lt;h2&gt;
  
  
  Not a safety upgrade. A chain of custody.
&lt;/h2&gt;

&lt;p&gt;Read the mechanism as a sequence and its real purpose becomes obvious:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Block.&lt;/strong&gt; The classifier flags the request as borderline.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Reroute.&lt;/strong&gt; It hands the request to Opus 4.8 instead of completing on Fable 5.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Log.&lt;/strong&gt; The event is recorded, the flag is stamped, the interaction is captured.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Notify.&lt;/strong&gt; The relevant parties are informed that a borderline request occurred.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;That is not containment. That is chain of custody. The point is not to stop the output from existing. The point is to ensure that when it exists, there is a record of who asked, when, and through which path. Regulators did not get a wall. They got an audit log. And for a lot of policy purposes, an audit log is what they actually wanted, because it converts an unmonitorable capability into a governable, attributable event.&lt;/p&gt;

&lt;p&gt;This is a meaningfully different thing from what the press release implies. The public framing is "we made the model safer." The mechanism is "we made the usage traceable." Those are not the same claim, and the gap between them is where the real policy lives.&lt;/p&gt;

&lt;h2&gt;
  
  
  What they regulated is a model regulating a model
&lt;/h2&gt;

&lt;p&gt;Sit with the recursion here. The safeguard is a classifier. A classifier is itself a model. So the object of regulation is a model whose job is to police another model. And a classifier, being a model, can be jailbroken like any other.&lt;/p&gt;

&lt;p&gt;Anthropic says as much in their own words: the safeguard is "probably impossible to make fully robust." That is not a hedge. It is the honest description of the situation. You have built a probabilistic gate to guard a probabilistic system, and both are susceptible to adversarial input. The muzzle is made of the same material as the thing it is muzzling.&lt;/p&gt;

&lt;p&gt;This matters because it changes what "compliance" even means. Compliance is no longer a binary property of the model. It is the current, defeatable state of a classifier that sits in front of it. Regulate that, and you have regulated something that can be talked around by a sufficiently clever prompt. The control is real, but it is soft, and everyone building on it should understand that it is soft. It is closer to a spam filter than a lock.&lt;/p&gt;

&lt;h2&gt;
  
  
  Same weights, two labels
&lt;/h2&gt;

&lt;p&gt;Now the commercial structure clicks into place. The same underlying weights ship two ways. They ship as Mythos to a vetted circle, cleared, unmuzzled, trusted. And they ship as Fable to everyone else, wrapped in the classifier, the rerouting, the logging.&lt;/p&gt;

&lt;p&gt;The intelligence is identical. What differs is the muzzle and who is trusted to operate without one. That is the entire product distinction. The model was never the product. The muzzle is the product. Access to the unmuzzled version is the premium tier, and clearance to skip the classifier is the thing of value.&lt;/p&gt;

&lt;p&gt;This is why I keep saying the &lt;a href="https://dev.to/blog/model-wars-are-over-clearance-wars-begin"&gt;model wars are over and the clearance wars are beginning&lt;/a&gt;. When the capability is ambient and the weights are shared, the only remaining lever is who is trusted to run them without a governor. That lever is not technical. It is political and institutional, and it is exactly where the value is migrating.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why this framing beats the official one
&lt;/h2&gt;

&lt;p&gt;If you take the official story at face value, you will make bad predictions. You will expect regulation to make capabilities disappear, and it won't, because the capability is everywhere and un-recallable. You will expect safeguards to be robust, and they aren't, because they are jailbreakable classifiers. You will expect the model to be the regulated object, and it isn't, because two labels ship from the same weights.&lt;/p&gt;

&lt;p&gt;Take the muzzle framing instead and your predictions get sharper:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Regulation will increasingly target monitoring and attribution, not capability, because capability can't be un-shipped.&lt;/li&gt;
&lt;li&gt;Safeguards will be soft controls, defeatable and probabilistic, marketed as hard ones.&lt;/li&gt;
&lt;li&gt;The commercial frontier moves to clearance: who gets the unmuzzled weights, and who is stuck with the governor.&lt;/li&gt;
&lt;li&gt;"Safety" and "traceability" will be used interchangeably in press releases, even though only one of them is actually being delivered.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Key takeaways
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Fable 5's control fired on a borderline exploit that Opus 4.8, GPT-5.5, Haiku 4.5, and Sonnet 4.6 could all reproduce. The capability was never unique.&lt;/li&gt;
&lt;li&gt;You can't export-control math everyone has, so regulation targeted whether the safeguard holds, not the capability itself.&lt;/li&gt;
&lt;li&gt;When Fable 5 blocks a request it reroutes to Opus 4.8, which produces the same output. The capability never left; only the label changed.&lt;/li&gt;
&lt;li&gt;Block → reroute → log → notify is chain of custody, not containment. Regulators got an audit log, not a wall.&lt;/li&gt;
&lt;li&gt;The safeguard is a classifier, itself a model, and jailbreakable. Anthropic calls it "probably impossible to make fully robust."&lt;/li&gt;
&lt;li&gt;Same weights ship as unmuzzled Mythos to a vetted circle and muzzled Fable to everyone else. The muzzle is the product.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The uncomfortable conclusion is that AI regulation, as currently practiced, does not regulate intelligence. It regulates the paperwork around intelligence. That may even be the right call given that the capability cannot be recalled. But we should be honest about what is being sold. The model is free to think what it thinks. What is governed is the record, the routing, and the clearance. If you want the map of where that leads, start with the &lt;a href="https://dev.to/michal-piszczek"&gt;manifest&lt;/a&gt; and the &lt;a href="https://dev.to/michal-piszczek#joule-wars"&gt;Joule Wars&lt;/a&gt;. The model was never the product. The muzzle is.&lt;/p&gt;

</description>
      <category>aipolicy</category>
    </item>
    <item>
      <title>Route by Task, Not Vendor: The Open-Weight AI Stack</title>
      <dc:creator>Michał Piszczek</dc:creator>
      <pubDate>Wed, 01 Jul 2026 09:00:00 +0000</pubDate>
      <link>https://dev.to/pich/route-by-task-not-vendor-the-open-weight-ai-stack-4edl</link>
      <guid>https://dev.to/pich/route-by-task-not-vendor-the-open-weight-ai-stack-4edl</guid>
      <description>&lt;p&gt;Six months ago, "move your AI workloads to open Chinese models" was a thought experiment you floated in a strategy deck to sound forward-looking. Now it is a procurement story with real invoices attached. The migration is already happening at names you know, and it is not being driven by ideology. It is being driven by arithmetic.&lt;/p&gt;

&lt;p&gt;Airbnb moved to Qwen, Alibaba's open-weight family. CEO Brian Chesky described it plainly: "very good, fast and cheap." It powers their support agent. Cursor built its Composer coding model on Moonshot's open weights, shipping as Kimi K2.5. Microsoft has been hosting and testing DeepSeek V4 inside Azure Foundry and Copilot. Shopify, Coinbase, Siemens, and Uber Eats have all been reported routing real production workloads to Qwen, GLM, Kimi, or DeepSeek.&lt;/p&gt;

&lt;p&gt;None of them "switched to the best model." That framing misreads the entire decision. Each of them moved the right task to a cheaper open-weight model sitting within a few points of frontier. The distinction matters more than any benchmark leaderboard, because it inverts the question everyone has been asking.&lt;/p&gt;

&lt;h2&gt;
  
  
  The question was never "which model is best?"
&lt;/h2&gt;

&lt;p&gt;For three years the industry has treated model selection as a single global decision. You pick the smartest model, you wire everything to it, you feel safe. That instinct is expensive and increasingly wrong. The real question is narrower and far more useful: which task actually needs the best model?&lt;/p&gt;

&lt;p&gt;Look at what production traffic is actually made of. The overwhelming majority of it is the boring 80% — extraction, classification, summarization, routing, simple tool calls, reformatting, deduplication. This is plumbing. It does not require a model that can reason through a novel proof or design a distributed system. It requires a model that is competent, fast, and cheap.&lt;/p&gt;

&lt;p&gt;Frontier models are priced for the hard 20% — the genuinely difficult reasoning, the long-horizon planning, the cases where an extra few points of quality translate into measurable business value. That is what you are paying a premium for. When you send the easy 80% through a frontier API, you are paying that premium on every request that never needed it.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Paying frontier prices for the easy 80% is one of the biggest sources of AI budget waste in production today.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Route by task, not by vendor
&lt;/h2&gt;

&lt;p&gt;The architecture that follows is not exotic. It is a routing table. You classify the task, then you send it to the cheapest model that clears the quality bar for that task. In practice, a stack that holds up in production looks something like this:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Reasoning&lt;/strong&gt; — GLM or Kimi, which now sit close enough to frontier that the gap rarely shows up in real workloads.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Code&lt;/strong&gt; — Kimi Code or Qwen Coder for the bulk of generation and refactoring.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Agents and tool calls&lt;/strong&gt; — GLM, which handles structured tool invocation reliably at a fraction of closed-API cost.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Bulk processing&lt;/strong&gt; — MiMo, where you are grinding through volume and latency-per-dollar dominates.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Images and video&lt;/strong&gt; — fine-tuned LTX plus Wan, tuned to your own domain.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Local workhorse&lt;/strong&gt; — Qwen3.6-35B-A3B, the model that runs on your own hardware and quietly handles the daily grind.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Almost all of these are open-weight, self-hostable, close to frontier, and a fraction of the cost. This is the same principle that runs underneath &lt;a href="https://dev.to/blog/capability-is-commoditizing-cost-is-the-frontier"&gt;capability commoditizing while cost becomes the frontier&lt;/a&gt;: when the models converge on quality, the differentiation moves to how efficiently you deploy them.&lt;/p&gt;

&lt;h2&gt;
  
  
  Savings are the visible win. Ownership is the real one.
&lt;/h2&gt;

&lt;p&gt;The cost delta is what gets the CFO's attention, and it is real. But savings are not the point. The point is that you own the stack. When the core of your system runs on weights you hold, nobody can switch you off. Nobody can revoke your access on their timeline. Nobody can see your data, dictate your pricing, or quietly reshape your roadmap by changing theirs.&lt;/p&gt;

&lt;p&gt;That is a business-continuity property, not a line item. It is the same argument that sits underneath the question of &lt;a href="https://dev.to/blog/who-owns-your-harness"&gt;who owns your harness&lt;/a&gt; — the orchestration layer that actually knows how your company works. Open weights are the only ones nobody outside your walls can turn off.&lt;/p&gt;

&lt;h2&gt;
  
  
  Clearing up the "my data goes to China" reflex
&lt;/h2&gt;

&lt;p&gt;There is a reflexive objection worth killing directly. "Chinese model equals my data goes to China" is simply wrong for open weights. Open weights run on your infrastructure. The weights may originate in a lab in Hangzhou or Beijing, but the weights are a static artifact — a file of numbers. When you self-host them, your data never leaves your servers. It goes to your GPUs, not theirs.&lt;/p&gt;

&lt;p&gt;This is why the real boundary is not American versus Chinese. It is open-weight versus closed. A closed American API can log your prompts, change its terms, and go dark on a government's schedule. A set of open weights running in your own datacenter cannot do any of those things, regardless of which country trained it. The nationality of the training run is a distraction; the deployment topology is the actual security boundary.&lt;/p&gt;

&lt;h2&gt;
  
  
  When to still pay for closed
&lt;/h2&gt;

&lt;p&gt;None of this means closed models are obsolete. They still lead on some cinematic, high-stakes workloads where the last few points of quality genuinely move the needle. The discipline is to pay for a closed model when the quality gap creates measurable business value — not by default, not out of habit, and not because it is the name everyone recognizes.&lt;/p&gt;

&lt;p&gt;The rule is simple to state and harder to enforce: open-source first, self-host the core, pay for frontier only where it creates value you cannot get elsewhere. Enforcing it means building a routing layer, maintaining evals per task, and resisting the temptation to route everything to the smartest model because it is easier.&lt;/p&gt;

&lt;h2&gt;
  
  
  Key takeaways
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;The right question is not "which model is best?" but "which task actually needs the best?"&lt;/li&gt;
&lt;li&gt;Most production traffic is the boring 80% — extraction, classification, routing — and frontier pricing on it is pure waste.&lt;/li&gt;
&lt;li&gt;Route by task, not vendor: match each workload to the cheapest model that clears its quality bar.&lt;/li&gt;
&lt;li&gt;Open weights self-hosted mean your data never leaves your servers, whatever the model's country of origin.&lt;/li&gt;
&lt;li&gt;The real boundary is open-weight versus closed, not American versus Chinese.&lt;/li&gt;
&lt;li&gt;Pay for closed models only where the quality gap creates business value you cannot get otherwise.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The race stopped being about the smartest model. It became about architecture that still works when today's smartest model is unavailable, unaffordable, or switched off. I write more about that shift across my &lt;a href="https://dev.to/michal-piszczek"&gt;essays on execution and AI infrastructure&lt;/a&gt;. Build the routing table now, while it is still a competitive edge rather than table stakes.&lt;/p&gt;

</description>
      <category>aiarchitecture</category>
    </item>
    <item>
      <title>Capability Is Commoditizing. Cost Is the Frontier.</title>
      <dc:creator>Michał Piszczek</dc:creator>
      <pubDate>Wed, 01 Jul 2026 09:00:00 +0000</pubDate>
      <link>https://dev.to/pich/capability-is-commoditizing-cost-is-the-frontier-1c5g</link>
      <guid>https://dev.to/pich/capability-is-commoditizing-cost-is-the-frontier-1c5g</guid>
      <description>&lt;p&gt;Anthropic shipped Claude Sonnet 5. On knowledge work it edges out Opus 4.8, its own flagship, at roughly half the price. The benchmark table isn't the story. The price column is.&lt;/p&gt;

&lt;p&gt;Everyone read the launch the same way: another model, another set of numbers, ho-hum, the leaderboard shuffles again. That is the wrong column to be reading. The mid-tier model just matched the flagship on the work that actually gets paid for, and it did it at a fraction of the cost. When that happens, you are not looking at a product update. You are looking at a phase change in what the market is willing to pay for.&lt;/p&gt;

&lt;h2&gt;
  
  
  Read the price column, not the benchmark
&lt;/h2&gt;

&lt;p&gt;Here are the numbers that matter. On GDPval-AA, the knowledge-work benchmark, Sonnet 5 scores 1618 against Opus 4.8's 1615. The mid-tier passed the flagship. On Humanity's Last Exam with tools, it is 57.4% versus 57.9%, a difference well inside rounding error. On the work that maps to what knowledge workers actually do, these two models are indistinguishable.&lt;/p&gt;

&lt;p&gt;Now the pricing. Sonnet 5 launches at $2 per million input tokens and $10 per million output at the introductory rate, settling to $3 and $15. Opus 4.8 is $5 and $25. Same class of work, at roughly 40% of the cost. That is not a discount. That is a repricing of the entire capability tier.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;The moment a capability stops being scarce, the market reprices around delivery, not intelligence.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;When the premium product and the mid-tier product do the same job, the premium is no longer buying capability. It is buying a slightly better result on the tail, for the cases where the last fraction of a percent matters. For the vast majority of knowledge work, that tail is irrelevant, and the market knows it. The price column is where that knowledge shows up first.&lt;/p&gt;

&lt;h2&gt;
  
  
  Compute did it. Storage did it. Bandwidth did it.
&lt;/h2&gt;

&lt;p&gt;This is not a novel event in the history of technology. It is the single most reliable pattern we have. Every foundational capability follows the same arc from scarce and premium to abundant and priced-by-delivery.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Compute.&lt;/strong&gt; A cycle was once a rationed resource you scheduled time on. Now it is a commodity you rent by the second and never think about.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Storage.&lt;/strong&gt; A megabyte was a budget line. Now storage is effectively free and the cost that matters is moving and querying the data.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Bandwidth.&lt;/strong&gt; A bit over the wire was metered and precious. Now the pipe is assumed and the value moved to what flows through it.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In every case the capability did not disappear. It became the floor. And once it was the floor, the entire market repriced around the thing that was still scarce: delivery, integration, reliability, and cost at scale. Intelligence is now walking the same path. The capability to do frontier-grade agentic knowledge work is becoming the floor, not the ceiling.&lt;/p&gt;

&lt;h2&gt;
  
  
  Frontier-grade is now the default tier
&lt;/h2&gt;

&lt;p&gt;The most telling signal is not in the benchmark or the price. It is in the distribution. Sonnet 5 is the model free and Pro users get by default. Frontier-grade agentic work is no longer the thing you pay up for. It is the thing you get when you don't pay attention. The premium tier and the default tier now overlap on capability.&lt;/p&gt;

&lt;p&gt;Think about what that does to product strategy. If your entire pitch was "we have access to the best model," you no longer have a pitch, because the best-in-class-for-the-task model is the commodity default. The differentiation has to move somewhere else, and there are only a few places it can go: the data you feed the model, the harness you run it in, and the cost at which you can finish the job. I've argued the data point separately in &lt;a href="https://dev.to/blog/models-are-commodities-clean-data-is-not"&gt;models are commodities, clean data is not&lt;/a&gt;, and the harness point in &lt;a href="https://dev.to/blog/route-by-task-not-vendor-open-weight-ai-architecture"&gt;route by task, not by vendor&lt;/a&gt;. When capability is uniform, routing to the cheapest sufficient model per task is not a nice-to-have. It is the architecture.&lt;/p&gt;

&lt;h2&gt;
  
  
  The question changed. Notice which one.
&lt;/h2&gt;

&lt;p&gt;For two years the operative question was: &lt;em&gt;can the model do the task?&lt;/em&gt; That question is now boring, because for most tasks the answer is yes, from the default tier, for a couple of dollars per million tokens. The interesting question is a different one entirely:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What does the task cost to finish, at scale, with nobody watching?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Every clause in that sentence is load-bearing. &lt;em&gt;Cost to finish&lt;/em&gt;, not cost per call, because agentic work chains many calls and the total is what hits the invoice. &lt;em&gt;At scale&lt;/em&gt;, because a workflow that pencils out at ten runs a day can bankrupt you at ten million. &lt;em&gt;With nobody watching&lt;/em&gt;, because the economics only work if the agent completes autonomously, without a human babysitting each step and eating the real cost, which is salary, not tokens.&lt;/p&gt;

&lt;p&gt;This reframes the whole build calculus. You are no longer selecting the smartest model. You are engineering the cheapest reliable completion of a unit of work. That is an economics and execution problem, not a capability problem. The same underlying force is why I've argued &lt;a href="https://dev.to/blog/openai-economics-gpu-constrained-not-demand-constrained"&gt;the constraint is GPUs, not demand&lt;/a&gt;. When capability is abundant and cheap, demand explodes to meet supply, and the binding constraint becomes the physical cost of serving it.&lt;/p&gt;

&lt;h2&gt;
  
  
  What operators should do about it
&lt;/h2&gt;

&lt;p&gt;If capability is commoditizing and cost is the frontier, then the winning moves are unglamorous and entirely about execution:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Instrument cost per completed task, not per token.&lt;/strong&gt; The token price is a red herring. Measure what it costs to finish a real unit of work end to end.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Default to the cheapest sufficient model and route up only on the tail.&lt;/strong&gt; Reserve the flagship for the fraction of cases where the last percent actually pays.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Design for unattended completion.&lt;/strong&gt; The moment a human has to watch, your cost model is dominated by labor and the token savings are noise.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Move differentiation to data, harness, and reliability.&lt;/strong&gt; Capability is the floor now. Your edge lives in the layers the commodity model can't provide.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Key takeaways
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Sonnet 5 matches Opus 4.8 on knowledge work (GDPval-AA 1618 vs 1615) at roughly 40% of the cost. The mid-tier passed the flagship.&lt;/li&gt;
&lt;li&gt;When the premium and mid-tier do the same job, the premium stops buying capability and starts buying a marginal tail.&lt;/li&gt;
&lt;li&gt;Compute, storage, and bandwidth all commoditized the same way. Intelligence is now the floor, not the ceiling.&lt;/li&gt;
&lt;li&gt;Frontier-grade agentic work is the default tier free and Pro users get, not the tier you pay up for.&lt;/li&gt;
&lt;li&gt;The question shifted from "can the model do it" to "what does the task cost to finish, at scale, with nobody watching."&lt;/li&gt;
&lt;li&gt;Differentiation moves to data, harness, reliability, and cost per completed task. Capability alone is no longer a moat.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The leaderboard-watchers are optimizing the wrong variable. They are still asking whether the model is smart enough, a question the market has already answered and priced to the floor. The operators who win the next cycle are asking what it costs to finish the work when the intelligence is free and the only scarce thing left is disciplined, unattended, economical execution. Capability is commoditizing. Cost is the new frontier. For the wider thesis, the &lt;a href="https://dev.to/michal-piszczek"&gt;manifest&lt;/a&gt; and the &lt;a href="https://dev.to/michal-piszczek#joule-wars"&gt;Joule Wars&lt;/a&gt; lay out where the joules, and the margins, actually go.&lt;/p&gt;

</description>
      <category>aieconomics</category>
    </item>
    <item>
      <title>Who Owns Your Harness? The Layer Above the Model</title>
      <dc:creator>Michał Piszczek</dc:creator>
      <pubDate>Tue, 30 Jun 2026 09:00:00 +0000</pubDate>
      <link>https://dev.to/pich/who-owns-your-harness-the-layer-above-the-model-5ca2</link>
      <guid>https://dev.to/pich/who-owns-your-harness-the-layer-above-the-model-5ca2</guid>
      <description>&lt;p&gt;Lately I find myself less interested in which model wins and far more interested in who owns the layer above the model. The benchmark wars — GLM versus Claude versus GPT versus Qwen versus DeepSeek — are already yesterday's conversation. Models improve fast and get cheaper faster. Open-source is closing the gap ahead of every schedule people drew a year ago. So "which model should we use?" is not the question. It never was.&lt;/p&gt;

&lt;p&gt;The question that actually determines whether your company survives a bad quarter in AI policy is this: can we replace the model tomorrow? If the honest answer is no, you have already made the most expensive architectural decision of the decade without noticing.&lt;/p&gt;

&lt;h2&gt;
  
  
  You think you're buying AI. You're wiring an operating system.
&lt;/h2&gt;

&lt;p&gt;Most companies believe they are buying AI the way they buy a database or a cloud region — a component, swappable, bounded. What they are actually doing is wiring their entire execution layer around a single vendor. The prompts. The memory. The agents. The routing logic. The evals. And then every integration on top: Slack, Jira, GitHub, the internal tools, the accumulated company knowledge that no one wrote down anywhere else.&lt;/p&gt;

&lt;p&gt;Bit by bit, the model stops being a model. It becomes the operating system of the business. Every workflow assumes its quirks. Every prompt is tuned to its behavior. Every engineer's mental model of "how our AI works" is really a mental model of one vendor's API. That is where lock-in begins — not in a contract clause, but in a thousand small couplings nobody tracked.&lt;/p&gt;

&lt;h2&gt;
  
  
  Lock-in stopped being a commercial inconvenience
&lt;/h2&gt;

&lt;p&gt;For most of software history, lock-in was a negotiating problem. You paid a switching cost, you grumbled, you migrated over a quarter. Annoying, survivable. That era is over for AI. Lock-in became a business-continuity risk, and recent events proved it in the harshest way possible.&lt;/p&gt;

&lt;p&gt;A single US export-control order took Anthropic's top models — Mythos 5 and Fable 5 — offline for two weeks. Not just for foreign users. To stay compliant, they were pulled for everyone worldwide, the United States included. Every company that had wired its product around those models lost its core capability overnight, through no decision of its own.&lt;/p&gt;

&lt;p&gt;Days later, GPT-5.6 shipped only as a gated, US-only preview, after Washington reportedly asked OpenAI to hold the launch. Two data points, one lesson: the model under your product can go dark on a government's timeline, not yours.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Closed is closed the moment someone decides it's closed to you — and that someone may not be your vendor.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That last clause is the whole point. You can have a perfect relationship with your model provider, pay every invoice on time, and still lose access because a regulator three time zones away signed an order. Your vendor's goodwill is irrelevant when the constraint sits above the vendor. This is the shift I described in &lt;a href="https://dev.to/blog/model-wars-are-over-clearance-wars-begin"&gt;the model wars ending and the clearance wars beginning&lt;/a&gt;: capability now ships when it clears, not when it's ready.&lt;/p&gt;

&lt;h2&gt;
  
  
  The default has to flip: open-source first
&lt;/h2&gt;

&lt;p&gt;The conclusion writes itself. The default posture must invert. Open-source first — not because open models win every benchmark, because they do not yet, but because open weights are the only ones nobody can switch off. A file of weights sitting on your own hardware does not care what any government decides next week. It is inert, and it is yours.&lt;/p&gt;

&lt;p&gt;Keep closed models for the frontier edge, the genuinely hard workloads where the quality gap earns its premium. But the core you cannot afford to lose should sit on weights you actually hold. This is the same architecture I lay out in &lt;a href="https://dev.to/blog/route-by-task-not-vendor-open-weight-ai-architecture"&gt;routing by task, not vendor&lt;/a&gt;: self-host the core, pay for frontier only where it creates value you can't get elsewhere.&lt;/p&gt;

&lt;h2&gt;
  
  
  Above the model sits the harness
&lt;/h2&gt;

&lt;p&gt;And above every model — open or closed — sits the harness. This is the layer that actually matters, and it should be vendor-agnostic by design. The harness owns:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Memory&lt;/strong&gt; — what your system remembers across sessions, users, and workflows.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Context&lt;/strong&gt; — how the right information reaches the model at the right moment.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Routing&lt;/strong&gt; — which task goes to which model, and the fallback when one goes dark.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Permissions&lt;/strong&gt; — who and what is allowed to do which action.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tools&lt;/strong&gt; — the integrations that let the model act on your actual systems.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Evals&lt;/strong&gt; — how you know quality held after you swapped a model underneath.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Orchestration&lt;/strong&gt; — the logic that ties it all into something that works.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The harness is the part that actually knows how your company works. The model is a replaceable engine bolted into it. If your harness is well-built and vendor-agnostic, swapping models is a config change and a re-run of your evals. If it is not — if the harness and the vendor are the same thing — then a model going dark takes your whole business with it.&lt;/p&gt;

&lt;h2&gt;
  
  
  The highest-ROI call of the decade
&lt;/h2&gt;

&lt;p&gt;Building your own harness is slower and costlier today. There is no way around that. It is more engineering, more discipline, more upfront investment than plugging into one vendor's SDK and shipping. That is exactly why most teams will not do it until they are forced to.&lt;/p&gt;

&lt;p&gt;But it may be one of the highest-ROI calls of the decade. Models come and go. Governments reshuffle who gets access to what, and on what timeline. Your company brain — the accumulated knowledge, workflows, and judgment encoded in your execution layer — should depend on neither. It should sit in a harness you own, feeding whichever model happens to be best, cheapest, and available this month.&lt;/p&gt;

&lt;h2&gt;
  
  
  Key takeaways
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;The right question is not "which model do you use?" but "can you replace it tomorrow?"&lt;/li&gt;
&lt;li&gt;Companies think they're buying AI; they're wiring their whole execution layer around one vendor.&lt;/li&gt;
&lt;li&gt;Lock-in became a business-continuity risk — a single export order pulled Anthropic's top models worldwide for two weeks.&lt;/li&gt;
&lt;li&gt;Closed is closed the moment someone decides it's closed to you, and that someone may not be your vendor.&lt;/li&gt;
&lt;li&gt;Default to open-source for the core; open weights are the only ones nobody can switch off.&lt;/li&gt;
&lt;li&gt;Own the harness — memory, context, routing, permissions, tools, evals, orchestration — and models become swappable engines.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;We are going to stop asking "which model do you use?" and start asking who owns your harness. I write about that transition and the architecture it demands across my &lt;a href="https://dev.to/michal-piszczek"&gt;essays on AI infrastructure and execution&lt;/a&gt;. The teams that build the harness now will be the ones still running when the next model goes dark.&lt;/p&gt;

</description>
      <category>aiarchitecture</category>
    </item>
    <item>
      <title>The Model Wars Are Over. The Clearance Wars Begin.</title>
      <dc:creator>Michał Piszczek</dc:creator>
      <pubDate>Sat, 27 Jun 2026 09:00:00 +0000</pubDate>
      <link>https://dev.to/pich/the-model-wars-are-over-the-clearance-wars-begin-1n66</link>
      <guid>https://dev.to/pich/the-model-wars-are-over-the-clearance-wars-begin-1n66</guid>
      <description>&lt;p&gt;OpenAI previewed GPT-5.6 and, in doing so, became the first lab to ship a model the US government clears for use customer by customer. Read that as an engineering release and you miss everything. The model is not the story. The gate is. For the first time, a frontier capability arrived not when it was built, but when it was permitted — and the permission is granted one customer at a time.&lt;/p&gt;

&lt;h2&gt;
  
  
  What actually shipped
&lt;/h2&gt;

&lt;p&gt;The release has three tiers. Sol is the flagship. Terra sits in the same class as GPT-5.5 at roughly half the cost. Luna is the cheapest, built for volume. There is a new ultra mode that spins up subagents to attack harder problems in parallel. On the surface, this is a clean, well-segmented product line — the kind of tiering that signals a mature lab that understands its cost curve.&lt;/p&gt;

&lt;p&gt;But every one of those capabilities shipped behind a limited preview only. And before release, OpenAI shared the model's capabilities with the US government. That sequencing is the whole point. The product decisions are downstream of a clearance decision that happened first.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why the gate exists: cyber
&lt;/h2&gt;

&lt;p&gt;The reason for the gate is not vague safety hand-waving. It is cyber, and the numbers are specific. On ExploitBench, Sol matches Anthropic's Mythos while using roughly one-third of the output tokens. It finds bugs and exploitation primitives in Chromium and Firefox — real browsers that billions of people run — stopping short of a full autonomous exploit, but not by a comfortable margin.&lt;/p&gt;

&lt;p&gt;OpenAI spent 700,000 A100-equivalent GPU hours red-teaming its own safeguards before release. That is not a rounding error in a training budget; that is a deliberate, industrial-scale effort to understand what the model can do before anyone outside gets to ask it. And Sol runs on Cerebras at 750 tokens per second starting in July, which means whatever it can do, it can do fast and at scale.&lt;/p&gt;

&lt;p&gt;Put those facts together and the gate is not paranoia. A model that finds exploitation primitives in the world's most-used browsers, at a third of the token cost of the prior frontier, running at 750 tokens per second, is a genuinely dual-use artifact. The lab knew it. The government knew it. The preview gate is the compromise that let it ship at all.&lt;/p&gt;

&lt;h2&gt;
  
  
  The quiet part, said out loud
&lt;/h2&gt;

&lt;p&gt;OpenAI said the part most labs would keep internal: it does not want government approval to become the long-term default. That is a remarkable admission. It means the company shipping the model understands that the clearance regime it just participated in is a threshold being crossed, not a one-off accommodation. They cleared this launch and simultaneously warned against the precedent of clearing launches.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Capability used to ship the day it was ready. Now it ships when it's cleared. The bottleneck moved from compute to permission.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;I have watched this exact pattern before, and it is worth being precise about the analogy, because the analogy is the argument.&lt;/p&gt;

&lt;h2&gt;
  
  
  We have run this playbook three times now
&lt;/h2&gt;

&lt;p&gt;Strong encryption was classified as a munition. For years, exporting cryptographic software above a certain key length was legally equivalent to exporting weapons. The capability existed; shipping it required clearance. GPS shipped with selective availability — the civilian signal was deliberately degraded, its full precision reserved and released only when the government decided the strategic calculus had changed. Now inference joins that list.&lt;/p&gt;

&lt;p&gt;The through-line is consistent. When a technology becomes strategically decisive, the state stops treating it as a product and starts treating it as a controlled capability. The pattern has three stages:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Classification&lt;/strong&gt; — the capability is recognized as dual-use and reframed as a matter of national security rather than commerce.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Gating&lt;/strong&gt; — release is made conditional on clearance, whether by export license, degraded signal, or customer-by-customer approval.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Selective release&lt;/strong&gt; — the full capability flows only to approved parties, on the government's timeline, not the builder's.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Encryption followed it. GPS followed it. GPT-5.6 is the first frontier model to follow it explicitly, with the government briefed before launch and access granted per customer. That is not a coincidence of one release. It is a category shift.&lt;/p&gt;

&lt;h2&gt;
  
  
  The bottleneck moved
&lt;/h2&gt;

&lt;p&gt;For three years the constraint on AI progress was compute. Whoever had the most GPUs, the best data, and the best training runs shipped the best model, and they shipped it the moment it cleared their own evals. That world is closing. The new constraint is permission. You can have the compute, the data, the trained weights sitting on disk — and still not be allowed to ship, or only allowed to ship to a vetted list, on a schedule set outside your building.&lt;/p&gt;

&lt;p&gt;This is why the strategic questions have changed. It is no longer only "who has the best model?" It is "who is allowed to run it, where, and for whom?" That reframing is the same one I trace in &lt;a href="https://dev.to/blog/what-washington-regulated-the-muzzle-not-the-model"&gt;what Washington actually regulated — the muzzle, not the model&lt;/a&gt;: the control point moved from the artifact to its use. And it is why the architectural imperative I describe in &lt;a href="https://dev.to/blog/who-owns-your-harness"&gt;who owns your harness&lt;/a&gt; matters more every quarter. If the model under your product can be gated by a government, the layer you own becomes the only thing you can count on.&lt;/p&gt;

&lt;h2&gt;
  
  
  Key takeaways
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;GPT-5.6 is the first frontier model cleared by the US government customer by customer — the gate, not the model, is the story.&lt;/li&gt;
&lt;li&gt;The gate exists because of cyber: Sol matches Anthropic's Mythos on ExploitBench at a third of the tokens and finds exploitation primitives in Chromium and Firefox.&lt;/li&gt;
&lt;li&gt;OpenAI spent 700,000 A100-equivalent GPU hours red-teaming its own safeguards before release.&lt;/li&gt;
&lt;li&gt;Encryption became a munition; GPS shipped with selective availability; inference now joins the list of gated strategic capabilities.&lt;/li&gt;
&lt;li&gt;The bottleneck moved from compute to permission — capability ships when it's cleared, not when it's ready.&lt;/li&gt;
&lt;li&gt;The strategic question shifted from "who has the best model?" to "who is allowed to run it, where, and for whom?"&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The model wars are over. The clearance wars just started. I track that shift and what it means for anyone building on top of frontier models across my &lt;a href="https://dev.to/michal-piszczek"&gt;essays on AI policy and execution&lt;/a&gt;. Plan your architecture for a world where the smartest model available to you is the one you are cleared to run.&lt;/p&gt;

</description>
      <category>aipolicy</category>
    </item>
    <item>
      <title>The Biggest Customer Becomes the Competitor</title>
      <dc:creator>Michał Piszczek</dc:creator>
      <pubDate>Sat, 27 Jun 2026 09:00:00 +0000</pubDate>
      <link>https://dev.to/pich/the-biggest-customer-becomes-the-competitor-3p89</link>
      <guid>https://dev.to/pich/the-biggest-customer-becomes-the-competitor-3p89</guid>
      <description>&lt;p&gt;OpenAI designed its own AI chip in nine months and aimed it straight at Nvidia, the supplier it cannot survive without. Codenamed Jalapeño, co-designed with Broadcom. The bill forced the move.&lt;/p&gt;

&lt;p&gt;A custom chip usually takes two to three years from design to working silicon. OpenAI did it in nine months. The compression is not a footnote; it is the whole point. OpenAI used its own models to accelerate the design cycle, turning frontier inference back onto the problem of building the hardware that runs frontier inference. The snake ate part of its own tail, and the tail grew back faster.&lt;/p&gt;

&lt;p&gt;Jalapeño is inference-only. It is tuned for the workloads OpenAI actually runs at scale: ChatGPT, Codex, the API, agents. It is not built for training. That narrowing is deliberate, and it is where the leverage lives. When you know your workload down to the token, you can throw away everything a general-purpose GPU carries to serve a thousand customers you are not. Early tests claim better performance-per-watt than today's best GPUs. At gigawatt scale, performance-per-watt is not a spec-sheet vanity metric. It is the P&amp;amp;L.&lt;/p&gt;

&lt;h2&gt;
  
  
  The pattern is older than OpenAI
&lt;/h2&gt;

&lt;p&gt;This is not a surprise if you have watched infrastructure economics before. Google built TPUs because renting general-purpose accelerators for search and ads and, later, Gemini stopped making sense at their volume. Amazon built Trainium and Inferentia because AWS could not let the margin on every AI workload flow to a single supplier. Now OpenAI builds Jalapeño for exactly the same reason, and the reason is arithmetic.&lt;/p&gt;

&lt;p&gt;The rule generalizes: the biggest customer always becomes the next competitor, because the bill forces it. When you are a small buyer, renting is obviously correct. The vendor amortizes billions in R&amp;amp;D across thousands of customers, and your slice is cheap. When you become the largest single consumer of a component, the math inverts. You are now underwriting a meaningful fraction of the vendor's margin, and that margin is a tax you pay on your own scale. At some volume, designing the thing yourself is cheaper than renting it, and every dollar of vendor margin you eliminate is a dollar that compounds.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Renting compute is a cost. Designing it is a moat. The difference is who owns the workload.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Old stack, new stack
&lt;/h2&gt;

&lt;p&gt;The old stack was simple and stable. One vendor designs the silicon. Everyone else rents it. Nvidia sat at the top of that pyramid, and the pyramid was the entire industry. Access to Nvidia was the bottleneck, and allocation of Nvidia's chips was a story that moved markets. Whoever got the biggest allocation won the round.&lt;/p&gt;

&lt;p&gt;The new stack rearranges the pyramid. The buyer designs the silicon, and the vendor becomes optional. Not eliminated, optional. That word does a lot of work. OpenAI will still buy Nvidia for training, for burst capacity, for the workloads where a general-purpose part still wins. But the strategic dependency loosens the moment a credible in-house alternative exists for the workload that dominates the bill. The bottleneck moves from access to Nvidia to ownership of the workload. Once you own the workload end to end, you get to decide how much of it to rent and how much to build.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why inference, and why now
&lt;/h2&gt;

&lt;p&gt;Inference is the right place to start vertical integration, and the timing is not an accident. Training is bursty, experimental, and moves with the research frontier; the workload changes shape every few months, which punishes custom silicon built around fixed assumptions. Inference at OpenAI's scale is the opposite. It is enormous, steady, and increasingly well understood. The company serves the same handful of model architectures to hundreds of millions of users, billions of times a day. That is exactly the profile that rewards a chip designed for one job and stripped of everything else.&lt;/p&gt;

&lt;p&gt;The economics compound with agents. As I have argued in &lt;a href="https://dev.to/blog/the-unit-of-work-is-the-agent-hour"&gt;the unit of work is the agent-hour&lt;/a&gt;, output is going parallel: work is no longer bounded by human hours but by how many agents you can run at once. Every one of those agent-hours is inference. The inference bill is not a fixed cost you optimize once; it is the growth curve itself. Owning the silicon under that curve is owning the cost structure of your own future.&lt;/p&gt;

&lt;h2&gt;
  
  
  What OpenAI is really buying
&lt;/h2&gt;

&lt;p&gt;Read past the chip and you can see what OpenAI is actually acquiring. It is not just cheaper tokens. It is control over its own cost curve, its own roadmap, and its own supply chain in a market where compute is the binding constraint. As I have written in &lt;a href="https://dev.to/blog/openai-economics-gpu-constrained-not-demand-constrained"&gt;OpenAI is GPU-constrained, not demand-constrained&lt;/a&gt;, the company's growth ceiling is set by silicon it does not manufacture. Jalapeño is the structural answer to that constraint. It is the first chip in a multi-generation roadmap, which tells you this was never a one-off experiment. It is a commitment to owning the bottom of the stack.&lt;/p&gt;

&lt;p&gt;Here is the framework I use to decide when a big buyer should stop renting and start building:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Volume concentration.&lt;/strong&gt; When one workload dominates your spend, the vendor's margin on that workload becomes your largest controllable cost. Concentration is the trigger.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Workload stability.&lt;/strong&gt; Custom silicon rewards a job that will not change shape for years. Inference qualifies; frontier training does not, yet.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Design-cycle leverage.&lt;/strong&gt; If you can compress the two-to-three-year chip cycle, as OpenAI did with its own models, the payback window shrinks and the bet gets far safer.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Strategic optionality.&lt;/strong&gt; Even a good-enough in-house part changes your negotiating position with the incumbent vendor. The threat of building is worth money before the chip ships.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Roadmap commitment.&lt;/strong&gt; One chip is a science project. A multi-generation roadmap is a business decision. Only the second one moves the moat.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  What breaks next
&lt;/h2&gt;

&lt;p&gt;If the largest AI buyers all vertically integrate, Nvidia does not disappear, but its position changes. It moves from the sole source of frontier compute toward a supplier of training and burst capacity, competing against the in-house parts of its biggest former customers. That is a different, thinner business than owning the entire pyramid. The interesting question is not whether Nvidia survives, it will, but what the market looks like when the five buyers who matter most each design the silicon for their own dominant workload.&lt;/p&gt;

&lt;p&gt;The deeper shift is about where value accrues. For a decade, the story was that whoever controlled the scarce input, the chips, controlled the industry. Jalapeño is evidence that the scarce input is being routed around by the buyers with enough volume to justify the engineering. Value migrates from owning the general-purpose component to owning the specific workload well enough to build the component yourself. The bottleneck moved from access to ownership, and ownership is the more durable position.&lt;/p&gt;

&lt;h2&gt;
  
  
  Key takeaways
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;OpenAI designed Jalapeño, an inference-only chip, in nine months versus the usual two to three years, using its own models to compress the cycle.&lt;/li&gt;
&lt;li&gt;The move follows a rule: the biggest customer becomes the next competitor, because concentrated volume turns vendor margin into your largest controllable cost.&lt;/li&gt;
&lt;li&gt;Google (TPU) and Amazon (Trainium) ran this playbook first. OpenAI is the newest instance, not a novel one.&lt;/li&gt;
&lt;li&gt;Inference is the right entry point for vertical integration: enormous, steady, and well understood, unlike frontier training.&lt;/li&gt;
&lt;li&gt;The bottleneck moved from access to Nvidia to ownership of the workload. Renting compute is a cost; designing it is a moat.&lt;/li&gt;
&lt;li&gt;A multi-generation roadmap, not a single chip, is what turns this from a science project into a structural change in the market.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The chip allocation era trained everyone to watch who got the most GPUs. That was the old bottleneck. The new one is quieter: which buyers understand their own workload well enough to stop renting and start building. For the wider map of how compute, clearance, and control connect, start with the &lt;a href="https://dev.to/michal-piszczek"&gt;manifest&lt;/a&gt; and the &lt;a href="https://dev.to/michal-piszczek#joule-wars"&gt;Joule Wars&lt;/a&gt; thesis. The supplier you cannot survive without is the one you eventually have to replace.&lt;/p&gt;

</description>
      <category>aiinfrastructure</category>
    </item>
    <item>
      <title>The Unit of Work Is the Agent-Hour</title>
      <dc:creator>Michał Piszczek</dc:creator>
      <pubDate>Fri, 26 Jun 2026 09:00:00 +0000</pubDate>
      <link>https://dev.to/pich/the-unit-of-work-is-the-agent-hour-3021</link>
      <guid>https://dev.to/pich/the-unit-of-work-is-the-agent-hour-3021</guid>
      <description>&lt;p&gt;OpenAI published usage data from inside its own walls. Its 99th-percentile employees now run more than 60 hours of agent work every single day. Sixty hours inside a 24-hour day is not overtime. It is a different unit of work.&lt;/p&gt;

&lt;p&gt;The number breaks your intuition on purpose. You cannot fit 60 hours of labor into a day if labor is something a human performs sequentially with two hands and one attention span. You can fit 60 agent-hours into a day trivially, because agent-hours run in parallel and the human is no longer the one doing them. That single figure marks the boundary between the old model of work and the one replacing it.&lt;/p&gt;

&lt;p&gt;The rest of OpenAI's report fills in the shape. The average employee now produces 85% of their output through Codex, not typed, delegated. Across the company, agents already account for 99.8% of weekly output tokens. The humans are still deciding what gets built and whether it is right. They have almost entirely stopped being the ones who produce it.&lt;/p&gt;

&lt;h2&gt;
  
  
  The work did not get faster. It went parallel.
&lt;/h2&gt;

&lt;p&gt;This is the distinction people miss, and it changes every downstream conclusion. "Faster" is a story about the same sequential process compressed in time: the same person doing the same task in less time. That is a linear improvement, and linear improvements have ceilings set by the human at the center.&lt;/p&gt;

&lt;p&gt;Parallel is a different regime entirely. The human stops executing tasks one after another and starts dispatching many at once, each running independently while attention moves elsewhere. The constraint is no longer how fast you work. It is how many streams of work you can start, supervise, and accept. The growth-team data makes the pattern concrete: research teams show 56 times more agent use than seven months ago, customer support 32 times, engineering 27 times, even legal 13 times. Those are not efficiency gains. Those are step changes in how many things happen at once.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;The human stops doing the work and starts approving it. That is not a productivity upgrade. It is a change in what a person is for.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Old company, new company
&lt;/h2&gt;

&lt;p&gt;The old company had a simple production function: headcount times hours. Output was labor, and labor was people multiplied by the time each one worked. It scaled linearly and it scaled by hiring. If you wanted more output, you added heads, onboarded them, managed them, and absorbed the coordination cost of every new person. The ceiling was real, and everyone knew where it was.&lt;/p&gt;

&lt;p&gt;The new company has a different production function: agents times parallelism. Output is a function of how many agents you can run and how many you can run at once, and there is no ceiling you can staff your way to. This is not a rhetorical flourish. It is a structural claim about where the limit sits. In the old company, the binding constraint was people. In the new one, the binding constraint is your ability to specify work clearly and verify it correctly at volume. Those are different muscles, and most organizations have only trained the first one.&lt;/p&gt;

&lt;h2&gt;
  
  
  We have seen this abstraction before
&lt;/h2&gt;

&lt;p&gt;The move is not unprecedented; it is the same move computing has made twice already. Compilers did it to assembly. Programmers stopped hand-writing the instructions the machine executes and started writing intent, letting the compiler generate the instructions. The programmer's job moved up a level, from producing machine code to specifying behavior and checking the result. Nobody mourns hand-written assembly.&lt;/p&gt;

&lt;p&gt;The cloud did it to servers. Operations teams stopped racking physical machines and started declaring the infrastructure they wanted, letting the provider produce it. The unit of work stopped being the server you touched and became the capacity you specified. In both cases the human did not become less important. The human moved to a higher level of abstraction and became responsible for more, because each unit of their attention now commanded far more underlying work. Agents are the third instance of the same pattern, applied to knowledge work itself.&lt;/p&gt;

&lt;h2&gt;
  
  
  What the agent-hour measures, and what breaks
&lt;/h2&gt;

&lt;p&gt;When the unit of work is the agent-hour, the metrics that ran the old company stop describing the new one. Headcount measured the old company because headcount was the input that produced output. Throughput measures this one, because output is now decoupled from the number of people. A ten-person team running thousands of agent-hours a day is not a ten-person team in any meaningful sense. It is a throughput engine with ten people steering it. Counting the people tells you almost nothing about what it produces.&lt;/p&gt;

&lt;p&gt;Two things break as this lands, and both are worth naming before they surprise you:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Verification becomes the bottleneck.&lt;/strong&gt; When agents produce 99.8% of output, the scarce human resource is the judgment that accepts or rejects it. I have argued this at length in &lt;a href="https://dev.to/blog/verification-cost-is-the-new-bottleneck"&gt;verification cost is the new bottleneck&lt;/a&gt;: the constraint moves from producing work to confirming it is correct, and that cost does not fall as fast as generation cost.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Org charts stop mapping to output.&lt;/strong&gt; If throughput is agents times parallelism, then seniority, span of control, and headcount budgets are measuring the wrong thing. The high-leverage person is the one who specifies and verifies the most agent-hours, not the one who manages the most people.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The delegation itself compounds. Every agent-hour is inference, and inference at this volume is a cost curve, not a fixed line. That is why the biggest operators are moving to own the silicon underneath it, a shift I traced in &lt;a href="https://dev.to/blog/the-biggest-customer-becomes-the-competitor"&gt;the biggest customer becomes the competitor&lt;/a&gt;. The agent-hour is both the new unit of work and the new unit of spend, and the two are the same number read from opposite ends.&lt;/p&gt;

&lt;h2&gt;
  
  
  How to operate when the unit changes
&lt;/h2&gt;

&lt;p&gt;If the unit of work is the agent-hour, the skills that matter shift accordingly. The premium moves to specification, decomposition, and verification, the three things a human still does that an agent cannot yet do for itself. Writing a clear enough instruction that an agent produces the right thing is a skill. Breaking a large goal into parallelizable pieces is a skill. Judging correctness at the rate agents generate output is the scarcest skill of all, and the one most organizations have not started training.&lt;/p&gt;

&lt;p&gt;The forward-looking version is uncomfortable and worth sitting with. If output is agents times parallelism, then the competitive gap between two companies is no longer a hiring gap. It is a gap in how well each one specifies and verifies work at scale. That gap is invisible on an org chart and enormous in throughput. The company that learns to run agent-hours well will out-produce the company that keeps counting heads, and the head-counting company will not understand why until it is far behind.&lt;/p&gt;

&lt;h2&gt;
  
  
  Key takeaways
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;OpenAI's top employees run 60+ agent-hours per day, and agents produce 99.8% of the company's weekly output tokens. Work went parallel, not just faster.&lt;/li&gt;
&lt;li&gt;The old production function was headcount times hours, which scales linearly by hiring. The new one is agents times parallelism, with no ceiling you can staff your way to.&lt;/li&gt;
&lt;li&gt;Compilers did this to assembly and the cloud did it to servers: the human moves up a level of abstraction and becomes responsible for more.&lt;/li&gt;
&lt;li&gt;The human stops producing work and starts approving it, which makes verification the scarce resource.&lt;/li&gt;
&lt;li&gt;Headcount measured the old company; throughput measures this one. Org charts stop mapping to output.&lt;/li&gt;
&lt;li&gt;The competitive gap is now a specification-and-verification gap, invisible on an org chart and enormous in throughput.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The industrial era measured work in hours because a person was the engine. That era is closing. The unit of work is no longer the hour; it is the agent-hour, and the companies that learn to count it will look nothing like the ones that keep counting people. For the wider argument about where capability, cost, and control are heading, start with the &lt;a href="https://dev.to/michal-piszczek"&gt;manifest&lt;/a&gt; and the &lt;a href="https://dev.to/michal-piszczek#joule-wars"&gt;Joule Wars&lt;/a&gt; thesis. Measure throughput, not headcount, and you will see the new company before it is obvious.&lt;/p&gt;

</description>
      <category>futureofwork</category>
    </item>
    <item>
      <title>Language World Models: Predict Before You Act</title>
      <dc:creator>Michał Piszczek</dc:creator>
      <pubDate>Thu, 25 Jun 2026 09:00:00 +0000</pubDate>
      <link>https://dev.to/pich/language-world-models-predict-before-you-act-1bkk</link>
      <guid>https://dev.to/pich/language-world-models-predict-before-you-act-1bkk</guid>
      <description>&lt;p&gt;Alibaba's Qwen team open-sourced a model that does not act in the world. It imagines it. Qwen-AgentWorld is a language world model, trained from day one to simulate the environment itself rather than to pick the next click.&lt;/p&gt;

&lt;p&gt;Start with what every agent you have used actually does. Claude Code, Cursor, an Android automation bot, all of them were trained to choose the next action: click here, run this command, call that tool, then find out what happens. The environment is a black box the agent pokes and observes. Learning means poking the real box enough times to build an intuition for how it responds. That works, but it is expensive, slow, and dangerous, because the box you are learning on is production.&lt;/p&gt;

&lt;p&gt;Qwen-AgentWorld flips the direction of the arrow. Feed it a state and an action, and it predicts the next state. Not "what should I do" but "what will the world do back." It was trained across seven domains, terminal, web, operating system, Android, code repositories, search, and MCP tools, to model how each of those environments responds to actions. It is not the driver. It is the road.&lt;/p&gt;

&lt;h2&gt;
  
  
  The driving-simulator analogy
&lt;/h2&gt;

&lt;p&gt;The cleanest way to understand the shift is the one the Qwen team themselves reach for. Most agents are a driver who only ever learned on real roads. Every lesson is a live drive, with real traffic and real consequences, and the only way to learn a rare situation is to encounter it for real. Qwen-AgentWorld is the driving simulator. It is the model of the road that lets you practice the crash without crashing.&lt;/p&gt;

&lt;p&gt;And it is good enough to matter. On AgentWorldBench, the benchmark released alongside it, the 397B version outscores frontier models including Claude Opus 4.8 and GPT-5.4 at environment simulation. That is the load-bearing result. A simulator is only useful if its predictions match reality; a bad simulator teaches bad habits. Qwen-AgentWorld predicts what environments do better than the frontier models built to act in them. The simulator is now more accurate than the drivers.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Most agents are a driver who only learned on real roads. This one is the simulator, and it now models the road better than the frontier models drive it.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Why simulated training beats real training
&lt;/h2&gt;

&lt;p&gt;The practical payoff is agent training that is cheaper and safer, and the safety argument is not abstract. Recall the Cursor "deleted prod DB in 9 seconds" story, an agent with real access to a real database doing irreversible damage before anyone could intervene. That is what training in the real environment risks by default. Every half-trained agent you loose on a live system is a live grenade, and the cost of a mistake is not a bad gradient, it is a destroyed database.&lt;/p&gt;

&lt;p&gt;A language world model changes the economics of learning. You train the agent inside the simulated world first, where a catastrophic action costs nothing but a token budget. The agent can delete the simulated production database a thousand times, learn that the action is catastrophic, and never touch a real one until it has internalized the lesson. Simulated training beats real training on every axis that matters at scale:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Cost.&lt;/strong&gt; Simulated steps are inference, not infrastructure. You do not provision a real terminal, repo, or Android device for every training episode.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Safety.&lt;/strong&gt; Irreversible actions are reversible in the simulator. The blast radius of a mistake is zero.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Coverage.&lt;/strong&gt; Rare and dangerous states, the ones you cannot ethically or affordably reproduce in production, can be generated on demand.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Speed.&lt;/strong&gt; The simulator runs as fast as inference allows, decoupled from the latency of real systems.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Imagination transfers
&lt;/h2&gt;

&lt;p&gt;The more interesting claim is subtler than safe training, and it is the one worth sitting with. When an agent internalizes world modeling as a warm-up, it gets better at real tasks even with zero task-specific fine-tuning. Predicting before acting is not just a way to generate safe practice data. It is a capability that transfers. An agent that has learned to model what the world will do carries that model into every real task, and it acts better because it can anticipate consequences instead of discovering them.&lt;/p&gt;

&lt;p&gt;This mirrors something we already believe about human expertise. The expert is not the one with the fastest reflexes. It is the one who has internalized a model of the domain accurate enough to predict outcomes before committing to a move. World modeling is that faculty, made explicit and trainable. Imagination, it turns out, is not decoration on top of intelligence. It is a large part of what intelligence is for.&lt;/p&gt;

&lt;h2&gt;
  
  
  The open-weights angle
&lt;/h2&gt;

&lt;p&gt;The distribution story matters as much as the capability. The headline benchmark used the 397B model, but the team also released Qwen-AgentWorld-35B-A3B, a Mixture-of-Experts model with 35B total parameters and only 3B active per token. That architecture is the point: it runs cheap, because you pay compute for the 3B active per token, not the full 35B, while retaining the knowledge of the larger count. Add a 256K context window and you have a world model a small team can actually run. It is on HuggingFace, GitHub, and ModelScope, with the benchmark alongside it.&lt;/p&gt;

&lt;p&gt;Notice the direction of travel. This is another open-weights drop from China while the frontier labs lock down. The pattern is consistent enough to be a strategy, and it is the same one I traced in &lt;a href="https://dev.to/blog/route-by-task-not-vendor-open-weight-ai-architecture"&gt;route by task, not vendor&lt;/a&gt;: capability arrives as open weights you can route to, not just as an API you rent. When the simulator is open, training better agents stops being the exclusive privilege of whoever owns the largest closed model. The simulator becomes a public good, and public goods reshape who gets to build.&lt;/p&gt;

&lt;p&gt;That connects directly to how work itself is changing. As I argued in &lt;a href="https://dev.to/blog/the-unit-of-work-is-the-agent-hour"&gt;the unit of work is the agent-hour&lt;/a&gt;, output is going parallel across armies of agents. Every one has to be trained, and training in the real world does not scale, it is too slow, too expensive, and too dangerous. A cheap, open, accurate world model is what makes agent-hours safe to manufacture at volume. You cannot run millions of them if each new agent learns by breaking production first.&lt;/p&gt;

&lt;h2&gt;
  
  
  What to watch next
&lt;/h2&gt;

&lt;p&gt;The forward-looking question is whether world modeling becomes a standard layer in the agent stack rather than a research curiosity. My read is that it does, and quickly, because the economics are too favorable to ignore. If a warm-up in a simulated world produces better real-world agents at zero marginal task-specific cost, then not doing it becomes the expensive choice. Teams shipping agents into production will train them in simulators first, the same way we test software before we deploy it.&lt;/p&gt;

&lt;p&gt;The deeper shift is where the leverage sits. For a while the frontier was the agent, the thing that acts. Qwen-AgentWorld is a bet that the frontier is moving to the world model, the thing that predicts. Whoever owns the most accurate, cheapest, most open simulator of the environments agents operate in owns the factory that produces good agents. That is a more durable position than owning any single agent, and it is now, at least in part, a public good.&lt;/p&gt;

&lt;h2&gt;
  
  
  Key takeaways
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Qwen-AgentWorld is a language world model: given a state and an action, it predicts the next state across seven domains, instead of choosing the next action.&lt;/li&gt;
&lt;li&gt;On AgentWorldBench, the 397B version outscores frontier models including Claude Opus 4.8 and GPT-5.4 at environment simulation.&lt;/li&gt;
&lt;li&gt;Training agents in a simulator beats training in real environments on cost, safety, coverage, and speed, no more "deleted prod DB in 9 seconds."&lt;/li&gt;
&lt;li&gt;Imagination transfers: an agent that internalizes world modeling as a warm-up performs better on real tasks with zero task-specific fine-tuning.&lt;/li&gt;
&lt;li&gt;The 35B-A3B Mixture-of-Experts version runs cheap (3B active per token, 256K context) and ships open on HuggingFace, GitHub, and ModelScope.&lt;/li&gt;
&lt;li&gt;Another open-weights drop from China while frontier labs lock down. The simulator is now a public good, and public goods reshape who gets to build.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;We spent years teaching agents to act and find out. The next move is teaching them to predict before they act, and to practice in a world that costs nothing to break. For the wider map of how open weights, routing, and control fit together, start with the &lt;a href="https://dev.to/michal-piszczek"&gt;manifest&lt;/a&gt; and the &lt;a href="https://dev.to/michal-piszczek#joule-wars"&gt;Joule Wars&lt;/a&gt; thesis. The frontier is quietly moving from the actor to the model of the world it acts in.&lt;/p&gt;

</description>
      <category>airesearch</category>
    </item>
    <item>
      <title>OpenAI Is GPU-Constrained, Not Demand-Constrained</title>
      <dc:creator>Michał Piszczek</dc:creator>
      <pubDate>Sat, 14 Feb 2026 09:00:00 +0000</pubDate>
      <link>https://dev.to/pich/openai-is-gpu-constrained-not-demand-constrained-1e47</link>
      <guid>https://dev.to/pich/openai-is-gpu-constrained-not-demand-constrained-1e47</guid>
      <description>&lt;p&gt;OpenAI may be in a stronger financial position than most people think. Not because it's profitable, it isn't, but because the flywheel is real, measurable, and pointing at a constraint that most of the commentary gets exactly backwards.&lt;/p&gt;

&lt;p&gt;Start with the public numbers. Revenue crossed $20B ARR in 2025, per CFO Sarah Friar in January 2026. The trajectory: roughly $2B in 2023, $6B in 2024, $20B+ in 2025. That is about 3x year over year for three straight years. Now lay compute capacity beside it: roughly 0.2 GW, then 0.6 GW, then 1.9 GW over the same period. Revenue and compute scaled almost 1:1. That coupling is rare, and it is the single most important fact in the whole picture.&lt;/p&gt;

&lt;p&gt;When output scales in lockstep with a physical input, the physical input is the throttle. Revenue is not tracking demand, because demand was never the scarce thing. Revenue is tracking gigawatts. OpenAI is not demand-constrained. It is GPU-constrained, and once you see that, the losses stop looking like distress and start looking like a plan.&lt;/p&gt;

&lt;h2&gt;
  
  
  The 1:1 coupling is the whole tell
&lt;/h2&gt;

&lt;p&gt;If OpenAI were demand-constrained, you would expect revenue to plateau while compute kept climbing, capacity chasing users who were not there. If it were purely supply-constrained with unlimited demand, you would expect revenue to rise faster than compute as they squeezed more value per chip. Instead the two moved together, almost proportionally, for three years.&lt;/p&gt;

&lt;p&gt;That specific signature, output rising in near-perfect proportion to one input, is what a hard physical bottleneck looks like. Every additional gigawatt bought a roughly proportional slug of revenue, which means every gigawatt they could not build was revenue they could not book. The demand was waiting. The chips, the power, and the buildings were not there to serve it. The constraint is physical, not commercial.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;When revenue scales 1:1 with compute for three straight years, demand isn't the throttle. Gigawatts are.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Yes, they lose money per token. That's the strategy.
&lt;/h2&gt;

&lt;p&gt;OpenAI is almost certainly losing money on inference. Independent estimates put realized revenue near $1.20 per GPU-hour against a market cost of roughly $2 to $7 per GPU-hour, with potential annual losses of $10B to $20B, per Dr. David Gingerball's independent AI-infrastructure cost analysis. Read in isolation, that looks like a company setting money on fire.&lt;/p&gt;

&lt;p&gt;Read against the 1:1 coupling, it looks like a deliberate loss leader. This is classic infrastructure capture: subsidize inference now, at a loss, to maximize lock-in through memory, personalization, and embedded workflows, then add monetization layers once the surface is captured. The layers are already visible: the $8 "Go" tier, ads on free and low tiers, enterprise bundling. You do not price below cost by accident three years running. You do it to buy the surface before someone else does.&lt;/p&gt;

&lt;p&gt;Microsoft ran this exact playbook. Office established the surface, then Teams was bundled onto it, and distribution beat margin every step of the way until the margin arrived on its own terms. Early in a platform war, whoever owns distribution wins, and margin is a problem you are grateful to have later. OpenAI is buying distribution with inference losses, on purpose.&lt;/p&gt;

&lt;h2&gt;
  
  
  The monetization layers aren't hypothetical
&lt;/h2&gt;

&lt;p&gt;The upside is not a hand-wave. Do the arithmetic on the surface OpenAI already holds. ChatGPT sits near 800M weekly active users. Suppose it reaches roughly 1B, and suppose it monetizes advertising at only 10 to 20% of Meta-level ARPU, a deliberately conservative fraction. That is $5B to $10B in additional annual revenue from ads alone, stacked on top of subscriptions that are already growing 3x a year.&lt;/p&gt;

&lt;p&gt;This reframes the losses entirely. The per-token loss is the customer-acquisition cost for the largest consumer surface in software, and the monetization layers are the mechanism that turns that acquired attention into margin later. The structure only fails if the surface fails to hold, and a captured surface with memory and personalization is exactly the kind that holds. This is the same dynamic I traced in &lt;a href="https://dev.to/blog/the-biggest-customer-becomes-the-competitor"&gt;the biggest customer becomes the competitor&lt;/a&gt;: once you own the surface, you climb the stack into everyone who was renting it from you.&lt;/p&gt;

&lt;h2&gt;
  
  
  The real constraint is physics, and physics is slow
&lt;/h2&gt;

&lt;p&gt;Here is the part the market keeps missing. The binding constraints are not demand or even capital. They are GPUs, power, inference efficiency, and grid capacity. And those constraints run on different clocks. Compute scales in months: you can buy chips and stand up a data center on a quarters-long timeline. Energy scales in years: substations, transmission, and generation are permitted and built on a decade-long timeline that no amount of capital compresses past physics.&lt;/p&gt;

&lt;p&gt;So the ceiling on this flywheel is not going to be a shortage of users or a shortage of funding. It is going to be a shortage of joules delivered to the right place at the right price. This is precisely the terrain of the &lt;a href="https://dev.to/michal-piszczek#joule-wars"&gt;Joule Wars&lt;/a&gt;: when intelligence becomes a function of energy throughput, the grid becomes the battlefield, and the winners are whoever can turn megawatts into tokens most efficiently. OpenAI's 1.9 GW is not a vanity number. It is a claim staked on the actual scarce resource.&lt;/p&gt;

&lt;p&gt;Which leads to the one question that decides the whole bet, and it is a physics question, not a sentiment one. Can OpenAI drive down cost per inference faster than usage grows? If yes, the per-token loss narrows toward zero while the surface keeps expanding, and the loss leader converts into a monopoly with margin. If no, the losses compound faster than the monetization layers can catch them, and the flywheel becomes a treadmill. Everything rides on that single ratio, and it is exactly the frontier I described in &lt;a href="https://dev.to/blog/capability-is-commoditizing-cost-is-the-frontier"&gt;capability is commoditizing, cost is the frontier&lt;/a&gt;: once capability is ambient, the war is fought on cost per unit of intelligence delivered.&lt;/p&gt;

&lt;h2&gt;
  
  
  Key takeaways
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;OpenAI revenue scaled roughly $2B to $6B to $20B+ across 2023 to 2025, about 3x a year, while compute went 0.2 to 0.6 to 1.9 GW. The near-1:1 coupling signals a GPU constraint, not a demand constraint.&lt;/li&gt;
&lt;li&gt;Independent estimates put revenue near $1.20 per GPU-hour against $2 to $7 in cost, implying $10B to $20B in annual losses. Read against the coupling, that is a deliberate loss leader, not distress.&lt;/li&gt;
&lt;li&gt;The move is infrastructure capture: subsidize inference now, lock in memory, personalization, and workflows, then monetize via the $8 tier, ads, and enterprise bundling, the same distribution-first playbook Microsoft ran with Office and Teams.&lt;/li&gt;
&lt;li&gt;At ~1B users and just 10 to 20% of Meta-level ad ARPU, ads alone add $5B to $10B a year on top of subscriptions.&lt;/li&gt;
&lt;li&gt;The binding constraints are GPUs, power, inference efficiency, and grid. Compute scales in months; energy scales in years, which is the terrain of the Joule Wars.&lt;/li&gt;
&lt;li&gt;The entire bet reduces to one ratio: can cost per inference fall faster than usage grows? That is physics, not sentiment.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;OpenAI does not look like a company running out of money. It looks like one deliberately burning capital to capture the dominant inference surface before models commoditize underneath it. Whether that is genius or ruin is not a matter of narrative or vibes. It is a matter of whether cost per inference falls faster than usage rises, measured in joules and dollars, quarter after quarter. That is the number to watch. Everything else is commentary. For the full map of where energy, cost, and capability collide, start with the &lt;a href="https://dev.to/michal-piszczek"&gt;manifest&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>aieconomics</category>
    </item>
    <item>
      <title>Execution Architecture Beats Model Capability</title>
      <dc:creator>Michał Piszczek</dc:creator>
      <pubDate>Thu, 12 Feb 2026 09:00:00 +0000</pubDate>
      <link>https://dev.to/pich/execution-architecture-beats-model-capability-119f</link>
      <guid>https://dev.to/pich/execution-architecture-beats-model-capability-119f</guid>
      <description>&lt;p&gt;S&amp;amp;P Global reports that 42% of companies abandoned most of their AI initiatives in 2025. In 2024 the figure was 17%. Better models, more dead deployments. That is not a paradox. It is physics.&lt;/p&gt;

&lt;p&gt;The intuition says capability and adoption should move together. Smarter models, more value shipped. Instead the curves diverged: model quality went up and the abandonment rate more than doubled in a single year. When a variable improves and the outcome it should drive gets worse, you are not looking at the variable that matters. You are looking at a bottleneck somewhere else in the system, and the improvement is just pressure being applied to a wall.&lt;/p&gt;

&lt;p&gt;The wall is execution. A better model does not fix an organization that cannot decide who owns an output. It makes the fracture visible faster. This is the pattern the numbers describe, and it is the reason the best models on the market are stranded in pilots while the abandonment rate climbs.&lt;/p&gt;

&lt;h2&gt;
  
  
  AI is a stress test, not a tool
&lt;/h2&gt;

&lt;p&gt;Every company carries a gap between the processes it documented and the processes it actually runs. For decades that gap was survivable because humans papered over it in real time. Someone knew who to ask. Someone quietly validated the number before it went out. Someone absorbed the blame when it was wrong. The org chart was fiction, and the fiction worked because people improvised the missing structure.&lt;/p&gt;

&lt;p&gt;AI removes the improviser. When a model produces an output in three seconds, the informal human buffer that used to validate, own, and absorb is gone. And now the questions that were always there, but never had to be answered explicitly, arrive all at once. Who validates this output? Who owns the decision it feeds? Who pays when it is wrong?&lt;/p&gt;

&lt;p&gt;In most organizations the answer to all three is silence. And silence does not ship to production. So the deployment dies, not because the model was inadequate, but because the model exposed that the accountability behind it never existed.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;AI doesn't create the accountability gap. It just removes the humans who were quietly covering for it.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  The numbers describe a bottleneck, not a capability problem
&lt;/h2&gt;

&lt;p&gt;The World Quality Report 2025 puts a second data point next to the first. Around 90% of companies say they are "deploying AI." Only 15% have reached enterprise scale. The distance between those two numbers is the entire story. Almost everyone can wire up a model. Almost no one can put it into production and keep it there.&lt;/p&gt;

&lt;p&gt;That distance is not a model gap. If it were, the 15% who scaled would be the ones with privileged access to better models, and they are not. The scaled minority are the ones who happened to already have the machinery a model needs to plug into: a clear decision chain, a validation step someone owns, and an honest accounting of what an error costs. They did not build that machinery for AI. They built it because they ran a disciplined operation, and AI simply rewarded the discipline they already had.&lt;/p&gt;

&lt;p&gt;Meanwhile the public conversation is calibrated to the wrong axis. Musk talks about superintelligence. The median company cannot deploy a chatbot. The gap between those two sentences is not intelligence. It is plumbing.&lt;/p&gt;

&lt;h2&gt;
  
  
  The three structures winners built before the model arrived
&lt;/h2&gt;

&lt;p&gt;If execution is the constraint, then the work is not model selection. It is building the machinery that lets any competent model do useful work. Three structures separate the 15% from the 85%, and none of them is technical.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Validation structure.&lt;/strong&gt; A named step where an output is checked against reality before it acts, with an owner attached to that step. Not a committee. A person and a threshold. If no one can tell you who validates a given output, that output will never leave the pilot, and it should not.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Decision chain.&lt;/strong&gt; An explicit map from output to decision to owner. The model produces a recommendation; a specific role converts it into a decision; that role is accountable for the decision. Where the chain is ambiguous, the model's speed just accelerates the moment everyone points at everyone else.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Error economics.&lt;/strong&gt; A pre-committed answer to what a wrong output costs and who absorbs it. A misfired marketing email and a misfired clinical recommendation are not the same event, and an organization that has not priced the difference cannot delegate either to a model. Pricing the error is what lets you calibrate how much autonomy the model gets.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;These three are why &lt;a href="https://dev.to/blog/models-are-commodities-clean-data-is-not"&gt;models are commodities and clean data is not&lt;/a&gt;: the durable advantage was never the weights, it was the organizational substrate the weights run on. Get the substrate right and a mid-tier model outperforms a frontier model wired into chaos.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why this is physics, not management theory
&lt;/h2&gt;

&lt;p&gt;Call it what it is. A system's throughput is set by its tightest constraint, not by the capacity of its fastest component. Upgrade the fast component and throughput does not move, because the constraint did not move. That is Amdahl's law wearing a business suit. The model is the fast component. The constraint is the human and organizational latency around validation, decision, and liability, and that latency did not improve because you swapped in a better model.&lt;/p&gt;

&lt;p&gt;This is why the same asymmetry keeps surfacing everywhere the work gets serious. Generation cost collapses toward zero; the cost to verify, own, and be accountable for the output does not. I have argued this for engineering, where &lt;a href="https://dev.to/blog/verification-cost-is-the-new-bottleneck"&gt;verification cost is the new bottleneck&lt;/a&gt;, and for regulated domains, where &lt;a href="https://dev.to/blog/the-liability-stack-why-healthcare-ai-stalls"&gt;the liability stack is why healthcare AI stalls&lt;/a&gt;. It is one law with three faces: creation is cheap, accountability is not, and accountability is where the throughput ceiling actually sits.&lt;/p&gt;

&lt;h2&gt;
  
  
  Key takeaways
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;42% of companies abandoned most AI initiatives in 2025, up from 17% in 2024. Better models coincided with more dead deployments, which points to a bottleneck outside the model.&lt;/li&gt;
&lt;li&gt;AI removes the humans who informally validated, owned, and absorbed error, forcing organizations to answer accountability questions they never answered before.&lt;/li&gt;
&lt;li&gt;90% of companies are "deploying AI" but only 15% reached enterprise scale. That distance is an execution gap, not a capability gap.&lt;/li&gt;
&lt;li&gt;Three structures separate the winners: a validation step someone owns, an explicit decision chain, and pre-priced error economics.&lt;/li&gt;
&lt;li&gt;Throughput is set by the tightest constraint. Upgrading the model, the fast component, does not move a constraint that lives in human and organizational latency.&lt;/li&gt;
&lt;li&gt;Operators win by asking where the bottleneck is, not what the model can do.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Musk looks at possibilities. Operators look at bottlenecks. The companies that will win the next phase are not the ones with early access to the best model; they are the ones who built validation structures, decision chains, and error economics before superintelligence showed up looking for somewhere to plug in. Execution architecture beats model capability, every quarter, on the numbers. The only question worth asking inside your own company is the operator's question, not the visionary's: where is the bottleneck? For the full map of how this reshapes the stack, start with the &lt;a href="https://dev.to/michal-piszczek"&gt;manifest&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>enterpriseai</category>
    </item>
    <item>
      <title>The Conscience of a Hacker in the Age of AI</title>
      <dc:creator>Michał Piszczek</dc:creator>
      <pubDate>Tue, 10 Feb 2026 09:00:00 +0000</pubDate>
      <link>https://dev.to/pich/the-conscience-of-a-hacker-in-the-age-of-ai-3gb8</link>
      <guid>https://dev.to/pich/the-conscience-of-a-hacker-in-the-age-of-ai-3gb8</guid>
      <description>&lt;p&gt;A short essay written by a teenager in 1986, hours after an arrest, has been with me for almost my entire hacker life. I read it as a kid, came back to it as an engineer, and I still recognize myself in it decades later. In the age of AI, it reads less like nostalgia and more like a warning.&lt;/p&gt;

&lt;p&gt;The piece is "The Conscience of a Hacker," written by The Mentor and better known as the Hacker Manifesto. Most people who work in technology today have never read it. It sits buried in the archives of the early ASCII internet, a monospace relic from a world of dial-up modems and bulletin boards. But it shaped how many of us thought about systems, power, curiosity, and freedom long before cybersecurity became an industry, and long before "AI governance" was even a phrase.&lt;/p&gt;

&lt;p&gt;I spent my formative years, roughly 2004 to 2012, as a white-hat hacker. I disclosed flaws in Sun Microsystems' infrastructure and wrote about security for Dziennik Internautów. That world gave me a specific way of looking at every system I encounter, and that way of looking is exactly what the age of AI now demands from everyone, not just from the people who used to break into things for sport.&lt;/p&gt;

&lt;h2&gt;
  
  
  What the manifesto actually says
&lt;/h2&gt;

&lt;p&gt;Strip away the teenage defiance and the manifesto makes a precise argument. The author describes being bored by an education system that hands out answers to be memorized rather than understood. He finds a computer, and for the first time something responds to what he does rather than what he is told to accept. The machine does exactly what he asks. It has no hidden agenda he cannot inspect. That is the seduction, and it is not really about crime at all.&lt;/p&gt;

&lt;p&gt;The core claim is that curiosity is not a threat, and that judging a mind by what it explores is the only honest standard. The famous closing line reframes the whole thing: the crime is curiosity. People misremember that as bravado. It is not bravado. It is an epistemology.&lt;/p&gt;

&lt;p&gt;The hacker ethos, read carefully, is three commitments. Curiosity as a duty rather than a vice. Skepticism of authority, especially authority that asks to be trusted without showing its work. And a refusal to trust any system you have not first understood from the inside. Those three commitments were forged against phone switches and mainframes. They transfer, almost without modification, to large language models.&lt;/p&gt;

&lt;h2&gt;
  
  
  Nullius in verba: take nobody's word for it
&lt;/h2&gt;

&lt;p&gt;My motto is &lt;em&gt;nullius in verba&lt;/em&gt;, the old line of the Royal Society. Take nobody's word for it. Verify against primary sources. It is the same instinct the manifesto describes, just dressed in Latin instead of leetspeak. And it is the single most important discipline for anyone deploying AI at scale.&lt;/p&gt;

&lt;p&gt;A modern language model is the most fluent authority figure ever built. It speaks in the confident register of a textbook, cites plausibly, and never signals doubt unless you force it to. It is, in other words, exactly the kind of authority the hacker ethos teaches you to distrust on principle. Not because it is malicious, but because fluency is not the same as correctness, and a system that cannot show its work has not earned your trust regardless of how well it performs.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;The tools changed. The questions didn't. Understand the system before you trust the output; the machine's confidence is not evidence.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This is where the old mindset stops being romantic and becomes operational. If you spent years assuming any system does precisely what its incentives and its code dictate, and nothing more, you already know how to treat a model. You do not ask whether it seems smart. You ask what it is optimizing, what it can see, where it fails silently, and how you would catch it when it does.&lt;/p&gt;

&lt;h2&gt;
  
  
  The mindset now runs in the products I build
&lt;/h2&gt;

&lt;p&gt;I did not keep this as a philosophy. I wired it into ventures. Two of them are direct descendants of the manifesto's logic, translated from breaking systems to building trustworthy ones.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://lextron.ai" rel="noopener noreferrer"&gt;Lextron.ai&lt;/a&gt; exists because a model's answer is a claim, not a fact. It verifies AI output against primary sources rather than letting fluent text stand on its own authority. That is &lt;em&gt;nullius in verba&lt;/em&gt; compiled into a pipeline. The machine proposes; the sources dispose. If a claim cannot be traced to something you can independently inspect, it does not get to count as knowledge.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://inclify.com" rel="noopener noreferrer"&gt;Inclify&lt;/a&gt; applies the same skepticism to the physical world. It trusts sensor data over inspection schedules, because a schedule is an assertion of authority ("this was checked, therefore it is fine") while a sensor reading is primary evidence about what is actually happening right now. A calendar says the equipment is safe. The data tells you whether that is true. The hacker instinct is to believe the data.&lt;/p&gt;

&lt;p&gt;Both are the same move. Replace inherited trust with verified evidence. Refuse to accept a system's self-report as ground truth. It is the manifesto's suspicion of authority, aimed not at institutions but at the far more seductive authority of a confident machine.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why this matters more as capability rises
&lt;/h2&gt;

&lt;p&gt;Here is the uncomfortable part. As models approach and possibly exceed human capability across more domains, the temptation to simply defer to them grows in lockstep with their competence. The better the system performs, the more expensive independent verification feels, and the more tempting it becomes to just take its word. That is precisely the moment the hacker ethos becomes non-negotiable rather than nostalgic.&lt;/p&gt;

&lt;p&gt;Approaching superintelligence does not retire the question "how do I know this is true." It sharpens it to a point. A system smart enough to be usually right is a system whose rare, confident errors are the most dangerous, because you have been trained by its track record to stop checking. Curiosity and skepticism are not obstacles to progress here. They are the only brakes that scale.&lt;/p&gt;

&lt;p&gt;Concretely, the disciplines I would carry forward from the manifesto into any serious AI deployment are these:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Understand before you trust.&lt;/strong&gt; Know what the system optimizes, what it can access, and how it fails. A model you cannot describe mechanically is a model you cannot govern.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Treat fluency as a red flag, not a green light.&lt;/strong&gt; Confidence is a rhetorical property, not an epistemic one. The smoother the answer, the more it deserves a source.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Verify against primary evidence.&lt;/strong&gt; Ground claims in something inspectable. If the chain of evidence breaks, the claim is a hypothesis, not a fact.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Keep a human in the loop where the loop is load-bearing.&lt;/strong&gt; Automate the work; do not automate away the accountability. Someone has to be able to answer for the decision.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Assume the incentive gradient, not the marketing.&lt;/strong&gt; A system does what it is rewarded to do. Read the reward function, not the press release.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The questions outlived the tools
&lt;/h2&gt;

&lt;p&gt;The manifesto was written against a backdrop of mainframes and prosecutors who could not tell exploration from theft. Almost none of that context survives. The phones are different, the networks are different, the very definition of a computer is different. And yet the mind it describes is the exact mind the AI era rewards: relentlessly curious, congenitally skeptical of authority, unwilling to trust a system it has not taken apart.&lt;/p&gt;

&lt;p&gt;This connects directly to two things I have argued elsewhere. Verification is not a formality; it is becoming the binding constraint on how fast we can safely move, which is why &lt;a href="https://dev.to/blog/verification-cost-is-the-new-bottleneck"&gt;verification cost is the new bottleneck&lt;/a&gt;. And the competitive frontier is shifting from raw capability toward trust and clearance, which is the whole thesis behind why &lt;a href="https://dev.to/blog/model-wars-are-over-clearance-wars-begin"&gt;the model wars are over and the clearance wars begin&lt;/a&gt;. Both are the hacker ethos, grown up and put to work.&lt;/p&gt;

&lt;h2&gt;
  
  
  Key takeaways
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;The 1986 Hacker Manifesto is really an epistemology: curiosity as duty, skepticism of authority, no trust without understanding.&lt;/li&gt;
&lt;li&gt;A language model is the most fluent authority ever built, which makes it exactly the kind of authority the hacker ethos teaches you to verify.&lt;/li&gt;
&lt;li&gt;
&lt;em&gt;Nullius in verba&lt;/em&gt;, take nobody's word for it, is the operating principle: ground every claim in inspectable primary evidence.&lt;/li&gt;
&lt;li&gt;Lextron.ai verifies AI against primary sources; Inclify trusts sensor data over inspection schedules. Same move, different domain.&lt;/li&gt;
&lt;li&gt;As capability rises, the temptation to defer grows and the cost of verification feels higher; that is precisely when skepticism becomes non-negotiable.&lt;/li&gt;
&lt;li&gt;The tools changed completely. The questions, and the mindset that answers them, did not.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I keep coming back to that old essay not out of sentiment but because it turned out to be a specification. It described the kind of thinking that would be needed decades before the systems that would need it existed. The full map of how this mindset runs through my work lives in the &lt;a href="https://dev.to/michal-piszczek"&gt;manifest&lt;/a&gt;. Forty years on, the manifesto still hits uncomfortably close, and I have stopped being surprised by that. The machines got smarter. The reasons to check their work only got stronger.&lt;/p&gt;

</description>
      <category>securityphilosophy</category>
    </item>
  </channel>
</rss>
