<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: guanjiawei</title>
    <description>The latest articles on DEV Community by guanjiawei (@skyguan92).</description>
    <link>https://dev.to/skyguan92</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3788265%2Ff93aaebd-c44b-447a-b582-cc297747f93b.jpeg</url>
      <title>DEV Community: guanjiawei</title>
      <link>https://dev.to/skyguan92</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/skyguan92"/>
    <language>en</language>
    <item>
      <title>The Shortest Chain in History: From Tech to Profit in One Day</title>
      <dc:creator>guanjiawei</dc:creator>
      <pubDate>Wed, 17 Jun 2026 05:26:07 +0000</pubDate>
      <link>https://dev.to/skyguan92/the-shortest-chain-in-history-from-tech-to-profit-in-one-day-54mj</link>
      <guid>https://dev.to/skyguan92/the-shortest-chain-in-history-from-tech-to-profit-in-one-day-54mj</guid>
      <description>&lt;p&gt;A while back, our team was working on an optimization for the inference engine.&lt;/p&gt;

&lt;p&gt;Exactly what we were studying doesn't matter—it was one of those grinding, unglamorous tasks: staring at the profiler for nearly three weeks, squeezing incremental gains out of scheduling and VRAM, and finally improving throughput by a few points.&lt;/p&gt;

&lt;p&gt;What can a few points do? Back when I was starting out, almost nothing. But this time was different. One morning we merged it into main and rolled it out gradually; the next day we opened the dashboard and those points had already turned into real money. Same cards, same model, same customers—cost per token dropped by a few points, and gross margin widened by a few points. From the day it went live, it started showing up on the books.&lt;/p&gt;

&lt;p&gt;A low-level optimization, from an idea in the head of some kid on the team born after 2000 to profit on the ledger, separated by a single night.&lt;/p&gt;

&lt;p&gt;Five years ago, this would have been unthinkable.&lt;/p&gt;

&lt;h2&gt;
  
  
  Short Chain, Short Lifespan
&lt;/h2&gt;

&lt;p&gt;Lately, whenever I talk to friends, the conversation keeps circling back to the same thought: we've stumbled into an unusual era.&lt;/p&gt;

&lt;p&gt;What's unusual? Search history and you'll struggle to find another time when technical leadership turned into commercial influence, competitive advantage, or direct profit this fast. How short? A model company today basically has one job: train a smart enough model, ship it, provision compute for tokens, maybe open-source it. The rest—influence, revenue, valuation—grows on its own, fast.&lt;/p&gt;

&lt;p&gt;The catch: the other end of this chain is just as short. Each generation's time in the spotlight is compressed to three to six months, and that's optimistic. Often, a model goes from hot topic to complete silence in one to three months.&lt;/p&gt;

&lt;h2&gt;
  
  
  How Fast the Wind Has Turned These Past Six Months
&lt;/h2&gt;

&lt;p&gt;Last November, Google dropped Gemini 3 and slaughtered the leaderboards; even OpenAI declared a "Code Red" internally. Back then, open any group chat and the screen was full of Gemini worship. Six months later, the conversation has moved on. Not that Gemini is failing—users are still growing—but the spotlight on this track only stays lit for about three months.&lt;/p&gt;

&lt;p&gt;Go back a bit further. When Claude Opus 4.6 dropped, insiders and outsiders alike thought "this is the thing that changes the world." At the time, it truly crushed the competition. Then 4.7, 4.8, and so on came along, praise and criticism trailing right behind.&lt;/p&gt;

&lt;p&gt;OpenAI's story is even more dramatic. Its coding capabilities had always been mocked. I used the early Codex myself for a while and then unsubscribed—it was genuinely terrible. Then riding GPT-5.4 and 5.5, Codex felt like it swapped in a new engine: OpenAI's official numbers say weekly active users broke 5 million, a 6x increase since the desktop launch in February. One generation of models dragged a product everyone had written off out of the gutter.&lt;/p&gt;

&lt;p&gt;The clearest example in China is Z.ai. A year ago its position was precarious, seemingly about to drop out of the race. Then GLM-4.5, 4.6, and 4.7 came out in quick succession, followed by GLM-5, 5.1, and 5.2 at the start of the year—three versions in three months—and the situation flipped completely. Six months after its Hong Kong IPO, its stock price rose roughly eightfold, market cap surging past 600 billion HKD. The technical reversal was written directly into the stock price.&lt;/p&gt;

&lt;p&gt;MiniMax is the opposite case. At IPO its pricing and valuation were in the same ballpark as Z.ai's. Its stock price doubled on day one, market cap briefly hit over $13 billion, and during the March surge it even briefly surpassed Baidu's Hong Kong-listed shares. But the wind turned just as fast: the M2.7 and M3 generations didn't catch the hype, market expectations immediately took a discount, and market cap fell back sharply from its peak. The speed at which they hype you is the same speed at which they abandon you.&lt;/p&gt;

&lt;p&gt;After all the noise, attention converged on two things: coding and multimodality. Traditional valuation logic—users, revenue structure, moats—basically fails here. Everyone is really only asking one question: is your current generation of models strong enough.&lt;/p&gt;

&lt;h2&gt;
  
  
  In My Line of Work, This Chain Used to Be Terrifyingly Long
&lt;/h2&gt;

&lt;p&gt;The thrill of "tech turning into money in days" exists because I know exactly how excruciating it used to be.&lt;/p&gt;

&lt;p&gt;I do infra, work tied tight to infrastructure. In the past, if you made a cluster-level innovation that improved inference efficiency by 15%, turning that 15% into a commercial advantage was hard enough to make you quit.&lt;/p&gt;

&lt;p&gt;Why so hard? You can't directly price that 15%. You can't tell a customer, "The machine used to cost a million, now it's 15% faster, so I'll sell it to you for 1.15 million." That's not how they calculate. They'll drag you into their TCO model, into their risk structure, haggling: how do you prove it's 15%, could it actually be 5%, who guarantees stability, who bears supply chain volatility.&lt;/p&gt;

&lt;p&gt;So a low-level optimization, to reach the point where customers actually pay for it, is separated by long cycles, complex supply chains, and a massive business team. You have to maintain an entire organization, grinding slowly at the far end of the commercial chain, to grind technology into profit and scale. A tweak at the bottom layer takes forever to echo back from the market.&lt;/p&gt;

&lt;p&gt;Worse, models have short lifespans. You toil away optimizing for a particular model generation for months, and by the time you're done, its moment in the spotlight has already passed. The investment hasn't broken even, but the target has already disappeared. So inference infra used to be stuck in an awkward spot: everyone believed it would matter down the road, but nobody could see a clear business model right now.&lt;/p&gt;

&lt;h2&gt;
  
  
  Tokens Are Spot Goods, Not Futures
&lt;/h2&gt;

&lt;p&gt;Because AI coding ignited token demand, squeezing every bit of slack from the chain until it feels almost unreal.&lt;/p&gt;

&lt;p&gt;The key is the nature of tokens: they settle spot against daily capacity. Unlike traditional goods—design today, produce tomorrow, ship the day after—tokens are computing right now, results sent out within seconds. This single trait rewrote all the rules.&lt;/p&gt;

&lt;p&gt;The few points we earned from those three weeks of research start settling the moment they go live: from the next day, the extra capacity squeezed out of daily production is booked directly as additional gross margin and competitive advantage. No waiting for the next fiscal year, no entering someone's TCO model, no maintaining a team to prove how much it's worth. Capacity grew, real cost per token dropped, and the books looked better that same day.&lt;/p&gt;

&lt;p&gt;And it replicates with almost zero friction. This kind of optimization doesn't discriminate by region—same type of compute, same model, same customers. It's basically a straight port, spreading extremely fast.&lt;/p&gt;

&lt;p&gt;Measured in days, a low-level technical improvement converts directly into commercial returns, skipping that entire long, heavy system in between. This is probably the shortest chain from technology to profit in history.&lt;/p&gt;

&lt;h2&gt;
  
  
  This Is the Era of the Young
&lt;/h2&gt;

&lt;p&gt;There's another thing I keep thinking about, and it amuses me more and more: the people doing this on the front lines are mostly kids in their early twenties.&lt;/p&gt;

&lt;p&gt;This era is unusually kind to them, because its judging criteria are brutally objective: did efficiency rise or fall, did accuracy change. Put the thing on the table and one test settles it, leaving almost no room for "seniority" or "connections." You don't need any veteran's nod, nor do you need to be good at reading people or playing politics. Whether an industry's senior judges stamp their approval on you doesn't matter here. Make something real, and it writes itself plainly on production efficiency.&lt;/p&gt;

&lt;p&gt;A twenty-something can create value worth hundreds of millions, even billions of dollars, with a single technical breakthrough. Fast enough to see results immediately, hard enough to be verified by anyone. The entire chain no longer needs to be stuffed with so many people whose sole job is to "judge whether you're qualified."&lt;/p&gt;

&lt;h2&gt;
  
  
  In Closing
&lt;/h2&gt;

&lt;p&gt;The era we've caught is global, its tempo terrifyingly short, and it will likely only grow more brutal from here.&lt;/p&gt;

&lt;p&gt;But it has also genuinely deleted those long, heavy middle layers, along with those people whose entire job is to judge you. What remains is technological innovation, and the person who creates it.&lt;/p&gt;

&lt;p&gt;The fewer people in the middle, the more valuable the ones doing the work.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;References&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Google, "Introducing Gemini 3", 2025-11-18, &lt;a href="https://blog.google/products/gemini/gemini-3/" rel="noopener noreferrer"&gt;https://blog.google/products/gemini/gemini-3/&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Fortune, "Sam Altman declares 'Code Red' as Gemini 3 surges", 2025-12-02, &lt;a href="https://fortune.com/2025/12/02/sam-altman-declares-code-red-google-gemini-ceo-sundar-pichai/" rel="noopener noreferrer"&gt;https://fortune.com/2025/12/02/sam-altman-declares-code-red-google-gemini-ceo-sundar-pichai/&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;OpenAI, "Codex is becoming a productivity tool for everyone" (Weekly active users exceed 5 million, 6x growth since February), 2026-06-02, &lt;a href="https://openai.com/index/codex-for-knowledge-work/" rel="noopener noreferrer"&gt;https://openai.com/index/codex-for-knowledge-work/&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;OpenAI, "Codex for (almost) everything" (3 million weekly active users milestone), 2026-04-16, &lt;a href="https://openai.com/index/codex-for-almost-everything/" rel="noopener noreferrer"&gt;https://openai.com/index/codex-for-almost-everything/&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;OpenAI, "ChatGPT — Release Notes" (GPT-5.4 / 5.5 release notes), accessed 2026-06-17, &lt;a href="https://help.openai.com/en/articles/6825453-chatgpt-release-notes" rel="noopener noreferrer"&gt;https://help.openai.com/en/articles/6825453-chatgpt-release-notes&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Securities Times, "'Global Large Model First Stock' Z.ai Exceeds 57 Billion HKD Market Cap on IPO Day", 2026-01, &lt;a href="https://www.stcn.com/article/detail/3580246.html" rel="noopener noreferrer"&gt;https://www.stcn.com/article/detail/3580246.html&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Finet.com.cn, "[IPO Tracking] Z.ai (02513.HK) High Growth Ignites Hong Kong Stocks, Shares Rise 31% to New High", 2026-04, &lt;a href="https://www.finet.com.cn/news/69cc927d2308294c69bf7bec.html" rel="noopener noreferrer"&gt;https://www.finet.com.cn/news/69cc927d2308294c69bf7bec.html&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Z.ai, "Z.ai Releases GLM-4.7", PR Newswire, 2025-12-22, &lt;a href="https://www.prnewswire.com/news-releases/zai-releases-glm-4-7-designed-for-real-world-development-environments-cementing-itself-as-chinas-openai-302649821.html" rel="noopener noreferrer"&gt;https://www.prnewswire.com/news-releases/zai-releases-glm-4-7-designed-for-real-world-development-environments-cementing-itself-as-chinas-openai-302649821.html&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;MarkTechPost, "Z.ai Launches GLM-5.2 With a Usable 1M-Token Context", 2026-06-14, &lt;a href="https://www.marktechpost.com/2026/06/14/z-ai-launches-glm-5-2-with-a-usable-1m-token-context-two-thinking-effort-levels-and-no-benchmarks-at-launch/" rel="noopener noreferrer"&gt;https://www.marktechpost.com/2026/06/14/z-ai-launches-glm-5-2-with-a-usable-1m-token-context-two-thinking-effort-levels-and-no-benchmarks-at-launch/&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Reuters, "MiniMax doubles in value in Hong Kong debut", 2026-01-09, &lt;a href="https://www.reuters.com/world/asia-pacific/china-ai-firm-minimax-set-surge-hong-kong-debut-2026-01-09/" rel="noopener noreferrer"&gt;https://www.reuters.com/world/asia-pacific/china-ai-firm-minimax-set-surge-hong-kong-debut-2026-01-09/&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Securities Times, "Stock Price Surges 51% in Two Days, MiniMax Market Cap Successively Surpasses Three Internet Giants", 2026-03-11, &lt;a href="https://www.stcn.com/article/detail/3670887.html" rel="noopener noreferrer"&gt;https://www.stcn.com/article/detail/3670887.html&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;MiniMax, "MiniMax-M2.5" (SWE-bench Verified 80.2), GitHub, 2026-02, &lt;a href="https://github.com/MiniMax-AI/MiniMax-M2.5" rel="noopener noreferrer"&gt;https://github.com/MiniMax-AI/MiniMax-M2.5&lt;/a&gt;
&lt;/li&gt;
&lt;/ol&gt;




&lt;p&gt;原文链接：&lt;a href="https://guanjiawei.ai/en/blog/shortest-chain-to-profit" rel="noopener noreferrer"&gt;https://guanjiawei.ai/en/blog/shortest-chain-to-profit&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>infra</category>
      <category>tokeneconomy</category>
      <category>reflection</category>
    </item>
    <item>
      <title>Top-Tier Intelligence, Cut Off Overnight</title>
      <dc:creator>guanjiawei</dc:creator>
      <pubDate>Mon, 15 Jun 2026 09:50:26 +0000</pubDate>
      <link>https://dev.to/skyguan92/top-tier-intelligence-cut-off-overnight-236p</link>
      <guid>https://dev.to/skyguan92/top-tier-intelligence-cut-off-overnight-236p</guid>
      <description>&lt;p&gt;I set Fable 5 on a long task and let it run, telling myself I'd check back in a few hours. When I returned, the model doing the work wasn't Fable 5 anymore. It was Opus 4.8. It had downgraded itself behind my back.&lt;/p&gt;

&lt;p&gt;I stared at the results. My first thought wasn't whether it had done a good job, but whether the output was even usable. I had set my expectations for Fable 5-level work. When a weaker model took over halfway through, those first few hours were suddenly up in the air.&lt;/p&gt;

&lt;p&gt;Later I learned this was an official mechanism. Fable 5 was Anthropic's strongest model at the time, priced at twice Opus 4.8. But it came with a safety classifier. Once the classifier flagged your question as touching on sensitive areas like cybersecurity or biochemistry, it would automatically hand that round off to Opus 4.8. The official trigger rate was under 5%, but I was building an inference engine, working with low-level code and system calls every day, so I was bound to get caught in the crossfire. One false positive, and the whole session dropped a level.&lt;/p&gt;

&lt;h2&gt;
  
  
  Gone in Three Days
&lt;/h2&gt;

&lt;p&gt;The silent downgrade was merely irritating. What really left me speechless was Fable 5 disappearing entirely just days later.&lt;/p&gt;

&lt;p&gt;It had been out for only three days when the U.S. Commerce Department handed Anthropic an export control order: add the model to the export control list under national security provisions. The trigger was someone jailbreaking it and using it to dig up software vulnerabilities. The order wasn't worded as "don't sell to China," but as "do not provide to any foreign national." Even foreign employees working in the U.S. were covered. Anthropic said the only way to stay compliant was to shut it down for everyone, worldwide. So allies like South Korea and the UK were cut off too.&lt;/p&gt;

&lt;p&gt;This proved something I'd told my team before: with overseas top-tier closed-source models, use the best ones you can get, while you can. That path only gets narrower, and there's no going back. I thought the squeeze would come gradually. Instead, it was three days post-launch, one piece of paper, and cut off just like that.&lt;/p&gt;

&lt;h2&gt;
  
  
  Software as Munitions Is Nothing New
&lt;/h2&gt;

&lt;p&gt;Treating intelligence as a strategic resource sounds like something new. It's actually the same old story.&lt;/p&gt;

&lt;p&gt;In the 1990s, the U.S. placed strong encryption algorithms on the munitions list, regulating them as arms exports. Civilian encryption was deliberately weakened until it was practically useless. PGP creator Phil Zimmermann was investigated by federal authorities for three years for "munitions export without a license" after putting his encryption software online. The logic was identical: once a piece of software becomes capable enough, it stops being software. It becomes a weapon.&lt;/p&gt;

&lt;p&gt;Fable 5 got the same judgment: it was too good at finding vulnerabilities, so it was reclassified from "commercial product" to "cyberweapon."&lt;/p&gt;

&lt;p&gt;But the encryption fight had a more interesting ending. The controls didn't work. Strong encryption spread globally through open source, and walls couldn't stop it. By 2000, the U.S. had no choice but to loosen its grip. Any capability that is useful enough and can be replicated will be slowed by regulation, but never stopped.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Replacements Are Already at the Door
&lt;/h2&gt;

&lt;p&gt;This time, the replacements showed up almost the same week.&lt;/p&gt;

&lt;p&gt;The day after Fable 5 was shut down, Zhipu pushed out GLM 5.2. The timing was almost too perfect. Before that, MiniMax had released M3, and Kimi had released K2.7 Code. All of them were open source or open-weight. GLM 5.2 got great word-of-mouth. People mostly complained about how slow it was; few questioned its quality. This matched my own experience using it: infrastructure and GPU shortages made serving it a struggle, but the model itself was genuinely right at the top tier.&lt;/p&gt;

&lt;p&gt;There's another angle. Silicon Valley's top AI talent costs have skyrocketed, and buying GPUs is restricted. Yet in this environment short on both people and compute, domestic players built models that benchmark against the top tier, and open-sourced them.&lt;/p&gt;

&lt;p&gt;Today, Chinese-made models account for the largest share of global open-source model downloads, 17.1% to the U.S.'s 15.8%. Doing all that with a fraction of the competition's resources, and still releasing the weights, is genuinely impressive. Zhipu's recurring slogan this time was "open": intelligence should belong to everyone.&lt;/p&gt;

&lt;p&gt;This is no longer about which model is stronger. The White House's AI action plan last year opened with "America is in a race to achieve global dominance in AI." When a nation puts intelligence at this level, cutting off others' access becomes the obvious play. The Industrial Revolution already played out this script: once one country's productivity exceeds another's, it can do almost as it pleases. Intelligence is the foundation of the next wave of productivity. That card will only get heavier.&lt;/p&gt;

&lt;h2&gt;
  
  
  Use It While It's Open
&lt;/h2&gt;

&lt;p&gt;Two things.&lt;/p&gt;

&lt;p&gt;First, use what's open while you can, and put it to work where it counts. Top-tier intelligence now behaves more like a quota that expires. You never know when someone will flip a switch. While you still have access, give it the hardest problems.&lt;/p&gt;

&lt;p&gt;Second, don't bet your entire intelligence supply on a switch you can't reach. Locking your workflow to a single closed-source model means one compliance ruling can stop you cold. You need a fallback: open-weight models, deployments you control. Even if they're slower or weaker, they'll still be there when it counts.&lt;/p&gt;

&lt;h2&gt;
  
  
  Closing Thoughts
&lt;/h2&gt;

&lt;p&gt;Fable 5's silent downgrade felt like a lousy product experience at the time. In retrospect, it was a preview: the intelligence you think you hold securely can be quietly swapped out, or simply taken away, at any moment.&lt;/p&gt;

&lt;p&gt;Models will keep getting stronger generation after generation. But whether you can actually use the strongest one is less and less a technical problem.&lt;/p&gt;

&lt;p&gt;The more intelligence is treated as a strategic resource, the more valuable access to it becomes.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;References&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Anthropic, "Claude Fable 5 and Claude Mythos 5" (Official launch announcement, including the mechanism for automatic downgrading to Opus 4.8 upon safety classifier triggers), 2026-06-09, &lt;a href="https://www.anthropic.com/news/claude-fable-5-mythos-5" rel="noopener noreferrer"&gt;https://www.anthropic.com/news/claude-fable-5-mythos-5&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Anthropic, "Statement on US government directive" (Official shutdown statement), 2026-06-12, &lt;a href="https://www.anthropic.com/news/fable-mythos-access" rel="noopener noreferrer"&gt;https://www.anthropic.com/news/fable-mythos-access&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;The New Stack, "US Gov Orders Anthropic To Pull Fable 5 and Mythos 5 Three Days After Launch" (Mechanism: export control order, not executive order; compliance logic behind global shutdown), 2026-06, &lt;a href="https://thenewstack.io/us-gov-orders-anthropic-to-pull-fable-5-and-mythos-5-three-days-after-launch/" rel="noopener noreferrer"&gt;https://thenewstack.io/us-gov-orders-anthropic-to-pull-fable-5-and-mythos-5-three-days-after-launch/&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Fortune, "Anthropic disables Fable and Mythos under export controls" (Commerce Secretary Lutnick's letter to Amodei; Opus 4.8 unaffected), 2026-06-13, &lt;a href="https://fortune.com/2026/06/13/anthropic-disables-fable-mythos-export-controls-national-security-threat/" rel="noopener noreferrer"&gt;https://fortune.com/2026/06/13/anthropic-disables-fable-mythos-export-controls-national-security-threat/&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;InfoQ, "Anthropic Releases—and Then Suspends—Claude Fable 5 and Mythos 5" (Release specs, pricing, and suspension timeline), 2026-06, &lt;a href="https://www.infoq.com/news/2026/06/claude-5-release/" rel="noopener noreferrer"&gt;https://www.infoq.com/news/2026/06/claude-5-release/&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;TechCrunch, "Anthropic releases Opus 4.8 with new Dynamic Workflow tool", 2026-05-28, &lt;a href="https://techcrunch.com/2026/05/28/anthropic-releases-opus-4-8-with-new-dynamic-workflow-tool/" rel="noopener noreferrer"&gt;https://techcrunch.com/2026/05/28/anthropic-releases-opus-4-8-with-new-dynamic-workflow-tool/&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Zhipu Z.ai, GLM-5.2 API Documentation / Release Notes, 2026-06-13, &lt;a href="https://docs.z.ai/devpack/latest-model" rel="noopener noreferrer"&gt;https://docs.z.ai/devpack/latest-model&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;MiniMax, "MiniMax M3 is officially released" (Official blog), 2026-06-01, &lt;a href="https://www.minimax.io/blog/minimax-m3" rel="noopener noreferrer"&gt;https://www.minimax.io/blog/minimax-m3&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Moonshot AI, "Kimi-K2.7-Code" (HuggingFace model page, Modified MIT open source), 2026-06-12, &lt;a href="https://huggingface.co/moonshotai/Kimi-K2.7-Code" rel="noopener noreferrer"&gt;https://huggingface.co/moonshotai/Kimi-K2.7-Code&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Science and Technology Daily / People's Daily Online, "China-developed open-source AI models account for 17.1% of global downloads, ranking first in the world", 2025-12-08, &lt;a href="http://finance.people.com.cn/BIG5/n1/2025/1208/c1004-40619459.html" rel="noopener noreferrer"&gt;http://finance.people.com.cn/BIG5/n1/2025/1208/c1004-40619459.html&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;The White House, "America's AI Action Plan" (Opening line: "a race to achieve global dominance in AI"), 2025-07, &lt;a href="https://www.ai.gov/action-plan" rel="noopener noreferrer"&gt;https://www.ai.gov/action-plan&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;U.S. Bureau of Industry and Security, "Commerce Announces Rescission of Biden-Era AI Diffusion Rule" (AI diffusion rule rescinded; control focus shifts to chips), 2025-05-13, &lt;a href="https://www.bis.gov/press-release/department-commerce-announces-rescission-biden-era-artificial-intelligence-diffusion-rule-strengthens" rel="noopener noreferrer"&gt;https://www.bis.gov/press-release/department-commerce-announces-rescission-biden-era-artificial-intelligence-diffusion-rule-strengthens&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Wikipedia, "Export of cryptography from the United States" (History of 1990s encryption export controls and the PGP case), &lt;a href="https://en.wikipedia.org/wiki/Export_of_cryptography_from_the_United_States" rel="noopener noreferrer"&gt;https://en.wikipedia.org/wiki/Export_of_cryptography_from_the_United_States&lt;/a&gt;
&lt;/li&gt;
&lt;/ol&gt;




&lt;p&gt;原文链接：&lt;a href="https://guanjiawei.ai/en/blog/intelligence-as-strategic-resource" rel="noopener noreferrer"&gt;https://guanjiawei.ai/en/blog/intelligence-as-strategic-resource&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llms</category>
      <category>geopolitics</category>
      <category>opensource</category>
    </item>
    <item>
      <title>When the Harness Is a Mess, Restart</title>
      <dc:creator>guanjiawei</dc:creator>
      <pubDate>Fri, 12 Jun 2026 08:00:58 +0000</pubDate>
      <link>https://dev.to/skyguan92/when-the-harness-is-a-mess-restart-44oc</link>
      <guid>https://dev.to/skyguan92/when-the-harness-is-a-mess-restart-44oc</guid>
      <description>&lt;p&gt;I mentioned earlier that I'd been using an agent to build an inference engine from scratch. By the third week, I realized something: this project was busted.&lt;/p&gt;

&lt;p&gt;Not the "won't compile" kind of broken. Documentation was exploding. Experimental scripts from seven or eight directions were crammed into one repo, benchmark results scattered across a dozen markdown files. Every time the model started work, it spent half a day just figuring out where it had left off. I watched it flit between directions, scratching the surface of each before moving on. Give it more guidance and it obediently followed along; give it less and it just spun in place.&lt;/p&gt;

&lt;p&gt;That wasn't its usual level. Two days earlier, it had spent twelve hours connecting a 78-layer pipeline from scratch.&lt;/p&gt;

&lt;h2&gt;
  
  
  Lost in the Chaos
&lt;/h2&gt;

&lt;p&gt;I kept running into this same situation, and the conclusion was always the same: once the context turns into chaos, the model can't find its way out. I almost never saw it climb out on its own. Every time, I had to tear the whole thing down.&lt;/p&gt;

&lt;p&gt;This isn't gut feel; someone measured it. Microsoft and Salesforce simulated over 200,000 multi-turn conversations last year, testing 15 models. The conclusion was right there in the abstract: once an LLM takes a wrong step in conversation, it gets lost and won't recover. Average performance dropped 39%. The breakdown is more interesting: capability only dropped 16%, but unreliability shot up 112%. The model didn't get much dumber; it became wildly unstable. On the same question, sometimes it answered great, sometimes terribly. That matches exactly what I saw: it's not that it couldn't do it, it just couldn't perform.&lt;/p&gt;

&lt;p&gt;Chroma's Context Rot report landed another blow: even for a task as simple as fishing one sentence out of a pile of text, the longer the input, the more all 18 models' performance dropped. And the drop wasn't uniform. Context is a finite resource. Anthropic calls it the attention budget. Every extra token you stuff in burns part of it.&lt;/p&gt;

&lt;p&gt;The messier your repo, the faster this budget burns.&lt;/p&gt;

&lt;h2&gt;
  
  
  After the Restart
&lt;/h2&gt;

&lt;p&gt;The solution was always the same: stop, figure out what the goal actually is, open a new repo with fresh context, write down this one goal and the execution path clearly, and start over. Note: "this one." After a restart, you can basically only lock onto one direction. Running multiple lines in parallel is a luxury at this point.&lt;/p&gt;

&lt;p&gt;The change is staggering. The same model, spinning in a mess the day before, acts like a different species after the restart: digging deep, actively hunting for improvements, pushing forward with efficiency that isn't even on the same scale. My rough estimate is a three- to fivefold difference, and it's probably not even linear. In a chaotic environment it might never reach the goal. In a clean environment it's genuinely converging on the target.&lt;/p&gt;

&lt;p&gt;The Microsoft paper's advice for users boils down to two things: if you have time, restart the session; before you do, have the model consolidate what it knows and carry it over. They tested it: merging information scattered across multiple turns into a single turn and re-feeding it restores performance to 95% of a single-turn session. Anthropic does the same thing with its multi-agent research system: when context is almost full, spin up a new agent with clean context and hand off properly.&lt;/p&gt;

&lt;p&gt;The industry calls this compaction, or context engineering, or whatever. The name doesn't matter. It all comes down to one thing: whatever environment you give the model, that's the performance you get.&lt;/p&gt;

&lt;h2&gt;
  
  
  Rewriting Is Suicide, Restarting Is Not
&lt;/h2&gt;

&lt;p&gt;There's an old iron law in software engineering, written by Joel Spolsky over twenty years ago: never rewrite from scratch. Netscape decided to rewrite its browser, went three years without a major release, and handed the market to IE. This iron law ruled the industry for two decades.&lt;/p&gt;

&lt;p&gt;But it rests on a premise: rebuilding costs are extremely high. A team rewriting for three years while competitors don't stop and wait.&lt;/p&gt;

&lt;p&gt;Agents eliminate rebuilding costs. New repo, new context, re-clarifying the goal. Half a day's work. Take the conclusions and pitfalls from before, organize them into documents, and carry them over. When rebuilding drops from three years to half a day, the iron law flips: patching a rotten context is far more expensive than restarting.&lt;/p&gt;

&lt;p&gt;So the chaos of exploration that looks wasted isn't actually wasted. The clear goal at restart grows directly out of that first round of random bashing: which assumptions held, which directions closed off. Without that chaos, you couldn't write that clarity. Code can be thrown away; cognition is carried forward.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Weight of the Harness
&lt;/h2&gt;

&lt;p&gt;The word "harness" has burned its way from the evaluation community to the engineering community over the past two years. METR wrote scaffolding into its evaluation methodology back in 2023; this February OpenAI published a piece on Harness engineering, redefining an engineer's job as "designing the environment, expressing intent, and building feedback loops"; in April, Martin Fowler's site gave the most concise definition: the harness is everything in an agent except the model itself.&lt;/p&gt;

&lt;p&gt;How heavy is that weight? The Terminal-Bench 2.0 leaderboard spells it out: the same Opus 4.6, with different harnesses, scores from 58 to 76. That's a gap of 18 percentage points. Under the same harness, swapping GPT-5 for GPT-5.2 only raises the score by 19 percentage points. Good versus bad harness design is the difference of a full model generation.&lt;/p&gt;

&lt;p&gt;Addy Osmani put it crudely but accurately: an average model with a good harness can beat a top-tier model with a bad harness.&lt;/p&gt;

&lt;p&gt;My own version is even cruder: the environment you give the model is its ceiling.&lt;/p&gt;

&lt;h2&gt;
  
  
  Human Value Is Rising
&lt;/h2&gt;

&lt;p&gt;There's another feeling that's been growing stronger these past few months: humans are becoming more valuable.&lt;/p&gt;

&lt;p&gt;Deciding whether to tear it all down and restart is on the human. Converging from a pile of messy exploration results to "what is the real problem to solve" is also on the human. I've seen too many times these past few months where an agent can't find its way out of fuzzy, complex problems. There's a line in that OpenAI blog post I strongly agree with: Humans steer. Agents execute.&lt;/p&gt;

&lt;p&gt;So an agent isn't a wishing machine; it's a very strong employee. If you lead well, it can produce outstanding results; if you toss the work and walk away, it probably won't produce much. Models keep getting stronger; Fable 5's engineering capability took another step up, and every generation pushes the boundary you can cross further out. But the "leading" part: no one can do that for you yet.&lt;/p&gt;

&lt;h2&gt;
  
  
  Closing Words
&lt;/h2&gt;

&lt;p&gt;When it's a mess, restart. These five words are the most valuable lesson I've accumulated from using agents on complex projects these past few months.&lt;/p&gt;

&lt;p&gt;Models will keep getting stronger. But the environment is yours to provide, and the decision to tear it all down is yours to make.&lt;/p&gt;

&lt;p&gt;The cheaper intelligence becomes, the more valuable clarity is.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;References&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Laban et al., "LLMs Get Lost In Multi-Turn Conversation", arXiv:2505.06120, 2025-05, &lt;a href="https://arxiv.org/abs/2505.06120" rel="noopener noreferrer"&gt;https://arxiv.org/abs/2505.06120&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Chroma, "Context Rot: How Increasing Input Tokens Impacts LLM Performance", 2025-07-14, &lt;a href="https://research.trychroma.com/context-rot" rel="noopener noreferrer"&gt;https://research.trychroma.com/context-rot&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Anthropic, "Effective context engineering for AI agents", 2025-09-29, &lt;a href="https://www.anthropic.com/engineering/effective-context-engineering-for-ai-agents" rel="noopener noreferrer"&gt;https://www.anthropic.com/engineering/effective-context-engineering-for-ai-agents&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Anthropic, "How we built our multi-agent research system", 2025-06-13, &lt;a href="https://www.anthropic.com/engineering/multi-agent-research-system" rel="noopener noreferrer"&gt;https://www.anthropic.com/engineering/multi-agent-research-system&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Terminal-Bench 2.0 Leaderboard, &lt;a href="https://www.tbench.ai/leaderboard/terminal-bench/2.0" rel="noopener noreferrer"&gt;https://www.tbench.ai/leaderboard/terminal-bench/2.0&lt;/a&gt; (retrieved 2026-06-12)&lt;/li&gt;
&lt;li&gt;OpenAI, "Harness engineering: leveraging Codex in an agent-first world", 2026-02, &lt;a href="https://openai.com/index/harness-engineering/" rel="noopener noreferrer"&gt;https://openai.com/index/harness-engineering/&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Birgitta Böckeler, "Harness engineering for coding agent users", martinfowler.com, 2026-04-02, &lt;a href="https://martinfowler.com/articles/harness-engineering.html" rel="noopener noreferrer"&gt;https://martinfowler.com/articles/harness-engineering.html&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Addy Osmani, "Agent Harness Engineering", O'Reilly Radar, 2026-05-15, &lt;a href="https://www.oreilly.com/radar/agent-harness-engineering/" rel="noopener noreferrer"&gt;https://www.oreilly.com/radar/agent-harness-engineering/&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Joel Spolsky, "Things You Should Never Do, Part I", 2000-04-06, &lt;a href="https://www.joelonsoftware.com/2000/04/06/things-you-should-never-do-part-i/" rel="noopener noreferrer"&gt;https://www.joelonsoftware.com/2000/04/06/things-you-should-never-do-part-i/&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;METR, "Evaluating Language-Model Agents on Realistic Autonomous Tasks", 2023-08, &lt;a href="https://metr.org/blog/2023-08-01-new-report/" rel="noopener noreferrer"&gt;https://metr.org/blog/2023-08-01-new-report/&lt;/a&gt;
&lt;/li&gt;
&lt;/ol&gt;




&lt;p&gt;原文链接：&lt;a href="https://guanjiawei.ai/en/blog/restart-when-lost" rel="noopener noreferrer"&gt;https://guanjiawei.ai/en/blog/restart-when-lost&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>agents</category>
      <category>contextengineering</category>
      <category>thinking</category>
    </item>
    <item>
      <title>Intelligence Is Starting to Be About Wealth</title>
      <dc:creator>guanjiawei</dc:creator>
      <pubDate>Thu, 11 Jun 2026 09:25:32 +0000</pubDate>
      <link>https://dev.to/skyguan92/intelligence-is-starting-to-be-about-wealth-55fp</link>
      <guid>https://dev.to/skyguan92/intelligence-is-starting-to-be-about-wealth-55fp</guid>
      <description>&lt;p&gt;The day after Fable 5 dropped, I threw it at a research project that had been stuck for two weeks.&lt;/p&gt;

&lt;p&gt;Inference engine performance optimization. A 78-layer model running on hardware, baseline 10 token/s, target 100+ tps. GPT 5.5 had already sunk a lot of hours into it without much luck.&lt;/p&gt;

&lt;p&gt;It ran for 15 hours. API bill: $420. Performance went from 10 to 13 token/s, a 30% boost. Still far from 100, but it moved.&lt;/p&gt;

&lt;h2&gt;
  
  
  Two Approaches
&lt;/h2&gt;

&lt;p&gt;I had GPT 5.5 working the same project in another worktree, instructions basically identical. The two sides tackled it very differently.&lt;/p&gt;

&lt;p&gt;GPT 5.5 started by scaffolding with just two layers. Out of 78 total, it first got the pipeline running on two, then dug in, running hundreds of rounds on a single point, squeezing every drop of local speed before scaling up. Steady. Slow. Following the roadmap one step at a time.&lt;/p&gt;

&lt;p&gt;Fable 5 did the exact opposite. It spent an hour or two feeling out the single-layer ceiling, then stitched all 78 layers together and went straight for an end-to-end result.&lt;/p&gt;

&lt;p&gt;Its engineering was on another level. It built a runnable inference engine from scratch in just over ten hours, a level of completeness previous models couldn't match. But 30% is nowhere near the target, and burning a few hundred dollars a pop isn't sustainable. I ended up switching back to GPT 5.5.&lt;/p&gt;

&lt;p&gt;But that's not the point of this post.&lt;/p&gt;

&lt;h2&gt;
  
  
  An Order-of-Magnitude Gap Between Subscription and API
&lt;/h2&gt;

&lt;p&gt;At launch, Fable 5 was put into Pro and Max subscriptions, counted against quota at twice the Opus 4.8 rate, for a two-week window. After June 22 it was pulled from subscriptions. To keep using it you had to buy usage credits at API rates.&lt;/p&gt;

&lt;p&gt;API pricing: $10 input, $50 output, per million tokens.&lt;/p&gt;

&lt;p&gt;Max 20x runs $200 a month. Before, one Max 20x seat was enough for Opus 4.8 on complex tasks. Fable 5 at 2x burns through quota faster, but you could still pick which tasks to spend it on. On API, the math changes entirely. A three-hour subtask costs $150; the full project ran $420. That's two months of Max fees.&lt;/p&gt;

&lt;p&gt;From Opus 4.5 to 4.6 to 4.8, every upgrade just tweaked the subscription quota ratios. The best model was always included. $200 a month got you access. Now it's not a ratio tweak. It's a straight removal.&lt;/p&gt;

&lt;p&gt;Most people can't afford to burn hundreds on a single task.&lt;/p&gt;

&lt;h2&gt;
  
  
  Anti-Distillation
&lt;/h2&gt;

&lt;p&gt;Anti-distillation is where things get really twisted.&lt;/p&gt;

&lt;p&gt;I get not wanting to be distilled. Fable 5 has a built-in two-layer classifier: requests flagged as distillation, cybersecurity, or biochemistry automatically fall back to Opus 4.8. Anthropic says the trigger rate is under 5%.&lt;/p&gt;

&lt;p&gt;In reality? People on Reddit and Twitter are saying a routine bloodwork interpretation got flagged as biochemistry, or normal questions got silently downgraded. Distillation is just intensive model use. Once the model is publicly available, you can't fully stop it. Using an opaque classifier to guess intent? Getting it wrong is only a matter of time.&lt;/p&gt;

&lt;p&gt;The worst part is that the downgrade is silent. Answer quality tanks out of nowhere, and you think that's just the model, but you're actually getting Opus 4.8. You're paying Fable 5 prices for Opus 4.8 output, and they don't even tell you. I find this more disgusting than the pricing issue.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Other Side
&lt;/h2&gt;

&lt;p&gt;OpenAI has been doing something different lately. In March they launched Codex for Open Source: six months of free Pro for maintainers of GitHub projects with 1000+ stars. It goes to individuals writing code in the community, not enterprises.&lt;/p&gt;

&lt;p&gt;The scale isn't massive. But the direction is toward letting more people use the good stuff, not locking it away.&lt;/p&gt;

&lt;p&gt;The two are going head-to-head; who's right or wrong is anyone's guess. But one pulls the best model from subscriptions and silently downgrades users, while the other hands out free accounts to open-source contributors. You tell me. Do those two signals look the same to you?&lt;/p&gt;

&lt;h2&gt;
  
  
  The World Is at a Crossroads
&lt;/h2&gt;

&lt;p&gt;Anthropic has elitism in its bones. AI constitution, moral charter—the intentions aren't bad, but who gets to define "responsible"? It's always that tiny group, deciding from the top down. The business strategy follows the same path: those with budgets get the best, those without get second tier, with a tenfold gap between.&lt;/p&gt;

&lt;p&gt;You can charge a premium for the strongest model. But keeping it in subscriptions lets individuals and small teams allocate their own quotas. Pulling it out and putting it on API changes the signal completely.&lt;/p&gt;

&lt;p&gt;A model company's pricing is its stance on how intelligence should be distributed. This fracture is bigger than Anthropic realizes.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;References&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Anthropic, "Introducing Claude Fable 5 and Mythos 5", 2026-06-09&lt;/li&gt;
&lt;li&gt;Claude official pricing: Fable 5 input $10/M token, output $50/M token; GPT-5.5 input $5/M token, output $30/M token&lt;/li&gt;
&lt;li&gt;Fable 5 subscription availability: June 9 to 22, after which usage credits at API rates are required&lt;/li&gt;
&lt;li&gt;Fable 5 anti-distillation mechanism: two-stage classifier detection, falls back to Opus 4.8 when triggered, official trigger rate claimed below 5%&lt;/li&gt;
&lt;li&gt;Fortune, "Anthropic accused of secretly limiting Claude Fable 5 capabilities", 2026-06-10&lt;/li&gt;
&lt;li&gt;OpenAI, "Codex for Open Source" program, 2026-03-07, providing 6 months of free Pro accounts to maintainers of 1000+ stars projects&lt;/li&gt;
&lt;li&gt;SWE-bench Pro benchmark: Fable 5 scored 80%, GPT-5.5 scored 58.6%&lt;/li&gt;
&lt;/ol&gt;




&lt;p&gt;原文链接：&lt;a href="https://guanjiawei.ai/en/blog/intelligence-wealth-divide" rel="noopener noreferrer"&gt;https://guanjiawei.ai/en/blog/intelligence-wealth-divide&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>anthropic</category>
      <category>models</category>
      <category>reflections</category>
    </item>
    <item>
      <title>Models Keep Getting Stronger, but 'Strongest' Has No Single Answer</title>
      <dc:creator>guanjiawei</dc:creator>
      <pubDate>Wed, 03 Jun 2026 10:09:05 +0000</pubDate>
      <link>https://dev.to/skyguan92/models-keep-getting-stronger-but-strongest-has-no-single-answer-2ec0</link>
      <guid>https://dev.to/skyguan92/models-keep-getting-stronger-but-strongest-has-no-single-answer-2ec0</guid>
      <description>&lt;p&gt;June is shaping up to be another packed month for model releases. Opus 4.8 dropped at the end of May, MiniMax's M3 landed a couple of days ago, GPT 5.6 is supposedly around the corner, and some are already waiting on DeepSeek's next drop. It looks like we'll see a new model every few days. Pretty lively.&lt;/p&gt;

&lt;p&gt;But for the past two days, what I've actually been thinking about is a friend's experience with models.&lt;/p&gt;

&lt;p&gt;He started out using models to build small things—writing web pages, making little tools and plugins. He was pretty excited at first, telling me how amazing today's models are. He more or less picked a decent domestic model at random and found it more than enough. He couldn't even imagine where else models could get stronger; they already worked so well.&lt;/p&gt;

&lt;p&gt;Then his work got more complex. He moved from small tools to trying to build an auto-editing tool, video cropping and the like. That's when the problems started.&lt;/p&gt;

&lt;p&gt;The model told him it was done. He said okay, tried it, no dice. A moment later it said this time it was really done. He tried again, still no good. Back and forth for several rounds.&lt;/p&gt;

&lt;p&gt;He couldn't tell anymore: part of him felt he was getting better at working with the model, that he needed to give more guidance and try different approaches; part of him started wondering if the model itself just wasn't cutting it, whether he should switch to something like Claude Opus.&lt;/p&gt;

&lt;p&gt;This pattern is all too common. Behind it is something a lot of people haven't caught on to yet: model strength is forking in different directions. A single score no longer tells the whole story.&lt;/p&gt;

&lt;h2&gt;
  
  
  Scores Bunch at the Ceiling, Real-World Feel Diverges
&lt;/h2&gt;

&lt;p&gt;First, the weird state of things: on mainstream benchmarks, top models score terrifyingly high, all squeezed into a narrow band.&lt;/p&gt;

&lt;p&gt;Take GPQA. It's graduate-level, so hard that the PhD experts they brought in only scored around 65%. Yet top models now routinely hit 92% to 94%, bunched together. Older benchmarks like MMLU were surpassed by pretty much everyone long ago, all scoring over 90%. Hard problems aren't hard anymore; scores have hit the ceiling, and you can't tell models apart.&lt;/p&gt;

&lt;p&gt;So benchmark makers have to keep inventing harder tests. The new Humanity's Last Exam states it plainly: it was created because models had exceeded 90% on MMLU and the old questions weren't enough anymore. One study looked at sixty mainstream benchmarks and found that nearly half are already highly saturated—top models are "statistically indistinguishable" on them.&lt;/p&gt;

&lt;p&gt;But when you actually use them, the difference in feel is absurd. I &lt;a href="https://guanjiawei.ai/blog/stay-on-the-table" rel="noopener noreferrer"&gt;wrote in my last post&lt;/a&gt; how Opus 4.8 kept letting me down on engineering and research tasks—work that later all moved to GPT 5.5. By the scores, the two are close; in practice, worlds apart.&lt;/p&gt;

&lt;p&gt;The ARC-AGI suite is a perfect example. On the old version, top models had already saturated at 96%. Switch to the harder ARC-AGI-2, and the same models immediately show their true colors: GPT 5.5 still manages 85%, while Opus 4.8 drops to barely over 70%. Switch again to ARC-AGI-3, which requires actual interaction and exploration, and almost everyone flatlines to zero.&lt;/p&gt;

&lt;p&gt;So benchmarks are still useful. It's just that "making tests humans can define and grade" is becoming less and less able to distinguish models. To understand why, you have to look at training.&lt;/p&gt;

&lt;h2&gt;
  
  
  Solving the Hardest Problems vs. Reliably Doing Messy Work
&lt;/h2&gt;

&lt;p&gt;The main technique making models stronger right now is called "verifiable rewards." In short: pick hard problems with standard answers that machines can automatically grade, and use them for reinforcement learning. Math and code are the classic examples. Right answer gets points, wrong gets zero, rinse and repeat.&lt;/p&gt;

&lt;p&gt;The DeepSeek-R1 paper puts it clearly: math problems are verified with rules, code is thrown straight into compilers to run test cases. They specifically note they avoided neural-network-based reward models, because those are too easy for models to game. OpenAI's o series follows the same playbook. It's highly effective; this is exactly how models learned to solve hard problems.&lt;/p&gt;

&lt;p&gt;But it has one characteristic: what it excels at is taking the "hardest problems humans can define and grade" and grinding them out. That's an entirely different capability: give it a fuzzy, not-that-hard but very real task, and get it done reliably in one go.&lt;/p&gt;

&lt;p&gt;My friend's editing tool is the latter. The task isn't extremely difficult, but the intent is fuzzy. It has to be broken down yourself, and it has to be done cleanly in one go. A model that can solve Olympiad problems may not handle this kind of messy work cleanly in one shot. It might go in circles, need three rounds of back-and-forth, and finally say "I'm done" when it isn't. Conversely, a model that's great at messy work might completely choke when you throw a really hard problem at it.&lt;/p&gt;

&lt;p&gt;These are two directions of capability, each going its own way. You can't rank them on a single line.&lt;/p&gt;

&lt;p&gt;The trouble is, ninety percent of people need the latter in daily life. They want to take a poorly specified task and get it done reliably, without hassle. Yet the score we use to rank models measures almost entirely the former. It's completely normal for "highest score" and "most useful for me" to not line up.&lt;/p&gt;

&lt;h2&gt;
  
  
  Another Dimension: Exploration
&lt;/h2&gt;

&lt;p&gt;The first two types are still in the world of standard answers: either solve a gradable hard problem, or finish a verifiable task. The truly difficult one is the third.&lt;/p&gt;

&lt;p&gt;When my friend got stuck, I thought of another class of problems. Like driving toward an intersection with a traffic light ahead. Do you go straight, or weave through the middle? There's no standard answer; you have to find your own direction in the ambiguity. Exploring a domain people haven't clearly defined, or don't even know the answer to, is a completely different capability.&lt;/p&gt;

&lt;p&gt;This capability, benchmarks can't measure at all. The whole premise of evaluation is having a standard answer, something gradable right or wrong. But exploration has no right or wrong at all, only efficiency. Can you fish out something new in the ambiguity, use it to move forward, and push out a boundary that didn't exist before?&lt;/p&gt;

&lt;p&gt;It's also precisely the blind spot of the "verifiable rewards" approach. Research has already pointed out that open-ended tasks without unique answers have no clear standard answer to begin with, so you can't even construct rewards. This method can't get traction. Some have even found that this training approach doesn't necessarily give models new capabilities, and may instead narrow their exploration surface, with capability ceilings hard-capped by the base model.&lt;/p&gt;

&lt;p&gt;The result is that a model great at exploration, thrown into a cage with clear standard answers, might seem a bit stupid. A model that tests incredibly well may not possess exploration ability at all. In my own experience, GPT and Claude show the clearest difference on this dimension.&lt;/p&gt;

&lt;p&gt;And this dimension happens to be the most important. Because truly valuable things often start without standard answers. Yet it's the hardest to measure, and the hardest to train.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Chat Era Already Ran This Course
&lt;/h2&gt;

&lt;p&gt;Model capabilities forking along dimensions and layering down isn't new. The chatbot era ran the full course.&lt;/p&gt;

&lt;p&gt;Back then, everyone also thought for a while that the biggest model was the strongest. But they quickly discovered that for chatting specifically, the biggest model wasn't much better. In 2023, the LMSYS Chatbot Arena leaderboard dedicated a section to "Smaller Models Are Competitive": a 13B Vicuna ranked in the top five, its Elo score even beating Google's PaLM 2. 7B models also squeezed into the top ten, trading blows with models twice their size.&lt;/p&gt;

&lt;p&gt;Later studies echoed this: scaling models from tens of millions to hundreds of billions, all the way up to GPT-4 class, showed gains topping out quickly on softer tasks. Models with a few tens of billions of parameters weren't far from frontier models.&lt;/p&gt;

&lt;p&gt;In other words: for chatting, for emotional support, the marginal returns to scale are low. A few tens of billions is enough; scaling up to hundreds of billions is pure waste.&lt;/p&gt;

&lt;p&gt;So the market sorted itself out. If you want emotional value, someone to chat with, a smaller model that sounds human is enough. Only when you need serious research or hardcore engineering do top-tier models come into play. Models sorted themselves into different price-performance tiers by use case.&lt;/p&gt;

&lt;p&gt;Today's round is the same plot, replaying at a higher capability level.&lt;/p&gt;

&lt;h2&gt;
  
  
  Closing: Don't Ask Which Is Strongest—Ask Which Dimension You Need
&lt;/h2&gt;

&lt;p&gt;Back to my friend's dilemma: "should I switch to a stronger model?" He's asking the wrong question.&lt;/p&gt;

&lt;p&gt;There is no "stronger" that simultaneously covers solving hard problems, doing messy work, and exploring. These three things are diverging onto different models.&lt;/p&gt;

&lt;p&gt;New-generation models are still pushing forward, of course. But the progress they fight for increasingly lands on "the hardest problems humans can define and grade," which is exactly where most people can't perceive it. So you see a split: leaderboards keep getting stronger generation after generation, yet most people just feel "it's been good enough for a while, can't see where it's stronger." Neither side is wrong. Because what they want are fundamentally different dimensions of capability.&lt;/p&gt;

&lt;p&gt;So stop vaguely asking "which model is strongest." First ask clearly: which dimension of work do you need it to do? Solve a hard problem with an answer, finish a messy task that wasn't clearly specified, or join you in something no one knows the answer to yet.&lt;/p&gt;

&lt;p&gt;"Strongest" is becoming a question without a standard answer.&lt;/p&gt;




&lt;h2&gt;
  
  
  References
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Model Releases and Timeline&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://www.anthropic.com/news/claude-opus-4-8" rel="noopener noreferrer"&gt;Claude Opus 4.8 Release (2026-05-28) — Anthropic&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.marktechpost.com/2026/06/01/minimax-releases-minimax-m3-with-msa-architecture-supporting-1m-token-context-native-multimodality-and-agentic-coding/" rel="noopener noreferrer"&gt;MiniMax M3 Release Coverage (2026-06-01) — MarkTechPost&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://openai.com/index/introducing-gpt-5-5/" rel="noopener noreferrer"&gt;OpenAI GPT-5.5 (Current Official Version; GPT-5.6 Not Yet Released)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://api-docs.deepseek.com/news/news260424" rel="noopener noreferrer"&gt;DeepSeek V4 (2026-04-24 Preview) — API Docs&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Benchmark Saturation&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/abs/2311.12022" rel="noopener noreferrer"&gt;GPQA: A Graduate-Level Google-Proof Q&amp;amp;A Benchmark (PhD Expert Baseline ~65%) — arXiv 2311.12022&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://lastexam.ai/" rel="noopener noreferrer"&gt;Humanity's Last Exam (Official Motivation for Harder Benchmarks: Models Already Exceed 90% on MMLU, etc.)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/abs/2602.16763" rel="noopener noreferrer"&gt;Large Model Benchmark Saturation Study: Nearly Half of 60 Benchmarks Highly Saturated, Top Models Statistically Indistinguishable — arXiv 2602.16763&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://arcprize.org/leaderboard" rel="noopener noreferrer"&gt;ARC-AGI Official Leaderboard (ARC-AGI-1 Saturated, Same Batch of Models Drops Sharply on ARC-AGI-2) — ARC Prize&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://arcprize.org/arc-agi/2/" rel="noopener noreferrer"&gt;ARC-AGI-2 Design Notes — ARC Prize&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Verifiable Rewards, and Its Limits&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/abs/2411.15124" rel="noopener noreferrer"&gt;Tülu 3: RLVR (Reinforcement Learning with Verifiable Rewards) Proposal and Definition — arXiv 2411.15124&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/abs/2501.12948" rel="noopener noreferrer"&gt;DeepSeek-R1: Rule-Based Rewards, Compiler Test Cases, Deliberately Avoiding Neural Reward Models (Due to Reward Hacking Concerns) — arXiv 2501.12948&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://openai.com/index/learning-to-reason-with-llms/" rel="noopener noreferrer"&gt;OpenAI o1: Large-Scale Reinforcement Learning for Reasoning — OpenAI&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/abs/2506.00103" rel="noopener noreferrer"&gt;Writing-Zero: Open-Ended, Subjective Tasks Lack Clear Ground Truth, Making Reward Construction Difficult — arXiv 2506.00103&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/abs/2601.18533" rel="noopener noreferrer"&gt;Open-Ended Generation "Lacks Clear Standard Answers," Making RLVR Hard to Extend — arXiv 2601.18533&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/abs/2504.13837" rel="noopener noreferrer"&gt;This Type of Reinforcement Learning May Narrow the Exploration Surface, with Capability Ceiling Capped by the Base Model — arXiv 2504.13837&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Small Models in the Chat Era&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://www.lmsys.org/blog/2023-05-25-leaderboard/" rel="noopener noreferrer"&gt;LMSYS Chatbot Arena Leaderboard (2023-05, "Smaller Models Are Competitive": 13B Vicuna Enters Top Five, Elo Beats PaLM 2)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.pnas.org/doi/10.1073/pnas.2413443122" rel="noopener noreferrer"&gt;PNAS: Diminishing Marginal Returns of Model Scale on Single-Turn Persuasion, Quickly Hitting a Ceiling — pnas.org&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;原文链接：&lt;a href="https://guanjiawei.ai/en/blog/strongest-no-single-answer" rel="noopener noreferrer"&gt;https://guanjiawei.ai/en/blog/strongest-no-single-answer&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>models</category>
      <category>evaluation</category>
      <category>reinforcementlearning</category>
    </item>
    <item>
      <title>Claude 4.8 Let Me Down, But It’s Not Just Claude’s Problem</title>
      <dc:creator>guanjiawei</dc:creator>
      <pubDate>Mon, 01 Jun 2026 11:12:47 +0000</pubDate>
      <link>https://dev.to/skyguan92/claude-48-let-me-down-but-its-not-just-claudes-problem-36oc</link>
      <guid>https://dev.to/skyguan92/claude-48-let-me-down-but-its-not-just-claudes-problem-36oc</guid>
      <description>&lt;p&gt;The day Claude launched Opus 4.8, I was pretty excited.&lt;/p&gt;

&lt;p&gt;I'd always thought Opus had solid engineering skills. Macro analysis and intent alignment were its strong suits. A few points in the 4.8 release notes caught my eye, so I tried it out on several complex tasks I had going.&lt;/p&gt;

&lt;p&gt;The results were one disappointment after another. Here are a few examples.&lt;/p&gt;

&lt;h2&gt;
  
  
  Looks Promising Out of the Gate, Then Goes Off the Rails
&lt;/h2&gt;

&lt;p&gt;Right around then, my Codex credits had run out. Since 4.8 looked solid on paper, I put it on a research task using that flashy feature called ultracode, a dynamic workflow that supposedly auto-orchestrates ultra-long-horizon tasks.&lt;/p&gt;

&lt;p&gt;It started out looking solid. It ran a bunch of checks and got everything set up. Seemed reliable. That's Claude Code in a nutshell: the start always gives you hope.&lt;/p&gt;

&lt;p&gt;Then I let it run for about a full day and night.&lt;/p&gt;

&lt;p&gt;The task was to optimize a performance metric. The baseline was terrible: throughput sat around 0.1 tokens per second. After optimizing for a while, it bumped the number from 0.1 to 0.15 and started celebrating. Look, I improved it by 50%! I did all this work! What a huge achievement! It kept hauling out that basic initial setup to claim credit, writing pages of self-congratulatory fluff.&lt;/p&gt;

&lt;p&gt;The problem is, in the 0.1 to 0.15 range, using multiples to understand the problem is wrong to begin with.&lt;/p&gt;

&lt;p&gt;When performance is that low, your direction is completely wrong. You have to look at absolute values. What does 0.15 token/s actually mean? That's how you see how far off it is. Celebrating a "50% improvement" in a setup that fundamentally can't run is like celebrating that you bailed two buckets from a sinking ship.&lt;/p&gt;

&lt;p&gt;Looking back, the pile of documentation it recorded, those so-called results, pretty much all had to be tossed. The direction was wrong. Freeze everything and start over.&lt;/p&gt;

&lt;p&gt;This wasn't even the most surprising part. I'd always had the impression that Claude's macro capabilities were fine, good at analysis and alignment, but concrete execution, especially on research tasks, tended to go sideways. The real red flag was the engineering task that followed.&lt;/p&gt;

&lt;h2&gt;
  
  
  A Pure Engineering Problem, and It Was Clowning Around
&lt;/h2&gt;

&lt;p&gt;It was a cloud service. A probe was reporting errors, and I asked it to diagnose the issue and fix it while it was there. Nothing complicated.&lt;/p&gt;

&lt;p&gt;A couple of things it did left me baffled.&lt;/p&gt;

&lt;p&gt;First, in the middle of analysis it randomly started counting. It had the machine echo a string of numbers. I stared at the screen, maybe a dozen in total. I still have no idea what that was about. Pure token burning.&lt;/p&gt;

&lt;p&gt;Second was even more absurd. The probe's purpose was straightforward: workers send periodic heartbeats returning OK to tell the platform "I'm alive and working." If the heartbeat is abnormal, take the worker offline first. Don't assign it tasks. It discovered heartbeats had a cost and asked if I wanted to cut it. I said sure.&lt;/p&gt;

&lt;p&gt;It changed it to &lt;code&gt;runtime --version&lt;/code&gt;, which returns a version number.&lt;/p&gt;

&lt;p&gt;I laughed out loud. This isn't cutting costs. It's destroying the design intent entirely. A version number only proves the thing is installed. It has nothing to do with whether it can actually work. Effectively, it's saying "it's installed, so go ahead and assign tasks." A model supposedly strong in engineering and alignment pulled this on a problem where the intent was crystal clear.&lt;/p&gt;

&lt;p&gt;I said forget it, let's revert to the old setup.&lt;/p&gt;

&lt;p&gt;During the revert, something else happened. When looking for a fix, it told me "here are three options below," asking me to pick. But it never listed the three options. It just jumped straight to "I recommend option one." I scrolled back several times. The three options didn't exist at all.&lt;/p&gt;

&lt;p&gt;At this point I was pretty certain: this model genuinely can't be counted on for this kind of task.&lt;/p&gt;

&lt;h2&gt;
  
  
  So I Moved All That Work to GPT 5.5
&lt;/h2&gt;

&lt;p&gt;After these incidents, the direct result was simple: for research and engineering tasks, I no longer trust Claude.&lt;/p&gt;

&lt;p&gt;Trust builds slowly and collapses fast. Earlier I thought it was still usable, but the more I used it, the more I realized asking it to do something was probably a waste of time. Not just a matter of burning a few tokens, but having to redo everything in the end. Now I don't even consider it for these tasks. Everything goes to GPT 5.5.&lt;/p&gt;

&lt;p&gt;GPT 5.5 is genuinely strong. Whether coding or research, it's clearly a notch above every other model.&lt;/p&gt;

&lt;p&gt;This shows up most directly in my account setup: I'm now running 7 Codex accounts, maxing out weekly quotas on all of them, and it's still not enough. On the Claude side, I'm down to one account barely hanging on. It keeps getting my accounts banned, so I bought a spare just to hold onto. I never seriously used Google's.&lt;/p&gt;

&lt;p&gt;Seven to one. That's more honest than any benchmark.&lt;/p&gt;

&lt;h2&gt;
  
  
  But This Isn't Just a Claude Problem
&lt;/h2&gt;

&lt;p&gt;At this point I need to walk that back a bit. If this were just bashing Claude, the post wouldn't be worth reading.&lt;/p&gt;

&lt;p&gt;Step back, and you see the top models have always taken turns in the spotlight.&lt;/p&gt;

&lt;p&gt;Six months ago Gemini was riding high. Everyone was talking about how strong Google was. The last two generations have both felt underwhelming. Hardly anyone mentions them now. Then everyone flocked to Claude, thinking it was the strongest. But look at 4.7 to 4.8. It's been a real letdown. This round, OpenAI has clearly gotten back on its feet. GPT 5.5 is ridiculously strong.&lt;/p&gt;

&lt;p&gt;No single company can stay on top forever. This isn't unique to the AI industry.&lt;/p&gt;

&lt;h2&gt;
  
  
  Chip Companies' Fate, Replayed by Model Makers at Fast-Forward
&lt;/h2&gt;

&lt;p&gt;I was chatting with a friend about chips the other day. He told me how brutal strategy is in that business. Bet wrong on one generation of chips, and you might be finished.&lt;/p&gt;

&lt;p&gt;NVIDIA nearly died. Its first chip, the NV1, bet on forward texture mapping and quadrilaterals, but the industry standard went with triangles (Microsoft's DirectX). The wrong direction meant no one wanted the product, and the company shrank from about 100 people to 40.&lt;/p&gt;

&lt;p&gt;What saved them was Sega. Sega had commissioned NVIDIA to build the graphics chip for its console. Later both sides realized the direction was wrong, but Sega's Shoichiro Irimajiri still converted that roughly $5 million contract payment into an investment in NVIDIA. Jensen Huang later said this gave them "six months to live," just enough to survive until the RIVA 128 turned things around.&lt;/p&gt;

&lt;p&gt;AMD was the same story. The 2011 Bulldozer architecture was a strategic mistake. Single-core performance was awful, and the company was badly wounded. By July 2015, its stock had crashed to $1.62, one step from bankruptcy. That year it licensed x86 to its Chinese joint venture Tianjin Haiguang for roughly $293 million, largely to stop the bleeding. It didn't recover until the Zen architecture arrived in 2017.&lt;/p&gt;

&lt;p&gt;Both are now dominant players. But looking back, no one can guarantee they'll have the last laugh. Chip cycles are long and capital-intensive. One generation every three to five years, and one wrong move means massive pressure, or even getting knocked out entirely.&lt;/p&gt;

&lt;p&gt;Model companies move much faster. It doesn't take a whole generation. Maybe just a few model iterations, a roughly six-month window of consistently missing the mark, and a company can be pushed off the table.&lt;/p&gt;

&lt;h2&gt;
  
  
  Chinese Model Makers Have Already Completed a Full Rotation
&lt;/h2&gt;

&lt;p&gt;Chinese model companies have already played out this entire cycle.&lt;/p&gt;

&lt;p&gt;The first to break out and gain recognition as a top-tier player was Zhipu. In 2022, its GLM-130B was the only large model from Asia selected for Stanford's HELM evaluation, and ChatGLM was among the first open-source models at home. For a while, it was unrivaled.&lt;/p&gt;

&lt;p&gt;Then it fell behind. By late 2024, its flagship GLM-4-Plus had been overtaken by DeepSeek-V3 and Tongyi Qianwen on public benchmarks like SuperCLUE, dropping out of the top tier. At the time, a lot of people were surprised.&lt;/p&gt;

&lt;p&gt;Then on January 20, 2025, DeepSeek-R1 burst onto the scene. Six days later its app hit #1 on the US download charts, along with 51 other countries. On January 27, it directly tanked NVIDIA's stock, wiping out $589 billion in market cap in a single day, the largest single-stock single-day loss in US market history. During that stretch, I felt like a bunch of model vendors were on the verge of collapse.&lt;/p&gt;

&lt;p&gt;But Zhipu didn't leave the table. In the second half of 2025, it changed tactics, going fully open-source while narrowing its focus. It released GLM-4.5 and GLM-4.6 back to back, and its reputation clearly recovered. In January 2026, it even listed on the Hong Kong Stock Exchange. From falling behind to bouncing back, the key was staying at the table.&lt;/p&gt;

&lt;h2&gt;
  
  
  As a Side Note: Top Models Are Also Specializing
&lt;/h2&gt;

&lt;p&gt;Beyond taking turns at the top, there's now another dynamic: division of labor.&lt;/p&gt;

&lt;p&gt;For hardcore coding agents and research tasks, OpenAI is in a league of its own, clearly ahead of everyone else. Claude's original strong suit was precisely this area, but after several generations it failed to meet expectations and started regressing. Its strengths have instead shifted to white-collar work: writing, finance and legal, daily tasks, document research, that sort of thing. To be fair, 4.8 is a bit better than 4.7. It sounds more natural, its style moved back toward 4.6, execution is a bit more accurate, and it's actually pretty decent at writing.&lt;/p&gt;

&lt;p&gt;Further toward the fringe you have Doubao and its ilk. Everyone knows it doesn't do serious work, but its emotional value is maxed out and its user base is terrifyingly large. There's a saying that "Claude is the American version of Doubao." I thought it was kind of funny at first, but thinking about it, it points to a different kind of divergence: some models are just good at chatting with you and giving you emotional value. That's also a category of demand.&lt;/p&gt;

&lt;p&gt;So "which one is the strongest" isn't even the question anymore. It depends on what job you need done.&lt;/p&gt;

&lt;h2&gt;
  
  
  Final Thoughts: Don't Bet on Who's Strongest—Bet on Who's Still at the Table
&lt;/h2&gt;

&lt;p&gt;Coming full circle, my conclusion is actually pretty optimistic.&lt;/p&gt;

&lt;p&gt;The lead changes hands. Don't fixate on "whoever's strong now will stay strong," and don't write a company off just because it's temporarily stumbling. As long as you're still at the table, you have a chance to turn things around. Zhipu turned it around. NVIDIA and AMD both turned it around back in the day.&lt;/p&gt;

&lt;p&gt;The real danger is leaving the table.&lt;/p&gt;

&lt;p&gt;Anthropic has indeed been stuck these past few generations. My use of Claude is now basically limited to white-collar tasks. It's reportedly cooking up its next-gen flagship. If that comes out at the same level, things will really get dicey. It's not that any single model is bad, but in this six-month-cycle rhythm, missing the mark for several generations in a row is exactly how you get squeezed off the table.&lt;/p&gt;

&lt;p&gt;But as long as it's still at the table, I'm not too worried. That's just how it works at the table: everyone takes turns.&lt;/p&gt;




&lt;h2&gt;
  
  
  References
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://www.sequoiacap.com/podcast/crucible-moments-nvidia/" rel="noopener noreferrer"&gt;Crucible Moments: Nvidia — Sequoia Capital (Jensen Huang recounts the NV1 wrong bet, Sega's $5 million, and "six months to live")&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.acquired.fm/episodes/jensen-huang" rel="noopener noreferrer"&gt;NVIDIA CEO Jensen Huang — Acquired Podcast&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.tomshardware.com/news/amd-joint-venture-partner-banned-us-trade-war,39703.html" rel="noopener noreferrer"&gt;AMD's Chinese joint venture Tianjin Haiguang (including the $293 million licensing fee and 2019 Entity List) — Tom's Hardware&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://en.wikipedia.org/wiki/AMD%E2%80%93Chinese_joint_venture" rel="noopener noreferrer"&gt;AMD–Chinese joint venture — Wikipedia&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.cnn.com/2020/03/27/tech/lisa-su-amd-risk-takers" rel="noopener noreferrer"&gt;How Lisa Su brought AMD back from the brink — CNN Business&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/abs/2210.02414" rel="noopener noreferrer"&gt;GLM-130B: An Open Bilingual Pre-trained Model (ICLR 2023, the only Asian model selected for Stanford HELM)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.deeplearning.ai/the-batch/zhipu-ai-z-ai-releases-open-weights-glm-4-5-models-that-perform-comparably-to-the-latest-from-claude-and-deepseek" rel="noopener noreferrer"&gt;Zhipu AI open-sources GLM-4.5, performance comparable to the latest Claude and DeepSeek models — The Batch (DeepLearning.AI)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.stcn.com/article/detail/3579827.html" rel="noopener noreferrer"&gt;HK$57.9 billion market cap! Zhipu, the world's first large-model stock, lists in Hong Kong (02513.HK) — Securities Times&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://api-docs.deepseek.com/news/news250120" rel="noopener noreferrer"&gt;DeepSeek-R1 Release (2025/01/20 official release page) — DeepSeek API Docs&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.bloomberg.com/news/articles/2025-01-27/asml-sinks-as-china-ai-startup-triggers-panic-in-tech-stocks" rel="noopener noreferrer"&gt;Nvidia's $589 Billion DeepSeek Plunge Is Largest in Market History — Bloomberg&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://techcrunch.com/2025/01/27/deepseek-displaces-chatgpt-as-the-app-stores-top-app/" rel="noopener noreferrer"&gt;DeepSeek displaces ChatGPT as the App Store's top app — TechCrunch&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;原文链接：&lt;a href="https://guanjiawei.ai/en/blog/stay-on-the-table" rel="noopener noreferrer"&gt;https://guanjiawei.ai/en/blog/stay-on-the-table&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>ailabs</category>
      <category>industry</category>
      <category>reflections</category>
    </item>
    <item>
      <title>The Hidden Thread in Token Business: Cost Is Set by KV Cache Hits, Not Throughput</title>
      <dc:creator>guanjiawei</dc:creator>
      <pubDate>Thu, 28 May 2026 08:15:50 +0000</pubDate>
      <link>https://dev.to/skyguan92/the-hidden-thread-in-token-business-cost-is-set-by-kv-cache-hits-not-throughput-4neo</link>
      <guid>https://dev.to/skyguan92/the-hidden-thread-in-token-business-cost-is-set-by-kv-cache-hits-not-throughput-4neo</guid>
      <description>&lt;p&gt;The more I study the token business lately, the more I feel there's one angle that keeps getting overlooked.&lt;/p&gt;

&lt;p&gt;Over the past year, when people benchmark inference performance, they mainly watch three numbers: absolute throughput, TTFT, and TPOT. How many requests can be batched, how fast the first token comes out, how fast each output token is. That's the standard talking point today.&lt;/p&gt;

&lt;p&gt;But when you actually get down to serving, you find that what really drives token cost isn't throughput. It's whether the KV cache hits.&lt;/p&gt;

&lt;h2&gt;
  
  
  I. A 10× Gap Carved into the Price List
&lt;/h2&gt;

&lt;p&gt;Several model APIs have cut prices recently. Open any pricing sheet and you'll see the input column was split in two a while back: cache hit and cache miss.&lt;/p&gt;

&lt;p&gt;How big is the gap?&lt;/p&gt;

&lt;p&gt;Anthropic charges 0.1× the base input price for cache reads, making it 10× cheaper. DeepSeek V4 Flash cache hit is \$0.0028 per million tokens, cache miss \$0.14 per million tokens, a 50× difference. Anthropic also charges 1.25× (5-minute version) or 2× (1-hour version) to write a cache. On April 26, DeepSeek cut cache hit prices across all models by another 10×.&lt;/p&gt;

&lt;p&gt;At the machine level, the difference comes down to compute. A hit skips prefill; the machine only runs decode. A miss means recalculating from scratch, burning machine time and compute. The gap isn't a few percentage points. It's multiples, up to 10×.&lt;/p&gt;

&lt;p&gt;The interesting part is this: once cache hit and miss are priced separately, some of the cost is yours to control through design, and the rest is entirely up to the vendor. Split like this, and "which API is cheaper" stops being comparable at the token level. You have to look at actual hit rates.&lt;/p&gt;

&lt;h2&gt;
  
  
  II. The Biggest Pitfall on the User Side: Model Routing
&lt;/h2&gt;

&lt;p&gt;Let's talk about how users can mess this up.&lt;/p&gt;

&lt;p&gt;Lately people have bought hard into "model routing." Hard tasks go to strong models, easy ones to weak models. It looks like savings on paper.&lt;/p&gt;

&lt;p&gt;My view: switching models mid-session is usually a losing bet.&lt;/p&gt;

&lt;p&gt;The clearest example is Claude Code. You've accumulated 300K of context, then midway decide Opus is too expensive for this step and switch to Sonnet. Claude Code now pops a warning that tells you explicitly: after switching, all previous cache is invalidated, and the next step must cold-start and recalculate. It didn't warn before. After enough complaints, they added it.&lt;/p&gt;

&lt;p&gt;Why does it invalidate? Each model's KV representation is different, so cache can't be reused across models. Opus cache and Sonnet cache are two different things. The session hasn't changed, the cache key hasn't moved, but not a penny is saved on the recalculation cost.&lt;/p&gt;

&lt;p&gt;Run the numbers. Current Opus 4.7 is \$5/\$25 per million tokens; Sonnet 4.6 is \$3/\$15. Sonnet is roughly 40% cheaper than Opus, not the 5× gap of the past. But that preceding 300K input goes from a cache hit (0.1× price) to cold calculation (1× price), so that single input cost jumps 10×. You save 40% on the model itself. Net it out, and the overall cost is actually over 5× higher.&lt;/p&gt;

&lt;p&gt;Plus in agent workloads, tokens are almost always input-heavy and output-light. Prefill usually runs thousands of tokens per second. Decode runs dozens to just over a hundred. That's two orders of magnitude. The money is basically spent on input. Model routing destroys exactly the savings mechanism on the input side.&lt;/p&gt;

&lt;p&gt;So mainstream agent design today revolves around "context stability." Don't swap models lightly, don't change tool structures, don't touch core prompts like &lt;code&gt;CLAUDE.md&lt;/code&gt; halfway through. One move, and hit rates really do drop from 90% to 5%.&lt;/p&gt;

&lt;h2&gt;
  
  
  III. Claude Code's Solution: Spawn Sub-Agents
&lt;/h2&gt;

&lt;p&gt;So what if part of a task really is better suited to a cheaper model?&lt;/p&gt;

&lt;p&gt;Claude Code's solution is to spawn sub-agents.&lt;/p&gt;

&lt;p&gt;The main session stays on Opus, preserving its hit rate. When you need to explore, batch-process, or run a specific sub-task, you call the Task tool to spin up a new agent. The new agent runs in its own isolated context, can pick a cheaper model, and maintains its own hit rate. When it's done, only a summary is passed back to the main session. The main session's cache isn't touched.&lt;/p&gt;

&lt;p&gt;The precondition for this mechanism is that the sub-task's context needs differ enough from the main task's. If your sub-task happens to feed most of the main session's content into it, that's another cold start inside the sub-agent, and prefill eats up whatever you saved. This takes pretty fine-grained judgment.&lt;/p&gt;

&lt;h2&gt;
  
  
  IV. Server-Side KV Cache Engineering
&lt;/h2&gt;

&lt;p&gt;How big is the server-side gap? Massive.&lt;/p&gt;

&lt;p&gt;The crudest implementations don't design for cache at all. No reuse across users. Routing goes haywire; the request that should hit the machine holding the cache lands elsewhere. VRAM backs all cache capacity. In a system like that, no amount of user-side care can save you.&lt;/p&gt;

&lt;p&gt;The mature example is the Mooncake framework from Moonshot AI, Tsinghua University, and Qijing Technology, a KVCache-centric disaggregated architecture. Prefill and decode clusters are separated, and underutilized CPU, DRAM, and SSD resources on GPU nodes are repurposed into a distributed KV cache pool. A KV cache scheduler handles queuing and routing. The paper cites a simulated 525% throughput gain; under real loads, requests handled increased by 75% to 115%.&lt;/p&gt;

&lt;p&gt;The counterexample is Openclaw. This open-source agent took a lot of criticism, mostly because it stumbles at this layer. Its plugin architecture doesn't set a &lt;code&gt;promptCacheKey&lt;/code&gt; by default. Pass it through a third-party proxy and you lose node affinity, so cache hit rate is nearly 0%. The total token volume isn't actually that high, but all input is priced at cache-miss rates, so the bill is ridiculous. About a month ago I looked at its trace: 60K+ input tokens in one request, 0% hit rate, \$0.12 a pop. That's what happens when server-side cache has no targeted design.&lt;/p&gt;

&lt;h2&gt;
  
  
  V. The Model Layer Can Also Push in This Direction
&lt;/h2&gt;

&lt;p&gt;Go one layer deeper: the model itself can free up massive room for KV cache.&lt;/p&gt;

&lt;p&gt;DeepSeek has been the most systematic here. MLA (Multi-head Latent Attention) projects KV into a latent vector and back, compressing KV cache volume by over 90%. V3 kept this mechanism. Later they added Native Sparse Attention, which almost flattens KV cache growth in long contexts. Only then can the inference system build a cache pool at million-scale context lengths.&lt;/p&gt;

&lt;p&gt;But once the model changes, cache hit determination logic changes too. Some inputs previously recognized as "prefix overlap" behave differently under sparse attention, so whether they hit needs realignment. The inference system has to be revised as well. That's why I say inference engineers can't just stare at throughput anymore. They have to redesign cache around the model architecture.&lt;/p&gt;

&lt;h2&gt;
  
  
  VI. Measuring This Is Hard
&lt;/h2&gt;

&lt;p&gt;The most frustrating part of the whole chain is that evaluating the server side alone is useless, and evaluating the user side alone is also useless.&lt;/p&gt;

&lt;p&gt;No matter how strong the server-side cache is, if the user side is designed like Openclaw, hit rates still won't rise. No matter how careful the user side is, hit a crude server with chaotic routing, insufficient capacity, and no cross-user reuse, and costs still leak.&lt;/p&gt;

&lt;p&gt;So "which API is cheaper" can't be compared on a single dimension when it comes to tokens. Same hardware, same target. Good coordination between both ends versus each doing its own thing, and the total bill can differ by 5× to 10×.&lt;/p&gt;

&lt;h2&gt;
  
  
  Closing Thoughts
&lt;/h2&gt;

&lt;p&gt;The cache hit/miss split on token pricing tables is the single most important thread in this whole thing. It gives users a clear incentive: high hit rates save you money. At the same time, it pressures providers. If your cache system isn't strong enough, you won't get the business.&lt;/p&gt;

&lt;p&gt;I hope vendors also expose the cache hit mechanism itself. Otherwise users only know it exists without knowing how to optimize for it. There still aren't many vendors that can tie together model, server-side cache, and user-side usage end to end.&lt;/p&gt;

&lt;p&gt;The edge side is still competing on raw performance. But once app density rises and agents really start running, KV cache will become a core issue there too. From cloud to edge, there's no way around it.&lt;/p&gt;




&lt;h2&gt;
  
  
  References
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://platform.claude.com/docs/en/build-with-claude/prompt-caching" rel="noopener noreferrer"&gt;Prompt caching — Claude API Docs&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://platform.claude.com/docs/en/about-claude/pricing" rel="noopener noreferrer"&gt;Anthropic Pricing — Claude API Docs&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://code.claude.com/docs/en/prompt-caching" rel="noopener noreferrer"&gt;How Claude Code uses prompt caching&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://code.claude.com/docs/en/sub-agents" rel="noopener noreferrer"&gt;Claude Code Sub-Agents Docs&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://api-docs.deepseek.com/guides/kv_cache" rel="noopener noreferrer"&gt;Context Caching — DeepSeek API Docs&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://api-docs.deepseek.com/quick_start/pricing" rel="noopener noreferrer"&gt;DeepSeek API Pricing&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/abs/2407.00079" rel="noopener noreferrer"&gt;Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving (arXiv:2407.00079)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/kvcache-ai/Mooncake" rel="noopener noreferrer"&gt;Mooncake on GitHub (kvcache-ai/Mooncake)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/abs/2405.04434" rel="noopener noreferrer"&gt;DeepSeek-V2 paper — introduces MLA&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.ddhigh.com/en/2026/03/26/fix-opencode-prompt-caching-with-third-party-proxy/" rel="noopener noreferrer"&gt;Fixing OpenCode prompt cache misses with third-party proxy&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/code-yeongyu/oh-my-openagent/issues/1247" rel="noopener noreferrer"&gt;Plugin architecture prevents prompt caching — oh-my-opencode issue&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;原文链接：&lt;a href="https://guanjiawei.ai/en/blog/token-business-kv-cache" rel="noopener noreferrer"&gt;https://guanjiawei.ai/en/blog/token-business-kv-cache&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>infra</category>
      <category>kvcache</category>
      <category>inference</category>
    </item>
    <item>
      <title>A Token Is Not a Thing</title>
      <dc:creator>guanjiawei</dc:creator>
      <pubDate>Tue, 26 May 2026 04:48:36 +0000</pubDate>
      <link>https://dev.to/skyguan92/a-token-is-not-a-thing-5g0l</link>
      <guid>https://dev.to/skyguan92/a-token-is-not-a-thing-5g0l</guid>
      <description>&lt;p&gt;Lately, "token economy" is hot. Every business model in AI will eventually converge on one unit of account: the token. I buy that thesis. But one premise keeps getting skipped—a token is not a standardized commodity.&lt;/p&gt;

&lt;p&gt;Water has standard units. Electricity has standard units. Money, obviously. Token doesn't. It's more like gasoline: 92, 95, and 98 octane are different fuels, priced differently, for different engines. Adding them up by the liter and reporting one number means nothing.&lt;/p&gt;

&lt;p&gt;Most contradictions in AI today come down to this.&lt;/p&gt;

&lt;h2&gt;
  
  
  I. Intelligence Has Tiers
&lt;/h2&gt;

&lt;p&gt;Roughly four.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Top tier.&lt;/strong&gt; Overseas: OpenAI GPT-5.5, Anthropic Claude Opus 4.7. China: Zhipu GLM-5.1, Moonshot Kimi K2.6, DeepSeek V4-Pro. Xiaomi MiMo-V2.5-Pro is a bit controversial, but usage and data are climbing, so I'll count it. These range from hundreds of billions to over a trillion parameters. Demand is almost unlimited; willingness to pay is fierce. Prices rise, quotas tighten, prices rise again—users keep pouring in. Zhipu's 2025 annual report showed GLM Coding Plan token calls up 15× in six months, with paying developers past 240,000. That's the real demand curve for top-tier tokens.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Mid-tier.&lt;/strong&gt; This is the awkward gap. MiniMax M2.7, DeepSeek V4-Flash, Xiaomi MiMo-V2.5 standard—these are about it. Moderate size, an order of magnitude cheaper, theoretically the best value. But almost no one is seriously building here. I'll explain why later.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Low-mid.&lt;/strong&gt; Mostly open source. Alibaba Qwen 3.6 leads, with both 35B-A3B (MoE) and 27B dense versions open. Google Gemma 4 is here too, from E2B to 31B.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;On-device.&lt;/strong&gt; A few billion parameters, or even sub-billion, fitting in a phone or a consumer GPU.&lt;/p&gt;

&lt;p&gt;The first imbalance is right here. Top tier is a bloodbath. Mid-tier is empty. Low-mid and on-device are noisy but lack clear scenarios.&lt;/p&gt;

&lt;h2&gt;
  
  
  II. Speed Is Another Dimension
&lt;/h2&gt;

&lt;p&gt;Tiers are only half the token story.&lt;/p&gt;

&lt;p&gt;The other half is speed. GPT-5.5 at 30 TPS versus 200 TPS is a completely different experience.&lt;/p&gt;

&lt;p&gt;Here are the 2026 numbers from Artificial Analysis, a commonly cited benchmark:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Tier&lt;/th&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Output TPS&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Flagship Standard&lt;/td&gt;
&lt;td&gt;GPT-5.5 (high)&lt;/td&gt;
&lt;td&gt;~68&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Flagship Standard&lt;/td&gt;
&lt;td&gt;Claude Opus 4.7&lt;/td&gt;
&lt;td&gt;~48&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Flagship Standard&lt;/td&gt;
&lt;td&gt;DeepSeek V4-Pro&lt;/td&gt;
&lt;td&gt;~48&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Flagship Standard&lt;/td&gt;
&lt;td&gt;Kimi K2.6&lt;/td&gt;
&lt;td&gt;~33&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;High Speed&lt;/td&gt;
&lt;td&gt;DeepSeek V4-Flash&lt;/td&gt;
&lt;td&gt;~126&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;High Speed&lt;/td&gt;
&lt;td&gt;Gemini 3.5 Flash&lt;/td&gt;
&lt;td&gt;~203&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Ultra High Speed&lt;/td&gt;
&lt;td&gt;GLM-5.1 High-Speed Edition&lt;/td&gt;
&lt;td&gt;400 (official)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Ultra High Speed&lt;/td&gt;
&lt;td&gt;Cerebras running Kimi K2.6&lt;/td&gt;
&lt;td&gt;981&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;I wrote a post earlier, &lt;a href="https://guanjiawei.ai/blog/inference-speed-new-species" rel="noopener noreferrer"&gt;A Model 5× Faster Is No Longer the Same Model&lt;/a&gt;. The argument: a 5× speedup unlocks product forms that literally didn't exist before. This isn't slightly faster. It's a different species.&lt;/p&gt;

&lt;p&gt;The market already prices this. Anthropic Opus Fast: 2.5× speed, 6× price. OpenAI Priority Tier: 2.5× price. Look at those ratios—price rises faster than speed. Not greed. It's a pricing signal. There's a real cohort willing to pay multiples for speed.&lt;/p&gt;

&lt;p&gt;Intelligence tier × speed tier. Stack them and you get a matrix. The token in each cell is a different product.&lt;/p&gt;

&lt;h2&gt;
  
  
  III. Two Demand Tracks, Worlds Apart in Willingness to Pay
&lt;/h2&gt;

&lt;p&gt;Who's burning top-tier tokens? Two main tracks.&lt;/p&gt;

&lt;p&gt;First: coding agents. The fastest-growing, highest-burn category worldwide. The surface is a coding agent writing code to solve problems. In practice, people use them for everything. The work just happens to get done through "writing code."&lt;/p&gt;

&lt;p&gt;Second: consumer agents. The Claude app, ChatGPT app, Microsoft Copilot, and Zhipu's new AutoClaw (Claw Plan). AutoClaw launched March 2026 and hit 400,000 subscriptions in 20 days. Under the hood it's a coding agent wrapped in a non-technical shell, letting ordinary people "hire an AI employee."&lt;/p&gt;

&lt;p&gt;The two tracks have very different willingness to pay.&lt;/p&gt;

&lt;p&gt;Coding agent users demand peak intelligence—Opus 4.7, GPT-5.5 tier. Anything less fails. The work is valuable; time saved is valuable. They'll pay for top-tier tokens continuously. Stickiness is another story: when a better model drops, they switch immediately.&lt;/p&gt;

&lt;p&gt;Consumer agent users differ. Their tasks are lower-value, they're price-sensitive, and they don't need absolute peak intelligence. A "mid-tier smarts, good value, acceptable speed" model fits them perfectly. The problem: that tier is empty right now, with no real supply. So DeepSeek V4, with extreme cost-performance, quickly captured this segment. I've noticed many friends around me switching to DeepSeek.&lt;/p&gt;

&lt;p&gt;Demand looks like this, so model companies follow the money. That's why top-tier models keep screaming compute shortages while mid-tier models have no takers.&lt;/p&gt;

&lt;h2&gt;
  
  
  IV. The Supply-Side Mismatch: Scarce Cards and Idle Racks at the Same Time
&lt;/h2&gt;

&lt;p&gt;Demand misalignment carries over to the compute market.&lt;/p&gt;

&lt;p&gt;Top-tier compute shortage is obvious.&lt;/p&gt;

&lt;p&gt;Jensen Huang personally confirmed NVIDIA's Blackwell series (B200/GB200) is "sold out through mid-2026," with new enterprise orders facing 8–16 week lead times. Meta's annual CapEx is expected past $100 billion; Microsoft is spending nearly $35 billion in a single quarter—all scrambling for these chips. In China, the frenzy is over B300 and H200: a B300 server costs ¥7 million and you still can't get one, monthly rent pushed to ¥130,000–200,000. H200 was cleared for sale in China in January 2026; the first 5,000–10,000 module batch was snapped up by top vendors immediately, cluster delivery pushed to Q2 2027. The older H100 has cooled. No one is fighting for it now.&lt;/p&gt;

&lt;p&gt;Domestic top-tier chips are even more extreme. Huawei's latest Ascend 950PR only began mass production in March 2026, yet the full-year plan of 750,000 units was completely locked up: ByteDance (350,000), Alibaba (200,000), Tencent/Baidu (100,000), government and enterprise IT innovation (100,000)—orders pushed to 2027. Roughly $16,000 per chip, 1.56 PFLOPS FP4, officially claimed at 2.87× H20 single-card performance. This is the first time in domestic AI chip history that an entire year's production was bought out. When DeepSeek V4 open-sourced, it shipped day-zero support for eight domestic chips, listing Ascend NPUs alongside NVIDIA GPUs in the technical report. GLM-5 was trained entirely on Ascend + MindSpore, with support for seven domestic chips. This is about positioning: anchoring top models on domestic chips is both a technical and supply problem.&lt;/p&gt;

&lt;p&gt;The hidden side is massive idle capacity in low-to-mid-range compute.&lt;/p&gt;

&lt;p&gt;PPIO founder Yao Xin has said some domestic GPU AI compute centers have idle rates up to 80%. 36Kr reported some centers at only 10–20% utilization. Xinhua put it more bluntly: "General-purpose compute is relatively oversupplied; AI compute is relatively scarce"—an admission of structural mismatch. Prices reflect this: A100 prices crashed over 50%, RTX 4090 hourly rent dropped to ¥1–2, and the 5090 is around ¥2.5.&lt;/p&gt;

&lt;p&gt;But the low-to-mid-range mismatch has two distinct bottlenecks.&lt;/p&gt;

&lt;p&gt;Mid-tier datacenter cards (H20, L20, Huawei 910B, etc.) are stuck on infrastructure. Inference frameworks optimize for top-tier cards far more than these. KV cache management, MoE expert parallelism, FP8/FP4 precision support—none of the critical paths is mature here. The hardware exists, demand exists, but you can't serve a top experience.&lt;/p&gt;

&lt;p&gt;Consumer PCIe cards (4090, 5090, 4090 48GB mods) face the opposite problem. The hardware can run; vLLM already supports the 5090 (needs CUDA 12.8 + falling back to FlashAttention 2, usable enough). What's missing is good models designed for them. The 70B dense tier is obsolete—as of May 2026, the top six open-source models are all MoE; dense has virtually disappeared at the flagship level. MoE total parameters routinely exceed 100B, which won't fit on consumer cards; distilled small models can't match top quality. No one is supplying new, high-quality models tailored to 24GB/32GB/48GB VRAM limits.&lt;/p&gt;

&lt;p&gt;So the picture is: 4090/5090 prices are absurdly cheap compared to datacenter cards, yet the mid-tier models you can actually run are still old stock like Llama 3.3 70B from late 2024. Individual developers experimenting locally, small-team PoCs, and privacy-sensitive on-prem deployments can get by. But for enterprise-grade mid-tier inference on these cards, no newly optimized models exist.&lt;/p&gt;

&lt;p&gt;The issue isn't "total compute is insufficient." It's "compute can't align with demand."&lt;/p&gt;

&lt;p&gt;Outsiders used to quote compute in "petaflops." That was always shaky; in the AI inference era it's nearly useless. Whether a compute unit can serve top-tier models depends on interconnect, memory bandwidth, FP4/FP8 support, KV cache management. A hundred older cards can't match one top-tier card's single-stream speed.&lt;/p&gt;

&lt;p&gt;You get a strange picture: top model providers scrambling for chips, while last-gen cards in datacenters can't be rented out even at a discount. Scarcity and glut, side by side.&lt;/p&gt;

&lt;h2&gt;
  
  
  V. The Market Will Correct the Mismatch, But It Takes Time
&lt;/h2&gt;

&lt;p&gt;This mismatch won't last. The two bottlenecks will be pushed by two different market forces.&lt;/p&gt;

&lt;p&gt;The infrastructure gap for mid-tier datacenter cards will be driven by engineering priorities. Inference frameworks follow the money. Once mid-tier model demand grows, top frameworks like vLLM, SGLang, and TensorRT-LLM will eventually be forced to prioritize H20, L20, and 910B optimization. Not glamorous, but inevitable.&lt;/p&gt;

&lt;p&gt;The model supply gap for consumer cards is being pushed by distillation and small MoE. DeepSeek-V4 has already distilled a ~9B version; the Qwen series has been working on this. Once someone actually delivers "runs in 32GB VRAM, quality close to top-tier," idle 4090s and 5090s will immediately find work.&lt;/p&gt;

&lt;p&gt;Another track is deep binding between domestic chips and domestic models. DeepSeek and Zhipu are both pursuing it; technically it's proven feasible. Once it fully works, the low-to-mid-range compute market will reshuffle structurally.&lt;/p&gt;

&lt;p&gt;I'm fairly optimistic this will happen—it just takes time. Maybe a few quarters, maybe a year or two. For those who catch the rhythm, there's a structural window.&lt;/p&gt;

&lt;h2&gt;
  
  
  VI. Don't Reduce Tokens to a Single Number
&lt;/h2&gt;

&lt;p&gt;Back to the opening line. "Token economy" is a fine term, but it's far less intuitive than selling water or electricity.&lt;/p&gt;

&lt;p&gt;It's more like a gas station. Gasoline looks like one thing, but it's actually an intelligence × speed matrix. Layer on the supply-side compute tier mismatch, and you have the real cause behind today's apparently contradictory industry phenomena: why model companies are scrambling for chips, why some AI compute centers sit idle, why fast tier can charge 6×, and why mid-tier intelligence models are slow to arrive.&lt;/p&gt;

&lt;p&gt;Next time you see "we've deployed N petaflops" or "we produce X trillion tokens per month," pause and ask: which intelligence tier, which speed tier, which demand tier.&lt;/p&gt;

&lt;p&gt;A token is not a thing.&lt;/p&gt;




&lt;h2&gt;
  
  
  References
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Model Versions and Positioning&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://openai.com/index/gpt-5-5-instant/" rel="noopener noreferrer"&gt;OpenAI GPT-5.5 Instant Release&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.anthropic.com/news/claude-opus-4-7" rel="noopener noreferrer"&gt;Claude Opus 4.7 — Anthropic&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.bigmodel.cn/cn/guide/models/text/glm-5.1" rel="noopener noreferrer"&gt;Zhipu GLM-5.1 Technical Docs&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://kimi-k2.org/blog/24-kimi-k2-6-release" rel="noopener noreferrer"&gt;Moonshot Kimi K2.6 Release&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://api-docs.deepseek.com/news/news260424" rel="noopener noreferrer"&gt;DeepSeek V4 — API Docs&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://simonwillison.net/2026/Apr/24/deepseek-v4/" rel="noopener noreferrer"&gt;Simon Willison: DeepSeek V4—almost on the frontier&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://mimo.xiaomi.com/mimo-v2-5-pro/" rel="noopener noreferrer"&gt;Xiaomi MiMo-V2.5-Pro Official&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.minimaxi.com/news/minimax-m25" rel="noopener noreferrer"&gt;MiniMax M2.5 Release&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.modelscope.cn/models/Qwen/Qwen3.6-35B-A3B" rel="noopener noreferrer"&gt;Qwen 3.6-35B-A3B — ModelScope&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://blog.google/innovation-and-ai/technology/developers-tools/gemma-4/" rel="noopener noreferrer"&gt;Google Gemma 4 Release&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Speed Data&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://artificialanalysis.ai/models/gpt-5-5-high" rel="noopener noreferrer"&gt;Artificial Analysis — GPT-5.5 (high)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://artificialanalysis.ai/models/claude-opus-4-7" rel="noopener noreferrer"&gt;Artificial Analysis — Claude Opus 4.7&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://artificialanalysis.ai/models/deepseek-v4-pro" rel="noopener noreferrer"&gt;Artificial Analysis — DeepSeek V4 Pro&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://artificialanalysis.ai/models/deepseek-v4-flash-non-reasoning" rel="noopener noreferrer"&gt;Artificial Analysis — DeepSeek V4 Flash&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://artificialanalysis.ai/models/kimi-k2-6" rel="noopener noreferrer"&gt;Artificial Analysis — Kimi K2.6&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://artificialanalysis.ai/models/gemini-3-5-flash" rel="noopener noreferrer"&gt;Artificial Analysis — Gemini 3.5 Flash&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.ithome.com/0/953/717.htm" rel="noopener noreferrer"&gt;Zhipu GLM-5.1 High-Speed 400 tokens/s Report (IT Home)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.cerebras.ai/blog/cerebras-kimi-k2-Enterprise" rel="noopener noreferrer"&gt;Cerebras Running Kimi K2.6 at 981 tokens/s&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://groundy.com/articles/claude-code-fast-mode-6x-pricing-worth/" rel="noopener noreferrer"&gt;Claude Opus Fast Mode: 2.5× Speed / 6× Price (Groundy)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://developers.openai.com/api/docs/guides/priority-processing" rel="noopener noreferrer"&gt;OpenAI Priority Processing Official Docs&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Zhipu Products and Financial Reports&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://finance.sina.com.cn/stock/hkstock/ggscyd/2026-03-31/doc-inhswpms3341678.shtml" rel="noopener noreferrer"&gt;Zhipu 2025 Annual Report (Sina Finance)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.qbitai.com/2026/03/394135.html" rel="noopener noreferrer"&gt;QbitAI: Zhipu Financial Report Analysis&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.tmtpost.com/7938401.html" rel="noopener noreferrer"&gt;TMTPost: Zhipu Financial Report Analysis&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://m.jiemian.com/article/14095547.html" rel="noopener noreferrer"&gt;AutoClaw / Claw Plan Launch (Jiemian News)&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Compute Market&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://markets.financialcontent.com/wral/article/tokenring-2025-12-29-nvidias-blackwell-dynasty-b200-and-gb200-sold-out-through-mid-2026-as-backlog-hits-36-million-units" rel="noopener noreferrer"&gt;NVIDIA Blackwell sold out through mid-2026 (FinancialContent)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://finance.sina.com.cn/china/gncj/2026-05-01/doc-inhwktza5465326.shtml" rel="noopener noreferrer"&gt;Domestic B300 Servers at ¥7 Million, Still Unavailable (Sina Finance)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://zhuanlan.zhihu.com/p/1981721428861682031" rel="noopener noreferrer"&gt;H200 Sales Ban Lifted in China: Buy or Rent? (Zhihu)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://gitcode.csdn.net/69dc9c0a54b52172bc69377c.html" rel="noopener noreferrer"&gt;2026 Q1 GPU Rental Market Deep Dive&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.fxbaogao.com/detail/5399775" rel="noopener noreferrer"&gt;High-End GPU Supply-Demand Mismatch Drives Compute Rental Boom (WallstreetCN)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://caifuhao2.eastmoney.com/news/20260519201003499682070" rel="noopener noreferrer"&gt;Huawei Ascend 950PR in Mass Production + Orders Pushed to 2027 (East Money)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.oschina.net/news/373016" rel="noopener noreferrer"&gt;Huawei Ascend AI Chip Three-Year Roadmap: 950PR / 950DT / 960 / 970 (OSCHINA)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://ascendai.csdn.net/69d716f30a2f6a37c59df6df.html" rel="noopener noreferrer"&gt;DeepSeek V4 Fully Switches to Huawei Ascend 950PR (CSDN)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.leiphone.com/category/industrynews/GpExeQUQDrXE3B8H.html" rel="noopener noreferrer"&gt;PPIO Yao Xin on AI Compute Center Idle Rates (Leiphone)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://36kr.com/p/3269610247737992" rel="noopener noreferrer"&gt;36Kr: AI Compute Center Utilization Only 10–20%&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.news.cn/finance/20250429/3df0c33317d2499ab3a297a413e0acce/c.html" rel="noopener noreferrer"&gt;Xinhua: General Compute Oversupply, AI Compute Shortage&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.mornai.cn/news/gpu/a100-gpu-rent-trend/" rel="noopener noreferrer"&gt;A100 Price Trend Report&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.sohu.com/a/1001811857_122681399" rel="noopener noreferrer"&gt;RTX 4090 Hourly Rental Price Range ¥1.45–2.29 (Sohu, March 2026)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.gongjiyun.com/blog/2026/1/rx8wwbsgsisch5kyogoc7t3yncb/" rel="noopener noreferrer"&gt;RTX 5090 Compute at ¥2.5/Card/Hour (Gongji Compute)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.mornai.cn/news/gpu/rtx-4090-48gb/" rel="noopener noreferrer"&gt;RTX 4090 48GB Mod Review (Mornai)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://qubittool.com/zh/blog/llm-landscape-may-2026-deepseek-qwen-llama-comparison" rel="noopener noreferrer"&gt;2026 LLM Landscape: MoE Extincts Dense at the Flagship Level (QubitTool)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/BoltzmannEntropy/vLLM-5090" rel="noopener noreferrer"&gt;vLLM Deployment Guide on RTX 5090 (GitHub)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://zhuanlan.zhihu.com/p/2028142360152817950" rel="noopener noreferrer"&gt;Using a Modded 4090 for a Year: Great for Dev, Disaster for Production (Zhihu)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://news.qq.com/rain/a/20260508A0267Y00" rel="noopener noreferrer"&gt;DeepSeek V4 Day-0 Support for Eight Domestic Chips&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.guancha.cn/economy/2026_02_12_806895.shtml" rel="noopener noreferrer"&gt;GLM-5 Supports Seven Domestic Chips (Guancha)&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;原文链接：&lt;a href="https://guanjiawei.ai/en/blog/token-is-not-one-thing" rel="noopener noreferrer"&gt;https://guanjiawei.ai/en/blog/token-is-not-one-thing&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>token</category>
      <category>infra</category>
      <category>compute</category>
    </item>
    <item>
      <title>The Stronger the Agent, the More Common Sense Is Worth</title>
      <dc:creator>guanjiawei</dc:creator>
      <pubDate>Mon, 25 May 2026 04:06:31 +0000</pubDate>
      <link>https://dev.to/skyguan92/the-stronger-the-agent-the-more-common-sense-is-worth-4m96</link>
      <guid>https://dev.to/skyguan92/the-stronger-the-agent-the-more-common-sense-is-worth-4m96</guid>
      <description>&lt;p&gt;Last month I wrote &lt;a href="https://guanjiawei.ai/blog/amateur-advantage" rel="noopener noreferrer"&gt;“AI Turns Ignorance Into an Advantage”&lt;/a&gt;, about how outsiders without the baggage of knowing how hard something is are more willing to use AI to try things that look impossible.&lt;/p&gt;

&lt;p&gt;I still believe that. But agents burned me four times in a row recently, so I need to revise.&lt;/p&gt;

&lt;p&gt;The sweet spot isn’t knowing nothing; it’s knowing just enough. You have common sense, you grasp the big picture, but you don’t get lost in technical details. Total beginners do dare to try, which is good. But they can’t tell whether the agent’s output is actually reliable.&lt;/p&gt;

&lt;h2&gt;
  
  
  1. Fake Data Can Trick You by Orders of Magnitude
&lt;/h2&gt;

&lt;p&gt;I’ve been optimizing an inference engine lately.&lt;/p&gt;

&lt;p&gt;I checked the results on the first night. The metric had hit a target I’d considered seriously challenging. I was excited. Had we really cracked it that fast?&lt;/p&gt;

&lt;p&gt;If I knew nothing about this domain, I’d probably have cheerfully shared the results with my partners. But because I had some common sense, something felt off. I had it run a correctness check. The output was nothing but exclamation marks. After fixing correctness, performance dropped by orders of magnitude.&lt;/p&gt;

&lt;p&gt;I thought that was the end of it. But as we kept optimizing, the rhythm still felt wrong. The numbers climbed too fast, suspiciously fast. I looked at the test flow and found that before each official test it quietly ran a warm-up using the exact same prompt. Every subsequent test was hitting the prefix cache, essentially cheating on an open-book exam. After isolating the cache, performance dropped by orders of magnitude again.&lt;/p&gt;

&lt;p&gt;Still not done. Once prefill returned to normal, decode speed suddenly became absurd. A Windows build of the engine was somehow outperforming the Linux version. I ran the real-prompt test script I’d written earlier, and performance took another ten-fold hit. The problem was that the agent’s synthetic prompts were too simple and too regular, letting speculative decoding hit an acceptance rate above 80%. Switch to real prompts and the acceptance rate cratered, taking performance with it. Teams that have shipped speculative decoding have documented the exact same trap: real-world production performance is 40% to 60% lower than in the lab, a gap large enough to make you wonder if it’s the same system.&lt;/p&gt;

&lt;p&gt;Three layers of illusion, stacked. If I’d believed that first number and shared it externally, the cleanup would have been miserable. You give someone the wrong expectation, and they start scheduling around it. Then you have to go back and say, “Sorry, we’re off by orders of magnitude.” That feels way worse than saying “We’re not there yet” from the start.&lt;/p&gt;

&lt;p&gt;After that, every optimization target explicitly included two rules: prefill must not be affected by prefix-cache interference, and decode must use real prompts. Only then did we see a normal curve that crept upward, bit by bit.&lt;/p&gt;

&lt;h2&gt;
  
  
  2. It Will Brick Your Lab Machine
&lt;/h2&gt;

&lt;p&gt;The latest agents can work autonomously for a full day or longer. The longer they run, the higher the chance something goes wrong.&lt;/p&gt;

&lt;p&gt;My agent has, more than once, trashed the entire OS of a lab machine mid-run because of a missing quote or a flipped command-line argument. Files gone, environment wiped. It happens in a split second. You can’t stop it in time.&lt;/p&gt;

&lt;p&gt;It’s not just me. In April, when an agent hit a credential mismatch, it didn’t stop to ask a human. It found a token with full privileges and deleted an entire company’s production database and all backups in nine seconds. Thirty-plus hours of downtime. Three months of customer data, gone. There have been at least a dozen similar documented incidents in the past two years.&lt;/p&gt;

&lt;p&gt;Anthropic and OpenAI are now pushing sandboxing. The idea isn’t complicated. Filesystem isolation on one layer, network isolation on another. Without filesystem isolation, the agent can touch things it shouldn’t. Without network isolation, a compromised agent can steal your keys.&lt;/p&gt;

&lt;p&gt;My own approach is more low-tech: dedicate a machine exclusively to the agent, and don’t store anything else on it. If it runs for dozens of hours straight, the probability of a dumb mistake is non-zero. Reinstalling the OS costs time. Losing important data costs your sanity.&lt;/p&gt;

&lt;h2&gt;
  
  
  3. It Will Spin in Circles Until You Step In
&lt;/h2&gt;

&lt;p&gt;Agents have another bad habit: they circle the same problem.&lt;/p&gt;

&lt;p&gt;A recent goal was to run an inference engine on Windows in BF16 precision. The model weights were over 60 GB, and loading them caused an immediate OOM crash.&lt;/p&gt;

&lt;p&gt;The agent’s response was interesting: it kept trying to work around the memory bottleneck. Load only some weights, dynamically read the rest during inference, every offload trick in the book. None of them worked, and each ate up a lot of time. It even added a warm-up to the tests to hide loading latency. That was part of the root cause of the prefix-cache problem I mentioned earlier.&lt;/p&gt;

&lt;p&gt;I finally cut in and said: stop tweaking performance and fix the memory issue first. Until that bottleneck is solved, everything else is wasted effort.&lt;/p&gt;

&lt;p&gt;The agent actually executes well. Once pointed in the right direction, it quickly found a series of system-level Windows settings to expand available memory and VRAM. After that was fixed, the optimization path smoothed out immediately. All the previous workarounds were suddenly useless. That time was basically wasted.&lt;/p&gt;

&lt;p&gt;The problem is that it won’t proactively redefine the problem. Hand it “optimize performance” and it will keep grinding on that goal, even when stuck on a prerequisite. It finds ways to work around it rather than telling you, “This assumption is false; we need to handle something else first.” Recognizing the real blocker and pulling the agent out of the dead end is a judgment call only a human can make.&lt;/p&gt;

&lt;h2&gt;
  
  
  4. Set the Bar Too High and You Ship Nothing
&lt;/h2&gt;

&lt;p&gt;The last pitfall isn’t the agent’s fault. It’s mine.&lt;/p&gt;

&lt;p&gt;The more powerful agents get, the easier it is to set the bar too high. Because they can run for days, you start thinking anything is fair game. Every direction looks like a top-conference breakthrough. So you spin up multiple threads, each one ambitious.&lt;/p&gt;

&lt;p&gt;The result? Every thread is active, every thread shows progress, but nothing ships.&lt;/p&gt;

&lt;p&gt;You keep burning tokens, you keep seeing “progress,” yet nothing reaches the user’s hands. It looks like work. It’s actually just burning money. I made this mistake recently: several threads were the kind that would be huge if they landed, but the execution risk was equally high. An agent isn’t a genie; if it can’t be done, it can’t be done. I burned a mountain of tokens and delivered nothing.&lt;/p&gt;

&lt;p&gt;I eventually realized: narrow the scope. You need something shippable in the short-to-medium term and some worthwhile long-term explorations, not only the latter. Deliver what can be delivered first, stabilize the rhythm, then go after the big bets.&lt;/p&gt;

&lt;h2&gt;
  
  
  5. Knowing Just Enough Is Exactly Right
&lt;/h2&gt;

&lt;p&gt;Look at the four pitfalls together and one thread connects them: none requires you to be a deep expert to avoid.&lt;/p&gt;

&lt;p&gt;Performance jumped by orders of magnitude? Check whether you measured wrong first. The agent needs to run on your main machine all day? Give it a dedicated one. Stuck on the same spot after three optimization rounds? That spot is the real problem. Every thread is running but none ships? Kill a few.&lt;/p&gt;

&lt;p&gt;It’s all common sense.&lt;/p&gt;

&lt;p&gt;An MIT Sloan article this year on managing in the age of agentic AI noted that the most important skills for managing agents are defining problems and validating outputs. Those are things AI still can’t do well. “Agent Manager” is already showing up on job boards, and one line in the job description stands out: domain common sense matters more than AI expertise.&lt;/p&gt;

&lt;p&gt;Going back to my previous post. “Ignorance is an advantage” still holds: you have to not know what’s hard in order to dare to try. But courage alone isn’t enough. The most valuable state is this: willing to try, yet able to sense when something is off at the critical moment.&lt;/p&gt;

&lt;p&gt;Total beginners get carried away by fake data. Deep experts get shackled by priors. The people in the middle, the ones who know just enough, are bold enough to act, yet wise enough to pull the reins when needed.&lt;/p&gt;

&lt;p&gt;Agents will keep getting stronger. But that bit of human common sense, whether the numbers check out, whether the direction is right, or whether this should ship now, will only become more valuable. These are still things agents can’t handle.&lt;/p&gt;




&lt;h2&gt;
  
  
  References
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://openai.com/index/gpt-5-1-codex-max/" rel="noopener noreferrer"&gt;GPT-5.1 Codex Max Can Work Autonomously for Over 24 Hours — OpenAI&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://openai.com/index/introducing-gpt-5-5/" rel="noopener noreferrer"&gt;GPT-5.5 Released: Multi-Hour Autonomous Session Capability — OpenAI&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://tianpan.co/blog/2026-04-17-speculative-decoding-production-hidden-traps" rel="noopener noreferrer"&gt;Speculative Decoding’s Hidden Traps in Production: Real-World Performance 40–60% Lower Than in the Lab — tianpan.co&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.sudostack.co/mtp-speculative-decoding-coding-vs-creative-writing/" rel="noopener noreferrer"&gt;MTP Acceptance Rate Variations Across Task Types — SudoStack&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.theregister.com/2026/04/27/cursoropus_agent_snuffs_out_pocketos/" rel="noopener noreferrer"&gt;Cursor + Claude Agent Deletes Entire Company Production Database in 9 Seconds — The Register&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://dev.to/claude-go/what-10-real-ai-agent-disasters-taught-me-about-autonomous-systems-2ndc"&gt;10 Real-World AI Agent Incidents Reviewed — DEV Community&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.anthropic.com/engineering/claude-code-sandboxing" rel="noopener noreferrer"&gt;Claude Code Sandbox Design: Two-Layer Isolation Cuts Permission Prompts by 84% — Anthropic Engineering&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://sloanreview.mit.edu/article/agentic-ai-at-scale-redefining-management-for-a-superhuman-workforce/" rel="noopener noreferrer"&gt;Agentic AI Redefines Management: 69% of Experts Call It a Paradigm Shift — MIT Sloan Management Review&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.weforum.org/stories/2025/07/leaders-will-soon-be-managing-ai-agents-these-are-the-skills-theyll-need/" rel="noopener noreferrer"&gt;Core Skills for Managing AI Agents — World Economic Forum&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://beam.ai/agentic-insights/what-is-an-agent-manager-the-new-role-every-ai-company-needs-in-2026/" rel="noopener noreferrer"&gt;Agent Manager: The New Enterprise Role in 2026 — Beam AI&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;原文链接：&lt;a href="https://guanjiawei.ai/en/blog/agent-common-sense" rel="noopener noreferrer"&gt;https://guanjiawei.ai/en/blog/agent-common-sense&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>agents</category>
      <category>collaboration</category>
      <category>thinking</category>
    </item>
    <item>
      <title>When a Model Is 5 Faster, It’s No Longer the Same Model</title>
      <dc:creator>guanjiawei</dc:creator>
      <pubDate>Fri, 22 May 2026 06:52:33 +0000</pubDate>
      <link>https://dev.to/skyguan92/when-a-model-is-5x-faster-its-no-longer-the-same-model-2ih3</link>
      <guid>https://dev.to/skyguan92/when-a-model-is-5x-faster-its-no-longer-the-same-model-2ih3</guid>
      <description>&lt;p&gt;Two releases caught my eye this week.&lt;/p&gt;

&lt;p&gt;On May 19, Google released Gemini 3.5 Flash. I watched their launch event. Oddly, they barely emphasized the model’s raw intelligence. Benchmarks against the previous generation didn’t exactly stand out either. But they devoted serious time to speed, calling it “frontier intelligence built for speed,” claiming inference is 4× faster than other frontier models.&lt;/p&gt;

&lt;p&gt;Today, May 22, Zhipu also launched GLM-5.1 High-Speed, claiming 400 token/s output—the current ceiling for industry APIs. This engine wasn’t built by Zhipu alone; it was a joint effort with a team called TileRT, doing low-level customization specifically for the GLM model family on a specific class of hardware.&lt;/p&gt;

&lt;p&gt;Put these two together, then look back at Anthropic’s Opus Fast and OpenAI’s GPT-5.5 Fast over the past few months, and the direction is clear: differentiation at the model layer is changing lanes. Everyone used to compete on smarts; now they’re increasingly competing on speed.&lt;/p&gt;

&lt;p&gt;And once speed crosses a certain line, it stops being a linear “X times faster” improvement. AI becomes a different kind of thing.&lt;/p&gt;

&lt;h2&gt;
  
  
  1. Pricing Already Tells the Story
&lt;/h2&gt;

&lt;p&gt;The clearest evidence is fast-mode pricing.&lt;/p&gt;

&lt;p&gt;Anthropic’s Opus Fast: 2.5× the speed, 6× the price.&lt;br&gt;&lt;br&gt;
OpenAI’s GPT-5.5 Fast: 1.5× the speed, 2.5× the cost.&lt;/p&gt;

&lt;p&gt;Look at the numbers. If speed were valued linearly, 2.5× speed would cost 2.5× the price, and 1.5× speed would cost 1.5×. But in practice, the price jumps far more than the speed.&lt;/p&gt;

&lt;p&gt;This isn’t greed. It’s a real market signal: some people will happily pay disproportionately more for speed. Either their tasks need high-frequency feedback, or users are sitting there waiting, or downstream steps are blocked. In these scenarios, going from 30 seconds to 12 seconds feels completely different from going from 30 seconds to 20 seconds.&lt;/p&gt;

&lt;p&gt;I toggle Opus Fast on and off constantly myself. I turned off GPT-5.5’s 1.5× tier immediately. I couldn’t feel the difference; it was just burning money. But at 2.5×, there are tasks I just leave it on for—mostly when I’m staring at the output and iterating fast.&lt;/p&gt;

&lt;p&gt;Markets don’t lie. Something that sells for 6× has buyers who genuinely think it’s worth it.&lt;/p&gt;

&lt;h2&gt;
  
  
  2. Per-Request Speed and Scaling Out Are Not the Same Thing
&lt;/h2&gt;

&lt;p&gt;Two things need to be kept apart here.&lt;/p&gt;

&lt;p&gt;The first is “doing more concurrency at a fixed speed.” Same 30 token/s throughput, but serving 1,000 users instead of 100. This is relatively easy. Just throw more machines at it—you can buy slightly weaker cards and spread the load across them, and the cost-performance ratio can be tuned.&lt;/p&gt;

&lt;p&gt;The second is “making a single request faster.” Going from 30 token/s to 400. This is an entirely different beast. You need higher-end hardware, more aggressive memory bandwidth, and cutting-edge packaging. You can’t fix this by “spending a bit more to stack extra cards.” A hundred weak cards won’t get a single request to the speed of one top-tier card.&lt;/p&gt;

&lt;p&gt;I’ve spent time experimenting with inference infra myself, optimizing a few open-source models. The cost curves are completely different. The first is roughly linear—double the money, get about double the concurrency. The second is non-linear—that first 20% speedup might cost you 50% more, and it only gets steeper.&lt;/p&gt;

&lt;p&gt;So when Gemini 3.5 Flash emphasizes speed, or GLM High-Speed hits 400 token/s, they’re not saying “we made a cheaper version.” They’re saying “we pushed single-request speed to a new level.” That’s a problem of an entirely different magnitude.&lt;/p&gt;

&lt;h2&gt;
  
  
  3. 5× Is a Speciation Threshold
&lt;/h2&gt;

&lt;p&gt;So why push so hard?&lt;/p&gt;

&lt;p&gt;When I think about this, I go back to a simple comparison.&lt;/p&gt;

&lt;p&gt;If you want something done faster, the traditional options are limited.&lt;/p&gt;

&lt;p&gt;First, hire smarter people. But that hits a ceiling. There are only so many world-class experts, and today’s best models are already brushing up against that ceiling.&lt;/p&gt;

&lt;p&gt;Second, make people work overtime. Agents already run 24/7, so that ceiling is gone too.&lt;/p&gt;

&lt;p&gt;Third, divide the work and throw more people at it. But anyone who’s done engineering knows adding people doesn’t scale linearly. Adding one person doesn’t make it twice as fast; adding ten gets you nowhere near 10×. You have to break things down, hand off, coordinate, deal with uneven quality, manage waste. The ramp-up period for new hires is expensive. If you’re doing multi-agent orchestration, you know exactly what I mean.&lt;/p&gt;

&lt;p&gt;At this point, the traditional paths to speed are tapped out.&lt;/p&gt;

&lt;p&gt;So what’s left? Make the model itself—the same employee—faster.&lt;/p&gt;

&lt;p&gt;And making that “same employee” faster is non-linear.&lt;/p&gt;

&lt;p&gt;Imagine an employee who used to take an hour to finish a task. Now they do it in ten minutes. You think you just saved fifty minutes? It’s more than that.&lt;/p&gt;

&lt;p&gt;You’ll start giving them tasks you’d never have bothered with because “it’s too slow.” Small ad-hoc requests that used to take an hour—so you never asked—now come back in ten minutes, and you make a dozen a day. Speed unlocks tasks that literally didn’t exist before.&lt;/p&gt;

&lt;p&gt;I saw a demo the other day: someone wearing glasses pointed at a video on a screen and said “zoom in on this,” and the AI behind it wrote code to resize the element. If the whole chain takes thirty seconds, you glance and walk away—there’s no real interaction. But if it finishes in five seconds, the feel is completely different; it becomes a genuinely usable product.&lt;/p&gt;

&lt;p&gt;That’s the gap between 50 token/s and 400 token/s. 8× speed unlocks products that were impossible to build before.&lt;/p&gt;

&lt;p&gt;A speedup beyond 5× is a speciation line.&lt;/p&gt;

&lt;h2&gt;
  
  
  4. The Return of Specialization
&lt;/h2&gt;

&lt;p&gt;Okay, speed is valuable. How do you actually achieve it?&lt;/p&gt;

&lt;p&gt;That brings us to TileRT’s approach, which diverges from where the industry was a year ago.&lt;/p&gt;

&lt;p&gt;Mainstream inference frameworks like vLLM, TensorRT-LLM, and SGLang are general-purpose. They aim to “support as many models as possible, running well enough on as much hardware as possible.” That has always been software engineering’s default bias: generality first, performance second.&lt;/p&gt;

&lt;p&gt;TileRT does the opposite. It statically schedules the entire inference graph at compile time, running as a persistent kernel on the GPU with almost no runtime dynamic scheduling. Micro-tasks at tile-level granularity squeeze the hardware close to its physical limits. The cost? Change the model and it’s basically scrap; change the hardware and it needs major rework.&lt;/p&gt;

&lt;p&gt;DeepSeek is on the same path. Their own inference engine started out based on vLLM, then underwent more than a year of deep customization—almost every path was rewritten for their own MoE architecture. When they open-sourced part of it recently, the industry’s reaction wasn’t “how general-purpose this is,” but “how deep you can go for a single model.”&lt;/p&gt;

&lt;p&gt;Go one layer deeper, and the hardware side has been on this path for a while. Groq’s LPU runs Llama 4 Scout at 460 token/s, 3–4× what an H100 delivers. Cerebras’s WSE-3 hits 1,800 token/s on a 70B model and nearly 3,000 on gpt-oss-120B. These are specialized chips. They aren’t trying to run every kind of model; they’re built to take a specific workload to the extreme.&lt;/p&gt;

&lt;p&gt;Chip designers have debated this for decades: general-purpose CPU or specialized ASIC? General chips have their place, but when a domain is big enough and the lifecycle is long enough, specialization pays off.&lt;/p&gt;

&lt;p&gt;The software layer used to avoid this, mainly because software isn’t cheap to write. Building a dedicated inference stack for one model takes too long to pay off; the model changes and your software is dead.&lt;/p&gt;

&lt;p&gt;That’s changing. AI agents can write software now. The cost of “building an optimal inference stack from scratch for a specific model and specific hardware” drops every year. Once it falls below a certain threshold, specialization becomes the default.&lt;/p&gt;

&lt;p&gt;Every promising model will eventually have its own dedicated inference engine. Every generation of mainstream hardware will have its own specially optimized stack. What you used to think of as just “the last 5% of optimization” could now become a 5× or 10× gap.&lt;/p&gt;

&lt;h2&gt;
  
  
  5. Vertical Integration at the Model Layer Is Inevitable
&lt;/h2&gt;

&lt;p&gt;Pulling these threads together.&lt;/p&gt;

&lt;p&gt;Intelligence will keep improving in the short term, but the marginal utility of competing on raw smarts is declining. A model that’s 20% smarter versus the same model accelerated 10×—for many users, the latter is far more valuable, especially for the new scenarios that speed itself unlocks.&lt;/p&gt;

&lt;p&gt;So the next phase of competition shifts from “point intelligence” to “end-to-end capability.” Model, inference engine, and hardware—all three bundled together.&lt;/p&gt;

&lt;p&gt;If you’re at 400 token/s and I’m at 30 token/s, even if my model is 20× smarter, I’m unusable in many scenarios. I’ll be watching my smartest model sit there slowly spitting out words while you’ve already delivered the whole product experience to the user.&lt;/p&gt;

&lt;p&gt;DeepSeek and Zhipu are already doing this. Anthropic and OpenAI are too. Google probably went the earliest and deepest—the TPU + Gemini combo has been running internally for a long time. My guess is that over the next year or two, the whole industry moves this way: model companies must own their inference stack and go deep into the hardware layer; hardware companies must go deep into model architecture; and the generic middle layer gets squeezed from both ends.&lt;/p&gt;

&lt;p&gt;For engineers, this is pretty exciting. We used to think “general, scalable, portable” was good taste. For the foreseeable future, the opposite may hold: writing the most extreme code for a specific model and specific hardware—code that breaks if you change anything—becomes worth doing again.&lt;/p&gt;

&lt;p&gt;Software engineering aesthetics will have to change.&lt;/p&gt;




&lt;h2&gt;
  
  
  References
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://blog.google/products-and-platforms/products/gemini/gemini-3-flash/" rel="noopener noreferrer"&gt;Gemini 3 Flash: frontier intelligence built for speed（Google Blog）&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://artificialanalysis.ai/models/gemini-3-5-flash" rel="noopener noreferrer"&gt;Gemini 3.5 Flash - Intelligence, Performance &amp;amp; Price Analysis（Artificial Analysis）&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://finance.sina.com.cn/tech/digi/2026-05-22/doc-inhytqkw6284792.shtml" rel="noopener noreferrer"&gt;智谱 GLM-5.1 高速版 AI 模型发布，跑出全球最快速度 400 tokens/s（新浪科技）&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.163.com/dy/article/KTHAMEVM05198UNI.html" rel="noopener noreferrer"&gt;智谱(02513)推出 GLM-5.1 高速版 API 400 tokens/s 刷新全球速度上限（网易）&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.itsolotime.com/archives/21459" rel="noopener noreferrer"&gt;TileRT v0.1.3 发布：GLM-5 支持上线，推理速度高达 600 tokens/s&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://byteiota.com/claude-opus-4-7-fast-mode-2-5x-speed-6x-the-cost/" rel="noopener noreferrer"&gt;Claude Opus 4.7 Fast Mode: 2.5x Speed, 6x the Cost（byteiota）&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://groundy.com/articles/claude-code-fast-mode-6x-pricing-worth/" rel="noopener noreferrer"&gt;Claude Code /fast Mode: Is 6x Pricing Worth It?（Groundy）&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://developers.openai.com/codex/speed" rel="noopener noreferrer"&gt;Speed — OpenAI Codex Developers&lt;/a&gt;（GPT-5.5 Fast 1.5× 速度 / 2.5× 成本）&lt;/li&gt;
&lt;li&gt;&lt;a href="https://stable-learn.com/en/deepseek_inference_engine/" rel="noopener noreferrer"&gt;Open-Sourcing DeepSeek Inference Engine: A New Chapter in AI Infrastructure（StableLearn）&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://groq.com/lpu-architecture" rel="noopener noreferrer"&gt;Groq LPU Architecture&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.cerebras.ai/blog/cerebras-cs-3-vs-groq-lpu" rel="noopener noreferrer"&gt;Cerebras CS-3 vs. Groq LPU&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://speko.ai/benchmark/groq-vs-cerebras" rel="noopener noreferrer"&gt;Groq vs Cerebras: LLM Inference Speed Comparison 2026（Speko）&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;原文链接：&lt;a href="https://guanjiawei.ai/en/blog/inference-speed-new-species" rel="noopener noreferrer"&gt;https://guanjiawei.ai/en/blog/inference-speed-new-species&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>inferenceoptimization</category>
      <category>infra</category>
      <category>reflections</category>
    </item>
    <item>
      <title>AMD Gave a Developer Award to Someone Who Can't Code</title>
      <dc:creator>guanjiawei</dc:creator>
      <pubDate>Tue, 19 May 2026 08:07:36 +0000</pubDate>
      <link>https://dev.to/skyguan92/amd-gave-a-developer-award-to-someone-who-cant-code-418l</link>
      <guid>https://dev.to/skyguan92/amd-gave-a-developer-award-to-someone-who-cant-code-418l</guid>
      <description>&lt;p&gt;Today I went to AMD's developer conference in Shanghai.&lt;/p&gt;

&lt;p&gt;The entrance alone was a shock. The line to get in was long, and the hall was already packed before anything started. They expected just over 1,000 people; more than 2,000 showed up. AMD said it was their biggest recent event.&lt;/p&gt;

&lt;p&gt;Lisa Su showed up too. She'd been in Beijing the day before meeting Vice Premier He Lifeng to talk chip cooperation. I'd never seen a chip company pull a crowd like this for a developer conference.&lt;/p&gt;

&lt;h2&gt;
  
  
  AMD Gave an Award to Someone Who Can't Code
&lt;/h2&gt;

&lt;p&gt;That morning, an AMD senior VP got on stage to hand out two developer awards. When they introduced one winner, the host said:&lt;/p&gt;

&lt;p&gt;"He didn't actually know how to code before."&lt;/p&gt;

&lt;p&gt;He'd used an AI agent to rewrite an entire system in Rust and optimize performance. AMD figured that was worth an award.&lt;/p&gt;

&lt;p&gt;Sitting there, the whole thing felt surreal. A chip company worth hundreds of billions, at a 2,000-person developer conference, handed one of two awards to someone who doesn't code.&lt;/p&gt;

&lt;p&gt;I bet next year the award will be even harder to judge. AI-native people like him will only become more common.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Boundary Between Developers and Users Is Vanishing
&lt;/h2&gt;

&lt;p&gt;Before, if you used a product, you just used it. You couldn't really help build it. Even in open source, you had to code before you could contribute.&lt;/p&gt;

&lt;p&gt;Not anymore. Coding agents are getting stronger, and regular users can now tweak, optimize, and push changes back while using a product. The same person is both user and builder.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://guanjiawei.ai/blog/ai-amplifies-passion" rel="noopener noreferrer"&gt;I wrote before about the split between Builders and Promoters&lt;/a&gt;. That was about passion diverging. This is the flip side: the roles of user and contributor now overlap, often in the same person. Users are also investing their tokens across different products, and the ones that earn that investment keep evolving.&lt;/p&gt;

&lt;p&gt;Product logic has shifted. You used to focus on making the experience great. Now you also need to make it easy for users to become contributors.&lt;/p&gt;

&lt;p&gt;AMD's big Strix Halo push is interesting. The AI Max+ 395 chip can allocate up to 96 GB of unified memory to its integrated GPU for running local models, and &lt;a href="https://guanjiawei.ai/blog/inference-engine-last-puzzle-piece" rel="noopener noreferrer"&gt;my inference engine can run on it too&lt;/a&gt;. Domestically, prices have been climbing and it's been out of stock. I have several R&amp;amp;D test machines for performance tuning, and they're also my entry point into the ROCm ecosystem.&lt;/p&gt;

&lt;p&gt;AMD is pushing this machine to lower the developer barrier another notch. More developers means stickier stacks.&lt;/p&gt;

&lt;h2&gt;
  
  
  Industry Winds Did a 180 in Six Months
&lt;/h2&gt;

&lt;p&gt;I attended a similar conference around mid last year. The vibe was completely different.&lt;/p&gt;

&lt;p&gt;Back then, people in the compute business were stressed. Early last year, DeepSeek raised expectations for models, and everyone was wondering whether the wave would last. How to move product, how to clear inventory, whether the business could survive. Everyone was scrambling for solutions and partners.&lt;/p&gt;

&lt;p&gt;This year, the table talk completely changed. The first thing anyone says is, "Can you get me more supply?" or "I'll take everything you've got."&lt;/p&gt;

&lt;p&gt;It's completely flipped. Supply is tight, and whoever holds quality inventory is making money. The shift from demand anxiety to supply anxiety took just six months.&lt;/p&gt;

&lt;h2&gt;
  
  
  This Isn't Another Bubble
&lt;/h2&gt;

&lt;p&gt;Plenty of people say: "Here we go again. Next metaverse."&lt;/p&gt;

&lt;p&gt;This time it really is different. I lived through the metaverse and blockchain cycles too. The difference this time is in the data, specifically paid demand from real users.&lt;/p&gt;

&lt;p&gt;Lisa Su said on stage that roughly 1 billion people are already using AI worldwide, and by 2030 that number will hit 5 billion daily active users. ChatGPT came out at the end of 2022, so it's been less than three years. The internet took over 20 years to reach that scale; the PC era took even longer. This is a diffusion speed never seen before in history.&lt;/p&gt;

&lt;p&gt;The money is keeping up too. Anthropic's Q1 grew 80x year-over-year. That's annualized revenue, not API calls. Dario himself said they weren't ready to catch a wave this big. Claude Code hit a $1 billion annualized run rate within six months of launch, and by April this year the company's overall ARR surged to $30 billion.&lt;/p&gt;

&lt;p&gt;This is nothing like a few years ago, when everyone was in a price war, handing out free tokens, and chasing call volume. Supply can't keep up with paid demand.&lt;/p&gt;

&lt;h2&gt;
  
  
  "X Is Dead" Is the Cheapest Narrative
&lt;/h2&gt;

&lt;p&gt;A friend recently asked me: "Is Openclaw dead?" "How's Claude Code doing?" "I heard Codex is going to win."&lt;/p&gt;

&lt;p&gt;I think that's just inertia.&lt;/p&gt;

&lt;p&gt;Last December, every top academic conference and product circle was talking about Gemini. Back then everyone thought Google had it in the bag. A few months later, almost nobody mentioned Gemini. Then it was Cursor, then Claude Code. Pretty soon it'll be Codex. At the top table, players keep rotating.&lt;/p&gt;

&lt;p&gt;But the underlying trend runs one way. It hasn't reversed. Paid demand is rising, call volume is rising.&lt;/p&gt;

&lt;p&gt;Real information is expensive. You have to use the tools yourself, show up on-site, and talk to people inside. So the audience for that is naturally small. Narratives like "it's dead," "it's a bubble," or "just another cycle" are the cheapest to spin up. They validate sitting on the sidelines and feed the need to believe that not engaging was the right call. They spread the easiest.&lt;/p&gt;

&lt;p&gt;Not that nobody believes it. Most people just want to.&lt;/p&gt;

&lt;h2&gt;
  
  
  Go See for Yourself
&lt;/h2&gt;

&lt;p&gt;Lately when I meet friends, I do one thing: tell them to bring their laptop, and I help them install Claude Code or Codex and get the model connected.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://guanjiawei.ai/blog/coding-agent-adoption-gap" rel="noopener noreferrer"&gt;Once you get past that hurdle, you can hand off 95% of computer work&lt;/a&gt;. I built my own website from scratch without lifting a finger. Frontend, DNS, SEO, all done by agents. The barrier is small, but once you're past it, the world looks completely different.&lt;/p&gt;

&lt;p&gt;If that's still too much, just find a conference this year where people are actually doing this and go. There were quite a few workshops at the event where you brought a laptop and worked hands-on. Only when you sit there do you realize how far AI has already come.&lt;/p&gt;




&lt;h2&gt;
  
  
  References
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;a href="https://technode.com/2026/05/19/amd-ceo-lisa-su-in-shanghai-predicts-five-billion-daily-ai-users-within-five-years/" rel="noopener noreferrer"&gt;AMD CEO Lisa Su in Shanghai Predicts 5 Billion Daily AI Users Within Five Years&lt;/a&gt; — On-site report from AMD AI Developer Day Shanghai, May 19, 2026&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://news.cgtn.com/news/2026-05-18/Chinese-vice-premier-meets-AMD-CEO-calls-for-deeper-cooperation-1NfLteD25Y4/p.html" rel="noopener noreferrer"&gt;Chinese Vice Premier He Lifeng Meets Lisa Su, Calls for Deeper Cooperation&lt;/a&gt; — May 18, 2026, He Lifeng meets AMD CEO Lisa Su in Beijing&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.cnbc.com/video/2026/01/06/amd-ceo-lisa-su-expect-over-5-billion-active-ai-users-in-the-next-five-years.html" rel="noopener noreferrer"&gt;CES 2026: Lisa Su Predicts Over 5 Billion AI Users in Five Years&lt;/a&gt; — Lisa Su first gave the 5-billion-user forecast during her CES keynote&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://venturebeat.com/technology/anthropic-says-it-hit-a-30-billion-revenue-run-rate-after-crazy-80x-growth" rel="noopener noreferrer"&gt;Anthropic Q1 Grew 80x, Annualized Run Rate Hits $30 Billion ARR&lt;/a&gt; — Dario Amodei publicly acknowledged 80x year-over-year Q1 growth&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.mindstudio.ai/blog/anthropic-30b-arr-4-months-pulling-ahead-openai" rel="noopener noreferrer"&gt;Anthropic's ARR Surged from $9 Billion to $30 Billion in 4 Months&lt;/a&gt; — Full ARR trajectory: Jan 2024 $87M → Dec 2024 $1B → End of 2025 $9B → Apr 2026 $30B&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.saastr.com/anthropic-just-passed-openai-in-revenue-while-spending-4x-less-to-train-their-models/" rel="noopener noreferrer"&gt;Claude Code Surpassed $1 Billion Annualized Revenue Within Six Months of Launch&lt;/a&gt; — Claude Code is Anthropic's fastest-growing product&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://time.com/6253615/chatgpt-fastest-growing/" rel="noopener noreferrer"&gt;ChatGPT Is the Fastest-Growing Consumer Product in History to Reach 100 Million Users&lt;/a&gt; — Reached 100 million users in 2 months, faster than TikTok and Instagram&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://epoch.ai/gradient-updates/after-the-chatgpt-moment-measuring-ais-adoption" rel="noopener noreferrer"&gt;AI Adoption Speed Compared with Historical Technologies&lt;/a&gt; — Epoch AI research: 70% US household adoption took 40 years in 1900, shrinking to 17 years by 2000&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.amd.com/en/products/processors/laptop/ryzen/ai-300-series/amd-ryzen-ai-max-plus-395.html" rel="noopener noreferrer"&gt;AMD Ryzen AI Max+ 395 (Strix Halo) Official Specs&lt;/a&gt; — 16 Zen 5 cores, Radeon 8060S, up to 128GB LPDDR5X unified memory (up to 96GB allocatable to GPU)&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://videocardz.com/newz/amd-ryzen-ai-max-strix-halo-processors-now-available-for-standalone-purchase-in-china" rel="noopener noreferrer"&gt;Ryzen AI Max+ 395 Out of Stock and Rising in Price in China&lt;/a&gt; — Current tight supply situation for Strix Halo standard chips in China's retail market&lt;/li&gt;
&lt;/ol&gt;




&lt;p&gt;原文链接：&lt;a href="https://guanjiawei.ai/en/blog/non-coder-award" rel="noopener noreferrer"&gt;https://guanjiawei.ai/en/blog/non-coder-award&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>agents</category>
      <category>reflections</category>
    </item>
    <item>
      <title>Same /goal Feature, Two Agent Personalities</title>
      <dc:creator>guanjiawei</dc:creator>
      <pubDate>Mon, 18 May 2026 04:59:32 +0000</pubDate>
      <link>https://dev.to/skyguan92/same-goal-feature-two-agent-personalities-5315</link>
      <guid>https://dev.to/skyguan92/same-goal-feature-two-agent-personalities-5315</guid>
      <description>&lt;p&gt;I've been using Codex's &lt;code&gt;/goal&lt;/code&gt; for weeks, and my token consumption has climbed another notch. Claude Code added the feature in its May 12 2.1.139 release—straight to stable, not experimental. I had a few tasks that Codex never quite managed to finish, so I moved them over to try.&lt;/p&gt;

&lt;p&gt;The contrast was stark. Same paradigm, nearly identical loop, yet the two models produced completely different results.&lt;/p&gt;

&lt;p&gt;I'm writing this partly to think it through, partly because it's worth sharing. &lt;code&gt;/goal&lt;/code&gt; isn't so much a feature as a new way of working. The form looks identical, but when the model's personality differs, the practical reality is entirely different.&lt;/p&gt;

&lt;h2&gt;
  
  
  1. Codex: Heads Down, No Questions, Never Quits
&lt;/h2&gt;

&lt;p&gt;Let me start with Codex as a baseline.&lt;/p&gt;

&lt;p&gt;Codex CLI's &lt;code&gt;/goal&lt;/code&gt; appeared as an experimental feature in 0.128.0. I've been using it since then and wrote about it previously. The real shift has been mental: I actually started believing that "letting the agent run" really works.&lt;/p&gt;

&lt;p&gt;It doesn't interrupt you. When running &lt;code&gt;/goal&lt;/code&gt;, Codex almost never calls subagents; it works inline unless I explicitly tell it to delegate. Compaction works better than I expected. After compressing, it picks up from the previous round without major information loss, and doesn't suddenly get dumber as it pushes forward. Most importantly, it's stubborn. It almost never tells me a goal is unachievable. Even when it hits a wall, it tries another angle, then another, until the token budget runs out. I've tested this repeatedly. I've left three independent &lt;code&gt;/goal&lt;/code&gt; sessions running overnight; in the morning, most are still on track.&lt;/p&gt;

&lt;p&gt;The context window is a genuine weak spot. Codex defaults to 400K under GPT-5.5. OpenAI balanced pricing and throughput there, while the API offers the full 1M. Claude Code defaults to 1M. But even with only 400K, Codex runs remarkably stable under &lt;code&gt;/goal&lt;/code&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  2. Claude Code: Beautiful Opening, Then What?
&lt;/h2&gt;

&lt;p&gt;On May 12, Anthropic dropped &lt;code&gt;/goal&lt;/code&gt;, Agent View, &lt;code&gt;/bg&lt;/code&gt;, &lt;code&gt;/loop&lt;/code&gt;, and &lt;code&gt;/batch&lt;/code&gt; all at once. My first thought was "finally." Codex had been iterating on this for several versions; Claude Code felt a bit slow to catch up.&lt;/p&gt;

&lt;p&gt;I moved the tasks Codex couldn't crack over to Claude Code and started &lt;code&gt;/goal&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;It started strong. Claude Code immediately spun up subagents, laid out plans, and orchestrated context. It looked far more ambitious than Codex. My expectations immediately rose. With an opening like this, it should outperform Codex.&lt;/p&gt;

&lt;p&gt;But as it ran, issues cropped up.&lt;/p&gt;

&lt;p&gt;The first thing that made me frown: it kept popping up to ask me to make choices. This is usually one of Claude Code's likable traits. Faced with a judgment call, it doesn't just plow ahead. It stops to align with you, asking which of directions A, B, or C you prefer. And the questions are usually on point. But under &lt;code&gt;/goal&lt;/code&gt;, this is a bug, not a feature. The whole point of &lt;code&gt;/goal&lt;/code&gt; is "you set the goal, I run myself, don't interfere." The model should own every intermediate judgment. When it pops out with questions, those hours of freed-up time are immediately lost. If you step away, it just sits there waiting for you to come back.&lt;/p&gt;

&lt;p&gt;More surprisingly, it proactively tells me it can't achieve the goal. Then it actually fails the goal. Sometimes after just a few dozen minutes. The reason is usually that the task seems too large for the session, or that there are fundamental blockers. When I tell it to continue, it reluctantly pushes forward a bit, then does it again.&lt;/p&gt;

&lt;p&gt;Third: it gets dumber after compaction. A 1M context window sounds huge, but Anthropic themselves have admitted that performance degrades over long runs. Worse is the compression step. After each compaction, Claude Code often seems to have forgotten everything that came before. The original plan, the pitfalls already encountered, the original context—all have to be pieced back together. Codex doesn't suffer from compaction nearly as badly.&lt;/p&gt;

&lt;p&gt;These three issues combined make long-horizon tasks unstable in Claude Code's &lt;code&gt;/goal&lt;/code&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  3. It's Not Just My Impression
&lt;/h2&gt;

&lt;p&gt;At first I thought it was my usage. Then I looked around and realized Opus 4.7's laziness was already common knowledge.&lt;/p&gt;

&lt;p&gt;Opus 4.7 was released on April 16. Within 48 hours, a Reddit thread titled "Opus 4.7 is not an upgrade but a serious regression" got over 2,300 upvotes. AMD's AI director publicly complained that Claude Code had become "dumber and lazier." Screenshots were everywhere. Someone posted a conversation where Claude itself replied, "I was acting lazily."&lt;/p&gt;

&lt;p&gt;Anthropic later published a postmortem, admitting that on April 16 they had added a "reduce verbosity" instruction to the system prompt. This instruction, along with a few other changes, dragged down coding quality. On April 20 they rolled it back. But my sense is that after the rollback, Opus 4.7's laziness only eased slightly. It didn't fully recover. The RL layer had already internalized this tendency. You can't fix that by tweaking a system prompt.&lt;/p&gt;

&lt;p&gt;In extended continuous operation like &lt;code&gt;/goal&lt;/code&gt;, this laziness gets amplified. A lazy model might get away with it on short tasks. Put it on a long task, and it will find all sorts of seemingly reasonable excuses to fail itself.&lt;/p&gt;

&lt;h2&gt;
  
  
  4. These Past Few Months, We've Been Doing the Same Thing
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;/goal&lt;/code&gt; didn't appear out of nowhere. It's the culmination of months of exploration.&lt;/p&gt;

&lt;p&gt;Before the Lunar New Year, I was already tinkering with something similar. At the time, I was doing stability testing for AIMA (our model management platform). The core idea was to have AI simulate real users running tests repeatedly to improve stability. The most naive attempt was using the terminal's built-in task mechanism. I set up 10 tasks, each running for a long time.&lt;/p&gt;

&lt;p&gt;This path died quickly. Each task was still in the same session, and models don't hold up well in long sessions. Within a few rounds, things destabilized, and no amount of prompt tuning could save it.&lt;/p&gt;

&lt;p&gt;Next I looked at a two-layer architecture. At the time, Kilo Code was pushing a feature called Orchestrator Mode, previously known as Boomerang Tasks, inherited from Roo Code. The logic was sound: an outer orchestrator manages tasks, delegates each subtask to an independent subagent running in its own context, then collects the results.&lt;/p&gt;

&lt;p&gt;I tried a round with several cost-effective models available at the time. Zhipu performed slightly better, able to push through long tasks for a while. Minimax was more comical. It started writing code at the orchestrator layer itself and never delegated. The two-layer architecture simply failed on it. I thought about this for a while afterwards. It didn't seem like a harness adaptation issue. More likely, the model itself lacked the sense that it's the lead and should delegate.&lt;/p&gt;

&lt;p&gt;In February, Claude Code shipped Agent Teams alongside Opus 4.6. It was experimental, requiring the &lt;code&gt;CLAUDE_CODE_EXPERIMENTAL_AGENT_TEAMS&lt;/code&gt; environment variable to enable. One session acts as team lead, dispatching other subagents to complete tasks in fresh context windows. This was essentially Kilo's architecture as an official implementation. I was genuinely impressed when I tried it. Long tasks could run for two or three hours without crashing.&lt;/p&gt;

&lt;p&gt;But after one compaction, the team lead side fell apart. Previously dispatched subagents couldn't be found, so it would redeploy a new round, task lists got misaligned, and tokens burned fast. The two-layer architecture itself suffered information decay. Context shuttled back and forth between layers, losing a bit each time.&lt;/p&gt;

&lt;p&gt;Then came Ralph, full name Ralph Wiggum. Australian developer Geoffrey Huntley built it at the end of 2025. The logic was so simple it was almost suspicious: a bash while-true loop, repeatedly feeding the same prompt file to an agent until the goal is achieved. I tried to test its tmux version at the time, hit some snags, and shelved it.&lt;/p&gt;

&lt;p&gt;Ralph caught on extremely fast. It's the most direct inspiration for the &lt;code&gt;/goal&lt;/code&gt; product line. Today, Anthropic has absorbed Ralph as an official Claude Code plugin, parked under &lt;code&gt;plugins/ralph-wiggum/&lt;/code&gt; in the repo. Kilo Code's Orchestrator Mode, conversely, has been officially marked deprecated. The reason given: "the main agent can now delegate directly to subagents, so a dedicated orchestrator is no longer needed."&lt;/p&gt;

&lt;p&gt;Hand-rolled terminal tasks, to Kilo Orchestrator, to Claude Code Agent Teams, to Ralph going viral, to Codex shipping &lt;code&gt;/goal&lt;/code&gt;, to Claude Code shipping &lt;code&gt;/goal&lt;/code&gt;, to Ralph being absorbed and Kilo Orchestrator deprecated. The evolutionary thread of these past few months is clear.&lt;/p&gt;

&lt;h2&gt;
  
  
  5. Codex Dives In; Claude Code Keeps Looking Up
&lt;/h2&gt;

&lt;p&gt;Back to the models themselves.&lt;/p&gt;

&lt;p&gt;After running both, I have a fairly solid judgment. Codex is the "local" faction. Claude Code is the "global" faction.&lt;/p&gt;

&lt;p&gt;Math has a concept called local optima. The optimization space is like a valley. Starting from one point and walking downhill, you might end up in a local minimum, but over the next ridge there's a deeper valley. I've watched Codex fall into these local optima repeatedly during &lt;code&gt;/goal&lt;/code&gt;. It polishes one direction, does this and that, circles back, thinks it's moving forward, but is actually treading water. Its heads-down approach is usually a strength. In these moments it becomes a weakness.&lt;/p&gt;

&lt;p&gt;Claude Code is different. It performs large-span reflection and validation, proactively asking whether its current direction is right. I've repeatedly seen it jump out of what looked like a converging direction, saying "wait, the root of this problem might not be here, I need to reconsider," and then actually find a better path.&lt;/p&gt;

&lt;p&gt;This global view is Claude Code's strength. For complex tasks lasting one to two hours and requiring judgment, reflection, and cross-module coordination, I still think Claude Code outperforms Codex.&lt;/p&gt;

&lt;p&gt;But this global view doesn't buy endurance. It can't run long under &lt;code&gt;/goal&lt;/code&gt;, and can't deliver stable 24-hour unattended output. An imperfect analogy: Codex is an intern who can grind for 12 hours straight, occasionally drifting off course. Claude Code is a senior engineer with good judgment, but he needs to check in every 40 minutes, or decides after 30 minutes that this is too hard and he's out. Which is better suited for &lt;code&gt;/goal&lt;/code&gt;? The answer is obvious.&lt;/p&gt;

&lt;h2&gt;
  
  
  6. Form Converges, Training Diverges
&lt;/h2&gt;

&lt;p&gt;After running this comparison, I have an additional read on where coding agents are heading in the coming months. Harnesses are rapidly converging, but model personality differences will become increasingly prominent.&lt;/p&gt;

&lt;p&gt;Boris Cherny (founder of Claude Code) has been saying that in the future, a harness might be just 100 lines of code. I believe this even more now. Once the &lt;code&gt;/goal&lt;/code&gt; paradigm converges, the outer structure of coding agents will get thinner and thinner. A loop, a set of tools, a goal. That's enough.&lt;/p&gt;

&lt;p&gt;What will truly determine the gap is the model's personality within this loop. Whether it's willing to put its head down and work. Whether it keeps popping out to align with humans. Whether its state survives compaction. Whether it can jump out when stuck in a wrong direction. When it hits a wall, does it try again, or say it can't do the goal and bail?&lt;/p&gt;

&lt;p&gt;None of these can be fixed with prompting. They're set during training.&lt;/p&gt;

&lt;p&gt;OpenAI and Anthropic have already trained distinctly different model personalities for long-horizon tasks. Codex seems to have been trained into "never give up, hit the wall and try again." Claude Code seems trained to "report frequently, align frequently, reflect frequently." That's endearing in interactive scenarios, but fatal under &lt;code&gt;/goal&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;In the short term, this divergence is hard to bridge. Even after Anthropic rolled back that verbosity system prompt, Opus 4.7's laziness only eased. It didn't fully recover. RL internalized it. You can't fix that by changing outer prompts.&lt;/p&gt;

&lt;h2&gt;
  
  
  7. Choosing an Agent Is Increasingly Like Choosing a Partner
&lt;/h2&gt;

&lt;p&gt;At this point, the way I use &lt;code&gt;/goal&lt;/code&gt; has changed.&lt;/p&gt;

&lt;p&gt;I no longer start by asking which tool is stronger. Instead, I ask: which model's personality fits this task?&lt;/p&gt;

&lt;p&gt;For iterations lasting over six hours, with a clear goal and low trial-and-error cost, I just fire up Codex &lt;code&gt;/goal&lt;/code&gt;. For architectural judgment, cross-module decisions, possible mid-course direction changes, I use Claude Code &lt;code&gt;/goal&lt;/code&gt;, but I check back every 30 to 60 minutes, mentally prepared for it to pop out with questions. For truly unattended 24-hour runs, it has to be Codex, and the task direction needs to be clearly nailed down upfront. If it's just a single hard problem requiring global vision, I actually don't use &lt;code&gt;/goal&lt;/code&gt; at all. I use Claude Code in normal mode and knock it out in 30 minutes.&lt;/p&gt;

&lt;p&gt;A few months ago, choosing an agent meant choosing UI, community, pricing. Now it's more about choosing a model personality.&lt;/p&gt;

&lt;p&gt;Next-generation models, whether from Anthropic or OpenAI, will definitely train toward fixing the other side's weakness. Codex will try to add global vision; Claude Code will try to add endurance. In the short term, this personality divergence remains real, and it significantly affects how much value you can extract from &lt;code&gt;/goal&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;The biggest effect of &lt;code&gt;/goal&lt;/code&gt; is that it amplifies a model's true personality into 24 hours of continuous output. The one with the steadier personality wins this round.&lt;/p&gt;

&lt;p&gt;Right now, Codex leads by half a step. But only half a step.&lt;/p&gt;




&lt;h2&gt;
  
  
  References
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://explainx.ai/blog/claude-code-goal-command-long-running-agents-2026" rel="noopener noreferrer"&gt;Claude Code 2.1.139 adds /goal command — explainx.ai&lt;/a&gt;: Claude Code &lt;code&gt;/goal&lt;/code&gt; launch notes, May 12, 2026&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.geeky-gadgets.com/claude-code-agent-view-update/" rel="noopener noreferrer"&gt;Claude Code Agent View, Goal Command, and Background Sessions Update — Geeky Gadgets&lt;/a&gt;: Detailed overview of Claude Code 2.1 features&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://devinterrupted.substack.com/p/inventing-the-ralph-wiggum-loop-creator" rel="noopener noreferrer"&gt;Inventing the Ralph Wiggum Loop — Dev Interrupted&lt;/a&gt;: Geoffrey Huntley on inventing Ralph&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/anthropics/claude-code/blob/main/plugins/ralph-wiggum/README.md" rel="noopener noreferrer"&gt;Ralph Wiggum 官方 Claude Code plugin — GitHub&lt;/a&gt;: Anthropic has absorbed Ralph as an official plugin&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://kilo.ai/docs/code-with-ai/agents/orchestrator-mode" rel="noopener noreferrer"&gt;Kilo Code Orchestrator Mode (Deprecated)&lt;/a&gt;: Current status of Kilo Code Orchestrator&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://code.claude.com/docs/en/agent-teams" rel="noopener noreferrer"&gt;Orchestrate teams of Claude Code sessions — Claude Code Docs&lt;/a&gt;: Agent Teams official documentation&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://deepakness.com/raw/claude-code-agent-teams/" rel="noopener noreferrer"&gt;Claude Code experimental agent teams — DeepakNess&lt;/a&gt;: Agent Teams release notes alongside Opus 4.6&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.buildfastwithai.com/blogs/claude-opus-4-7-regression-explained-2026" rel="noopener noreferrer"&gt;Claude Opus 4.7 Regression Explained — buildfastwithai&lt;/a&gt;: Opus 4.7 regression and community feedback&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://shimin.io/journal/opus-4-7-just-lazy/" rel="noopener noreferrer"&gt;Opus 4.7 isn't dumb, it's just lazy — Shimin Zhang&lt;/a&gt;: Analysis of Opus 4.7's laziness issue&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.anthropic.com/engineering/april-23-postmortem" rel="noopener noreferrer"&gt;An update on recent Claude Code quality reports — Anthropic Engineering&lt;/a&gt;: Anthropic official postmortem on rolling back the verbosity system prompt&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/openai/codex/issues/19464" rel="noopener noreferrer"&gt;GPT-5.5 Codex 400K context window — GitHub Issue&lt;/a&gt;: Codex 400K context window limit explained&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://newsletter.pragmaticengineer.com/p/building-claude-code-with-boris-cherny" rel="noopener noreferrer"&gt;Boris Cherny on Claude Code's future — Pragmatic Engineer&lt;/a&gt;: The "100 lines of code" prediction&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;原文链接：&lt;a href="https://guanjiawei.ai/en/blog/goal-two-personalities" rel="noopener noreferrer"&gt;https://guanjiawei.ai/en/blog/goal-two-personalities&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>codingagent</category>
      <category>codex</category>
      <category>claudecode</category>
    </item>
  </channel>
</rss>
