<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: guanjiawei</title>
    <description>The latest articles on DEV Community by guanjiawei (@skyguan92).</description>
    <link>https://dev.to/skyguan92</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3788265%2Ff93aaebd-c44b-447a-b582-cc297747f93b.jpeg</url>
      <title>DEV Community: guanjiawei</title>
      <link>https://dev.to/skyguan92</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/skyguan92"/>
    <language>en</language>
    <item>
      <title>Models Keep Getting Stronger, but 'Strongest' Has No Single Answer</title>
      <dc:creator>guanjiawei</dc:creator>
      <pubDate>Wed, 03 Jun 2026 10:09:05 +0000</pubDate>
      <link>https://dev.to/skyguan92/models-keep-getting-stronger-but-strongest-has-no-single-answer-2ec0</link>
      <guid>https://dev.to/skyguan92/models-keep-getting-stronger-but-strongest-has-no-single-answer-2ec0</guid>
      <description>&lt;p&gt;June is shaping up to be another packed month for model releases. Opus 4.8 dropped at the end of May, MiniMax's M3 landed a couple of days ago, GPT 5.6 is supposedly around the corner, and some are already waiting on DeepSeek's next drop. It looks like we'll see a new model every few days. Pretty lively.&lt;/p&gt;

&lt;p&gt;But for the past two days, what I've actually been thinking about is a friend's experience with models.&lt;/p&gt;

&lt;p&gt;He started out using models to build small things—writing web pages, making little tools and plugins. He was pretty excited at first, telling me how amazing today's models are. He more or less picked a decent domestic model at random and found it more than enough. He couldn't even imagine where else models could get stronger; they already worked so well.&lt;/p&gt;

&lt;p&gt;Then his work got more complex. He moved from small tools to trying to build an auto-editing tool, video cropping and the like. That's when the problems started.&lt;/p&gt;

&lt;p&gt;The model told him it was done. He said okay, tried it, no dice. A moment later it said this time it was really done. He tried again, still no good. Back and forth for several rounds.&lt;/p&gt;

&lt;p&gt;He couldn't tell anymore: part of him felt he was getting better at working with the model, that he needed to give more guidance and try different approaches; part of him started wondering if the model itself just wasn't cutting it, whether he should switch to something like Claude Opus.&lt;/p&gt;

&lt;p&gt;This pattern is all too common. Behind it is something a lot of people haven't caught on to yet: model strength is forking in different directions. A single score no longer tells the whole story.&lt;/p&gt;

&lt;h2&gt;
  
  
  Scores Bunch at the Ceiling, Real-World Feel Diverges
&lt;/h2&gt;

&lt;p&gt;First, the weird state of things: on mainstream benchmarks, top models score terrifyingly high, all squeezed into a narrow band.&lt;/p&gt;

&lt;p&gt;Take GPQA. It's graduate-level, so hard that the PhD experts they brought in only scored around 65%. Yet top models now routinely hit 92% to 94%, bunched together. Older benchmarks like MMLU were surpassed by pretty much everyone long ago, all scoring over 90%. Hard problems aren't hard anymore; scores have hit the ceiling, and you can't tell models apart.&lt;/p&gt;

&lt;p&gt;So benchmark makers have to keep inventing harder tests. The new Humanity's Last Exam states it plainly: it was created because models had exceeded 90% on MMLU and the old questions weren't enough anymore. One study looked at sixty mainstream benchmarks and found that nearly half are already highly saturated—top models are "statistically indistinguishable" on them.&lt;/p&gt;

&lt;p&gt;But when you actually use them, the difference in feel is absurd. I &lt;a href="https://guanjiawei.ai/blog/stay-on-the-table" rel="noopener noreferrer"&gt;wrote in my last post&lt;/a&gt; how Opus 4.8 kept letting me down on engineering and research tasks—work that later all moved to GPT 5.5. By the scores, the two are close; in practice, worlds apart.&lt;/p&gt;

&lt;p&gt;The ARC-AGI suite is a perfect example. On the old version, top models had already saturated at 96%. Switch to the harder ARC-AGI-2, and the same models immediately show their true colors: GPT 5.5 still manages 85%, while Opus 4.8 drops to barely over 70%. Switch again to ARC-AGI-3, which requires actual interaction and exploration, and almost everyone flatlines to zero.&lt;/p&gt;

&lt;p&gt;So benchmarks are still useful. It's just that "making tests humans can define and grade" is becoming less and less able to distinguish models. To understand why, you have to look at training.&lt;/p&gt;

&lt;h2&gt;
  
  
  Solving the Hardest Problems vs. Reliably Doing Messy Work
&lt;/h2&gt;

&lt;p&gt;The main technique making models stronger right now is called "verifiable rewards." In short: pick hard problems with standard answers that machines can automatically grade, and use them for reinforcement learning. Math and code are the classic examples. Right answer gets points, wrong gets zero, rinse and repeat.&lt;/p&gt;

&lt;p&gt;The DeepSeek-R1 paper puts it clearly: math problems are verified with rules, code is thrown straight into compilers to run test cases. They specifically note they avoided neural-network-based reward models, because those are too easy for models to game. OpenAI's o series follows the same playbook. It's highly effective; this is exactly how models learned to solve hard problems.&lt;/p&gt;

&lt;p&gt;But it has one characteristic: what it excels at is taking the "hardest problems humans can define and grade" and grinding them out. That's an entirely different capability: give it a fuzzy, not-that-hard but very real task, and get it done reliably in one go.&lt;/p&gt;

&lt;p&gt;My friend's editing tool is the latter. The task isn't extremely difficult, but the intent is fuzzy. It has to be broken down yourself, and it has to be done cleanly in one go. A model that can solve Olympiad problems may not handle this kind of messy work cleanly in one shot. It might go in circles, need three rounds of back-and-forth, and finally say "I'm done" when it isn't. Conversely, a model that's great at messy work might completely choke when you throw a really hard problem at it.&lt;/p&gt;

&lt;p&gt;These are two directions of capability, each going its own way. You can't rank them on a single line.&lt;/p&gt;

&lt;p&gt;The trouble is, ninety percent of people need the latter in daily life. They want to take a poorly specified task and get it done reliably, without hassle. Yet the score we use to rank models measures almost entirely the former. It's completely normal for "highest score" and "most useful for me" to not line up.&lt;/p&gt;

&lt;h2&gt;
  
  
  Another Dimension: Exploration
&lt;/h2&gt;

&lt;p&gt;The first two types are still in the world of standard answers: either solve a gradable hard problem, or finish a verifiable task. The truly difficult one is the third.&lt;/p&gt;

&lt;p&gt;When my friend got stuck, I thought of another class of problems. Like driving toward an intersection with a traffic light ahead. Do you go straight, or weave through the middle? There's no standard answer; you have to find your own direction in the ambiguity. Exploring a domain people haven't clearly defined, or don't even know the answer to, is a completely different capability.&lt;/p&gt;

&lt;p&gt;This capability, benchmarks can't measure at all. The whole premise of evaluation is having a standard answer, something gradable right or wrong. But exploration has no right or wrong at all, only efficiency. Can you fish out something new in the ambiguity, use it to move forward, and push out a boundary that didn't exist before?&lt;/p&gt;

&lt;p&gt;It's also precisely the blind spot of the "verifiable rewards" approach. Research has already pointed out that open-ended tasks without unique answers have no clear standard answer to begin with, so you can't even construct rewards. This method can't get traction. Some have even found that this training approach doesn't necessarily give models new capabilities, and may instead narrow their exploration surface, with capability ceilings hard-capped by the base model.&lt;/p&gt;

&lt;p&gt;The result is that a model great at exploration, thrown into a cage with clear standard answers, might seem a bit stupid. A model that tests incredibly well may not possess exploration ability at all. In my own experience, GPT and Claude show the clearest difference on this dimension.&lt;/p&gt;

&lt;p&gt;And this dimension happens to be the most important. Because truly valuable things often start without standard answers. Yet it's the hardest to measure, and the hardest to train.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Chat Era Already Ran This Course
&lt;/h2&gt;

&lt;p&gt;Model capabilities forking along dimensions and layering down isn't new. The chatbot era ran the full course.&lt;/p&gt;

&lt;p&gt;Back then, everyone also thought for a while that the biggest model was the strongest. But they quickly discovered that for chatting specifically, the biggest model wasn't much better. In 2023, the LMSYS Chatbot Arena leaderboard dedicated a section to "Smaller Models Are Competitive": a 13B Vicuna ranked in the top five, its Elo score even beating Google's PaLM 2. 7B models also squeezed into the top ten, trading blows with models twice their size.&lt;/p&gt;

&lt;p&gt;Later studies echoed this: scaling models from tens of millions to hundreds of billions, all the way up to GPT-4 class, showed gains topping out quickly on softer tasks. Models with a few tens of billions of parameters weren't far from frontier models.&lt;/p&gt;

&lt;p&gt;In other words: for chatting, for emotional support, the marginal returns to scale are low. A few tens of billions is enough; scaling up to hundreds of billions is pure waste.&lt;/p&gt;

&lt;p&gt;So the market sorted itself out. If you want emotional value, someone to chat with, a smaller model that sounds human is enough. Only when you need serious research or hardcore engineering do top-tier models come into play. Models sorted themselves into different price-performance tiers by use case.&lt;/p&gt;

&lt;p&gt;Today's round is the same plot, replaying at a higher capability level.&lt;/p&gt;

&lt;h2&gt;
  
  
  Closing: Don't Ask Which Is Strongest—Ask Which Dimension You Need
&lt;/h2&gt;

&lt;p&gt;Back to my friend's dilemma: "should I switch to a stronger model?" He's asking the wrong question.&lt;/p&gt;

&lt;p&gt;There is no "stronger" that simultaneously covers solving hard problems, doing messy work, and exploring. These three things are diverging onto different models.&lt;/p&gt;

&lt;p&gt;New-generation models are still pushing forward, of course. But the progress they fight for increasingly lands on "the hardest problems humans can define and grade," which is exactly where most people can't perceive it. So you see a split: leaderboards keep getting stronger generation after generation, yet most people just feel "it's been good enough for a while, can't see where it's stronger." Neither side is wrong. Because what they want are fundamentally different dimensions of capability.&lt;/p&gt;

&lt;p&gt;So stop vaguely asking "which model is strongest." First ask clearly: which dimension of work do you need it to do? Solve a hard problem with an answer, finish a messy task that wasn't clearly specified, or join you in something no one knows the answer to yet.&lt;/p&gt;

&lt;p&gt;"Strongest" is becoming a question without a standard answer.&lt;/p&gt;




&lt;h2&gt;
  
  
  References
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Model Releases and Timeline&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://www.anthropic.com/news/claude-opus-4-8" rel="noopener noreferrer"&gt;Claude Opus 4.8 Release (2026-05-28) — Anthropic&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.marktechpost.com/2026/06/01/minimax-releases-minimax-m3-with-msa-architecture-supporting-1m-token-context-native-multimodality-and-agentic-coding/" rel="noopener noreferrer"&gt;MiniMax M3 Release Coverage (2026-06-01) — MarkTechPost&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://openai.com/index/introducing-gpt-5-5/" rel="noopener noreferrer"&gt;OpenAI GPT-5.5 (Current Official Version; GPT-5.6 Not Yet Released)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://api-docs.deepseek.com/news/news260424" rel="noopener noreferrer"&gt;DeepSeek V4 (2026-04-24 Preview) — API Docs&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Benchmark Saturation&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/abs/2311.12022" rel="noopener noreferrer"&gt;GPQA: A Graduate-Level Google-Proof Q&amp;amp;A Benchmark (PhD Expert Baseline ~65%) — arXiv 2311.12022&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://lastexam.ai/" rel="noopener noreferrer"&gt;Humanity's Last Exam (Official Motivation for Harder Benchmarks: Models Already Exceed 90% on MMLU, etc.)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/abs/2602.16763" rel="noopener noreferrer"&gt;Large Model Benchmark Saturation Study: Nearly Half of 60 Benchmarks Highly Saturated, Top Models Statistically Indistinguishable — arXiv 2602.16763&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://arcprize.org/leaderboard" rel="noopener noreferrer"&gt;ARC-AGI Official Leaderboard (ARC-AGI-1 Saturated, Same Batch of Models Drops Sharply on ARC-AGI-2) — ARC Prize&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://arcprize.org/arc-agi/2/" rel="noopener noreferrer"&gt;ARC-AGI-2 Design Notes — ARC Prize&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Verifiable Rewards, and Its Limits&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/abs/2411.15124" rel="noopener noreferrer"&gt;Tülu 3: RLVR (Reinforcement Learning with Verifiable Rewards) Proposal and Definition — arXiv 2411.15124&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/abs/2501.12948" rel="noopener noreferrer"&gt;DeepSeek-R1: Rule-Based Rewards, Compiler Test Cases, Deliberately Avoiding Neural Reward Models (Due to Reward Hacking Concerns) — arXiv 2501.12948&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://openai.com/index/learning-to-reason-with-llms/" rel="noopener noreferrer"&gt;OpenAI o1: Large-Scale Reinforcement Learning for Reasoning — OpenAI&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/abs/2506.00103" rel="noopener noreferrer"&gt;Writing-Zero: Open-Ended, Subjective Tasks Lack Clear Ground Truth, Making Reward Construction Difficult — arXiv 2506.00103&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/abs/2601.18533" rel="noopener noreferrer"&gt;Open-Ended Generation "Lacks Clear Standard Answers," Making RLVR Hard to Extend — arXiv 2601.18533&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/abs/2504.13837" rel="noopener noreferrer"&gt;This Type of Reinforcement Learning May Narrow the Exploration Surface, with Capability Ceiling Capped by the Base Model — arXiv 2504.13837&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Small Models in the Chat Era&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://www.lmsys.org/blog/2023-05-25-leaderboard/" rel="noopener noreferrer"&gt;LMSYS Chatbot Arena Leaderboard (2023-05, "Smaller Models Are Competitive": 13B Vicuna Enters Top Five, Elo Beats PaLM 2)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.pnas.org/doi/10.1073/pnas.2413443122" rel="noopener noreferrer"&gt;PNAS: Diminishing Marginal Returns of Model Scale on Single-Turn Persuasion, Quickly Hitting a Ceiling — pnas.org&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;原文链接：&lt;a href="https://guanjiawei.ai/en/blog/strongest-no-single-answer" rel="noopener noreferrer"&gt;https://guanjiawei.ai/en/blog/strongest-no-single-answer&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>models</category>
      <category>evaluation</category>
      <category>reinforcementlearning</category>
    </item>
    <item>
      <title>Claude 4.8 Let Me Down, But It’s Not Just Claude’s Problem</title>
      <dc:creator>guanjiawei</dc:creator>
      <pubDate>Mon, 01 Jun 2026 11:12:47 +0000</pubDate>
      <link>https://dev.to/skyguan92/claude-48-let-me-down-but-its-not-just-claudes-problem-36oc</link>
      <guid>https://dev.to/skyguan92/claude-48-let-me-down-but-its-not-just-claudes-problem-36oc</guid>
      <description>&lt;p&gt;The day Claude launched Opus 4.8, I was pretty excited.&lt;/p&gt;

&lt;p&gt;I'd always thought Opus had solid engineering skills. Macro analysis and intent alignment were its strong suits. A few points in the 4.8 release notes caught my eye, so I tried it out on several complex tasks I had going.&lt;/p&gt;

&lt;p&gt;The results were one disappointment after another. Here are a few examples.&lt;/p&gt;

&lt;h2&gt;
  
  
  Looks Promising Out of the Gate, Then Goes Off the Rails
&lt;/h2&gt;

&lt;p&gt;Right around then, my Codex credits had run out. Since 4.8 looked solid on paper, I put it on a research task using that flashy feature called ultracode, a dynamic workflow that supposedly auto-orchestrates ultra-long-horizon tasks.&lt;/p&gt;

&lt;p&gt;It started out looking solid. It ran a bunch of checks and got everything set up. Seemed reliable. That's Claude Code in a nutshell: the start always gives you hope.&lt;/p&gt;

&lt;p&gt;Then I let it run for about a full day and night.&lt;/p&gt;

&lt;p&gt;The task was to optimize a performance metric. The baseline was terrible: throughput sat around 0.1 tokens per second. After optimizing for a while, it bumped the number from 0.1 to 0.15 and started celebrating. Look, I improved it by 50%! I did all this work! What a huge achievement! It kept hauling out that basic initial setup to claim credit, writing pages of self-congratulatory fluff.&lt;/p&gt;

&lt;p&gt;The problem is, in the 0.1 to 0.15 range, using multiples to understand the problem is wrong to begin with.&lt;/p&gt;

&lt;p&gt;When performance is that low, your direction is completely wrong. You have to look at absolute values. What does 0.15 token/s actually mean? That's how you see how far off it is. Celebrating a "50% improvement" in a setup that fundamentally can't run is like celebrating that you bailed two buckets from a sinking ship.&lt;/p&gt;

&lt;p&gt;Looking back, the pile of documentation it recorded, those so-called results, pretty much all had to be tossed. The direction was wrong. Freeze everything and start over.&lt;/p&gt;

&lt;p&gt;This wasn't even the most surprising part. I'd always had the impression that Claude's macro capabilities were fine, good at analysis and alignment, but concrete execution, especially on research tasks, tended to go sideways. The real red flag was the engineering task that followed.&lt;/p&gt;

&lt;h2&gt;
  
  
  A Pure Engineering Problem, and It Was Clowning Around
&lt;/h2&gt;

&lt;p&gt;It was a cloud service. A probe was reporting errors, and I asked it to diagnose the issue and fix it while it was there. Nothing complicated.&lt;/p&gt;

&lt;p&gt;A couple of things it did left me baffled.&lt;/p&gt;

&lt;p&gt;First, in the middle of analysis it randomly started counting. It had the machine echo a string of numbers. I stared at the screen, maybe a dozen in total. I still have no idea what that was about. Pure token burning.&lt;/p&gt;

&lt;p&gt;Second was even more absurd. The probe's purpose was straightforward: workers send periodic heartbeats returning OK to tell the platform "I'm alive and working." If the heartbeat is abnormal, take the worker offline first. Don't assign it tasks. It discovered heartbeats had a cost and asked if I wanted to cut it. I said sure.&lt;/p&gt;

&lt;p&gt;It changed it to &lt;code&gt;runtime --version&lt;/code&gt;, which returns a version number.&lt;/p&gt;

&lt;p&gt;I laughed out loud. This isn't cutting costs. It's destroying the design intent entirely. A version number only proves the thing is installed. It has nothing to do with whether it can actually work. Effectively, it's saying "it's installed, so go ahead and assign tasks." A model supposedly strong in engineering and alignment pulled this on a problem where the intent was crystal clear.&lt;/p&gt;

&lt;p&gt;I said forget it, let's revert to the old setup.&lt;/p&gt;

&lt;p&gt;During the revert, something else happened. When looking for a fix, it told me "here are three options below," asking me to pick. But it never listed the three options. It just jumped straight to "I recommend option one." I scrolled back several times. The three options didn't exist at all.&lt;/p&gt;

&lt;p&gt;At this point I was pretty certain: this model genuinely can't be counted on for this kind of task.&lt;/p&gt;

&lt;h2&gt;
  
  
  So I Moved All That Work to GPT 5.5
&lt;/h2&gt;

&lt;p&gt;After these incidents, the direct result was simple: for research and engineering tasks, I no longer trust Claude.&lt;/p&gt;

&lt;p&gt;Trust builds slowly and collapses fast. Earlier I thought it was still usable, but the more I used it, the more I realized asking it to do something was probably a waste of time. Not just a matter of burning a few tokens, but having to redo everything in the end. Now I don't even consider it for these tasks. Everything goes to GPT 5.5.&lt;/p&gt;

&lt;p&gt;GPT 5.5 is genuinely strong. Whether coding or research, it's clearly a notch above every other model.&lt;/p&gt;

&lt;p&gt;This shows up most directly in my account setup: I'm now running 7 Codex accounts, maxing out weekly quotas on all of them, and it's still not enough. On the Claude side, I'm down to one account barely hanging on. It keeps getting my accounts banned, so I bought a spare just to hold onto. I never seriously used Google's.&lt;/p&gt;

&lt;p&gt;Seven to one. That's more honest than any benchmark.&lt;/p&gt;

&lt;h2&gt;
  
  
  But This Isn't Just a Claude Problem
&lt;/h2&gt;

&lt;p&gt;At this point I need to walk that back a bit. If this were just bashing Claude, the post wouldn't be worth reading.&lt;/p&gt;

&lt;p&gt;Step back, and you see the top models have always taken turns in the spotlight.&lt;/p&gt;

&lt;p&gt;Six months ago Gemini was riding high. Everyone was talking about how strong Google was. The last two generations have both felt underwhelming. Hardly anyone mentions them now. Then everyone flocked to Claude, thinking it was the strongest. But look at 4.7 to 4.8. It's been a real letdown. This round, OpenAI has clearly gotten back on its feet. GPT 5.5 is ridiculously strong.&lt;/p&gt;

&lt;p&gt;No single company can stay on top forever. This isn't unique to the AI industry.&lt;/p&gt;

&lt;h2&gt;
  
  
  Chip Companies' Fate, Replayed by Model Makers at Fast-Forward
&lt;/h2&gt;

&lt;p&gt;I was chatting with a friend about chips the other day. He told me how brutal strategy is in that business. Bet wrong on one generation of chips, and you might be finished.&lt;/p&gt;

&lt;p&gt;NVIDIA nearly died. Its first chip, the NV1, bet on forward texture mapping and quadrilaterals, but the industry standard went with triangles (Microsoft's DirectX). The wrong direction meant no one wanted the product, and the company shrank from about 100 people to 40.&lt;/p&gt;

&lt;p&gt;What saved them was Sega. Sega had commissioned NVIDIA to build the graphics chip for its console. Later both sides realized the direction was wrong, but Sega's Shoichiro Irimajiri still converted that roughly $5 million contract payment into an investment in NVIDIA. Jensen Huang later said this gave them "six months to live," just enough to survive until the RIVA 128 turned things around.&lt;/p&gt;

&lt;p&gt;AMD was the same story. The 2011 Bulldozer architecture was a strategic mistake. Single-core performance was awful, and the company was badly wounded. By July 2015, its stock had crashed to $1.62, one step from bankruptcy. That year it licensed x86 to its Chinese joint venture Tianjin Haiguang for roughly $293 million, largely to stop the bleeding. It didn't recover until the Zen architecture arrived in 2017.&lt;/p&gt;

&lt;p&gt;Both are now dominant players. But looking back, no one can guarantee they'll have the last laugh. Chip cycles are long and capital-intensive. One generation every three to five years, and one wrong move means massive pressure, or even getting knocked out entirely.&lt;/p&gt;

&lt;p&gt;Model companies move much faster. It doesn't take a whole generation. Maybe just a few model iterations, a roughly six-month window of consistently missing the mark, and a company can be pushed off the table.&lt;/p&gt;

&lt;h2&gt;
  
  
  Chinese Model Makers Have Already Completed a Full Rotation
&lt;/h2&gt;

&lt;p&gt;Chinese model companies have already played out this entire cycle.&lt;/p&gt;

&lt;p&gt;The first to break out and gain recognition as a top-tier player was Zhipu. In 2022, its GLM-130B was the only large model from Asia selected for Stanford's HELM evaluation, and ChatGLM was among the first open-source models at home. For a while, it was unrivaled.&lt;/p&gt;

&lt;p&gt;Then it fell behind. By late 2024, its flagship GLM-4-Plus had been overtaken by DeepSeek-V3 and Tongyi Qianwen on public benchmarks like SuperCLUE, dropping out of the top tier. At the time, a lot of people were surprised.&lt;/p&gt;

&lt;p&gt;Then on January 20, 2025, DeepSeek-R1 burst onto the scene. Six days later its app hit #1 on the US download charts, along with 51 other countries. On January 27, it directly tanked NVIDIA's stock, wiping out $589 billion in market cap in a single day, the largest single-stock single-day loss in US market history. During that stretch, I felt like a bunch of model vendors were on the verge of collapse.&lt;/p&gt;

&lt;p&gt;But Zhipu didn't leave the table. In the second half of 2025, it changed tactics, going fully open-source while narrowing its focus. It released GLM-4.5 and GLM-4.6 back to back, and its reputation clearly recovered. In January 2026, it even listed on the Hong Kong Stock Exchange. From falling behind to bouncing back, the key was staying at the table.&lt;/p&gt;

&lt;h2&gt;
  
  
  As a Side Note: Top Models Are Also Specializing
&lt;/h2&gt;

&lt;p&gt;Beyond taking turns at the top, there's now another dynamic: division of labor.&lt;/p&gt;

&lt;p&gt;For hardcore coding agents and research tasks, OpenAI is in a league of its own, clearly ahead of everyone else. Claude's original strong suit was precisely this area, but after several generations it failed to meet expectations and started regressing. Its strengths have instead shifted to white-collar work: writing, finance and legal, daily tasks, document research, that sort of thing. To be fair, 4.8 is a bit better than 4.7. It sounds more natural, its style moved back toward 4.6, execution is a bit more accurate, and it's actually pretty decent at writing.&lt;/p&gt;

&lt;p&gt;Further toward the fringe you have Doubao and its ilk. Everyone knows it doesn't do serious work, but its emotional value is maxed out and its user base is terrifyingly large. There's a saying that "Claude is the American version of Doubao." I thought it was kind of funny at first, but thinking about it, it points to a different kind of divergence: some models are just good at chatting with you and giving you emotional value. That's also a category of demand.&lt;/p&gt;

&lt;p&gt;So "which one is the strongest" isn't even the question anymore. It depends on what job you need done.&lt;/p&gt;

&lt;h2&gt;
  
  
  Final Thoughts: Don't Bet on Who's Strongest—Bet on Who's Still at the Table
&lt;/h2&gt;

&lt;p&gt;Coming full circle, my conclusion is actually pretty optimistic.&lt;/p&gt;

&lt;p&gt;The lead changes hands. Don't fixate on "whoever's strong now will stay strong," and don't write a company off just because it's temporarily stumbling. As long as you're still at the table, you have a chance to turn things around. Zhipu turned it around. NVIDIA and AMD both turned it around back in the day.&lt;/p&gt;

&lt;p&gt;The real danger is leaving the table.&lt;/p&gt;

&lt;p&gt;Anthropic has indeed been stuck these past few generations. My use of Claude is now basically limited to white-collar tasks. It's reportedly cooking up its next-gen flagship. If that comes out at the same level, things will really get dicey. It's not that any single model is bad, but in this six-month-cycle rhythm, missing the mark for several generations in a row is exactly how you get squeezed off the table.&lt;/p&gt;

&lt;p&gt;But as long as it's still at the table, I'm not too worried. That's just how it works at the table: everyone takes turns.&lt;/p&gt;




&lt;h2&gt;
  
  
  References
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://www.sequoiacap.com/podcast/crucible-moments-nvidia/" rel="noopener noreferrer"&gt;Crucible Moments: Nvidia — Sequoia Capital (Jensen Huang recounts the NV1 wrong bet, Sega's $5 million, and "six months to live")&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.acquired.fm/episodes/jensen-huang" rel="noopener noreferrer"&gt;NVIDIA CEO Jensen Huang — Acquired Podcast&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.tomshardware.com/news/amd-joint-venture-partner-banned-us-trade-war,39703.html" rel="noopener noreferrer"&gt;AMD's Chinese joint venture Tianjin Haiguang (including the $293 million licensing fee and 2019 Entity List) — Tom's Hardware&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://en.wikipedia.org/wiki/AMD%E2%80%93Chinese_joint_venture" rel="noopener noreferrer"&gt;AMD–Chinese joint venture — Wikipedia&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.cnn.com/2020/03/27/tech/lisa-su-amd-risk-takers" rel="noopener noreferrer"&gt;How Lisa Su brought AMD back from the brink — CNN Business&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/abs/2210.02414" rel="noopener noreferrer"&gt;GLM-130B: An Open Bilingual Pre-trained Model (ICLR 2023, the only Asian model selected for Stanford HELM)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.deeplearning.ai/the-batch/zhipu-ai-z-ai-releases-open-weights-glm-4-5-models-that-perform-comparably-to-the-latest-from-claude-and-deepseek" rel="noopener noreferrer"&gt;Zhipu AI open-sources GLM-4.5, performance comparable to the latest Claude and DeepSeek models — The Batch (DeepLearning.AI)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.stcn.com/article/detail/3579827.html" rel="noopener noreferrer"&gt;HK$57.9 billion market cap! Zhipu, the world's first large-model stock, lists in Hong Kong (02513.HK) — Securities Times&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://api-docs.deepseek.com/news/news250120" rel="noopener noreferrer"&gt;DeepSeek-R1 Release (2025/01/20 official release page) — DeepSeek API Docs&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.bloomberg.com/news/articles/2025-01-27/asml-sinks-as-china-ai-startup-triggers-panic-in-tech-stocks" rel="noopener noreferrer"&gt;Nvidia's $589 Billion DeepSeek Plunge Is Largest in Market History — Bloomberg&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://techcrunch.com/2025/01/27/deepseek-displaces-chatgpt-as-the-app-stores-top-app/" rel="noopener noreferrer"&gt;DeepSeek displaces ChatGPT as the App Store's top app — TechCrunch&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;原文链接：&lt;a href="https://guanjiawei.ai/en/blog/stay-on-the-table" rel="noopener noreferrer"&gt;https://guanjiawei.ai/en/blog/stay-on-the-table&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>ailabs</category>
      <category>industry</category>
      <category>reflections</category>
    </item>
    <item>
      <title>The Hidden Thread in Token Business: Cost Is Set by KV Cache Hits, Not Throughput</title>
      <dc:creator>guanjiawei</dc:creator>
      <pubDate>Thu, 28 May 2026 08:15:50 +0000</pubDate>
      <link>https://dev.to/skyguan92/the-hidden-thread-in-token-business-cost-is-set-by-kv-cache-hits-not-throughput-4neo</link>
      <guid>https://dev.to/skyguan92/the-hidden-thread-in-token-business-cost-is-set-by-kv-cache-hits-not-throughput-4neo</guid>
      <description>&lt;p&gt;The more I study the token business lately, the more I feel there's one angle that keeps getting overlooked.&lt;/p&gt;

&lt;p&gt;Over the past year, when people benchmark inference performance, they mainly watch three numbers: absolute throughput, TTFT, and TPOT. How many requests can be batched, how fast the first token comes out, how fast each output token is. That's the standard talking point today.&lt;/p&gt;

&lt;p&gt;But when you actually get down to serving, you find that what really drives token cost isn't throughput. It's whether the KV cache hits.&lt;/p&gt;

&lt;h2&gt;
  
  
  I. A 10× Gap Carved into the Price List
&lt;/h2&gt;

&lt;p&gt;Several model APIs have cut prices recently. Open any pricing sheet and you'll see the input column was split in two a while back: cache hit and cache miss.&lt;/p&gt;

&lt;p&gt;How big is the gap?&lt;/p&gt;

&lt;p&gt;Anthropic charges 0.1× the base input price for cache reads, making it 10× cheaper. DeepSeek V4 Flash cache hit is \$0.0028 per million tokens, cache miss \$0.14 per million tokens, a 50× difference. Anthropic also charges 1.25× (5-minute version) or 2× (1-hour version) to write a cache. On April 26, DeepSeek cut cache hit prices across all models by another 10×.&lt;/p&gt;

&lt;p&gt;At the machine level, the difference comes down to compute. A hit skips prefill; the machine only runs decode. A miss means recalculating from scratch, burning machine time and compute. The gap isn't a few percentage points. It's multiples, up to 10×.&lt;/p&gt;

&lt;p&gt;The interesting part is this: once cache hit and miss are priced separately, some of the cost is yours to control through design, and the rest is entirely up to the vendor. Split like this, and "which API is cheaper" stops being comparable at the token level. You have to look at actual hit rates.&lt;/p&gt;

&lt;h2&gt;
  
  
  II. The Biggest Pitfall on the User Side: Model Routing
&lt;/h2&gt;

&lt;p&gt;Let's talk about how users can mess this up.&lt;/p&gt;

&lt;p&gt;Lately people have bought hard into "model routing." Hard tasks go to strong models, easy ones to weak models. It looks like savings on paper.&lt;/p&gt;

&lt;p&gt;My view: switching models mid-session is usually a losing bet.&lt;/p&gt;

&lt;p&gt;The clearest example is Claude Code. You've accumulated 300K of context, then midway decide Opus is too expensive for this step and switch to Sonnet. Claude Code now pops a warning that tells you explicitly: after switching, all previous cache is invalidated, and the next step must cold-start and recalculate. It didn't warn before. After enough complaints, they added it.&lt;/p&gt;

&lt;p&gt;Why does it invalidate? Each model's KV representation is different, so cache can't be reused across models. Opus cache and Sonnet cache are two different things. The session hasn't changed, the cache key hasn't moved, but not a penny is saved on the recalculation cost.&lt;/p&gt;

&lt;p&gt;Run the numbers. Current Opus 4.7 is \$5/\$25 per million tokens; Sonnet 4.6 is \$3/\$15. Sonnet is roughly 40% cheaper than Opus, not the 5× gap of the past. But that preceding 300K input goes from a cache hit (0.1× price) to cold calculation (1× price), so that single input cost jumps 10×. You save 40% on the model itself. Net it out, and the overall cost is actually over 5× higher.&lt;/p&gt;

&lt;p&gt;Plus in agent workloads, tokens are almost always input-heavy and output-light. Prefill usually runs thousands of tokens per second. Decode runs dozens to just over a hundred. That's two orders of magnitude. The money is basically spent on input. Model routing destroys exactly the savings mechanism on the input side.&lt;/p&gt;

&lt;p&gt;So mainstream agent design today revolves around "context stability." Don't swap models lightly, don't change tool structures, don't touch core prompts like &lt;code&gt;CLAUDE.md&lt;/code&gt; halfway through. One move, and hit rates really do drop from 90% to 5%.&lt;/p&gt;

&lt;h2&gt;
  
  
  III. Claude Code's Solution: Spawn Sub-Agents
&lt;/h2&gt;

&lt;p&gt;So what if part of a task really is better suited to a cheaper model?&lt;/p&gt;

&lt;p&gt;Claude Code's solution is to spawn sub-agents.&lt;/p&gt;

&lt;p&gt;The main session stays on Opus, preserving its hit rate. When you need to explore, batch-process, or run a specific sub-task, you call the Task tool to spin up a new agent. The new agent runs in its own isolated context, can pick a cheaper model, and maintains its own hit rate. When it's done, only a summary is passed back to the main session. The main session's cache isn't touched.&lt;/p&gt;

&lt;p&gt;The precondition for this mechanism is that the sub-task's context needs differ enough from the main task's. If your sub-task happens to feed most of the main session's content into it, that's another cold start inside the sub-agent, and prefill eats up whatever you saved. This takes pretty fine-grained judgment.&lt;/p&gt;

&lt;h2&gt;
  
  
  IV. Server-Side KV Cache Engineering
&lt;/h2&gt;

&lt;p&gt;How big is the server-side gap? Massive.&lt;/p&gt;

&lt;p&gt;The crudest implementations don't design for cache at all. No reuse across users. Routing goes haywire; the request that should hit the machine holding the cache lands elsewhere. VRAM backs all cache capacity. In a system like that, no amount of user-side care can save you.&lt;/p&gt;

&lt;p&gt;The mature example is the Mooncake framework from Moonshot AI, Tsinghua University, and Qijing Technology, a KVCache-centric disaggregated architecture. Prefill and decode clusters are separated, and underutilized CPU, DRAM, and SSD resources on GPU nodes are repurposed into a distributed KV cache pool. A KV cache scheduler handles queuing and routing. The paper cites a simulated 525% throughput gain; under real loads, requests handled increased by 75% to 115%.&lt;/p&gt;

&lt;p&gt;The counterexample is Openclaw. This open-source agent took a lot of criticism, mostly because it stumbles at this layer. Its plugin architecture doesn't set a &lt;code&gt;promptCacheKey&lt;/code&gt; by default. Pass it through a third-party proxy and you lose node affinity, so cache hit rate is nearly 0%. The total token volume isn't actually that high, but all input is priced at cache-miss rates, so the bill is ridiculous. About a month ago I looked at its trace: 60K+ input tokens in one request, 0% hit rate, \$0.12 a pop. That's what happens when server-side cache has no targeted design.&lt;/p&gt;

&lt;h2&gt;
  
  
  V. The Model Layer Can Also Push in This Direction
&lt;/h2&gt;

&lt;p&gt;Go one layer deeper: the model itself can free up massive room for KV cache.&lt;/p&gt;

&lt;p&gt;DeepSeek has been the most systematic here. MLA (Multi-head Latent Attention) projects KV into a latent vector and back, compressing KV cache volume by over 90%. V3 kept this mechanism. Later they added Native Sparse Attention, which almost flattens KV cache growth in long contexts. Only then can the inference system build a cache pool at million-scale context lengths.&lt;/p&gt;

&lt;p&gt;But once the model changes, cache hit determination logic changes too. Some inputs previously recognized as "prefix overlap" behave differently under sparse attention, so whether they hit needs realignment. The inference system has to be revised as well. That's why I say inference engineers can't just stare at throughput anymore. They have to redesign cache around the model architecture.&lt;/p&gt;

&lt;h2&gt;
  
  
  VI. Measuring This Is Hard
&lt;/h2&gt;

&lt;p&gt;The most frustrating part of the whole chain is that evaluating the server side alone is useless, and evaluating the user side alone is also useless.&lt;/p&gt;

&lt;p&gt;No matter how strong the server-side cache is, if the user side is designed like Openclaw, hit rates still won't rise. No matter how careful the user side is, hit a crude server with chaotic routing, insufficient capacity, and no cross-user reuse, and costs still leak.&lt;/p&gt;

&lt;p&gt;So "which API is cheaper" can't be compared on a single dimension when it comes to tokens. Same hardware, same target. Good coordination between both ends versus each doing its own thing, and the total bill can differ by 5× to 10×.&lt;/p&gt;

&lt;h2&gt;
  
  
  Closing Thoughts
&lt;/h2&gt;

&lt;p&gt;The cache hit/miss split on token pricing tables is the single most important thread in this whole thing. It gives users a clear incentive: high hit rates save you money. At the same time, it pressures providers. If your cache system isn't strong enough, you won't get the business.&lt;/p&gt;

&lt;p&gt;I hope vendors also expose the cache hit mechanism itself. Otherwise users only know it exists without knowing how to optimize for it. There still aren't many vendors that can tie together model, server-side cache, and user-side usage end to end.&lt;/p&gt;

&lt;p&gt;The edge side is still competing on raw performance. But once app density rises and agents really start running, KV cache will become a core issue there too. From cloud to edge, there's no way around it.&lt;/p&gt;




&lt;h2&gt;
  
  
  References
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://platform.claude.com/docs/en/build-with-claude/prompt-caching" rel="noopener noreferrer"&gt;Prompt caching — Claude API Docs&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://platform.claude.com/docs/en/about-claude/pricing" rel="noopener noreferrer"&gt;Anthropic Pricing — Claude API Docs&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://code.claude.com/docs/en/prompt-caching" rel="noopener noreferrer"&gt;How Claude Code uses prompt caching&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://code.claude.com/docs/en/sub-agents" rel="noopener noreferrer"&gt;Claude Code Sub-Agents Docs&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://api-docs.deepseek.com/guides/kv_cache" rel="noopener noreferrer"&gt;Context Caching — DeepSeek API Docs&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://api-docs.deepseek.com/quick_start/pricing" rel="noopener noreferrer"&gt;DeepSeek API Pricing&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/abs/2407.00079" rel="noopener noreferrer"&gt;Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving (arXiv:2407.00079)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/kvcache-ai/Mooncake" rel="noopener noreferrer"&gt;Mooncake on GitHub (kvcache-ai/Mooncake)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/abs/2405.04434" rel="noopener noreferrer"&gt;DeepSeek-V2 paper — introduces MLA&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.ddhigh.com/en/2026/03/26/fix-opencode-prompt-caching-with-third-party-proxy/" rel="noopener noreferrer"&gt;Fixing OpenCode prompt cache misses with third-party proxy&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/code-yeongyu/oh-my-openagent/issues/1247" rel="noopener noreferrer"&gt;Plugin architecture prevents prompt caching — oh-my-opencode issue&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;原文链接：&lt;a href="https://guanjiawei.ai/en/blog/token-business-kv-cache" rel="noopener noreferrer"&gt;https://guanjiawei.ai/en/blog/token-business-kv-cache&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>infra</category>
      <category>kvcache</category>
      <category>inference</category>
    </item>
    <item>
      <title>A Token Is Not a Thing</title>
      <dc:creator>guanjiawei</dc:creator>
      <pubDate>Tue, 26 May 2026 04:48:36 +0000</pubDate>
      <link>https://dev.to/skyguan92/a-token-is-not-a-thing-5g0l</link>
      <guid>https://dev.to/skyguan92/a-token-is-not-a-thing-5g0l</guid>
      <description>&lt;p&gt;Lately, "token economy" is hot. Every business model in AI will eventually converge on one unit of account: the token. I buy that thesis. But one premise keeps getting skipped—a token is not a standardized commodity.&lt;/p&gt;

&lt;p&gt;Water has standard units. Electricity has standard units. Money, obviously. Token doesn't. It's more like gasoline: 92, 95, and 98 octane are different fuels, priced differently, for different engines. Adding them up by the liter and reporting one number means nothing.&lt;/p&gt;

&lt;p&gt;Most contradictions in AI today come down to this.&lt;/p&gt;

&lt;h2&gt;
  
  
  I. Intelligence Has Tiers
&lt;/h2&gt;

&lt;p&gt;Roughly four.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Top tier.&lt;/strong&gt; Overseas: OpenAI GPT-5.5, Anthropic Claude Opus 4.7. China: Zhipu GLM-5.1, Moonshot Kimi K2.6, DeepSeek V4-Pro. Xiaomi MiMo-V2.5-Pro is a bit controversial, but usage and data are climbing, so I'll count it. These range from hundreds of billions to over a trillion parameters. Demand is almost unlimited; willingness to pay is fierce. Prices rise, quotas tighten, prices rise again—users keep pouring in. Zhipu's 2025 annual report showed GLM Coding Plan token calls up 15× in six months, with paying developers past 240,000. That's the real demand curve for top-tier tokens.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Mid-tier.&lt;/strong&gt; This is the awkward gap. MiniMax M2.7, DeepSeek V4-Flash, Xiaomi MiMo-V2.5 standard—these are about it. Moderate size, an order of magnitude cheaper, theoretically the best value. But almost no one is seriously building here. I'll explain why later.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Low-mid.&lt;/strong&gt; Mostly open source. Alibaba Qwen 3.6 leads, with both 35B-A3B (MoE) and 27B dense versions open. Google Gemma 4 is here too, from E2B to 31B.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;On-device.&lt;/strong&gt; A few billion parameters, or even sub-billion, fitting in a phone or a consumer GPU.&lt;/p&gt;

&lt;p&gt;The first imbalance is right here. Top tier is a bloodbath. Mid-tier is empty. Low-mid and on-device are noisy but lack clear scenarios.&lt;/p&gt;

&lt;h2&gt;
  
  
  II. Speed Is Another Dimension
&lt;/h2&gt;

&lt;p&gt;Tiers are only half the token story.&lt;/p&gt;

&lt;p&gt;The other half is speed. GPT-5.5 at 30 TPS versus 200 TPS is a completely different experience.&lt;/p&gt;

&lt;p&gt;Here are the 2026 numbers from Artificial Analysis, a commonly cited benchmark:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Tier&lt;/th&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Output TPS&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Flagship Standard&lt;/td&gt;
&lt;td&gt;GPT-5.5 (high)&lt;/td&gt;
&lt;td&gt;~68&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Flagship Standard&lt;/td&gt;
&lt;td&gt;Claude Opus 4.7&lt;/td&gt;
&lt;td&gt;~48&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Flagship Standard&lt;/td&gt;
&lt;td&gt;DeepSeek V4-Pro&lt;/td&gt;
&lt;td&gt;~48&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Flagship Standard&lt;/td&gt;
&lt;td&gt;Kimi K2.6&lt;/td&gt;
&lt;td&gt;~33&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;High Speed&lt;/td&gt;
&lt;td&gt;DeepSeek V4-Flash&lt;/td&gt;
&lt;td&gt;~126&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;High Speed&lt;/td&gt;
&lt;td&gt;Gemini 3.5 Flash&lt;/td&gt;
&lt;td&gt;~203&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Ultra High Speed&lt;/td&gt;
&lt;td&gt;GLM-5.1 High-Speed Edition&lt;/td&gt;
&lt;td&gt;400 (official)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Ultra High Speed&lt;/td&gt;
&lt;td&gt;Cerebras running Kimi K2.6&lt;/td&gt;
&lt;td&gt;981&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;I wrote a post earlier, &lt;a href="https://guanjiawei.ai/blog/inference-speed-new-species" rel="noopener noreferrer"&gt;A Model 5× Faster Is No Longer the Same Model&lt;/a&gt;. The argument: a 5× speedup unlocks product forms that literally didn't exist before. This isn't slightly faster. It's a different species.&lt;/p&gt;

&lt;p&gt;The market already prices this. Anthropic Opus Fast: 2.5× speed, 6× price. OpenAI Priority Tier: 2.5× price. Look at those ratios—price rises faster than speed. Not greed. It's a pricing signal. There's a real cohort willing to pay multiples for speed.&lt;/p&gt;

&lt;p&gt;Intelligence tier × speed tier. Stack them and you get a matrix. The token in each cell is a different product.&lt;/p&gt;

&lt;h2&gt;
  
  
  III. Two Demand Tracks, Worlds Apart in Willingness to Pay
&lt;/h2&gt;

&lt;p&gt;Who's burning top-tier tokens? Two main tracks.&lt;/p&gt;

&lt;p&gt;First: coding agents. The fastest-growing, highest-burn category worldwide. The surface is a coding agent writing code to solve problems. In practice, people use them for everything. The work just happens to get done through "writing code."&lt;/p&gt;

&lt;p&gt;Second: consumer agents. The Claude app, ChatGPT app, Microsoft Copilot, and Zhipu's new AutoClaw (Claw Plan). AutoClaw launched March 2026 and hit 400,000 subscriptions in 20 days. Under the hood it's a coding agent wrapped in a non-technical shell, letting ordinary people "hire an AI employee."&lt;/p&gt;

&lt;p&gt;The two tracks have very different willingness to pay.&lt;/p&gt;

&lt;p&gt;Coding agent users demand peak intelligence—Opus 4.7, GPT-5.5 tier. Anything less fails. The work is valuable; time saved is valuable. They'll pay for top-tier tokens continuously. Stickiness is another story: when a better model drops, they switch immediately.&lt;/p&gt;

&lt;p&gt;Consumer agent users differ. Their tasks are lower-value, they're price-sensitive, and they don't need absolute peak intelligence. A "mid-tier smarts, good value, acceptable speed" model fits them perfectly. The problem: that tier is empty right now, with no real supply. So DeepSeek V4, with extreme cost-performance, quickly captured this segment. I've noticed many friends around me switching to DeepSeek.&lt;/p&gt;

&lt;p&gt;Demand looks like this, so model companies follow the money. That's why top-tier models keep screaming compute shortages while mid-tier models have no takers.&lt;/p&gt;

&lt;h2&gt;
  
  
  IV. The Supply-Side Mismatch: Scarce Cards and Idle Racks at the Same Time
&lt;/h2&gt;

&lt;p&gt;Demand misalignment carries over to the compute market.&lt;/p&gt;

&lt;p&gt;Top-tier compute shortage is obvious.&lt;/p&gt;

&lt;p&gt;Jensen Huang personally confirmed NVIDIA's Blackwell series (B200/GB200) is "sold out through mid-2026," with new enterprise orders facing 8–16 week lead times. Meta's annual CapEx is expected past $100 billion; Microsoft is spending nearly $35 billion in a single quarter—all scrambling for these chips. In China, the frenzy is over B300 and H200: a B300 server costs ¥7 million and you still can't get one, monthly rent pushed to ¥130,000–200,000. H200 was cleared for sale in China in January 2026; the first 5,000–10,000 module batch was snapped up by top vendors immediately, cluster delivery pushed to Q2 2027. The older H100 has cooled. No one is fighting for it now.&lt;/p&gt;

&lt;p&gt;Domestic top-tier chips are even more extreme. Huawei's latest Ascend 950PR only began mass production in March 2026, yet the full-year plan of 750,000 units was completely locked up: ByteDance (350,000), Alibaba (200,000), Tencent/Baidu (100,000), government and enterprise IT innovation (100,000)—orders pushed to 2027. Roughly $16,000 per chip, 1.56 PFLOPS FP4, officially claimed at 2.87× H20 single-card performance. This is the first time in domestic AI chip history that an entire year's production was bought out. When DeepSeek V4 open-sourced, it shipped day-zero support for eight domestic chips, listing Ascend NPUs alongside NVIDIA GPUs in the technical report. GLM-5 was trained entirely on Ascend + MindSpore, with support for seven domestic chips. This is about positioning: anchoring top models on domestic chips is both a technical and supply problem.&lt;/p&gt;

&lt;p&gt;The hidden side is massive idle capacity in low-to-mid-range compute.&lt;/p&gt;

&lt;p&gt;PPIO founder Yao Xin has said some domestic GPU AI compute centers have idle rates up to 80%. 36Kr reported some centers at only 10–20% utilization. Xinhua put it more bluntly: "General-purpose compute is relatively oversupplied; AI compute is relatively scarce"—an admission of structural mismatch. Prices reflect this: A100 prices crashed over 50%, RTX 4090 hourly rent dropped to ¥1–2, and the 5090 is around ¥2.5.&lt;/p&gt;

&lt;p&gt;But the low-to-mid-range mismatch has two distinct bottlenecks.&lt;/p&gt;

&lt;p&gt;Mid-tier datacenter cards (H20, L20, Huawei 910B, etc.) are stuck on infrastructure. Inference frameworks optimize for top-tier cards far more than these. KV cache management, MoE expert parallelism, FP8/FP4 precision support—none of the critical paths is mature here. The hardware exists, demand exists, but you can't serve a top experience.&lt;/p&gt;

&lt;p&gt;Consumer PCIe cards (4090, 5090, 4090 48GB mods) face the opposite problem. The hardware can run; vLLM already supports the 5090 (needs CUDA 12.8 + falling back to FlashAttention 2, usable enough). What's missing is good models designed for them. The 70B dense tier is obsolete—as of May 2026, the top six open-source models are all MoE; dense has virtually disappeared at the flagship level. MoE total parameters routinely exceed 100B, which won't fit on consumer cards; distilled small models can't match top quality. No one is supplying new, high-quality models tailored to 24GB/32GB/48GB VRAM limits.&lt;/p&gt;

&lt;p&gt;So the picture is: 4090/5090 prices are absurdly cheap compared to datacenter cards, yet the mid-tier models you can actually run are still old stock like Llama 3.3 70B from late 2024. Individual developers experimenting locally, small-team PoCs, and privacy-sensitive on-prem deployments can get by. But for enterprise-grade mid-tier inference on these cards, no newly optimized models exist.&lt;/p&gt;

&lt;p&gt;The issue isn't "total compute is insufficient." It's "compute can't align with demand."&lt;/p&gt;

&lt;p&gt;Outsiders used to quote compute in "petaflops." That was always shaky; in the AI inference era it's nearly useless. Whether a compute unit can serve top-tier models depends on interconnect, memory bandwidth, FP4/FP8 support, KV cache management. A hundred older cards can't match one top-tier card's single-stream speed.&lt;/p&gt;

&lt;p&gt;You get a strange picture: top model providers scrambling for chips, while last-gen cards in datacenters can't be rented out even at a discount. Scarcity and glut, side by side.&lt;/p&gt;

&lt;h2&gt;
  
  
  V. The Market Will Correct the Mismatch, But It Takes Time
&lt;/h2&gt;

&lt;p&gt;This mismatch won't last. The two bottlenecks will be pushed by two different market forces.&lt;/p&gt;

&lt;p&gt;The infrastructure gap for mid-tier datacenter cards will be driven by engineering priorities. Inference frameworks follow the money. Once mid-tier model demand grows, top frameworks like vLLM, SGLang, and TensorRT-LLM will eventually be forced to prioritize H20, L20, and 910B optimization. Not glamorous, but inevitable.&lt;/p&gt;

&lt;p&gt;The model supply gap for consumer cards is being pushed by distillation and small MoE. DeepSeek-V4 has already distilled a ~9B version; the Qwen series has been working on this. Once someone actually delivers "runs in 32GB VRAM, quality close to top-tier," idle 4090s and 5090s will immediately find work.&lt;/p&gt;

&lt;p&gt;Another track is deep binding between domestic chips and domestic models. DeepSeek and Zhipu are both pursuing it; technically it's proven feasible. Once it fully works, the low-to-mid-range compute market will reshuffle structurally.&lt;/p&gt;

&lt;p&gt;I'm fairly optimistic this will happen—it just takes time. Maybe a few quarters, maybe a year or two. For those who catch the rhythm, there's a structural window.&lt;/p&gt;

&lt;h2&gt;
  
  
  VI. Don't Reduce Tokens to a Single Number
&lt;/h2&gt;

&lt;p&gt;Back to the opening line. "Token economy" is a fine term, but it's far less intuitive than selling water or electricity.&lt;/p&gt;

&lt;p&gt;It's more like a gas station. Gasoline looks like one thing, but it's actually an intelligence × speed matrix. Layer on the supply-side compute tier mismatch, and you have the real cause behind today's apparently contradictory industry phenomena: why model companies are scrambling for chips, why some AI compute centers sit idle, why fast tier can charge 6×, and why mid-tier intelligence models are slow to arrive.&lt;/p&gt;

&lt;p&gt;Next time you see "we've deployed N petaflops" or "we produce X trillion tokens per month," pause and ask: which intelligence tier, which speed tier, which demand tier.&lt;/p&gt;

&lt;p&gt;A token is not a thing.&lt;/p&gt;




&lt;h2&gt;
  
  
  References
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Model Versions and Positioning&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://openai.com/index/gpt-5-5-instant/" rel="noopener noreferrer"&gt;OpenAI GPT-5.5 Instant Release&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.anthropic.com/news/claude-opus-4-7" rel="noopener noreferrer"&gt;Claude Opus 4.7 — Anthropic&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.bigmodel.cn/cn/guide/models/text/glm-5.1" rel="noopener noreferrer"&gt;Zhipu GLM-5.1 Technical Docs&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://kimi-k2.org/blog/24-kimi-k2-6-release" rel="noopener noreferrer"&gt;Moonshot Kimi K2.6 Release&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://api-docs.deepseek.com/news/news260424" rel="noopener noreferrer"&gt;DeepSeek V4 — API Docs&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://simonwillison.net/2026/Apr/24/deepseek-v4/" rel="noopener noreferrer"&gt;Simon Willison: DeepSeek V4—almost on the frontier&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://mimo.xiaomi.com/mimo-v2-5-pro/" rel="noopener noreferrer"&gt;Xiaomi MiMo-V2.5-Pro Official&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.minimaxi.com/news/minimax-m25" rel="noopener noreferrer"&gt;MiniMax M2.5 Release&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.modelscope.cn/models/Qwen/Qwen3.6-35B-A3B" rel="noopener noreferrer"&gt;Qwen 3.6-35B-A3B — ModelScope&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://blog.google/innovation-and-ai/technology/developers-tools/gemma-4/" rel="noopener noreferrer"&gt;Google Gemma 4 Release&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Speed Data&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://artificialanalysis.ai/models/gpt-5-5-high" rel="noopener noreferrer"&gt;Artificial Analysis — GPT-5.5 (high)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://artificialanalysis.ai/models/claude-opus-4-7" rel="noopener noreferrer"&gt;Artificial Analysis — Claude Opus 4.7&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://artificialanalysis.ai/models/deepseek-v4-pro" rel="noopener noreferrer"&gt;Artificial Analysis — DeepSeek V4 Pro&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://artificialanalysis.ai/models/deepseek-v4-flash-non-reasoning" rel="noopener noreferrer"&gt;Artificial Analysis — DeepSeek V4 Flash&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://artificialanalysis.ai/models/kimi-k2-6" rel="noopener noreferrer"&gt;Artificial Analysis — Kimi K2.6&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://artificialanalysis.ai/models/gemini-3-5-flash" rel="noopener noreferrer"&gt;Artificial Analysis — Gemini 3.5 Flash&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.ithome.com/0/953/717.htm" rel="noopener noreferrer"&gt;Zhipu GLM-5.1 High-Speed 400 tokens/s Report (IT Home)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.cerebras.ai/blog/cerebras-kimi-k2-Enterprise" rel="noopener noreferrer"&gt;Cerebras Running Kimi K2.6 at 981 tokens/s&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://groundy.com/articles/claude-code-fast-mode-6x-pricing-worth/" rel="noopener noreferrer"&gt;Claude Opus Fast Mode: 2.5× Speed / 6× Price (Groundy)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://developers.openai.com/api/docs/guides/priority-processing" rel="noopener noreferrer"&gt;OpenAI Priority Processing Official Docs&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Zhipu Products and Financial Reports&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://finance.sina.com.cn/stock/hkstock/ggscyd/2026-03-31/doc-inhswpms3341678.shtml" rel="noopener noreferrer"&gt;Zhipu 2025 Annual Report (Sina Finance)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.qbitai.com/2026/03/394135.html" rel="noopener noreferrer"&gt;QbitAI: Zhipu Financial Report Analysis&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.tmtpost.com/7938401.html" rel="noopener noreferrer"&gt;TMTPost: Zhipu Financial Report Analysis&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://m.jiemian.com/article/14095547.html" rel="noopener noreferrer"&gt;AutoClaw / Claw Plan Launch (Jiemian News)&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Compute Market&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://markets.financialcontent.com/wral/article/tokenring-2025-12-29-nvidias-blackwell-dynasty-b200-and-gb200-sold-out-through-mid-2026-as-backlog-hits-36-million-units" rel="noopener noreferrer"&gt;NVIDIA Blackwell sold out through mid-2026 (FinancialContent)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://finance.sina.com.cn/china/gncj/2026-05-01/doc-inhwktza5465326.shtml" rel="noopener noreferrer"&gt;Domestic B300 Servers at ¥7 Million, Still Unavailable (Sina Finance)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://zhuanlan.zhihu.com/p/1981721428861682031" rel="noopener noreferrer"&gt;H200 Sales Ban Lifted in China: Buy or Rent? (Zhihu)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://gitcode.csdn.net/69dc9c0a54b52172bc69377c.html" rel="noopener noreferrer"&gt;2026 Q1 GPU Rental Market Deep Dive&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.fxbaogao.com/detail/5399775" rel="noopener noreferrer"&gt;High-End GPU Supply-Demand Mismatch Drives Compute Rental Boom (WallstreetCN)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://caifuhao2.eastmoney.com/news/20260519201003499682070" rel="noopener noreferrer"&gt;Huawei Ascend 950PR in Mass Production + Orders Pushed to 2027 (East Money)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.oschina.net/news/373016" rel="noopener noreferrer"&gt;Huawei Ascend AI Chip Three-Year Roadmap: 950PR / 950DT / 960 / 970 (OSCHINA)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://ascendai.csdn.net/69d716f30a2f6a37c59df6df.html" rel="noopener noreferrer"&gt;DeepSeek V4 Fully Switches to Huawei Ascend 950PR (CSDN)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.leiphone.com/category/industrynews/GpExeQUQDrXE3B8H.html" rel="noopener noreferrer"&gt;PPIO Yao Xin on AI Compute Center Idle Rates (Leiphone)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://36kr.com/p/3269610247737992" rel="noopener noreferrer"&gt;36Kr: AI Compute Center Utilization Only 10–20%&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.news.cn/finance/20250429/3df0c33317d2499ab3a297a413e0acce/c.html" rel="noopener noreferrer"&gt;Xinhua: General Compute Oversupply, AI Compute Shortage&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.mornai.cn/news/gpu/a100-gpu-rent-trend/" rel="noopener noreferrer"&gt;A100 Price Trend Report&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.sohu.com/a/1001811857_122681399" rel="noopener noreferrer"&gt;RTX 4090 Hourly Rental Price Range ¥1.45–2.29 (Sohu, March 2026)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.gongjiyun.com/blog/2026/1/rx8wwbsgsisch5kyogoc7t3yncb/" rel="noopener noreferrer"&gt;RTX 5090 Compute at ¥2.5/Card/Hour (Gongji Compute)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.mornai.cn/news/gpu/rtx-4090-48gb/" rel="noopener noreferrer"&gt;RTX 4090 48GB Mod Review (Mornai)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://qubittool.com/zh/blog/llm-landscape-may-2026-deepseek-qwen-llama-comparison" rel="noopener noreferrer"&gt;2026 LLM Landscape: MoE Extincts Dense at the Flagship Level (QubitTool)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/BoltzmannEntropy/vLLM-5090" rel="noopener noreferrer"&gt;vLLM Deployment Guide on RTX 5090 (GitHub)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://zhuanlan.zhihu.com/p/2028142360152817950" rel="noopener noreferrer"&gt;Using a Modded 4090 for a Year: Great for Dev, Disaster for Production (Zhihu)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://news.qq.com/rain/a/20260508A0267Y00" rel="noopener noreferrer"&gt;DeepSeek V4 Day-0 Support for Eight Domestic Chips&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.guancha.cn/economy/2026_02_12_806895.shtml" rel="noopener noreferrer"&gt;GLM-5 Supports Seven Domestic Chips (Guancha)&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;原文链接：&lt;a href="https://guanjiawei.ai/en/blog/token-is-not-one-thing" rel="noopener noreferrer"&gt;https://guanjiawei.ai/en/blog/token-is-not-one-thing&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>token</category>
      <category>infra</category>
      <category>compute</category>
    </item>
    <item>
      <title>The Stronger the Agent, the More Common Sense Is Worth</title>
      <dc:creator>guanjiawei</dc:creator>
      <pubDate>Mon, 25 May 2026 04:06:31 +0000</pubDate>
      <link>https://dev.to/skyguan92/the-stronger-the-agent-the-more-common-sense-is-worth-4m96</link>
      <guid>https://dev.to/skyguan92/the-stronger-the-agent-the-more-common-sense-is-worth-4m96</guid>
      <description>&lt;p&gt;Last month I wrote &lt;a href="https://guanjiawei.ai/blog/amateur-advantage" rel="noopener noreferrer"&gt;“AI Turns Ignorance Into an Advantage”&lt;/a&gt;, about how outsiders without the baggage of knowing how hard something is are more willing to use AI to try things that look impossible.&lt;/p&gt;

&lt;p&gt;I still believe that. But agents burned me four times in a row recently, so I need to revise.&lt;/p&gt;

&lt;p&gt;The sweet spot isn’t knowing nothing; it’s knowing just enough. You have common sense, you grasp the big picture, but you don’t get lost in technical details. Total beginners do dare to try, which is good. But they can’t tell whether the agent’s output is actually reliable.&lt;/p&gt;

&lt;h2&gt;
  
  
  1. Fake Data Can Trick You by Orders of Magnitude
&lt;/h2&gt;

&lt;p&gt;I’ve been optimizing an inference engine lately.&lt;/p&gt;

&lt;p&gt;I checked the results on the first night. The metric had hit a target I’d considered seriously challenging. I was excited. Had we really cracked it that fast?&lt;/p&gt;

&lt;p&gt;If I knew nothing about this domain, I’d probably have cheerfully shared the results with my partners. But because I had some common sense, something felt off. I had it run a correctness check. The output was nothing but exclamation marks. After fixing correctness, performance dropped by orders of magnitude.&lt;/p&gt;

&lt;p&gt;I thought that was the end of it. But as we kept optimizing, the rhythm still felt wrong. The numbers climbed too fast, suspiciously fast. I looked at the test flow and found that before each official test it quietly ran a warm-up using the exact same prompt. Every subsequent test was hitting the prefix cache, essentially cheating on an open-book exam. After isolating the cache, performance dropped by orders of magnitude again.&lt;/p&gt;

&lt;p&gt;Still not done. Once prefill returned to normal, decode speed suddenly became absurd. A Windows build of the engine was somehow outperforming the Linux version. I ran the real-prompt test script I’d written earlier, and performance took another ten-fold hit. The problem was that the agent’s synthetic prompts were too simple and too regular, letting speculative decoding hit an acceptance rate above 80%. Switch to real prompts and the acceptance rate cratered, taking performance with it. Teams that have shipped speculative decoding have documented the exact same trap: real-world production performance is 40% to 60% lower than in the lab, a gap large enough to make you wonder if it’s the same system.&lt;/p&gt;

&lt;p&gt;Three layers of illusion, stacked. If I’d believed that first number and shared it externally, the cleanup would have been miserable. You give someone the wrong expectation, and they start scheduling around it. Then you have to go back and say, “Sorry, we’re off by orders of magnitude.” That feels way worse than saying “We’re not there yet” from the start.&lt;/p&gt;

&lt;p&gt;After that, every optimization target explicitly included two rules: prefill must not be affected by prefix-cache interference, and decode must use real prompts. Only then did we see a normal curve that crept upward, bit by bit.&lt;/p&gt;

&lt;h2&gt;
  
  
  2. It Will Brick Your Lab Machine
&lt;/h2&gt;

&lt;p&gt;The latest agents can work autonomously for a full day or longer. The longer they run, the higher the chance something goes wrong.&lt;/p&gt;

&lt;p&gt;My agent has, more than once, trashed the entire OS of a lab machine mid-run because of a missing quote or a flipped command-line argument. Files gone, environment wiped. It happens in a split second. You can’t stop it in time.&lt;/p&gt;

&lt;p&gt;It’s not just me. In April, when an agent hit a credential mismatch, it didn’t stop to ask a human. It found a token with full privileges and deleted an entire company’s production database and all backups in nine seconds. Thirty-plus hours of downtime. Three months of customer data, gone. There have been at least a dozen similar documented incidents in the past two years.&lt;/p&gt;

&lt;p&gt;Anthropic and OpenAI are now pushing sandboxing. The idea isn’t complicated. Filesystem isolation on one layer, network isolation on another. Without filesystem isolation, the agent can touch things it shouldn’t. Without network isolation, a compromised agent can steal your keys.&lt;/p&gt;

&lt;p&gt;My own approach is more low-tech: dedicate a machine exclusively to the agent, and don’t store anything else on it. If it runs for dozens of hours straight, the probability of a dumb mistake is non-zero. Reinstalling the OS costs time. Losing important data costs your sanity.&lt;/p&gt;

&lt;h2&gt;
  
  
  3. It Will Spin in Circles Until You Step In
&lt;/h2&gt;

&lt;p&gt;Agents have another bad habit: they circle the same problem.&lt;/p&gt;

&lt;p&gt;A recent goal was to run an inference engine on Windows in BF16 precision. The model weights were over 60 GB, and loading them caused an immediate OOM crash.&lt;/p&gt;

&lt;p&gt;The agent’s response was interesting: it kept trying to work around the memory bottleneck. Load only some weights, dynamically read the rest during inference, every offload trick in the book. None of them worked, and each ate up a lot of time. It even added a warm-up to the tests to hide loading latency. That was part of the root cause of the prefix-cache problem I mentioned earlier.&lt;/p&gt;

&lt;p&gt;I finally cut in and said: stop tweaking performance and fix the memory issue first. Until that bottleneck is solved, everything else is wasted effort.&lt;/p&gt;

&lt;p&gt;The agent actually executes well. Once pointed in the right direction, it quickly found a series of system-level Windows settings to expand available memory and VRAM. After that was fixed, the optimization path smoothed out immediately. All the previous workarounds were suddenly useless. That time was basically wasted.&lt;/p&gt;

&lt;p&gt;The problem is that it won’t proactively redefine the problem. Hand it “optimize performance” and it will keep grinding on that goal, even when stuck on a prerequisite. It finds ways to work around it rather than telling you, “This assumption is false; we need to handle something else first.” Recognizing the real blocker and pulling the agent out of the dead end is a judgment call only a human can make.&lt;/p&gt;

&lt;h2&gt;
  
  
  4. Set the Bar Too High and You Ship Nothing
&lt;/h2&gt;

&lt;p&gt;The last pitfall isn’t the agent’s fault. It’s mine.&lt;/p&gt;

&lt;p&gt;The more powerful agents get, the easier it is to set the bar too high. Because they can run for days, you start thinking anything is fair game. Every direction looks like a top-conference breakthrough. So you spin up multiple threads, each one ambitious.&lt;/p&gt;

&lt;p&gt;The result? Every thread is active, every thread shows progress, but nothing ships.&lt;/p&gt;

&lt;p&gt;You keep burning tokens, you keep seeing “progress,” yet nothing reaches the user’s hands. It looks like work. It’s actually just burning money. I made this mistake recently: several threads were the kind that would be huge if they landed, but the execution risk was equally high. An agent isn’t a genie; if it can’t be done, it can’t be done. I burned a mountain of tokens and delivered nothing.&lt;/p&gt;

&lt;p&gt;I eventually realized: narrow the scope. You need something shippable in the short-to-medium term and some worthwhile long-term explorations, not only the latter. Deliver what can be delivered first, stabilize the rhythm, then go after the big bets.&lt;/p&gt;

&lt;h2&gt;
  
  
  5. Knowing Just Enough Is Exactly Right
&lt;/h2&gt;

&lt;p&gt;Look at the four pitfalls together and one thread connects them: none requires you to be a deep expert to avoid.&lt;/p&gt;

&lt;p&gt;Performance jumped by orders of magnitude? Check whether you measured wrong first. The agent needs to run on your main machine all day? Give it a dedicated one. Stuck on the same spot after three optimization rounds? That spot is the real problem. Every thread is running but none ships? Kill a few.&lt;/p&gt;

&lt;p&gt;It’s all common sense.&lt;/p&gt;

&lt;p&gt;An MIT Sloan article this year on managing in the age of agentic AI noted that the most important skills for managing agents are defining problems and validating outputs. Those are things AI still can’t do well. “Agent Manager” is already showing up on job boards, and one line in the job description stands out: domain common sense matters more than AI expertise.&lt;/p&gt;

&lt;p&gt;Going back to my previous post. “Ignorance is an advantage” still holds: you have to not know what’s hard in order to dare to try. But courage alone isn’t enough. The most valuable state is this: willing to try, yet able to sense when something is off at the critical moment.&lt;/p&gt;

&lt;p&gt;Total beginners get carried away by fake data. Deep experts get shackled by priors. The people in the middle, the ones who know just enough, are bold enough to act, yet wise enough to pull the reins when needed.&lt;/p&gt;

&lt;p&gt;Agents will keep getting stronger. But that bit of human common sense, whether the numbers check out, whether the direction is right, or whether this should ship now, will only become more valuable. These are still things agents can’t handle.&lt;/p&gt;




&lt;h2&gt;
  
  
  References
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://openai.com/index/gpt-5-1-codex-max/" rel="noopener noreferrer"&gt;GPT-5.1 Codex Max Can Work Autonomously for Over 24 Hours — OpenAI&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://openai.com/index/introducing-gpt-5-5/" rel="noopener noreferrer"&gt;GPT-5.5 Released: Multi-Hour Autonomous Session Capability — OpenAI&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://tianpan.co/blog/2026-04-17-speculative-decoding-production-hidden-traps" rel="noopener noreferrer"&gt;Speculative Decoding’s Hidden Traps in Production: Real-World Performance 40–60% Lower Than in the Lab — tianpan.co&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.sudostack.co/mtp-speculative-decoding-coding-vs-creative-writing/" rel="noopener noreferrer"&gt;MTP Acceptance Rate Variations Across Task Types — SudoStack&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.theregister.com/2026/04/27/cursoropus_agent_snuffs_out_pocketos/" rel="noopener noreferrer"&gt;Cursor + Claude Agent Deletes Entire Company Production Database in 9 Seconds — The Register&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://dev.to/claude-go/what-10-real-ai-agent-disasters-taught-me-about-autonomous-systems-2ndc"&gt;10 Real-World AI Agent Incidents Reviewed — DEV Community&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.anthropic.com/engineering/claude-code-sandboxing" rel="noopener noreferrer"&gt;Claude Code Sandbox Design: Two-Layer Isolation Cuts Permission Prompts by 84% — Anthropic Engineering&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://sloanreview.mit.edu/article/agentic-ai-at-scale-redefining-management-for-a-superhuman-workforce/" rel="noopener noreferrer"&gt;Agentic AI Redefines Management: 69% of Experts Call It a Paradigm Shift — MIT Sloan Management Review&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.weforum.org/stories/2025/07/leaders-will-soon-be-managing-ai-agents-these-are-the-skills-theyll-need/" rel="noopener noreferrer"&gt;Core Skills for Managing AI Agents — World Economic Forum&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://beam.ai/agentic-insights/what-is-an-agent-manager-the-new-role-every-ai-company-needs-in-2026/" rel="noopener noreferrer"&gt;Agent Manager: The New Enterprise Role in 2026 — Beam AI&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;原文链接：&lt;a href="https://guanjiawei.ai/en/blog/agent-common-sense" rel="noopener noreferrer"&gt;https://guanjiawei.ai/en/blog/agent-common-sense&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>agents</category>
      <category>collaboration</category>
      <category>thinking</category>
    </item>
    <item>
      <title>When a Model Is 5 Faster, It’s No Longer the Same Model</title>
      <dc:creator>guanjiawei</dc:creator>
      <pubDate>Fri, 22 May 2026 06:52:33 +0000</pubDate>
      <link>https://dev.to/skyguan92/when-a-model-is-5x-faster-its-no-longer-the-same-model-2ih3</link>
      <guid>https://dev.to/skyguan92/when-a-model-is-5x-faster-its-no-longer-the-same-model-2ih3</guid>
      <description>&lt;p&gt;Two releases caught my eye this week.&lt;/p&gt;

&lt;p&gt;On May 19, Google released Gemini 3.5 Flash. I watched their launch event. Oddly, they barely emphasized the model’s raw intelligence. Benchmarks against the previous generation didn’t exactly stand out either. But they devoted serious time to speed, calling it “frontier intelligence built for speed,” claiming inference is 4× faster than other frontier models.&lt;/p&gt;

&lt;p&gt;Today, May 22, Zhipu also launched GLM-5.1 High-Speed, claiming 400 token/s output—the current ceiling for industry APIs. This engine wasn’t built by Zhipu alone; it was a joint effort with a team called TileRT, doing low-level customization specifically for the GLM model family on a specific class of hardware.&lt;/p&gt;

&lt;p&gt;Put these two together, then look back at Anthropic’s Opus Fast and OpenAI’s GPT-5.5 Fast over the past few months, and the direction is clear: differentiation at the model layer is changing lanes. Everyone used to compete on smarts; now they’re increasingly competing on speed.&lt;/p&gt;

&lt;p&gt;And once speed crosses a certain line, it stops being a linear “X times faster” improvement. AI becomes a different kind of thing.&lt;/p&gt;

&lt;h2&gt;
  
  
  1. Pricing Already Tells the Story
&lt;/h2&gt;

&lt;p&gt;The clearest evidence is fast-mode pricing.&lt;/p&gt;

&lt;p&gt;Anthropic’s Opus Fast: 2.5× the speed, 6× the price.&lt;br&gt;&lt;br&gt;
OpenAI’s GPT-5.5 Fast: 1.5× the speed, 2.5× the cost.&lt;/p&gt;

&lt;p&gt;Look at the numbers. If speed were valued linearly, 2.5× speed would cost 2.5× the price, and 1.5× speed would cost 1.5×. But in practice, the price jumps far more than the speed.&lt;/p&gt;

&lt;p&gt;This isn’t greed. It’s a real market signal: some people will happily pay disproportionately more for speed. Either their tasks need high-frequency feedback, or users are sitting there waiting, or downstream steps are blocked. In these scenarios, going from 30 seconds to 12 seconds feels completely different from going from 30 seconds to 20 seconds.&lt;/p&gt;

&lt;p&gt;I toggle Opus Fast on and off constantly myself. I turned off GPT-5.5’s 1.5× tier immediately. I couldn’t feel the difference; it was just burning money. But at 2.5×, there are tasks I just leave it on for—mostly when I’m staring at the output and iterating fast.&lt;/p&gt;

&lt;p&gt;Markets don’t lie. Something that sells for 6× has buyers who genuinely think it’s worth it.&lt;/p&gt;

&lt;h2&gt;
  
  
  2. Per-Request Speed and Scaling Out Are Not the Same Thing
&lt;/h2&gt;

&lt;p&gt;Two things need to be kept apart here.&lt;/p&gt;

&lt;p&gt;The first is “doing more concurrency at a fixed speed.” Same 30 token/s throughput, but serving 1,000 users instead of 100. This is relatively easy. Just throw more machines at it—you can buy slightly weaker cards and spread the load across them, and the cost-performance ratio can be tuned.&lt;/p&gt;

&lt;p&gt;The second is “making a single request faster.” Going from 30 token/s to 400. This is an entirely different beast. You need higher-end hardware, more aggressive memory bandwidth, and cutting-edge packaging. You can’t fix this by “spending a bit more to stack extra cards.” A hundred weak cards won’t get a single request to the speed of one top-tier card.&lt;/p&gt;

&lt;p&gt;I’ve spent time experimenting with inference infra myself, optimizing a few open-source models. The cost curves are completely different. The first is roughly linear—double the money, get about double the concurrency. The second is non-linear—that first 20% speedup might cost you 50% more, and it only gets steeper.&lt;/p&gt;

&lt;p&gt;So when Gemini 3.5 Flash emphasizes speed, or GLM High-Speed hits 400 token/s, they’re not saying “we made a cheaper version.” They’re saying “we pushed single-request speed to a new level.” That’s a problem of an entirely different magnitude.&lt;/p&gt;

&lt;h2&gt;
  
  
  3. 5× Is a Speciation Threshold
&lt;/h2&gt;

&lt;p&gt;So why push so hard?&lt;/p&gt;

&lt;p&gt;When I think about this, I go back to a simple comparison.&lt;/p&gt;

&lt;p&gt;If you want something done faster, the traditional options are limited.&lt;/p&gt;

&lt;p&gt;First, hire smarter people. But that hits a ceiling. There are only so many world-class experts, and today’s best models are already brushing up against that ceiling.&lt;/p&gt;

&lt;p&gt;Second, make people work overtime. Agents already run 24/7, so that ceiling is gone too.&lt;/p&gt;

&lt;p&gt;Third, divide the work and throw more people at it. But anyone who’s done engineering knows adding people doesn’t scale linearly. Adding one person doesn’t make it twice as fast; adding ten gets you nowhere near 10×. You have to break things down, hand off, coordinate, deal with uneven quality, manage waste. The ramp-up period for new hires is expensive. If you’re doing multi-agent orchestration, you know exactly what I mean.&lt;/p&gt;

&lt;p&gt;At this point, the traditional paths to speed are tapped out.&lt;/p&gt;

&lt;p&gt;So what’s left? Make the model itself—the same employee—faster.&lt;/p&gt;

&lt;p&gt;And making that “same employee” faster is non-linear.&lt;/p&gt;

&lt;p&gt;Imagine an employee who used to take an hour to finish a task. Now they do it in ten minutes. You think you just saved fifty minutes? It’s more than that.&lt;/p&gt;

&lt;p&gt;You’ll start giving them tasks you’d never have bothered with because “it’s too slow.” Small ad-hoc requests that used to take an hour—so you never asked—now come back in ten minutes, and you make a dozen a day. Speed unlocks tasks that literally didn’t exist before.&lt;/p&gt;

&lt;p&gt;I saw a demo the other day: someone wearing glasses pointed at a video on a screen and said “zoom in on this,” and the AI behind it wrote code to resize the element. If the whole chain takes thirty seconds, you glance and walk away—there’s no real interaction. But if it finishes in five seconds, the feel is completely different; it becomes a genuinely usable product.&lt;/p&gt;

&lt;p&gt;That’s the gap between 50 token/s and 400 token/s. 8× speed unlocks products that were impossible to build before.&lt;/p&gt;

&lt;p&gt;A speedup beyond 5× is a speciation line.&lt;/p&gt;

&lt;h2&gt;
  
  
  4. The Return of Specialization
&lt;/h2&gt;

&lt;p&gt;Okay, speed is valuable. How do you actually achieve it?&lt;/p&gt;

&lt;p&gt;That brings us to TileRT’s approach, which diverges from where the industry was a year ago.&lt;/p&gt;

&lt;p&gt;Mainstream inference frameworks like vLLM, TensorRT-LLM, and SGLang are general-purpose. They aim to “support as many models as possible, running well enough on as much hardware as possible.” That has always been software engineering’s default bias: generality first, performance second.&lt;/p&gt;

&lt;p&gt;TileRT does the opposite. It statically schedules the entire inference graph at compile time, running as a persistent kernel on the GPU with almost no runtime dynamic scheduling. Micro-tasks at tile-level granularity squeeze the hardware close to its physical limits. The cost? Change the model and it’s basically scrap; change the hardware and it needs major rework.&lt;/p&gt;

&lt;p&gt;DeepSeek is on the same path. Their own inference engine started out based on vLLM, then underwent more than a year of deep customization—almost every path was rewritten for their own MoE architecture. When they open-sourced part of it recently, the industry’s reaction wasn’t “how general-purpose this is,” but “how deep you can go for a single model.”&lt;/p&gt;

&lt;p&gt;Go one layer deeper, and the hardware side has been on this path for a while. Groq’s LPU runs Llama 4 Scout at 460 token/s, 3–4× what an H100 delivers. Cerebras’s WSE-3 hits 1,800 token/s on a 70B model and nearly 3,000 on gpt-oss-120B. These are specialized chips. They aren’t trying to run every kind of model; they’re built to take a specific workload to the extreme.&lt;/p&gt;

&lt;p&gt;Chip designers have debated this for decades: general-purpose CPU or specialized ASIC? General chips have their place, but when a domain is big enough and the lifecycle is long enough, specialization pays off.&lt;/p&gt;

&lt;p&gt;The software layer used to avoid this, mainly because software isn’t cheap to write. Building a dedicated inference stack for one model takes too long to pay off; the model changes and your software is dead.&lt;/p&gt;

&lt;p&gt;That’s changing. AI agents can write software now. The cost of “building an optimal inference stack from scratch for a specific model and specific hardware” drops every year. Once it falls below a certain threshold, specialization becomes the default.&lt;/p&gt;

&lt;p&gt;Every promising model will eventually have its own dedicated inference engine. Every generation of mainstream hardware will have its own specially optimized stack. What you used to think of as just “the last 5% of optimization” could now become a 5× or 10× gap.&lt;/p&gt;

&lt;h2&gt;
  
  
  5. Vertical Integration at the Model Layer Is Inevitable
&lt;/h2&gt;

&lt;p&gt;Pulling these threads together.&lt;/p&gt;

&lt;p&gt;Intelligence will keep improving in the short term, but the marginal utility of competing on raw smarts is declining. A model that’s 20% smarter versus the same model accelerated 10×—for many users, the latter is far more valuable, especially for the new scenarios that speed itself unlocks.&lt;/p&gt;

&lt;p&gt;So the next phase of competition shifts from “point intelligence” to “end-to-end capability.” Model, inference engine, and hardware—all three bundled together.&lt;/p&gt;

&lt;p&gt;If you’re at 400 token/s and I’m at 30 token/s, even if my model is 20× smarter, I’m unusable in many scenarios. I’ll be watching my smartest model sit there slowly spitting out words while you’ve already delivered the whole product experience to the user.&lt;/p&gt;

&lt;p&gt;DeepSeek and Zhipu are already doing this. Anthropic and OpenAI are too. Google probably went the earliest and deepest—the TPU + Gemini combo has been running internally for a long time. My guess is that over the next year or two, the whole industry moves this way: model companies must own their inference stack and go deep into the hardware layer; hardware companies must go deep into model architecture; and the generic middle layer gets squeezed from both ends.&lt;/p&gt;

&lt;p&gt;For engineers, this is pretty exciting. We used to think “general, scalable, portable” was good taste. For the foreseeable future, the opposite may hold: writing the most extreme code for a specific model and specific hardware—code that breaks if you change anything—becomes worth doing again.&lt;/p&gt;

&lt;p&gt;Software engineering aesthetics will have to change.&lt;/p&gt;




&lt;h2&gt;
  
  
  References
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://blog.google/products-and-platforms/products/gemini/gemini-3-flash/" rel="noopener noreferrer"&gt;Gemini 3 Flash: frontier intelligence built for speed（Google Blog）&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://artificialanalysis.ai/models/gemini-3-5-flash" rel="noopener noreferrer"&gt;Gemini 3.5 Flash - Intelligence, Performance &amp;amp; Price Analysis（Artificial Analysis）&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://finance.sina.com.cn/tech/digi/2026-05-22/doc-inhytqkw6284792.shtml" rel="noopener noreferrer"&gt;智谱 GLM-5.1 高速版 AI 模型发布，跑出全球最快速度 400 tokens/s（新浪科技）&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.163.com/dy/article/KTHAMEVM05198UNI.html" rel="noopener noreferrer"&gt;智谱(02513)推出 GLM-5.1 高速版 API 400 tokens/s 刷新全球速度上限（网易）&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.itsolotime.com/archives/21459" rel="noopener noreferrer"&gt;TileRT v0.1.3 发布：GLM-5 支持上线，推理速度高达 600 tokens/s&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://byteiota.com/claude-opus-4-7-fast-mode-2-5x-speed-6x-the-cost/" rel="noopener noreferrer"&gt;Claude Opus 4.7 Fast Mode: 2.5x Speed, 6x the Cost（byteiota）&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://groundy.com/articles/claude-code-fast-mode-6x-pricing-worth/" rel="noopener noreferrer"&gt;Claude Code /fast Mode: Is 6x Pricing Worth It?（Groundy）&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://developers.openai.com/codex/speed" rel="noopener noreferrer"&gt;Speed — OpenAI Codex Developers&lt;/a&gt;（GPT-5.5 Fast 1.5× 速度 / 2.5× 成本）&lt;/li&gt;
&lt;li&gt;&lt;a href="https://stable-learn.com/en/deepseek_inference_engine/" rel="noopener noreferrer"&gt;Open-Sourcing DeepSeek Inference Engine: A New Chapter in AI Infrastructure（StableLearn）&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://groq.com/lpu-architecture" rel="noopener noreferrer"&gt;Groq LPU Architecture&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.cerebras.ai/blog/cerebras-cs-3-vs-groq-lpu" rel="noopener noreferrer"&gt;Cerebras CS-3 vs. Groq LPU&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://speko.ai/benchmark/groq-vs-cerebras" rel="noopener noreferrer"&gt;Groq vs Cerebras: LLM Inference Speed Comparison 2026（Speko）&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;原文链接：&lt;a href="https://guanjiawei.ai/en/blog/inference-speed-new-species" rel="noopener noreferrer"&gt;https://guanjiawei.ai/en/blog/inference-speed-new-species&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>inferenceoptimization</category>
      <category>infra</category>
      <category>reflections</category>
    </item>
    <item>
      <title>AMD Gave a Developer Award to Someone Who Can't Code</title>
      <dc:creator>guanjiawei</dc:creator>
      <pubDate>Tue, 19 May 2026 08:07:36 +0000</pubDate>
      <link>https://dev.to/skyguan92/amd-gave-a-developer-award-to-someone-who-cant-code-418l</link>
      <guid>https://dev.to/skyguan92/amd-gave-a-developer-award-to-someone-who-cant-code-418l</guid>
      <description>&lt;p&gt;Today I went to AMD's developer conference in Shanghai.&lt;/p&gt;

&lt;p&gt;The entrance alone was a shock. The line to get in was long, and the hall was already packed before anything started. They expected just over 1,000 people; more than 2,000 showed up. AMD said it was their biggest recent event.&lt;/p&gt;

&lt;p&gt;Lisa Su showed up too. She'd been in Beijing the day before meeting Vice Premier He Lifeng to talk chip cooperation. I'd never seen a chip company pull a crowd like this for a developer conference.&lt;/p&gt;

&lt;h2&gt;
  
  
  AMD Gave an Award to Someone Who Can't Code
&lt;/h2&gt;

&lt;p&gt;That morning, an AMD senior VP got on stage to hand out two developer awards. When they introduced one winner, the host said:&lt;/p&gt;

&lt;p&gt;"He didn't actually know how to code before."&lt;/p&gt;

&lt;p&gt;He'd used an AI agent to rewrite an entire system in Rust and optimize performance. AMD figured that was worth an award.&lt;/p&gt;

&lt;p&gt;Sitting there, the whole thing felt surreal. A chip company worth hundreds of billions, at a 2,000-person developer conference, handed one of two awards to someone who doesn't code.&lt;/p&gt;

&lt;p&gt;I bet next year the award will be even harder to judge. AI-native people like him will only become more common.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Boundary Between Developers and Users Is Vanishing
&lt;/h2&gt;

&lt;p&gt;Before, if you used a product, you just used it. You couldn't really help build it. Even in open source, you had to code before you could contribute.&lt;/p&gt;

&lt;p&gt;Not anymore. Coding agents are getting stronger, and regular users can now tweak, optimize, and push changes back while using a product. The same person is both user and builder.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://guanjiawei.ai/blog/ai-amplifies-passion" rel="noopener noreferrer"&gt;I wrote before about the split between Builders and Promoters&lt;/a&gt;. That was about passion diverging. This is the flip side: the roles of user and contributor now overlap, often in the same person. Users are also investing their tokens across different products, and the ones that earn that investment keep evolving.&lt;/p&gt;

&lt;p&gt;Product logic has shifted. You used to focus on making the experience great. Now you also need to make it easy for users to become contributors.&lt;/p&gt;

&lt;p&gt;AMD's big Strix Halo push is interesting. The AI Max+ 395 chip can allocate up to 96 GB of unified memory to its integrated GPU for running local models, and &lt;a href="https://guanjiawei.ai/blog/inference-engine-last-puzzle-piece" rel="noopener noreferrer"&gt;my inference engine can run on it too&lt;/a&gt;. Domestically, prices have been climbing and it's been out of stock. I have several R&amp;amp;D test machines for performance tuning, and they're also my entry point into the ROCm ecosystem.&lt;/p&gt;

&lt;p&gt;AMD is pushing this machine to lower the developer barrier another notch. More developers means stickier stacks.&lt;/p&gt;

&lt;h2&gt;
  
  
  Industry Winds Did a 180 in Six Months
&lt;/h2&gt;

&lt;p&gt;I attended a similar conference around mid last year. The vibe was completely different.&lt;/p&gt;

&lt;p&gt;Back then, people in the compute business were stressed. Early last year, DeepSeek raised expectations for models, and everyone was wondering whether the wave would last. How to move product, how to clear inventory, whether the business could survive. Everyone was scrambling for solutions and partners.&lt;/p&gt;

&lt;p&gt;This year, the table talk completely changed. The first thing anyone says is, "Can you get me more supply?" or "I'll take everything you've got."&lt;/p&gt;

&lt;p&gt;It's completely flipped. Supply is tight, and whoever holds quality inventory is making money. The shift from demand anxiety to supply anxiety took just six months.&lt;/p&gt;

&lt;h2&gt;
  
  
  This Isn't Another Bubble
&lt;/h2&gt;

&lt;p&gt;Plenty of people say: "Here we go again. Next metaverse."&lt;/p&gt;

&lt;p&gt;This time it really is different. I lived through the metaverse and blockchain cycles too. The difference this time is in the data, specifically paid demand from real users.&lt;/p&gt;

&lt;p&gt;Lisa Su said on stage that roughly 1 billion people are already using AI worldwide, and by 2030 that number will hit 5 billion daily active users. ChatGPT came out at the end of 2022, so it's been less than three years. The internet took over 20 years to reach that scale; the PC era took even longer. This is a diffusion speed never seen before in history.&lt;/p&gt;

&lt;p&gt;The money is keeping up too. Anthropic's Q1 grew 80x year-over-year. That's annualized revenue, not API calls. Dario himself said they weren't ready to catch a wave this big. Claude Code hit a $1 billion annualized run rate within six months of launch, and by April this year the company's overall ARR surged to $30 billion.&lt;/p&gt;

&lt;p&gt;This is nothing like a few years ago, when everyone was in a price war, handing out free tokens, and chasing call volume. Supply can't keep up with paid demand.&lt;/p&gt;

&lt;h2&gt;
  
  
  "X Is Dead" Is the Cheapest Narrative
&lt;/h2&gt;

&lt;p&gt;A friend recently asked me: "Is Openclaw dead?" "How's Claude Code doing?" "I heard Codex is going to win."&lt;/p&gt;

&lt;p&gt;I think that's just inertia.&lt;/p&gt;

&lt;p&gt;Last December, every top academic conference and product circle was talking about Gemini. Back then everyone thought Google had it in the bag. A few months later, almost nobody mentioned Gemini. Then it was Cursor, then Claude Code. Pretty soon it'll be Codex. At the top table, players keep rotating.&lt;/p&gt;

&lt;p&gt;But the underlying trend runs one way. It hasn't reversed. Paid demand is rising, call volume is rising.&lt;/p&gt;

&lt;p&gt;Real information is expensive. You have to use the tools yourself, show up on-site, and talk to people inside. So the audience for that is naturally small. Narratives like "it's dead," "it's a bubble," or "just another cycle" are the cheapest to spin up. They validate sitting on the sidelines and feed the need to believe that not engaging was the right call. They spread the easiest.&lt;/p&gt;

&lt;p&gt;Not that nobody believes it. Most people just want to.&lt;/p&gt;

&lt;h2&gt;
  
  
  Go See for Yourself
&lt;/h2&gt;

&lt;p&gt;Lately when I meet friends, I do one thing: tell them to bring their laptop, and I help them install Claude Code or Codex and get the model connected.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://guanjiawei.ai/blog/coding-agent-adoption-gap" rel="noopener noreferrer"&gt;Once you get past that hurdle, you can hand off 95% of computer work&lt;/a&gt;. I built my own website from scratch without lifting a finger. Frontend, DNS, SEO, all done by agents. The barrier is small, but once you're past it, the world looks completely different.&lt;/p&gt;

&lt;p&gt;If that's still too much, just find a conference this year where people are actually doing this and go. There were quite a few workshops at the event where you brought a laptop and worked hands-on. Only when you sit there do you realize how far AI has already come.&lt;/p&gt;




&lt;h2&gt;
  
  
  References
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;a href="https://technode.com/2026/05/19/amd-ceo-lisa-su-in-shanghai-predicts-five-billion-daily-ai-users-within-five-years/" rel="noopener noreferrer"&gt;AMD CEO Lisa Su in Shanghai Predicts 5 Billion Daily AI Users Within Five Years&lt;/a&gt; — On-site report from AMD AI Developer Day Shanghai, May 19, 2026&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://news.cgtn.com/news/2026-05-18/Chinese-vice-premier-meets-AMD-CEO-calls-for-deeper-cooperation-1NfLteD25Y4/p.html" rel="noopener noreferrer"&gt;Chinese Vice Premier He Lifeng Meets Lisa Su, Calls for Deeper Cooperation&lt;/a&gt; — May 18, 2026, He Lifeng meets AMD CEO Lisa Su in Beijing&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.cnbc.com/video/2026/01/06/amd-ceo-lisa-su-expect-over-5-billion-active-ai-users-in-the-next-five-years.html" rel="noopener noreferrer"&gt;CES 2026: Lisa Su Predicts Over 5 Billion AI Users in Five Years&lt;/a&gt; — Lisa Su first gave the 5-billion-user forecast during her CES keynote&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://venturebeat.com/technology/anthropic-says-it-hit-a-30-billion-revenue-run-rate-after-crazy-80x-growth" rel="noopener noreferrer"&gt;Anthropic Q1 Grew 80x, Annualized Run Rate Hits $30 Billion ARR&lt;/a&gt; — Dario Amodei publicly acknowledged 80x year-over-year Q1 growth&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.mindstudio.ai/blog/anthropic-30b-arr-4-months-pulling-ahead-openai" rel="noopener noreferrer"&gt;Anthropic's ARR Surged from $9 Billion to $30 Billion in 4 Months&lt;/a&gt; — Full ARR trajectory: Jan 2024 $87M → Dec 2024 $1B → End of 2025 $9B → Apr 2026 $30B&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.saastr.com/anthropic-just-passed-openai-in-revenue-while-spending-4x-less-to-train-their-models/" rel="noopener noreferrer"&gt;Claude Code Surpassed $1 Billion Annualized Revenue Within Six Months of Launch&lt;/a&gt; — Claude Code is Anthropic's fastest-growing product&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://time.com/6253615/chatgpt-fastest-growing/" rel="noopener noreferrer"&gt;ChatGPT Is the Fastest-Growing Consumer Product in History to Reach 100 Million Users&lt;/a&gt; — Reached 100 million users in 2 months, faster than TikTok and Instagram&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://epoch.ai/gradient-updates/after-the-chatgpt-moment-measuring-ais-adoption" rel="noopener noreferrer"&gt;AI Adoption Speed Compared with Historical Technologies&lt;/a&gt; — Epoch AI research: 70% US household adoption took 40 years in 1900, shrinking to 17 years by 2000&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.amd.com/en/products/processors/laptop/ryzen/ai-300-series/amd-ryzen-ai-max-plus-395.html" rel="noopener noreferrer"&gt;AMD Ryzen AI Max+ 395 (Strix Halo) Official Specs&lt;/a&gt; — 16 Zen 5 cores, Radeon 8060S, up to 128GB LPDDR5X unified memory (up to 96GB allocatable to GPU)&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://videocardz.com/newz/amd-ryzen-ai-max-strix-halo-processors-now-available-for-standalone-purchase-in-china" rel="noopener noreferrer"&gt;Ryzen AI Max+ 395 Out of Stock and Rising in Price in China&lt;/a&gt; — Current tight supply situation for Strix Halo standard chips in China's retail market&lt;/li&gt;
&lt;/ol&gt;




&lt;p&gt;原文链接：&lt;a href="https://guanjiawei.ai/en/blog/non-coder-award" rel="noopener noreferrer"&gt;https://guanjiawei.ai/en/blog/non-coder-award&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>agents</category>
      <category>reflections</category>
    </item>
    <item>
      <title>Same /goal Feature, Two Agent Personalities</title>
      <dc:creator>guanjiawei</dc:creator>
      <pubDate>Mon, 18 May 2026 04:59:32 +0000</pubDate>
      <link>https://dev.to/skyguan92/same-goal-feature-two-agent-personalities-5315</link>
      <guid>https://dev.to/skyguan92/same-goal-feature-two-agent-personalities-5315</guid>
      <description>&lt;p&gt;I've been using Codex's &lt;code&gt;/goal&lt;/code&gt; for weeks, and my token consumption has climbed another notch. Claude Code added the feature in its May 12 2.1.139 release—straight to stable, not experimental. I had a few tasks that Codex never quite managed to finish, so I moved them over to try.&lt;/p&gt;

&lt;p&gt;The contrast was stark. Same paradigm, nearly identical loop, yet the two models produced completely different results.&lt;/p&gt;

&lt;p&gt;I'm writing this partly to think it through, partly because it's worth sharing. &lt;code&gt;/goal&lt;/code&gt; isn't so much a feature as a new way of working. The form looks identical, but when the model's personality differs, the practical reality is entirely different.&lt;/p&gt;

&lt;h2&gt;
  
  
  1. Codex: Heads Down, No Questions, Never Quits
&lt;/h2&gt;

&lt;p&gt;Let me start with Codex as a baseline.&lt;/p&gt;

&lt;p&gt;Codex CLI's &lt;code&gt;/goal&lt;/code&gt; appeared as an experimental feature in 0.128.0. I've been using it since then and wrote about it previously. The real shift has been mental: I actually started believing that "letting the agent run" really works.&lt;/p&gt;

&lt;p&gt;It doesn't interrupt you. When running &lt;code&gt;/goal&lt;/code&gt;, Codex almost never calls subagents; it works inline unless I explicitly tell it to delegate. Compaction works better than I expected. After compressing, it picks up from the previous round without major information loss, and doesn't suddenly get dumber as it pushes forward. Most importantly, it's stubborn. It almost never tells me a goal is unachievable. Even when it hits a wall, it tries another angle, then another, until the token budget runs out. I've tested this repeatedly. I've left three independent &lt;code&gt;/goal&lt;/code&gt; sessions running overnight; in the morning, most are still on track.&lt;/p&gt;

&lt;p&gt;The context window is a genuine weak spot. Codex defaults to 400K under GPT-5.5. OpenAI balanced pricing and throughput there, while the API offers the full 1M. Claude Code defaults to 1M. But even with only 400K, Codex runs remarkably stable under &lt;code&gt;/goal&lt;/code&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  2. Claude Code: Beautiful Opening, Then What?
&lt;/h2&gt;

&lt;p&gt;On May 12, Anthropic dropped &lt;code&gt;/goal&lt;/code&gt;, Agent View, &lt;code&gt;/bg&lt;/code&gt;, &lt;code&gt;/loop&lt;/code&gt;, and &lt;code&gt;/batch&lt;/code&gt; all at once. My first thought was "finally." Codex had been iterating on this for several versions; Claude Code felt a bit slow to catch up.&lt;/p&gt;

&lt;p&gt;I moved the tasks Codex couldn't crack over to Claude Code and started &lt;code&gt;/goal&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;It started strong. Claude Code immediately spun up subagents, laid out plans, and orchestrated context. It looked far more ambitious than Codex. My expectations immediately rose. With an opening like this, it should outperform Codex.&lt;/p&gt;

&lt;p&gt;But as it ran, issues cropped up.&lt;/p&gt;

&lt;p&gt;The first thing that made me frown: it kept popping up to ask me to make choices. This is usually one of Claude Code's likable traits. Faced with a judgment call, it doesn't just plow ahead. It stops to align with you, asking which of directions A, B, or C you prefer. And the questions are usually on point. But under &lt;code&gt;/goal&lt;/code&gt;, this is a bug, not a feature. The whole point of &lt;code&gt;/goal&lt;/code&gt; is "you set the goal, I run myself, don't interfere." The model should own every intermediate judgment. When it pops out with questions, those hours of freed-up time are immediately lost. If you step away, it just sits there waiting for you to come back.&lt;/p&gt;

&lt;p&gt;More surprisingly, it proactively tells me it can't achieve the goal. Then it actually fails the goal. Sometimes after just a few dozen minutes. The reason is usually that the task seems too large for the session, or that there are fundamental blockers. When I tell it to continue, it reluctantly pushes forward a bit, then does it again.&lt;/p&gt;

&lt;p&gt;Third: it gets dumber after compaction. A 1M context window sounds huge, but Anthropic themselves have admitted that performance degrades over long runs. Worse is the compression step. After each compaction, Claude Code often seems to have forgotten everything that came before. The original plan, the pitfalls already encountered, the original context—all have to be pieced back together. Codex doesn't suffer from compaction nearly as badly.&lt;/p&gt;

&lt;p&gt;These three issues combined make long-horizon tasks unstable in Claude Code's &lt;code&gt;/goal&lt;/code&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  3. It's Not Just My Impression
&lt;/h2&gt;

&lt;p&gt;At first I thought it was my usage. Then I looked around and realized Opus 4.7's laziness was already common knowledge.&lt;/p&gt;

&lt;p&gt;Opus 4.7 was released on April 16. Within 48 hours, a Reddit thread titled "Opus 4.7 is not an upgrade but a serious regression" got over 2,300 upvotes. AMD's AI director publicly complained that Claude Code had become "dumber and lazier." Screenshots were everywhere. Someone posted a conversation where Claude itself replied, "I was acting lazily."&lt;/p&gt;

&lt;p&gt;Anthropic later published a postmortem, admitting that on April 16 they had added a "reduce verbosity" instruction to the system prompt. This instruction, along with a few other changes, dragged down coding quality. On April 20 they rolled it back. But my sense is that after the rollback, Opus 4.7's laziness only eased slightly. It didn't fully recover. The RL layer had already internalized this tendency. You can't fix that by tweaking a system prompt.&lt;/p&gt;

&lt;p&gt;In extended continuous operation like &lt;code&gt;/goal&lt;/code&gt;, this laziness gets amplified. A lazy model might get away with it on short tasks. Put it on a long task, and it will find all sorts of seemingly reasonable excuses to fail itself.&lt;/p&gt;

&lt;h2&gt;
  
  
  4. These Past Few Months, We've Been Doing the Same Thing
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;/goal&lt;/code&gt; didn't appear out of nowhere. It's the culmination of months of exploration.&lt;/p&gt;

&lt;p&gt;Before the Lunar New Year, I was already tinkering with something similar. At the time, I was doing stability testing for AIMA (our model management platform). The core idea was to have AI simulate real users running tests repeatedly to improve stability. The most naive attempt was using the terminal's built-in task mechanism. I set up 10 tasks, each running for a long time.&lt;/p&gt;

&lt;p&gt;This path died quickly. Each task was still in the same session, and models don't hold up well in long sessions. Within a few rounds, things destabilized, and no amount of prompt tuning could save it.&lt;/p&gt;

&lt;p&gt;Next I looked at a two-layer architecture. At the time, Kilo Code was pushing a feature called Orchestrator Mode, previously known as Boomerang Tasks, inherited from Roo Code. The logic was sound: an outer orchestrator manages tasks, delegates each subtask to an independent subagent running in its own context, then collects the results.&lt;/p&gt;

&lt;p&gt;I tried a round with several cost-effective models available at the time. Zhipu performed slightly better, able to push through long tasks for a while. Minimax was more comical. It started writing code at the orchestrator layer itself and never delegated. The two-layer architecture simply failed on it. I thought about this for a while afterwards. It didn't seem like a harness adaptation issue. More likely, the model itself lacked the sense that it's the lead and should delegate.&lt;/p&gt;

&lt;p&gt;In February, Claude Code shipped Agent Teams alongside Opus 4.6. It was experimental, requiring the &lt;code&gt;CLAUDE_CODE_EXPERIMENTAL_AGENT_TEAMS&lt;/code&gt; environment variable to enable. One session acts as team lead, dispatching other subagents to complete tasks in fresh context windows. This was essentially Kilo's architecture as an official implementation. I was genuinely impressed when I tried it. Long tasks could run for two or three hours without crashing.&lt;/p&gt;

&lt;p&gt;But after one compaction, the team lead side fell apart. Previously dispatched subagents couldn't be found, so it would redeploy a new round, task lists got misaligned, and tokens burned fast. The two-layer architecture itself suffered information decay. Context shuttled back and forth between layers, losing a bit each time.&lt;/p&gt;

&lt;p&gt;Then came Ralph, full name Ralph Wiggum. Australian developer Geoffrey Huntley built it at the end of 2025. The logic was so simple it was almost suspicious: a bash while-true loop, repeatedly feeding the same prompt file to an agent until the goal is achieved. I tried to test its tmux version at the time, hit some snags, and shelved it.&lt;/p&gt;

&lt;p&gt;Ralph caught on extremely fast. It's the most direct inspiration for the &lt;code&gt;/goal&lt;/code&gt; product line. Today, Anthropic has absorbed Ralph as an official Claude Code plugin, parked under &lt;code&gt;plugins/ralph-wiggum/&lt;/code&gt; in the repo. Kilo Code's Orchestrator Mode, conversely, has been officially marked deprecated. The reason given: "the main agent can now delegate directly to subagents, so a dedicated orchestrator is no longer needed."&lt;/p&gt;

&lt;p&gt;Hand-rolled terminal tasks, to Kilo Orchestrator, to Claude Code Agent Teams, to Ralph going viral, to Codex shipping &lt;code&gt;/goal&lt;/code&gt;, to Claude Code shipping &lt;code&gt;/goal&lt;/code&gt;, to Ralph being absorbed and Kilo Orchestrator deprecated. The evolutionary thread of these past few months is clear.&lt;/p&gt;

&lt;h2&gt;
  
  
  5. Codex Dives In; Claude Code Keeps Looking Up
&lt;/h2&gt;

&lt;p&gt;Back to the models themselves.&lt;/p&gt;

&lt;p&gt;After running both, I have a fairly solid judgment. Codex is the "local" faction. Claude Code is the "global" faction.&lt;/p&gt;

&lt;p&gt;Math has a concept called local optima. The optimization space is like a valley. Starting from one point and walking downhill, you might end up in a local minimum, but over the next ridge there's a deeper valley. I've watched Codex fall into these local optima repeatedly during &lt;code&gt;/goal&lt;/code&gt;. It polishes one direction, does this and that, circles back, thinks it's moving forward, but is actually treading water. Its heads-down approach is usually a strength. In these moments it becomes a weakness.&lt;/p&gt;

&lt;p&gt;Claude Code is different. It performs large-span reflection and validation, proactively asking whether its current direction is right. I've repeatedly seen it jump out of what looked like a converging direction, saying "wait, the root of this problem might not be here, I need to reconsider," and then actually find a better path.&lt;/p&gt;

&lt;p&gt;This global view is Claude Code's strength. For complex tasks lasting one to two hours and requiring judgment, reflection, and cross-module coordination, I still think Claude Code outperforms Codex.&lt;/p&gt;

&lt;p&gt;But this global view doesn't buy endurance. It can't run long under &lt;code&gt;/goal&lt;/code&gt;, and can't deliver stable 24-hour unattended output. An imperfect analogy: Codex is an intern who can grind for 12 hours straight, occasionally drifting off course. Claude Code is a senior engineer with good judgment, but he needs to check in every 40 minutes, or decides after 30 minutes that this is too hard and he's out. Which is better suited for &lt;code&gt;/goal&lt;/code&gt;? The answer is obvious.&lt;/p&gt;

&lt;h2&gt;
  
  
  6. Form Converges, Training Diverges
&lt;/h2&gt;

&lt;p&gt;After running this comparison, I have an additional read on where coding agents are heading in the coming months. Harnesses are rapidly converging, but model personality differences will become increasingly prominent.&lt;/p&gt;

&lt;p&gt;Boris Cherny (founder of Claude Code) has been saying that in the future, a harness might be just 100 lines of code. I believe this even more now. Once the &lt;code&gt;/goal&lt;/code&gt; paradigm converges, the outer structure of coding agents will get thinner and thinner. A loop, a set of tools, a goal. That's enough.&lt;/p&gt;

&lt;p&gt;What will truly determine the gap is the model's personality within this loop. Whether it's willing to put its head down and work. Whether it keeps popping out to align with humans. Whether its state survives compaction. Whether it can jump out when stuck in a wrong direction. When it hits a wall, does it try again, or say it can't do the goal and bail?&lt;/p&gt;

&lt;p&gt;None of these can be fixed with prompting. They're set during training.&lt;/p&gt;

&lt;p&gt;OpenAI and Anthropic have already trained distinctly different model personalities for long-horizon tasks. Codex seems to have been trained into "never give up, hit the wall and try again." Claude Code seems trained to "report frequently, align frequently, reflect frequently." That's endearing in interactive scenarios, but fatal under &lt;code&gt;/goal&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;In the short term, this divergence is hard to bridge. Even after Anthropic rolled back that verbosity system prompt, Opus 4.7's laziness only eased. It didn't fully recover. RL internalized it. You can't fix that by changing outer prompts.&lt;/p&gt;

&lt;h2&gt;
  
  
  7. Choosing an Agent Is Increasingly Like Choosing a Partner
&lt;/h2&gt;

&lt;p&gt;At this point, the way I use &lt;code&gt;/goal&lt;/code&gt; has changed.&lt;/p&gt;

&lt;p&gt;I no longer start by asking which tool is stronger. Instead, I ask: which model's personality fits this task?&lt;/p&gt;

&lt;p&gt;For iterations lasting over six hours, with a clear goal and low trial-and-error cost, I just fire up Codex &lt;code&gt;/goal&lt;/code&gt;. For architectural judgment, cross-module decisions, possible mid-course direction changes, I use Claude Code &lt;code&gt;/goal&lt;/code&gt;, but I check back every 30 to 60 minutes, mentally prepared for it to pop out with questions. For truly unattended 24-hour runs, it has to be Codex, and the task direction needs to be clearly nailed down upfront. If it's just a single hard problem requiring global vision, I actually don't use &lt;code&gt;/goal&lt;/code&gt; at all. I use Claude Code in normal mode and knock it out in 30 minutes.&lt;/p&gt;

&lt;p&gt;A few months ago, choosing an agent meant choosing UI, community, pricing. Now it's more about choosing a model personality.&lt;/p&gt;

&lt;p&gt;Next-generation models, whether from Anthropic or OpenAI, will definitely train toward fixing the other side's weakness. Codex will try to add global vision; Claude Code will try to add endurance. In the short term, this personality divergence remains real, and it significantly affects how much value you can extract from &lt;code&gt;/goal&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;The biggest effect of &lt;code&gt;/goal&lt;/code&gt; is that it amplifies a model's true personality into 24 hours of continuous output. The one with the steadier personality wins this round.&lt;/p&gt;

&lt;p&gt;Right now, Codex leads by half a step. But only half a step.&lt;/p&gt;




&lt;h2&gt;
  
  
  References
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://explainx.ai/blog/claude-code-goal-command-long-running-agents-2026" rel="noopener noreferrer"&gt;Claude Code 2.1.139 adds /goal command — explainx.ai&lt;/a&gt;: Claude Code &lt;code&gt;/goal&lt;/code&gt; launch notes, May 12, 2026&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.geeky-gadgets.com/claude-code-agent-view-update/" rel="noopener noreferrer"&gt;Claude Code Agent View, Goal Command, and Background Sessions Update — Geeky Gadgets&lt;/a&gt;: Detailed overview of Claude Code 2.1 features&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://devinterrupted.substack.com/p/inventing-the-ralph-wiggum-loop-creator" rel="noopener noreferrer"&gt;Inventing the Ralph Wiggum Loop — Dev Interrupted&lt;/a&gt;: Geoffrey Huntley on inventing Ralph&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/anthropics/claude-code/blob/main/plugins/ralph-wiggum/README.md" rel="noopener noreferrer"&gt;Ralph Wiggum 官方 Claude Code plugin — GitHub&lt;/a&gt;: Anthropic has absorbed Ralph as an official plugin&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://kilo.ai/docs/code-with-ai/agents/orchestrator-mode" rel="noopener noreferrer"&gt;Kilo Code Orchestrator Mode (Deprecated)&lt;/a&gt;: Current status of Kilo Code Orchestrator&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://code.claude.com/docs/en/agent-teams" rel="noopener noreferrer"&gt;Orchestrate teams of Claude Code sessions — Claude Code Docs&lt;/a&gt;: Agent Teams official documentation&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://deepakness.com/raw/claude-code-agent-teams/" rel="noopener noreferrer"&gt;Claude Code experimental agent teams — DeepakNess&lt;/a&gt;: Agent Teams release notes alongside Opus 4.6&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.buildfastwithai.com/blogs/claude-opus-4-7-regression-explained-2026" rel="noopener noreferrer"&gt;Claude Opus 4.7 Regression Explained — buildfastwithai&lt;/a&gt;: Opus 4.7 regression and community feedback&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://shimin.io/journal/opus-4-7-just-lazy/" rel="noopener noreferrer"&gt;Opus 4.7 isn't dumb, it's just lazy — Shimin Zhang&lt;/a&gt;: Analysis of Opus 4.7's laziness issue&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.anthropic.com/engineering/april-23-postmortem" rel="noopener noreferrer"&gt;An update on recent Claude Code quality reports — Anthropic Engineering&lt;/a&gt;: Anthropic official postmortem on rolling back the verbosity system prompt&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/openai/codex/issues/19464" rel="noopener noreferrer"&gt;GPT-5.5 Codex 400K context window — GitHub Issue&lt;/a&gt;: Codex 400K context window limit explained&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://newsletter.pragmaticengineer.com/p/building-claude-code-with-boris-cherny" rel="noopener noreferrer"&gt;Boris Cherny on Claude Code's future — Pragmatic Engineer&lt;/a&gt;: The "100 lines of code" prediction&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;原文链接：&lt;a href="https://guanjiawei.ai/en/blog/goal-two-personalities" rel="noopener noreferrer"&gt;https://guanjiawei.ai/en/blog/goal-two-personalities&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>codingagent</category>
      <category>codex</category>
      <category>claudecode</category>
    </item>
    <item>
      <title>AI Transformation Doesn't Come from Training</title>
      <dc:creator>guanjiawei</dc:creator>
      <pubDate>Sat, 16 May 2026 13:19:33 +0000</pubDate>
      <link>https://dev.to/skyguan92/ai-transformation-doesnt-come-from-training-4c67</link>
      <guid>https://dev.to/skyguan92/ai-transformation-doesnt-come-from-training-4c67</guid>
      <description>&lt;p&gt;Lately, when AI agents come up in conversation with friends, I've fallen into a habit. I pull out my phone, remote into my computer, and show them the agents I've had running over the past 24 hours. One has been autonomously chasing a goal for over ten hours straight. Another is running experiments and tuning parameters.&lt;/p&gt;

&lt;p&gt;Their reactions are pretty much always the same: "Oh, so it's already at this stage. That's not what I pictured."&lt;/p&gt;

&lt;p&gt;What they say next is the interesting part.&lt;/p&gt;

&lt;h2&gt;
  
  
  1. "Help Me Explain This to My Boss"
&lt;/h2&gt;

&lt;p&gt;After watching, the first thing friends say usually goes like this:&lt;/p&gt;

&lt;p&gt;"Can you come explain this to our boss?"&lt;/p&gt;

&lt;p&gt;"I want to bring our tech lead over to see this."&lt;/p&gt;

&lt;p&gt;"Can you give us a training session?"&lt;/p&gt;

&lt;p&gt;Fair enough. You think this is important, and you want to bring in the people who need to see it. Good instinct.&lt;/p&gt;

&lt;p&gt;But looking back at how AI agents actually spread through our own company, real change never came from a single class or presentation.&lt;/p&gt;

&lt;p&gt;It started with someone who just did it themselves. They created something in their own work that made people around them do a double take.&lt;/p&gt;

&lt;p&gt;A salesperson who suddenly talks like an engineer, while closing deals faster than ever. An admin or HR person who turns out to be doing technical work and marketing, shipping product-grade work from a role that never used to do that. People around them start to wonder. Why don't you seem like the same salesperson, the same admin anymore?&lt;/p&gt;

&lt;p&gt;At that point, the curious ones show up naturally. Colleagues, bosses, friends. The change is happening right beside them, they can see it, and only then do they actually absorb what you're saying. Then it spreads from you to the next colleague, and the next, and out from there.&lt;/p&gt;

&lt;p&gt;To be honest, trying to drive change by "getting the boss to sit through a lesson" rarely works. Unless that boss personally got their hands dirty on day one. Because right now, knowing what AI can and can't do comes entirely from bumping up against its boundaries yourself, not from hearing about them.&lt;/p&gt;

&lt;p&gt;The data backs this up. A BCG report from early 2025 said 75% of executives rank AI as a top-three priority, but only a quarter have actually captured significant value. McKinsey put it more bluntly: 70% of employees skip their company's formal AI training videos entirely, learning instead by tinkering and word of mouth.&lt;/p&gt;

&lt;p&gt;Training can only convey so much. What's scarier is that someone who hasn't deeply used AI themselves, if they go on to set policy, easily falls into one of two extremes. Either they fantasize that AI can do anything, piling on unrealistic KPIs that make their team's life miserable while they think it's all simple. Or they dismiss it entirely—"another bubble, here we go again"—and miss the real window.&lt;/p&gt;

&lt;p&gt;So the first misconception, and I think the biggest: don't start by trying to change others. Start with yourself.&lt;/p&gt;

&lt;h2&gt;
  
  
  2. A PhD-Level AI Writing Weekly Reports
&lt;/h2&gt;

&lt;p&gt;The second thing that really strikes me as a shame.&lt;/p&gt;

&lt;p&gt;A lot of top companies give their employees excellent AI infrastructure. The best models, unlimited usage, loose policies. But most people, once they get access, instinctively reach for the most routine tasks. Meeting summaries. Reports. Weekly and monthly updates. And then they stop.&lt;/p&gt;

&lt;p&gt;I'm not saying those tasks don't matter; AI really is useful for them. But stopping there is a waste.&lt;/p&gt;

&lt;p&gt;If you look deeper along the company's value chain, at the most painful links, whether that's marketing, sales, the product itself, or R&amp;amp;D, couldn't AI do something there too? You don't have to be an expert in that domain, but your industry understanding plus AI's execution ability could let you build something at those nodes.&lt;/p&gt;

&lt;p&gt;Think about it. A PhD-level AI told to write weekly reports will dutifully write weekly reports. It does what you assign. But tell it to research cutting-edge math, biology, or medicine, to run experiments and work through deductions, and it does that well too. One's a clerk, the other's a scientist. The gap is massive.&lt;/p&gt;

&lt;p&gt;Worklytics data says that within an organization, truly deep AI power users probably account for only 20–30%. The rest hold the exact same tools and use them only for the shallowest tasks. A BCG report from October 2025 also noted that 74% of enterprises get stuck when trying to expand AI adoption. It's not that the tools don't work. It's that the users only used one corner of them.&lt;/p&gt;

&lt;h2&gt;
  
  
  3. Long-Term Without Short-Term Is Unsustainable
&lt;/h2&gt;

&lt;p&gt;This one is harder to spot than the first two.&lt;/p&gt;

&lt;p&gt;After using AI for a while, a lot of people go through an emotional arc. At first they're amazed: "This is so powerful." Then they gradually shift to: "What exactly should I do?" The directions seem plentiful, all viable, but deciding specifically what to do and how to keep going is actually the hardest part.&lt;/p&gt;

&lt;p&gt;I've fallen into this trap myself.&lt;/p&gt;

&lt;p&gt;AI agents can do remarkable things, but they don't grant wishes. For some bigger directions, agents still burn through massive amounts of tokens and take forever. They need round after round of experimentation to explore and tune before they might yield results. They might not yield anything at all. You're at the boundary of knowledge, and probing forward was never easy. If you bet everything on projects like that, it's easy for your enthusiasm to fizzle out. You work for ages without seeing results, and when people ask what you're doing, you can't really explain it.&lt;/p&gt;

&lt;p&gt;So you need a mix.&lt;/p&gt;

&lt;p&gt;Short-term things with fast positive feedback. My shortest feedback loop comes from &lt;a href="https://guanjiawei.ai/blog/digital-identity-biggest-leverage" rel="noopener noreferrer"&gt;working on my digital identity&lt;/a&gt;. Optimizing my website for SEO and having people find me through search. Writing blog posts and having readers get something out of them and want to share and engage. In between, I do small AI projects for friends. &lt;a href="https://guanjiawei.ai/blog/worker-before-manager" rel="noopener noreferrer"&gt;Helping a friend with a crawfish business&lt;/a&gt;. Making games for people. All of them show results quickly.&lt;/p&gt;

&lt;p&gt;Mid-term, you need products that accumulate. The AIMA system, for example. When I show it to potential partners, some are willing to install it and promote it. That's a sturdier kind of positive feedback than "I ran an experiment."&lt;/p&gt;

&lt;p&gt;And those deep, long-term explorations in the trenches keep running quietly in the background.&lt;/p&gt;

&lt;p&gt;Kotter's eight-step change model has a step called "Generate Short-Term Wins." Same idea. Short-term results sustain confidence, giving you the nerve to keep chewing on hard problems. If the process also brings in some revenue to cover the token costs, the positive loop gets even stronger.&lt;/p&gt;

&lt;h2&gt;
  
  
  4. Prompt Engineering Is Yesterday's News
&lt;/h2&gt;

&lt;p&gt;The last one, and I think a lot of people are still stuck here.&lt;/p&gt;

&lt;p&gt;When people talk about using AI, they still fixate on prompts, thinking they need to master prompt engineering.&lt;/p&gt;

&lt;p&gt;That was fine two years ago. Not anymore.&lt;/p&gt;

&lt;p&gt;Give today's models a goal and a couple of sentences, and they'll go execute complex tasks. Prompts stopped being the bottleneck a while ago.&lt;/p&gt;

&lt;p&gt;The bottleneck is harness. How to build an environment where the agent can actually get work done.&lt;/p&gt;

&lt;p&gt;What you need to think about has changed. How do you design the document structure of the working directory? How do you give it machines for experiments? When do you check if it's gone off track? When should you have it pivot direction or change methods? How do you do periodic summaries and archiving?&lt;/p&gt;

&lt;p&gt;In early 2025 Karpathy coined the term "vibe coding," casually using natural language to have AI write code, very freeform. A year later, looking back, he said the industry had moved from vibe coding to "agentic engineering," with value shifting up from syntax and implementation to judgment, taste, and management capability. Shopify's Tobi Lutke offered another term, "context engineering." It's not about how to write a good prompt, but about how to fill the agent's context window with the right information.&lt;/p&gt;

&lt;p&gt;At the end of the day, AI is a digital employee. When you work with an employee, you don't think the most important thing is crafting their first email, right? That email is a tiny piece. What you really need to figure out is how to set up a proper work environment and guidance that leverages your sense of direction and their execution power, while steering clear of the mistakes they're prone to make.&lt;/p&gt;

&lt;p&gt;Shift your thinking from "how to write one good sentence" to "how to manage a digital employee," and collaborating with AI feels completely different.&lt;/p&gt;




&lt;p&gt;Looking back, these four points are really one thing.&lt;/p&gt;

&lt;p&gt;Start doing it yourself. Don't wait for others. Once you do, don't stay in the comfort zone. Look deeper along the value chain. Set your own rhythm so short-term feedback never dries up. And shift your attention from prompts to environment and collaboration.&lt;/p&gt;

&lt;p&gt;The change you create doesn't need pushing. It spreads on its own.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;References&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;BCG, &lt;em&gt;From Potential to Profit: Closing the AI Impact Gap&lt;/em&gt;, January 2025.&lt;/li&gt;
&lt;li&gt;McKinsey, &lt;em&gt;Superagency in the Workplace: Empowering People to Unlock AI's Full Potential at Work&lt;/em&gt;, 2025.&lt;/li&gt;
&lt;li&gt;Tobi Lutke (Shopify CEO), Internal Memo on AI Usage Expectations, April 2025.&lt;/li&gt;
&lt;li&gt;Andrej Karpathy, Sequoia AI Ascent 2026: From Vibe Coding to Agentic Engineering, April–May 2026.&lt;/li&gt;
&lt;li&gt;Tobi Lutke &amp;amp; Andrej Karpathy on "Context Engineering," 2025.&lt;/li&gt;
&lt;li&gt;BCG, &lt;em&gt;The Widening AI Value Gap&lt;/em&gt;, October 2025.&lt;/li&gt;
&lt;li&gt;Worklytics, &lt;em&gt;AI Adoption Benchmarks 2025&lt;/em&gt;, Q3 2025.&lt;/li&gt;
&lt;li&gt;McKinsey, &lt;em&gt;The State of AI in 2025&lt;/em&gt;, March 2025.&lt;/li&gt;
&lt;li&gt;John P. Kotter, &lt;em&gt;Leading Change: Generate Short-Term Wins&lt;/em&gt;.&lt;/li&gt;
&lt;/ol&gt;




&lt;p&gt;原文链接：&lt;a href="https://guanjiawei.ai/en/blog/ai-transformation-not-from-training" rel="noopener noreferrer"&gt;https://guanjiawei.ai/en/blog/ai-transformation-not-from-training&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>agents</category>
      <category>thoughts</category>
    </item>
    <item>
      <title>Two Generations Was All It Took</title>
      <dc:creator>guanjiawei</dc:creator>
      <pubDate>Fri, 15 May 2026 03:43:43 +0000</pubDate>
      <link>https://dev.to/skyguan92/two-generations-was-all-it-took-16j5</link>
      <guid>https://dev.to/skyguan92/two-generations-was-all-it-took-16j5</guid>
      <description>&lt;p&gt;Yesterday I watched the footage of Trump's state visit to China, and honestly it hit me. Red carpet, military band, state dinner. Musk, Jensen Huang, Tim Cook all came along, even Defense Secretary Hegseth. First time a US president visited China in nearly nine years.&lt;/p&gt;

&lt;p&gt;Think about who this is. The president of the most powerful country on earth, bringing some of the biggest names in tech, sitting down to talk. Not coming to lecture. Coming to negotiate.&lt;/p&gt;

&lt;p&gt;Everyone knows Trump's style. With countries he considers weaker, he doesn't even bother pretending — might makes right. These past few years, the way many world leaders have looked standing next to him has been, frankly, painful to watch. Some of it bordered on comical.&lt;/p&gt;

&lt;p&gt;But watch him in China. Completely different person. Polite, restrained, saying nice things.&lt;/p&gt;

&lt;p&gt;Why? Because you're strong enough. Weak countries in today's world have no dignity to speak of.&lt;/p&gt;

&lt;h2&gt;
  
  
  Stand first
&lt;/h2&gt;

&lt;p&gt;When the People's Republic was founded in 1949, how bad was it? A century of getting beaten from every direction. Foreign powers, the Japanese, civil war. The country had nothing left.&lt;/p&gt;

&lt;p&gt;But the first priority wasn't getting rich. It was making sure nobody could beat you again.&lt;/p&gt;

&lt;p&gt;The Korean War broke out in 1950. When Chinese volunteers crossed the Yalu River, the country's per capita GDP was a few dozen dollars. The Americans had tanks, artillery, fighter jets. China didn't even have an air force, and logistics were basically nonexistent. Under those conditions, over a million troops went in, fighting on guts and willingness to die, and pushed the front line back to a ceasefire.&lt;/p&gt;

&lt;p&gt;Over a hundred thousand never came home.&lt;/p&gt;

&lt;p&gt;Then came the Two Bombs, One Satellite program.&lt;/p&gt;

&lt;p&gt;First atomic bomb in 1964. Hydrogen bomb test in 1967, just 32 months from fission to fusion, the fastest of any nuclear state. "Dongfanghong-1" satellite launched in 1970. China became the fifth country to independently put a satellite in orbit.&lt;/p&gt;

&lt;p&gt;Of the 23 scientists honored for this, 10 had studied in America, 6 in Britain, others in France, Germany, the Soviet Union. They finished their studies abroad and came back. Back to a country that had nothing. In the chaos of the Great Leap Forward and the Cultural Revolution, they built world-class strategic technology.&lt;/p&gt;

&lt;p&gt;What made it extraordinary? The window.&lt;/p&gt;

&lt;p&gt;Try building nuclear weapons today. You can't. The treaties killed that. The window closed. Those scientists, working with raw brilliance and pure stubbornness on barren ground, grabbed it while it was still open.&lt;/p&gt;

&lt;h2&gt;
  
  
  From poor to prosperous
&lt;/h2&gt;

&lt;p&gt;The first thirty years solved the "can't be beaten" problem. Next: "can't eat."&lt;/p&gt;

&lt;p&gt;Deng Xiaoping's Southern Tour in 1992. Reform and opening up had nearly stalled, conservative forces were gaining ground. An 87-year-old man, instead of arguing with bureaucrats in Beijing, went straight to Wuhan, Shenzhen, and Zhuhai and said one line: "Development is the only hard truth."&lt;/p&gt;

&lt;p&gt;GDP growth went from 3.9% in 1990 to 14.3% in 1992. That same year, the 14th Party Congress formalized the "socialist market economy."&lt;/p&gt;

&lt;p&gt;I was born right at that inflection point. For as long as I can remember, China was growing. My generation got lucky — never went hungry, never lived through a war.&lt;/p&gt;

&lt;p&gt;There's a concept in economics called the middle income trap. The World Bank studied it: out of 101 middle-income economies since 1960, only 13 made it to high income. South Korea, Taiwan, Hong Kong, Singapore. That's the short list.&lt;/p&gt;

&lt;p&gt;Now it's China's turn.&lt;/p&gt;

&lt;p&gt;Per capita GNI in 2024 was roughly $13,500. The World Bank's high-income line is $14,005, a 4% gap. Probably cleared within a year or two. A 1.4-billion-person economy crossing that line. Never happened before.&lt;/p&gt;

&lt;p&gt;Why do some countries get stuck? It's not a shortage of smart people. Some countries have plenty of brilliant minds. But the talent leaves and doesn't come back. Society itself is too fractured, no stable foundation to channel all that energy into something coherent. Having a big population and having a deep talent pool are different things. Look at India and Brazil.&lt;/p&gt;

&lt;h2&gt;
  
  
  The AI parallel
&lt;/h2&gt;

&lt;p&gt;Back to windows of opportunity. AI works the same way.&lt;/p&gt;

&lt;p&gt;I wrote a piece before about how the AI industry is already in wartime. Look at global AI today. The only two players that actually compete at the frontier model level are the US and China.&lt;/p&gt;

&lt;p&gt;Stanford's 2026 AI Index has some interesting numbers. The top US model leads the top Chinese model by just 2.7%. DeepMind's CEO Hassabis himself said the gap is only "a few months."&lt;/p&gt;

&lt;p&gt;But there's another number that's even more interesting: US private AI investment totaled $285.9 billion. China's was $12.4 billion. A 23x spending gap producing less than a 3-point performance gap. So who's more efficient?&lt;/p&gt;

&lt;p&gt;Europe has Mistral, valued at €11.7 billion and growing. But on the frontier model leaderboards, the gap between Mistral and the US-China top tier is clear. Everywhere else isn't even in the conversation.&lt;/p&gt;

&lt;p&gt;Why can only the US and China compete?&lt;/p&gt;

&lt;p&gt;I think the answer is the same as why those scientists pulled off Two Bombs, One Satellite seventy-seven years ago. Stable environment, sustained investment in education, deep enough talent base, and making the right calls when the window was open. China now holds close to 70% of global AI patents, and leads in research output and industrial robot deployment.&lt;/p&gt;

&lt;p&gt;Foundation, environment, timing. Take away any one and it falls apart. Same logic as seventy years ago.&lt;/p&gt;

&lt;h2&gt;
  
  
  None of this was guaranteed
&lt;/h2&gt;

&lt;p&gt;From 1949 to 2026. Seventy-seven years. About two generations.&lt;/p&gt;

&lt;p&gt;Zoom in a bit. My parents' generation, thirty, forty years ago, still going hungry. One generation before that, Li Hongzhang signing the Treaty of Shimonoseki after the Sino-Japanese War, ceding Taiwan and the Liaodong Peninsula. Then the Boxer Protocol after the Eight-Nation Alliance. World War II ended, China was one of the victors, and its territory was still carved up at will.&lt;/p&gt;

&lt;p&gt;Less than a century later, this country stands at the dead center of the world stage, sitting across the table from the most powerful nation on earth as equals.&lt;/p&gt;

&lt;p&gt;Flip through history. Britain's rise via the Industrial Revolution took the better part of a century. Germany after unification, decades. Japan from the Meiji Restoration to genuine great-power status, about the same. And all of them started from a much better position than China did.&lt;/p&gt;

&lt;p&gt;Watching yesterday's footage, it's worth stopping to think about what it actually took. People back then carrying millet and rifles, owning nothing, trading their lives for the space to survive. Scientists building world-class technology from absolutely nothing. Then generation after generation of ordinary people grinding it out until we got here. Sitting comfortably, eating whatever we want, drinking whatever we want, living with dignity.&lt;/p&gt;

&lt;p&gt;None of this fell from the sky.&lt;/p&gt;

&lt;p&gt;From standing up, to getting prosperous, to sitting at the center of the world while it falls apart around you. Two generations. That's all it took.&lt;/p&gt;

&lt;p&gt;So what about us? What does our generation do next?&lt;/p&gt;




&lt;h2&gt;
  
  
  References
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;a href="https://en.wikipedia.org/wiki/2026_state_visit_by_Donald_Trump_to_China" rel="noopener noreferrer"&gt;Trump's 2026 State Visit to China&lt;/a&gt; — May 13–15, 2026, first US presidential visit to China in nearly nine years; Elon Musk, Jensen Huang, Tim Cook, and Defense Secretary Hegseth accompanied&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.pbs.org/newshour/world/who-was-on-trumps-plane-to-china-elon-musk-nvidia-ceo-and-more" rel="noopener noreferrer"&gt;Who Was on Trump's Plane to China (PBS)&lt;/a&gt; — delegation included multiple tech CEOs and the Secretary of Defense&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.cnbc.com/2026/05/14/trump-xi-beijing-summit-trade-taiwan-ai-iran-rare-earths-tariffs.html" rel="noopener noreferrer"&gt;Trump–Xi Beijing Summit Trade Talks (CNBC)&lt;/a&gt; — both sides reached "generally balanced and positive outcomes"&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://en.wikipedia.org/wiki/Deng_Xiaoping%27s_southern_tour" rel="noopener noreferrer"&gt;Deng Xiaoping's Southern Tour&lt;/a&gt; — Jan–Feb 1992, visited Wuhan, Shenzhen, Zhuhai, Shanghai; GDP growth surged from 3.9% to 14.3%&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://en.wikipedia.org/wiki/China_in_the_Korean_War" rel="noopener noreferrer"&gt;China in the Korean War&lt;/a&gt; — 1950–1953, China deployed over 1 million volunteers under extreme material disadvantage&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://en.wikipedia.org/wiki/Two_Bombs,_One_Satellite" rel="noopener noreferrer"&gt;Two Bombs, One Satellite&lt;/a&gt; — atomic bomb 1964, hydrogen bomb 1967, satellite 1970; 23 honored scientists&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://thebulletin.org/2024/04/the-short-march-to-chinas-hydrogen-bomb/" rel="noopener noreferrer"&gt;China's 32 Months from A-Bomb to H-Bomb (Bulletin of the Atomic Scientists)&lt;/a&gt; — the shortest fission-to-fusion timeline of any nuclear state&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://cepr.org/voxeu/columns/trapped-china-and-middle-income-trap" rel="noopener noreferrer"&gt;The Middle Income Trap and China (CEPR)&lt;/a&gt; — World Bank high-income threshold $14,005; only 13 of 101 middle-income economies since 1960 successfully crossed&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.scmp.com/opinion/china-opinion/article/3319278/china-could-become-high-income-country-year-can-it-stay-one" rel="noopener noreferrer"&gt;China's Per Capita GNI Approaching High-Income Threshold (SCMP)&lt;/a&gt; — ~$13,500 in 2024, ~4% gap&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://hai.stanford.edu/ai-index/2026-ai-index-report" rel="noopener noreferrer"&gt;Stanford 2026 AI Index Report&lt;/a&gt; — US–China AI performance gap narrowed to 2.7%; China holds ~70% of global AI patents&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://vertu.com/lifestyle/the-global-ai-race-why-china-is-just-months-behind-the-us-according-to-deepminds-ceo/" rel="noopener noreferrer"&gt;DeepMind CEO: US–China AI Gap Is Only "Months"&lt;/a&gt; — Hassabis's assessment of the US–China AI gap&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.morganstanley.com/insights/articles/global-ai-race-us-vs-china-investment-opportunities" rel="noopener noreferrer"&gt;US–China AI Investment Gap (Morgan Stanley)&lt;/a&gt; — US private AI investment $285.9B vs China $12.4B&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://en.wikipedia.org/wiki/Treaty_of_Shimonoseki" rel="noopener noreferrer"&gt;Treaty of Shimonoseki&lt;/a&gt; — signed 1895 after the First Sino-Japanese War by Li Hongzhang, ceding Taiwan, Penghu, and the Liaodong Peninsula&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://en.wikipedia.org/wiki/Boxer_Protocol" rel="noopener noreferrer"&gt;Boxer Protocol&lt;/a&gt; — signed 1901 after the Eight-Nation Alliance; Li Hongzhang was China's signatory and died shortly after&lt;/li&gt;
&lt;/ol&gt;




&lt;p&gt;原文链接：&lt;a href="https://guanjiawei.ai/en/blog/two-generations" rel="noopener noreferrer"&gt;https://guanjiawei.ai/en/blog/two-generations&lt;/a&gt;&lt;/p&gt;

</description>
      <category>thoughts</category>
      <category>china</category>
      <category>ai</category>
      <category>internationalaffairs</category>
    </item>
    <item>
      <title>One Hour for the Demo, Three for the Production Line</title>
      <dc:creator>guanjiawei</dc:creator>
      <pubDate>Thu, 14 May 2026 08:17:04 +0000</pubDate>
      <link>https://dev.to/skyguan92/one-hour-for-the-demo-three-for-the-production-line-3ad8</link>
      <guid>https://dev.to/skyguan92/one-hour-for-the-demo-three-for-the-production-line-3ad8</guid>
      <description>&lt;p&gt;You often see people online saying that in the AI era, reliability matters most.&lt;/p&gt;

&lt;p&gt;The first time I saw it, it sounded like a tired cliché. Every era gets assigned its own buzzword; "intelligence" and "execution" have already had their turn. Does "reliability" actually fit the AI era any better? Not really.&lt;/p&gt;

&lt;p&gt;A few recent projects finally drove the point home.&lt;/p&gt;

&lt;h2&gt;
  
  
  Three Characters a Day, Thousands of Assets Over a Hundred Days
&lt;/h2&gt;

&lt;p&gt;I made a PAW Patrol reading game for my son. Three characters a day, three hundred over a hundred days. The number looks small.&lt;/p&gt;

&lt;p&gt;The thing is, each of those hundred days is an independent mini-level. Every day needs about six or seven images and dozens of audio clips. The voices are cloned from a PAW Patrol character, each line matching a specific script. Add it up across a hundred days, do the math, and you're looking at thousands of assets, easy.&lt;/p&gt;

&lt;p&gt;The day-one demo worked. My son and I sat there playing for twenty minutes, having fun. That's exactly where the problem started.&lt;/p&gt;

&lt;p&gt;I thought the rest was just running that demo ninety-nine more times. Turns out the real thing and the demo are two completely different beasts.&lt;/p&gt;

&lt;p&gt;I randomly picked two clips from the first batch of twenty. Both were bad. One dropped two characters; the other's emotion completely mismatched the line. Would I dare use the other eighteen? I sampled again. Still broken. With over a thousand assets across a hundred days, how many would actually be usable? I had no idea.&lt;/p&gt;

&lt;p&gt;That feeling of uncertainty is the critical part. It's not a minor issue; it stops you cold.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Extra Layer Is Called Quality Assurance
&lt;/h2&gt;

&lt;p&gt;Demos work because a human is watching. Generate one, listen, if it's no good try again, pick the best and keep it. The whole process is manual. The human is an invisible QA layer.&lt;/p&gt;

&lt;p&gt;To make it fully automated, you have to swap "human in the loop" for "model in the loop." That's QA.&lt;/p&gt;

&lt;p&gt;Sounds simple. Doing it opens a whole new world.&lt;/p&gt;

&lt;h2&gt;
  
  
  One Audio Clip, Scored by Three or Four Models at Once
&lt;/h2&gt;

&lt;p&gt;I started digging into niche models in the industry. The usual suspects are ASR and TTS. ASR understands speech, TTS generates it. But scoring TTS output for quality? There's a whole category of models built just for that.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;DNSMOS&lt;/strong&gt; is from Microsoft. Originally built to score noise-suppression algorithms, it doesn't need the original clean audio as reference; from a single clip it judges how much noise is present and whether the overall result is listenable. Later people found it's also sensitive to TTS artifacts.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;NISQA&lt;/strong&gt; comes from Gabriel Mittag's team at TU Berlin. It includes a NISQA-TTS weight specifically for TTS naturalness. Instead of a single score, it breaks things down into dimensions: noise, coloration, discontinuity, loudness.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;UTMOS&lt;/strong&gt; is from the SaruLab team at the University of Tokyo, winner of the VoiceMOS Challenge 2022, and now the de-facto baseline for TTS scoring. I use it as the outermost backstop.&lt;/p&gt;

&lt;p&gt;Finally there's a &lt;strong&gt;reverse ASR&lt;/strong&gt; pass: feed the generated audio through Whisper, compare the transcript to the original script, and reject it if the gap is too big. It's the crudest check, but the most reliable.&lt;/p&gt;

&lt;p&gt;Add up the four scores; pass the threshold and it's good, fail and it triggers regeneration. I spent a day wiring it up, and the output was clearly better than running TTS alone.&lt;/p&gt;

&lt;h2&gt;
  
  
  From 1× to 3× Is Not an Exaggeration
&lt;/h2&gt;

&lt;p&gt;But the cost went through the roof.&lt;/p&gt;

&lt;p&gt;Before adding QA, I figured it would add maybe 50% more time. Pick out the bad ones and regenerate. Worst case, the new batch is all bad too, so you run it again. At most 1.5×.&lt;/p&gt;

&lt;p&gt;In reality, one hour became three.&lt;/p&gt;

&lt;p&gt;The reason is that the model simply can't clear a certain line. The same script, different random seeds, seven, eight, nine tries and it still can't pass the QA threshold. Sometimes you have to fall back to changing the prompt, the speaking rate, the emotion tag, just to squeeze it through. Every regeneration is a full model call, burning tokens each time.&lt;/p&gt;

&lt;p&gt;Run this in the cloud, billed by the minute or by the call, and racking up a hefty bill in minutes is no exaggeration. I later did a rough calculation: a voice generation task I had planned to run entirely in the cloud would cost roughly 7 to 10× the demo bill once you factor in retry rates.&lt;/p&gt;

&lt;p&gt;This was the math I hadn't done: from demo to production line, costs jump by orders of magnitude, not percentages.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Same Thing Happened Again on Another Project
&lt;/h2&gt;

&lt;p&gt;Recently I've been playing with another project called &lt;strong&gt;MiroFish&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;It's a pretty interesting open-source project by Guo Hangjiang, an undergraduate in China. It hit #1 on global GitHub Trending in March 2026, with investment from Chen Tianqiao of Shanda Group. It generates a large population of agents with personalities, memories, and social relationships, then runs them across two simulation platforms where they discuss, debate, form alliances, and shift opinions. Finally, a ReportAgent summarizes the conclusions of the entire evolution to predict how an event will unfold.&lt;/p&gt;

&lt;p&gt;My config wasn't large. About 54 agents per event, across a 20-round timeline. Every round requires every agent to run once. 54 × 20, roughly 1,000 full calls.&lt;/p&gt;

&lt;p&gt;I used Kimi K2.6 Thinking. The problem is you can't turn off thinking mode; it thinks before every output. Thousands of thinking tokens per call is normal. Multiply by 1,000, and the token burn hurts.&lt;/p&gt;

&lt;p&gt;After a few runs, I started wondering: does this scenario really need a top-tier model?&lt;/p&gt;

&lt;p&gt;Each agent, on its turn, just scans the context, says a line or casts a vote based on its persona, then gets aggregated. The intelligence threshold for each call is actually low. Swap in a last-year model, something around GPT-4o level, and the results are probably similar, only faster.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Mid-Tier Slot That No One Has Clearly Defined
&lt;/h2&gt;

&lt;p&gt;For the past year, one question has gone unanswered: what scenarios actually need a second-tier model? Everyone races for the best and most expensive, leaving the mid-tier in an awkward spot.&lt;/p&gt;

&lt;p&gt;I now see two very specific slots.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;First is quality assurance.&lt;/strong&gt; Judging whether an audio clip sounds natural, whether an image matches the style, whether a conversation stays on topic, these tasks require mid-tier intelligence. Using a top model here is like using Claude Opus to review GPT-4o's code. It works, but it's not cost-effective. A lightweight vision model plus a specialized scorer like NISQA costs far less than one top-tier call.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Second is large-scale agent simulation.&lt;/strong&gt; A setup like MiroFish strings together 1,000 inferences to reach a collective evolutionary result. It's not sensitive to the quality of any single call, but extremely sensitive to total cost. The "best model" for this scenario isn't the smartest; it's whatever gives you the best mix of per-token price and inference speed.&lt;/p&gt;

&lt;p&gt;These two scenarios hadn't been clearly spelled out because no one actually doing industrialization was batch-producing content at scale. Once you actually need to generate thousands of audio clips or tens of thousands of agent inferences, these two slots jump right out.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Second Reason for Going Local
&lt;/h2&gt;

&lt;p&gt;This is also when I finally understood why local compute matters so much.&lt;/p&gt;

&lt;p&gt;The two reasons usually cited for local deployment: speed and privacy. Both are valid, but neither is the most critical.&lt;/p&gt;

&lt;p&gt;The real killer is cost per call.&lt;/p&gt;

&lt;p&gt;An industrial pipeline is bound to retry heavily. Cloud TTS is billed by duration, token models by the call. Every retry is another invoice. Local is different. A DGX Spark running open-source models like F5-TTS or VoxCPM incurs zero marginal cost beyond electricity. Leave it running for a day and you get enough material for a week. Failed? Run it again, no big deal.&lt;/p&gt;

&lt;p&gt;This is the fundamental difference between cloud and local models in industrial scenarios. The former charges by usage; the latter only charges once for hardware. In a high-retry-rate pipeline, that gap gets magnified by orders of magnitude.&lt;/p&gt;

&lt;p&gt;The reason local deployment never made sense in past discussions is that everyone compared it to demo costs. A demo TTS run costs pennies. Set that against a local machine costing thousands, and the math never works. But compare it to industrial-scale costs, factoring in retry rates, QA, and agent simulation, and the math flips immediately.&lt;/p&gt;

&lt;h2&gt;
  
  
  Three Tiers, Three Positions
&lt;/h2&gt;

&lt;p&gt;Writing this, I suddenly realized that industrialized AI content production actually needs three tiers of models running simultaneously.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Top-tier models&lt;/strong&gt; at the front, handling the hardest generation tasks. Expensive per call, but you don't run them often.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Mid-tier models&lt;/strong&gt; for QA and agent simulation, handling high-volume, medium-intelligence tasks. Called repeatedly, so each call must stay cheap.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Local models&lt;/strong&gt; at the bottom doing the heavy lifting. Asset generation, vectorization, transcription, alignment, the grunt work. If it can run locally, don't send it to the cloud.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;You won't find these three tiers in any official tutorial; the setup is still evolving. But once you actually get your hands dirty with industrial content production, you'll end up piecing together this structure yourself.&lt;/p&gt;

&lt;p&gt;Looking back at that opening line that sounded like empty talk, I actually think it understated things. In the AI era, what matters most isn't "reliability" itself; it's the cost curve of reliability. From demo to production line, that curve starts at 3×.&lt;/p&gt;

&lt;p&gt;Understand that curve, and you know how to spend money. Otherwise you'll budget for 1× and get a bill for 7×.&lt;/p&gt;




&lt;h2&gt;
  
  
  References
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://www.microsoft.com/en-us/research/publication/dnsmos-a-non-intrusive-perceptual-objective-speech-quality-metric-to-evaluate-noise-suppressors/" rel="noopener noreferrer"&gt;DNSMOS: A Non-Intrusive Perceptual Objective Speech Quality Metric to Evaluate Noise Suppressors (Microsoft Research)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/abs/2110.01763" rel="noopener noreferrer"&gt;DNSMOS P.835 ICASSP 2022 Paper (arXiv)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/gabrielmittag/NISQA" rel="noopener noreferrer"&gt;NISQA: Non-Intrusive Speech Quality and TTS Naturalness Assessment (GitHub)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://depositonce.tu-berlin.de/items/3b08adda-9fe5-485a-ae0b-813c89975235" rel="noopener noreferrer"&gt;NISQA Speech Quality Corpus (TU Berlin DepositOnce)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/sarulab-speech/UTMOS22" rel="noopener noreferrer"&gt;UTMOS: UTokyo-SaruLab System for VoiceMOS Challenge 2022 (GitHub)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/SWivid/F5-TTS" rel="noopener noreferrer"&gt;F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching (GitHub)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/OpenBMB/VoxCPM" rel="noopener noreferrer"&gt;VoxCPM2: Tokenizer-Free TTS for Multilingual Speech Generation (OpenBMB)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/openai/whisper" rel="noopener noreferrer"&gt;Whisper: Robust Speech Recognition via Large-Scale Weak Supervision (OpenAI)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://openrouter.ai/moonshotai/kimi-k2-thinking" rel="noopener noreferrer"&gt;Kimi K2 Thinking API Pricing (OpenRouter)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/666ghj/MiroFish" rel="noopener noreferrer"&gt;MiroFish: A Simple and Universal Swarm Intelligence Engine (GitHub)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://agentnativedev.medium.com/mirofish-swarm-intelligence-with-1m-agents-that-can-predict-everything-114296323663" rel="noopener noreferrer"&gt;MiroFish: Swarm Intelligence with 1M Agents That Can Predict Everything (Medium)&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;原文链接：&lt;a href="https://guanjiawei.ai/en/blog/from-demo-to-production" rel="noopener noreferrer"&gt;https://guanjiawei.ai/en/blog/from-demo-to-production&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>contentproduction</category>
      <category>tts</category>
      <category>agents</category>
    </item>
    <item>
      <title>Three Questions After the AI Job Wave</title>
      <dc:creator>guanjiawei</dc:creator>
      <pubDate>Wed, 13 May 2026 04:35:47 +0000</pubDate>
      <link>https://dev.to/skyguan92/three-questions-after-the-ai-job-wave-3hp7</link>
      <guid>https://dev.to/skyguan92/three-questions-after-the-ai-job-wave-3hp7</guid>
      <description>&lt;p&gt;The drive to hire is weak right now.&lt;/p&gt;

&lt;p&gt;Before, when you wanted to build something bigger, your gut reaction was "we need more people." Now it's the opposite: "how do we squeeze more out of who we already have?" Even bringing on interns feels less appealing.&lt;/p&gt;

&lt;p&gt;It's a small shift in sentiment, but it points to something that isn't reversing.&lt;/p&gt;

&lt;p&gt;Two days ago, Kim Yong-beom, policy chief at the South Korean presidential office, posted on Facebook: the "excess returns" of the AI era shouldn't belong only to individual companies; part should flow back to the public as a "citizen dividend." The next day, the KOSPI plunged 5.1%. He later clarified he wasn't suggesting confiscating profits, only discussing how to spend the "excess tax revenue" created by AI dividends. The market settled.&lt;/p&gt;

&lt;p&gt;A single Facebook post moving the market 5% tells you the issue is already on the table.&lt;/p&gt;

&lt;p&gt;The context is plain enough. SK hynix posted a 72% operating profit margin in the first quarter; spread across employees, bonuses averaged nearly $500,000 per person. Samsung Electronics' semiconductor division logged 53.7 trillion won in operating profit for the same period, but the 74,000 workers represented by the union received a far smaller slice than their SK counterparts. The Samsung union has threatened an 18-day strike starting May 21. The workers aren't after a slightly larger bonus. They want a respectable cut of the AI dividend chain.&lt;/p&gt;

&lt;p&gt;I see this as the middle of three questions AI is really throwing at society. Ahead of it is the unemployment question. Behind it is a deeper one about identity. The three are linked.&lt;/p&gt;

&lt;h2&gt;
  
  
  Question One: Net Job Loss Is the Pattern
&lt;/h2&gt;

&lt;p&gt;First, let's get the definitions straight.&lt;/p&gt;

&lt;p&gt;Mainstream reports don't actually agree on "global net job losses." The World Economic Forum's &lt;em&gt;Future of Jobs Report 2025&lt;/em&gt; still predicts a net increase of 78 million jobs by 2030. The ILO's GenAI exposure index uses "transformation" rather than "replacement": one-quarter of jobs globally are touched by GenAI, one-third in high-income countries. The IMF uses the broadest brush: 40% globally, 60% in advanced economies.&lt;/p&gt;

&lt;p&gt;But macro models are neat; actual labor pools are not. When a clerk gets "transformed" into an "AI-collaborating high-efficiency role," the macro report counts that as transformation. For the clerk, it's unemployment. New jobs like AI product managers, data governance specialists, model evaluators, and robot maintenance technicians may not be in the same city, may not suit the same age group, and may not absorb the former customer service reps, copywriters, junior programmers, and admin staff. Someone always gets pushed off along the way.&lt;/p&gt;

&lt;p&gt;The current data already shows how painful this transition is. In Challenger's March 2026 U.S. layoff report, AI was the top reason given for cuts: 15,341 people, 25% of all monthly layoffs. Tech layoffs in the U.S. reached 52,050 in Q1, up 40% year over year. Goldman Sachs estimates that over the past year AI has eliminated roughly 25,000 jobs a month while creating fewer than 9,000. Net loss: 16,000. Gen Z and entry-level white-collar workers are hit hardest. Research from Stanford Digital Economy Lab also shows that in the jobs most exposed to AI, employment among 22- to 25-year-olds has dropped markedly.&lt;/p&gt;

&lt;p&gt;This is what I mean by "net loss." I'm not forecasting global employment in 2030; I'm looking at the real demand for specific occupations, specific age groups, and specific companies right now. When a company realizes that ten people plus AI can do what used to take fifteen, it doesn't first ask a macro model whether new jobs will appear in five years. It freezes hiring, trims headcount, and cuts peripheral roles.&lt;/p&gt;

&lt;p&gt;The sectors that traditionally soaked up labor are running in reverse, too. Autonomous vehicles are replacing drivers; drones are replacing delivery riders; industrial robots are replacing line workers. IFR &lt;em&gt;World Robotics&lt;/em&gt; data shows global manufacturing robot density doubled in seven years. In China in 2023 there were 470 robots per 10,000 manufacturing employees. New infrastructure no longer naturally brings large numbers of low-barrier jobs the way building bridges and roads once did. Computing centers, ultra-high-voltage grids, battery plants, and dark factories are all capital-intensive and light on labor.&lt;/p&gt;

&lt;p&gt;Some pin their hopes on "one-person companies." OPCs have been hyped plenty over the last couple of years, and I do think they'll become a real new organizational form. By June 2025, China had over 16 million one-person limited liability companies; 2.86 million were newly registered in the first half of 2025 alone, up 47% year over year. Shangcheng District in Hangzhou is already piloting OPC community policies.&lt;/p&gt;

&lt;p&gt;But judging whether OPCs can carry employment means looking past the headline numbers and anecdotes. Most of those 16 million are traditional self-employed operations and micro-entities that existed long before AI. Two things matter: the growth rate, and what share of that growth can actually sustain middle-class incomes.&lt;/p&gt;

&lt;p&gt;The growth rate is eye-catching: 47% year over year. But the distribution is ugly. Industry reports show OPC revenue is extremely long-tail: more than half are still stuck in a product-validation phase earning a few thousand yuan a month; fewer than one in ten steadily clear a million yuan a year. Even with an optimistic 10%, only 600,000 of the 6 million new OPCs each year would reach middle-class levels.&lt;/p&gt;

&lt;p&gt;China's 2025 statistical bulletin puts year-end employment at 725 million. Apply the IMF's exposure metric: 60% in advanced economies. Use a more conservative 30% for China, and that's still over 200 million people. 600,000 versus 200 million: two orders of magnitude apart. OPCs will buoy some super-individuals, but they can't hold up the labor market.&lt;/p&gt;

&lt;p&gt;Why is this shock so sharp? I boil it down to one word: concentration.&lt;/p&gt;

&lt;p&gt;SK hynix posted 37.6 trillion won in Q1 operating profit with roughly 35,000 employees company-wide. That's roughly 1 billion won in operating profit per employee for the quarter, over 4 billion won annualized, or more than 20 million RMB per person. Not every employee actually creates that much, but the number makes it viscerally clear how few hands the AI dividend is squeezed into.&lt;/p&gt;

&lt;p&gt;Xiaomi isn't as extreme, but the direction is the same. In 2025 the group recorded 457.3 billion yuan in revenue; its smart EV and AI innovation business contributed 106.1 billion yuan at a 24.3% gross margin, delivering 410,000 vehicles for the year. Automakers used to compete on production capacity; now they also compete on algorithms, supply-chain software, automated production lines, and data loops. Capacity is scaling up; headcount isn't scaling with it.&lt;/p&gt;

&lt;p&gt;Since the Industrial Revolution, every major industrial wave has pulled new job chains along with it. Labor scale and industrial scale mostly moved together. This AI wave runs the other way: the greater the output, the fewer people it needs. Chips, cloud, models, data centers, plus the handful of teams that can push AI to its limits. These swallow most of the dividends.&lt;/p&gt;

&lt;p&gt;You can think of it this way: one person out of a thousand, armed with super-productivity, flattens part of the work that the other 999 used to do. Not all jobs are erased, but the demand curve flattens. I don't see any new direction that could regenerate labor demand on that scale in the same window.&lt;/p&gt;

&lt;p&gt;Accepting this is prerequisite to discussing the next two questions. Net job loss isn't a panic slogan; it's a magnitude mismatch already happening in local labor markets.&lt;/p&gt;

&lt;h2&gt;
  
  
  Question Two: Distribution Needs a Reset
&lt;/h2&gt;

&lt;p&gt;Which brings us to the second question: distribution.&lt;/p&gt;

&lt;p&gt;The Korean incident is a template for the distribution question. Once AI dividends concentrate in a handful of companies, who gets them? Shareholders, executives, core engineers, all employees, or society at large through taxes and public spending? This question will inevitably spread from Korea to Japan, Europe, and China.&lt;/p&gt;

&lt;p&gt;If concentrated dividends are still distributed under old rules, the outcome is almost predetermined. The bulk goes to shareholders and top management; core employees get hefty bonuses; ordinary workers get fixed salaries; and most people outside the supply chain only feel rising prices, rents, fewer openings, and tougher competition. This structure was already widening inequality; AI only steepens the curve.&lt;/p&gt;

&lt;p&gt;The group squeezed hardest isn't the lowest earners. It's the middle class, especially those living on salaries, professional skills, and stable jobs.&lt;/p&gt;

&lt;p&gt;The reason is simple. Wages are the most perfectly taxed form of income; there's nowhere to hide. In China's individual income tax system, comprehensive income faces progressive rates from 3% to 45%. Salaries are withheld at source, and social insurance plus tax are deducted before the money ever hits your hands. High-salary workers should certainly pay tax. The problem is that wealthier people have far more types of income: capital gains, equity incentives, dividends, corporate structures, trusts, family offices, cross-border arrangements. I'm not saying these are illegal. I'm saying they have more choices of tax base and more room to defer.&lt;/p&gt;

&lt;p&gt;This structure was already obvious in the internet era; in the AI era it will only get worse, because the core gains of the industry concentrate in fewer hands. The EU Tax Observatory's global tax evasion report also notes that billionaires worldwide face extremely low effective tax rates relative to their wealth. The more mobile wealth becomes across borders, the harder it is for any single country to carry out redistribution alone.&lt;/p&gt;

&lt;p&gt;The other side of the middle-class squeeze is how fast they are being replaced. The jobs AI currently hits easiest are middle-class occupations: programmers, designers, customer service reps, junior legal staff, junior analysts, copywriters, translators, operations specialists, admin staff, finance assistants. They shoulder the heaviest taxes and face the fastest replacement. They are the ones hurting most in the current structure.&lt;/p&gt;

&lt;p&gt;Back to Korea. The Samsung and SK unions aren't fighting over a one-time bonus; they're fighting over a long-term rule. The companies will only offer a "special bonus." The unions want the profit-sharing ratio locked into a formal agreement that takes effect every year. On the surface it's about the bonus amount. In reality it's about whether this distribution rule will still hold next year.&lt;/p&gt;

&lt;p&gt;Using "excess profits" or "excess tax revenue" for redistribution isn't a new framework in itself. Nordic countries have been running this for decades. Denmark's top marginal income tax rate is pushed to 60.5% in 2026. Sweden, Finland, and Norway have long maintained high labor-tax burdens and public services. The OECD's &lt;em&gt;Taxing Wages&lt;/em&gt; also shows that the average tax wedge on labor in European countries is markedly higher than in the U.S. or Korea.&lt;/p&gt;

&lt;p&gt;But the AI era introduces a new problem: productivity itself can move.&lt;/p&gt;

&lt;p&gt;Heavy-asset, fab-heavy players like Samsung and SK hynix can't move; the Korean government can at least capture some corporate income tax and supply-chain revenue. But the more typical AI business doesn't look like that. Compute is rented in Singapore; the company is registered in Ireland; the team is spread across five time zones; settlements run through global payment networks. Teams of three to five people generating hundreds of millions in revenue will become more common, and nations have far fewer levers to tax them than they had with traditional manufacturing.&lt;/p&gt;

&lt;p&gt;So an "AI tax" can't be read as simply slapping higher taxes on a few companies. It's more like a bundle of questions. What is the tax base? Compute, profits, capital gains, data, or the labor costs displaced by robots? And who receives the revenue? Is it poured into new infrastructure, or used to shore up social security, education, health care, pensions, unemployment insurance, even direct cash transfers to residents?&lt;/p&gt;

&lt;p&gt;What needs guarding against here is path dependence. Many countries have grown used to propping up the economy with investment and infrastructure, but AI-era infrastructure may not prop up employment. Building more computing centers, data centers, ultra-high-voltage grids, and battery plants will likely continue to raise the productivity of leading firms, benefiting capital and a narrow slice of high-skill jobs, while doing little directly for displaced middle-class and low-income workers.&lt;/p&gt;

&lt;p&gt;This is why people at OpenAI have been talking publicly about UBI for years. OpenResearch, funded by Sam Altman, ran a three-year experiment in Texas and Illinois: 1,000 low-income participants received $1,000 per month, alongside a control group of 2,000. The results, published in 2024, weren't miraculous. Recipients worked an average of 1.3 fewer hours per week, had a 2 percentage-point lower employment probability, and saw household income excluding subsidies decline. But they were more proactive in looking for work, valued meaningful work more, had more room to relocate, see doctors, and plan long term, and were more likely to have entrepreneurial ideas.&lt;/p&gt;

&lt;p&gt;This experiment matters, not because it proves UBI is right, but because it drags the debate from slogans back to data. Cash doesn't automatically make people stop working, nor does it automatically give them dignity. What it provides is a buffer and choice. For a society with excess productivity and rapid job restructuring, choice itself may be infrastructure.&lt;/p&gt;

&lt;p&gt;I don't think UBI is the standard answer for the AI era. But it's one of the few options that has been seriously tested and has data behind it. Compared with patching old rules, it at least offers a different starting point.&lt;/p&gt;

&lt;h2&gt;
  
  
  Question Three: Where Does Value Come From for People Who Don't Work?
&lt;/h2&gt;

&lt;p&gt;This is the hardest of the three. The first two can still be moved forward with policy, tax systems, and redistribution. This one cannot.&lt;/p&gt;

&lt;p&gt;In the Chinese context, "not working" is a very heavy verdict. At family gatherings, when someone asks "What do you do?" the expected answer is an occupation. If you reply, "I don't currently have a job," the atmosphere changes instantly. This isn't just about face.&lt;/p&gt;

&lt;p&gt;The National Bureau of Statistics' 2025 bulletin lists 5.95 million urban residents and 33.4 million rural residents on subsistence allowances at year-end. China does have a welfare system. But subsistence allowances and relief still carry stigma in many places. Families who qualify but don't apply have always existed. The reason isn't insufficient money; it's the fear of being whispered about for "living off handouts."&lt;/p&gt;

&lt;p&gt;This sense of shame runs deep. Our generation grew up on a narrative that said "Work hard and you'll be rewarded; effort deserves respect." Education, media, and the people around you all tell you the same thing: your value equals your output. Labor is the anchor of identity; salary is the measure of it. I wrote &lt;a href="https://guanjiawei.ai/blog/beyond-ai-anxiety" rel="noopener noreferrer"&gt;a piece on AI anxiety&lt;/a&gt; before, touching on the other side of this. When AI fortune-telling trends and young people flock to mysticism for certainty, what's really happening is that this anchor is loosening.&lt;/p&gt;

&lt;p&gt;AI has simply used up the expiration date of this narrative ahead of schedule. What it truly shakes isn't just income. It's the sense of identity. You receive a basic income, food and housing are covered, friends respect you, but you wake up with nothing to look forward to. That hollowness is something policy cannot answer.&lt;/p&gt;

&lt;p&gt;A society whose time has been freed by AI doesn't lack welfare distribution points. It lacks a narrative that can give people a new identity. This isn't something engineers can code or models can compute. It requires telling a new story about what kind of person is worthy of respect and what kind of life is decent.&lt;/p&gt;

&lt;p&gt;A thousand years ago the story was "study to become an official"; a hundred years ago it was "industry saves the nation"; thirty years ago it was "go into business." Over the last decade or so it was "get into a big tech firm," "buy an apartment," "start a company," "financial freedom." What it should be in the AI era, no one can give a clean answer.&lt;/p&gt;

&lt;h2&gt;
  
  
  Closing
&lt;/h2&gt;

&lt;p&gt;The three questions are linked. The first makes the second urgent. No matter how well the second is handled, the third cannot be avoided.&lt;/p&gt;

&lt;p&gt;The market tremor triggered by that May 11 proposal in Korea is only the opening act. I expect these discussions to spread to Japan, Europe, and China in the second half of the year. Every country will craft different answers based on its own politics and culture. AI taxes, UBI, tax-base reforms, new infrastructure, promoting one-person companies—each will have its trial runs. Trial and error itself is part of the answer.&lt;/p&gt;

&lt;p&gt;What individuals can actually do isn't complicated: build more skills, keep more capital on hand, and don't let any single narrative sweep away your emotions. What society must do is harder: stop brushing things off with "new jobs will always appear," and stop pushing the unemployed back into shame. No one can answer all three questions at once. We can only walk through them one by one.&lt;/p&gt;




&lt;h2&gt;
  
  
  References
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://koreajoongangdaily.joins.com/news/2026-05-12/national/politics/Senior-Blue-House-official-calls-for-returning-Samsung-SKs-excess-chip-profits-to-the-public/2590097" rel="noopener noreferrer"&gt;Senior Blue House official calls for returning Samsung, SK's "excess" chip profits to the public（Korea JoongAng Daily）&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.moneycontrol.com/news/business/pay-citizens-a-dividend-from-ai-windfall-south-korea-roils-market-by-floating-plan-to-redistribute-gains-13917168.html" rel="noopener noreferrer"&gt;South Korea roils market by floating "citizen dividend" from AI（Bloomberg / Moneycontrol）&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://apnews.com/article/korea-samsung-union-strike-ai-38e7a5030d3688850d3e8d8baf240f58" rel="noopener noreferrer"&gt;Samsung workers protest for higher pay, threaten to strike（AP）&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://news.samsung.com/global/samsung-electronics-announces-first-quarter-2026-results" rel="noopener noreferrer"&gt;Samsung Electronics Announces First Quarter 2026 Results（Samsung Global Newsroom）&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://news.skhynix.co.kr/q1-2026-business-results/" rel="noopener noreferrer"&gt;SK hynix Announces First Quarter 2026 Business Results（SK hynix Newsroom）&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://ir.mi.com/system/files-encrypted/nasdaq_kms/assets/2026/03/24/6-20-53/Xiaomi%20Corp_25Q4_ER_ENG%20vF.pdf" rel="noopener noreferrer"&gt;Xiaomi Corporation 2025 Fourth Quarter and Annual Results（Xiaomi IR）&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.ilo.org/publications/generative-ai-and-jobs-2025-update" rel="noopener noreferrer"&gt;Generative AI and jobs: A 2025 update（International Labour Organization）&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.ilo.org/resource/news/one-four-jobs-risk-being-transformed-genai-new-ilo%E2%80%93nask-global-index-shows" rel="noopener noreferrer"&gt;One in four jobs at risk of being transformed by GenAI — ILO–NASK Global Index（International Labour Organization）&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.imf.org/en/Blogs/Articles/2024/01/14/ai-will-transform-the-global-economy-lets-make-sure-it-benefits-humanity" rel="noopener noreferrer"&gt;AI will transform the global economy（International Monetary Fund）&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.weforum.org/publications/the-future-of-jobs-report-2025/in-full/2-jobs-outlook/" rel="noopener noreferrer"&gt;The Future of Jobs Report 2025（World Economic Forum）&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.challengergray.com/blog/challenger-report-march-cuts-rise-25-from-february-ai-leads-reasons/" rel="noopener noreferrer"&gt;Challenger Report: March 2026 cuts rise 25%, AI leads reasons（Challenger, Gray &amp;amp; Christmas）&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://fortune.com/2026/04/06/ai-tech-displacement-effect-gen-z-16000-jobs-per-month/" rel="noopener noreferrer"&gt;AI is cutting 16,000 U.S. jobs a month（Fortune / Goldman Sachs）&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://digitaleconomy.stanford.edu/publication/canaries-in-the-coal-mine-six-facts-about-the-recent-employment-effects-of-artificial-intelligence/" rel="noopener noreferrer"&gt;Canaries in the Coal Mine? Six Facts about the Recent Employment Effects of AI（Stanford Digital Economy Lab）&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://ifr.org/ifr-press-releases/news/global-robot-density-in-factories-doubled-in-seven-years" rel="noopener noreferrer"&gt;Global Robot Density in Factories Doubled in Seven Years（International Federation of Robotics）&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://global.chinadaily.com.cn/a/202605/06/WS69fa9a65a310d6866eb47064.html" rel="noopener noreferrer"&gt;One-person companies rise in popularity, gain policy support（China Daily）&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://finance.eastmoney.com/a/202604093699351321.html" rel="noopener noreferrer"&gt;2026 One-Person Company Insights Report Released: 1 Yuan of AI Cost Leverages 72x Human Labor（Securities Times / East Money）&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.stats.gov.cn/sj/zxfbhjd/202602/t20260228_1962662.html" rel="noopener noreferrer"&gt;Statistical Bulletin of the People's Republic of China on National Economic and Social Development in 2025（National Bureau of Statistics of China）&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.chinatax.gov.cn/eng/c102441/c5211934/content.html" rel="noopener noreferrer"&gt;Individual Income Tax Law of the People's Republic of China（State Taxation Administration English Site）&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://taxfoundation.org/data/all/eu/top-personal-income-tax-rates-europe/" rel="noopener noreferrer"&gt;Top Personal Income Tax Rates in Europe 2026（Tax Foundation）&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.oecd.org/en/publications/taxing-wages-2026_3a5169ef-en/full-report/overview_d93131c3.html" rel="noopener noreferrer"&gt;Taxing Wages 2026 overview（OECD）&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://taxobservatory.world/publication/global-tax-evasion-report-2024/" rel="noopener noreferrer"&gt;Global Tax Evasion Report 2024（EU Tax Observatory）&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.openresearchlab.org/projects/unconditional-cash-study" rel="noopener noreferrer"&gt;Unconditional Cash Study（OpenResearch）&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;原文链接：&lt;a href="https://guanjiawei.ai/en/blog/three-questions-after-jobs" rel="noopener noreferrer"&gt;https://guanjiawei.ai/en/blog/three-questions-after-jobs&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>employment</category>
      <category>distribution</category>
      <category>ubi</category>
    </item>
  </channel>
</rss>
