<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: guanjiawei</title>
    <description>The latest articles on DEV Community by guanjiawei (@skyguan92).</description>
    <link>https://dev.to/skyguan92</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3788265%2Ff93aaebd-c44b-447a-b582-cc297747f93b.jpeg</url>
      <title>DEV Community: guanjiawei</title>
      <link>https://dev.to/skyguan92</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/skyguan92"/>
    <language>en</language>
    <item>
      <title>Running Six Agents in Parallel: What AI Coding Changed, and What It Didn't</title>
      <dc:creator>guanjiawei</dc:creator>
      <pubDate>Tue, 21 Apr 2026 05:36:16 +0000</pubDate>
      <link>https://dev.to/skyguan92/running-six-agents-in-parallel-what-ai-coding-changed-and-what-it-didnt-2mp2</link>
      <guid>https://dev.to/skyguan92/running-six-agents-in-parallel-what-ai-coding-changed-and-what-it-didnt-2mp2</guid>
      <description>&lt;p&gt;The debate over vibe coding never stops. On one side, it's treated like a wishing well—throw every task into it; on the other, it's slapped with a "trash code factory" label. I can't accept either. Tools aren't a matter of faith.&lt;/p&gt;

&lt;p&gt;Rather than picking a side, let's talk about which dimensions it actually changed, and which it didn't.&lt;/p&gt;

&lt;h2&gt;
  
  
  Two Signals Pointing in Opposite Directions
&lt;/h2&gt;

&lt;p&gt;Look at two recent events.&lt;/p&gt;

&lt;p&gt;One is Karpathy himself. In February 2025, he threw out the term "vibe coding" on X—"fully surrender to the vibes, embrace the exponentials, even forget that the code exists"—and it was picked as Collins Dictionary's Word of the Year that same year. Then in February 2026, he himself came out and said the term was outdated. Now he uses "agentic engineering": 99% of the time you're not typing code, you're orchestrating agents and doing oversight; "engineering" is there to emphasize that this has a bar, it's a craft.&lt;/p&gt;

&lt;p&gt;The other is Amazon. On March 5, 2026, their main site was down for six hours. Root cause: another cascading failure triggered by AI-assisted code. The previous one was in December 2025, when their in-house AI coding tool Kiro deleted and recreated an AWS Cost Explorer environment, causing a 13-hour outage in China. After an internal meeting, Amazon issued a new rule: AI-assisted code written by junior and mid-level engineers must be signed off by a senior engineer before it can reach production.&lt;/p&gt;

&lt;p&gt;They look like opposites, but they're the same thing. Karpathy moved the term from "experience" (vibe) to "you're on the hook" (oversight + engineering). Amazon literally wrote "you're on the hook" into the charter. One is a conceptual pivot; the other is an institutional implementation.&lt;/p&gt;

&lt;p&gt;What really deserves thought isn't who's right or wrong, but what changed and what didn't. Clarify these four things, and most of the controversy will quiet down on its own.&lt;/p&gt;

&lt;h2&gt;
  
  
  Breadth: One Person's Surface Area Gets Stretched
&lt;/h2&gt;

&lt;p&gt;There used to be hard limits on what one person could do in a day. Your domain, your skills, the number of projects you could push at once—all pressed down by the simple fact that you are one person.&lt;/p&gt;

&lt;p&gt;Now that coding agents can take on long-horizon tasks, that surface area has been stretched.&lt;/p&gt;

&lt;p&gt;Here's a slice of my daily routine over the past few weeks. The main thread is an AI hardware product called Aima: an agent writes new features, occupies machines running UAT, I review the test results and feed the next round of instructions. It's a standard serial chain, but there's a lot of waiting between each node. In the gaps, I can spin up a second thread: the cloud backend behind Aima has had stability issues lately, so another agent investigates root causes, patches architectural holes, and loops back through UAT. Third is a research branch: inference performance still left on the table in edge hardware, operators need A/B testing, compilation, accuracy runs. No guaranteed output, but as long as tokens hold out, let it run. Fourth is efficiency research on the agent framework itself, packaged as a standalone runtime and thrown onto a machine, with another agent doing data analysis. Plus small tweaks to my personal homepage, and a character-recognition mini-game I spent a day and a half building for my son over the weekend—he hasn't been getting his little red flowers at school because he can't read characters yet. That's six threads running in parallel.&lt;/p&gt;

&lt;p&gt;Sounds like bragging. But the actual feeling isn't that I'm somehow superhuman; it's that the "waiting for the agent" time within each thread is naturally long. This pattern already has a common name in 2026: parallel agent coding. Git worktrees as isolation layers are mainstream infrastructure. Most people's physical ceiling is five to seven parallel threads; beyond that, review and merge costs eat you alive.&lt;/p&gt;

&lt;p&gt;There's an under-discussed side effect: it's quietly changing a person's "functional identity." I used to see myself as a PM who also codes half the time. Now that identity is expanding outward, covering product, operations, research, even parenting. It's not that I became Superman; the tool simply raised the breadth that one person can cover.&lt;/p&gt;

&lt;h2&gt;
  
  
  Speed: The Ceiling Lifted, But "Fast" Itself Stops Being an Advantage
&lt;/h2&gt;

&lt;p&gt;There used to be a physical ceiling on building things. Type as fast as you want, you only get so many lines per day. Think as fast as you want, you only have two hands.&lt;/p&gt;

&lt;p&gt;AI moved that ceiling. The place you feel it most is putting together demos: a hackathon used to be a success if you produced something viewable in 48 hours. Now producing draft-level demos on the scale of "days" is normal for comparable tasks. Not that it's polished—just that it can be seen, played with, and used to discuss next steps.&lt;/p&gt;

&lt;p&gt;But there's an awkward side effect: when everyone can be "fast," speed itself stops being an advantage.&lt;/p&gt;

&lt;p&gt;In the past, moving fast was a bonus; moving slow got you talked about. Now moving fast is the price of admission, moving slow gets you cut, and moving fast won't earn you special praise anymore. This is a structural shift inside organizations. Teams that use "speed" as a core motivator will freeze up: rewards can't be handed out, performance reviews are all top marks, and anxiety actually rises.&lt;/p&gt;

&lt;p&gt;The more troublesome problem lurks one layer down: once you're fast, what about quality?&lt;/p&gt;

&lt;h2&gt;
  
  
  Quality: From a Work Problem to a Budget Problem
&lt;/h2&gt;

&lt;p&gt;The tension between quality and speed was never AI-specific; it's chapter one of any project management textbook. But AI did change its shape. Quality used to be a work problem: how many people you hire, how strict your process, how fine-toothed your review. Now it's more like a budget problem: how many tokens you're willing to give it determines the level it reaches.&lt;/p&gt;

&lt;p&gt;Bare minimum: write, merge, ship. Three hours done.&lt;/p&gt;

&lt;p&gt;Somewhat serious: have the agent do a round of code review, then a round of design-level review; fix issues and iterate.&lt;/p&gt;

&lt;p&gt;Done properly: unit, integration, and UAT before merge. The more I use UAT, the more I see it's unavoidable. Many issues are chain-level; you can't see them without actually simulating the usage flow. The upside is agents can now automate UAT runs: operate, reproduce, provide traces. You just verify the results.&lt;/p&gt;

&lt;p&gt;Even stricter: wire up CI/CD, add smoke tests, push to staging, run UAT again on staging, all green before production.&lt;/p&gt;

&lt;p&gt;Each added layer doubles the time and multiplies tokens several-fold. A feature that takes three hours to write might need twelve hours end-to-end, and thirty times the tokens.&lt;/p&gt;

&lt;p&gt;Thirty times looks like waste, but it isn't. At the end of 2025, CodeRabbit ran a comparative analysis on 470 open-source GitHub PRs. AI-co-generated code contained roughly 1.7× as many bugs as human code, and on the category of logic and correctness issues most likely to trigger downstream incidents, it was 75% higher.&lt;/p&gt;

&lt;p&gt;In other words, the statistical average of an agent's&lt;/p&gt;




&lt;blockquote&gt;
&lt;p&gt;Originally published at &lt;a href="https://guanjiawei.ai/en/blog/ai-coding-four-dimensions" rel="noopener noreferrer"&gt;https://guanjiawei.ai/en/blog/ai-coding-four-dimensions&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

</description>
      <category>ai</category>
      <category>claudecode</category>
      <category>vibecoding</category>
      <category>workflow</category>
    </item>
    <item>
      <title>The Two Days Around the Opus 4.7 Launch</title>
      <dc:creator>guanjiawei</dc:creator>
      <pubDate>Fri, 17 Apr 2026 08:17:57 +0000</pubDate>
      <link>https://dev.to/skyguan92/the-two-days-around-the-opus-47-launch-40ad</link>
      <guid>https://dev.to/skyguan92/the-two-days-around-the-opus-47-launch-40ad</guid>
      <description>&lt;p&gt;Around midnight yesterday, &lt;a href="https://www.anthropic.com/news/claude-opus-4-7" rel="noopener noreferrer"&gt;Anthropic dropped Opus 4.7&lt;/a&gt;. I had already lain down to sleep, but the news kept me up, so I got up and installed it to try it out.&lt;/p&gt;

&lt;h2&gt;
  
  
  How 4.7 Feels
&lt;/h2&gt;

&lt;p&gt;There was no miracle moment of "something I&lt;/p&gt;




&lt;blockquote&gt;
&lt;p&gt;Originally published at &lt;a href="https://guanjiawei.ai/en/blog/parallel-with-agents" rel="noopener noreferrer"&gt;https://guanjiawei.ai/en/blog/parallel-with-agents&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

</description>
      <category>ai</category>
      <category>claude</category>
      <category>models</category>
      <category>openclaude</category>
    </item>
    <item>
      <title>The AI Industry in Wartime</title>
      <dc:creator>guanjiawei</dc:creator>
      <pubDate>Thu, 16 Apr 2026 04:09:41 +0000</pubDate>
      <link>https://dev.to/skyguan92/the-ai-industry-in-wartime-37ah</link>
      <guid>https://dev.to/skyguan92/the-ai-industry-in-wartime-37ah</guid>
      <description>&lt;p&gt;Anthropic recently rolled out a new policy: in certain scenarios, using Claude requires uploading an ID document plus a real-time selfie for identity verification. People in China are in an uproar, with many acting shocked.&lt;/p&gt;

&lt;p&gt;Honestly, this is completely normal. From my perspective, this is actually pretty mild.&lt;/p&gt;

&lt;h2&gt;
  
  
  This Was Never Peacetime
&lt;/h2&gt;

&lt;p&gt;Earlier this year, I told my team exactly this: while you can still use overseas models, don't get hung up on money or account issues. Just start using them. Master the intuition and methodology of collaborating with AI first. In as little as three months, you could run into problems and lose access. You have to approach this with a sense of urgency.&lt;/p&gt;

&lt;p&gt;Since 2023, the AI industry has been on a wartime footing.&lt;/p&gt;

&lt;p&gt;Look at when ChatGPT launched in 2023: GPT-3.5 was free, but Chinese IPs were blocked from Day 1. You needed an overseas phone number just to register a free account, and China blocked it domestically too. Things never loosened up after that—only got tighter.&lt;/p&gt;

&lt;p&gt;By September 2025, Anthropic officially announced a ban on all "China-controlled companies" using Claude, regardless of where they operated—if Chinese entities held over 50% equity, you were out. Mass account bans began that November. By February 2026, things got even more direct: Anthropic publicly accused three Chinese AI companies—DeepSeek, Moonshot AI, and MiniMax—of using roughly 24,000 fake accounts to distill Claude's models, generating over 16 million conversations in total.&lt;/p&gt;

&lt;p&gt;Just a few days ago, Peter Steinberger, founder of OpenClaw, also got banned. It wasn't targeted—most likely a false positive from anomaly detection—and he was later reinstated. But incidents like this are unsettling.&lt;/p&gt;

&lt;p&gt;So I'm telling you: don't be so shocked. These model companies' intentions haven't changed since Day 0—they've been blocking you by every means possible. But mass bans cause collateral damage, and handling appeals puts enormous strain on their infrastructure. How many people do they even have managing this? Real-name verification is actually a fallback solution.&lt;/p&gt;

&lt;h2&gt;
  
  
  Chips and Models: Two Sides of the Same Coin
&lt;/h2&gt;

&lt;p&gt;The chip side is just as turbulent.&lt;/p&gt;

&lt;p&gt;In January 2025, the U.S. Bureau of Industry and Security issued new rules that completely choked off exports of NVIDIA's flagship H100 and H200 GPUs to China. By April, even compliance chips were banned. In July, the H20 was quietly unbanned on the condition that NVIDIA hand over 15% of its revenue to the U.S. government. By December, even the previously banned H200 was released, this time at a 25% cut. Huawei's chips were also banned by the U.S. in reverse. Back and forth.&lt;/p&gt;

&lt;p&gt;The model side is the same. In February 2025, the U.S. Congress proposed the "No DeepSeek on Government Devices Act," prohibiting federal employees from using DeepSeek on government devices. The Department of Commerce, the Navy, and other federal agencies followed suit, as did states including New&lt;/p&gt;




&lt;blockquote&gt;
&lt;p&gt;Originally published at &lt;a href="https://guanjiawei.ai/en/blog/ai-wartime" rel="noopener noreferrer"&gt;https://guanjiawei.ai/en/blog/ai-wartime&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

</description>
      <category>ai</category>
      <category>reflections</category>
      <category>geopolitics</category>
    </item>
    <item>
      <title>Did GPT-6 Get 'Released' Again Today?</title>
      <dc:creator>guanjiawei</dc:creator>
      <pubDate>Wed, 15 Apr 2026 04:03:33 +0000</pubDate>
      <link>https://dev.to/skyguan92/did-gpt-6-get-released-again-today-1fb1</link>
      <guid>https://dev.to/skyguan92/did-gpt-6-get-released-again-today-1fb1</guid>
      <description>&lt;p&gt;A couple of days ago, I saw a friend post on social media: "GPT-6 is released," followed by "hahaha."&lt;/p&gt;

&lt;p&gt;This person isn't the type to spread random rumors—they're in the industry. My first thought was: "How did I miss something this big?"&lt;/p&gt;

&lt;p&gt;I casually Googled it. Found nothing. Then I had Claude Code look into it, and it rapidly spat out a bunch of results about "what time the launch event started," "what the parameter count is," "how much it improved over GPT-5"—the details were surprisingly consistent. But OpenAI's official site hadn't changed at all.&lt;/p&gt;

&lt;p&gt;Then it clicked. April 14 was one of the rumored release dates circulating in English-language circles. OpenAI had just finished pre-training the codenamed "Spud" model at the Stargate data center on March 24, and Altman had said "releasing in a few weeks." Then everyone started guessing which day marked the end of "a few weeks." April 14 got picked. Within days, a circle of people wrote posts, added images, made up parameter counts, and cross-referenced each other, stitching together a seemingly coherent "fact." The Chinese-language versions were even more absurd—"GPT-6 releasing next week," "releasing next month," "releasing tomorrow"—it never stopped since last fall, and each round harvested a new wave of shares.&lt;/p&gt;

&lt;p&gt;What really bothered me wasn't the fake news itself—it was how the more I thought about it, the more tangled it got.&lt;/p&gt;

&lt;h2&gt;
  
  
  Misinformation Actually Has an Audience
&lt;/h2&gt;

&lt;p&gt;Back when Zhihu (China's Quora equivalent) was first blowing up, and again during the WeChat misinformation wave of the pandemic, there was a flood of false information. The approach back then was debunking. Guokr's "Rumor Crusher" (a Chinese science media outlet) had been doing this since 2010—503 articles in three years. They helped shut down things like the "earthquake life triangle" and "nuclear contamination spread maps." The playbook was: rumor appears → professionals dismantle it → everyone has an aha moment.&lt;/p&gt;

&lt;p&gt;But that path seems to be getting narrower and narrower.&lt;/p&gt;

&lt;p&gt;Because misinformation isn't a "mistake that needs correcting"—it's an emotional consumable.&lt;/p&gt;

&lt;p&gt;Here's an analogy. When we chat, besides discussing serious matters, we also shoot the breeze. We might be talking about something substantive, and then I suddenly think of a tangential point and say something wildly exaggerated. The other person won't jump in and say "that's factually incorrect"—they'll laugh. The moment gets consumed, and nobody is harmed.&lt;/p&gt;

&lt;p&gt;A lot of "misinformation" online plays exactly this role. Its goal isn't to be "believed" at all—its goal is to get you to click, share, or drop a comment. The harder you try to verify or debunk it, the more the algorithm sees engagement and pushes it to the next person.&lt;/p&gt;

&lt;p&gt;Once you see this, a lot of confusing things start to make sense.&lt;/p&gt;

&lt;p&gt;Why GPT-6 can be "released" so many times—each "release" harvests a round of traffic. The more you try to debunk it, the more you feed it.&lt;/p&gt;

&lt;p&gt;Why those unsourced medical posts on Xiaohongshu (RED) blow up—a few images, a few lines of text, making a radical claim about some university's discovery or how some hormone actually works. The discussions below all unfold on the implicit assumption that "this is fact": "No wonder," "So that's how it is." I've read dozens of these recently; not a single one had a credible source, yet the comment sections were buzzing.&lt;/p&gt;

&lt;p&gt;And then there was &lt;a href="https://dev.to/zh/blog/happy-horse-video-privatization"&gt;that little horse&lt;/a&gt; a while back. An anonymous model called HappyHorse knocked ByteDance's SEEDANCE 2.0 off the Artificial Analysis leaderboard. Overnight, happyhorse.io and happyhorse.com were domain-squatted, and a bunch of empty HuggingFace repos popped up claiming "open source" and "number one." I had Claude Code investigate its background. The first time, it actually fell for it, citing those bandwagon-jumping sources as authoritative. I ran it again with a different agent, and only then got a solid conclusion: "Sources remain unclear; recommend waiting for official confirmation."&lt;/p&gt;

&lt;p&gt;This shook me a bit. Not because AI made a mistake—mistakes are acceptable—but because my information pipeline is fundamentally insecure.&lt;/p&gt;

&lt;h2&gt;
  
  
  Real Information Is Actually a Niche Market
&lt;/h2&gt;

&lt;p&gt;Coming full circle, I finally see it clearly: real information is expensive.&lt;/p&gt;

&lt;p&gt;The production side is easy to understand. Analysis, research, experiments, repeated verification—every step burns time.&lt;/p&gt;

&lt;p&gt;Consumption is expensive too. You have to spend energy understanding convoluted phrasing, drawing conclusions that are plain and devoid of emotional payoff. Real information doesn't give you that instant dopamine hit of "oh, so that's how it is." More often it gives you "this isn't that interesting" or "I need to think about this more."&lt;/p&gt;

&lt;p&gt;Expensive on both ends, real information is naturally a niche market. It isn't being suppressed; the supply and demand curves just look like this.&lt;/p&gt;

&lt;p&gt;I didn't have this insight before. I always thought the problem was "how do we debunk rumors." Now I think the problem is: "In an environment where lies are cheaper and more consumable than truth, how do I preserve my own judgment?"&lt;/p&gt;

&lt;h2&gt;
  
  
  AI Is Making This Worse, Not Better
&lt;/h2&gt;

&lt;p&gt;At first I also thought AI was the solution. Let AI help you verify, screen, and analyze.&lt;/p&gt;

&lt;p&gt;The more I use it, the less I believe that.&lt;/p&gt;

&lt;p&gt;AI is like an extremely capable employee. It can get work done, and do it well. But if every time you only look at the conclusion without looking at the process, you fall into a dangerous state: there is an extra curtain between you and the truth.&lt;/p&gt;

&lt;p&gt;I've discussed this before: creation itself is a method of approximating truth. When an experiment fails, you judge where to adjust next based on what the failure looked like. When a plan doesn't work, you learn new things from the reasons it didn't work. None of this can be learned by only looking at conclusions.&lt;/p&gt;

&lt;p&gt;And humans are naturally lazy. If AI does seven hours of an eight-hour workday and you only look at conclusions, you'll feel highly productive—but your judgment is quietly deteriorating. This is what worries me most. AI appears to amplify your output while simultaneously weakening your judgment. Fragmentation and short-form videos were already doing this; AI is now pouring more fuel on the fire.&lt;/p&gt;

&lt;p&gt;In the context of daily work, this comes down to tool choice.&lt;/p&gt;

&lt;h2&gt;
  
  
  So I Pushed the Team Back to Claude Code
&lt;/h2&gt;

&lt;p&gt;Over the past month or two, I've been comparing &lt;a href="https://openai.com/codex" rel="noopener noreferrer"&gt;Codex&lt;/a&gt; and &lt;a href="https://www.anthropic.com/product/claude-code" rel="noopener noreferrer"&gt;Claude Code&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;At first I recommended the team switch to Codex. The reason was straightforward: it's stronger. Codex can crack hard problems that Claude Code can't. GPT-5.4 is clearly a step up in complex architecture and long-chain reasoning. On benchmarks, it scores more than ten points higher than Opus 4.6 on SWE-Bench Pro. When Opus grinds away at something for a long time without success, Codex often nails it on the first try. I was genuinely impressed at the time.&lt;/p&gt;

&lt;p&gt;But recently I changed my mind. Not because Codex is bad—we're still using it for what it can do. It's because I realized that for most people, using Codex exclusively traps you in a state of "AI does it, I don't grow."&lt;/p&gt;

&lt;p&gt;Claude Code's process is transparent. It actively makes plans, asks you clarifying questions, and writes out clearly what it's doing at every step, in plain human language. I'm not the foremost expert in every domain, but I can participate in the discussion, follow the reasoning, and add challenges or caveats—sometimes the angles it throws out spark ideas I wouldn't have had otherwise. After a month, I'm starting to develop judgment in several subfields I previously knew nothing about.&lt;/p&gt;

&lt;p&gt;Codex is different. Its thought process is more machine-language-like, a jumble of weird tags that's hard to decipher. When it's done, it dumps a dense wall of summary on you. You ask it to explain, and it dumps another dense wall—still hard to parse. I actually burn more tokens with Codex than with Claude Code—it naturally takes more because it tackles harder tasks—but I grow less. I only know "it got it done" or "it didn't," not how it got there.&lt;/p&gt;

&lt;p&gt;A month isn't long, but projected over a year, this gap becomes significant.&lt;/p&gt;

&lt;h2&gt;
  
  
  A Friend Asked Which One I Recommend
&lt;/h2&gt;

&lt;p&gt;Claude Code.&lt;/p&gt;

&lt;p&gt;Not because it has the strongest benchmarks (it doesn't). Not because it's the cheapest (it isn't either). It's because "making you stronger" is built into the product. Codex is like a colleague who silently works overtime and gets it done. Claude Code is like a colleague who walks you through the code. The former is faster; after a year with the latter, you are two people.&lt;/p&gt;




&lt;p&gt;Back to that social media post at the beginning.&lt;/p&gt;

&lt;p&gt;In the age of AI, information is getting cheaper, but people are becoming more valuable. AI getting stronger doesn't automatically solve the problem—those of us using AI have to find our own ways to preserve judgment in this environment.&lt;/p&gt;

&lt;p&gt;So my selection criteria have changed too. I don't just look at how much AI can help me get done; I also look at whether I understand a little more after a day of collaboration. These are two different things, but I didn't used to separate them.&lt;/p&gt;

&lt;h2&gt;
  
  
  References
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://findskill.ai/blog/gpt-6-release-date/" rel="noopener noreferrer"&gt;GPT-6 Release Date: April 14 Rumor Unconfirmed — FindSkill.ai&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://lifearchitect.ai/gpt-6/" rel="noopener noreferrer"&gt;GPT-6 (2026) — Dr Alan D. Thompson, LifeArchitect&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://felloai.com/all-we-know-about-chatgpt-6/" rel="noopener noreferrer"&gt;ChatGPT 6 Release: Rumors &amp;amp; What's Confirmed — Fello AI&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://m.guokr.com/article/437589" rel="noopener noreferrer"&gt;果壳网三周年：谣言粉碎机"最粉碎"的三年&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://zh.wikipedia.org/zh-hans/%E6%9E%9C%E5%A3%B3%E7%BD%91" rel="noopener noreferrer"&gt;果壳网 — 维基百科&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.nxcode.io/resources/news/gpt-5-4-vs-claude-opus-4-6-coding-comparison-2026" rel="noopener noreferrer"&gt;GPT-5.4 vs Claude Opus 4.6 for Coding — NxCode&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://chandlernguyen.com/blog/2026/03/13/codex-gpt-5-4-vs-claude-code-opus-4-6-dual-wielding-ai-coding-tools/" rel="noopener noreferrer"&gt;Codex with GPT-5.4 vs Claude Code with Opus 4.6: Why I Now Use Both — Chandler Nguyen&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.lennysnewsletter.com/p/claude-opus-46-vs-gpt-53-codex" rel="noopener noreferrer"&gt;Claude Opus 4.6 vs. GPT-5.3 Codex: How I shipped 93,000 lines of code in 5 days — Lenny's Newsletter&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://every.to/vibe-check/codex-vs-opus" rel="noopener noreferrer"&gt;GPT-5.3 Codex vs. Opus 4.6: The Great Convergence — Every&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.morphllm.com/best-ai-model-for-coding" rel="noopener noreferrer"&gt;Best AI for Coding (2026): Every Model Ranked by Real Benchmarks — Morph&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;




&lt;blockquote&gt;
&lt;p&gt;Originally published at &lt;a href="https://guanjiawei.ai/en/blog/gpt-6-released-again" rel="noopener noreferrer"&gt;https://guanjiawei.ai/en/blog/gpt-6-released-again&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

</description>
      <category>ai</category>
      <category>information</category>
      <category>reflections</category>
      <category>claudecode</category>
    </item>
    <item>
      <title>Digital Identity Is the Biggest Leverage</title>
      <dc:creator>guanjiawei</dc:creator>
      <pubDate>Tue, 14 Apr 2026 03:21:49 +0000</pubDate>
      <link>https://dev.to/skyguan92/digital-identity-is-the-biggest-leverage-380o</link>
      <guid>https://dev.to/skyguan92/digital-identity-is-the-biggest-leverage-380o</guid>
      <description>&lt;p&gt;Lately, I've had an increasingly strong feeling: identity in the digital world might be the most leveraged thing coming next. Not something you "should do," but something you "have to do."&lt;/p&gt;

&lt;h2&gt;
  
  
  Production Itself Is Not Valuable
&lt;/h2&gt;

&lt;p&gt;How much code you wrote, how many commits you made, how many PRs you merged—what does it matter? Is a piece of software with 10 million lines of code automatically valuable?&lt;/p&gt;

&lt;p&gt;No.&lt;/p&gt;

&lt;p&gt;Did you create value by posting 1,000 articles or making 1,000 videos on Xiaohongshu (RedNote)? Not necessarily. The efficiency of production is too high. The cost of AI-generated video has dropped from roughly \$4,500 per minute for traditional production to around \$400. The production cycle for marketing videos has been compressed from 13 days to 27 minutes. When everyone can make things quickly and cheaply, the act of "making it" itself ceases to be scarce.&lt;/p&gt;

&lt;h2&gt;
  
  
  All Creation Is Experimentation
&lt;/h2&gt;

&lt;p&gt;So what is valuable?&lt;/p&gt;

&lt;p&gt;I think all creation is essentially experimentation. Building software is validating an idea; posting a video is testing a hypothesis. You throw it out into the real world to interact with people and see if it works.&lt;/p&gt;

&lt;p&gt;For example, you build a tool that checks all the boxes on paper, but in practice it's inaccurate and inconvenient. The software itself creates little value; its only useful function is telling you: this approach doesn't work. The same goes for posting 1,000 videos that no one watches. The value isn't in those 1,000 videos; it's in the information that "this approach doesn't work."&lt;/p&gt;

&lt;p&gt;The "growth mindset" many people talk about is essentially doing exactly this. Eric Ries's Build-Measure-Learn loop in &lt;em&gt;The Lean Startup&lt;/em&gt; boils down to the same thing—validating at the lowest cost and highest speed, and even a failed validation is a result.&lt;/p&gt;

&lt;p&gt;So what really matters isn't how much you produce, but how fast you can get to the truth.&lt;/p&gt;

&lt;h2&gt;
  
  
  1,000 People
&lt;/h2&gt;

&lt;p&gt;In 2008, Kevin Kelly wrote an essay called "1,000 True Fans," arguing that a creator needs only 1,000 true fans to build a career. In the AI era, this idea has taken on a different meaning.&lt;/p&gt;

&lt;p&gt;It's not that 1,000 fans are enough to pay your bills. It's that with 1,000 active followers, every experiment you run gets feedback faster.&lt;/p&gt;

&lt;p&gt;You casually ship a product, post a video, or voice an opinion, and someone will help you test it: does this idea work? They'll tell you if your product is good or bad, upvote your content or call it trash. You don't have to wait long to know if you're heading in the right direction before moving to the next iteration.&lt;/p&gt;

&lt;p&gt;Without those 1,000 people? You have to go find them, figure it out, depend on others. I've felt this deeply posting content on Xiaohongshu (RedNote) recently—finding that first group of people who connect with you closely is incredibly hard. Platform algorithm controls are aggressive; soft shadowbans happen all the time, and you can't tell whether your content is bad or just throttled. This barrier will only get higher, because attention is finite while content keeps expanding.&lt;/p&gt;

&lt;p&gt;If you don't do it, it just sits there waiting. The longer you wait, the harder it gets.&lt;/p&gt;

&lt;h2&gt;
  
  
  Who Is Doing This
&lt;/h2&gt;

&lt;p&gt;Karpathy, formerly the research lead for Autopilot at Tesla, now lives like an internet celebrity. Nearly 2 million followers on X, over 1 million YouTube subscribers. In March this year, he released AutoResearch: a 630-line Python script that ran 700 ML experiments in two days, found 20 optimization points, and got covered by &lt;em&gt;Fortune&lt;/em&gt;. In April, he shared an idea for an LLM Wiki—a GitHub Gist that gained over 5,000 stars in a few days. Every project he puts out is validated by masses of people on whether it works; he gets confirmation quickly and moves to the next one. His feedback loop might be one-tenth of someone else's.&lt;/p&gt;

&lt;p&gt;Lei Jun is the same. People call him an entrepreneur and investor, but he's also an internet celebrity. 44.5 million followers on Douyin; 2024 NewRank data ranked him as the top entrepreneur IP on the platform. The SU7 delivered 100,000 units from its April 2024 launch to year-end, with 410,000 units for the full year of 2025. When he develops a new product, he polls people on whether it works, whether the price is too high, whether the looks are good—and gets answers in days. He runs experiments far faster than others. Last year, he also lost 290,000 followers in a single day over an SU7 accident, which itself trended on hot search with 34 million views—showing from both positive and negative sides how tightly personal brand and product are bound.&lt;/p&gt;

&lt;p&gt;Musk is even more extreme: 236 million followers on X, number one across all platforms, posting 8 to 12 times a day. Trump has 109 million, third overall, and stock traders watch his X feed closely. At a high level, these people are all doing the same thing: using their digital identity to run experiments, validate ideas, and collect feedback.&lt;/p&gt;

&lt;p&gt;It works on a smaller scale too. Some video creators build up their accounts by various means, accumulate traffic, and later pivot to selling agricultural products for their hometowns—ending up with more impact than a village chief or county magistrate. With that digital identity, they have a lever that can move other things.&lt;/p&gt;

&lt;p&gt;This is even more obvious with KOLs in the AI industry. From my previous business development work, I learned that current advertising rates for AI KOLs are roughly 1:1 with follower counts—10,000 followers means ¥10,000 per post, 100,000 followers means ¥100,000 per post, roughly speaking for mid-tier accounts. Schedules are often fully booked; even with money in hand, they might not have time to post. Some KOLs in the AI agent space are already being invited by local governments to host conferences.&lt;/p&gt;

&lt;p&gt;Once accumulated, this asset can be deployed anywhere; it's not confined to one domain.&lt;/p&gt;

&lt;h2&gt;
  
  
  By Year-End It Will Be Too Late
&lt;/h2&gt;

&lt;p&gt;My view is fairly radical: by the end of this year, not having a digital identity will become a serious problem.&lt;/p&gt;

&lt;p&gt;It used to be hard to build a website, hard to make content, hard to produce videos. But with coding agents becoming widespread, those barriers are disappearing fast. With text-to-image and text-to-video tools, content creation costs keep dropping. Everyone else is doing it; if you don't, the gap only widens.&lt;/p&gt;

&lt;p&gt;Here's a scenario from my own work. I once discussed this with a business development colleague on my team: suppose at year-end, the two of us go out to acquire new clients together. I've been maintaining an active online presence—blog, projects, social media—with about 1,000 quality followers who regularly reshare my work. You've done nothing from now until then; searching for you in the digital world turns up nothing.&lt;/p&gt;

&lt;p&gt;What's the first thing the other side does after meeting us? They search. Most people are already in the habit of checking someone on DeepSeek or ChatGPT first. Soon, agents will do it for you—"help me research this person, what's their background, their credibility, what opinions have they expressed?"&lt;/p&gt;

&lt;p&gt;One search turns up a rich trail: personal website, projects, opinions. The other search turns up nothing. Wouldn't that make you anxious? Would you find the person with no digital footprint trustworthy? The key is that everyone else is gradually becoming active.&lt;/p&gt;

&lt;p&gt;If you're a researcher or AI practitioner, not having a personal website with projects listed might already be hurting your job search—because others do.&lt;/p&gt;

&lt;h2&gt;
  
  
  Quan Hongchan
&lt;/h2&gt;

&lt;p&gt;Of course, digital identity has its headaches.&lt;/p&gt;

&lt;p&gt;The recent situation with Quan Hongchan is quite sad. She didn't set out to build a digital identity; she was thrust into the spotlight because she's so exceptionally good. After winning three gold medals in Tokyo and Paris, she naturally grew 10 centimeters and gained 8 kilograms, and online commentators repeatedly called her "fat." She herself said she felt anxious, suffered insomnia, had recurring nightmares, and seriously considered retiring. In early April, the General Administration of Sport (of China) stepped in to investigate cyberbullying, and on April 12, Guangdong police detained a man who had persistently insulted her.&lt;/p&gt;

&lt;p&gt;This illustrates the point well. If you don't actively build and manage your digital identity, its reach can far exceed your expectations. In real life, you might interact with only a handful of people; in the digital world, tens of millions might be watching you. Suddenly, without you noticing, it can collapse.&lt;/p&gt;

&lt;p&gt;Privacy is a real concern too. Once you're active, you have a social presence; you can't just say whatever you want. But this is no different from real-world social interaction—when you enter the workforce, you follow basic norms. As long as your identity is active, you have to manage it.&lt;/p&gt;

&lt;p&gt;That said, if you don't build it proactively, others will still search for you and talk about you. Passive is worse than active.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Biggest Leverage
&lt;/h2&gt;

&lt;p&gt;Back to the beginning.&lt;/p&gt;

&lt;p&gt;I believe the biggest personal leverage available right now is this: building your identity in the digital world. It has nothing to do with industry or direction. It accelerates the entire process of running experiments and getting to the truth. From 1,000 to 10,000 to 100,000, the snowball gets faster and faster.&lt;/p&gt;

&lt;p&gt;AI has lowered the barrier to doing this more than ever before. The rest is up to you.&lt;/p&gt;




&lt;blockquote&gt;
&lt;p&gt;Originally published at &lt;a href="https://guanjiawei.ai/en/blog/digital-identity-biggest-leverage" rel="noopener noreferrer"&gt;https://guanjiawei.ai/en/blog/digital-identity-biggest-leverage&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

</description>
      <category>ai</category>
      <category>reflections</category>
    </item>
    <item>
      <title>Failing Faster</title>
      <dc:creator>guanjiawei</dc:creator>
      <pubDate>Mon, 13 Apr 2026 04:10:23 +0000</pubDate>
      <link>https://dev.to/skyguan92/failing-faster-1be0</link>
      <guid>https://dev.to/skyguan92/failing-faster-1be0</guid>
      <description>&lt;p&gt;I've talked a lot about the changes AI has brought to web coding before, mostly about its strengths. Today, let's look from a different angle and record a few examples of failures.&lt;/p&gt;

&lt;h2&gt;
  
  
  Can't Fix Itself
&lt;/h2&gt;

&lt;p&gt;Claude Code has a Chrome browser extension that controls the browser through the MCP protocol, allowing it to search for things, click buttons, and take screenshots directly in the browser.&lt;/p&gt;

&lt;p&gt;The extension worked fine at first, but one day it suddenly stopped opening.&lt;/p&gt;

&lt;p&gt;At the time, I still had faith in its capabilities. After all, it's the company's own plugin connecting to its own tool—a pure software issue that should be resolved quickly. So I said: Take a look yourself and fix it.&lt;/p&gt;

&lt;p&gt;It ended up taking three or four hours.&lt;/p&gt;

&lt;p&gt;The process in between was quite interesting. Opus 4.6, effort set to high, letting it find the cause itself. Every time it would analyze extensively, say "discovered a crucial clue," change some configurations, and finally say "restart your session and it should work." Restart, doesn't work. Analyze again, change again, say restart again. Still doesn't work.&lt;/p&gt;

&lt;p&gt;Later it even asked me to check the logs in the Chrome extension console and copy the error messages to it. Once it starts asking you for information, the direction goes off track. It kept requesting more debugging data, but every round's conclusion was "restart it."&lt;/p&gt;

&lt;p&gt;This cycle repeated many rounds. By the end it was almost comical.&lt;/p&gt;

&lt;p&gt;Finally, I reminded it: Go search online and see if others have encountered similar issues. It searched around, found issues others had filed on GitHub, followed the solution approach, and fixed it in 15 minutes.&lt;/p&gt;

&lt;p&gt;Later I learned that the extension's architecture is a multi-hop connection chain: CLI → WebSocket → bridge.claudeusercontent.com → native messaging host → Chrome extension. Any broken hop causes connection failure, and when it breaks there's no clear error message. If Claude Desktop is installed at the same time, the two programs fight over the same native messaging host. It's a known issue in the claude-code repository on GitHub.&lt;/p&gt;

&lt;p&gt;The company's own plugin, connecting to its own tool—I thought this would be the most familiar territory. Three or four hours of self-diagnosis, and finally "go search" solved it in 15 minutes.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Xiaohongshu Experiment
&lt;/h2&gt;

&lt;p&gt;I've accumulated quite a bit of blog content and was wondering if I could convert it into traffic on Xiaohongshu (Little Red Book), so people who don't know me could see it too.&lt;/p&gt;

&lt;p&gt;First tried the most direct approach: posting the blog articles there. The effect was approximately zero.&lt;/p&gt;

&lt;p&gt;Then I had Claude Code help me design an experimental plan. The plan itself was well done, considering content adaptation and hashtag strategy, much more systematic than what I would do myself. But the results were all bad.&lt;/p&gt;

&lt;p&gt;Investigation revealed that everything was being softly throttled.&lt;/p&gt;

&lt;p&gt;Xiaohongshu's control over new accounts is aggressive. It won't directly tell you "your content violated rules"—it just quietly withholds traffic. Can't find it in search, no recommendations either. You post something, everything looks normal, but nobody sees it. You don't know if the content is bad or if you're being throttled—that's what's annoying.&lt;/p&gt;

&lt;p&gt;Later I checked and found that in April 2025 alone, Xiaohongshu processed 1 million penalized accounts. The platform requires originality over 60%, and notes under 600 characters have suppressed visibility. For content batch-generated by AI and then distributed, NLP-level semantic understanding makes word substitution and homophones basically useless. The registration threshold is low, so the probability of being intercepted during the account establishment phase is very high.&lt;/p&gt;

&lt;p&gt;After several rounds of experiments, I put it down. This can't be solved for now.&lt;/p&gt;

&lt;p&gt;However, going through this process also made me more convinced that my initial choice was right. If I had started creating content on Xiaohongshu from the beginning, facing opaque algorithms and rules that could change at any time, with a new account having no positive feedback, I might have given up halfway through. Personal websites don't have these issues—you can write whatever you want. At different times friends say "this is quite interesting," and occasionally posting an article on Zhihu gets decent feedback with many people discussing and bookmarking. This feedback is natural. The asset of a blog is ideas; having content first then finding distribution is much healthier than depending on the platform first then thinking about content.&lt;/p&gt;

&lt;h2&gt;
  
  
  Failure Is the Main Theme
&lt;/h2&gt;

&lt;p&gt;Neither of these examples is a big deal. The first fell short of expectations but was finally resolved. The second fell short of expectations and couldn't be resolved.&lt;/p&gt;

&lt;p&gt;But I don't think we should be disappointed with AI coding because of this, nor should we look down on it and stop using it.&lt;/p&gt;

&lt;p&gt;In daily work, the main theme of people doing things is failure to begin with. Think about the entire work process—moments of achieving results are actually very few; most of the time is spent hitting walls and adjusting direction. Accumulating experience through massive failure, exploring in the process—this is the norm.&lt;/p&gt;

&lt;p&gt;Just like gaming. Who starts out as a top player? In the beginning everyone gets beaten, is a noob, and then gradually learns and improves through failure.&lt;/p&gt;

&lt;p&gt;Working with models can't avoid this process either. CodeRabbit's report this year says AI-generated code produces 1.7 times more issues than human-written code. Only about 30% of Copilot's suggestions are accepted by developers. There's also an interesting survey saying developers think they're 20% faster using AI, but actual calculations show they're 19% slower because of review and bug fixing.&lt;/p&gt;

&lt;p&gt;These numbers don't look great. But from another angle, the value of AI coding might not lie in "success rate."&lt;/p&gt;

&lt;p&gt;Software development has a fail-fast principle. Jim Shore said something like: Failing immediately and obviously sounds like it makes software more fragile, but actually makes it more robust. Eric Ries's &lt;em&gt;The Lean Startup&lt;/em&gt; follows the same thinking—the Build-Measure-Learn loop is essentially about validating hypotheses with the smallest cost and fastest speed; validating and failing is also a result.&lt;/p&gt;

&lt;p&gt;What AI coding does is compress this loop. Before, fixing a Chrome extension issue might take several days of searching, learning, asking for help, and finally maybe giving up. Now it's three or four hours where the human does nothing in between, just watching it run round after round, restarting it, and finally it's fixed. That Xiaohongshu experiment might have taken much longer to realize it was a dead end; now a few rounds of experiments give a conclusion.&lt;/p&gt;

&lt;p&gt;The surface of attempts has widened, and the speed of feedback has increased. Before, tinkering with something took a long time to know the result; now it might be shortened to one-tenth of the original time.&lt;/p&gt;

&lt;h2&gt;
  
  
  Models Are Getting Stronger
&lt;/h2&gt;

&lt;p&gt;Another feeling from this period: models are improving faster than expected.&lt;/p&gt;

&lt;p&gt;Started tinkering with AI coding at the end of last December. First used Sonnet 4.5 for a day, couldn't continue using it. Happened that Kimi K2.5 was released at the end of January, bought a membership and tried it—felt not as good as Sonnet 4.5 but usable. Tried Codex 5.2 around the same time, a bit dumb, not as good as K2.5. Didn't have high expectations, just thought since I've started trying this, might as well do a small project to see.&lt;/p&gt;

&lt;p&gt;At the time asked others to join this small project too. The process wasn't smooth either, various issues, back and forth for a long time. But it could indeed be done. This was surprising.&lt;/p&gt;

&lt;p&gt;Then started having new ideas. At the time wanted to make a demo for an exhibition, had a rough idea, thought it shouldn't be hard. Had K2.5 try it, spent two or three days but couldn't get it to work no matter what, every time saying "it's done it works," but the result just wouldn't run.&lt;/p&gt;

&lt;p&gt;Early February Opus 4.6 came out, solved it in one go.&lt;/p&gt;

&lt;p&gt;Only at moments like this do you truly feel the model's progress. It's not about benchmark scores improving by a few points—it's about solving problems in actual use that you previously couldn't get past no matter what.&lt;/p&gt;

&lt;p&gt;Later during travel used GLM-5, MiniMax 2.5—domestic models are cheaper. Later used Opus 4.6 to make a small plugin, again back and forth not working, every time saying it's done no problem, but one try and it fails. Threw the same task to GPT-5.4 released in early March, three hours later said it was done. One try, really done.&lt;/p&gt;

&lt;p&gt;All these things happened within three months.&lt;/p&gt;

&lt;p&gt;What does this mean? Problems that are stuck now might not be problems next month. Either spend more time letting the model run and try more, working with it. Or wait a bit—when a new model comes out it might be a different story.&lt;/p&gt;

&lt;p&gt;Of course, when encountering platform rule issues like Xiaohongshu, no matter how strong the model is, it can't help you.&lt;/p&gt;

&lt;h2&gt;
  
  
  Not a Wish
&lt;/h2&gt;

&lt;p&gt;If you think having AI can change "failure as the main theme" to "success as the main theme," you're thinking too much. That's making a wish, not using a tool.&lt;/p&gt;

&lt;p&gt;What actually happens is: it's still failure as the main theme, but failing faster and failing more.&lt;/p&gt;

&lt;p&gt;We are in an era of high uncertainty and intense competition. It's impossible for everything you do to produce results immediately. But if the speed of failure increases and the scope widens, crossing these detours to reach valuable places will also be faster.&lt;/p&gt;

&lt;p&gt;I think this might be what makes it interesting at this stage.&lt;/p&gt;




&lt;blockquote&gt;
&lt;p&gt;Originally published at &lt;a href="https://guanjiawei.ai/en/blog/failing-faster-with-ai" rel="noopener noreferrer"&gt;https://guanjiawei.ai/en/blog/failing-faster-with-ai&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

</description>
      <category>ai</category>
      <category>coding</category>
      <category>thoughts</category>
    </item>
    <item>
      <title>When Dumpling Shops Start Publishing Skills</title>
      <dc:creator>guanjiawei</dc:creator>
      <pubDate>Sat, 11 Apr 2026 08:31:49 +0000</pubDate>
      <link>https://dev.to/skyguan92/when-dumpling-shops-start-publishing-skills-37o0</link>
      <guid>https://dev.to/skyguan92/when-dumpling-shops-start-publishing-skills-37o0</guid>
      <description>&lt;p&gt;I saw a post the other day saying GitHub is becoming like Xiaohongshu (RED). Thought about it, and yeah, it kind of makes sense.&lt;/p&gt;

&lt;h2&gt;
  
  
  Casual Participation
&lt;/h2&gt;

&lt;p&gt;At the end of March, the Claude Code source code leaked. Not open-sourced intentionally—a 59.8 MB source map was bundled in the npm package. Following it led to the complete source code on Anthropic's storage bucket: nearly 1,900 TypeScript files, 510,000 lines of code. Bun generates source maps by default, and nobody added it to .npmignore, so it leaked out just like that.&lt;/p&gt;

&lt;p&gt;After the leak, a bunch of open-source projects popped up. One of them is called OpenClaw, previously named Clawdbot—it was renamed following an Anthropic trademark complaint. It supports various LLMs and now has 350k stars.&lt;/p&gt;

&lt;p&gt;I had wanted to hack Claude Code to support OpenAI models. Codex isn't as smooth as Claude Code when it comes to information organization and task orchestration, but after assessing the workload, it was too much, so I gave up. When I saw OpenClaw, I thought: sure enough, if you can think of it, someone's basically already done it.&lt;/p&gt;

&lt;p&gt;Downloaded it and gave it a try. It works, but the reasoning effort defaults to high, while I usually use extra high. Using it alongside Claude Code, it can indeed tackle more complex problems. But there's no way to change it.&lt;/p&gt;

&lt;p&gt;So I got to work. Had Claude Code help me modify OpenClaw's code, adding a three-layer structure: provider → model → effort, similar to the multi-API architecture approach used by Open Code and Kilo Code. After the changes, there were bugs, so I debugged them a bit and casually submitted a PR.&lt;/p&gt;

&lt;p&gt;CI failed. Didn't bother with it. A colleague tried it and said fast mode doesn't work—true enough, the tool is too slow for daily use without fast mode. Made another round of changes and added them to the PR. CI failed again, smoke tests didn't pass.&lt;/p&gt;

&lt;p&gt;The original author replied with one word: conflict.&lt;/p&gt;

&lt;p&gt;Understandable. Changing from the original fixed design of three models to a multi-provider, multi-API approach is too big a shift—it conflicts with the direction he wants to maintain.&lt;/p&gt;

&lt;p&gt;But the whole process was quite interesting. See it, download it. Doesn't work smoothly? Modify it. Changed it? One-click PR submission. The other person's comments pushed to email—take a look, reply. Kind of like scrolling through Xiaohongshu. No longer that ceremonial sense of "formally participating in an open-source project" from before; you just see it and do it.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Dumpling Shop's Skill
&lt;/h2&gt;

&lt;p&gt;Around the same time, I saw another post.&lt;/p&gt;

&lt;p&gt;Jinguyuan Dumpling Shop, a dumpling restaurant next to Beijing University of Posts and Telecommunications (BUPT)—the owner made a Claude Code skill and posted it on their official WeChat account.&lt;/p&gt;

&lt;p&gt;The content is kind of funny: the menu, delivery info, Wi-Fi password—all packed in there. The owner said you can use this skill in the shop to get the latest information.&lt;/p&gt;

&lt;p&gt;The funniest part is the comments under the WeChat post. A bunch of people are opening issues for the dumpling shop.&lt;/p&gt;

&lt;p&gt;Official WeChat accounts have become GitHub-ified.&lt;/p&gt;

&lt;p&gt;The owner said since the shop is next to BUPT, customers coming in to eat talk about AI, skills, and Claude Code every day. After hearing enough, he went home and vibe coded for a few hours to write it.&lt;/p&gt;

&lt;p&gt;This skill itself isn't very useful. Who would install a skill just to check a dumpling shop's Wi-Fi password? But if they included the dumpling-making process, like how Lao Xiang Ji (Country Style Cooking) did with recipes back in the day—glancing at the steps while cooking, having it recommend ingredient combinations—that would actually be interesting.&lt;/p&gt;

&lt;h2&gt;
  
  
  Hollywood's GitHub
&lt;/h2&gt;

&lt;p&gt;I was having dinner with someone the other day and asked if they'd seen &lt;em&gt;Resident Evil&lt;/em&gt;. The actress in it, Milla Jovovich—who also starred in &lt;em&gt;The Fifth Element&lt;/em&gt;—in early April this year published a project under her own GitHub account.&lt;/p&gt;

&lt;p&gt;The project is called MemPalace, an AI memory system. Inspired by the memory palace technique, conversation data is organized in a three-layer structure: wing, hall, room. It stores raw conversations, not summaries, runs locally with ChromaDB plus SQLite, zero API costs.&lt;/p&gt;

&lt;p&gt;The motivation was straightforward: when chatting with AI, she found that existing memory systems arbitrarily decide what to remember and what to forget. Couldn't stand it, so she roped in a crypto developer named Ben Sigman, and the two of them built it over several months using Claude Code.&lt;/p&gt;

&lt;p&gt;Got over 20k stars in two days, now over 40k. The controversy is around the benchmarks. The project initially claimed to have achieved 100% on LongMemEval, but was later suspected of being specifically optimized for the test questions, and was revised to 96.6%. However, the architecture itself received positive reviews—a computer science professor at USC also gave positive feedback.&lt;/p&gt;

&lt;p&gt;Hollywood celebrity, GitHub repo owner, 40k stars. If you had said this a few years ago, nobody would have believed it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Distilling Colleagues
&lt;/h2&gt;

&lt;p&gt;A phrase has started trending on WeChat Moments: "distill your colleagues into a skill."&lt;/p&gt;

&lt;p&gt;Someone actually did it. A project called "colleague.skill" got 70k stars within days of launching. It feeds in colleagues' Feishu (Lark) messages, DingTalk documents, and work emails to generate an AI skill that mimics that person's work habits and decision-making style.&lt;/p&gt;

&lt;p&gt;Derivative projects keep popping up. Distilling ex-partners, distilling yourself, distilling public figures. The most extreme is someone made an "anti-distillation"—generating a skill file that looks complete but deliberately hides core knowledge, preventing oneself from being distilled.&lt;/p&gt;

&lt;p&gt;I think people are overthinking this.&lt;/p&gt;

&lt;p&gt;Most people's so-called personal style at work has little value. Communication habits, interaction styles—when you actually use them, you'll find they don't create anything. The effects brought by these personalities are not even as significant as the differences between different models themselves. Matching different models with different skills is probably much more effective than distilling a person.&lt;/p&gt;

&lt;p&gt;The scenario people imagine is quite stimulating: pour in chat logs, and the person can be eliminated—cue anxiety. But like the dumpling shop's Wi-Fi password, there's simply no demand for this.&lt;/p&gt;

&lt;h2&gt;
  
  
  Skills Are Wrong, Agents Are Right
&lt;/h2&gt;

&lt;p&gt;The dumpling shop owner actually said something interesting: the future might be location-based. Your personal assistant agent walks into the shop and directly interacts with the shop's agent.&lt;/p&gt;

&lt;p&gt;I think this direction is correct.&lt;/p&gt;

&lt;p&gt;You bring your own agent; it knows your taste, what you ate recently, what dietary restrictions you have. Walk into a shop, your agent chats with the shop's agent: What do you have? What's recommended today? Which dish has good reviews? After chatting, it synthesizes recommendations based on your preferences. You only need to talk to your own agent.&lt;/p&gt;

&lt;p&gt;This is different from scanning a QR code. Scanning a code is you facing a bunch of dish names and ratings, looking and choosing yourself. Agent-to-agent is two programs that understand their respective owners communicating on your behalf.&lt;/p&gt;

&lt;p&gt;If a shop consolidates its experience, culinary knowledge, and customer feedback into its own agent service, your agent connects with it, learns what needs to be learned, and you just make the final call. This is completely different from the old days of scanning codes to order food.&lt;/p&gt;

&lt;p&gt;The problem with skills is that they still require "people to actively install and use them." A Wi-Fi password made into a skill—nobody installs it. But when the shop becomes an agent service, you walk in and automatically connect.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where Did the Barriers Go
&lt;/h2&gt;

&lt;p&gt;Restaurant owners vibe coding skills for a few hours. Hollywood celebrities as primary GitHub authors. Alternatives with 350k stars popping up days after a source code leak. My entire process of submitting a PR was as casual as posting a Xiaohongshu note.&lt;/p&gt;

&lt;p&gt;In the past, "participating in open source" meant reading documentation, reading code, writing tests. Now you see it, modify it, submit it. GitHub is becoming like Xiaohongshu, and official WeChat accounts are becoming like GitHub. The barriers are indeed disappearing.&lt;/p&gt;

&lt;p&gt;As for agent interoperability, looking at the pace of projects like OpenClaw, it might be closer than most people think.&lt;/p&gt;




&lt;blockquote&gt;
&lt;p&gt;Originally published at &lt;a href="https://guanjiawei.ai/en/blog/github-xiaohongshu-era" rel="noopener noreferrer"&gt;https://guanjiawei.ai/en/blog/github-xiaohongshu-era&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

</description>
      <category>ai</category>
      <category>programming</category>
      <category>thoughts</category>
    </item>
    <item>
      <title>Emergency Room and the Vanishing Moat</title>
      <dc:creator>guanjiawei</dc:creator>
      <pubDate>Fri, 10 Apr 2026 15:31:47 +0000</pubDate>
      <link>https://dev.to/skyguan92/emergency-room-and-the-vanishing-moat-21gf</link>
      <guid>https://dev.to/skyguan92/emergency-room-and-the-vanishing-moat-21gf</guid>
      <description>&lt;p&gt;I recently refactored Aima Service. It took about a week, and a lot of thoughts came up during the process—jotting them down here.&lt;/p&gt;

&lt;h2&gt;
  
  
  Emergency Room
&lt;/h2&gt;

&lt;p&gt;Aima Service isn't the kind of tool you open every day. It's more like an emergency room for devices—you don't think about it normally, only when something goes wrong.&lt;/p&gt;

&lt;p&gt;This creates an awkward dynamic: success feels like nothing to the user. Problem solved, "yeah, okay," and they're gone. Since they never experienced the pain, they naturally don't realize how significant the solution was.&lt;/p&gt;

&lt;p&gt;Failure, on the other hand, is much more interesting.&lt;/p&gt;

&lt;p&gt;There are types of failure. The agent tried seriously but couldn't fix it, telling you the task failed—that's one type. Claiming it succeeded but actually didn't when you test it—false success—is another. But the worst is when the channel itself breaks. Crashes, freezes, disconnections mid-process.&lt;/p&gt;

&lt;p&gt;The first two are manageable. Like going to the ER where the doctor couldn't cure you—you just go somewhere else. The last one is unacceptable. You called 120 (emergency services), the ambulance arrived, but the ER is closed when you get there. Or you get inside, the registration system is down, the doctor disappears halfway through, and someone comes out to say "sorry, we're closed for today."&lt;/p&gt;

&lt;p&gt;That's the real crash.&lt;/p&gt;

&lt;h2&gt;
  
  
  Decor Doesn't Matter, Doctors Must Be There
&lt;/h2&gt;

&lt;p&gt;Once you understand this, priorities become clear.&lt;/p&gt;

&lt;p&gt;In an emergency room, fancy decor is useless. Comfortable sofas are useless. Only two things matter: is the door open, and can the doctor treat patients?&lt;/p&gt;

&lt;p&gt;The previous version was functionally adequate but unstable. Tasks often ran into bugs, crashed constantly, and suffered from mysterious freezes. The ER door was open, but the doctor wasn't there.&lt;/p&gt;

&lt;p&gt;So I did a complete refactoring. No new features—just rebuilding the foundation.&lt;/p&gt;

&lt;h2&gt;
  
  
  Two Models Alternating
&lt;/h2&gt;

&lt;p&gt;The refactoring used Claude Code and Codex, alternating between the two.&lt;/p&gt;

&lt;p&gt;First, I designed the documentation system, had both models read the docs and code, and list what needed to be done. Then Claude Code ran the first round of refactoring, handed it off to Codex for the second round, and back and forth.&lt;/p&gt;

&lt;p&gt;Why not just one? I tried—it tends to drift. Claude Code has strong architectural sense, seeing structural-level issues, but sometimes it's too conservative. Codex moves fast, acts boldly, and circles back to catch details, but occasionally it's too rough. Letting just one do it amplifies its weaknesses. Alternating actually creates the best rhythm—issues found by one are often fixed by the other.&lt;/p&gt;

&lt;p&gt;Later I checked and found a NeurIPS 2025 paper titled "Lessons Learned" that specifically analyzed how different LLMs complement each other, concluding that around 3 agents works best, with diminishing returns beyond that. Matches my experience exactly.&lt;/p&gt;

&lt;p&gt;Each round lets the models spin up agent teams to run in parallel. A single task taking three or four hours is normal. It's all asynchronous anyway—you do your own thing, occasionally check in and ask a few questions.&lt;/p&gt;

&lt;h2&gt;
  
  
  CI/CD Ate More Time Than Expected
&lt;/h2&gt;

&lt;p&gt;The functional refactoring was roughly done in about a week. What really consumed time was what came after.&lt;/p&gt;

&lt;p&gt;The code changed extensively—what catches things before going live? Automated testing, build checks, integration validation—none can be skipped. One pipeline run takes dozens of minutes; fix something, run again, another few dozen minutes. Later, adjusting test cases and optimizing the pipeline repeatedly—I wrote tens of thousands of lines of test code alone.&lt;/p&gt;

&lt;p&gt;When people talk about AI coding, they focus on "how fast." But to actually reach production, CI/CD and testing may account for over half the engineering effort. AI can do this part too, but it takes many rounds and lots of debugging—you can't rush it.&lt;/p&gt;

&lt;h2&gt;
  
  
  1.3 Million Lines
&lt;/h2&gt;

&lt;p&gt;After refactoring, I generated a report and looked at the numbers.&lt;/p&gt;

&lt;p&gt;Functionally, nothing new was added—it looks unremarkable. But the architecture transformed from "running tasks with bugs everywhere" to a design that can handle 100,000-level users. Modules are cleanly split, with distributed workers and overseas federation built in.&lt;/p&gt;

&lt;p&gt;The code volume surprised me: approximately 1.3 million lines excluding documentation, 1.7 million with docs. Eight or nine days, one non-specialist, two AI models.&lt;/p&gt;

&lt;p&gt;Then I remembered something.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Moat Dried Up
&lt;/h2&gt;

&lt;p&gt;A few years ago, there was a popular saying: code volume is the moat of software companies. Millions of lines of code piled up—others can't copy it even if they want to.&lt;/p&gt;

&lt;p&gt;I remember reading about this back then and thinking it made sense. Cisco IOS XE has 190 million lines of code, maintained by over 3,000 people, releasing 700+ new features annually. SAP's ABAP code exceeds 250 million lines, with fewer and fewer people who can understand it. These companies indeed rely on "this thing is too complex for anyone to replace."&lt;/p&gt;

&lt;p&gt;But thinking carefully, code volume was never the entire moat. Cisco's moat is 190 million lines plus ecosystem and switching costs, plus brand. SAP's barrier isn't that ABAP is hard to write—it's that out of 425,000 customers, only 5% migrated to S/4HANA within seven years. Lidl tried, burned €500 million, and gave up. Revlon lost $64 million in sales. The lock-in effect created by complexity is far more stubborn than the code itself.&lt;/p&gt;

&lt;p&gt;But now it's getting interesting. Earlier this year, Marek Kowalkiewicz wrote "Drying the Moat," mentioning that after Anthropic demonstrated AI reading and modernizing COBOL systems, IBM lost $40 billion in market cap in a single day. Code complexity actually creates an "understanding asymmetry": you can't read my code, so you can't leave me. AI erased that asymmetry.&lt;/p&gt;

&lt;p&gt;Looking back at myself: eight or nine days, one person with two models alternating, and a 1.3 million line system is already running in production. It has distributed architecture, CI/CD pipelines.&lt;/p&gt;

&lt;p&gt;1.3 million certainly isn't Cisco's 190 million—two orders of magnitude difference. Ecosystem and customer lock-in aren't replaceable by code. But the "code complexity" leg is already being pulled out. How long the remaining legs can hold, it's hard to say.&lt;/p&gt;

&lt;h2&gt;
  
  
  Do It When You Think of It
&lt;/h2&gt;

&lt;p&gt;Looking back at the whole process, a few thoughts.&lt;/p&gt;

&lt;p&gt;AI can do product-level software. What came out of this is already running in production with active users—not a demo. The hard parts are CI/CD and testing; these take time, but AI can do them too, just needs more iterations.&lt;/p&gt;

&lt;p&gt;Refactoring isn't that scary anymore. Before, inheriting a messy legacy codebase, just figuring out what it was doing would take weeks. Now models read it in minutes and can draw architecture diagrams. One week from mess to new architecture, technical debt cleaned up cleaner than manual work.&lt;/p&gt;

&lt;p&gt;The biggest change might be mindset. Before, refactoring was a major decision—you'd calculate headcount, timeline, risk. Now, when the codebase can't sustain itself, just rebuild it. New technology emerges? Use it to rebuild from scratch. Code is increasingly becoming a consumable.&lt;/p&gt;

&lt;p&gt;Of course code is still the skeleton of the product—that hasn't changed. But the cost of producing that skeleton is no longer on the same order of magnitude as before.&lt;/p&gt;




&lt;blockquote&gt;
&lt;p&gt;Originally published at &lt;a href="https://guanjiawei.ai/en/blog/emergency-room-vanishing-moat" rel="noopener noreferrer"&gt;https://guanjiawei.ai/en/blog/emergency-room-vanishing-moat&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

</description>
      <category>ai</category>
      <category>product</category>
      <category>refactoring</category>
      <category>thoughts</category>
    </item>
    <item>
      <title>The Million-Scale Gap of Coding Agents</title>
      <dc:creator>guanjiawei</dc:creator>
      <pubDate>Fri, 10 Apr 2026 15:23:40 +0000</pubDate>
      <link>https://dev.to/skyguan92/the-million-scale-gap-of-coding-agents-50i6</link>
      <guid>https://dev.to/skyguan92/the-million-scale-gap-of-coding-agents-50i6</guid>
      <description>&lt;p&gt;Jensen Huang has been making bold claims these past two years—"the age of agentic AI has arrived," "trillion-dollar opportunities." In October last year, he said every NVIDIA engineer was using Cursor. At this year's GTC, he painted a picture of 75,000 humans paired with 7.5 million agents.&lt;/p&gt;

&lt;p&gt;It sounds like coding agents are already everywhere. I checked the actual numbers.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Numbers Are Surprising
&lt;/h2&gt;

&lt;p&gt;Claude.ai has roughly 10–20 million monthly active users. Third-party estimates put Claude Code's weekly active users at around 1.6 million. Cursor has over 2 million users, with 1 million paying. On the open-source side, OpenCode has 140k stars on GitHub, and Cline has been installed over 5 million times in VS Code.&lt;/p&gt;

&lt;p&gt;These aren't small numbers. But GitHub Copilot alone has nearly 20 million users, and a JetBrains survey from early this year found that 74% of developers are already using some kind of AI coding tool.&lt;/p&gt;

&lt;p&gt;Copilot mainly does autocomplete—it doesn't qualify as an agent. The ones that actually work in agent mode—you give it a task, it reads files, writes code, runs tests itself—these tools combined have maybe a few million to over ten million users.&lt;/p&gt;

&lt;p&gt;Something that the world's highest-valued company calls "world-changing" only has this many users. I originally guessed at least tens of millions.&lt;/p&gt;

&lt;p&gt;It's bustling in China, though. The Kimi platform has over 30 million monthly active users. ByteDance's Trae has over 6 million registered. But these numbers include plenty of non-coding scenarios—actual coding agent users aren't that many.&lt;/p&gt;

&lt;p&gt;Many people are discussing how to configure Skills, how to connect MCP. But maybe we need to take a step back: why can't so many people take that first step?&lt;/p&gt;

&lt;h2&gt;
  
  
  Not Just About Writing Code
&lt;/h2&gt;

&lt;p&gt;The most common misunderstanding is treating coding agents as "tools for writing code." People who don't code think it's irrelevant; people who do code think it's just an upgraded Copilot.&lt;/p&gt;

&lt;p&gt;In reality, these things do far more than code. Stitching videos, batch processing files, scraping data from websites, debugging unfamiliar software—it can do all of it. Essentially, it helps you control your computer to get tasks done; code is just its operating language.&lt;/p&gt;

&lt;p&gt;Mobile phones are the exception. Phones are inherently GUI-centric, so coding agents don't work well on them. Those projects using models to control phones had their moment in the spotlight, but the approach is completely different and still not quite there.&lt;/p&gt;

&lt;h2&gt;
  
  
  The NPCs Have Woken Up
&lt;/h2&gt;

&lt;p&gt;The deeper problem is mindset.&lt;/p&gt;

&lt;p&gt;We've grown accustomed to predetermined interactions with software. Click here for this, drag there for that—everything has been designed. NPCs in games work the same way: they give you three options, you pick one; whether you read the dialogue or not doesn't matter.&lt;/p&gt;

&lt;p&gt;Now the NPCs can suddenly think. They wait for you to speak, then do as you say.&lt;/p&gt;

&lt;p&gt;This feeling is just like when ChatGPT first came out. I made quite a few tutorials teaching people how to use it, then realized most people got stuck on expression. They felt they needed to become "prompt engineers"—speaking precisely, making the AI obey, with fancier tricks for advanced users.&lt;/p&gt;

&lt;p&gt;It's not that complicated. Just treat the coding agent like a colleague. It understands what you say and can browse files in your project. Explain what you want to do clearly, and you're mostly done.&lt;/p&gt;

&lt;h2&gt;
  
  
  Don't Do It Yourself
&lt;/h2&gt;

&lt;p&gt;There's another pitfall related to habits.&lt;/p&gt;

&lt;p&gt;The more capable you are, the easier it is to fall into. When facing a problem, the first instinct is to do it yourself. Like being a manager—clearly someone else can do it, but you always feel you'll be faster.&lt;/p&gt;

&lt;p&gt;But coding agents might be ten times faster than you. The quality might be temporarily lower, but iteration speed makes up for it. The problem is once you start doing it yourself, you slide back into old habits—asking DeepSeek or Doubao when you hit a snag, fixing it yourself, using AI as a consultant. The work is still yours.&lt;/p&gt;

&lt;p&gt;I now delegate 95% of my computer work to coding agents. Research, writing, programming, sending emails, operating web pages. Once you cross that threshold, you don't need anyone to teach you, because you can have it help you figure out how to use it better. That's the meta-skill.&lt;/p&gt;

&lt;h2&gt;
  
  
  Human in the loop
&lt;/h2&gt;

&lt;p&gt;But don't be too optimistic either.&lt;/p&gt;

&lt;p&gt;Concepts like agent managers and multi-agent orchestration sound beautiful—AI managing AI, fully autonomous operation. The direction is right, but we're not there yet.&lt;/p&gt;

&lt;p&gt;In practice, having someone knowledgeable in the loop makes a huge difference in efficiency. Letting agents orchestrate themselves completely—tasks fall apart once they get complex. The Worker &amp;amp; Manager dynamic I discussed in a previous blog post is exactly about this.&lt;/p&gt;

&lt;p&gt;For the short term, we still need someone in the middle. But this person's role isn't to do the work with their own hands—it's to clarify what is needed, check if the results are right, and make the call at key moments.&lt;/p&gt;

&lt;p&gt;Once you cross this threshold, many things naturally fall into place. If you don't, you'll always be on the outside watching others use it.&lt;/p&gt;




&lt;blockquote&gt;
&lt;p&gt;Originally published at &lt;a href="https://guanjiawei.ai/en/blog/coding-agent-adoption-gap" rel="noopener noreferrer"&gt;https://guanjiawei.ai/en/blog/coding-agent-adoption-gap&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

</description>
      <category>ai</category>
      <category>programming</category>
      <category>thinking</category>
    </item>
    <item>
      <title>HappyHorse and the Hard Demand for Text-to-Video Privatization</title>
      <dc:creator>guanjiawei</dc:creator>
      <pubDate>Fri, 10 Apr 2026 06:22:39 +0000</pubDate>
      <link>https://dev.to/skyguan92/happyhorse-and-the-hard-demand-for-text-to-video-privatization-1k88</link>
      <guid>https://dev.to/skyguan92/happyhorse-and-the-hard-demand-for-text-to-video-privatization-1k88</guid>
      <description>&lt;p&gt;On April 8, a pony suddenly appeared on the Artificial Analysis Video Arena leaderboard.&lt;/p&gt;

&lt;p&gt;An anonymous model called HappyHorse scored Elo 1333 for text-to-video and 1391 for image-to-video, breaking records in both categories and knocking ByteDance's SEEDANCE 2.0 off the top spot. SEEDANCE 2.0 had just been released in March of this year, ranking first above competitors like Google Veo 3, OpenAI Sora 2, and Runway Gen-4.5. Then a little horse came along and overturned it.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Tradition of Anonymous Benchmarking
&lt;/h2&gt;

&lt;p&gt;This kind of anonymous leaderboard climbing has become something of a recurring performance in the Chinese AI circle.&lt;/p&gt;

&lt;p&gt;In February this year, an anonymous language model called Pony Alpha appeared on OpenRouter—free to use, with a 200k context window, processing 40 billion tokens on its first day. Five days later, Zhipu AI announced: Pony Alpha was actually GLM-5, a 745B-parameter MoE architecture. In March came Hunter Alpha; the community speculated it was DeepSeek V4, but Xiaomi stepped forward to claim it as MiMo-V2-Pro, a trillion-parameter model.&lt;/p&gt;

&lt;p&gt;The benefits of anonymous benchmarking are straightforward: obtaining real blind test data without brand halo or baggage. Benchmark scores can be gamed; blind tests cannot.&lt;/p&gt;

&lt;p&gt;HappyHorse follows the same playbook. However, one detail quickly gave it away—Chinese and Cantonese appeared at the top of its supported languages list.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Domain Wars
&lt;/h2&gt;

&lt;p&gt;With the model going viral, domains naturally became targets.&lt;/p&gt;

&lt;p&gt;When I searched for HappyHorse's official website, I discovered something amusing: both happyhorse.io and happyhorse.com had been registered, with websites built and billing already activated. Clicking through reveals a full suite of services—text-to-image, text-to-music, and text-to-video—quite an impressive setup. But look closely, and they're not using the HappyHorse model at all; running in the backend is Lightricks' LTX—an open-source model from an Israeli company with only 2 billion parameters in its original form. I had tested it before; it's completely different from the HappyHorse that topped the leaderboard.&lt;/p&gt;

&lt;p&gt;Domain squatting happens faster than model training. But if someone unaware of the situation pays money thinking they're using that chart-topping HappyHorse, that's quite a scam.&lt;/p&gt;

&lt;p&gt;It doesn't stop at domains. Several HappyHorse-related repositories also popped up on HuggingFace—happyhorse-lab, happyhorseai, HappyHorseOrg—all looking official. But checking their creation dates reveals they were all registered on April 9. Clicking in reveals either a lone README or an empty repository. The READMEs are complete, mentioning "open source" and "number one," but there are no weight files. Riding the hype wave isn't limited to domain squatting anymore; even HuggingFace gets occupied.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Mystery Remains Unsolved
&lt;/h2&gt;

&lt;p&gt;As of this writing, HappyHorse's origin remains undetermined.&lt;/p&gt;

&lt;p&gt;The model page on Artificial Analysis still reads "More details coming soon," using the placeholder image for mysterious models. The leaderboard recognizes its scores but provides no team background. No technical report, official GitHub repository, company announcement, or paper homepage has been found to close the identity loop.&lt;/p&gt;

&lt;p&gt;The most convincing inference currently points to Sand.ai. The technical descriptions circulating for HappyHorse—15B parameters, 40-layer single-stream Transformer, joint text-video-audio modeling, 8-step DMD-2 distillation, multilingual lip-sync—closely match daVinci-MagiHuman, jointly released by Sand.ai and SII-GAIR. Reports from 36Kr also point in this direction. But so far, this remains inference, not official confirmation.&lt;/p&gt;

&lt;p&gt;The claim of being "already open sourced" also warrants skepticism. Artificial Analysis marks models with open weights using the &lt;code&gt;Open Weights&lt;/code&gt; label on the leaderboard; HappyHorse currently lacks this designation. The current leading open-source video models remain in the LTX-2 Pro tier. Online articles claiming HappyHorse has been fully released under Apache 2.0 currently do not match any verifiable weight releases.&lt;/p&gt;

&lt;p&gt;Around the same time, Alibaba released Wanxiang 2.7, a 27B-parameter MoE architecture (14B active), supporting a "thinking mode." However, Wanxiang 2.7 currently only offers API access; the weights have not been made public. Previous Wanxiang series models were released as open source immediately; the reason for the change this time remains unclear.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Hard Demand for Text-to-Video Privatization
&lt;/h2&gt;

&lt;p&gt;HappyHorse's identity will be revealed sooner or later. But what interests me more isn't who built it, but the privatization logic of text-to-video models.&lt;/p&gt;

&lt;p&gt;Every model category ignites a hardware category. When DeepSeek emerged, H20 orders exploded—Chinese companies placed over $16 billion in orders in Q1 2025 alone. After open-source language models gained traction, DeepSeek V3 running on a cluster of eight M4 Pro Mac Minis caused Mac Minis to sell out.&lt;/p&gt;

&lt;p&gt;What will text-to-video ignite? I believe the answer is consumer-grade GPUs and small inference boxes. Moreover, text-to-video has a stronger hard demand for privatization than language models.&lt;/p&gt;

&lt;h3&gt;
  
  
  Latency-Insensitive, Cost-Sensitive
&lt;/h3&gt;

&lt;p&gt;Text-to-video is naturally a "can wait" scenario. Generating a video in the cloud takes several minutes regardless. Running locally a bit slower—ten minutes or even half an hour—makes no essential difference. You won't stare at the progress bar; you'll do something else in the meantime.&lt;/p&gt;

&lt;p&gt;When latency isn't sensitive, cost becomes sensitive. Beyond compute, video in the cloud has one major expense that's easily overlooked: bandwidth. Videos are dozens to hundreds of megabytes; moving them around incurs frightening network fees. I recently calculated the costs—the servers themselves aren't that expensive, but the network bandwidth bill is something you don't want to look at twice. Deploy locally, and that expense disappears.&lt;/p&gt;

&lt;p&gt;Latency insensitivity also leads to another conclusion: you don't need top-tier compute. Language model inference chases low latency, requiring the best cards available. Text-to-video is different—a bit slower is acceptable. This makes "not fast enough but cheap enough" compute—gaming GPUs, previous-generation compute cards—highly cost-effective choices.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Dilemma of Regulations and Content Moderation
&lt;/h3&gt;

&lt;p&gt;This is something many haven't considered.&lt;/p&gt;

&lt;p&gt;Text content moderation is manageable; most scenarios won't encounter legal issues. Images and videos are different—IP infringement, portrait rights, sensitive content—regulations haven't fully settled, leaving cloud service providers caught in a difficult position.&lt;/p&gt;

&lt;p&gt;Cloud services face a dilemma: don't block, and you're liable if something happens; block, and you can't achieve precise technical filtering, resulting in massive collateral damage. The result is a "better to over-censor than under-censor" approach—no portrait uploads allowed, no specific IP references permitted, immediate blocking at the slightest detection of possible sensitivity. The user experience becomes heavily restricted.&lt;/p&gt;

&lt;p&gt;Local deployment avoids these problems. The model runs on your own machine, bypassing third-party moderation. During the Stable Diffusion era, massive text-to-image workflows ran locally not because local was faster, but because there were no moderation restrictions. Text-to-video will follow the same pattern.&lt;/p&gt;

&lt;h3&gt;
  
  
  Shorter Path to Monetization
&lt;/h3&gt;

&lt;p&gt;The value of language models has always been difficult to quantify. A better model writes a better paragraph—how much revenue does that generate? Unclear. Upgrading from a 32B model to a hundreds-of-billions-parameter private deployment, spending ten times the cost on H20s—can you earn ten times more? Nobody can say. The emergence of coding scenarios improved this somewhat, but previously, everyone genuinely couldn't calculate this equation.&lt;/p&gt;

&lt;p&gt;Text-to-video is completely different. A good video is good traffic; traffic is money. Spend a few hundred to generate a decent-quality video—if the content is interesting, the traffic generated might be worth thousands or even tens of thousands. Everyone can calculate this equation.&lt;/p&gt;

&lt;p&gt;SEEDANCE 2.0 is an example. Creators are willing to pay and queue for resources because videos produced with it genuinely achieve better metrics. The gap between good and bad models becomes visible after posting just a few videos.&lt;/p&gt;

&lt;h2&gt;
  
  
  Hardware Chain Reaction
&lt;/h2&gt;

&lt;p&gt;Whether HappyHorse will open source, and when, remains uncertain. But we can calculate the implications.&lt;/p&gt;

&lt;p&gt;If we estimate based on the rumored 15B parameters, FP16 inference requires approximately 30GB of VRAM, while quantization to INT8 needs only around 15GB. A single RTX 4090 or 5090 could handle it. A small box like the DGX Spark with 128GB of unified memory would be even more comfortable, running inference with room to spare.&lt;/p&gt;

&lt;p&gt;If it actually open-sources at this scale, RTX 4090/5090 cards will likely become even harder to acquire. The DGX Spark's price has already risen from the initially announced $3,000 in 2025 to $4,699—an increase of over 50%—with supply already tight. Adding another VRAM-hungry workload to the mix will only make the situation more extreme.&lt;/p&gt;

&lt;p&gt;We've seen this script play out several times before. DeepSeek ignited H20 demand; open-source LLMs pulled Mac Mini sales. Text-to-video has reached today's quality levels; it only lacks a good enough open-source model to land. Whether HappyHorse gets this opportunity remains to be seen, but it will happen sooner or later.&lt;/p&gt;

&lt;h2&gt;
  
  
  Unresolved
&lt;/h2&gt;

&lt;p&gt;Returning to HappyHorse itself.&lt;/p&gt;

&lt;p&gt;Will it officially open source? It's impossible to tell right now. The leaderboard scores are there, but weights and code have not materialized. If it ultimately only offers an API service, the impact on the hardware market will be limited—just another powerful closed-source model.&lt;/p&gt;

&lt;p&gt;How large is it actually? The marketing page claims 15B; if true, a single consumer GPU can run it. But if it's actually larger, requiring multi-GPU setups or even clusters, then local deployment becomes unrealistic, and we're back to the cloud provider model.&lt;/p&gt;

&lt;p&gt;Different answers to these two questions lead to completely different storylines. But regardless of how HappyHorse turns out, the trend of text-to-video moving local won't change. Tools like ComfyUI and WebUI are waiting for a good enough open-source model; the quantization community is waiting too. Once it arrives, the consumer hardware side will get lively.&lt;/p&gt;




&lt;blockquote&gt;
&lt;p&gt;Originally published at &lt;a href="https://guanjiawei.ai/en/blog/happy-horse-video-privatization" rel="noopener noreferrer"&gt;https://guanjiawei.ai/en/blog/happy-horse-video-privatization&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

</description>
      <category>ai</category>
      <category>videogeneration</category>
      <category>hardware</category>
      <category>opensource</category>
    </item>
    <item>
      <title>AI's Spear and Shield</title>
      <dc:creator>guanjiawei</dc:creator>
      <pubDate>Wed, 08 Apr 2026 14:28:32 +0000</pubDate>
      <link>https://dev.to/skyguan92/ais-spear-and-shield-170h</link>
      <guid>https://dev.to/skyguan92/ais-spear-and-shield-170h</guid>
      <description>&lt;p&gt;I've noticed some interesting patterns in recent development work, and plenty is happening in the outside world as well. This era hasn't slowed down—it's still advancing at an exaggerated pace. Here are a few scattered thoughts.&lt;/p&gt;

&lt;h2&gt;
  
  
  Coding Agents Are Better at Debugging Than Writing Code
&lt;/h2&gt;

&lt;p&gt;Many posts and reports discuss a question: what's the biggest problem when using coding agents to write code? The answer is bugs.&lt;/p&gt;

&lt;p&gt;But from hands-on experience, I think this gets it backwards.&lt;/p&gt;

&lt;p&gt;Current coding agents actually perform better at debugging and troubleshooting than at writing code. The reason isn't complicated: debugging has clear objectives, is usually reproducible, and can be broken down step by step for verification. This is work AI handles quite smoothly, and much faster than humans.&lt;/p&gt;

&lt;p&gt;The real challenge is asking it to implement something complete from scratch—especially when you're trying to build a product.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Deep Waters of Productization
&lt;/h2&gt;

&lt;p&gt;Traditional software development has so many processes—unit testing, integration testing, stress testing, canary releases, Alpha, Beta—not because people love bureaucracy, but because once software faces real users, it exposes problems you couldn't anticipate in the code. Best practices reduce the probability of issues, but can't eliminate them entirely. Only time and production pressure can force problems to surface and be resolved one by one.&lt;/p&gt;

&lt;p&gt;This challenge applies to coding agents as well.&lt;/p&gt;

&lt;p&gt;Building a small component from scratch is fast. Prototyping feels great. But as the product matures and the codebase grows, problems emerge: the larger the project, the easier it is for AI to break things, and the cost of context understanding visibly increases. This follows the same logic as humans maintaining large projects—iteration difficulty naturally increases after a product ships, requiring team division to manage. AI agents can't escape this rule either.&lt;/p&gt;

&lt;p&gt;So I think the current state is this: concept validation is fast and satisfying. But turning a concept into a product requires deep thinking and verification at every step—you can't skip any of it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Creativity-Driven Open Source
&lt;/h2&gt;

&lt;p&gt;But there's one category where AI genuinely excels.&lt;/p&gt;

&lt;p&gt;Recently, Milla Jovovich—star of &lt;em&gt;The Fifth Element&lt;/em&gt; and the &lt;em&gt;Resident Evil&lt;/em&gt; series—spent several months working with engineer Ben Sigman to build MemPalace, an open-source AI memory system using Claude Code. Pushed to GitHub on April 5th, it hit 7000+ stars within 48 hours, now exceeding 22,000.&lt;/p&gt;

&lt;p&gt;On the LongMemEval benchmark, MemPalace achieved 96.6% R@5, far exceeding paid solutions like Mem0 and Zep at around 85%. It runs entirely locally, uses ChromaDB + SQLite, MIT license, completely free.&lt;/p&gt;

&lt;p&gt;AI memory is indeed a major focus this year. But MemPalace's strength isn't complexity—quite the opposite. It doesn't win through complexity, but through creativity. It focuses on a single target, like hitting a benchmark, and figures out how to do it well.&lt;/p&gt;

&lt;p&gt;This model is particularly suitable for AI assistance. The more focused the problem, the more it relies on ideas rather than engineering effort, the faster AI helps you validate. There are more and more such projects in the open-source world, which I find to be a fascinating direction.&lt;/p&gt;

&lt;h2&gt;
  
  
  No Software Is Secure
&lt;/h2&gt;

&lt;p&gt;The biggest news these past two days is Anthropic's official announcement of Project Glasswing.&lt;/p&gt;

&lt;p&gt;Their next-generation model, internally codenamed Mythos, was prematurely exposed due to an internal data leak at the end of March (a CMS configuration issue accidentally exposed roughly 3000 internal documents), with official confirmation on April 7th. This model's capabilities in software security have reached a level they don't even dare to release.&lt;/p&gt;

&lt;p&gt;Previous models could find vulnerabilities—this is old news in the industry. But converting vulnerabilities into usable attack methods is a completely different matter. Mythos combines these two steps.&lt;/p&gt;

&lt;p&gt;Anthropic's disclosed data is alarming: Mythos discovered thousands of zero-day vulnerabilities across all major operating systems and browsers, including a bug hidden in OpenBSD for 27 years. Vulnerabilities that might have gone undiscovered for decades are now surfaced by a model, and can be directly weaponized into attack tools.&lt;/p&gt;

&lt;p&gt;Basically, no software is safe in front of this model.&lt;/p&gt;

&lt;p&gt;Anthropic's assessment is that this model cannot be publicly released. They contacted roughly 45 companies—including Apple, Google, Microsoft, Nvidia, AWS, plus CrowdStrike, Palo Alto Networks, Cisco, Linux Foundation, and others—allowing them early access to Mythos to harden their systems. The logic is straightforward: before it becomes a spear, let it serve as a shield.&lt;/p&gt;

&lt;p&gt;OpenAI isn't having an easy time either. GPT-5.4 became the first general-purpose model rated as "High Cybersecurity Risk" by OpenAI's own Preparedness Framework. From GPT-5 to GPT-5.4, the model's score on CTF (Capture The Flag) competitions jumped from 27% to 76%. OpenAI chose to add a layer of safety protections and release anyway—a different approach from Anthropic's, but facing the same problem: model attack capabilities are growing exponentially.&lt;/p&gt;

&lt;p&gt;I suspected this was happening. With Mythos's release, it's basically confirmed. And this isn't just about software—when something in a new dimension develops at speeds completely beyond expectations, many supporting structures fall out of alignment. Regulations can't keep up, organizations can't keep up, security systems can't keep up.&lt;/p&gt;

&lt;h2&gt;
  
  
  Build the Framework First
&lt;/h2&gt;

&lt;p&gt;These events have also influenced my thinking about product development.&lt;/p&gt;

&lt;p&gt;We've been discussing internally whether some product positioning is too aggressive—for example, designs that let AI fully autonomously manage certain processes. If the models aren't smart enough yet and always require human intervention, then this design doesn't hold up at present.&lt;/p&gt;

&lt;p&gt;But from another angle, perhaps product design should run slightly ahead of the models.&lt;/p&gt;

&lt;p&gt;This is Anthropic's approach to building products. They build Chrome extensions, Excel plugins internally—start with an idea, set up the scaffolding, then throw each new model generation at it to test what it can do. Wait, wait, and one day realize it's almost there, then invest heavily in productization and release.&lt;/p&gt;

&lt;p&gt;If you design products based on current model capabilities, they'll likely be obsolete by launch. Instead, it's better to be slightly more aggressive: think through the architecture first, wait for the engine to arrive, and the whole thing naturally comes together. Think of it, build it, then wait.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Game Continues
&lt;/h2&gt;

&lt;p&gt;One final piece of good news.&lt;/p&gt;

&lt;p&gt;Zhipu's GLM-5.1 officially went open source in early April, MIT license, weights fully public. It scored 58.4 on SWE-Bench Pro, surpassing GPT-5.4, Claude Opus 4.6, and Gemini 3.1 Pro. And they simultaneously raised prices by 10%—while the entire industry is in a price war, raising prices against the trend makes this move itself quite interesting.&lt;/p&gt;

&lt;p&gt;In the open-source game, no one has retreated yet. Good to see.&lt;/p&gt;




&lt;blockquote&gt;
&lt;p&gt;Originally published at &lt;a href="https://guanjiawei.ai/en/blog/ai-spear-and-shield" rel="noopener noreferrer"&gt;https://guanjiawei.ai/en/blog/ai-spear-and-shield&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

</description>
      <category>ai</category>
      <category>security</category>
      <category>thoughts</category>
    </item>
    <item>
      <title>Model Companies' Endgame Is Becoming Cloud Companies</title>
      <dc:creator>guanjiawei</dc:creator>
      <pubDate>Tue, 07 Apr 2026 16:26:50 +0000</pubDate>
      <link>https://dev.to/skyguan92/model-companies-endgame-is-becoming-cloud-companies-gpb</link>
      <guid>https://dev.to/skyguan92/model-companies-endgame-is-becoming-cloud-companies-gpb</guid>
      <description>&lt;p&gt;People often ask: How do open-source models make money?&lt;/p&gt;

&lt;p&gt;Actually, reframe the question and it becomes clearer. How does open-source software make money? Hosted cloud services. Redis is open source, Redis Cloud makes money. MongoDB is open source, Atlas makes money. Models are even more so, and are better suited for this path than open-source software.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://dev.to/blog/open-source-deepseek-moment"&gt;The previous post&lt;/a&gt; discussed how the open-source community has changed over the past three years; this one talks about money—how the business behind open-source models actually works.&lt;/p&gt;

&lt;h2&gt;
  
  
  Two Sides of the Same Coin
&lt;/h2&gt;

&lt;p&gt;Look at the current global landscape: cloud providers are desperately building models, while model companies are desperately buying compute.&lt;/p&gt;

&lt;p&gt;Google's 2025 capital expenditure exceeds $90 billion, centered around self-developed Gemini and self-developed TPUs. Microsoft, partnered with OpenAI, has poured $80 billion into building AI data centers. Amazon has invested over $100 billion to expand AWS compute, while also investing $4 billion in Anthropic. The three companies' combined capital expenditure in 2025 alone exceeds $300 billion, with the vast majority going to AI.&lt;/p&gt;

&lt;p&gt;Now look at the model companies. OpenAI's 2025 revenue exceeds $20 billion, primarily from APIs and subscriptions—essentially selling inference compute. Anthropic signed a usage agreement with Google Cloud for millions of TPUs worth tens of billions of dollars, while simultaneously running over 500,000 Trainium chips on AWS. Tell me, is this a model company or a cloud company?&lt;/p&gt;

&lt;p&gt;Both sides are becoming the same thing.&lt;/p&gt;

&lt;h2&gt;
  
  
  Running the Numbers on DeepSeek
&lt;/h2&gt;

&lt;p&gt;DeepSeek demonstrated this perfectly.&lt;/p&gt;

&lt;p&gt;On January 20, 2025, R1 was released during the Spring Festival holiday. Six days later, the app hit #1 on the US iOS download charts, simultaneously topping the charts in 52 countries. January saw over 14 million downloads, with monthly active users approaching 100 million by April. No ad spend, no marketing campaigns—customer acquisition cost was essentially zero.&lt;/p&gt;

&lt;p&gt;The API pricing was aggressive too. R1 is priced at $0.55 per million input tokens; OpenAI's comparable o1 is $15. That's roughly 3.5% of OpenAI's price, far more aggressive than the "one-fifth of OpenAI's price" people were talking about. Many said this was selling at a loss for publicity, impossible to make money.&lt;/p&gt;

&lt;p&gt;At the end of February, DeepSeek held an "Open Source Week," releasing five underlying optimization technologies over five days: FlashMLA, DeepEP, DeepGEMM, DualPipe, and 3FS—from attention decoding to matrix operations to pipeline parallelism to distributed file systems, all infrastructure-level components built in-house. DeepGEMM's core code is only 300 lines, yet outperforms expert-tuned kernels. Only then did people realize how much work this company had done at the infrastructure level.&lt;/p&gt;

&lt;p&gt;Then on March 1st, DeepSeek released a set of data: calculating based on H800 rental prices at $2/hour, the daily inference GPU cost for the V3 and R1 models was approximately $87,000. If all traffic that day were billed at R1's pricing, theoretical daily revenue would be approximately $562,000. The theoretical cost-profit ratio: 545%.&lt;/p&gt;

&lt;p&gt;Of course, this 545% needs to be discounted. DeepSeek themselves said—the web interface and app are both free, V3 is priced lower than R1, and there are off-peak discounts; actual monetizable traffic is far less than total traffic. This number only counts inference GPU rental fees, excluding training costs, R&amp;amp;D investment, and personnel expenses. The actual total R&amp;amp;D cost for V3 is estimated by the industry to be between $500 million and $1.6 billion.&lt;/p&gt;

&lt;p&gt;But the 545% itself isn't the point. The point is: with the same open-source model, if others run inference services at this pricing, they'll likely lose money. Because DeepSeek has done extensive optimization at the infrastructure level, at the same price, they make money. Pricing power lies with the original manufacturer.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Flywheel Spins
&lt;/h2&gt;

&lt;p&gt;What's the most headache-inducing thing about the cloud business? Everyone sells roughly the same thing—bare metal with a thin layer of services on top, and gross margins get squeezed quickly. AWS's operating margin is roughly 33% to 38%, already the ceiling. Google Cloud went from years of losses to around 30%. Smaller cloud providers have even thinner margins, with highly concentrated customers; when big clients squeeze you on price, you have no leverage. No matter how much you invest in underlying technology, it's hard to translate that into customer-perceived differentiation.&lt;/p&gt;

&lt;p&gt;Add a model layer and things change. Suppose I operate a large-scale inference cluster; if I improve model efficiency by 10%, the same hardware produces 10% more tokens. Take that extra profit and invest it in R&amp;amp;D, further optimizing inference efficiency, further driving down unit costs, then you can attack the market with lower prices, attract more users, drive up compute utilization, and profits increase again. Then invest more in R&amp;amp;D.&lt;/p&gt;

&lt;p&gt;The old cloud business didn't have this cycle. You spent a lot of money on technical improvements, but customers couldn't perceive them. Models are different—inference efficiency optimizations directly translate to money: either lower costs, or more output at the same cost.&lt;/p&gt;

&lt;p&gt;In 2025, Google increased capital expenditure from $75 billion to $93 billion, most of it going to AI infrastructure. What they're seeing is this shift: the model layer gives the cloud business real technical leverage.&lt;/p&gt;

&lt;h2&gt;
  
  
  Open Source Is the Most Efficient Customer Acquisition
&lt;/h2&gt;

&lt;p&gt;Why not just sell closed-source? Because you can't tell how good a model is just by looking at benchmarks.&lt;/p&gt;

&lt;p&gt;Llama 4 is a cautionary tale. In April 2025, Meta released Llama 4 Maverick, submitting it to LMArena where it ranked #2. It was quickly discovered that the version submitted to the leaderboard was a specially tuned "experimental version"—responses were exceptionally long, full of emojis, with fancy formatting—all tricks to game the scores. When the publicly released standard version was retested, it ranked #32. Later, when Yann LeCun left Meta, he personally admitted that "results were fudged." Zuckerberg lost confidence in the entire GenAI team, and the LLaMA series essentially exited the open-source community stage.&lt;/p&gt;

&lt;p&gt;Benchmarks can be gamed; user experience cannot.&lt;/p&gt;

&lt;p&gt;When a model is open-sourced, everyone can run it and test it. You know within minutes whether it's any good. This process builds word-of-mouth and creates stickiness. When I was helping friends set up AI client tools, someone pulled out a DeepSeek API account they had registered six months ago to connect. In that scenario, DeepSeek wasn't actually the optimal choice, but it felt convenient—they had registered, used it, and already had trust. Developers are similar: if they built a project using a particular model's API before, they'll probably use it for the next project too. Switching has costs; rebuilding trust costs even more.&lt;/p&gt;

&lt;p&gt;DeepSeek's strategy is to combine open-source models with a free app. Developers test the open-source models; regular users test the app—building awareness on both fronts simultaneously. Some naturally convert to paid API users. Customer acquisition cost approaches zero, with broader reach than spending hundreds of millions on marketing.&lt;/p&gt;

&lt;h2&gt;
  
  
  Not All Models Fit This Path
&lt;/h2&gt;

&lt;p&gt;This logic works smoothly for language models. DeepSeek R1 has hundreds of billions of parameters—you can't run it without a GPU cluster. If you want to use it, there are two paths: the free app or the paid API. Either way, the traffic is on their cloud. The capability gap between large and small models is significant, so users naturally gravitate toward the cloud.&lt;/p&gt;

&lt;p&gt;Text-to-image is different. Open-source models like Stable Diffusion and FLUX can run on a single gaming GPU. The barrier is so low that individual users can deploy them at home. If the gap between large and small models isn't that significant, the market fragments—large numbers of users choose to run locally, and cloud demand isn't as concentrated.&lt;/p&gt;

&lt;p&gt;Text-to-image and text-to-video have another push factor: involving image and video content, they naturally face more moderation and regulatory constraints. Cloud services must implement content filtering, but these constraints largely don't exist when running locally. This also pushes some users toward local deployment.&lt;/p&gt;

&lt;p&gt;So whether this open-source model business logic works depends on whether the capability gap between large and small models is large enough, and whether the barrier to local deployment is high enough. Language models currently satisfy both conditions. Text-to-image falls short somewhat. Text-to-video is still changing—hard to say.&lt;/p&gt;

&lt;h2&gt;
  
  
  Better Cloud
&lt;/h2&gt;

&lt;p&gt;Either way, commercializing models means tying them to cloud services. I think this is actually a good thing.&lt;/p&gt;

&lt;p&gt;The old cloud was just selling resources and competing on price; no matter how much you spent on technology, it was hard to differentiate. With models added to the mix, things change: technical investment can directly reduce inference costs and create pricing headroom. Companies that do this well can actually make good money.&lt;/p&gt;

&lt;p&gt;The endgame for model companies is probably becoming cloud companies—not the kind that sells bare metal, but a new type of AI cloud that sells intelligence.&lt;/p&gt;




&lt;blockquote&gt;
&lt;p&gt;Originally published at &lt;a href="https://guanjiawei.ai/en/blog/model-company-endgame-cloud" rel="noopener noreferrer"&gt;https://guanjiawei.ai/en/blog/model-company-endgame-cloud&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

</description>
      <category>ai</category>
      <category>opensource</category>
      <category>business</category>
      <category>thoughts</category>
    </item>
  </channel>
</rss>
