DEV Community: guanjiawei

AI Transformation Doesn't Come from Training

guanjiawei — Sat, 16 May 2026 13:19:33 +0000

Lately, when AI agents come up in conversation with friends, I've fallen into a habit. I pull out my phone, remote into my computer, and show them the agents I've had running over the past 24 hours. One has been autonomously chasing a goal for over ten hours straight. Another is running experiments and tuning parameters.

Their reactions are pretty much always the same: "Oh, so it's already at this stage. That's not what I pictured."

What they say next is the interesting part.

1. "Help Me Explain This to My Boss"

After watching, the first thing friends say usually goes like this:

"Can you come explain this to our boss?"

"I want to bring our tech lead over to see this."

"Can you give us a training session?"

Fair enough. You think this is important, and you want to bring in the people who need to see it. Good instinct.

But looking back at how AI agents actually spread through our own company, real change never came from a single class or presentation.

It started with someone who just did it themselves. They created something in their own work that made people around them do a double take.

A salesperson who suddenly talks like an engineer, while closing deals faster than ever. An admin or HR person who turns out to be doing technical work and marketing, shipping product-grade work from a role that never used to do that. People around them start to wonder. Why don't you seem like the same salesperson, the same admin anymore?

At that point, the curious ones show up naturally. Colleagues, bosses, friends. The change is happening right beside them, they can see it, and only then do they actually absorb what you're saying. Then it spreads from you to the next colleague, and the next, and out from there.

To be honest, trying to drive change by "getting the boss to sit through a lesson" rarely works. Unless that boss personally got their hands dirty on day one. Because right now, knowing what AI can and can't do comes entirely from bumping up against its boundaries yourself, not from hearing about them.

The data backs this up. A BCG report from early 2025 said 75% of executives rank AI as a top-three priority, but only a quarter have actually captured significant value. McKinsey put it more bluntly: 70% of employees skip their company's formal AI training videos entirely, learning instead by tinkering and word of mouth.

Training can only convey so much. What's scarier is that someone who hasn't deeply used AI themselves, if they go on to set policy, easily falls into one of two extremes. Either they fantasize that AI can do anything, piling on unrealistic KPIs that make their team's life miserable while they think it's all simple. Or they dismiss it entirely—"another bubble, here we go again"—and miss the real window.

So the first misconception, and I think the biggest: don't start by trying to change others. Start with yourself.

2. A PhD-Level AI Writing Weekly Reports

The second thing that really strikes me as a shame.

A lot of top companies give their employees excellent AI infrastructure. The best models, unlimited usage, loose policies. But most people, once they get access, instinctively reach for the most routine tasks. Meeting summaries. Reports. Weekly and monthly updates. And then they stop.

I'm not saying those tasks don't matter; AI really is useful for them. But stopping there is a waste.

If you look deeper along the company's value chain, at the most painful links, whether that's marketing, sales, the product itself, or R&D, couldn't AI do something there too? You don't have to be an expert in that domain, but your industry understanding plus AI's execution ability could let you build something at those nodes.

Think about it. A PhD-level AI told to write weekly reports will dutifully write weekly reports. It does what you assign. But tell it to research cutting-edge math, biology, or medicine, to run experiments and work through deductions, and it does that well too. One's a clerk, the other's a scientist. The gap is massive.

Worklytics data says that within an organization, truly deep AI power users probably account for only 20–30%. The rest hold the exact same tools and use them only for the shallowest tasks. A BCG report from October 2025 also noted that 74% of enterprises get stuck when trying to expand AI adoption. It's not that the tools don't work. It's that the users only used one corner of them.

3. Long-Term Without Short-Term Is Unsustainable

This one is harder to spot than the first two.

After using AI for a while, a lot of people go through an emotional arc. At first they're amazed: "This is so powerful." Then they gradually shift to: "What exactly should I do?" The directions seem plentiful, all viable, but deciding specifically what to do and how to keep going is actually the hardest part.

I've fallen into this trap myself.

AI agents can do remarkable things, but they don't grant wishes. For some bigger directions, agents still burn through massive amounts of tokens and take forever. They need round after round of experimentation to explore and tune before they might yield results. They might not yield anything at all. You're at the boundary of knowledge, and probing forward was never easy. If you bet everything on projects like that, it's easy for your enthusiasm to fizzle out. You work for ages without seeing results, and when people ask what you're doing, you can't really explain it.

So you need a mix.

Short-term things with fast positive feedback. My shortest feedback loop comes from working on my digital identity. Optimizing my website for SEO and having people find me through search. Writing blog posts and having readers get something out of them and want to share and engage. In between, I do small AI projects for friends. Helping a friend with a crawfish business. Making games for people. All of them show results quickly.

Mid-term, you need products that accumulate. The AIMA system, for example. When I show it to potential partners, some are willing to install it and promote it. That's a sturdier kind of positive feedback than "I ran an experiment."

And those deep, long-term explorations in the trenches keep running quietly in the background.

Kotter's eight-step change model has a step called "Generate Short-Term Wins." Same idea. Short-term results sustain confidence, giving you the nerve to keep chewing on hard problems. If the process also brings in some revenue to cover the token costs, the positive loop gets even stronger.

4. Prompt Engineering Is Yesterday's News

The last one, and I think a lot of people are still stuck here.

When people talk about using AI, they still fixate on prompts, thinking they need to master prompt engineering.

That was fine two years ago. Not anymore.

Give today's models a goal and a couple of sentences, and they'll go execute complex tasks. Prompts stopped being the bottleneck a while ago.

The bottleneck is harness. How to build an environment where the agent can actually get work done.

What you need to think about has changed. How do you design the document structure of the working directory? How do you give it machines for experiments? When do you check if it's gone off track? When should you have it pivot direction or change methods? How do you do periodic summaries and archiving?

In early 2025 Karpathy coined the term "vibe coding," casually using natural language to have AI write code, very freeform. A year later, looking back, he said the industry had moved from vibe coding to "agentic engineering," with value shifting up from syntax and implementation to judgment, taste, and management capability. Shopify's Tobi Lutke offered another term, "context engineering." It's not about how to write a good prompt, but about how to fill the agent's context window with the right information.

At the end of the day, AI is a digital employee. When you work with an employee, you don't think the most important thing is crafting their first email, right? That email is a tiny piece. What you really need to figure out is how to set up a proper work environment and guidance that leverages your sense of direction and their execution power, while steering clear of the mistakes they're prone to make.

Shift your thinking from "how to write one good sentence" to "how to manage a digital employee," and collaborating with AI feels completely different.

Looking back, these four points are really one thing.

Start doing it yourself. Don't wait for others. Once you do, don't stay in the comfort zone. Look deeper along the value chain. Set your own rhythm so short-term feedback never dries up. And shift your attention from prompts to environment and collaboration.

The change you create doesn't need pushing. It spreads on its own.

References

BCG, From Potential to Profit: Closing the AI Impact Gap, January 2025.
McKinsey, Superagency in the Workplace: Empowering People to Unlock AI's Full Potential at Work, 2025.
Tobi Lutke (Shopify CEO), Internal Memo on AI Usage Expectations, April 2025.
Andrej Karpathy, Sequoia AI Ascent 2026: From Vibe Coding to Agentic Engineering, April–May 2026.
Tobi Lutke & Andrej Karpathy on "Context Engineering," 2025.
BCG, The Widening AI Value Gap, October 2025.
Worklytics, AI Adoption Benchmarks 2025, Q3 2025.
McKinsey, The State of AI in 2025, March 2025.
John P. Kotter, Leading Change: Generate Short-Term Wins.

原文链接：https://guanjiawei.ai/en/blog/ai-transformation-not-from-training

Two Generations Was All It Took

guanjiawei — Fri, 15 May 2026 03:43:43 +0000

Yesterday I watched the footage of Trump's state visit to China, and honestly it hit me. Red carpet, military band, state dinner. Musk, Jensen Huang, Tim Cook all came along, even Defense Secretary Hegseth. First time a US president visited China in nearly nine years.

Think about who this is. The president of the most powerful country on earth, bringing some of the biggest names in tech, sitting down to talk. Not coming to lecture. Coming to negotiate.

Everyone knows Trump's style. With countries he considers weaker, he doesn't even bother pretending — might makes right. These past few years, the way many world leaders have looked standing next to him has been, frankly, painful to watch. Some of it bordered on comical.

But watch him in China. Completely different person. Polite, restrained, saying nice things.

Why? Because you're strong enough. Weak countries in today's world have no dignity to speak of.

Stand first

When the People's Republic was founded in 1949, how bad was it? A century of getting beaten from every direction. Foreign powers, the Japanese, civil war. The country had nothing left.

But the first priority wasn't getting rich. It was making sure nobody could beat you again.

The Korean War broke out in 1950. When Chinese volunteers crossed the Yalu River, the country's per capita GDP was a few dozen dollars. The Americans had tanks, artillery, fighter jets. China didn't even have an air force, and logistics were basically nonexistent. Under those conditions, over a million troops went in, fighting on guts and willingness to die, and pushed the front line back to a ceasefire.

Over a hundred thousand never came home.

Then came the Two Bombs, One Satellite program.

First atomic bomb in 1964. Hydrogen bomb test in 1967, just 32 months from fission to fusion, the fastest of any nuclear state. "Dongfanghong-1" satellite launched in 1970. China became the fifth country to independently put a satellite in orbit.

Of the 23 scientists honored for this, 10 had studied in America, 6 in Britain, others in France, Germany, the Soviet Union. They finished their studies abroad and came back. Back to a country that had nothing. In the chaos of the Great Leap Forward and the Cultural Revolution, they built world-class strategic technology.

What made it extraordinary? The window.

Try building nuclear weapons today. You can't. The treaties killed that. The window closed. Those scientists, working with raw brilliance and pure stubbornness on barren ground, grabbed it while it was still open.

From poor to prosperous

The first thirty years solved the "can't be beaten" problem. Next: "can't eat."

Deng Xiaoping's Southern Tour in 1992. Reform and opening up had nearly stalled, conservative forces were gaining ground. An 87-year-old man, instead of arguing with bureaucrats in Beijing, went straight to Wuhan, Shenzhen, and Zhuhai and said one line: "Development is the only hard truth."

GDP growth went from 3.9% in 1990 to 14.3% in 1992. That same year, the 14th Party Congress formalized the "socialist market economy."

I was born right at that inflection point. For as long as I can remember, China was growing. My generation got lucky — never went hungry, never lived through a war.

There's a concept in economics called the middle income trap. The World Bank studied it: out of 101 middle-income economies since 1960, only 13 made it to high income. South Korea, Taiwan, Hong Kong, Singapore. That's the short list.

Now it's China's turn.

Per capita GNI in 2024 was roughly $13,500. The World Bank's high-income line is $14,005, a 4% gap. Probably cleared within a year or two. A 1.4-billion-person economy crossing that line. Never happened before.

Why do some countries get stuck? It's not a shortage of smart people. Some countries have plenty of brilliant minds. But the talent leaves and doesn't come back. Society itself is too fractured, no stable foundation to channel all that energy into something coherent. Having a big population and having a deep talent pool are different things. Look at India and Brazil.

The AI parallel

Back to windows of opportunity. AI works the same way.

I wrote a piece before about how the AI industry is already in wartime. Look at global AI today. The only two players that actually compete at the frontier model level are the US and China.

Stanford's 2026 AI Index has some interesting numbers. The top US model leads the top Chinese model by just 2.7%. DeepMind's CEO Hassabis himself said the gap is only "a few months."

But there's another number that's even more interesting: US private AI investment totaled $285.9 billion. China's was $12.4 billion. A 23x spending gap producing less than a 3-point performance gap. So who's more efficient?

Europe has Mistral, valued at €11.7 billion and growing. But on the frontier model leaderboards, the gap between Mistral and the US-China top tier is clear. Everywhere else isn't even in the conversation.

Why can only the US and China compete?

I think the answer is the same as why those scientists pulled off Two Bombs, One Satellite seventy-seven years ago. Stable environment, sustained investment in education, deep enough talent base, and making the right calls when the window was open. China now holds close to 70% of global AI patents, and leads in research output and industrial robot deployment.

Foundation, environment, timing. Take away any one and it falls apart. Same logic as seventy years ago.

None of this was guaranteed

From 1949 to 2026. Seventy-seven years. About two generations.

Zoom in a bit. My parents' generation, thirty, forty years ago, still going hungry. One generation before that, Li Hongzhang signing the Treaty of Shimonoseki after the Sino-Japanese War, ceding Taiwan and the Liaodong Peninsula. Then the Boxer Protocol after the Eight-Nation Alliance. World War II ended, China was one of the victors, and its territory was still carved up at will.

Less than a century later, this country stands at the dead center of the world stage, sitting across the table from the most powerful nation on earth as equals.

Flip through history. Britain's rise via the Industrial Revolution took the better part of a century. Germany after unification, decades. Japan from the Meiji Restoration to genuine great-power status, about the same. And all of them started from a much better position than China did.

Watching yesterday's footage, it's worth stopping to think about what it actually took. People back then carrying millet and rifles, owning nothing, trading their lives for the space to survive. Scientists building world-class technology from absolutely nothing. Then generation after generation of ordinary people grinding it out until we got here. Sitting comfortably, eating whatever we want, drinking whatever we want, living with dignity.

None of this fell from the sky.

From standing up, to getting prosperous, to sitting at the center of the world while it falls apart around you. Two generations. That's all it took.

So what about us? What does our generation do next?

References

Trump's 2026 State Visit to China — May 13–15, 2026, first US presidential visit to China in nearly nine years; Elon Musk, Jensen Huang, Tim Cook, and Defense Secretary Hegseth accompanied
Who Was on Trump's Plane to China (PBS) — delegation included multiple tech CEOs and the Secretary of Defense
Trump–Xi Beijing Summit Trade Talks (CNBC) — both sides reached "generally balanced and positive outcomes"
Deng Xiaoping's Southern Tour — Jan–Feb 1992, visited Wuhan, Shenzhen, Zhuhai, Shanghai; GDP growth surged from 3.9% to 14.3%
China in the Korean War — 1950–1953, China deployed over 1 million volunteers under extreme material disadvantage
Two Bombs, One Satellite — atomic bomb 1964, hydrogen bomb 1967, satellite 1970; 23 honored scientists
China's 32 Months from A-Bomb to H-Bomb (Bulletin of the Atomic Scientists) — the shortest fission-to-fusion timeline of any nuclear state
The Middle Income Trap and China (CEPR) — World Bank high-income threshold $14,005; only 13 of 101 middle-income economies since 1960 successfully crossed
China's Per Capita GNI Approaching High-Income Threshold (SCMP) — ~$13,500 in 2024, ~4% gap
Stanford 2026 AI Index Report — US–China AI performance gap narrowed to 2.7%; China holds ~70% of global AI patents
DeepMind CEO: US–China AI Gap Is Only "Months" — Hassabis's assessment of the US–China AI gap
US–China AI Investment Gap (Morgan Stanley) — US private AI investment $285.9B vs China $12.4B
Treaty of Shimonoseki — signed 1895 after the First Sino-Japanese War by Li Hongzhang, ceding Taiwan, Penghu, and the Liaodong Peninsula
Boxer Protocol — signed 1901 after the Eight-Nation Alliance; Li Hongzhang was China's signatory and died shortly after

原文链接：https://guanjiawei.ai/en/blog/two-generations

One Hour for the Demo, Three for the Production Line

guanjiawei — Thu, 14 May 2026 08:17:04 +0000

You often see people online saying that in the AI era, reliability matters most.

The first time I saw it, it sounded like a tired cliché. Every era gets assigned its own buzzword; "intelligence" and "execution" have already had their turn. Does "reliability" actually fit the AI era any better? Not really.

A few recent projects finally drove the point home.

Three Characters a Day, Thousands of Assets Over a Hundred Days

I made a PAW Patrol reading game for my son. Three characters a day, three hundred over a hundred days. The number looks small.

The thing is, each of those hundred days is an independent mini-level. Every day needs about six or seven images and dozens of audio clips. The voices are cloned from a PAW Patrol character, each line matching a specific script. Add it up across a hundred days, do the math, and you're looking at thousands of assets, easy.

The day-one demo worked. My son and I sat there playing for twenty minutes, having fun. That's exactly where the problem started.

I thought the rest was just running that demo ninety-nine more times. Turns out the real thing and the demo are two completely different beasts.

I randomly picked two clips from the first batch of twenty. Both were bad. One dropped two characters; the other's emotion completely mismatched the line. Would I dare use the other eighteen? I sampled again. Still broken. With over a thousand assets across a hundred days, how many would actually be usable? I had no idea.

That feeling of uncertainty is the critical part. It's not a minor issue; it stops you cold.

The Extra Layer Is Called Quality Assurance

Demos work because a human is watching. Generate one, listen, if it's no good try again, pick the best and keep it. The whole process is manual. The human is an invisible QA layer.

To make it fully automated, you have to swap "human in the loop" for "model in the loop." That's QA.

Sounds simple. Doing it opens a whole new world.

One Audio Clip, Scored by Three or Four Models at Once

I started digging into niche models in the industry. The usual suspects are ASR and TTS. ASR understands speech, TTS generates it. But scoring TTS output for quality? There's a whole category of models built just for that.

DNSMOS is from Microsoft. Originally built to score noise-suppression algorithms, it doesn't need the original clean audio as reference; from a single clip it judges how much noise is present and whether the overall result is listenable. Later people found it's also sensitive to TTS artifacts.

NISQA comes from Gabriel Mittag's team at TU Berlin. It includes a NISQA-TTS weight specifically for TTS naturalness. Instead of a single score, it breaks things down into dimensions: noise, coloration, discontinuity, loudness.

UTMOS is from the SaruLab team at the University of Tokyo, winner of the VoiceMOS Challenge 2022, and now the de-facto baseline for TTS scoring. I use it as the outermost backstop.

Finally there's a reverse ASR pass: feed the generated audio through Whisper, compare the transcript to the original script, and reject it if the gap is too big. It's the crudest check, but the most reliable.

Add up the four scores; pass the threshold and it's good, fail and it triggers regeneration. I spent a day wiring it up, and the output was clearly better than running TTS alone.

From 1× to 3× Is Not an Exaggeration

But the cost went through the roof.

Before adding QA, I figured it would add maybe 50% more time. Pick out the bad ones and regenerate. Worst case, the new batch is all bad too, so you run it again. At most 1.5×.

In reality, one hour became three.

The reason is that the model simply can't clear a certain line. The same script, different random seeds, seven, eight, nine tries and it still can't pass the QA threshold. Sometimes you have to fall back to changing the prompt, the speaking rate, the emotion tag, just to squeeze it through. Every regeneration is a full model call, burning tokens each time.

Run this in the cloud, billed by the minute or by the call, and racking up a hefty bill in minutes is no exaggeration. I later did a rough calculation: a voice generation task I had planned to run entirely in the cloud would cost roughly 7 to 10× the demo bill once you factor in retry rates.

This was the math I hadn't done: from demo to production line, costs jump by orders of magnitude, not percentages.

The Same Thing Happened Again on Another Project

Recently I've been playing with another project called MiroFish.

It's a pretty interesting open-source project by Guo Hangjiang, an undergraduate in China. It hit #1 on global GitHub Trending in March 2026, with investment from Chen Tianqiao of Shanda Group. It generates a large population of agents with personalities, memories, and social relationships, then runs them across two simulation platforms where they discuss, debate, form alliances, and shift opinions. Finally, a ReportAgent summarizes the conclusions of the entire evolution to predict how an event will unfold.

My config wasn't large. About 54 agents per event, across a 20-round timeline. Every round requires every agent to run once. 54 × 20, roughly 1,000 full calls.

I used Kimi K2.6 Thinking. The problem is you can't turn off thinking mode; it thinks before every output. Thousands of thinking tokens per call is normal. Multiply by 1,000, and the token burn hurts.

After a few runs, I started wondering: does this scenario really need a top-tier model?

Each agent, on its turn, just scans the context, says a line or casts a vote based on its persona, then gets aggregated. The intelligence threshold for each call is actually low. Swap in a last-year model, something around GPT-4o level, and the results are probably similar, only faster.

The Mid-Tier Slot That No One Has Clearly Defined

For the past year, one question has gone unanswered: what scenarios actually need a second-tier model? Everyone races for the best and most expensive, leaving the mid-tier in an awkward spot.

I now see two very specific slots.

First is quality assurance. Judging whether an audio clip sounds natural, whether an image matches the style, whether a conversation stays on topic, these tasks require mid-tier intelligence. Using a top model here is like using Claude Opus to review GPT-4o's code. It works, but it's not cost-effective. A lightweight vision model plus a specialized scorer like NISQA costs far less than one top-tier call.

Second is large-scale agent simulation. A setup like MiroFish strings together 1,000 inferences to reach a collective evolutionary result. It's not sensitive to the quality of any single call, but extremely sensitive to total cost. The "best model" for this scenario isn't the smartest; it's whatever gives you the best mix of per-token price and inference speed.

These two scenarios hadn't been clearly spelled out because no one actually doing industrialization was batch-producing content at scale. Once you actually need to generate thousands of audio clips or tens of thousands of agent inferences, these two slots jump right out.

The Second Reason for Going Local

This is also when I finally understood why local compute matters so much.

The two reasons usually cited for local deployment: speed and privacy. Both are valid, but neither is the most critical.

The real killer is cost per call.

An industrial pipeline is bound to retry heavily. Cloud TTS is billed by duration, token models by the call. Every retry is another invoice. Local is different. A DGX Spark running open-source models like F5-TTS or VoxCPM incurs zero marginal cost beyond electricity. Leave it running for a day and you get enough material for a week. Failed? Run it again, no big deal.

This is the fundamental difference between cloud and local models in industrial scenarios. The former charges by usage; the latter only charges once for hardware. In a high-retry-rate pipeline, that gap gets magnified by orders of magnitude.

The reason local deployment never made sense in past discussions is that everyone compared it to demo costs. A demo TTS run costs pennies. Set that against a local machine costing thousands, and the math never works. But compare it to industrial-scale costs, factoring in retry rates, QA, and agent simulation, and the math flips immediately.

Three Tiers, Three Positions

Writing this, I suddenly realized that industrialized AI content production actually needs three tiers of models running simultaneously.

Top-tier models at the front, handling the hardest generation tasks. Expensive per call, but you don't run them often.
Mid-tier models for QA and agent simulation, handling high-volume, medium-intelligence tasks. Called repeatedly, so each call must stay cheap.
Local models at the bottom doing the heavy lifting. Asset generation, vectorization, transcription, alignment, the grunt work. If it can run locally, don't send it to the cloud.

You won't find these three tiers in any official tutorial; the setup is still evolving. But once you actually get your hands dirty with industrial content production, you'll end up piecing together this structure yourself.

Looking back at that opening line that sounded like empty talk, I actually think it understated things. In the AI era, what matters most isn't "reliability" itself; it's the cost curve of reliability. From demo to production line, that curve starts at 3×.

Understand that curve, and you know how to spend money. Otherwise you'll budget for 1× and get a bill for 7×.

References

原文链接：https://guanjiawei.ai/en/blog/from-demo-to-production

Three Questions After the AI Job Wave

guanjiawei — Wed, 13 May 2026 04:35:47 +0000

The drive to hire is weak right now.

Before, when you wanted to build something bigger, your gut reaction was "we need more people." Now it's the opposite: "how do we squeeze more out of who we already have?" Even bringing on interns feels less appealing.

It's a small shift in sentiment, but it points to something that isn't reversing.

Two days ago, Kim Yong-beom, policy chief at the South Korean presidential office, posted on Facebook: the "excess returns" of the AI era shouldn't belong only to individual companies; part should flow back to the public as a "citizen dividend." The next day, the KOSPI plunged 5.1%. He later clarified he wasn't suggesting confiscating profits, only discussing how to spend the "excess tax revenue" created by AI dividends. The market settled.

A single Facebook post moving the market 5% tells you the issue is already on the table.

The context is plain enough. SK hynix posted a 72% operating profit margin in the first quarter; spread across employees, bonuses averaged nearly $500,000 per person. Samsung Electronics' semiconductor division logged 53.7 trillion won in operating profit for the same period, but the 74,000 workers represented by the union received a far smaller slice than their SK counterparts. The Samsung union has threatened an 18-day strike starting May 21. The workers aren't after a slightly larger bonus. They want a respectable cut of the AI dividend chain.

I see this as the middle of three questions AI is really throwing at society. Ahead of it is the unemployment question. Behind it is a deeper one about identity. The three are linked.

Question One: Net Job Loss Is the Pattern

First, let's get the definitions straight.

Mainstream reports don't actually agree on "global net job losses." The World Economic Forum's Future of Jobs Report 2025 still predicts a net increase of 78 million jobs by 2030. The ILO's GenAI exposure index uses "transformation" rather than "replacement": one-quarter of jobs globally are touched by GenAI, one-third in high-income countries. The IMF uses the broadest brush: 40% globally, 60% in advanced economies.

But macro models are neat; actual labor pools are not. When a clerk gets "transformed" into an "AI-collaborating high-efficiency role," the macro report counts that as transformation. For the clerk, it's unemployment. New jobs like AI product managers, data governance specialists, model evaluators, and robot maintenance technicians may not be in the same city, may not suit the same age group, and may not absorb the former customer service reps, copywriters, junior programmers, and admin staff. Someone always gets pushed off along the way.

The current data already shows how painful this transition is. In Challenger's March 2026 U.S. layoff report, AI was the top reason given for cuts: 15,341 people, 25% of all monthly layoffs. Tech layoffs in the U.S. reached 52,050 in Q1, up 40% year over year. Goldman Sachs estimates that over the past year AI has eliminated roughly 25,000 jobs a month while creating fewer than 9,000. Net loss: 16,000. Gen Z and entry-level white-collar workers are hit hardest. Research from Stanford Digital Economy Lab also shows that in the jobs most exposed to AI, employment among 22- to 25-year-olds has dropped markedly.

This is what I mean by "net loss." I'm not forecasting global employment in 2030; I'm looking at the real demand for specific occupations, specific age groups, and specific companies right now. When a company realizes that ten people plus AI can do what used to take fifteen, it doesn't first ask a macro model whether new jobs will appear in five years. It freezes hiring, trims headcount, and cuts peripheral roles.

The sectors that traditionally soaked up labor are running in reverse, too. Autonomous vehicles are replacing drivers; drones are replacing delivery riders; industrial robots are replacing line workers. IFR World Robotics data shows global manufacturing robot density doubled in seven years. In China in 2023 there were 470 robots per 10,000 manufacturing employees. New infrastructure no longer naturally brings large numbers of low-barrier jobs the way building bridges and roads once did. Computing centers, ultra-high-voltage grids, battery plants, and dark factories are all capital-intensive and light on labor.

Some pin their hopes on "one-person companies." OPCs have been hyped plenty over the last couple of years, and I do think they'll become a real new organizational form. By June 2025, China had over 16 million one-person limited liability companies; 2.86 million were newly registered in the first half of 2025 alone, up 47% year over year. Shangcheng District in Hangzhou is already piloting OPC community policies.

But judging whether OPCs can carry employment means looking past the headline numbers and anecdotes. Most of those 16 million are traditional self-employed operations and micro-entities that existed long before AI. Two things matter: the growth rate, and what share of that growth can actually sustain middle-class incomes.

The growth rate is eye-catching: 47% year over year. But the distribution is ugly. Industry reports show OPC revenue is extremely long-tail: more than half are still stuck in a product-validation phase earning a few thousand yuan a month; fewer than one in ten steadily clear a million yuan a year. Even with an optimistic 10%, only 600,000 of the 6 million new OPCs each year would reach middle-class levels.

China's 2025 statistical bulletin puts year-end employment at 725 million. Apply the IMF's exposure metric: 60% in advanced economies. Use a more conservative 30% for China, and that's still over 200 million people. 600,000 versus 200 million: two orders of magnitude apart. OPCs will buoy some super-individuals, but they can't hold up the labor market.

Why is this shock so sharp? I boil it down to one word: concentration.

SK hynix posted 37.6 trillion won in Q1 operating profit with roughly 35,000 employees company-wide. That's roughly 1 billion won in operating profit per employee for the quarter, over 4 billion won annualized, or more than 20 million RMB per person. Not every employee actually creates that much, but the number makes it viscerally clear how few hands the AI dividend is squeezed into.

Xiaomi isn't as extreme, but the direction is the same. In 2025 the group recorded 457.3 billion yuan in revenue; its smart EV and AI innovation business contributed 106.1 billion yuan at a 24.3% gross margin, delivering 410,000 vehicles for the year. Automakers used to compete on production capacity; now they also compete on algorithms, supply-chain software, automated production lines, and data loops. Capacity is scaling up; headcount isn't scaling with it.

Since the Industrial Revolution, every major industrial wave has pulled new job chains along with it. Labor scale and industrial scale mostly moved together. This AI wave runs the other way: the greater the output, the fewer people it needs. Chips, cloud, models, data centers, plus the handful of teams that can push AI to its limits. These swallow most of the dividends.

You can think of it this way: one person out of a thousand, armed with super-productivity, flattens part of the work that the other 999 used to do. Not all jobs are erased, but the demand curve flattens. I don't see any new direction that could regenerate labor demand on that scale in the same window.

Accepting this is prerequisite to discussing the next two questions. Net job loss isn't a panic slogan; it's a magnitude mismatch already happening in local labor markets.

Question Two: Distribution Needs a Reset

Which brings us to the second question: distribution.

The Korean incident is a template for the distribution question. Once AI dividends concentrate in a handful of companies, who gets them? Shareholders, executives, core engineers, all employees, or society at large through taxes and public spending? This question will inevitably spread from Korea to Japan, Europe, and China.

If concentrated dividends are still distributed under old rules, the outcome is almost predetermined. The bulk goes to shareholders and top management; core employees get hefty bonuses; ordinary workers get fixed salaries; and most people outside the supply chain only feel rising prices, rents, fewer openings, and tougher competition. This structure was already widening inequality; AI only steepens the curve.

The group squeezed hardest isn't the lowest earners. It's the middle class, especially those living on salaries, professional skills, and stable jobs.

The reason is simple. Wages are the most perfectly taxed form of income; there's nowhere to hide. In China's individual income tax system, comprehensive income faces progressive rates from 3% to 45%. Salaries are withheld at source, and social insurance plus tax are deducted before the money ever hits your hands. High-salary workers should certainly pay tax. The problem is that wealthier people have far more types of income: capital gains, equity incentives, dividends, corporate structures, trusts, family offices, cross-border arrangements. I'm not saying these are illegal. I'm saying they have more choices of tax base and more room to defer.

This structure was already obvious in the internet era; in the AI era it will only get worse, because the core gains of the industry concentrate in fewer hands. The EU Tax Observatory's global tax evasion report also notes that billionaires worldwide face extremely low effective tax rates relative to their wealth. The more mobile wealth becomes across borders, the harder it is for any single country to carry out redistribution alone.

The other side of the middle-class squeeze is how fast they are being replaced. The jobs AI currently hits easiest are middle-class occupations: programmers, designers, customer service reps, junior legal staff, junior analysts, copywriters, translators, operations specialists, admin staff, finance assistants. They shoulder the heaviest taxes and face the fastest replacement. They are the ones hurting most in the current structure.

Back to Korea. The Samsung and SK unions aren't fighting over a one-time bonus; they're fighting over a long-term rule. The companies will only offer a "special bonus." The unions want the profit-sharing ratio locked into a formal agreement that takes effect every year. On the surface it's about the bonus amount. In reality it's about whether this distribution rule will still hold next year.

Using "excess profits" or "excess tax revenue" for redistribution isn't a new framework in itself. Nordic countries have been running this for decades. Denmark's top marginal income tax rate is pushed to 60.5% in 2026. Sweden, Finland, and Norway have long maintained high labor-tax burdens and public services. The OECD's Taxing Wages also shows that the average tax wedge on labor in European countries is markedly higher than in the U.S. or Korea.

But the AI era introduces a new problem: productivity itself can move.

Heavy-asset, fab-heavy players like Samsung and SK hynix can't move; the Korean government can at least capture some corporate income tax and supply-chain revenue. But the more typical AI business doesn't look like that. Compute is rented in Singapore; the company is registered in Ireland; the team is spread across five time zones; settlements run through global payment networks. Teams of three to five people generating hundreds of millions in revenue will become more common, and nations have far fewer levers to tax them than they had with traditional manufacturing.

So an "AI tax" can't be read as simply slapping higher taxes on a few companies. It's more like a bundle of questions. What is the tax base? Compute, profits, capital gains, data, or the labor costs displaced by robots? And who receives the revenue? Is it poured into new infrastructure, or used to shore up social security, education, health care, pensions, unemployment insurance, even direct cash transfers to residents?

What needs guarding against here is path dependence. Many countries have grown used to propping up the economy with investment and infrastructure, but AI-era infrastructure may not prop up employment. Building more computing centers, data centers, ultra-high-voltage grids, and battery plants will likely continue to raise the productivity of leading firms, benefiting capital and a narrow slice of high-skill jobs, while doing little directly for displaced middle-class and low-income workers.

This is why people at OpenAI have been talking publicly about UBI for years. OpenResearch, funded by Sam Altman, ran a three-year experiment in Texas and Illinois: 1,000 low-income participants received $1,000 per month, alongside a control group of 2,000. The results, published in 2024, weren't miraculous. Recipients worked an average of 1.3 fewer hours per week, had a 2 percentage-point lower employment probability, and saw household income excluding subsidies decline. But they were more proactive in looking for work, valued meaningful work more, had more room to relocate, see doctors, and plan long term, and were more likely to have entrepreneurial ideas.

This experiment matters, not because it proves UBI is right, but because it drags the debate from slogans back to data. Cash doesn't automatically make people stop working, nor does it automatically give them dignity. What it provides is a buffer and choice. For a society with excess productivity and rapid job restructuring, choice itself may be infrastructure.

I don't think UBI is the standard answer for the AI era. But it's one of the few options that has been seriously tested and has data behind it. Compared with patching old rules, it at least offers a different starting point.

Question Three: Where Does Value Come From for People Who Don't Work?

This is the hardest of the three. The first two can still be moved forward with policy, tax systems, and redistribution. This one cannot.

In the Chinese context, "not working" is a very heavy verdict. At family gatherings, when someone asks "What do you do?" the expected answer is an occupation. If you reply, "I don't currently have a job," the atmosphere changes instantly. This isn't just about face.

The National Bureau of Statistics' 2025 bulletin lists 5.95 million urban residents and 33.4 million rural residents on subsistence allowances at year-end. China does have a welfare system. But subsistence allowances and relief still carry stigma in many places. Families who qualify but don't apply have always existed. The reason isn't insufficient money; it's the fear of being whispered about for "living off handouts."

This sense of shame runs deep. Our generation grew up on a narrative that said "Work hard and you'll be rewarded; effort deserves respect." Education, media, and the people around you all tell you the same thing: your value equals your output. Labor is the anchor of identity; salary is the measure of it. I wrote a piece on AI anxiety before, touching on the other side of this. When AI fortune-telling trends and young people flock to mysticism for certainty, what's really happening is that this anchor is loosening.

AI has simply used up the expiration date of this narrative ahead of schedule. What it truly shakes isn't just income. It's the sense of identity. You receive a basic income, food and housing are covered, friends respect you, but you wake up with nothing to look forward to. That hollowness is something policy cannot answer.

A society whose time has been freed by AI doesn't lack welfare distribution points. It lacks a narrative that can give people a new identity. This isn't something engineers can code or models can compute. It requires telling a new story about what kind of person is worthy of respect and what kind of life is decent.

A thousand years ago the story was "study to become an official"; a hundred years ago it was "industry saves the nation"; thirty years ago it was "go into business." Over the last decade or so it was "get into a big tech firm," "buy an apartment," "start a company," "financial freedom." What it should be in the AI era, no one can give a clean answer.

Closing

The three questions are linked. The first makes the second urgent. No matter how well the second is handled, the third cannot be avoided.

The market tremor triggered by that May 11 proposal in Korea is only the opening act. I expect these discussions to spread to Japan, Europe, and China in the second half of the year. Every country will craft different answers based on its own politics and culture. AI taxes, UBI, tax-base reforms, new infrastructure, promoting one-person companies—each will have its trial runs. Trial and error itself is part of the answer.

What individuals can actually do isn't complicated: build more skills, keep more capital on hand, and don't let any single narrative sweep away your emotions. What society must do is harder: stop brushing things off with "new jobs will always appear," and stop pushing the unemployed back into shame. No one can answer all three questions at once. We can only walk through them one by one.

References

原文链接：https://guanjiawei.ai/en/blog/three-questions-after-jobs

AI Vanguard: 10 Weeks Left

guanjiawei — Tue, 12 May 2026 13:11:13 +0000

A few recent numbers, taken together, are quite telling.

Sam Altman has clearly been energized since the GPT-5.5 launch. Codex weekly downloads surged to 90 million. Paid users climbed from 3 million in March to over 4 million by late April. My own gut feeling lines up with that. Before 5.5, I had one $200 Codex account; now I have four, at $800 a month. Many are saying 5.5 shouldn't be called 5.5—it should be GPT-6. I once posted that "this is not a minor version," and that claim is being validated.

The Anthropic side is even more extreme. Dario Amodei did the math in a CNBC interview last week. From the start, they bet on exponential AI growth and built infrastructure for "10x per year." Q1 growth annualized to 80x. Annualized revenue ran from $9 billion at the start of the year to $30 billion in April. Infrastructure got crushed, so they signed a deal with SpaceX to lease the entire 220,000 GPU Colossus 1 data center in Memphis, unlocking 5-hour usage limits for Claude Pro / Max users.

That's the backdrop.

Against that backdrop, I spent a few hours yesterday talking with a friend. He uses AI heavily, knows his way around Terminal, builds things on open-source frameworks like OpenClaw, and is a power user at his company. A firm offered him a role as an AI transformation expert and he asked what I thought.

After the conversation, I realized a few judgments from it are worth pulling out on their own.

1. When Evaluating an Offer, Look for "Unrestricted Access to the Strongest Models"

I told him that at this particular moment, the offer itself isn't the most important thing. You're already in a good spot.

What truly matters is whether the platform can give you resources for near-unlimited use of the absolute best models, plus enough freedom to tinker in whatever direction you want.

What do I mean? For someone like me with a bit more tenure, I have ways around resource limits—I can burn $1,000 a month out of my own pocket for top-tier access and set my own direction. But for younger people just getting fired up about AI, the company still matters. Because you're creating value for someone else during work hours. If the model isn't the best, you're always one step behind. You can feel the gap, but you can't close it.

Without that foundational access, it's actually very hard to be an AI transformation expert.

2. Stop Running GPT-5.5 on Medium Effort

The second thing—I guessed right. I asked him, when you use GPT-5.5, you don't crank effort to the max, do you? To save money, control costs, do "routing," and dial effort down? Or just switch to Kimi, DeepSeek, Minimax?

He confirmed it. Most people I know use them that way.

I think there's a problem here.

I've reached a conclusion through repeated trial and error. Back with Opus 4.6, dropping effort from high to medium on the same model slashed accuracy from around 80% to roughly 30%. Same model, just one effort tier lower, and the entire workflow performed completely differently. After that, I never used medium effort again; Claude has stayed on x high ever since.

With GPT, I've been on x high from day one. The reason is simple. If a task runs well in Claude Code, I generally won't bother trying GPT. The moments that made me think "this thing is genuinely different" all came at maximum effort. That was true in the 5.4 era, and the gap is even more pronounced since 5.5.

So in one sentence: don't optimize costs prematurely at this stage.

Let's do the math. One top-tier model, one account, $200 a month. Two accounts give you more than enough headroom to run nearly 7*24 on a single vertical task. That's $400 a month—about 3,000 RMB. That's cheaper than hiring an intern, but the output approaches PhD-level researcher quality. What is the point of saving that small amount of money?

I currently spend $1,000 a month on myself, rotating across five accounts. If I switched to API calls at the same intensity, it would cost roughly $10,000.

Why not optimize? Because right now you don't know where the strongest model's boundary lies. The researchers don't know either—they haven't tested it in your domain. This is uncharted territory. What you need to do is take the strongest, most expensive, highest-effort setup and slam against that boundary, pushing it to the limit. If it truly can't do something, you'll know for certain.

As for "why can't we control costs first like traditional software"—wait until everyone has mapped out the boundaries and we're in the mass-deployment phase, then consider cost-effectiveness. We're not in that phase right now.

3. Ten Weeks Left

My friend asked: What about time? Can't I just take more time and experiment slowly?

I said, let me do the time math for you.

The world hasn't been broadly stunned yet because top-tier models are coming too fast, too dense, with intervals too short. From my own use, I've found something counterintuitive. Even with the most elite models like GPT-5.5 and Claude Opus, cranked to maximum effort, it still takes time to run a valuable direction to completion. It's fast, but not so fast that you "think it and it's instantly done." It proceeds step by step. What used to take a team months gets compressed to weeks. But you still have to walk through the steps.

Plot this on a timeline:

GPT-5.5 launched on April 23—two weeks ago.
At that point, a cohort already realized this time was different and started running their most ambitious directions on top of it.
Conservatively, within three months, a batch of things will ship from unimaginable people, unimaginable directions, unimaginable domains.

Three months = 12 weeks. Two have passed. Ten remain.

In ten weeks, remarkable results will enter the world and make everyone realize "the world is different." These results won't wait for you. In ten weeks, your market position, your label, your place in the seniority hierarchy may all need to be recalculated.

Is ten weeks long? No. It's short enough that you cannot afford to waste a week building an inefficient workflow or hesitating over which tool to use.

4. Don't Use IDEs, and Don't Mess with Third-Party Frameworks

Third topic: So what should I use?

My answer might offend some people, but I want to be clear.

Don't use IDEs. In the coding-agent scenario, the IDE is, in my view, a form factor that has already been obsoleted. It's especially unfriendly to people who aren't already senior programmers. You have to spend time learning a complex interface that means nothing to you, and in the end you still can't read the code the agent writes. Squeezing a tiny Terminal window in the middle—this form factor is fundamentally awkward.

The name "Terminal" has done Claude Code and Codex a disservice. When many people hear CLI or Terminal, they immediately think "programmer stuff." But after I helped some friends with zero programming background install Claude Code and Codex, not a single one said "I can't figure this out." It shows you what's important, compresses unimportant steps into logs, and you just judge at the key decision points. Beginners actually pick it up faster than an IDE.

If you can avoid it, don't use third-party frameworks like OpenClaw or Hermes. This is a bit counterintuitive. I have indeed helped people install these before. But looking at it now, Terminal plus the official CLI has matured to the point where it can do everything those frameworks do, and better.

Why? Because the official CLI is tailored to the official model's behavior. Claude Code connects to Claude models; Codex connects to GPT models. Caching mechanisms, error recovery, risk guardrails, context compression—all tuned for that specific model. Switch to "using OpenClaw with GPT" or "using Claude Code with Kimi," and it may run in theory, but in practice the effect is noticeably worse.

A recently popular open-source project proves this point. Someone built a Deep Code CLI specifically for DeepSeek V4, similar in form to Claude Code but tailored solely for the DeepSeek model. Many find this counterintuitive—aren't relays supposed to "connect to every model"? This path is actually the right one. Models have their own behavior; a carrier customized around a specific model delivers better results and cost efficiency.

5. Never Use a Completely Black-Boxed Agent

OpenClaw has another "advantage" that I find dangerous. It can dispatch tasks remotely and deliver results without you watching the process. Sounds great.

For some people this is a good thing. But for those exploring the boundaries, this feature should be disabled.

The foundation of collaborating with an agent is understanding. What kinds of work it does well, what it does poorly, what cognitive habits it has—you only learn these by watching it work. Once it becomes a black box, what you lose is judgment, not just the details.

Managing AI is like managing a new hire. The fastest way to learn is to watch them do every step. Treating it as a wishing machine that gives you results won't make you a stronger leader.

6. Want to Operate Anywhere, Anytime? SSH + tailscale + tmux Is Enough

My friend said another reason he likes OpenClaw is that it can be controlled from a phone, letting him dispatch tasks anytime, anywhere. This point has been overlooked far too much, so I'll address it specifically.

If you only use one laptop, feel free to skip this section.

If you want to control your home desktop from your phone, the required infrastructure is actually very mature. SSH is an ancient protocol that lets you log into one machine from another with high privileges. tailscale is a free virtual VPN that adds your desktop, laptop, and phone to the same VPN so they can talk directly via stable internal IPs. tmux is a background session tool: open a session on the desktop, cd into the project directory, launch Claude Code or Codex, and that session runs forever in the background. Disconnecting from the network or turning off your phone doesn't affect it. You can attach anytime to check progress.

On the phone side, pair it with a terminal app like Termius and connect in. The whole setup takes less than an hour.

The workflow after setup looks something like this: Before leaving home in the morning, drop a task into the desktop's tmux session. During the commute, attach from your phone to take a look; if progress looks off, adjust direction. While you're in meetings at the office, the agent keeps running. At lunch, attach again to review and give more feedback. When you get home, pick up right where you left off at the computer.

The entire chain is seamless. I currently have about 50 hours a week where the agent works on its own, out of sight, but I have a clear sense of what it's doing. The phone is the controller.

This kind of infrastructure is common among programmers, but it's severely undervalued in the context of "using AI to drive work transformation." It lets you scale from one laptop to a building's worth of compute without any middleware.

Putting All of the Above Together

Back to my friend's original question: Should I take the AI transformation expert offer?

I didn't give him a direct answer; I gave him my decision framework. First, see if that company can give you near-unlimited access to the strongest models. If not, the offer's value is limited. Second, in your daily work you must use the top-tier model at maximum effort. The vanguard phase is no time to save money. Third, there are only about ten weeks left; don't waste them on inefficient toolchains or debating whether "third-party frameworks might be better."

It all comes down to one sentence. Right now, no one knows the true boundary of the strongest model. What you need to do is not optimize costs, not adapt to existing workflows, but take the most powerful tools available and slam into that boundary to see if it can be pushed outward.

In ten weeks, the world will be shaken by a wave of unexpected breakthroughs. At that moment, the last thing you want is to look back and realize: these past few months I spent tweaking IDE configurations and optimizing model routing costs.

Time is the most expensive resource. Attention is second. Money is last. Don't get this order reversed for the next ten weeks.

References

原文链接：https://guanjiawei.ai/en/blog/ai-vanguard-ten-weeks

Won Over by Cheng Li-wun: What Should the New Generation of Leaders Look Like?

guanjiawei — Tue, 12 May 2026 13:09:46 +0000

I've been won over by Cheng Li-wun lately.

Cheng Li-wun is the new KMT chair. On October 18, 2025, she won with 50.15% of the vote — only the second woman to chair the KMT, and the first chair to come from a DPP background. After winning, her position was crystal clear: return to the 1992 Consensus, oppose Taiwan independence, push cross-strait exchange through peaceful means.

Last week, April 7 to 12, she led a delegation to the mainland — the first KMT chair-led visit in ten years. The previous one was Eric Chu meeting Xi Jinping in 2015. This time she met Xi at the Great Hall of the People — a high-level reception.

The visit itself was a major event. But what really moved me was what she did after returning to Taiwan.

She Doesn't Do Traditional Press Tours

A typical politician comes back, holds a press conference, chats with state media. She did something different.

She went on livestream. On the night of April 16, she connected with Taiwanese internet personality Chen Zhihan ("Guan Zhang") for over two hours. They covered everything — Xiaomi cars, Taiwan's youth brain drain, mainland smart manufacturing, the gap between the two sides. Direct. No talking points.

Picture this: a politician at the absolute center of cross-strait attention, fresh from being received by the highest leader on the other side, comes home and instead of going to mainstream media, sits down with a streamer and just talks for hours on camera.

Counterintuitive. But also natural.

I watched a stretch of it. What she said had real impact. She wasn't reading prepared lines — she was actually thinking through it, actually saying things she believed. People like that are rare in politics.

She and Guan Zhang weren't a one-off pairing either. Back in June 2025, they had already started something called the "Non-Party Opposition Alliance," pulling people from across the political spectrum into dialogue. The livestream was a natural extension.

The Risk She's Carrying

The other thing I respect is the risk.

Pushing cross-strait peace right now is dangerous, period. The American side has been running operations for years — espionage, infiltration, assassinations — and it has never stopped. None of it acknowledged in public, but always running. As the most visible person pushing on this issue, she's a target by definition.

Her willingness to step forward, push this visit through, come back and keep extending the work — that's not nothing. The risk to her physical safety is real, not rhetorical.

The Lei Jun Segment Says Something Specific

The part of the livestream where she talked about meeting Lei Jun stuck with me.

She said the last day of the trip was a visit to the Beijing Xiaomi car super-factory. Lei Jun received her personally, she test-drove the YU7, and she even got a Xiaomi phone. She said she's a Lei Jun fan, and her husband is even more so. Their household runs on Xiaomi — cups, phones, wristbands, backpacks, tablets, all of it. When Lei Jun was explaining a Xiaomi cup set at 29.9 yuan, she laughed and said "everything in our home is Xiaomi."

This segment says one specific thing.

A politician just received by the highest leader on the other side comes home and openly says she was thrilled to meet Lei Jun, because she's a fan. Who is Lei Jun? An entrepreneur. Someone expressing himself daily on Douyin and Weibo. Someone continuously extending his identity through products and content.

The identity he has built in the digital world has this much energy: it can make one of the most politically influential people in the region voluntarily declare herself a fan.

The reverse is true too. Why has Cheng Li-wun been able to generate so much discussion so fast? Not through official statements. Through livestreams, dialogues with internet personalities — pushing her thinking directly to people, her own way.

This reinforced a take I've held: digital identity is the biggest individual leverage we have right now. Cuts across industry, cuts across direction. Politicians, entrepreneurs, creators, researchers — whoever does it earliest comes out ahead.

This Era Especially Needs Political Thinkers

Bigger point.

Everyone talks about AI like the technology by itself solves everything. I'm increasingly convinced it doesn't.

This wave of AI is changing economics — the underlying logic and the production formula are shifting hard. Every historical shift of this magnitude has triggered a reshuffling of society. Power gets redistributed, wealth gets redistributed, sometimes nations even go to war.

The thing is, the technology itself is neutral. People all over the world are using more or less the same AI; the threshold isn't that high. But look around: Silicon Valley engineers in stable environments working out how to use Claude to boost research efficiency; parts of Africa where people don't have drinking water; Ukraine living daily with the threat of incoming missiles.

Why?

Not because they can't get the technology. Because the modes of organization, the modes of distribution, the political structures are completely different.

So what really shapes a society or a world has never been only the scientists. Scientists matter, but they're not the only narrative. King's "I have a dream" pushed racial equality. Gandhi did nonviolent resistance. Deng took China from absolute poverty to a different state. What all of these people did was change how humans organize, distribute, and relate to each other.

That kind of work matters more than ever right now.

An Era When the Rites Have Broken Down

Bluntly: the world is entering a new phase.

I've told friends this — the configuration looks very much like the transition from late Spring and Autumn into the Warring States. The rites have broken down.

The old world order was held by the US: IMF, UN, World Bank, cultural exports, the whole package. Not necessarily fair, but at least everyone was paying surface lip service to the rules. A bit like the Spring and Autumn period: feudal lords going to war still needed a pretext, still had to invoke the Zhou king for legitimacy, still had to dress their actions in the language of rites and morality.

What Trump is doing now is no longer that.

Venezuela is one example. On January 3, 2026, US forces struck inside Venezuela and arrested President Maduro, who is now held in the US to face drug-trafficking charges. Trump immediately announced a "historic energy agreement" with Venezuela — his exact words were that the US would now "run Venezuela," and that Venezuelan oil interests would belong to the US. Venezuela's proven oil reserves are about 303 billion barrels, the largest in the world.

Iran was even harsher. On February 28, 2026, the US and Israel jointly killed Iran's Supreme Leader Khamenei. There was no formal declaration of war, no congressional authorization. Picture this: two countries already tense but not in a state of declared hostility, and one suddenly kills the other's top leader, then announces it as a successful military operation. Any head of state watching that gets cold chills.

Greenland is another thread. He has openly said he won't rule out using military force to take Greenland from Denmark, and threatened a 25% tariff to push Denmark to hand it over. He temporarily walked it back after Davos in January 2026, but the posture is already on the table: I want it, you'd better hand it over.

Put these three together and the signal is clear: between major powers, the pretense is gone. Whoever has the bigger fist makes the rules. Morality, rules, procedure — all of it can be thrown out.

This is what late Spring and Autumn felt like.

A Hundred Schools Contending

But late Spring and Autumn had another side: the contention of a hundred schools.

When everyone is stable and settled, ideas and new theories of governance don't have a market — nobody needs them. But when the existing modes of organization start collapsing, while a new economic foundation is being born at extreme speed, that is exactly when ideas have to appear.

That's right now.

One side: the old order is breaking down. The other: a new economic foundation from AI. Both sides are moving.

What this era needs, in my view, is not just scientists. We need more thinkers, more political practitioners, more people experimenting with modes of organization — people exploring what kinds of social arrangements fit this technology, this configuration, this reality.

That's the biggest thing Cheng Li-wun has stirred in me. Her approach is exactly what this era calls for: direct dialogue, livestreaming with internet personalities, thinking out loud on camera. Carrying personal risk to push something forward — that, too, is the posture this era needs.

The Form Doesn't Have to Be Fancy

When it comes to transmitting ideas, there's an easy wrong turn now: trying to make the form fancy, very produced, full of innovation cues. A short video, an animation, an interactive installation. All impressive — but not the only path.

Words and writing themselves carry enormous power. No matter how saturated short video gets, that doesn't go away.

What Cheng Li-wun did was: sit down, turn on the livestream, talk about what she's seeing, what she's thinking, how she feels. No editing, no filters, no script. But in those few hours of conversation, the volume of information and the resonance she transmitted exceeded a hundred carefully produced PR videos.

China Is in a Window Right Now

Back to ourselves.

Against the backdrop of a violently shaking world, China right now is a relatively stable patch. Outside is fighting hard, inside is relatively calm. Tense outside, loose inside. That itself is a remarkable achievement — and a precious window.

In this window, there is already a wave of very young people doing real political work on the front lines. At the county level, the village level, the district level — young leaders running real experiments and finding new ways to express themselves.

But that's still far from enough.

The hundred-schools moment in Spring and Autumn wasn't everyone going their own way and talking past each other. It was dozens of small states and hundreds of thinkers, each running experiments in their own corner, then converging — debating, exchanging, comparing results. Confucianism, Daoism, Legalism, Mohism — they all grew in that environment. If today we only have scattered practice with no exchange or collision, the hundred schools can't take off.

So I hope more people — researchers, founders, political workers, local officials, content creators — bring out their own thinking, run experiments, use digital identity to put it out there, find both resonance and disagreement.

What one person can change is amplified now. It's not that nothing can be done — it's that there's an extraordinary amount you can try.

What Cheng Li-wun did is not a grand-power strategy. It's one visit, a few livestreams, a few conversations. But it really did change how a lot of people felt and thought.

We can do the same.

References

Cheng Li-wun elected KMT chair — October 18, 2025, won with 50.15% (65,122 votes). Second woman to chair the KMT, first chair from a DPP background.
Cheng Li-wun's delegation arrives on the mainland — April 7–12, 2026 "Journey of Peace 2026," visiting Jiangsu, Shanghai, and Beijing.
Xi Jinping meets Cheng Li-wun — April 10, 2026 at the Great Hall of the People, the first meeting between KMT and CCP leaders in nearly a decade (the previous one was Eric Chu and Xi in 2015).
Taiwan's opposition leader arrives in China for a 'Journey of Peace' — NPR coverage of the trip.
The Domestic Politics of Cheng Li-wun's China Trip — The Diplomat's analysis of the trip's political significance.
Cheng Li-wun visits Xiaomi factory; Lei Jun receives her personally — April 12, 2026 visit to the Beijing Xiaomi auto super-factory; test drive of the YU7.
Cheng Li-wun openly a Lei Jun fan — Publicly stated she and her husband are Lei Jun fans; their household is full of Xiaomi products.
Cheng Li-wun's livestream with Chen Zhihan — April 16, 2026 evening; over two hours of livestreamed conversation covering Xiaomi cars, Taiwan's youth brain drain, the cross-strait gap.
Founding of the Non-Party Opposition Alliance — June 2025; co-founded by Cheng Li-wun and Chen Zhihan, gathering different parts of the political spectrum into dialogue.
US strike on Venezuela and arrest of Maduro — January 3, 2026 US military strike inside Venezuela; arrest of President Maduro.
Trump Claims 'Historic' Venezuela Oil Deal After Maduro Arrest — Military.com; Trump announces taking over Venezuelan oil.
Assassination of Ali Khamenei — Wikipedia — February 28, 2026 joint US-Israel operation kills Iran's Supreme Leader.
U.S. and Israel launch a major attack on Iran — PBS coverage of the operation.
Greenland crisis — Wikipedia — Trump threatens 25% tariff on Denmark and won't rule out military means to take Greenland.
Proposed United States acquisition of Greenland — Background on US sovereignty intentions toward Greenland.

原文链接：https://guanjiawei.ai/en/blog/zheng-liwen-hundred-schools

There Are Only Two Ways to Start Vibe Coding

guanjiawei — Tue, 12 May 2026 13:09:06 +0000

The question friends and colleagues have been asking most lately: Vibe Coding sounds pretty cool, but where exactly do you start?

It's a good question. AI Coding is a productivity tool, not an entertainment tool. If you try to tinker with it in daily life, you'll mostly be at a loss. The vast majority of daily needs have already been saturated by cheap or even free apps over the past decade or so. Want bubble tea? There's Meituan. Want to edit videos? There's Jianying (CapCut). Why vibe-code another one yourself?

My answer only offers two paths: either start from work, or start from building your own digital identity. There is no third path.

1. Start from Work

Let me first explain why work.

Tools should grow wherever human attention is focused. Whether it's eight hours or ten hours a day, that's the time you're compelled to produce value. The pressure is high, the feedback is direct. There's no better testing ground for transforming your work paradigm.

Scenarios like playing ball, cooking, or binge-watching shows seem more "free," but the actual time you invest weekly is pitifully thin, and your attention isn't sufficiently attached. Vibe coding doesn't easily take root in these places because your mind isn't really there.

Work is the opposite. You spend the vast majority of your time immersed in it. Off-the-shelf tools mostly cost money and aren't particularly good. Output gets validated by others, feedback doesn't disappear. Room for improvement is concrete, not imagined.

My own metric is AI penetration during work hours. Early on, I was probably only spending 10% of my time interacting with AI, with the remaining 90% being my original way of working—that state basically equals zero penetration. Later I gradually pushed it above 80%. When penetration is high, the only things left in my day that still use the old paradigm are purely human activities like meetings, signing off, and talking with clients.

This number is actually closer than you might think. In Stack Overflow's 2025 Developer Survey, 84% of developers are already using or planning to use AI tools, with 51% using them daily. Anthropic's own Economic Index data is even more direct: 46% of conversations on Claude.ai are work-related, and on the API side it's 74%. Heavy users treat it as a work partner, not a toy.

But pushing penetration from 10% to 80% isn't just a matter of adding AI to your workflow. It requires breaking apart your current tasks and redistributing them: which segments let AI write, which let AI research, which let it run on its own, and which must be monitored by humans. This restructuring is the most natural training ground for getting started with vibe coding.

2. Start from Building Your Digital Identity

If you can't find an opening at work right now—say the process is too rigid, or the team isn't ready yet—the second path works just as well. Build your own personal digital identity.

Why place it second, outside of work? Because its ROI is absurdly high.

The investment is so small it's barely worth calculating. A personal website can now be built with vibe coding in a day or two.

The potential payoff is compounding. What you write is yours, not the platform's. The same piece published on Zhihu, WeChat Official Accounts, Geekbang, X, and Xiaohongshu (RED) represents five completely different exposure formats. Each platform's algorithm is different. The probabilities of hitting a viral piece stack together, making it far more stable than betting on a single platform.

I cold-started for just over a month, averaging about 2,000 daily views across all platforms. This number actually means very little. What matters is the distribution. I had an article about software bidding and procurement that got pushed to over 20,000 reads and hundreds of comments on Zhihu. Many other articles sit at just dozens to hundreds of views on each platform. But as long as your inventory is large enough, one or two will eventually hit. That's the compounding effect of distribution.

Here's a counterintuitive fact. profy.dev surveyed over 60 technical hiring managers: 93% will look at a candidate's portfolio website, but 51% flat-out said "not having a portfolio website doesn't lower a candidate's chances." What does that mean? Static showcase "vanity sites" are largely useless. What actually works are vehicles that let content gradually accumulate. GitHub profiles, blogs, project notes—these things with "output traces" are what constitute a digital identity.

Applied to vibe coding, this path is now much easier to walk. AI has compressed the most tedious parts of the content pipeline to near-zero cost: writing, reformatting, adapting for different platforms, generating different media. What used to take half a day for one person to spread an article across five platforms now takes ten-plus minutes. Even investing just 30 minutes daily adds up significantly over the long term.

As for "why you should build a digital identity," I wrote a previous piece that covers it more completely, so I won't repeat it here.

3. The Prerequisite: It Must Be the Real You

My friend followed up with a second question that I think deserves more elaboration than "where to start."

He said: I feel a bit hesitant about putting my own stuff out publicly online. What should I post? How should I post it?

My answer has only one rule: post, but what you put out must be the real you.

Don't perform, and don't chase traffic by imitating a style you don't even endorse yourself. This is where long-term ROI most easily crashes.

Many people misunderstand, thinking that building a personal identity means "performing a better version of yourself." This path only works for full-time influencers who depend on traffic for their livelihood. That's their survival mode. For most people, digital identity is a compounding bonus beyond work, not their main livelihood. Once you start performing, you incur an invisible liability: every piece you publish is a "raw archive" that may be dug up and cross-examined someday in the future.

The fact that the internet has memory is, I think, grossly underestimated.

The classic example is Justine Sacco. In 2013, she casually tweeted before flying to South Africa, with only 200 followers at the time. Eleven hours later when her flight landed, she was trending worldwide amid outrage and was fired upon returning to her company. Follower count is no shield. James Gunn was fired by Disney in 2018 over old tweets from 2008–2009—a decade prior. Kevin Hart stepped down as Oscars host the same year, also over decade-old jokes that were dug up.

Once your influence crosses a certain threshold, past content gets examined under a magnifying glass.

So before I publish anything, I ask myself: at that point in time, is this the real me? Is this an opinion I'm willing to claim later when I look back?

If yes, publish. Even if your thinking changes later and you find it naive—that's fine. People naturally change, and looking back at youthful naivety is normal; it's part of your history. But if you couldn't even endorse what you wrote at that point in time, looking back it becomes a stain. These stains accumulate and become constraints holding you back.

Posting your authentic self has another hidden benefit. It forces you to think clearly about "what is the real me." The value of this process itself may exceed the exposure the content generates.

Not Posting Means Zero

Back to my friend's original question.

Two paths: starting from work is the most natural, since your attention is already there; starting from digital identity is the most cost-effective, with small investment and compounding returns. Whichever one you can start doing immediately, take it—it's far more effective than agonizing for six months.

On digital identity, I also said in my last piece: if you don't do it, the opportunity just hangs there waiting. The probability of any event hitting you is zero. If you don't buy a lottery ticket, the probability is zero; if you buy one, at least there's a probability.

But rather than grand narratives like "digital identity is leverage," what I want to emphasize more is: start today, don't wait. The barrier is already ridiculously low. Few things are so universally applicable across industries and effective for everyone. There's only one prerequisite: that it's the real you.

Everyone should do it. Musk, Lei Jun—they're all doing it, even Trump is doing it. You're using the same infrastructure, the same medium as them. There's no reason you shouldn't do it.

References

原文链接：https://guanjiawei.ai/en/blog/where-to-start-vibe-coding

No More Babysitting the Agent

guanjiawei — Tue, 12 May 2026 13:08:18 +0000

I spent two days automating with the Codex app, burned through a Pro account, and made virtually no progress. Switching to Codex CLI's /goal feature, everything immediately clicked.

It felt counterintuitive at first. Same model, same task—how could swapping the shell make such a huge difference? Two days later, I figured it out: it's not the app's fault. The agent form factor is still in transition, and the app is making a "mass-market" push that runs ahead of where the models actually are.

What surprised me more than the burned account was something else entirely. Mindset.

1. A Puzzle

There's something I haven't been able to figure out lately, so I'm putting it out here for discussion.

The same coding agent behaves so differently in the terminal versus the official app that they barely seem like the same entity. It runs beautifully in the terminal, but noticeably dumbs down when moved to the app.

Theoretically, there shouldn't be such a gap. An app is just a shell—wrap a framework, change the visuals, behavior should be consistent. But power users at the front lines have already voted with their feet: the vast majority still live in the terminal. The app has been pushed for so long without truly taking off.

After testing this myself, I found it's probably more complicated than it looks.

2. Codex App "Automations": Running Through Every Pitfall in Two Days

It started with that post-GPT-5.5 feeling of "this model is already strong enough."

5.5 pushed Codex's end-to-end execution capability up a notch. In long-horizon tasks, it spontaneously chooses to "verify first, then advance." My sense is that one loop can sustain roughly 30 minutes before it naturally pauses at a milestone. After watching it for days, I realized that in the terminal I was basically just saying "continue" on repeat. Its next-step judgment was already reliable enough.

So if that's the case, why not make it highly automated? After all, it was just "continue."

I remembered that the Codex app previously had a Routines feature—perfect. I downloaded it, only to find Routines were gone, renamed to "Automations." I read the docs; the capabilities looked similar, so I got started.

The first pitfall was the trigger mode. It supports fixed-time execution or interval triggers. I set a custom interval to fire every 30 minutes, letting it run continuously around a single goal for two days.

I initially wanted to simulate the terminal experience: the same session repeatedly "continuing," checking the results every few iterations, pulling it back if it drifted off course. In the app, this maps to thread automation—hanging a heartbeat-style timed wake-up on the current session.

It sounded reasonable, but testing revealed a hidden limitation: within a single session, automation can only successfully trigger once. After the first run, the system swallows subsequent heartbeats, reasoning that it wants to prevent infinite loops. This made continuous iteration impossible.

Stepping back, I switched to standalone automation: spawning a new session each time, triggered on a schedule. This could run, but at two costs.

First, price. Every new session carries no historical context, so previous state has to be handed off as documents. With no cache hits, the new session crawls and reads all relevant documents from scratch. Under OpenAI's prompt caching mechanism, cached tokens cost roughly one-tenth of uncached ones. That immediately inflated costs by 10×. My gut feeling checks out: running three independent automations simultaneously on one Pro account burned through the quota in about two days.

Second, performance. I narrowed the automation scope so it would only attempt small, concrete problems at night. Checking in the next morning: eight hours of runtime, virtually no progress.

This fell far short of my expectations for this model in the terminal. When I repeatedly said "continue" for four hours in the terminal, the output quality was consistently higher than these eight hours of unsupervised runtime. At first I thought the model itself wasn't that strong; I later rejected that hypothesis. The app path is the problem.

3. Codex CLI's Goal: The First Respectable Long-Distance Run in the Terminal

While researching, I noticed that Codex CLI had added an experimental feature called /goal in version 0.128.0.

Its design logic is straightforward: you give it a goal (not a single prompt), and it continuously loops through plan → act → test → review → iterate until the goal is achieved or the budget runs out. State persists across sessions; it can be paused, resumed, or cleared. This is essentially the exact thing I had been struggling to simulate with app automations.

I was setting up a new machine anyway, so I enabled this feature to try it out. My conclusion: the feel is close to what I expected. Long-horizon execution holds up; most of the time it pushes forward around a goal on its own without deviating too wildly. This is the first time I've seen a truly respectable "long-distance" form factor in the terminal.

This is where it gets awkward. The app's polished interface, visual diffs, and chat bubbles—ultimately less effective than a plain terminal loop.

Why?

Honestly, I haven't fully figured it out. The app isn't open source; I can't see how it's actually implemented behind the scenes. But one observation might hold some weight: the products actually doing long-horizon tasks (Codex terminal Goal, Boris's own Claude Code setup, community projects like OpenCode) all converge on the same form. A loop in the background—model plus tool calls plus context appending, running repeatedly. State persisted via files, git, or worktrees; the UI is just an observation window.

Meanwhile, the app experience, designed around a chat dialog, becomes a drag in long-distance scenarios. It treats every interaction as an independent "Q&A," breaking the shared context that should persist, altering the prefix that should stay cached, and artificially fragmenting the "next step" that the model should judge for itself into multiple separate events.

4. Letting Go for 24 Hours: A Shift in Mindset

Returning to that two-day experiment—setting aside cost and performance—I noticed something completely unexpected. Mindset.

Before, when using an agent, no matter how automated it was, your mind was still tethered to it. You'd constantly wait for the next turn, watch what it was doing, judge whether to intervene. This burden isn't in the hours worked; it's in your awareness. You haven't truly "handed it off"; you've only delegated execution, while decision-making remains with you.

During those 24 hours when it was running on three branches by itself, I felt "no need to watch" for the first time. I knew it was working, knew it had the capability, so I let it run. That day, I did other things. Ate dinner with family, read some philosophy, watched a movie.

The results fell short of expectations; on paper, it was a loss. But that shift in mindset felt like a real inflection point.

I hadn't realized before just how heavy the cognitive tax of "keeping an eye on the agent" actually is. You think you're not spending time; true, you only check in for a few minutes here and there. But your attention remains anchored to it. Once letting go becomes viable, that entire block of time truly belongs to you again.

This clarified for me the evolutionary stages of agents.

Stage One: Human-centric. Chatbots, code completion. AI is a productivity tool; humans must pay attention to every detail.

Stage Two: Human-directed agent. This is the mainstream form today. Let the agent run wild and it'll mess up, so you have to guide it through the process, babysitting it as it works. Here's the counterintuitive part: this stage is actually more exhausting than doing it manually. Your output is much higher, but the physical and mental toll is heavier. My family sees my state changes most clearly. She says I'm truly on-call 24/7 now.

Stage Three: Agent autonomy. Humans only define goals and occasionally supervise; the agent handles the rest. I think 5.5 is the first model that makes this stage look genuinely possible.

Stage Two is a transitional state. Its defining feature is "expectations exceed capabilities." You want the agent to work independently like an employee, but the model is still one step short, so you have to fill the gap. By Stage Three, that filling-in is no longer needed.

5. Model + Goal + Loop

Boris Cherny (founder of Claude Code) has been repeating a prediction lately: within a year, Claude Code might be down to just 100 lines of code.

His logic goes like this: most of what's currently in the harness—permission gates, context management, tool routing, prompt-injection guards, human-review hooks—exists because "the model isn't smart enough yet." Once the model can make the right judgments on its own, all of that becomes baggage. What remains as the truly necessary core is just one loop:

while (model returns tool calls):
  execute tool → capture result → append to context → call model again

This is the shared foundation underlying all frameworks: Claude Code, Codex, Cursor agent, Manus. One loop, plus a goal and tool calls.

This judgment aligns with my own intuition. Features like /goal work because the model itself is already stable enough on "what should I do next." You no longer need external constraints forcing it to do A before B; it plans on its own, judges on its own whether to step back and verify.

Once the form factor solidifies, the next evolutionary direction becomes clear: one person overseeing multiple agents, each running in a loop around its own goal, potentially even collaborating with each other. It's an agent team, not an agent copilot.

6. The Token Logic Flips

There's an interesting side effect here: token consumption will decouple from headcount.

Most paying users currently suffer from "token anxiety." Prices are settled weekly or monthly, so you feel compelled to squeeze every drop out of your quota, or else you feel guilty. The underlying logic is similar to hiring people: you're buying time, so you squeeze them 24/7.

But this mechanism is actually quite distorted relative to real value creation. Not every task needs to run 24 hours a day; in fact, most don't.

However, in Stage One the bottleneck was "capability"—the model wasn't strong enough to independently accomplish many things. In Stage Two the bottleneck was "human"—you only have so much attention, and juggling three agents is your ceiling. By Stage Three, the bottleneck shifts to "goals": how many valuable directions does one person actually have that are worth pushing an agent toward?

Once this holds true, token consumption has nothing to do with user count. Some people have never used an agent at all; others are already spending over a thousand dollars a month on agents (I'm approaching that range myself). This divergence will only intensify. One person with 100 agents running 24/7 for them is entirely plausible.

The near-term phenomenon: token demand will explode further, but the shape of that demand may be completely uncoupled from user growth. A small number of people in heavy-use scenarios will consume the vast majority of compute.

7. What to Do with the Liberated Time

This brings me back to that mindset shift I mentioned at the beginning.

Industrialization has placed humans in a contradictory position: managers expect employees to work 24/7, have autonomy, and yet be supervised in real time. These demands are inherently contradictory. Because "wages are paid monthly," managers will push this contradiction to its absolute limit.

Agents have no wage contradiction; they're settled by the token. In theory, you can let them truly run 24/7. The prerequisite is that they can produce without supervision. I think that prerequisite is beginning to hold true post-5.5.

Once it does, what does it mean for humans?

My family understood this better than I did: humans shouldn't be defined as production units. That "unsupervised" state I was in for two days led her to say directly: that kind of tense state you're in is unsustainable from a health perspective. She was right. The fact that Stage Two is "more exhausting than manual work" isn't a complaint—it's actually happening. You're constantly searching for ways to collaborate with the agent, and that exploratory cost is real on a mental level.

If Stage Three truly materializes, what can people do with the freed-up time? Express, connect, pursue things they're interested in. A simple answer, but the more I think about it, the more it feels like the direction actually worth taking. Studying philosophy, watching films, returning to family—whatever adds value on the dimension of being human. These are things agents can't replace.

8. How We Evaluate People Is Changing

Finally, a more macro-level observation.

The entire value system is being reshaped. The old way of measuring someone's output was by their hours worked, lines of code, or PR count. These metrics will become increasingly hollow in the agent era. Boris alone filed 150 PRs in a single day—how do you evaluate him by PR count?

What will truly differentiate people in the future is leadership and goal orientation. Can you define high-quality goals? Can you identify genuinely valuable directions? Can you get a team of agents running toward the same North Star?

These qualities mattered in "human-human collaboration" too, but you could still fall back on organizational processes, KPIs, and incentives. In "human-agent collaboration," that fallback disappears. A poorly defined goal is direct waste—100 dollars burned overnight on a direction that drifts off course.

Letting go is hard. But some things only move forward once you let go.

References

原文链接：https://guanjiawei.ai/en/blog/let-the-agent-run

DeepSeek Valued at 300 Billion, Liang Wenfeng Invests 20 Billion

guanjiawei — Tue, 12 May 2026 13:07:37 +0000

Yesterday when the DeepSeek funding news broke, I was taken aback scrolling through my social media feed.

The outside world is discussing the 300 billion RMB valuation, the lineup led by China's major funds with Tencent and Alibaba following, and the fact that this is the first time the company has ever raised money since its founding. All of this matters. But after reading the news several times, what truly shocked me was another set of numbers: DeepSeek contributed 20 billion internally in this round, entering alongside external investors at this same valuation. And Liang Wenfeng holds nearly 90% of the shares.

In all my years working in the AI industry, I've never seen a move like this.

1. Buying Your Own Cake

An analogy makes clear how abnormal this is.

Imagine you have a cake. You baked it for several years, and your employees baked it with you. One day the market says it wants to put a price on this cake, and the estimate comes out to 300 billion. Everyone is thrilled, because the small slice each person holds is suddenly worth money.

Normally, at this point the founder breathes a sigh of relief: the money previously spent is water under the bridge, external capital is coming in at a high valuation, the company has cash, and employee stock options have an anchor price.

But Liang Wenfeng did one more thing. At the 300 billion valuation he himself set, he pulled out another 20 billion to buy in alongside external investors.

He himself accepts this 300 billion price tag, and is willing to double down. Translated into the polite language of a shareholders' meeting, that's "I'm very optimistic about the company." But polite words don't require 20 billion in real cash.

The source of the funds is also intriguing. Liang had never raised external capital before; all the company's money in its first few years came out of his own pocket. Founders like this are usually wary of outside investors, and they only open up for the first time because they're running out of cash. But DeepSeek isn't short on money. High-Flyer Quant alone had assets under management exceeding 70 billion RMB in 2025, and the dividends from this single private equity firm to him personally are enough to fund research for many more years.

The real purpose of this round is to give employee stock options a market-recognized price anchor. The model industry is currently in a fierce war for talent; without a clear valuation, you can't retain people with equity. The fundraising was for this purpose, but the founder himself following on with 20 billion far exceeds that purpose. This is a statement made with money: I believe in this price myself, and I'm willing to keep adding to my position at this price.

2. The First Time I Heard About This Company

Rewind to late 2023.

Back then, the hottest labels for large models were the "Four Little Tigers" and "Six Little Tigers": Zhipu AI, Moonshot AI, Baichuan, MiniMax, 01.AI, and StepFun — the rankings shifted depending on who you asked. Investors and media were all circling this list. I was at Zhipu AI at the time, and no one in the entire circle had heard of DeepSeek.

Until one day a friend asked me on social media: Have you heard of DeepSeek? I heard their cost advantage is pretty insane.

I went to look up this company's background. It was spun off from a private equity fund called High-Flyer Quant, formally established to work on large models only in May 2023. The earliest versions weren't particularly impressive in terms of capability. They had originally aimed for performance, but when performance fell short of expectations, the MoE architecture unexpectedly reduced costs by an order of magnitude. Then they did something quite surprising: since costs were low, they simply lowered API prices.

This was the first time the industry heard the name DeepSeek. Not because the model was stunning, but because the price was cheap, and the training cost data was written in their paper — anyone in the tech community could see at a glance that they really had something.

3. At That Time, Only a Few Nationwide Could Field a Ten-Thousand-Card Cluster

2023 was the height of the US-China chip war. The A100 had long been placed on the restriction list, and available stock was dwindling. I was at Zhipu AI at the time and could directly feel that the A100 was a scarce resource. There weren't many players domestically who could truly field a ten-thousand-card A100 cluster. SenseTime was one; by its 2022 annual report disclosure it had 27,000 GPUs, with A100 reserves exceeding ten thousand.

Then DeepSeek's numbers surfaced: High-Flyer had already invested 1 billion yuan in 2021 on "Firefly No. 2," building a cluster of 10,000 A100s. Meaning before large models had become a mainstream topic, they had already prepared the hardware foundation for today's large model battles.

The industry began taking this company seriously starting from this point. A quant private equity firm going to hoard ten thousand A100 cards and building a self-developed deep learning platform. This isn't something ordinary secondary-market players would do.

4. The Price Butcher

May 2024, V2 was released.

The moment the price list came out, everyone realized something was different. Input at 1 yuan per million tokens, output at 2 yuan per million tokens — roughly 1% of GPT-4 Turbo's price at the time. This wasn't a price cut; it was moving prices onto an entirely different track.

The chain reaction followed quickly. Zhipu AI slashed GLM-4-Plus by 90%, from 50 yuan all the way down to 5 yuan; ByteDance's Doubao directly quoted 0.0008 yuan per thousand tokens, claiming the industry's lowest; Alibaba's Tongyi and Baidu's Ernie also followed with cuts. For a whole month, the industry's pricing baseline was rewritten.

This price war was different from any seen before. Previously, model companies cut prices following OpenAI; however OpenAI cut, everyone followed. This time, a previously unknown Chinese company set an anchor two orders of magnitude lower than OpenAI, dragging the entire domestic market onto a completely different cost curve.

5. So Low-Key It Was Abnormal

DeepSeek doesn't resemble a typical Chinese AI company in many ways.

It's in Hangzhou, not Beijing. It has no government background, nor did it seek local government support. The CEO doesn't do roadshows, doesn't give interviews, doesn't post on social media. Before fundraising, it hadn't cultivated VC relationships; employee hiring relied on word-of-mouth in technical circles, not headhunters.

The most interesting thing was the positions it hired for. At the time, a job title was circulating in the industry called "Data Know-It-All." The requirements were: strong interest in various domains, high self-drive, and passion for achieving AGI. No particular educational background required; knowing how to read files and call APIs in Python was enough. It explicitly stated "no algorithm derivation or handwritten code required." The essence of this role was teaching AI how to be more human, so it actually valued general knowledge and curiosity more. At the time, a product manager friend of mine saw the job description and told me he wanted to apply; in the end, he didn't get in.

I myself also once wanted to go. There was a sense of mystery about this company, stemming from a rare purity: head down doing technology, nothing else. Later there was never quite the right opportunity, but I kept paying attention.

6. From V3 to R1, Pulling the Industry Back to the Main Thread

December 2024, V3 was released.

Performance directly reached top-tier levels among open-source models. Then before long, on January 20, 2025, R1 went live, and they directly released an app. This was DeepSeek's first consumer-facing product, coinciding right before the Spring Festival.

Everyone saw what happened the following week. R1 topped the app stores during the Spring Festival holiday, overseas media reported on it collectively, and people from OpenAI were commenting on X repeatedly. The entire AI agenda for the first half of 2025 was rewritten by this single event.

But R1's real impact wasn't its popularity itself. It was that it pulled a batch of Chinese large model companies that were being forced by capital and markets into "forced commercialization" back to the main thread of "first make the model good."

Before DeepSeek, most Chinese large model companies were closed-source. The logic was: open source gets freeloaded, hurting commercialization. R1 directly proved this logic untenable. A completely open-source model can simultaneously have the best technical influence and the broadest user base; commercialization is something that will follow naturally.

By mid-2025, Kimi open-sourced K2, Zhipu AI open-sourced GLM-4.5, MiniMax open-sourced M2.1, Qwen continued its day-0 open-source strategy, and a whole group of companies previously insisting on closed source all switched. The market no longer discussed the question of "open source or closed source"; it shifted to "is your model good enough?"

DeepSeek held no launch event, made no loud noise, but every time it made a move, the way the industry played its cards changed.

7. xAI Is a Different Script

Someone might say: DeepSeek isn't the only idealistic AI company; isn't xAI one too?

When xAI started, its mission was also to use AI to seek truth, and its narrative had some resemblance to DeepSeek's. It was also founder-funded, team-built from scratch, and challenging big tech. But looking at this company today, the story is completely different.

February 2026, SpaceX announced the acquisition of xAI, merging at a SpaceX valuation of 1 trillion and an xAI valuation of 250 billion dollars, for a combined 1.25 trillion dollars. In March, all 11 of xAI's co-founders from that year resigned. In May, xAI ceased to exist as an independent company, was merged into SpaceX, and renamed SpaceXAI. Musk turned around and rented most of the 220,000 GPUs to Anthropic.

This is another reasonable script: sufficiently large compute, sufficiently strong capital narrative, ultimately absorbed into a larger parent body. If you look at xAI's current product cadence and research intensity, it no longer looks like the original "research for research's sake."

This isn't saying xAI took a wrong turn; it's saying idealism doesn't automatically lead to DeepSeek's path. Whether a company can maintain research priority and product restraint after raising tens of billions is an independent choice, and the further you go, the harder it becomes.

DeepSeek's contrast lies right here. It has reached the level of a 300 billion RMB valuation, with a funding round of roughly 7 billion USD, and market attention on par with the top AI companies globally. But Liang Wenfeng still doesn't come out. In the recent V4 Preview release notes, he quoted a line from Xunzi's Against the Twelve Masters: "Not tempted by praise, not frightened by slander; follow the Way and act, correct yourself with dignity." He didn't say this for PR; his behavior over these years itself has been exactly this.

8. Geek Spirit

China's AI industry isn't short of technical people; what's lacking are people possessed by the obsession of changing the world through technology and willing to get their hands dirty.

Wang Jian was the previous representative of this type. He didn't come from a formal computer science background; his 1990 doctorate was in psychology from Hangzhou University, and from 1993 to 1998 he was even head of the psychology department at Zhejiang University. But in 2009, as CTO of Alibaba Software, he presided over the R&D of the Apsara Cloud operating system from scratch. At the time, few believed China could build a cloud OS from the ground up, but he kept his head down and did it. In the end, Alibaba Cloud grew from lines of code to a business worth tens of billions. In 2019, he was elected as an academician of the Chinese Academy of Engineering — the first from a private enterprise background.

My understanding of geek spirit isn't about how hardcore one's technical skills are; it's about believing technology can make the world better, and being willing to roll up your sleeves and do it.

What I admire about DeepSeek is that it can still preserve this at the scale of a 7 billion USD funding round. At a time when everyone is being pushed to commercialize, pushed to become a hot topic, pushed to exit, there are still people willing to treat research itself as the end goal.

China should have more teams like this. Not all need to work on large models; they can do anything. The key is having this degree of purity in what you do.

That 20 billion isn't about the money. It's a declaration: this company will continue down the research path, and I myself am willing to continue walking that path with it.

Whether this company ultimately succeeds or fails, whether V4 or V5 performs well or not next — as long as this declaration continues to be made, as long as this state is maintained, it deserves everyone's attention.

Respect.

References

原文链接：https://guanjiawei.ai/en/blog/deepseek-respect

AI Coding: A Meta-Ability

guanjiawei — Tue, 12 May 2026 13:06:25 +0000

Yesterday, while introducing our product to a friend, I suddenly realized something.

Foundational technology is evolving fast—how fast? By the day. But the speed at which the general public—and even many white-collar professionals—adopt these things is far slower than imagined.

Take a recent example. OpenClaw went viral, and everyone was drawn to this technical breakthrough. Someone asked me: "You use Claude Code? That's for writing code, right?"

I said yes.

Their first reaction was almost identical: "That sounds complicated. I probably won't need it in my lifetime."

Not What You Think

I demonstrated in the most intuitive way: installed in 5 minutes, ready to go with just an account, as long as you can talk.

They stared at my terminal for a moment and fired off a series of questions. Is the installation tedious? Is it difficult? What problems can it solve?

I said, installation is just one line of code, 5 minutes. An account costs just a few dozen yuan per month to get started.

They were stunned. "Is that really it? I assumed something this powerful must be complicated to install."

Not at all.

What's even more surreal is OpenClaw itself. Online, there are 15-step, 20-step installation guides that look headache-inducing. Many people complain: the "crayfish" (OpenClaw's nickname) works great, but installation is too difficult. On Xianyu (a Chinese secondhand marketplace), services charging 500 yuan for on-site installation even appeared.

I demonstrated right then and there. When you have Claude Code installed on your computer, you only need to say one sentence—"Help me install OpenClaw on this computer."

It will find out what OpenClaw is, where to download it, and how to deploy it all by itself. You tell it what the Feishu (Lark) App ID is, and it helps you configure and test everything until it's running. You go grab a coffee, and by the time you're back, it's deployed.

One line of code, 5 minutes. That simple.

Why Is AI Coding a Meta-Ability?

Someone asked me: "Does your software have any out-of-the-box applications? Like text-to-image?"

I didn't answer directly. I was thinking about something else.

If we let everyone master an efficient AI coding tool—a coding agent—you want something, you have it install it for you.

Text-to-image? You say: "I want a text-to-image application, download the model and deploy it locally so I can generate images with a single sentence." It might take an hour or two to help you download the model, deploy it, and get it running.

This ability has already transcended the concept of an "application."

It is helping every ordinary person—at an extremely low barrier to entry—truly take control of their computer. By "control," I mean at a level surpassing that of 90% of engineers and senior engineers. And your computer is connected to the internet.

So the old idea of "I need to install an out-of-the-box application" is no longer necessary. You need something, you tell it. It does it for you.

What Can It Do?

Honestly, at this point, if you asked me to use a computer without software like this, I wouldn't know how.

It can unlock the 90% of capabilities on your computer that normally go unused. This is also why edge AI devices should run Linux—Linux is inherently designed for programs to control other programs, and agents experience zero friction on it.

You have 20 Excel files to process? Put them in a folder, toss it to the agent, and tell it how you need them analyzed. It will invoke Python, download plugins if they're missing, and solve problems on its own.

Want to implement some interesting networking features? It can help with that too.

As long as the model is strong enough and fast enough, things that previously required professional skills become within reach.

What Does the Future Look Like?

In the future, people will become increasingly lazy about using computers, unwilling to manually operate them. I rarely actually operate my computer anymore—at most, I open a few Claude Code instances and switch between them to assign different tasks.

The way we work will change.

Before: Human → Tool → Computer.

In the future, it will likely be: Human → Agent → Computer.

What's the point of building a complex interface? It's a burden for AI, and humans don't look at it either. Just give instructions directly to the agent and let it execute. Done.

Software infrastructure may change. I'm not sure exactly what it will look like, but the general direction is probably this.

The Difference Between Claude Code and OpenClaw

Many people ask me this.

Strictly speaking, OpenClaw puts a shell around Claude Code's capabilities. Some call this shell the "crayfish shell"—I think that makes sense.

What Claude Code does: you give it a vague instruction, and it executes end-to-end, largely meeting expectations along the way without needing you to correct its course. It can run for tens of minutes or even hours, and finally gives you a result.

What OpenClaw does: it adds another layer on top—distributing instructions, managing results, forming long-term memory, acting like a project manager or supervisor. It can interact with you at a higher level, becoming your personal assistant.

But the prerequisite for all of this is that the underlying coding agent executes each task beautifully, without you needing to watch over it.

This is also why OpenClaw connected to different models is an entirely different species.

Currently, there aren't many models that can truly "execute well from a vague instruction." In my view, only Claude's Opus and possibly Codex's latest model can do it. Versions connected to other models feel like toys—each execution is unreliable, making the coordination layer above even more difficult. However, model progress is faster than expected, and the million-level gap in this area is rapidly narrowing.

The reason OpenClaw wowed everyone is because it was connected to Opus. But Opus's coding plan is no longer available—too expensive.

This is the current pain point. I don't know when it will be resolved.

原文链接：https://guanjiawei.ai/en/blog/coding-agent-meta-ability

The AI Coding Business Ate Itself

guanjiawei — Tue, 12 May 2026 13:05:05 +0000

Today I got talking with someone about AI coding. I told them it's a strange space. Everyone's bullish on the direction, yet the earliest true believers still can't turn a profit.

I've been in AI from the start, and AI coding was a direction I locked onto early. Back when we were building edge devices, this was the first scenario we went after. But looking back from 2026, trying to build AI coding software looks more and more like a dead end.

The direction isn't wrong. It's too right.

The Completion Era

Roll back to 2021. GitHub Copilot went live. OpenAI trained a model called Codex for Microsoft, wrapped it into an editor, and sold it as code completion. That move started the whole race.

By 2023, Copilot had over a million paying users and $100 million in annualized revenue. At the time, those were eye-popping numbers for a brand-new category. But that year, Microsoft itself was still losing over $20 per Copilot user per month. The subscription price couldn't cover inference costs, but no one cared; grab the ecosystem first.

Then the first wave of startups rushed in. The logic was simple: programming is its own vertical, so why not train a model just for code? General models were expensive and slow; a small, specialized coding model should be cheaper and better.

Microsoft itself released Phi-1, a 1.3B-parameter model that scored 50.6% on HumanEval, matching models ten times its size at the time. In China, Tsinghua and Zhipu's CodeGeeX began iterating in September 2022, with a new generation every six months. A Peking University team called Silicon Intelligence open-sourced the 7B CodeShell at the end of 2023, surpassing CodeLlama-7B and StarCoderBase-7B on HumanEval, and released the IDE plugin alongside it. The shared bet: programming data is concentrated, so a specialized model should beat general-purpose ones on both price and quality.

A whole ecosystem of tool-chain builders sprang up too: evaluation, RAG, vector databases, completion integrations. RAG was hot then, and in programming it gave people plenty to pitch.

But in 2023, there was no real business model yet. Model coding ability was shaky, and developers only used it lightly. The whole sector felt like a sure thing that just hadn't popped.

The Workflow Era

The real inflection point was 2024.

In June 2024, Anthropic dropped Claude 3.5 Sonnet and coding ability jumped to a new level. Anysphere's Cursor was the first to plug this model into its own IDE. It wasn't just an API call; they redesigned the whole workflow around it. Reading context, cross-file edits, multi-turn iterations, acceptance-rate feedback—features that had been scattered across completions were finally strung into an actual pipeline.

The business model changed too. Cursor used monthly subscriptions to subsidize the underlying token costs. You paid them $20 a month, and they bundled in Claude API quota. Buy that much token volume straight from Anthropic and you'd easily pay over a thousand dollars. This arbitrage—selling model capacity cheap to buy growth—pushed user numbers up fast. Anysphere hit $500 million ARR and a $9.9 billion valuation in June 2025; by November 2025, its valuation had reached $29.3 billion.

But that's where the problems started. Cursor's workflow itself had no real moat.

Why didn't anyone copy it right away? Because only Claude 3.5 Sonnet could actually run that workflow; plug in other models and they were either too slow or too stupid. So the second wave of AI coding companies mostly got stuck between two paths: integrate Claude and race Cursor to the bottom on price, or use their own or open-source models—and watch users bounce after one comparison.

The One-Two Punch of DeepSeek and Claude Code

At the end of 2024, DeepSeek V3 went open source. That changed the game.

V3 still hadn't caught Claude 3.5 Sonnet on coding—82.6% against Claude's 93.7% on HumanEval, a gap of more than ten points. But it was absurdly cheap, with token prices about one-forty-second of Claude's. That price-to-performance ratio meant players who couldn't afford Claude before, especially domestic vendors, could suddenly run the same workflow Cursor had built.

During that stretch, pretty much every AI coding software company integrated V3. Their products started looking identical, and the differentiation window slammed shut fast.

Right after that, Anthropic launched Claude Code in February 2025. No IDE, just a CLI. It chats in your terminal, reads your project, edits code, runs tests. When it first came out, I treated it like a demo. After a week of using it, I had a sudden realization: when the model is strong enough, you might not need an IDE at all.

The business numbers were even more striking. Claude Code hit $1 billion in annualized revenue within a year, and by early 2026 estimates had it near $2 billion. That growth rate blew the gap between software makers and model makers wide open.

The CLI form itself has no software moat either. Open source quickly spit out OpenCode, Aider, Cline, and a bunch of others; Cursor later started moving toward CLI too. The result: even the IDE layer, the last place left to differentiate, got flattened.

All Three Layers Were Eaten

Look back at what got eaten.

The first wave of small coding model companies is basically silent now. The original logic was "general models aren't specialized, so I'll build a small one." But every new general model release treated coding as a core benchmark to beat, so the window where specialized small models had an edge barely opened. CodeShell, CodeGeeX, Phi—today they're footnotes in papers, not products.

The middleware folks building toolchains and workflows had it even worse. As models got stronger, all the engineering value built around patching model weaknesses evaporated overnight. RAG integrations, prompt orchestration, and completion enhancers that were selling six months ago were rendered useless by the next model drop.

Even standalone IDEs started feeling the heat. Cursor is still growing, but its core paradigm is being drained away by CLI tools like Claude Code. When the model itself can do most of the work inside a terminal, the IDE's value gets pushed back to the margins.

It isn't that these companies did anything wrong. They picked the right direction and nailed every inflection point. The problem is that this sector is too important and too lucrative, so every foundation model company treats it as a primary battleground. And a primary battleground means the layers above it get cleared out.

From Corsets to Bras

I've been thinking about this recently, and it reminded me of a story from fashion history.

In the early 20th century, women in Europe and America still wore corsets—stiff structures built with steel boning. It was a centuries-old industry, and countless small workshops lived off it. In 1914, a New York woman named Mary Phelps Jacob made a simple brassiere from two handkerchiefs and ribbon, and patented it. No one took it seriously at the time.

The real turning point was World War I. The U.S. War Department asked women to stop wearing corsets—not because they cared about women's comfort, but because the steel bones had to be melted down for weapons. One order killed the entire market.

Most corset makers died. But one company, Warner Brothers Corset, read the shift and switched to bras, hitting $12.6 million in annual revenue by the 1920s.

What's interesting is that the underlying demand never changed. Women still wanted to look good; only the vessel changed. First corsets, then bras. The corset makers died because they were defending a product shape, not a need.

AI coding is at that stage now. Everyone who bet on this direction five years ago was right. Developers using AI to write code, efficiency up tenfold, production relations changing—it's all happening. No one called it wrong. But the vessel carrying that demand is no longer IDE plugins, workflow software, or specialized small models. It's converging fast on "foundation model + CLI/agent."

Everything caught in the middle is being fast-tracked to obsolescence.

What Can Still Be Sold

The sector has been eaten, but the demand is still there. From another angle, what's left to sell probably collapses into three buckets: quality, cost, and everything else.

First, quality. Programming is high-stakes; one bad line can blow up production. Developers will keep paying extra for output that's more accurate and more reliable. That demand will only grow, but the carriers are mostly foundation model companies. If you can push your model to the front of the pack, the business works. The so-called "high-quality token" is essentially this tier.

Second, cost. Same capability, lower price wins. Open-source models, inference optimization, caching, scheduling—every layer has room to cut costs. This is actually an opening for application-layer companies, because not every model company is good at engineering optimization, and not all of them want to push prices to the floor.

Third, everything else. This bucket is bigger than the first two combined. Right now most teams use AI coding chaotically—everyone runs their own setup, and the organization never gets any real efficiency. Stitching together workflow, code review, knowledge retention, and accountability with agent work patterns is a real need. Go further out and security, privacy, recoverability, and compliance—these haven't really unfolded yet, but they'll soon be the core reasons big companies pay. This doesn't look like traditional SaaS; it's more like consulting, engineering, and training rolled together, and the pricing has to be rebuilt from scratch.

No matter which bucket, the old model of "how much for this software package" is basically broken. You either convert to tokens, or convert to services.

Closing Thoughts

We've watched this pattern repeat throughout the AI wave. The more certain your directional bet, the faster your original business form dies. Getting the direction right means the space matters; if it matters, foundation model companies will turn it into a primary battleground; and that means every layer above gets stripped away.

AI coding is the clearest example of this pattern so far, but probably not the last. In the coming months I expect AI customer service, AI design, AI sales, and other domains to go through the same cycle.

Small companies can turn on a dime. The most dangerous spot is the middle—mid-sized companies with sales, delivery, and customer-success teams already built around selling software. They're the ones who find it hardest to walk away from their engineering assets and pick a new angle.

If you're grinding away in AI coding too, or you've seen a similar "being eaten" process in another sector, feel free to leave a comment.

References

原文链接：https://guanjiawei.ai/en/blog/ai-coding-eats-itself

Codex 5.5: Don't Trust the Version Number

guanjiawei — Tue, 12 May 2026 12:14:44 +0000

OpenAI quietly launched GPT-5.5 on April 23, codenamed SPUD (potato). The version number was conservative—5.4 to 5.5, a mere 0.1 bump. After two weeks of use, my gut feeling is that the actual leap is far wider than that. It is the first base model fully retrained since GPT-4.5, not a fine-tune on top of 5.4.

The change in my own behavior is the most direct proof. I was a heavy Claude Code user; now I barely open it. The tab is still there, used only in a handful of scenarios. The rest of the time, I'm in Codex 5.5.

The reason for switching isn't that it's 5% or 10% better. It's that several of Claude Code's most frustrating weaknesses were fixed in one go with 5.5.

1. The Moments That Made You Want to Hit "Deny" Are Gone

Anyone who's used Claude Code has felt this: when you first start, it asks for permission for everything. Can it read this file? Run this command? Connect to the internet? Every step requires a click.

Then people realized: they were basically clicking "Allow" every time. So they flipped the switch to dangerously bypass permissions and let everything through by default.

This reflects, from the side, that the tool was mature enough. You hadn't genuinely wanted to hit "Deny" in a long time. It doesn't go rogue—at least, it doesn't easily break things. But it also often struggled to get things right on the first try. Claude's hallmark is speed: it can show you something within minutes, but when you review it, obvious problems are sitting right there; it simply missed them.

This "fast but inaccurate" experience is the biggest weakness of this generation of Claude Code.

What surprised me about 5.5 is that it fixed several long-standing weaknesses all at once.

The process started speaking human. Previously, Codex's intermediate steps were a blob of middleware; even if you stared at it, you couldn't tell what was going on. After 5.5, it tells you what it is doing at each step and why. This doesn't sound complicated, but it makes a huge difference for long-running tasks.

No more dithering. Before, it felt like a task would try left, then try right, taking a huge detour before finding the right spot. Now its judgment is clearer; it knows when to accelerate. If it sees a task is waiting, it will go do something else in parallel and prepare what the next step needs ahead of time. I had never seen this behavior before, and it surprised me the first time. End-to-end execution efficiency took a step up because of it.

Next-step suggestions suddenly became valuable. Codex has always had a habit: at the end of every task, it offers a next step. In the GPT-5.4 era, this suggestion was often filler. After 5.5, I found its proposals became quite reliable. Working continuously on one project for eight hours, I basically clicked "continue" every time. This means its ability to grasp "high-value directions" has come online.

The old compact problem is finally solved. I didn't use 5.4 deeply enough to be sensitive to compact issues. But after switching over in earnest, long sessions inevitably require context compression. Codex's compression is much more stable than Claude Code's. Most information survives the squeeze; with Claude Code, one compact and half the face is gone.

This has been studied and compared. Codex hands the entire conversation to the model to rewrite into a structured "handoff summary" carrying metadata and tool states; Claude Code uses a hierarchical human-readable summary, which is legible to the naked eye but inherently lossy. Both approaches have their merits, but in actual long sessions, my sense is that Codex loses less information.

The experience in the 5.4 era, where compact often interrupted due to network issues, is also completely fixed. Sessions can run continuously for an entire day with almost no attention needed.

2. The Shape Has Changed

Back in 5.4, research capability was already quite strong. Once 5.5 fixed the weaknesses, its entire shape changed.

I can now have it complete complex tasks with a very high degree of automation—provided the task can be verified end-to-end. For example, performance tuning at the operator layer. These scenarios have fine granularity, fast iteration, and A/B comparability. 5.5 runs them very professionally. It proactively designs queries, runs improvements, and then verifies whether the next step actually got better. The whole chain pushes forward on its own. Without a human present, it can run dozens of rounds continuously.

It's now hard to trip up in these dense loops. The numbers OpenAI published support this intuition: 82.7% on Terminal-Bench 2.0, 51.7% on FrontierMath tiers 1–3, and 95% on ARC-1 reasoning tasks. Feedback from researchers is even more direct: people are starting to let it run experimental variants overnight and coming back in the morning to see completed sweep dashboards.

It's far from perfect, of course. I've noticed two clear ceilings.

The first is big-picture systemic thinking. Ask it to look at the global picture from an overall architectural level, and its field of view is limited; the solutions it gives often stay local.

The second is innovation and creativity. If you completely hand it off, the solution it gives will most likely be by-the-book. You have to throw your own ideas at it first and let it expand and evaluate along your line of thought.

But even with these shortcomings, the whole thing already feels different. You used to assume that if you didn't watch it, it would drift off course. Now you find that after running for several hours, looking back, the vast majority of what it did was reasonable. On the main thread of writing code, it spontaneously prioritizes verification and cautiously advances step by step, rather than charging forward recklessly.

This side of it suddenly made one thing clear to me: the design philosophy of agents has actually already solidified.

3. "Assume the Next Step Will Fail" Is a Philosophy

From Claude Code to Codex, the design of this class of coding agents is built on the same assumption: the next step might fail.

The entire engineering effort is constructed around this assumption. How do I do this step well given that the next step might fail? If it really fails, do I have a path to roll back or bypass? Tool call didn't succeed? Try a different angle. What was written doesn't match expectations? Go back and re-verify.

It sounds simple, but this is what 98% of the code in Claude Code's engineering does. Permission gates, context management, tool routing, error recovery logic—most of it is deterministic engineering infrastructure. The model itself is just one component in this harness.

What 5.5 showed me is that this philosophy is starting to bear fruit. In its execution, "verify first, then advance" has become a spontaneous choice. It wants to make sure every step holds up, not how fast it can run at once.

This is the so-called "slow is fast."

Our previous expectation of agents was always "Hey, charge straight through and give me a result in half an hour." Then you look back and it's riddled with errors; the time spent fixing things is 3× or 5× the original. It feels good in the short term, but in the long run it's actually the slowest approach.

Once every small step is steady, the thing to do is figure out how to make it fast. Fast mode, parallel subtasks, 24/7 uninterrupted operation—these multipliers only work if the underlying steps don't go wrong.

4. Why I Say This Is the Eve of an Explosion

Stringing together the facts above:

It can already surpass a considerable portion of domain experts.
It can work tirelessly for hours on end.
It can do fairly complex research-grade work, as long as the task can be verified end-to-end.
Its cost hasn't risen significantly. For the same Codex task, 5.5 uses even fewer tokens than 5.4.

What does this mean? It means the explosion will happen first in domains that are "fine-grained, verifiable, and iterable." Operator optimization, performance tuning, research in specific subdomains, automated workflows. Adding a tireless colleague with reasoning near top experts who can design its own experiments—the speed of progress in these areas will be frightening.

Beyond that line—big-picture judgment for large systems, original ideas, cross-domain synthesis—the human role remains vital. But with every minor version iteration, the model shifts a little further in that direction.

Three months ago, I was having dinner with a friend who posed a question: coding agents are so strong now, but have you actually seen people without an engineering background ship things with them?

At the time I said, "It'll happen gradually." Now it's already happening. A yoga studio owner with zero programming experience used Lovable to build their own booking system in two hours—login, scheduling, payments, all included—and launched it that same week. Examples like this are multiplying. Inside NVIDIA, over ten thousand employees use Codex for daily work, spanning engineering, legal, finance, and operations. Codex is no longer just a tool for engineers.

I believe the next step will go even further. These non-engineering people will start shipping real research results. They'll first appear in those fine-grained, verifiable, iterable subdomains.

5.4 showed me capability gains. 5.5 showed me the path. Agents evolving under this design philosophy will genuinely transform the shape of knowledge production.

If the current pace of development isn't interrupted, the knowledge explosion will arrive faster than many expect.

A leader from China's Ministry of Industry and Information Technology (MIIT) once asked me which directions are worth watching. I said at the time to keep an eye on the GPT-5 line; its capability trajectory is different from Claude Code and from the earlier generation of lightweight agents, and the types of problems it can solve are different too. After 5.5 came out, I confirmed this judgment again. Two different evolutionary paths need to be viewed separately.

Next, there will definitely be a large wave of interesting products and results shipped to the market. We'll see in three months.

References

原文链接：https://guanjiawei.ai/en/blog/codex-5-5-paradigm-shift