guanjiawei

Posted on May 1 • Originally published at guanjiawei.ai

On LLM Pricing: Supply Is Locked by Chips, the Rest Is Business Philosophy

#ai #llms #pricing #businessmodels

Recently chatted with some friends at model companies, and found that everyone is struggling with pricing.

Back to the Economics Textbook

In economics, price is fundamentally a supply-demand regulator. When things get expensive, demand drops and supply rises, ultimately shrinking transaction volume; when things are cheap, the reverse happens. After a few rounds of tug-of-war, the market finds an equilibrium: no one wants to raise prices further, and no one wants to lower them either.

The strange thing about large language models is that neither side is playing by these rules.

Supply Is Locked by Chip Controls

Domestic market demand for models is polarized: demand for good models is ridiculously high, while slightly weaker models barely get used. The main driver behind this is coding—once programming takes off, agents follow suit, and capabilities like tool calling and long-chain reasoning depend heavily on the very top-tier models. This is completely different from the old landscape where "chat APIs work for everyone."

But supply can't keep up.

Top-tier inference requires NVIDIA Hopper (H100/H200) or the newer Blackwell (B200/B300). The current status of these chips: B200 and B300 are directly embargoed by the US; H200 was partially loosened under new rules in January 2026, but carries a 25% export tax and total volume is capped at half of historical shipment levels. Domestic filing and approval processes are also extremely strict—earlier this year, Shenzhen customs reportedly refused H200 customs declarations outright.

The domestic replacement path can't be rushed either. Huawei's most powerful Ascend 910C has a target production capacity of 600,000 units this year, which sounds like a lot. But HBM is the real bottleneck—CXMT's HBM capacity is only enough to ultimately package roughly 250,000 to 300,000 910C units. The next generation of domestic chips won't see mass delivery until around 2027. Semiconductor capacity ramp-up is slow work; neither technological breakthroughs nor joint venture signatures can accelerate it.

This creates a problem. Even if you raise prices by 10x, the market can't conjure up more B300s. Supply in this market has almost no elasticity.

In the short term, only two layers can move: the model layer making inference cheaper and faster, and the infra layer making interconnect and scheduling across cards more efficient. Both are happening, and the effects are immediate—over the past three years, single-token inference costs for large models dropped from $20 per million tokens to $0.40 per million tokens, roughly a 1000x reduction. But notice: this is the entire supply curve shifting downward, not supply elasticity improving. Technological breakthroughs happen when they happen—they have nothing to do with how much you're willing to pay.

Demand Has Been Carved into Three Completely Different Shapes

The demand side is even more interesting.

The top-tier users have virtually no price elasticity. If your model can genuinely solve problems they can't solve themselves, a 10% price increase is completely irrelevant. This resembles the early GPT-4 era—that small group of people truly using AI to reconstruct their workflows cares about capability boundaries, not unit price. Seedance 2.0 is a fairly typical example: roughly 1 RMB per second, which doesn't sound cheap. But for a commercial user who would otherwise spend over 10,000 RMB to produce a polished video, this price is utterly irrelevant—whether it rises to 1.5 RMB or drops to 0.8 RMB per second, they don't feel it.

The least intensive users are extremely sensitive. Chinese internet history has cultivated a habit of "try it for free first." These users might have been customers of DeepSeek's bargain-basement API ($0.27 per million tokens input, $1.10 per million tokens output), where a few dozen RMB could last them for months. In their eyes, a 9.9 RMB membership or first-batch token giveaways are worth spending an entire evening figuring out how to optimize. This layer is genuinely sensitive.

The truly difficult-to-price layer is the one in the middle. Heavy-duty programming engineers, researchers, and agent tinkerers. They care about price, but care even more about the value of output per unit of time. Pricing for this layer gets tricky, and it's precisely this layer that has sparked the Coding Plan controversy.

Is Coding Plan a False Proposition?

Recently I've repeatedly heard a view in my social circle: Coding Plans shouldn't exist at all.

This view isn't without merit. The reasoning: a model's cost is token cost, regardless of whether you subscribe monthly—fundamentally, it's the same as electricity. And in practice, there have indeed been plenty of shenanigans—some companies offer Coding Plans but can't actually serve the promised quotas, with users unable to connect for half of a 5-hour cycle and throttled for the rest, where so-called "quota exhausted" is simply a fraudulent experience.

Hearing this discussion made me somewhat dazed, because this debate is so familiar. It's the same old argument from the platform economy years about "membership vs. commission," just wearing a different shell.

My conclusion is similar to back then: both have their reason to exist.

The benefit of Coding Plans isn't to make things easier for model companies, but to give buyers peace of mind. Note that the user is not necessarily the customer—especially in B2B scenarios, employees are users while the enterprise is the customer. If an enterprise wants to equip 100 engineers with AI coding assistants, token-based billing alone would torment finance and IT with budget management. Giving everyone a fixed-quota Coding Plan essentially encapsulates the uncertainty of token billing entirely, providing the enterprise with a hard budget constraint. This is what Coding Plans actually solve: not pricing, but uncertainty management.

Think about it—this is exactly the essence of the Costco model. Costco's membership fees contribute roughly two-thirds of its net operating profit, while the stores themselves basically break even. But what membership brings is something else: it filters out the core users with high frequency, high repurchase rates, and high average order values, front-loading the "decide whether to come" decision to a single annual moment, eliminating subsequent decision costs for each purchase. This is precisely what Coding Plans do in B2B scenarios.

Anthropic and OpenAI Represent Two Business Philosophies

Looking back at the evolution of the Coding Plan product is quite interesting.

ChatGPT launched with both API and membership tracks from day one. But early ChatGPT Plus wasn't primarily about capped monthly subscriptions—it was "pay for membership to access better models." It was a consumer benefits-oriented membership, with caps merely serving as throttling limits. The company that truly established the name "Coding Plan" was Anthropic, following the Claude Code product, with Claude Code alone reaching an annualized revenue of $2.5 billion by February 2026.

Behind this lie two fundamentally different commercial logics for Anthropic and OpenAI.

Anthropic is playing target—aiming at the heaviest, most willing-to-pay, stickiest users, pushing model quality to the limit. Its business model closely resembles Costco: making money from membership fees rather than individual transactions, filtering people in and then locking them down with an extremely strong product experience. These people are programming developers. So you see that Anthropic's ARR reached $30 billion in April 2026, surpassing OpenAI's $25 billion for the first time, with 80% of revenue coming from enterprises.

OpenAI is playing coverage—broad reach. ChatGPT weekly active users have exceeded 900 million, and it's doing everything multimodal. Its consumer-origin logic means its membership corresponds to broader "usage rights" rather than token package quotas. These two approaches have grown into completely different forms over the past year.

But once the Coding Plan model targeting heavy users truly worked, a problem emerged: users were too heavy, so heavy that even Anthropic itself couldn't afford to serve them. On April 4 this year, Anthropic formally banned Claude Pro and Max subscriptions from being used in third-party agent frameworks like OpenClaw, with a blunt reason: subscriptions weren't built for the usage patterns of these third-party tools. At the time, an estimated 135,000 OpenClaw instances were running, with subscription prices and equivalent API costs differing by more than 5x. This was essentially a group of people using Anthropic's money to do things Anthropic was unwilling to do; getting cut off was only a matter of time.

Some domestic companies originally benchmarking against OpenAI are also beginning to shift toward the Anthropic direction. Zhipu raised GLM Coding Plan prices twice in February and April over the past six months, with subscriptions rising 30% to 60% overall and enterprise APIs increasing 67% to 100%. This isn't simply about wanting to charge more; supply is genuinely so tight that price adjustments have become unavoidable.

So What?

Putting all of the above together, we can arrive at a few unsexy but fairly solid conclusions:

Supply won't become elastic in the short term. Rising prices are the trend. What model companies can do is make models smaller, inference faster, and infra thicker—all of which push the supply curve downward as a whole. But this is entirely different from "raising prices can get you more supply."

The three-layer demand split makes "unified pricing" impossible. The top tier doesn't care about expensive, while the most casual users won't accept any charge at all. The middle layer is where the real negotiation happens, and they are further divided into two payment scenarios: individual customers paying out of pocket, and employees paid for by enterprises. The former is better suited to usage-based billing, the latter to Coding Plans.

The two models aren't mutually exclusive; they serve different purposes. Those shouting "Coding Plans should disappear" are underestimating the uncertainty management needed in enterprise scenarios. Those shouting "usage-based billing is the only real deal" are underestimating the instinctive "see the price list, then churn" reaction in individual customer scenarios.

To achieve democratized AGI, the Coding Plan model won't work—one person consuming the compute of a hundred people is unsustainable no matter how much you subsidize. To deliver the ultimate experience, usage-based billing can't retain enterprises—it's not a money issue, it's a budget management issue. It's nearly impossible for any single company to walk both paths simultaneously, which is why we've seen such a degree of divergence between Anthropic and OpenAI.

When I look at LLM company pricing now, it's similar to looking at Costco versus Walmart in retail—there's no right or wrong, just different philosophies. Which path gets walked to the end depends on which type of user the company wants to capture, and whether it can continuously deepen the value for that group.

References

Originally published at https://guanjiawei.ai/en/blog/llm-pricing-no-silver-bullet

DEV Community