The 10x-a-Year Price Collapse Is an Architecture Bet, Not a Prompt Trick

#ai #technology #opinion

Somewhere right now an engineer is spending a Tuesday afternoon rewriting a system prompt to save 180 tokens per call. The diff is real, the savings are real, and the effort is almost entirely wasted. Not because tokens are free— they aren't— but because the entire cost curve under that optimization is collapsing at a rate that makes the rewrite irrelevant before it ships to production.

This is the part of the AI build-out almost nobody prices into their roadmap: the cost of a fixed capability tier falls roughly 10x per year. Not the frontier— the frontier keeps moving and stays expensive. The fixed tier. "GPT-4-quality output" cost one thing in 2023 and a small fraction of that today, for the same answer. Optimizing around that price is optimizing around a number that has a half-life measured in months.

The number is real, and it is brutal

Start with the canonical framing. Andreessen Horowitz's Guido Appenzeller published "LLMflation" on November 12, 2024, and the headline finding has held up: the cost for an LLM of equivalent performance falls about 10x per year. Their example is almost absurd— GPT-3-level quality (roughly MMLU 42) cost about $60 per million tokens in November 2021 and dropped to about $0.06 per million with Llama 3.2 3B. That's a ~1,000x collapse in three years. The harder MMLU-83 tier fell ~62x since GPT-4 launched in March 2023.

Epoch AI's price-trend analysis (data-insight dated March 12, 2025) sharpens it and adds a necessary caveat: the decline is rapid but uneven, ranging from 9x to 900x per year depending on which benchmark you pin to. The median is around 50x per year, rising toward 200x per year for data since January 2024. GPT-4-level performance on GPQA-Diamond got roughly 40x cheaper per year. Pick your benchmark and you pick your multiplier, but every honest version of the number is enormous.

And from the Stanford AI Index (2025 vintage): the cost to run GPT-3.5-quality output— about 64.8% on MMLU— fell from $20 per million tokens in November 2022 to $0.07 per million by October 2024. That's more than 280x in roughly 18 months for the same tier of answer.

The frontier is a moving target that stays expensive. The capability you actually shipped against is a falling floor. Confusing the two is the most common strategic error in AI products right now.

Look at today's price sheet

As of June 2026 the spread tells the whole story. The frontier— GPT-5.5— runs about $5 in / $30 out per million tokens (Artificial Analysis, April 23, 2026). Meanwhile DeepSeek V4-Flash is $0.14 in / $0.28 out, and a stunning $0.0028 per million on cache hits. Kimi K2.6 from Moonshot sits around the #4 slot on the Intelligence Index (~43) at $0.95 / $4.00, blending to roughly $0.70 per million. Gemini 3.1 Pro reportedly lands near $2 / $12 (secondary reporting— treat the exact figures as soft).

Stare at that gap. The cheap tier is now two orders of magnitude below the frontier and is itself running last year's frontier-class reasoning. That is the curve, frozen into a single pricing page. Whatever you're paying top-of-market for in June 2026, something will do 90% of it for a tenth of the price by mid-2027. That isn't optimism. It's the base rate.

Why this is an architecture decision

If you genuinely believe the 10x number— and the data leaves little room not to— then the correct response is structural, not tactical. The token you shave today is a rounding error against a price that's going to fall by an order of magnitude anyway. What actually compounds is whether your system can swap the model underneath it without a rewrite.

Concretely, that means:

Treat the model as a component, not a foundation. One narrow interface— prompt in, structured result out, with the provider/model behind a config value, not hard-coded across forty call sites.
Build an evaluation harness before you build clever prompt scaffolding. The thing that lets you adopt next quarter's cheaper model in an afternoon is a test set that tells you, objectively, whether the swap held quality. Without it, every migration is a leap of faith and you'll stay frozen on the expensive model out of fear.
Push complexity into the parts that don't get cheaper. Your retrieval, your data quality, your guardrails, your UX— none of that falls 10x a year. The model does. Over-investing in model-specific glue is investing in the one layer that's evaporating.

The team that hard-couples to one provider's exact response format, one model's quirks, one vendor's fine-tuned checkpoint, has built a moat around a constraint that is about to dissolve. They engineered for "expensive forever." Expensive is not forever.

When "wait six months" is the right call

Here's the contrarian product move most roadmaps refuse to make: sometimes the right decision is to not build the feature yet. If a capability is technically possible today but only at frontier prices— and only barely works— you are paying a premium to ship a fragile version of something that will be cheap and reliable in two quarters. For a non-core feature, waiting is not laziness. It's capital discipline. You let the curve do the engineering for you.

The discipline is knowing which features those are. If the capability is your product— the thing customers pay for, the thing competitors can't match— build it now at frontier prices and let your margins recover as the floor drops beneath you. The cost of being early on your core differentiator is almost always worth it; the cost of being early on a "nice to have" almost never is.

The fine-tuning trap

This is where the architecture bet bites hardest. Fine-tuning feels like the responsible move— squeeze a smaller, cheaper model to match a bigger one on your task. But fine-tuning is a bet against the curve. You're locking effort, and often data pipelines and serving infrastructure, to a specific base model at a specific moment. Then the base model improves and cheapens underneath you, and your hard-won fine-tune is now a worse, more expensive option than just calling the new base with a good prompt.

The rule I'd defend: don't fine-tune until prompting plus retrieval has provably failed against a real eval, and even then prefer the lightest intervention that works. Fine-tune for durable things— format, tone, a proprietary taxonomy that won't change— not for raw capability the next base model will hand you for free. Lock-in to a checkpoint is lock-in to a price floor that is actively falling away. You want the opposite: the ability to ride the floor down.

The cheap-tier-tomorrow bet

Every architectural decision in an AI product is implicitly a bet about which model you'll be running in twelve months. Most teams bet, by inaction, that it's the model they're running today. That's the losing bet. The winning bet is that today's frontier model is next year's cheap tier— so you build to make the swap trivial and you spend your scarce engineering effort on the layers the curve can't touch.

Stop counting tokens. The price is going to fall 10x with or without your prompt golf. Build the swap, build the eval, and let LLMflation pay for the rest. The constraint you're optimizing around today has an expiration date— design like you believe it.