The Open-Model Cost Chart Everyone's Sharing Is API Prices. Here's What Self-Hosting Actually Gets You (Measured)

#ai #llm #opensource #selfhosting

There's a chart going around: intelligence on the y-axis, cost to run on the x-axis, and a green "most attractive" quadrant in the upper left where high intelligence meets low cost. The takeaway everyone's posting is that the green quadrant is almost entirely open source. DeepSeek, GLM, MiniMax, Kimi, Qwen all show up smart-enough and cheap, while the closed frontier models sit expensive on the right.

It's a real trend and the chart isn't wrong. But read the x-axis label: cost to run is a blended API price. That number answers "what does it cost to call this model through somebody's API," which is a different question from "what does it cost to run this yourself." For those of us who self-host, the second question is the whole point, and the chart quietly hides the answer.

So here's what it skips — measured on the two cards I own.

The catch: you can't self-host the green quadrant

The open models winning that value quadrant aren't small. Take GLM-5.2 — the one everyone points to when they say the open frontier finally caught up. It's coding-first, currently the strongest open weight on the coding benchmarks: a ~744B-parameter MoE with about 40B active per token. And unlike the closed three, the weights are actually MIT-licensed. That's its whole pitch: you can run it yourself, no per-token fee, weights on your own box. The cheap API price (around $1.40 in and $4.40 out per million tokens, roughly a sixth of GPT-5.5) is the headline. But the thing that sets it apart is the other half: you can run it yourself.

Then you try to. 744B at Q4 is roughly 372GB of weights. The other value-quadrant models are the same class, DeepSeek and Kimi running from the high hundreds of billions up toward a trillion. None of that fits a desktop GPU, or two, or four. "Self-hostable" here means a server with several 80GB datacenter cards, the exact infra headache the "open is cheap" story was supposed to spare you. So the option is real, it's just not real on the hardware most people own.

So when you self-host, you don't get the green quadrant. You get whatever fits on the card in front of you, which is a tier below. The useful question is: how far below, and is it good enough? That part I can answer with numbers instead of a chart.

What actually runs on a consumer card

Two tiers, both single consumer GPUs, models running fully on the GPU through Ollama. These are my own measured runs from earlier write-ups, pulled into one place:

GPU (used price)	best model that fits well	gen tok/s	prefill tok/s	context headroom
11GB — GTX 1080 Ti (~$200)	Gemma 4 12B QAT	~32	~315	12B at 16k with q8 KV
	Qwen3 8B	~46	~1390	comfortable
24GB — RTX 3090 (~$800)	Qwen3.6 27B Q4 + MTP	~75	—¹	dense 27B fits in VRAM

¹ Prefill doesn't reduce to one number on this card; it scales hard with context. At 64k the first token took about 59s. See "Long context is the real tax" below.

The 11GB card tops out comfortably at a 12B. A dense 27B doesn't fit one of them at all. The 24GB card moves you up to a dense 27B at a fast ~75 tok/s once speculative decoding is on, and that's the sweet spot: a 27B is a real step up in capability from a 12B, and it still lives entirely in VRAM.

On the intelligence chart, those are the mid-tier models, well below the green-quadrant frontier-open ones. So that's the real answer to "what does self-hosting get you": solid, useful, a tier under the cheap-API winners.

What the API number hides

Three costs that never show up as a dollar figure on that chart, and all three bit me at some point.

The VRAM ceiling is a wall, not a slope. A model either fits or it doesn't. The 27B that flies on a 3090 simply won't load on an 11GB card — no "a bit slower" middle ground at the boundary, it just fails, and your only move is a smaller model or a bigger card.

Spilling a MoE to system RAM looks like the obvious escape hatch when a model is too big. It isn't. I tried it with a 35B-A3B across two 1080 Tis and got about 17 tok/s — once the experts get mmapped to system RAM the whole thing goes memory-bandwidth-bound, and a CPU nearly tied it. A 12B living entirely in VRAM often feels snappier than a 35B that spills, which isn't what the parameter count would tell you.

The 3090's catch shows up at long context. It generates fast, but prompt processing scales hard: at 64k tokens the first token took about 59 seconds before generation even started. That latency never appears in a tokens-per-dollar number, and for anything retrieval-heavy it's the thing you feel.

So is it worth self-hosting?

If you're chasing the cheapest intelligence-per-token, the chart is right and the answer is often no. A cheap API to something like GLM-5.2 will beat your 3090 on raw capability per dollar, because you're not paying to keep a card idle between prompts, and you're getting a 744B model instead of a 27B.

Self-hosting is a bad way to win the cost game. What it buys you is the stuff that axis never measures: your data stays on the box, it runs offline, you can fine-tune and pin versions, and nobody deprecates a model out from under you. That last one is less abstract than it sounds. A weight already sitting on your disk under MIT is the one version nobody can reprice, retire, or region-lock on you later, which is part of why the open releases are starting to get talked about as insurance and not just a cheaper API. I run a local research assistant over my own papers for exactly that reason, and "a tier below the frontier" is completely fine for it. That's what you're paying for — privacy, control, a version nobody can pull out from under you. The per-token math is a side issue.

So that's the bit the chart leaves out. On API the open models do win on price — no argument there. But once the weights are on your own card you drop a tier, you hit the VRAM wall, and long prompts crawl. Nobody self-hosting at home is doing it to shave a few dollars a month. They do it because the weights are theirs, sitting on a disk nobody can reprice or retire.

Caveats

These are two cards I actually own, an 11GB Pascal and a 24GB Ampere, single-GPU, Ollama, the specific quants from my earlier posts. I don't have a 4090, a 5090, or a multi-card rig, so I can't speak to those tiers and I'm not going to guess at them. The model sizes for the big MoEs are approximate; if you're quoting them, check the current model cards. Numbers are from my own runs and are stable, not claimed to the decimal.