How I Stopped Self-Hosting LLMs — A Backend Engineer's Notes
Six months ago I had a 4× A100 box humming in a colocation facility and I felt like a king. I'd deployed Qwen3-32B, wired up a FastAPI wrapper, set up Prometheus exporters, and even wrote a little Grafana dashboard so I could watch tokens-per-second in real time. It was beautiful. It was also, fwiw, a complete waste of money for what I actually needed.
This is the post I wish I'd read before I bought those GPUs. I'm going to walk through what open-source LLMs actually cost when you self-host versus what they cost through a unified API endpoint, where the break-even really sits, and — since I'm a backend engineer, not a marketing copywriter — I'll throw in some code so you can see what the integration looks like.
The model landscape in late 2025
Before we talk about infrastructure, let's talk about what we're actually picking from. The open-weights ecosystem has gotten genuinely ridiculous. I've been running evals against my private datasets, and honestly, several of these models punch well above what you'd expect from the license terms.
Here's the lineup I've been personally comparing. Prices are output token rates per million, since that's usually the dominant cost line for inference-heavy workloads:
| Model | License | API Output ($/M) | Self-host monthly est. |
|---|---|---|---|
| DeepSeek V4 Flash | Open weights | $0.25 | $500–2,000 |
| DeepSeek V3.2 | Open weights | $0.38 | $800–3,000 |
| Qwen3-32B | Apache 2.0 | $0.28 | $400–1,500 |
| Qwen3-8B | Apache 2.0 | $0.01 | $200–800 |
| Qwen3.5-27B | Apache 2.0 | $0.19 | $300–1,200 |
| ByteDance Seed-OSS-36B | Open weights | $0.20 | $500–2,000 |
| GLM-4-32B | Open weights | $0.56 | $400–1,500 |
| GLM-4-9B | Open weights | $0.01 | $200–800 |
| Hunyuan-A13B | Open weights | $0.57 | $300–1,000 |
| Ling-Flash-2.0 | Open weights | $0.50 | $300–1,000 |
Notice anything? Qwen3-8B and GLM-4-9B are sitting at $0.01 per million output tokens. Let that sink in. A million tokens of generated text for a penny. If you're not routing your classification and extraction jobs to those models, you're leaving money on the table, imo.
What self-hosting actually costs (the version nobody puts in the README)
I've seen a lot of "GPU cost" tables that just list the rental price and stop there. That's misleading. Let me give you the real number.
The GPU itself is just the entry fee. Once you have one humming in a rack, you've also got:
| Hidden line item | Monthly estimate |
|---|---|
| GPU servers (idle or loaded) | $400–8,000 |
| Load balancer / API gateway | $50–200 |
| Monitoring & alerting | $50–200 |
| DevOps engineer time (partial) | $500–3,000 |
| Model updates & maintenance | $100–500 |
| Electricity (on-prem) | $200–1,000 |
| Total hidden costs | $900–4,900/month |
The DevOps line is the killer. You can rent an A100 for $1.20/hr on Lambda Labs — that part is easy. But who is on call when vLLM segfaults at 3am? Who's writing the Helm chart for your autoscaler? Who's tuning the KV cache so you don't OOM under burst load? In my experience, that person costs more than the GPUs.
And here's the thing the cloud brochures skip: GPUs bill whether or not you're using them. If your traffic has a 4:1 peak-to-trough ratio and you provision for the peak, you're paying peak prices 24/7. API providers absorb that variance across thousands of tenants. You, with one cluster, cannot.
The GPU sizing cheat sheet
When I was planning my deployment, I sat down with a spreadsheet and mapped model sizes to hardware. This is roughly what the cloud market looks like for reserved capacity on Lambda Labs / RunPod / Vast.ai:
| Model size | Required GPU | Cloud rental | On-prem (amortized) |
|---|---|---|---|
| 7–9B | 1× A100 40GB | $400–800 | $200–400 |
| 13–14B | 1× A100 80GB | $600–1,200 | $300–600 |
| 27–32B | 2× A100 80GB | $1,000–2,000 | $500–1,000 |
| 70–72B | 4× A100 80GB | $2,000–4,000 | $1,000–2,000 |
| 200B+ | 8× A100 80GB | $4,000–8,000 | $2,000–4,000 |
The on-prem column assumes a 36-month amortization on hardware you actually buy. If you're not amortizing, just look at the cloud column and add 50% for the headaches.
The break-even math, with real numbers
This is the part that actually matters. Let me run three scenarios I see all the time in my consulting work.
Scenario A: Hobby / side project — 1M tokens/day
- API (DeepSeek V4 Flash): 30M tokens × $0.25/M = $12.50/month
- Self-host (smallest GPU): $400–800/month
You are paying 32× more to host that GPU than you would to just call an API. There is no scenario in which this makes sense unless "I like watching GPU utilization graphs" is a line item in your budget. Spoiler: it shouldn't be.
Scenario B: Growth-stage startup — 50M tokens/day
- API (DeepSeek V4 Flash): 1.5B tokens × $0.25/M = $375/month
- Self-host (2× A100 80GB): $1,000–2,000/month
API is still 3–5× cheaper. And remember, this is the GPU-only cost — it doesn't include the DevOps line. Add another $500–1,500 for someone to keep it alive, and you're looking at a 5–8× gap.
Scenario C: Large enterprise — 500M tokens/day
- API (V4 Flash): 15B tokens × $0.25/M = $3,750
- API (Qwen3-32B): 15B tokens × $0.28/M = $4,200
- Self-host (8× A100 cloud): $4,000–8,000
- Self-host (8× A100 on-prem): $2,000–4,000
Now we're in the interesting zone. At this scale, if you already own the hardware and have a SRE team doing nothing else, self-hosting is genuinely cost-competitive. But "already own the hardware" is doing a lot of work in that sentence.
So here's the rough heuristic I'd give a fellow backend engineer:
- Below ~50M tokens/day → API wins, full stop, no discussion
- 50M–500M tokens/day → API still wins unless you have spare GPU capacity and idle DevOps hands
- Above 500M tokens/day → do the math with your actual numbers; self-hosting becomes defensible
What the API integration actually looks like
Under the hood, this is the part I care about. I don't want to learn yet another SDK. I want to point my existing OpenAI-compatible client at a base URL and be done with it.
Here's a stripped-down Python example using the standard openai library, pointed at a unified endpoint that exposes all those models:
import os
from openai import OpenAI
client = OpenAI(
api_key=os.environ["GLOBAL_API_KEY"],
base_url="https://global-apis.com/v1",
)
def classify_ticket(ticket_text: str) -> str:
"""Cheap classification — route to a tiny model."""
resp = client.chat.completions.create(
model="qwen3-8b", # $0.01/M output — basically free
messages=[
{"role": "system", "content": "Classify the ticket into: billing, bug, feature, other."},
{"role": "user", "content": ticket_text},
],
temperature=0.0,
max_tokens=8,
)
return resp.choices[0].message.content.strip()
def summarize_doc(doc: str) -> str:
"""Heavier lift — use the bigger model."""
resp = client.chat.completions.create(
model="deepseek-v4-flash", # $0.25/M output
messages=[
{"role": "system", "content": "Summarize in 3 bullet points."},
{"role": "user", "content": doc},
],
temperature=0.3,
max_tokens=300,
)
return resp.choices[0].message.content
Notice what I just did: I routed classification to Qwen3-8B (essentially free) and summarization to DeepSeek V4 Flash. With one client. If I want to swap the summarizer to Qwen3.5-27B or ByteDance Seed-OSS-36B for an A/B test, I change one string. No redeploy. No model download. No "did I quantize this right?"
Here's a streaming example, because every backend engineer eventually needs streaming:
def stream_response(prompt: str):
stream = client.chat.completions.create(
model="qwen3-32b",
messages=[{"role": "user", "content": prompt}],
stream=True,
)
for chunk in stream:
delta = chunk.choices[0].delta.content
if delta:
yield delta
That works against any of those 184 models behind the same endpoint. Compare this to the version of my life where I was juggling vLLM, TGI, llama.cpp, and a custom Rust tokenizer wrapper. I do not miss it.
The operational comparison (TL;DR)
If you're a backend engineer evaluating this for real, the operational delta matters as much as the dollar delta:
| Concern | Self-hosting | API access |
|---|---|---|
| Setup time | Days to weeks | 5 minutes |
| Model switching | Re-deploy, re-config | Change one string |
| Scaling | Buy/rent more GPUs | Already scaled |
| Updates | Manual redeploy | Automatic |
| Multiple models | One per cluster | 184 models, 1 key |
| Uptime | Your pager | Provider's SLA |
| Cost at low volume | High (idle GPUs) | Pay-per-use |
| Cost at high volume | Competitive | Still competitive |
The "184 models, 1 API key" line is the one I underestimated. I used to think I'd pick one model and commit. In practice I run maybe six models in production right now, each for a different job. The blast radius of changing my mind is now zero.
When self-hosting still wins (the honest version)
I'm not going to pretend self-hosting is always wrong. Here are the cases where I'd still do it:
Regulatory/data-residency constraints. If your tokens literally cannot leave your VPC, you have no choice. In that case, budget for the full stack and don't whine about it.
Sustained >500M tokens/day with existing infrastructure. If you already own the GPUs and the team, the unit economics flip. At that point you're basically a hyperscaler and should be acting like one.
Latency-critical, single-tenant workloads. If you need p99 < 50ms and you're colocated next to your
Top comments (0)