My Honest Breakdown: Open Source LLM API vs Self-Hosting Costs
Last quarter I burned a weekend wrestling with vLLM configs, CUDA driver mismatches, and a Prometheus alert storm that paged me at 3 AM because a single H100 decided it was too hot to live. By Monday morning I had a working inference cluster serving a 70B model. By Wednesday I was staring at the AWS bill and wondering if I had made a terrible mistake. That experience is what pushed me to do the math properly on API access versus self-hosting, and fwiw, the numbers surprised me enough that I rewrote our entire inference layer.
This is the post I wish I had read before that weekend. It's a backend engineer's pragmatic look at open source LLMs through the API, with honest dollar figures and zero vendor sugarcoating. Under the hood, the choice between spinning up your own GPU rig and hitting a managed endpoint is mostly a question of scale, ops maturity, and how much you value your sleep. Let me walk you through what I found.
The Open Source LLM Landscape in 2026
The open weight ecosystem has gotten genuinely absurd in a good way. When I started this hobby in 2022, "open source AI" meant running Llama 7B on a gaming GPU and hoping it didn't hallucinate your API keys. Now there are competitive models in basically every size bracket, and most of them are accessible via a clean REST endpoint that speaks the same wire protocol as OpenAI. RFC 7231 would be proud of how boring and predictable the surface area has become.
Here's the lineup I evaluated for our production workloads. Output prices are per million tokens, which is the standard unit you'll see across providers.
| Model | License | API Output Price | Self-Host GPU Estimate |
|---|---|---|---|
| DeepSeek V4 Flash | Open weights | $0.25/M | $500–2,000/month |
| DeepSeek V3.2 | Open weights | $0.38/M | $800–3,000/month |
| Qwen3-32B | Apache 2.0 | $0.28/M | $400–1,500/month |
| Qwen3-8B | Apache 2.0 | $0.01/M | $200–800/month |
| Qwen3.5-27B | Apache 2.0 | $0.19/M | $300–1,200/month |
| ByteDance Seed-OSS-36B | Open weights | $0.20/M | $500–2,000/month |
| GLM-4-32B | Open weights | $0.56/M | $400–1,500/month |
| GLM-4-9B | Open weights | $0.01/M | $200–800/month |
| Hunyuan-A13B | Open weights | $0.57/M | $300–1,000/month |
| Ling-Flash-2.0 | Open weights | $0.50/M | $300–1,000/month |
A few observations. Qwen3-8B and GLM-4-9B at $0.01/M output are essentially free for prototyping. I have a service that classifies support tickets using GLM-4-9B and the monthly bill is small enough that I forget it exists. On the opposite end, the GLM-4-32B and Hunyuan-A13B prices are higher because they are in the "premium reasoning" tier, but they still undercut GPT-4 class APIs by a factor of 5 to 20. Imo, the sweet spot for most production workloads is the 27B to 36B range, which is where Qwen3.5-27B, ByteDance Seed-OSS-36B, and DeepSeek V3.2 live.
The Real Cost of Self-Hosting: It's Never Just the GPU
This is the part of the calculator nobody shows you. When you see a tweet saying "I run Mixtral 8x7B on two A100s for $2/hour," that is the cost of the metal. It is not the cost of the service. After you add the load balancer, the observability stack, the model update pipeline, the on-call rotation, and the guy whose job is to keep the inference server from OOM-ing at peak load, the real number balloons fast.
Here is what GPU rental actually looks like across the major providers I'm familiar with (Lambda Labs, RunPod, Vast.ai reserved instances, etc.):
| Model Size | Required GPU | Cloud Rental | On-Prem (Amortized) |
|---|---|---|---|
| 7–9B | 1× A100 40GB | $400–800/mo | $200–400/mo |
| 13–14B | 1× A100 80GB | $600–1,200/mo | $300–600/mo |
| 27–32B | 2× A100 80GB | $1,000–2,000/mo | $500–1,000/mo |
| 70–72B | 4× A100 80GB | $2,000–4,000/mo | $1,000–2,000/mo |
| 200B+ | 8× A100 80GB | $4,000–8,000/mo | $2,000–4,000/mo |
But that table is the lie. The truth is in the hidden line items, which I have painfully itemized below. This is what I track in our internal cost dashboard, and it's what you should track too if you're serious about self-hosting.
| Line Item | Monthly Estimate |
|---|---|
| GPU servers (loaded or sitting idle) | $400–8,000 |
| Load balancer / API gateway | $50–200 |
| Monitoring and alerting (Grafana Cloud, Datadog, whatever) | $50–200 |
| DevOps engineer time (even partial allocation) | $500–3,000 |
| Model updates, weight downloads, eval runs | $100–500 |
| Electricity (on-prem only) | $200–1,000 |
| Total realistic hidden cost | $900–4,900/mo on top of GPU |
The DevOps line item is the one that kills most "we'll just self-host" plans. A competent SRE costs north of $150k fully loaded, and even a quarter of their time dedicated to keeping your inference cluster alive is $3,000 a month. If you're a five-person startup, that quarter-FTE is also the same person who needs to deploy your app, rotate your certs, and answer the occasional "why is the staging database on fire" Slack message. Good luck with that.
Break-Even Math: Three Realistic Scenarios
Let me run the numbers the way I actually run them, with monthly token volume, the cheapest reasonable self-hosting setup, and the API price for our favorite workhorse (DeepSeek V4 Flash at $0.25/M output). I'll use 1.5 billion tokens = 50M/day × 30 days as the math shortcut.
Scenario A: 1M Tokens/Day (Hobby Project)
| Option | Monthly Cost | Notes |
|---|---|---|
| API (DeepSeek V4 Flash) | $12.50 | 30M tokens × $0.25/M |
| Self-host (smallest A100) | $400–800/mo | Idle GPU costs the same as busy GPU |
Verdict: API wins by 32×. There is no scenario where self-hosting a 40GB A100 to serve 1M tokens a day makes financial sense. The GPU will be 99.97% idle. You are literally paying for hardware to sit there. Don't do it.
Scenario B: 50M Tokens/Day (Growth-Stage Startup)
| Option | Monthly Cost | Notes |
|---|---|---|
| API (DeepSeek V4 Flash) | $375 | 1.5B tokens × $0.25/M |
| Self-host (2× A100 80GB) | $1,000–2,000 | Can handle ~50M/day with vLLM + batching |
Verdict: API wins by 3–5×. This is the scenario I actually live in. Our RAG pipeline and a few internal tools chew through about 50M tokens a day, and my monthly invoice is consistently under $400. To match that with self-hosted infrastructure I would need a DevOps allocation, monitoring, and a 24/7 on-call rotation. The math isn't even close.
Scenario C: 500M Tokens/Day (Large Enterprise)
| Option | Monthly Cost | Notes |
|---|---|---|
| API (DeepSeek V4 Flash) | $3,750 | 15B tokens × $0.25/M |
| API (Qwen3-32B) | $4,200 | Different model, slightly different price |
| Self-host (8× A100 80GB) | $4,000–8,000/mo | Break-even zone |
| Self-host (on-prem, owned) | $2,000–4,000/mo | Only if you already own the hardware |
Verdict: Toss-up. At 500M tokens a day, self-hosting starts to make sense, but only if you already have a rack, a power contract, and someone who knows how to swap a failed NVLink cable without Googling it. If you're starting from scratch, the API is still cheaper for the first 6–12 months once you factor in procurement, setup, and the inevitable "why is inference latency spiking at 4 PM" investigation.
Code: Actually Calling These Models
The beautiful thing about the current state of the open source API ecosystem is that the interface is identical to OpenAI's. You can swap providers by changing two lines of code. Here's how I wire it up in our services.
import os
from openai import OpenAI
# Global API endpoint, OpenAI-compatible
client = OpenAI(
api_key=os.environ["GLOBAL_API_KEY"],
base_url="https://global-apis.com/v1",
)
def classify_ticket(text: str) -> str:
"""Cheap classification using Qwen3-8B at $0.01/M output."""
response = client.chat.completions.create(
model="qwen3-8b",
messages=[
{"role": "system", "content": "Classify the support ticket into one of: billing, bug, feature, other."},
{"role": "user", "content": text},
],
max_tokens=10,
temperature=0,
)
return response.choices[0].message.content.strip()
And for the heavier reasoning tasks, here's the exact same pattern with a different model:
def summarize_contract(document: str) -> str:
"""Long-context summarization using DeepSeek V3.2."""
response = client.chat.completions.create(
model="deepseek-v3.2",
messages=[
{"role": "system", "content": "Summarize the contract in 5 bullet points, plain English."},
{"role": "user", "content": document},
],
max_tokens=500,
temperature=0.3,
)
return response.choices[0].message.content
Notice that nothing changes except the model name. That's not an accident. The open source LLM ecosystem has converged on a wire format that lets you A/B test model providers the way you'd A/B test a database driver. Last month I switched our summarization pipeline from Qwen3-32B to DeepSeek V3.2 because the eval suite showed a 4% quality bump, and the entire migration took 11 minutes including the PR review.
Why API Beats Self-Hosting for 95% of Teams
Let me put the comparison in a table I can show to my manager when she asks "why are we paying someone else to run our models":
| Dimension | Self-Hosting | API Access |
|---|---|---|
| Time to first request | Days to weeks | 5 minutes |
| Switching models | Redeploy, reconfigure, restart | Change one string in code |
| Scaling | Buy or rent more GPUs | Automatic, transparent |
| Model updates | Manual download + redeploy | Provider handles it |
| Multi-model workflows | One model per GPU cluster | 184 models, one API key |
| Uptime responsibility | Yours | Provider's SLA |
| Low-volume cost | High (idle GPUs) | Pay only for what you use |
| High-volume cost | Competitive | Still competitive |
The "model switching" row is the one that really matters. When Llama 4 dropped
Top comments (0)