DEV Community

eagerspark
eagerspark

Posted on

How I Stopped Self-Hosting LLMs — A Backend Engineer's Notes

How I Stopped Self-Hosting LLMs — A Backend Engineer's Notes

Six months ago I had a 4× A100 box humming in a colocation facility and I felt like a king. I'd deployed Qwen3-32B, wired up a FastAPI wrapper, set up Prometheus exporters, and even wrote a little Grafana dashboard so I could watch tokens-per-second in real time. It was beautiful. It was also, fwiw, a complete waste of money for what I actually needed.

This is the post I wish I'd read before I bought those GPUs. I'm going to walk through what open-source LLMs actually cost when you self-host versus what they cost through a unified API endpoint, where the break-even really sits, and — since I'm a backend engineer, not a marketing copywriter — I'll throw in some code so you can see what the integration looks like.


The model landscape in late 2025

Before we talk about infrastructure, let's talk about what we're actually picking from. The open-weights ecosystem has gotten genuinely ridiculous. I've been running evals against my private datasets, and honestly, several of these models punch well above what you'd expect from the license terms.

Here's the lineup I've been personally comparing. Prices are output token rates per million, since that's usually the dominant cost line for inference-heavy workloads:

Model License API Output ($/M) Self-host monthly est.
DeepSeek V4 Flash Open weights $0.25 $500–2,000
DeepSeek V3.2 Open weights $0.38 $800–3,000
Qwen3-32B Apache 2.0 $0.28 $400–1,500
Qwen3-8B Apache 2.0 $0.01 $200–800
Qwen3.5-27B Apache 2.0 $0.19 $300–1,200
ByteDance Seed-OSS-36B Open weights $0.20 $500–2,000
GLM-4-32B Open weights $0.56 $400–1,500
GLM-4-9B Open weights $0.01 $200–800
Hunyuan-A13B Open weights $0.57 $300–1,000
Ling-Flash-2.0 Open weights $0.50 $300–1,000

Notice anything? Qwen3-8B and GLM-4-9B are sitting at $0.01 per million output tokens. Let that sink in. A million tokens of generated text for a penny. If you're not routing your classification and extraction jobs to those models, you're leaving money on the table, imo.


What self-hosting actually costs (the version nobody puts in the README)

I've seen a lot of "GPU cost" tables that just list the rental price and stop there. That's misleading. Let me give you the real number.

The GPU itself is just the entry fee. Once you have one humming in a rack, you've also got:

Hidden line item Monthly estimate
GPU servers (idle or loaded) $400–8,000
Load balancer / API gateway $50–200
Monitoring & alerting $50–200
DevOps engineer time (partial) $500–3,000
Model updates & maintenance $100–500
Electricity (on-prem) $200–1,000
Total hidden costs $900–4,900/month

The DevOps line is the killer. You can rent an A100 for $1.20/hr on Lambda Labs — that part is easy. But who is on call when vLLM segfaults at 3am? Who's writing the Helm chart for your autoscaler? Who's tuning the KV cache so you don't OOM under burst load? In my experience, that person costs more than the GPUs.

And here's the thing the cloud brochures skip: GPUs bill whether or not you're using them. If your traffic has a 4:1 peak-to-trough ratio and you provision for the peak, you're paying peak prices 24/7. API providers absorb that variance across thousands of tenants. You, with one cluster, cannot.


The GPU sizing cheat sheet

When I was planning my deployment, I sat down with a spreadsheet and mapped model sizes to hardware. This is roughly what the cloud market looks like for reserved capacity on Lambda Labs / RunPod / Vast.ai:

Model size Required GPU Cloud rental On-prem (amortized)
7–9B 1× A100 40GB $400–800 $200–400
13–14B 1× A100 80GB $600–1,200 $300–600
27–32B 2× A100 80GB $1,000–2,000 $500–1,000
70–72B 4× A100 80GB $2,000–4,000 $1,000–2,000
200B+ 8× A100 80GB $4,000–8,000 $2,000–4,000

The on-prem column assumes a 36-month amortization on hardware you actually buy. If you're not amortizing, just look at the cloud column and add 50% for the headaches.


The break-even math, with real numbers

This is the part that actually matters. Let me run three scenarios I see all the time in my consulting work.

Scenario A: Hobby / side project — 1M tokens/day

  • API (DeepSeek V4 Flash): 30M tokens × $0.25/M = $12.50/month
  • Self-host (smallest GPU): $400–800/month

You are paying 32× more to host that GPU than you would to just call an API. There is no scenario in which this makes sense unless "I like watching GPU utilization graphs" is a line item in your budget. Spoiler: it shouldn't be.

Scenario B: Growth-stage startup — 50M tokens/day

  • API (DeepSeek V4 Flash): 1.5B tokens × $0.25/M = $375/month
  • Self-host (2× A100 80GB): $1,000–2,000/month

API is still 3–5× cheaper. And remember, this is the GPU-only cost — it doesn't include the DevOps line. Add another $500–1,500 for someone to keep it alive, and you're looking at a 5–8× gap.

Scenario C: Large enterprise — 500M tokens/day

  • API (V4 Flash): 15B tokens × $0.25/M = $3,750
  • API (Qwen3-32B): 15B tokens × $0.28/M = $4,200
  • Self-host (8× A100 cloud): $4,000–8,000
  • Self-host (8× A100 on-prem): $2,000–4,000

Now we're in the interesting zone. At this scale, if you already own the hardware and have a SRE team doing nothing else, self-hosting is genuinely cost-competitive. But "already own the hardware" is doing a lot of work in that sentence.

So here's the rough heuristic I'd give a fellow backend engineer:

  • Below ~50M tokens/day → API wins, full stop, no discussion
  • 50M–500M tokens/day → API still wins unless you have spare GPU capacity and idle DevOps hands
  • Above 500M tokens/day → do the math with your actual numbers; self-hosting becomes defensible

What the API integration actually looks like

Under the hood, this is the part I care about. I don't want to learn yet another SDK. I want to point my existing OpenAI-compatible client at a base URL and be done with it.

Here's a stripped-down Python example using the standard openai library, pointed at a unified endpoint that exposes all those models:

import os
from openai import OpenAI

client = OpenAI(
    api_key=os.environ["GLOBAL_API_KEY"],
    base_url="https://global-apis.com/v1",
)

def classify_ticket(ticket_text: str) -> str:
    """Cheap classification — route to a tiny model."""
    resp = client.chat.completions.create(
        model="qwen3-8b",          # $0.01/M output — basically free
        messages=[
            {"role": "system", "content": "Classify the ticket into: billing, bug, feature, other."},
            {"role": "user", "content": ticket_text},
        ],
        temperature=0.0,
        max_tokens=8,
    )
    return resp.choices[0].message.content.strip()

def summarize_doc(doc: str) -> str:
    """Heavier lift — use the bigger model."""
    resp = client.chat.completions.create(
        model="deepseek-v4-flash",  # $0.25/M output
        messages=[
            {"role": "system", "content": "Summarize in 3 bullet points."},
            {"role": "user", "content": doc},
        ],
        temperature=0.3,
        max_tokens=300,
    )
    return resp.choices[0].message.content
Enter fullscreen mode Exit fullscreen mode

Notice what I just did: I routed classification to Qwen3-8B (essentially free) and summarization to DeepSeek V4 Flash. With one client. If I want to swap the summarizer to Qwen3.5-27B or ByteDance Seed-OSS-36B for an A/B test, I change one string. No redeploy. No model download. No "did I quantize this right?"

Here's a streaming example, because every backend engineer eventually needs streaming:

def stream_response(prompt: str):
    stream = client.chat.completions.create(
        model="qwen3-32b",
        messages=[{"role": "user", "content": prompt}],
        stream=True,
    )
    for chunk in stream:
        delta = chunk.choices[0].delta.content
        if delta:
            yield delta
Enter fullscreen mode Exit fullscreen mode

That works against any of those 184 models behind the same endpoint. Compare this to the version of my life where I was juggling vLLM, TGI, llama.cpp, and a custom Rust tokenizer wrapper. I do not miss it.


The operational comparison (TL;DR)

If you're a backend engineer evaluating this for real, the operational delta matters as much as the dollar delta:

Concern Self-hosting API access
Setup time Days to weeks 5 minutes
Model switching Re-deploy, re-config Change one string
Scaling Buy/rent more GPUs Already scaled
Updates Manual redeploy Automatic
Multiple models One per cluster 184 models, 1 key
Uptime Your pager Provider's SLA
Cost at low volume High (idle GPUs) Pay-per-use
Cost at high volume Competitive Still competitive

The "184 models, 1 API key" line is the one I underestimated. I used to think I'd pick one model and commit. In practice I run maybe six models in production right now, each for a different job. The blast radius of changing my mind is now zero.


When self-hosting still wins (the honest version)

I'm not going to pretend self-hosting is always wrong. Here are the cases where I'd still do it:

  1. Regulatory/data-residency constraints. If your tokens literally cannot leave your VPC, you have no choice. In that case, budget for the full stack and don't whine about it.

  2. Sustained >500M tokens/day with existing infrastructure. If you already own the GPUs and the team, the unit economics flip. At that point you're basically a hyperscaler and should be acting like one.

  3. Latency-critical, single-tenant workloads. If you need p99 < 50ms and you're colocated next to your

Top comments (0)