claire nguyen

Posted on Apr 23

The Actual Cost of Self-Hosting Your LLM (Nobody Does This Math First)

#sre #ai #devops #llm

title: "The Actual Cost of Self-Hosting Your LLM (Nobody Does This Math First)"
published: true
tags: [infrastructure, llm, devops, sre]

canonical_url:

TL;DR: Self-hosting an LLM looks cheaper until you add up everything else. This post walks through the real numbers — compute, storage, networking, and the ops overhead you will absolutely pay. Know the full picture before you spin up that g5 instance.

The Decision That Seems Obvious

Your team is spending $4k/month on OpenAI. Someone runs the numbers on a g5.xlarge. "We could host our own model and cut costs by 70%." Sounds great on paper.

Then you actually do it.

I've watched this play out a few times. The initial estimate is real. The full picture is not. So let me walk you through what the bill actually looks like twelve months in.

Compute: The Number You Actually Know

A g5.xlarge runs about $1.006/hour on-demand. Running inference 24/7 for a mid-sized model, that's roughly $740/month per instance. For any real load, you'll need more than one.

Go with a g5.12xlarge for a serious 7B model? You're at $5.67/hour. That's $4,082/month before you've touched anything else.

Spot instances help. You can get 50–70% off. But spot instances get interrupted, and your inference service needs to handle that gracefully or users see errors. Not impossible, but that's engineering work.

# EKS node group config for spot GPU instances
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig

managedNodeGroups:
  - name: gpu-spot
    instanceTypes: ["g5.xlarge", "g5.2xlarge"]
    spot: true
    minSize: 1
    maxSize: 10
    desiredCapacity: 2
    labels:
      workload: llm-inference
    taints:
      - key: nvidia.com/gpu
        value: "true"
        effect: NoSchedule
    iam:
      withAddonPolicies:
        autoScaler: true

The autoscaler handles scaling down. You still need logic to drain in-flight requests before a spot interruption kills the node. AWS gives you a two-minute warning, so your request timeout has to live inside that window.

Storage, Networking, and the Bits Everyone Forgets

Model weights are not small. Llama 3 8B in fp16 is about 16GB. Mixtral 8x7B hits 87GB. You're storing these in S3 or EFS and loading them on pod startup.

EFS for shared model weights runs roughly $0.30/GB/month. For a 50GB model, that's $15/month. Fine. But EFS read throughput matters at startup. Loading 50GB cold takes minutes per pod. Pre-warm your nodes or bake model weights into an EBS snapshot on your AMI.

Networking is where it gets sneaky. Data transfer out of AWS costs $0.09/GB after the first GB. If you're serving inference to users outside AWS — which is most teams — every response adds up. A million requests/month at 1KB average response is $90 in egress. At 10 million requests, that's $900, and nobody put it in the budget.

The Ops Tax Is Real

This is the one that bites hardest. You now own:

Model versioning and rollout
GPU driver updates
CUDA compatibility between driver and runtime
OOM debugging (GPU OOM behaves nothing like CPU OOM)
Observability for inference latency, token throughput, GPU utilisation
On-call when the model service falls over at 2am

That last one is the real cost. Human time. If your team hasn't run LLM inference at scale before, budget a few months of learning. GPU OOM crashes are cryptic. NCCL errors for multi-GPU setups will make you question your life choices.

Self-Hosted vs Managed API: The Honest Table

Factor	Self-Hosted	Managed API
Compute cost	Medium (but tunable)	High per token
Ops burden	High	Near zero
Latency control	Full	Limited
Model flexibility	Full	Vendor-locked
Data privacy	Full	Depends on vendor
Time to first token	Days to weeks setup	Minutes
Scaling complexity	You own it	Handled

No universally right answer here. High-volume batch workloads with infrastructure chops? Self-hosting wins. Need to move fast? Managed APIs are fine.

Some teams go hybrid: managed API for interactive use, self-hosted batch for cost-sensitive workloads. Tools like Bifrost help route between providers and track actual spend across both, which makes the hybrid model easier to reason about.

Trade-offs and Limitations

Self-hosting wins on cost only when GPU utilisation is high. A GPU sitting idle at 20% utilisation is expensive. You need to be hitting 60–70% to beat per-token API pricing.

Spot instances cut compute cost but add operational complexity. Async batch jobs can tolerate interruptions. Real-time inference is harder to design around.

Multi-GPU setups for larger models introduce inter-node networking costs and NCCL configuration pain. Worth it at scale. Not worth it for a team of five still figuring out their workload shape.

Model quantisation (int4, int8) can cut memory requirements dramatically. Running Llama 3 70B in int4 on two A10G GPUs is real. The quality trade-off is usually acceptable for most tasks. Always benchmark before committing to a quantisation level in production.

The Actual Recommendation

Run the full numbers. Not the compute number. The full number: compute, storage, networking, engineering time, on-call time.

If the math works and your team has the capacity, self-hosting is genuinely solid. You get control, data privacy, and cost efficiency at volume.

If you're not sure yet, start with a managed API. Get your usage patterns nailed down. Then revisit.

No worries either way. Both paths are valid. The mistake is assuming one is clearly better without doing the first.

DEV Community