title: "The Actual Cost of Self-Hosting Your LLM (Nobody Does This Math First)"
published: true
tags: [infrastructure, llm, devops, sre]
canonical_url:
TL;DR: Self-hosting an LLM looks cheaper until you add up everything else. This post walks through the real numbers — compute, storage, networking, and the ops overhead you will absolutely pay. Know the full picture before you spin up that g5 instance.
The Decision That Seems Obvious
Your team is spending $4k/month on OpenAI. Someone runs the numbers on a g5.xlarge. "We could host our own model and cut costs by 70%." Sounds great on paper.
Then you actually do it.
I've watched this play out a few times. The initial estimate is real. The full picture is not. So let me walk you through what the bill actually looks like twelve months in.
Compute: The Number You Actually Know
A g5.xlarge runs about $1.006/hour on-demand. Running inference 24/7 for a mid-sized model, that's roughly $740/month per instance. For any real load, you'll need more than one.
Go with a g5.12xlarge for a serious 7B model? You're at $5.67/hour. That's $4,082/month before you've touched anything else.
Spot instances help. You can get 50–70% off. But spot instances get interrupted, and your inference service needs to handle that gracefully or users see errors. Not impossible, but that's engineering work.
# EKS node group config for spot GPU instances
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig
managedNodeGroups:
- name: gpu-spot
instanceTypes: ["g5.xlarge", "g5.2xlarge"]
spot: true
minSize: 1
maxSize: 10
desiredCapacity: 2
labels:
workload: llm-inference
taints:
- key: nvidia.com/gpu
value: "true"
effect: NoSchedule
iam:
withAddonPolicies:
autoScaler: true
The autoscaler handles scaling down. You still need logic to drain in-flight requests before a spot interruption kills the node. AWS gives you a two-minute warning, so your request timeout has to live inside that window.
Storage, Networking, and the Bits Everyone Forgets
Model weights are not small. Llama 3 8B in fp16 is about 16GB. Mixtral 8x7B hits 87GB. You're storing these in S3 or EFS and loading them on pod startup.
EFS for shared model weights runs roughly $0.30/GB/month. For a 50GB model, that's $15/month. Fine. But EFS read throughput matters at startup. Loading 50GB cold takes minutes per pod. Pre-warm your nodes or bake model weights into an EBS snapshot on your AMI.
Networking is where it gets sneaky. Data transfer out of AWS costs $0.09/GB after the first GB. If you're serving inference to users outside AWS — which is most teams — every response adds up. A million requests/month at 1KB average response is $90 in egress. At 10 million requests, that's $900, and nobody put it in the budget.
The Ops Tax Is Real
This is the one that bites hardest. You now own:
- Model versioning and rollout
- GPU driver updates
- CUDA compatibility between driver and runtime
- OOM debugging (GPU OOM behaves nothing like CPU OOM)
- Observability for inference latency, token throughput, GPU utilisation
- On-call when the model service falls over at 2am
That last one is the real cost. Human time. If your team hasn't run LLM inference at scale before, budget a few months of learning. GPU OOM crashes are cryptic. NCCL errors for multi-GPU setups will make you question your life choices.
Self-Hosted vs Managed API: The Honest Table
| Factor | Self-Hosted | Managed API |
|---|---|---|
| Compute cost | Medium (but tunable) | High per token |
| Ops burden | High | Near zero |
| Latency control | Full | Limited |
| Model flexibility | Full | Vendor-locked |
| Data privacy | Full | Depends on vendor |
| Time to first token | Days to weeks setup | Minutes |
| Scaling complexity | You own it | Handled |
No universally right answer here. High-volume batch workloads with infrastructure chops? Self-hosting wins. Need to move fast? Managed APIs are fine.
Some teams go hybrid: managed API for interactive use, self-hosted batch for cost-sensitive workloads. Tools like Bifrost help route between providers and track actual spend across both, which makes the hybrid model easier to reason about.
Trade-offs and Limitations
Self-hosting wins on cost only when GPU utilisation is high. A GPU sitting idle at 20% utilisation is expensive. You need to be hitting 60–70% to beat per-token API pricing.
Spot instances cut compute cost but add operational complexity. Async batch jobs can tolerate interruptions. Real-time inference is harder to design around.
Multi-GPU setups for larger models introduce inter-node networking costs and NCCL configuration pain. Worth it at scale. Not worth it for a team of five still figuring out their workload shape.
Model quantisation (int4, int8) can cut memory requirements dramatically. Running Llama 3 70B in int4 on two A10G GPUs is real. The quality trade-off is usually acceptable for most tasks. Always benchmark before committing to a quantisation level in production.
The Actual Recommendation
Run the full numbers. Not the compute number. The full number: compute, storage, networking, engineering time, on-call time.
If the math works and your team has the capacity, self-hosting is genuinely solid. You get control, data privacy, and cost efficiency at volume.
If you're not sure yet, start with a managed API. Get your usage patterns nailed down. Then revisit.
No worries either way. Both paths are valid. The mistake is assuming one is clearly better without doing the first.
Top comments (0)