What We Are Building
By the end of this workshop, you will have a self-hosted LLM running on a budget VPS behind a request queue — and more importantly, you will know exactly when this setup makes financial sense versus just calling an API. Let me show you the numbers, the minimal setup, and the decision framework I use with every team that asks me about this.
Prerequisites
- A VPS with at least 16 GB RAM and 4 vCPUs (Hetzner CX42 or equivalent, ~$24/month)
- Docker installed on the instance
- Basic comfort with the terminal
- An honest willingness to look at spreadsheets
Step 1: Understand the Hardware Floor
Here is the gotcha that will save you hours: LLM inference is memory-bound, not compute-bound. The model's parameter count dictates your RAM floor before anything else.
| Model | Parameters | Min RAM (Q4 Quantized) | Recommended VPS |
|---|---|---|---|
| Phi-3 Mini | 3.8B | 3 GB | 8 GB / 4 vCPU |
| Llama 3.1 8B | 8B | 5 GB | 16 GB / 4 vCPU |
| Mistral 7B | 7.3B | 5 GB | 16 GB / 4 vCPU |
| Qwen2.5 32B | 32B | 20 GB | 64 GB / dedicated GPU |
The sweet spot for budget VPS is the 7–8B parameter range at Q4 quantization. Anything larger pushes you into $200+/month GPU territory, which destroys the cost argument entirely.
Step 2: Get Ollama Running
Ollama is the Docker of LLMs. Here is the minimal setup to get this working:
docker run -d --name ollama \
-v ollama:/root/.ollama \
-p 11434:11434 \
--restart unless-stopped \
ollama/ollama
docker exec -it ollama ollama pull llama3.1:8b-instruct-q4_K_M
Test it immediately:
curl http://localhost:11434/api/generate -d '{
"model": "llama3.1:8b-instruct-q4_K_M",
"prompt": "Summarize the benefits of containerization in two sentences.",
"stream": false
}'
On a 4-vCPU machine, expect roughly 8–15 tokens/second for a single request. That is adequate for internal tools and batch processing. It falls apart under concurrency.
Step 3: When to Reach for vLLM Instead
If you have access to a GPU instance, vLLM gives you production-grade throughput with continuous batching:
docker run --runtime nvidia --gpus all \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-p 8000:8000 \
vllm/vllm-openai:latest \
--model meta-llama/Llama-3.1-8B-Instruct --quantization awq
The performance gap is dramatic:
| Metric | Ollama (CPU, 8B Q4) | vLLM (GPU, 8B AWQ) |
|---|---|---|
| Tokens/sec (single) | 8–15 | 40–80 |
| Tokens/sec (10 concurrent) | 1–3 per req | 25–50 per req |
| Time-to-first-token | 1–4s | 0.1–0.3s |
Step 4: Run the Cost Math
Let me show you a pattern I use in every project. Take a common workload — 500 requests/day, 500 input tokens, 300 output tokens each:
- API (Claude Sonnet): ~$112/month, instant responses, frontier-quality output.
- Ollama on $24/month VPS: Fixed cost, but each request takes 20–40 seconds. You need ~4 hours of sequential processing daily.
- vLLM on GPU (~$150/month): Handles load well, but costs more than the API for a less capable model.
The break-even point is roughly 2,000+ requests/day with relaxed latency requirements and an 8B model that meets your quality bar.
Step 5: Production Hardening
Do not skip these if you go to production:
# docker-compose.yml — add health checks
services:
ollama:
image: ollama/ollama
restart: unless-stopped
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:11434/"]
interval: 30s
retries: 3
deploy:
resources:
limits:
memory: 14G
Put a queue in front of inference. CPU inference cannot handle burst traffic. Redis with BullMQ or even a simple PostgreSQL-backed queue works fine.
Gotchas
-
OOM kills are your primary failure mode. Set Docker memory limits slightly below your VPS total RAM. Monitor with
docker stats. -
Upstream quantization changes silently alter output quality. Pin your model hash. Do not just pull
latest. - The docs do not mention this, but Ollama's concurrent performance degrades non-linearly. Two concurrent requests do not halve throughput — they quarter it on CPU.
- Engineering time is real cost. Eight hours of setup and tuning is $800–2,000 in salary. You need months of sustained savings to break even.
Conclusion
Start with the API. Profile your actual usage for two weeks. Then migrate only the workloads that are high-volume, latency-tolerant, and where a 7–8B model genuinely meets your quality threshold. Budget VPS means CPU-only means Ollama — accept the 10 tok/s ceiling and design around async processing.
Self-hosting wins for data sovereignty, high-volume classification and extraction tasks, and predictable billing. For everything else, the API is still your best friend.
Resources:
Top comments (0)