DEV Community

Cover image for Self-Hosting AI Models on a Budget VPS: A Practical Workshop
SoftwareDevs mvpfactory.io
SoftwareDevs mvpfactory.io

Posted on • Originally published at mvpfactory.io

Self-Hosting AI Models on a Budget VPS: A Practical Workshop

What We Are Building

By the end of this workshop, you will have a self-hosted LLM running on a budget VPS behind a request queue — and more importantly, you will know exactly when this setup makes financial sense versus just calling an API. Let me show you the numbers, the minimal setup, and the decision framework I use with every team that asks me about this.

Prerequisites

  • A VPS with at least 16 GB RAM and 4 vCPUs (Hetzner CX42 or equivalent, ~$24/month)
  • Docker installed on the instance
  • Basic comfort with the terminal
  • An honest willingness to look at spreadsheets

Step 1: Understand the Hardware Floor

Here is the gotcha that will save you hours: LLM inference is memory-bound, not compute-bound. The model's parameter count dictates your RAM floor before anything else.

Model Parameters Min RAM (Q4 Quantized) Recommended VPS
Phi-3 Mini 3.8B 3 GB 8 GB / 4 vCPU
Llama 3.1 8B 8B 5 GB 16 GB / 4 vCPU
Mistral 7B 7.3B 5 GB 16 GB / 4 vCPU
Qwen2.5 32B 32B 20 GB 64 GB / dedicated GPU

The sweet spot for budget VPS is the 7–8B parameter range at Q4 quantization. Anything larger pushes you into $200+/month GPU territory, which destroys the cost argument entirely.

Step 2: Get Ollama Running

Ollama is the Docker of LLMs. Here is the minimal setup to get this working:

docker run -d --name ollama \
  -v ollama:/root/.ollama \
  -p 11434:11434 \
  --restart unless-stopped \
  ollama/ollama

docker exec -it ollama ollama pull llama3.1:8b-instruct-q4_K_M
Enter fullscreen mode Exit fullscreen mode

Test it immediately:

curl http://localhost:11434/api/generate -d '{
  "model": "llama3.1:8b-instruct-q4_K_M",
  "prompt": "Summarize the benefits of containerization in two sentences.",
  "stream": false
}'
Enter fullscreen mode Exit fullscreen mode

On a 4-vCPU machine, expect roughly 8–15 tokens/second for a single request. That is adequate for internal tools and batch processing. It falls apart under concurrency.

Step 3: When to Reach for vLLM Instead

If you have access to a GPU instance, vLLM gives you production-grade throughput with continuous batching:

docker run --runtime nvidia --gpus all \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -p 8000:8000 \
  vllm/vllm-openai:latest \
  --model meta-llama/Llama-3.1-8B-Instruct --quantization awq
Enter fullscreen mode Exit fullscreen mode

The performance gap is dramatic:

Metric Ollama (CPU, 8B Q4) vLLM (GPU, 8B AWQ)
Tokens/sec (single) 8–15 40–80
Tokens/sec (10 concurrent) 1–3 per req 25–50 per req
Time-to-first-token 1–4s 0.1–0.3s

Step 4: Run the Cost Math

Let me show you a pattern I use in every project. Take a common workload — 500 requests/day, 500 input tokens, 300 output tokens each:

  • API (Claude Sonnet): ~$112/month, instant responses, frontier-quality output.
  • Ollama on $24/month VPS: Fixed cost, but each request takes 20–40 seconds. You need ~4 hours of sequential processing daily.
  • vLLM on GPU (~$150/month): Handles load well, but costs more than the API for a less capable model.

The break-even point is roughly 2,000+ requests/day with relaxed latency requirements and an 8B model that meets your quality bar.

Step 5: Production Hardening

Do not skip these if you go to production:

# docker-compose.yml — add health checks
services:
  ollama:
    image: ollama/ollama
    restart: unless-stopped
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:11434/"]
      interval: 30s
      retries: 3
    deploy:
      resources:
        limits:
          memory: 14G
Enter fullscreen mode Exit fullscreen mode

Put a queue in front of inference. CPU inference cannot handle burst traffic. Redis with BullMQ or even a simple PostgreSQL-backed queue works fine.

Gotchas

  • OOM kills are your primary failure mode. Set Docker memory limits slightly below your VPS total RAM. Monitor with docker stats.
  • Upstream quantization changes silently alter output quality. Pin your model hash. Do not just pull latest.
  • The docs do not mention this, but Ollama's concurrent performance degrades non-linearly. Two concurrent requests do not halve throughput — they quarter it on CPU.
  • Engineering time is real cost. Eight hours of setup and tuning is $800–2,000 in salary. You need months of sustained savings to break even.

Conclusion

Start with the API. Profile your actual usage for two weeks. Then migrate only the workloads that are high-volume, latency-tolerant, and where a 7–8B model genuinely meets your quality threshold. Budget VPS means CPU-only means Ollama — accept the 10 tok/s ceiling and design around async processing.

Self-hosting wins for data sovereignty, high-volume classification and extraction tasks, and predictable billing. For everything else, the API is still your best friend.

Resources:

Top comments (0)