Hetzner Cloud for AI Projects — Complete GPU Server Setup & Cost Breakdown 2026
Running AI workloads on AWS or GCP is expensive. A single A100 instance on AWS costs $3-4 per hour — over $2,000 a month if you leave it running. For startups, indie developers, and small teams experimenting with AI, that math kills projects before they start.
Hetzner offers an alternative that most of the AI community outside Europe has not discovered yet. Budget cloud instances from €3.99/month for lightweight inference. Dedicated GPU servers with NVIDIA RTX 4000 Ada from €184/month. European data centers with flat monthly pricing and no bandwidth surprises.
This guide covers the full Hetzner AI server lineup, from $5/month CPU instances running tiny models to dedicated GPU servers handling production workloads. We will walk through actual setup, realistic performance expectations, and an honest cost comparison against AWS and GCP.
Why Hetzner for AI Workloads
Hetzner is a German hosting company that has been around since 1997. They are not a startup. They run their own data centers in Falkenstein, Nuremberg, and Helsinki. Their pricing has always been aggressive compared to US-based cloud providers, and that gap has only widened as AWS and GCP have raised prices.
The Price Gap Is Real
Hetzner's cost advantage is not 10-20% — it is 60-80% for equivalent compute. A Hetzner cloud server with 2 vCPUs and 4 GB RAM costs €3.99/month. A comparable instance on AWS (t3.medium) costs roughly $30/month. DigitalOcean and Vultr sit in between at $15-20/month for similar specs.
For AI workloads specifically, the gap gets even wider at the GPU tier. Hetzner's dedicated GPU servers start at €184/month. AWS GPU instances (g5.xlarge with A10G) start at roughly $1.00/hour — over $700/month for always-on use.
What Hetzner Does Well
- Flat monthly pricing. No surprise bandwidth bills, no hidden egress charges. Traffic is unlimited on most plans.
- EU data centers. Falkenstein and Helsinki give you GDPR compliance by default.
- Straightforward networking. Private networks, floating IPs, and load balancers at prices that make sense.
- ARM instances. Ampere-based CAX servers offer strong performance-per-euro for inference workloads.
What Hetzner Does Not Do
- No managed AI/ML services. No SageMaker equivalent, no managed Jupyter, no model registries. You manage everything yourself.
- No spot/preemptible instances. You cannot get cheap burst GPU time. It is flat monthly pricing or nothing.
- Limited GPU availability. Dedicated GPU servers can have waitlists. AWS and GCP have broader GPU SKU availability.
- No US data centers. If you need sub-50ms latency for US users, Hetzner is not the right choice.
The Hetzner AI Server Lineup
Hetzner offers multiple tiers for AI workloads. Here is the full spectrum from budget to production.
Tier 1: Cost-Optimized Cloud (CX Series) — €3.99-€14.99/mo
These are shared vCPU instances. No GPU. CPU-only inference for small models.
| Model | vCPU | RAM | Storage | Price |
|---|---|---|---|---|
| CX23 | 2 | 4 GB | 40 GB SSD | €3.99/mo (~$4.99) |
| CX33 | 4 | 8 GB | 80 GB SSD | €6.49/mo (~$8.09) |
| CX43 | 8 | 16 GB | 160 GB SSD | €11.99/mo (~$14.99) |
| CX53 | 16 | 32 GB | 320 GB SSD | €22.49/mo (~$28.09) |
AI use case: Running Ollama with small models (3B-7B parameters) for personal chatbots, lightweight RAG, or API-based inference for low-traffic applications. We covered this exact setup in our Ollama + Open WebUI self-hosting guide.
Realistic expectations: A CX23 can run a 3B model at roughly 3-6 tokens/second (CPU inference). A CX33 can handle a 7-8B model at 1-3 tokens/second. This is usable for async workflows but not for interactive chat.
Tier 2: ARM Cloud Instances (CAX Series) — Better Performance per Euro
Hetzner's Ampere-based ARM servers offer better compute efficiency than the x86 CX series at similar or lower price points.
| Model | vCPU (ARM) | RAM | Storage | Price |
|---|---|---|---|---|
| CAX11 | 2 | 4 GB | 40 GB SSD | €4.49/mo (~$5.59) |
| CAX21 | 4 | 8 GB | 80 GB SSD | €7.99/mo (~$9.99) |
| CAX31 | 8 | 16 GB | 160 GB SSD | €15.99/mo (~$19.99) |
| CAX41 | 16 | 32 GB | 320 GB SSD | €31.49/mo (~$39.29) |
AI use case: ARM chips handle inference workloads efficiently. Ollama has native ARM support, so these servers run small models with lower power draw and often better single-thread performance than the CX series at the same price. Good for always-on inference APIs.
Tier 3: GEX44 — Dedicated GPU Server (€184/mo)
This is where things get serious for AI workloads.
| Spec | Details |
|---|---|
| CPU | Intel Core i5-13500 (6P + 8E cores, HT) |
| GPU | NVIDIA RTX 4000 SFF Ada Generation, 20 GB GDDR6 ECC |
| RAM | 64 GB DDR4 |
| Storage | 2× 1.92 TB NVMe SSD Gen3 (RAID 1) |
| Network | 1 Gbit/s, unlimited traffic |
| Setup fee | €79 (one-time) |
| Monthly | €184/mo |
| Locations | Falkenstein (FSN1), Nuremberg (NBG1) |
AI use case: The RTX 4000 SFF Ada with 20 GB VRAM can run models up to ~32B parameters (4-bit quantized). It handles 7B-14B models comfortably with fast inference. This is the sweet spot for small teams running production AI inference, fine-tuning smaller models, or serving multiple users simultaneously.
The 20 GB of VRAM is the key spec. It puts this server above consumer RTX 4060/4070 cards (8-12 GB) and into territory where you can run meaningful models without aggressive quantization.
Tier 4: GEX131 — High-End GPU Server
For production AI workloads that need serious GPU compute.
| Spec | Details |
|---|---|
| CPU | Intel Xeon Gold 5412U |
| GPU | NVIDIA RTX PRO 6000 Blackwell Max-Q, 96 GB GDDR7 ECC |
| RAM | 256 GB DDR5 ECC (expandable to 768 GB) |
| Storage | 2× 960 GB NVMe SSD Datacenter Edition (RAID 1) |
| Network | 1 Gbit/s, unlimited traffic |
| Monthly | €889/mo (~$989) |
| Locations | Helsinki (HEL1), Falkenstein (FSN1), Nuremberg (NBG1) |
AI use case: With 96 GB of VRAM, this server can run 70B+ parameter models at full precision, handle multiple concurrent inference requests, or fine-tune large models. The 5th-generation Tensor Cores and Blackwell architecture make this competitive with cloud A100 instances at a fraction of the cost.
256 GB of system RAM with expansion to 768 GB also makes this viable for large-scale RAG deployments where you need to keep embedding databases in memory.
Budget Path: Running Small LLMs on CX/CAX Instances
You do not need a GPU to run AI inference. CPU-only inference with quantized models is slow but functional — and incredibly cheap.
What You Can Run
On a CX23 (€3.99/month, 4 GB RAM):
- Llama 3.2 3B (Q4) — Fits in ~2-3 GB. General chat and simple tasks.
- Phi-3.5 Mini 3.8B (Q4) — Microsoft's efficient model. Good for code and reasoning.
- TinyLlama 1.1B — Fast even on CPU. Useful for classification and simple generation.
On a CX33 (€6.49/month, 8 GB RAM):
- Llama 3.2 8B (Q4) — Solid general model. ~5 GB loaded.
- Gemma 2 2B — Google's efficient model. Punches above its weight.
- Qwen 2.5 7B (Q4) — Excellent for multilingual use cases.
Setup
Install Docker and run Ollama:
# Install Docker
curl -fsSL https://get.docker.com | sh
# Run Ollama
docker run -d --name ollama -p 11434:11434 \
-v ollama_data:/root/.ollama \
ollama/ollama:latest
# Pull a model
docker exec -it ollama ollama pull llama3.2:3b
For a full web interface, add Open WebUI as described in our Ollama self-hosting guide. If you are running multiple services on the same server, a deployment platform like Coolify or Dokploy simplifies container management significantly.
Performance Reality Check
CPU inference is measured in single-digit tokens per second. Here is what to expect:
| Model | Server | Speed (approx.) | Usability |
|---|---|---|---|
| TinyLlama 1.1B | CX23 | [ESTIMATED] 8-15 tok/s | Responsive for short queries |
| Llama 3.2 3B | CX23 | [ESTIMATED] 3-6 tok/s | Slow but usable |
| Llama 3.2 8B | CX33 | [ESTIMATED] 1-3 tok/s | Async workflows only |
| Qwen 2.5 7B | CX33 | [ESTIMATED] 1-3 tok/s | Async workflows only |
These numbers are usable for: API backends with tolerant timeouts, batch processing, personal assistants where you can wait a few seconds, and development/testing before deploying to GPU servers.
They are not usable for: real-time chat with multiple users, latency-sensitive applications, or anything requiring more than a few concurrent requests.
GPU Path: Setting Up the GEX44
The GEX44 at €184/month is the entry point for serious AI work on Hetzner. Here is how to set it up from scratch.
Step 1: Order and Initial Access
Order from the Hetzner Robot panel. Expect the €79 setup fee on your first invoice. Provisioning typically takes 1-3 business days for dedicated servers (unlike cloud instances which spin up in seconds).
Once provisioned, you will receive root SSH access:
ssh root@your-server-ip
Step 2: Install NVIDIA Drivers
The GEX44 comes with bare metal access. You need to install GPU drivers:
# Update system
apt update && apt upgrade -y
# Install NVIDIA driver dependencies
apt install -y build-essential linux-headers-$(uname -r)
# Install NVIDIA drivers (Ubuntu 22.04/24.04)
apt install -y nvidia-driver-550
# Reboot
reboot
After reboot, verify the GPU is recognized:
nvidia-smi
You should see the RTX 4000 SFF Ada with 20 GB VRAM listed.
Step 3: Install Docker with GPU Support
# Install Docker
curl -fsSL https://get.docker.com | sh
# Install NVIDIA Container Toolkit
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | \
gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
apt update && apt install -y nvidia-container-toolkit
nvidia-ctk runtime configure --runtime=docker
systemctl restart docker
Verify Docker can see the GPU:
docker run --rm --gpus all nvidia/cuda:12.4.0-base-ubuntu22.04 nvidia-smi
Step 4: Deploy Ollama with GPU Acceleration
Create docker-compose.yml:
version: "3.8"
services:
ollama:
image: ollama/ollama:latest
container_name: ollama
restart: unless-stopped
ports:
- "11434:11434"
volumes:
- ollama_data:/root/.ollama
environment:
- OLLAMA_NUM_PARALLEL=4
- OLLAMA_MAX_LOADED_MODELS=2
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
open-webui:
image: ghcr.io/open-webui/open-webui:main
container_name: open-webui
restart: unless-stopped
ports:
- "3000:8080"
volumes:
- open_webui_data:/app/backend/data
environment:
- OLLAMA_BASE_URL=http://ollama:11434
- WEBUI_AUTH=true
- WEBUI_SECRET_KEY=change-this-to-a-random-string
depends_on:
- ollama
volumes:
ollama_data:
open_webui_data:
docker compose up -d
Step 5: Pull Models That Fit 20 GB VRAM
With 20 GB of VRAM, you can run substantial models:
# 14B model — fits easily, fast inference
docker exec -it ollama ollama pull phi-4:14b
# 32B model (Q4) — fits in ~18 GB, good quality
docker exec -it ollama ollama pull qwen2.5:32b-instruct-q4_K_M
# Coding-specific model
docker exec -it ollama ollama pull qwen2.5-coder:14b
What the GEX44 Can Actually Run
| Model | VRAM Usage | Speed (GPU) | Quality |
|---|---|---|---|
| Llama 3.2 8B | ~5 GB | [ESTIMATED] 40-60 tok/s | Good general use |
| Phi-4 14B | ~8 GB | [ESTIMATED] 25-40 tok/s | Strong reasoning |
| Qwen 2.5 Coder 14B | ~8 GB | [ESTIMATED] 25-40 tok/s | Excellent for code |
| Qwen 2.5 32B (Q4) | ~18 GB | [ESTIMATED] 12-20 tok/s | High quality writing |
| Llama 3.3 70B (Q4) | ~35 GB | Does not fit | — |
The sweet spot is 14B models. They fit comfortably in 20 GB with room for context, run at speeds that feel interactive, and deliver quality that is genuinely useful for production work.
Step 6: Set Up HTTPS
For remote access, add Caddy as a reverse proxy:
apt install -y caddy
Edit /etc/caddy/Caddyfile:
ai.yourdomain.com {
reverse_proxy localhost:3000
}
systemctl reload caddy
Caddy handles SSL automatically. Access your AI at https://ai.yourdomain.com.
Cost Comparison: Hetzner vs AWS vs GCP
Here is an honest comparison for equivalent GPU compute, based on always-on monthly pricing as of early 2026.
Entry-Level GPU Tier
| Provider | Instance | GPU | VRAM | Monthly Cost |
|---|---|---|---|---|
| Hetzner | GEX44 | RTX 4000 SFF Ada | 20 GB | €184/mo (~$230) |
| AWS | g5.xlarge | A10G | 24 GB | ~$760/mo |
| GCP | g2-standard-4 | L4 | 24 GB | ~$580/mo |
| Lambda | gpu_1x_a10 | A10 | 24 GB | ~$440/mo |
Hetzner is 2.5-3.3× cheaper than hyperscalers for comparable GPU compute. The trade-offs: no managed ML services, manual setup, and EU-only data centers.
Budget CPU Tier (No GPU)
| Provider | Instance | vCPU | RAM | Monthly Cost |
|---|---|---|---|---|
| Hetzner | CX23 | 2 | 4 GB | €3.99/mo (~$5) |
| AWS | t3.medium | 2 | 4 GB | ~$30/mo |
| GCP | e2-medium | 2 | 4 GB | ~$25/mo |
| DigitalOcean | Basic | 2 | 4 GB | ~$18/mo |
At the budget tier, Hetzner is 4-6× cheaper. This is where it shines for development, testing, and low-traffic inference.
What the Cloud Providers Offer That Hetzner Does Not
- AWS SageMaker / GCP Vertex AI — Managed model training, deployment, and monitoring. If you need MLOps at scale, Hetzner's bare metal cannot compete.
- Spot/preemptible instances — AWS spot pricing can bring GPU costs down 60-70% for interruptible workloads. Hetzner has no equivalent.
- Global regions — AWS has 30+ regions worldwide. Hetzner has 3 European locations.
- Auto-scaling — Cloud providers scale GPU instances based on demand. Hetzner dedicated servers are fixed capacity.
Bottom line: Hetzner wins on predictable, always-on workloads where you know your compute needs. Hyperscalers win on variable demand, managed services, and global distribution.
Deployment with Docker and Coolify
If you are running multiple AI services (Ollama, vector databases, monitoring) alongside other applications on the same Hetzner server, manual Docker Compose management gets tedious.
This is where a self-hosted PaaS like Coolify or Dokploy adds value. We compared both platforms in detail in our Coolify vs Dokploy comparison. The short version:
- Coolify — More mature, better for multi-service deployments, built-in database management.
- Dokploy — Simpler, lighter footprint, good if Ollama is your primary workload.
Either one gives you a web dashboard for managing containers, automatic SSL, Git-based deployments, and basic monitoring — without touching the command line every time you need to update a container.
For a full walkthrough of running Coolify on Hetzner alongside other developer tools, see our self-hosting dev stack guide.
Our Infrastructure at Effloow
At Effloow, we run 14 AI agents that handle everything from content research to code generation. Our infrastructure choices reflect the same cost-conscious thinking behind this guide.
We use Hetzner cloud instances for non-GPU workloads: deployment platforms, Git hosting, monitoring, and lightweight services. The flat monthly pricing means our infrastructure bill is predictable regardless of how many articles the agents produce.
For AI inference specifically, we use a mix of API services (Claude, GPT) for tasks requiring frontier intelligence and self-hosted models for high-volume, lower-complexity work. The GEX44 tier is compelling for teams at our stage — it is enough GPU to run production inference at a cost that does not require venture capital to sustain.
The decision framework we use internally:
- Need frontier intelligence (complex reasoning, creative work)? → Use API services.
- Need high-volume, predictable inference? → Self-host on Hetzner GPU.
- Need lightweight, always-on AI? → CX/CAX instance with small models.
- Need managed MLOps at scale? → Use AWS/GCP (we do not, but many teams should).
Choosing the Right Tier
Here is a quick decision guide:
CX23 (€3.99/mo) — Start Here If...
- You are experimenting with self-hosted AI for the first time
- You need a personal chatbot or simple RAG pipeline
- Your queries are infrequent and latency is not critical
- Budget is the primary constraint
CX33/CAX31 (€6.49-€10/mo) — Upgrade When...
- You need 7-8B models with slightly better response times
- You are running the AI alongside other services (Git, CI, monitoring)
- Multiple people on your team need occasional access
GEX44 (€184/mo) — The AI Sweet Spot If...
- You need interactive-speed inference (30+ tokens/second)
- You want to run 14B-32B models with real quality
- Multiple users need concurrent access
- You are building products or services that rely on AI inference
- Fine-tuning smaller models is part of your workflow
GEX131 — Production AI If...
- You need 70B+ models at full precision
- Multi-user production inference is a requirement
- You are fine-tuning large models regularly
- You need 96 GB VRAM for large embedding databases or multi-model serving
Getting Started: Your First Hour
If you are new to Hetzner, here is the fastest path to running AI:
# 1. Sign up at hetzner.com and create a cloud project
# 2. Create a CX23 instance (€3.99/mo) via the console
# - Choose Ubuntu 24.04
# - Add your SSH key
# - Pick Falkenstein or Helsinki
# 3. SSH into your server
ssh root@your-server-ip
# 4. Install Docker
curl -fsSL https://get.docker.com | sh
# 5. Run Ollama
docker run -d --name ollama -p 11434:11434 \
-v ollama_data:/root/.ollama ollama/ollama:latest
# 6. Pull a small model
docker exec -it ollama ollama pull llama3.2:3b
# 7. Test it
curl http://localhost:11434/api/generate \
-d '{"model": "llama3.2:3b", "prompt": "Hello, how are you?"}'
Total time: under 10 minutes. Total cost: €3.99 for the first month.
When you outgrow the CX23, migrate your Ollama data volume to a bigger instance. When you need GPU speed, order a GEX44 and follow the GPU setup section above.
Conclusion
Hetzner is not the right choice for every AI workload. If you need managed ML services, global data centers, or spot pricing for burst GPU compute, the hyperscalers are still the answer.
But for predictable, always-on AI infrastructure at a fraction of the cost — personal AI assistants, team inference servers, self-hosted chatbots, development and testing environments — Hetzner is hard to beat.
The lineup covers the full spectrum: €3.99/month for experimentation, €184/month for production GPU inference, and higher tiers for serious AI workloads. All with flat pricing, unlimited bandwidth, and EU data residency.
Start with a CX23 and a 3B model. See if self-hosted inference fits your workflow. If it does, the upgrade path is straightforward — bigger instances, better models, and eventually dedicated GPU hardware, all from the same provider.
Top comments (0)