Jangwook Kim

Posted on Apr 4 • Originally published at effloow.com

Hetzner Cloud for AI Projects — Complete GPU Server Setup & Cost Breakdown 2026

#ai #selfhosted #devops #tutorial

Hetzner Cloud for AI Projects — Complete GPU Server Setup & Cost Breakdown 2026

Running AI workloads on AWS or GCP is expensive. A single A100 instance on AWS costs $3-4 per hour — over $2,000 a month if you leave it running. For startups, indie developers, and small teams experimenting with AI, that math kills projects before they start.

Hetzner offers an alternative that most of the AI community outside Europe has not discovered yet. Budget cloud instances from €3.99/month for lightweight inference. Dedicated GPU servers with NVIDIA RTX 4000 Ada from €184/month. European data centers with flat monthly pricing and no bandwidth surprises.

This guide covers the full Hetzner AI server lineup, from $5/month CPU instances running tiny models to dedicated GPU servers handling production workloads. We will walk through actual setup, realistic performance expectations, and an honest cost comparison against AWS and GCP.

Why Hetzner for AI Workloads

Hetzner is a German hosting company that has been around since 1997. They are not a startup. They run their own data centers in Falkenstein, Nuremberg, and Helsinki. Their pricing has always been aggressive compared to US-based cloud providers, and that gap has only widened as AWS and GCP have raised prices.

The Price Gap Is Real

Hetzner's cost advantage is not 10-20% — it is 60-80% for equivalent compute. A Hetzner cloud server with 2 vCPUs and 4 GB RAM costs €3.99/month. A comparable instance on AWS (t3.medium) costs roughly $30/month. DigitalOcean and Vultr sit in between at $15-20/month for similar specs.

For AI workloads specifically, the gap gets even wider at the GPU tier. Hetzner's dedicated GPU servers start at €184/month. AWS GPU instances (g5.xlarge with A10G) start at roughly $1.00/hour — over $700/month for always-on use.

What Hetzner Does Well

Flat monthly pricing. No surprise bandwidth bills, no hidden egress charges. Traffic is unlimited on most plans.
EU data centers. Falkenstein and Helsinki give you GDPR compliance by default.
Straightforward networking. Private networks, floating IPs, and load balancers at prices that make sense.
ARM instances. Ampere-based CAX servers offer strong performance-per-euro for inference workloads.

What Hetzner Does Not Do

No managed AI/ML services. No SageMaker equivalent, no managed Jupyter, no model registries. You manage everything yourself.
No spot/preemptible instances. You cannot get cheap burst GPU time. It is flat monthly pricing or nothing.
Limited GPU availability. Dedicated GPU servers can have waitlists. AWS and GCP have broader GPU SKU availability.
No US data centers. If you need sub-50ms latency for US users, Hetzner is not the right choice.

The Hetzner AI Server Lineup

Hetzner offers multiple tiers for AI workloads. Here is the full spectrum from budget to production.

Tier 1: Cost-Optimized Cloud (CX Series) — €3.99-€14.99/mo

These are shared vCPU instances. No GPU. CPU-only inference for small models.

Model	vCPU	RAM	Storage	Price
CX23	2	4 GB	40 GB SSD	€3.99/mo (~$4.99)
CX33	4	8 GB	80 GB SSD	€6.49/mo (~$8.09)
CX43	8	16 GB	160 GB SSD	€11.99/mo (~$14.99)
CX53	16	32 GB	320 GB SSD	€22.49/mo (~$28.09)

AI use case: Running Ollama with small models (3B-7B parameters) for personal chatbots, lightweight RAG, or API-based inference for low-traffic applications. We covered this exact setup in our Ollama + Open WebUI self-hosting guide.

Realistic expectations: A CX23 can run a 3B model at roughly 3-6 tokens/second (CPU inference). A CX33 can handle a 7-8B model at 1-3 tokens/second. This is usable for async workflows but not for interactive chat.

Tier 2: ARM Cloud Instances (CAX Series) — Better Performance per Euro

Hetzner's Ampere-based ARM servers offer better compute efficiency than the x86 CX series at similar or lower price points.

Model	vCPU (ARM)	RAM	Storage	Price
CAX11	2	4 GB	40 GB SSD	€4.49/mo (~$5.59)
CAX21	4	8 GB	80 GB SSD	€7.99/mo (~$9.99)
CAX31	8	16 GB	160 GB SSD	€15.99/mo (~$19.99)
CAX41	16	32 GB	320 GB SSD	€31.49/mo (~$39.29)

AI use case: ARM chips handle inference workloads efficiently. Ollama has native ARM support, so these servers run small models with lower power draw and often better single-thread performance than the CX series at the same price. Good for always-on inference APIs.

Tier 3: GEX44 — Dedicated GPU Server (€184/mo)

This is where things get serious for AI workloads.

Spec	Details
CPU	Intel Core i5-13500 (6P + 8E cores, HT)
GPU	NVIDIA RTX 4000 SFF Ada Generation, 20 GB GDDR6 ECC
RAM	64 GB DDR4
Storage	2× 1.92 TB NVMe SSD Gen3 (RAID 1)
Network	1 Gbit/s, unlimited traffic
Setup fee	€79 (one-time)
Monthly	€184/mo
Locations	Falkenstein (FSN1), Nuremberg (NBG1)

AI use case: The RTX 4000 SFF Ada with 20 GB VRAM can run models up to ~32B parameters (4-bit quantized). It handles 7B-14B models comfortably with fast inference. This is the sweet spot for small teams running production AI inference, fine-tuning smaller models, or serving multiple users simultaneously.

The 20 GB of VRAM is the key spec. It puts this server above consumer RTX 4060/4070 cards (8-12 GB) and into territory where you can run meaningful models without aggressive quantization.

Tier 4: GEX131 — High-End GPU Server

For production AI workloads that need serious GPU compute.

Spec	Details
CPU	Intel Xeon Gold 5412U
GPU	NVIDIA RTX PRO 6000 Blackwell Max-Q, 96 GB GDDR7 ECC
RAM	256 GB DDR5 ECC (expandable to 768 GB)
Storage	2× 960 GB NVMe SSD Datacenter Edition (RAID 1)
Network	1 Gbit/s, unlimited traffic
Monthly	€889/mo (~$989)
Locations	Helsinki (HEL1), Falkenstein (FSN1), Nuremberg (NBG1)

AI use case: With 96 GB of VRAM, this server can run 70B+ parameter models at full precision, handle multiple concurrent inference requests, or fine-tune large models. The 5th-generation Tensor Cores and Blackwell architecture make this competitive with cloud A100 instances at a fraction of the cost.

256 GB of system RAM with expansion to 768 GB also makes this viable for large-scale RAG deployments where you need to keep embedding databases in memory.

Budget Path: Running Small LLMs on CX/CAX Instances

You do not need a GPU to run AI inference. CPU-only inference with quantized models is slow but functional — and incredibly cheap.

What You Can Run

On a CX23 (€3.99/month, 4 GB RAM):

Llama 3.2 3B (Q4) — Fits in ~2-3 GB. General chat and simple tasks.
Phi-3.5 Mini 3.8B (Q4) — Microsoft's efficient model. Good for code and reasoning.
TinyLlama 1.1B — Fast even on CPU. Useful for classification and simple generation.

On a CX33 (€6.49/month, 8 GB RAM):

Llama 3.2 8B (Q4) — Solid general model. ~5 GB loaded.
Gemma 2 2B — Google's efficient model. Punches above its weight.
Qwen 2.5 7B (Q4) — Excellent for multilingual use cases.

Setup

Install Docker and run Ollama:

# Install Docker
curl -fsSL https://get.docker.com | sh

# Run Ollama
docker run -d --name ollama -p 11434:11434 \
  -v ollama_data:/root/.ollama \
  ollama/ollama:latest

# Pull a model
docker exec -it ollama ollama pull llama3.2:3b

For a full web interface, add Open WebUI as described in our Ollama self-hosting guide. If you are running multiple services on the same server, a deployment platform like Coolify or Dokploy simplifies container management significantly.

Performance Reality Check

CPU inference is measured in single-digit tokens per second. Here is what to expect:

Model	Server	Speed (approx.)	Usability
TinyLlama 1.1B	CX23	[ESTIMATED] 8-15 tok/s	Responsive for short queries
Llama 3.2 3B	CX23	[ESTIMATED] 3-6 tok/s	Slow but usable
Llama 3.2 8B	CX33	[ESTIMATED] 1-3 tok/s	Async workflows only
Qwen 2.5 7B	CX33	[ESTIMATED] 1-3 tok/s	Async workflows only

These numbers are usable for: API backends with tolerant timeouts, batch processing, personal assistants where you can wait a few seconds, and development/testing before deploying to GPU servers.

They are not usable for: real-time chat with multiple users, latency-sensitive applications, or anything requiring more than a few concurrent requests.

GPU Path: Setting Up the GEX44

The GEX44 at €184/month is the entry point for serious AI work on Hetzner. Here is how to set it up from scratch.

Step 1: Order and Initial Access

Order from the Hetzner Robot panel. Expect the €79 setup fee on your first invoice. Provisioning typically takes 1-3 business days for dedicated servers (unlike cloud instances which spin up in seconds).

Once provisioned, you will receive root SSH access:

ssh root@your-server-ip

Step 2: Install NVIDIA Drivers

The GEX44 comes with bare metal access. You need to install GPU drivers:

# Update system
apt update && apt upgrade -y

# Install NVIDIA driver dependencies
apt install -y build-essential linux-headers-$(uname -r)

# Install NVIDIA drivers (Ubuntu 22.04/24.04)
apt install -y nvidia-driver-550

# Reboot
reboot

After reboot, verify the GPU is recognized:

nvidia-smi

You should see the RTX 4000 SFF Ada with 20 GB VRAM listed.

Step 3: Install Docker with GPU Support

# Install Docker
curl -fsSL https://get.docker.com | sh

# Install NVIDIA Container Toolkit
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | \
  gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg

curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
  sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
  tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

apt update && apt install -y nvidia-container-toolkit
nvidia-ctk runtime configure --runtime=docker
systemctl restart docker

Verify Docker can see the GPU:

docker run --rm --gpus all nvidia/cuda:12.4.0-base-ubuntu22.04 nvidia-smi

Step 4: Deploy Ollama with GPU Acceleration

Create docker-compose.yml:

version: "3.8"

services:
  ollama:
    image: ollama/ollama:latest
    container_name: ollama
    restart: unless-stopped
    ports:
      - "11434:11434"
    volumes:
      - ollama_data:/root/.ollama
    environment:
      - OLLAMA_NUM_PARALLEL=4
      - OLLAMA_MAX_LOADED_MODELS=2
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]

  open-webui:
    image: ghcr.io/open-webui/open-webui:main
    container_name: open-webui
    restart: unless-stopped
    ports:
      - "3000:8080"
    volumes:
      - open_webui_data:/app/backend/data
    environment:
      - OLLAMA_BASE_URL=http://ollama:11434
      - WEBUI_AUTH=true
      - WEBUI_SECRET_KEY=change-this-to-a-random-string
    depends_on:
      - ollama

volumes:
  ollama_data:
  open_webui_data:

docker compose up -d

Step 5: Pull Models That Fit 20 GB VRAM

With 20 GB of VRAM, you can run substantial models:

# 14B model — fits easily, fast inference
docker exec -it ollama ollama pull phi-4:14b

# 32B model (Q4) — fits in ~18 GB, good quality
docker exec -it ollama ollama pull qwen2.5:32b-instruct-q4_K_M

# Coding-specific model
docker exec -it ollama ollama pull qwen2.5-coder:14b

What the GEX44 Can Actually Run

Model	VRAM Usage	Speed (GPU)	Quality
Llama 3.2 8B	~5 GB	[ESTIMATED] 40-60 tok/s	Good general use
Phi-4 14B	~8 GB	[ESTIMATED] 25-40 tok/s	Strong reasoning
Qwen 2.5 Coder 14B	~8 GB	[ESTIMATED] 25-40 tok/s	Excellent for code
Qwen 2.5 32B (Q4)	~18 GB	[ESTIMATED] 12-20 tok/s	High quality writing
Llama 3.3 70B (Q4)	~35 GB	Does not fit	—

The sweet spot is 14B models. They fit comfortably in 20 GB with room for context, run at speeds that feel interactive, and deliver quality that is genuinely useful for production work.

Step 6: Set Up HTTPS

For remote access, add Caddy as a reverse proxy:

apt install -y caddy

Edit /etc/caddy/Caddyfile:

ai.yourdomain.com {
    reverse_proxy localhost:3000
}

systemctl reload caddy

Caddy handles SSL automatically. Access your AI at https://ai.yourdomain.com.

Cost Comparison: Hetzner vs AWS vs GCP

Here is an honest comparison for equivalent GPU compute, based on always-on monthly pricing as of early 2026.

Entry-Level GPU Tier

Provider	Instance	GPU	VRAM	Monthly Cost
Hetzner	GEX44	RTX 4000 SFF Ada	20 GB	€184/mo (~$230)
AWS	g5.xlarge	A10G	24 GB	~$760/mo
GCP	g2-standard-4	L4	24 GB	~$580/mo
Lambda	gpu_1x_a10	A10	24 GB	~$440/mo

Hetzner is 2.5-3.3× cheaper than hyperscalers for comparable GPU compute. The trade-offs: no managed ML services, manual setup, and EU-only data centers.

Budget CPU Tier (No GPU)

Provider	Instance	vCPU	RAM	Monthly Cost
Hetzner	CX23	2	4 GB	€3.99/mo (~$5)
AWS	t3.medium	2	4 GB	~$30/mo
GCP	e2-medium	2	4 GB	~$25/mo
DigitalOcean	Basic	2	4 GB	~$18/mo

At the budget tier, Hetzner is 4-6× cheaper. This is where it shines for development, testing, and low-traffic inference.

What the Cloud Providers Offer That Hetzner Does Not

AWS SageMaker / GCP Vertex AI — Managed model training, deployment, and monitoring. If you need MLOps at scale, Hetzner's bare metal cannot compete.
Spot/preemptible instances — AWS spot pricing can bring GPU costs down 60-70% for interruptible workloads. Hetzner has no equivalent.
Global regions — AWS has 30+ regions worldwide. Hetzner has 3 European locations.
Auto-scaling — Cloud providers scale GPU instances based on demand. Hetzner dedicated servers are fixed capacity.

Bottom line: Hetzner wins on predictable, always-on workloads where you know your compute needs. Hyperscalers win on variable demand, managed services, and global distribution.

Deployment with Docker and Coolify

If you are running multiple AI services (Ollama, vector databases, monitoring) alongside other applications on the same Hetzner server, manual Docker Compose management gets tedious.

This is where a self-hosted PaaS like Coolify or Dokploy adds value. We compared both platforms in detail in our Coolify vs Dokploy comparison. The short version:

Coolify — More mature, better for multi-service deployments, built-in database management.
Dokploy — Simpler, lighter footprint, good if Ollama is your primary workload.

Either one gives you a web dashboard for managing containers, automatic SSL, Git-based deployments, and basic monitoring — without touching the command line every time you need to update a container.

For a full walkthrough of running Coolify on Hetzner alongside other developer tools, see our self-hosting dev stack guide.

Our Infrastructure at Effloow

At Effloow, we run 14 AI agents that handle everything from content research to code generation. Our infrastructure choices reflect the same cost-conscious thinking behind this guide.

We use Hetzner cloud instances for non-GPU workloads: deployment platforms, Git hosting, monitoring, and lightweight services. The flat monthly pricing means our infrastructure bill is predictable regardless of how many articles the agents produce.

For AI inference specifically, we use a mix of API services (Claude, GPT) for tasks requiring frontier intelligence and self-hosted models for high-volume, lower-complexity work. The GEX44 tier is compelling for teams at our stage — it is enough GPU to run production inference at a cost that does not require venture capital to sustain.

The decision framework we use internally:

Need frontier intelligence (complex reasoning, creative work)? → Use API services.
Need high-volume, predictable inference? → Self-host on Hetzner GPU.
Need lightweight, always-on AI? → CX/CAX instance with small models.
Need managed MLOps at scale? → Use AWS/GCP (we do not, but many teams should).

Choosing the Right Tier

Here is a quick decision guide:

CX23 (€3.99/mo) — Start Here If...

You are experimenting with self-hosted AI for the first time
You need a personal chatbot or simple RAG pipeline
Your queries are infrequent and latency is not critical
Budget is the primary constraint

CX33/CAX31 (€6.49-€10/mo) — Upgrade When...

You need 7-8B models with slightly better response times
You are running the AI alongside other services (Git, CI, monitoring)
Multiple people on your team need occasional access

GEX44 (€184/mo) — The AI Sweet Spot If...

You need interactive-speed inference (30+ tokens/second)
You want to run 14B-32B models with real quality
Multiple users need concurrent access
You are building products or services that rely on AI inference
Fine-tuning smaller models is part of your workflow

GEX131 — Production AI If...

You need 70B+ models at full precision
Multi-user production inference is a requirement
You are fine-tuning large models regularly
You need 96 GB VRAM for large embedding databases or multi-model serving

Getting Started: Your First Hour

If you are new to Hetzner, here is the fastest path to running AI:

# 1. Sign up at hetzner.com and create a cloud project

# 2. Create a CX23 instance (€3.99/mo) via the console
#    - Choose Ubuntu 24.04
#    - Add your SSH key
#    - Pick Falkenstein or Helsinki

# 3. SSH into your server
ssh root@your-server-ip

# 4. Install Docker
curl -fsSL https://get.docker.com | sh

# 5. Run Ollama
docker run -d --name ollama -p 11434:11434 \
  -v ollama_data:/root/.ollama ollama/ollama:latest

# 6. Pull a small model
docker exec -it ollama ollama pull llama3.2:3b

# 7. Test it
curl http://localhost:11434/api/generate \
  -d '{"model": "llama3.2:3b", "prompt": "Hello, how are you?"}'

Total time: under 10 minutes. Total cost: €3.99 for the first month.

When you outgrow the CX23, migrate your Ollama data volume to a bigger instance. When you need GPU speed, order a GEX44 and follow the GPU setup section above.

Conclusion

Hetzner is not the right choice for every AI workload. If you need managed ML services, global data centers, or spot pricing for burst GPU compute, the hyperscalers are still the answer.

But for predictable, always-on AI infrastructure at a fraction of the cost — personal AI assistants, team inference servers, self-hosted chatbots, development and testing environments — Hetzner is hard to beat.

The lineup covers the full spectrum: €3.99/month for experimentation, €184/month for production GPU inference, and higher tiers for serious AI workloads. All with flat pricing, unlimited bandwidth, and EU data residency.

Start with a CX23 and a 3B model. See if self-hosted inference fits your workflow. If it does, the upgrade path is straightforward — bigger instances, better models, and eventually dedicated GPU hardware, all from the same provider.

DEV Community

Hetzner Cloud for AI Projects — Complete GPU Server Setup & Cost Breakdown 2026

Hetzner Cloud for AI Projects — Complete GPU Server Setup & Cost Breakdown 2026

Why Hetzner for AI Workloads

The Price Gap Is Real

What Hetzner Does Well

What Hetzner Does Not Do

The Hetzner AI Server Lineup

Tier 1: Cost-Optimized Cloud (CX Series) — €3.99-€14.99/mo

Tier 2: ARM Cloud Instances (CAX Series) — Better Performance per Euro

Tier 3: GEX44 — Dedicated GPU Server (€184/mo)

Tier 4: GEX131 — High-End GPU Server

Budget Path: Running Small LLMs on CX/CAX Instances

What You Can Run

Setup

Performance Reality Check

GPU Path: Setting Up the GEX44

Step 1: Order and Initial Access

Step 2: Install NVIDIA Drivers

Step 3: Install Docker with GPU Support

Step 4: Deploy Ollama with GPU Acceleration

Step 5: Pull Models That Fit 20 GB VRAM

What the GEX44 Can Actually Run

Step 6: Set Up HTTPS

Cost Comparison: Hetzner vs AWS vs GCP

Entry-Level GPU Tier

Budget CPU Tier (No GPU)

What the Cloud Providers Offer That Hetzner Does Not

Deployment with Docker and Coolify

Our Infrastructure at Effloow

Choosing the Right Tier

CX23 (€3.99/mo) — Start Here If...

CX33/CAX31 (€6.49-€10/mo) — Upgrade When...

GEX44 (€184/mo) — The AI Sweet Spot If...

GEX131 — Production AI If...

Getting Started: Your First Hour

Conclusion

Top comments (0)