Why Running AI Locally Actually Matters in 2026
Every AI tool you use — ChatGPT, Copilot, Claude — sends your data to someone else's server. For most developers, that's fine. For companies handling customer data under GDPR, it's a compliance nightmare waiting to happen.
I've spent the last year building a fully self-hosted AI stack for a small Austrian engineering firm. No cloud. No data leaving our datacenter. Full GDPR Article 30 compliance. And honestly — it's faster than most cloud APIs.
Here's what I learned.
The GDPR Problem With Cloud AI
When you send a query to a cloud LLM:
- Your data crosses into a third country (US data centers = Chapter V GDPR transfer)
- You need an Article 28 Data Processing Agreement with the provider
- You need to document it in your Article 30 Register
- If there's a breach, you're on the hook in 72 hours
For a small team, this paperwork alone is a reason to go local.
The Stack (What We're Actually Running)
Here's our production setup on a 5-node Docker Swarm:
Hardware: 1x server + 1x workstation with RTX 3090 (24GB VRAM)
OS: Proxmox VE → Ubuntu VMs
Services:
- Ollama → Local LLM inference (Mistral, Llama3, Qwen)
- Open WebUI → Chat interface (like ChatGPT, but yours)
- Whisper STT → Speech-to-text, fully local
- Piper TTS → Text-to-speech, runs on CPU
- ChromaDB → Vector database for RAG
- n8n → Workflow automation (local, not cloud)
- Prometheus + Grafana → Monitoring
- Mattermost → Team communication (self-hosted Slack)
Total cost: ~€800-1200 for the GPU workstation (used RTX 3090).
Monthly running cost: ~€40 electricity.
Compare to: GPT-4 API at $10-30/1M tokens for a team doing 100K queries/month = $1,000-3,000/month.
Break-even: 1-3 months.
Getting Started: The Minimal Setup
You don't need a 5-node cluster. Here's the minimal viable self-hosted AI stack:
Prerequisites
- A machine with 16GB RAM (GPU optional, but recommended)
- Docker + Docker Compose
- 1 afternoon
Step 1: Install Ollama
# Linux/Mac
curl -fsSL https://ollama.com/install.sh | sh
# Pull a model (Llama3.2 3B runs on CPU, ~2GB)
ollama pull llama3.2:3b
# Test it
ollama run llama3.2:3b "What is GDPR Article 5?"
Step 2: Add Open WebUI (ChatGPT-like Interface)
# docker-compose.yml
services:
open-webui:
image: ghcr.io/open-webui/open-webui:main
ports:
- "3000:8080"
volumes:
- open-webui:/app/backend/data
environment:
- OLLAMA_BASE_URL=http://host.docker.internal:11434
extra_hosts:
- "host.docker.internal:host-gateway"
volumes:
open-webui:
docker compose up -d
# Open http://localhost:3000
That's it. You now have a local ChatGPT alternative.
GPU Makes All the Difference
On CPU (Intel i7):
-
llama3.2:3b→ ~10 tokens/sec (usable) -
llama3.1:8b→ ~3 tokens/sec (slow) -
mistral:7b→ ~3 tokens/sec (slow)
On GPU (RTX 3090 24GB):
-
llama3.2:3b→ ~150 tokens/sec (fast) -
llama3.1:8b→ ~80 tokens/sec (fast) -
mistral-small3.2:24b→ ~35 tokens/sec (fast) -
qwen2.5:32b→ ~25 tokens/sec (good for coding)
Recommendation: A used RTX 3060 12GB (~€250) is the sweet spot for small teams.
RAG: Making Your LLM Know Your Data
The real power of local AI is Retrieval Augmented Generation (RAG) — feeding your own documents to the model without fine-tuning.
# Simple RAG with ChromaDB + Ollama
import chromadb
import ollama
# 1. Index your documents
client = chromadb.Client()
collection = client.create_collection("company-docs")
docs = [
"Our GDPR policy requires...",
"Customer data is stored in...",
"The data retention period is 90 days..."
]
# Generate embeddings locally
for i, doc in enumerate(docs):
embedding = ollama.embeddings(model="mxbai-embed-large", prompt=doc)
collection.add(
ids=[str(i)],
embeddings=[embedding["embedding"]],
documents=[doc]
)
# 2. Query with context
def rag_query(question: str) -> str:
q_embedding = ollama.embeddings(model="mxbai-embed-large", prompt=question)
results = collection.query(query_embeddings=[q_embedding["embedding"]], n_results=3)
context = "\n".join(results["documents"][0])
response = ollama.chat(model="llama3.1:8b", messages=[
{"role": "system", "content": f"Answer based on this context:\n{context}"},
{"role": "user", "content": question}
])
return response["message"]["content"]
print(rag_query("How long do we retain customer data?"))
# → "According to your policy, customer data is retained for 90 days."
All of this runs 100% locally. Zero data leaves your machine.
GDPR Compliance Checklist for Local AI
Article 25 — Privacy by Design ✅
- No third-party AI APIs = no data transfer by default
- Add access control to Open WebUI (SSO or local users)
Article 30 — Records of Processing Activities ✅
- Document: "AI inference on local hardware for internal use"
- No DPA with external processor needed (it's your hardware)
- List which models you use and for what purpose
Article 32 — Security of Processing ✅
- Put Ollama behind a reverse proxy (Nginx/Traefik), don't expose port 11434
- Use HTTPS even internally (Let's Encrypt with private CA)
- Restrict access by IP or VPN
Chapter V — Third Country Transfers ✅
- Zero transfers if fully local
- Carefully review any integrations (n8n, monitoring tools) for cloud callbacks
What Models to Use for What
| Use Case | Model | VRAM Needed |
|---|---|---|
| General chat / writing | llama3.1:8b |
6GB |
| Code generation | qwen2.5:14b |
10GB |
| German/multilingual | mistral-small3.2:24b |
16GB |
| Fast summaries | llama3.2:3b |
2GB (CPU ok) |
| Embeddings/RAG | mxbai-embed-large |
1GB |
| Long documents | llama3.1:70b |
40GB (2x GPU) |
Rule of thumb: 8B parameter model = minimum 5-6GB VRAM. 4-bit quantized versions use ~60% less.
Is It Worth It?
For a team of 5+ using AI daily: Yes, absolutely.
Cost comparison (monthly):
- Cloud APIs (GPT-4): $500-2000/month
- Self-hosted (amortized hardware + electricity): ~$50/month
- Savings: ~$450-1950/month
Compliance benefit:
- Zero GDPR transfer risk
- No Article 28 DPA paperwork with AI providers
- Full audit trail (you control the logs)
Resources
- Ollama: ollama.com — The easiest way to run LLMs locally
- Open WebUI: docs.openwebui.com — Self-hosted ChatGPT UI
- n8n self-hosted: n8n.io — Local workflow automation
Want the Full Guide?
I packaged everything into Playbook 01 — Der lokale AI-Stack (70+ pages, DACH-focused):
- Complete Docker Swarm setup from zero
- n8n workflow automation templates (production-tested)
- GDPR Article 30 compliance documentation templates
- All config files and code snippets included
€49 one-time — ai-engineering.at
Building this publicly at ai-engineering.at. Running the entire AI infrastructure in a 5-node Docker Swarm in my home lab. AMA in the comments.
Top comments (0)