Running AI Locally in 2026: A GDPR-Compliant Guide

#ai #selfhosted #privacy #docker

Why Running AI Locally Actually Matters in 2026

Every AI tool you use — ChatGPT, Copilot, Claude — sends your data to someone else's server. For most developers, that's fine. For companies handling customer data under GDPR, it's a compliance nightmare waiting to happen.

I've spent the last year building a fully self-hosted AI stack for a small Austrian engineering firm. No cloud. No data leaving our datacenter. Full GDPR Article 30 compliance. And honestly — it's faster than most cloud APIs.

Here's what I learned.

The GDPR Problem With Cloud AI

When you send a query to a cloud LLM:

Your data crosses into a third country (US data centers = Chapter V GDPR transfer)
You need an Article 28 Data Processing Agreement with the provider
You need to document it in your Article 30 Register
If there's a breach, you're on the hook in 72 hours

For a small team, this paperwork alone is a reason to go local.

The Stack (What We're Actually Running)

Here's our production setup on a 5-node Docker Swarm:

Hardware: 1x server + 1x workstation with RTX 3090 (24GB VRAM)
OS: Proxmox VE → Ubuntu VMs

Services:
- Ollama          → Local LLM inference (Mistral, Llama3, Qwen)
- Open WebUI      → Chat interface (like ChatGPT, but yours)
- Whisper STT     → Speech-to-text, fully local
- Piper TTS       → Text-to-speech, runs on CPU
- ChromaDB        → Vector database for RAG
- n8n             → Workflow automation (local, not cloud)
- Prometheus + Grafana → Monitoring
- Mattermost      → Team communication (self-hosted Slack)

Total cost: ~€800-1200 for the GPU workstation (used RTX 3090).
Monthly running cost: ~€40 electricity.

Compare to: GPT-4 API at $10-30/1M tokens for a team doing 100K queries/month = $1,000-3,000/month.

Break-even: 1-3 months.

Getting Started: The Minimal Setup

You don't need a 5-node cluster. Here's the minimal viable self-hosted AI stack:

Prerequisites

A machine with 16GB RAM (GPU optional, but recommended)
Docker + Docker Compose
1 afternoon

Step 1: Install Ollama

# Linux/Mac
curl -fsSL https://ollama.com/install.sh | sh

# Pull a model (Llama3.2 3B runs on CPU, ~2GB)
ollama pull llama3.2:3b

# Test it
ollama run llama3.2:3b "What is GDPR Article 5?"

Step 2: Add Open WebUI (ChatGPT-like Interface)

# docker-compose.yml
services:
  open-webui:
    image: ghcr.io/open-webui/open-webui:main
    ports:
      - "3000:8080"
    volumes:
      - open-webui:/app/backend/data
    environment:
      - OLLAMA_BASE_URL=http://host.docker.internal:11434
    extra_hosts:
      - "host.docker.internal:host-gateway"

volumes:
  open-webui:

docker compose up -d
# Open http://localhost:3000

That's it. You now have a local ChatGPT alternative.

GPU Makes All the Difference

On CPU (Intel i7):

llama3.2:3b → ~10 tokens/sec (usable)
llama3.1:8b → ~3 tokens/sec (slow)
mistral:7b → ~3 tokens/sec (slow)

On GPU (RTX 3090 24GB):

llama3.2:3b → ~150 tokens/sec (fast)
llama3.1:8b → ~80 tokens/sec (fast)
mistral-small3.2:24b → ~35 tokens/sec (fast)
qwen2.5:32b → ~25 tokens/sec (good for coding)

Recommendation: A used RTX 3060 12GB (~€250) is the sweet spot for small teams.

RAG: Making Your LLM Know Your Data

The real power of local AI is Retrieval Augmented Generation (RAG) — feeding your own documents to the model without fine-tuning.

# Simple RAG with ChromaDB + Ollama
import chromadb
import ollama

# 1. Index your documents
client = chromadb.Client()
collection = client.create_collection("company-docs")

docs = [
    "Our GDPR policy requires...",
    "Customer data is stored in...",
    "The data retention period is 90 days..."
]

# Generate embeddings locally
for i, doc in enumerate(docs):
    embedding = ollama.embeddings(model="mxbai-embed-large", prompt=doc)
    collection.add(
        ids=[str(i)],
        embeddings=[embedding["embedding"]],
        documents=[doc]
    )

# 2. Query with context
def rag_query(question: str) -> str:
    q_embedding = ollama.embeddings(model="mxbai-embed-large", prompt=question)
    results = collection.query(query_embeddings=[q_embedding["embedding"]], n_results=3)
    context = "\n".join(results["documents"][0])

    response = ollama.chat(model="llama3.1:8b", messages=[
        {"role": "system", "content": f"Answer based on this context:\n{context}"},
        {"role": "user", "content": question}
    ])
    return response["message"]["content"]

print(rag_query("How long do we retain customer data?"))
# → "According to your policy, customer data is retained for 90 days."

All of this runs 100% locally. Zero data leaves your machine.

GDPR Compliance Checklist for Local AI

Article 25 — Privacy by Design ✅

No third-party AI APIs = no data transfer by default
Add access control to Open WebUI (SSO or local users)

Article 30 — Records of Processing Activities ✅

Document: "AI inference on local hardware for internal use"
No DPA with external processor needed (it's your hardware)
List which models you use and for what purpose

Article 32 — Security of Processing ✅

Put Ollama behind a reverse proxy (Nginx/Traefik), don't expose port 11434
Use HTTPS even internally (Let's Encrypt with private CA)
Restrict access by IP or VPN

Chapter V — Third Country Transfers ✅

Zero transfers if fully local
Carefully review any integrations (n8n, monitoring tools) for cloud callbacks

What Models to Use for What

Use Case	Model	VRAM Needed
General chat / writing	`llama3.1:8b`	6GB
Code generation	`qwen2.5:14b`	10GB
German/multilingual	`mistral-small3.2:24b`	16GB
Fast summaries	`llama3.2:3b`	2GB (CPU ok)
Embeddings/RAG	`mxbai-embed-large`	1GB
Long documents	`llama3.1:70b`	40GB (2x GPU)