DEV Community

Jörg Fuchs
Jörg Fuchs

Posted on

Running AI Locally in 2026: A GDPR-Compliant Guide

Why Running AI Locally Actually Matters in 2026

Every AI tool you use — ChatGPT, Copilot, Claude — sends your data to someone else's server. For most developers, that's fine. For companies handling customer data under GDPR, it's a compliance nightmare waiting to happen.

I've spent the last year building a fully self-hosted AI stack for a small Austrian engineering firm. No cloud. No data leaving our datacenter. Full GDPR Article 30 compliance. And honestly — it's faster than most cloud APIs.

Here's what I learned.


The GDPR Problem With Cloud AI

When you send a query to a cloud LLM:

  • Your data crosses into a third country (US data centers = Chapter V GDPR transfer)
  • You need an Article 28 Data Processing Agreement with the provider
  • You need to document it in your Article 30 Register
  • If there's a breach, you're on the hook in 72 hours

For a small team, this paperwork alone is a reason to go local.


The Stack (What We're Actually Running)

Here's our production setup on a 5-node Docker Swarm:

Hardware: 1x server + 1x workstation with RTX 3090 (24GB VRAM)
OS: Proxmox VE → Ubuntu VMs

Services:
- Ollama          → Local LLM inference (Mistral, Llama3, Qwen)
- Open WebUI      → Chat interface (like ChatGPT, but yours)
- Whisper STT     → Speech-to-text, fully local
- Piper TTS       → Text-to-speech, runs on CPU
- ChromaDB        → Vector database for RAG
- n8n             → Workflow automation (local, not cloud)
- Prometheus + Grafana → Monitoring
- Mattermost      → Team communication (self-hosted Slack)
Enter fullscreen mode Exit fullscreen mode

Total cost: ~€800-1200 for the GPU workstation (used RTX 3090).
Monthly running cost: ~€40 electricity.

Compare to: GPT-4 API at $10-30/1M tokens for a team doing 100K queries/month = $1,000-3,000/month.

Break-even: 1-3 months.


Getting Started: The Minimal Setup

You don't need a 5-node cluster. Here's the minimal viable self-hosted AI stack:

Prerequisites

  • A machine with 16GB RAM (GPU optional, but recommended)
  • Docker + Docker Compose
  • 1 afternoon

Step 1: Install Ollama

# Linux/Mac
curl -fsSL https://ollama.com/install.sh | sh

# Pull a model (Llama3.2 3B runs on CPU, ~2GB)
ollama pull llama3.2:3b

# Test it
ollama run llama3.2:3b "What is GDPR Article 5?"
Enter fullscreen mode Exit fullscreen mode

Step 2: Add Open WebUI (ChatGPT-like Interface)

# docker-compose.yml
services:
  open-webui:
    image: ghcr.io/open-webui/open-webui:main
    ports:
      - "3000:8080"
    volumes:
      - open-webui:/app/backend/data
    environment:
      - OLLAMA_BASE_URL=http://host.docker.internal:11434
    extra_hosts:
      - "host.docker.internal:host-gateway"

volumes:
  open-webui:
Enter fullscreen mode Exit fullscreen mode
docker compose up -d
# Open http://localhost:3000
Enter fullscreen mode Exit fullscreen mode

That's it. You now have a local ChatGPT alternative.


GPU Makes All the Difference

On CPU (Intel i7):

  • llama3.2:3b → ~10 tokens/sec (usable)
  • llama3.1:8b → ~3 tokens/sec (slow)
  • mistral:7b → ~3 tokens/sec (slow)

On GPU (RTX 3090 24GB):

  • llama3.2:3b → ~150 tokens/sec (fast)
  • llama3.1:8b → ~80 tokens/sec (fast)
  • mistral-small3.2:24b → ~35 tokens/sec (fast)
  • qwen2.5:32b → ~25 tokens/sec (good for coding)

Recommendation: A used RTX 3060 12GB (~€250) is the sweet spot for small teams.


RAG: Making Your LLM Know Your Data

The real power of local AI is Retrieval Augmented Generation (RAG) — feeding your own documents to the model without fine-tuning.

# Simple RAG with ChromaDB + Ollama
import chromadb
import ollama

# 1. Index your documents
client = chromadb.Client()
collection = client.create_collection("company-docs")

docs = [
    "Our GDPR policy requires...",
    "Customer data is stored in...",
    "The data retention period is 90 days..."
]

# Generate embeddings locally
for i, doc in enumerate(docs):
    embedding = ollama.embeddings(model="mxbai-embed-large", prompt=doc)
    collection.add(
        ids=[str(i)],
        embeddings=[embedding["embedding"]],
        documents=[doc]
    )

# 2. Query with context
def rag_query(question: str) -> str:
    q_embedding = ollama.embeddings(model="mxbai-embed-large", prompt=question)
    results = collection.query(query_embeddings=[q_embedding["embedding"]], n_results=3)
    context = "\n".join(results["documents"][0])

    response = ollama.chat(model="llama3.1:8b", messages=[
        {"role": "system", "content": f"Answer based on this context:\n{context}"},
        {"role": "user", "content": question}
    ])
    return response["message"]["content"]

print(rag_query("How long do we retain customer data?"))
# → "According to your policy, customer data is retained for 90 days."
Enter fullscreen mode Exit fullscreen mode

All of this runs 100% locally. Zero data leaves your machine.


GDPR Compliance Checklist for Local AI

Article 25 — Privacy by Design

  • No third-party AI APIs = no data transfer by default
  • Add access control to Open WebUI (SSO or local users)

Article 30 — Records of Processing Activities

  • Document: "AI inference on local hardware for internal use"
  • No DPA with external processor needed (it's your hardware)
  • List which models you use and for what purpose

Article 32 — Security of Processing

  • Put Ollama behind a reverse proxy (Nginx/Traefik), don't expose port 11434
  • Use HTTPS even internally (Let's Encrypt with private CA)
  • Restrict access by IP or VPN

Chapter V — Third Country Transfers

  • Zero transfers if fully local
  • Carefully review any integrations (n8n, monitoring tools) for cloud callbacks

What Models to Use for What

Use Case Model VRAM Needed
General chat / writing llama3.1:8b 6GB
Code generation qwen2.5:14b 10GB
German/multilingual mistral-small3.2:24b 16GB
Fast summaries llama3.2:3b 2GB (CPU ok)
Embeddings/RAG mxbai-embed-large 1GB
Long documents llama3.1:70b 40GB (2x GPU)

Rule of thumb: 8B parameter model = minimum 5-6GB VRAM. 4-bit quantized versions use ~60% less.


Is It Worth It?

For a team of 5+ using AI daily: Yes, absolutely.

Cost comparison (monthly):

  • Cloud APIs (GPT-4): $500-2000/month
  • Self-hosted (amortized hardware + electricity): ~$50/month
  • Savings: ~$450-1950/month

Compliance benefit:

  • Zero GDPR transfer risk
  • No Article 28 DPA paperwork with AI providers
  • Full audit trail (you control the logs)

Resources

  • Ollama: ollama.com — The easiest way to run LLMs locally
  • Open WebUI: docs.openwebui.com — Self-hosted ChatGPT UI
  • n8n self-hosted: n8n.io — Local workflow automation

Want the Full Guide?

I packaged everything into Playbook 01 — Der lokale AI-Stack (70+ pages, DACH-focused):

  • Complete Docker Swarm setup from zero
  • n8n workflow automation templates (production-tested)
  • GDPR Article 30 compliance documentation templates
  • All config files and code snippets included

€49 one-timeai-engineering.at


Building this publicly at ai-engineering.at. Running the entire AI infrastructure in a 5-node Docker Swarm in my home lab. AMA in the comments.

Top comments (0)