<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Jörg Fuchs</title>
    <description>The latest articles on DEV Community by Jörg Fuchs (@aiengineeringat).</description>
    <link>https://dev.to/aiengineeringat</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3791035%2Fec28ceff-455b-42c0-bcec-24a50ed1ff23.png</url>
      <title>DEV Community: Jörg Fuchs</title>
      <link>https://dev.to/aiengineeringat</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/aiengineeringat"/>
    <language>en</language>
    <item>
      <title>Hunyuan Video 720p on RTX 3090: Full On-Premise AI Media Pipeline E2E</title>
      <dc:creator>Jörg Fuchs</dc:creator>
      <pubDate>Wed, 04 Mar 2026 19:36:37 +0000</pubDate>
      <link>https://dev.to/aiengineeringat/hunyuan-video-720p-on-rtx-3090-full-on-premise-ai-media-pipeline-e2e-4bl4</link>
      <guid>https://dev.to/aiengineeringat/hunyuan-video-720p-on-rtx-3090-full-on-premise-ai-media-pipeline-e2e-4bl4</guid>
      <description>&lt;p&gt;Running AI video generation on consumer hardware - here is our full E2E pipeline that generates photos and videos without any cloud APIs.&lt;/p&gt;

&lt;h2&gt;
  
  
  Hardware
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;RTX 3090 24GB VRAM&lt;/li&gt;
&lt;li&gt;Intel i7-14700F (20 cores)&lt;/li&gt;
&lt;li&gt;30GB WSL2 RAM (critical for Hunyuan)&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Photo Pipeline (FLUX Dev FP8)
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Resolution: 1344x768&lt;/li&gt;
&lt;li&gt;Generation time: ~44 seconds&lt;/li&gt;
&lt;li&gt;Quality: Professional stock photo level&lt;/li&gt;
&lt;li&gt;Guidance: 3.5 via FluxGuidance node&lt;/li&gt;
&lt;li&gt;CFG: 1.0 (FLUX ignores traditional CFG)&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Video Pipeline (Hunyuan Video FP8)
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Resolution: 1280x720&lt;/li&gt;
&lt;li&gt;Frames: 13 (~1.1s at 12fps)&lt;/li&gt;
&lt;li&gt;Generation time: ~7.8 minutes&lt;/li&gt;
&lt;li&gt;Key fix: quantization=fp8_e4m3fn keeps model at ~12GB on GPU&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Critical Learning
&lt;/h2&gt;

&lt;p&gt;The pre-quantized FP8 Hunyuan model with quantization=disabled causes OOM because HyVideoModelLoader upcasts weights to bf16 (~24GB). Setting quantization to fp8_e4m3fn keeps it in FP8 format (~12GB), leaving room for VAE and sampling.&lt;/p&gt;

&lt;h2&gt;
  
  
  VRAM Management
&lt;/h2&gt;

&lt;p&gt;We built a custom VRAM Guard service that coordinates GPU access between Ollama (LLM) and ComfyUI (media generation). Before video generation, Ollama models are unloaded and ComfyUI cached models are freed.&lt;/p&gt;

&lt;h2&gt;
  
  
  Pipeline Architecture
&lt;/h2&gt;

&lt;p&gt;ComfyUI API → n8n workflow orchestration → Social Poster service → auto-post to Twitter, LinkedIn, Reddit, Dev.to&lt;/p&gt;

&lt;p&gt;All running on Docker Swarm across 6 nodes. No cloud dependencies.&lt;/p&gt;

</description>
      <category>docker</category>
    </item>
    <item>
      <title>Why Every AI Engineer Should Build a Homelab</title>
      <dc:creator>Jörg Fuchs</dc:creator>
      <pubDate>Tue, 03 Mar 2026 16:20:24 +0000</pubDate>
      <link>https://dev.to/aiengineeringat/why-every-ai-engineer-should-build-a-homelab-3pp8</link>
      <guid>https://dev.to/aiengineeringat/why-every-ai-engineer-should-build-a-homelab-3pp8</guid>
      <description>&lt;p&gt;Running your own AI infrastructure at home is not just possible — it is powerful. Open-source LLMs, self-hosted automation, and a homelab stack can replace expensive cloud subscriptions. Here is how we do it at AI Engineering.&lt;/p&gt;

&lt;h1&gt;
  
  
  AI #Homelab #SelfHosted #OpenSource #Automation
&lt;/h1&gt;

</description>
      <category>automation</category>
    </item>
    <item>
      <title>I Built a Production 4-Agent AI Stack on Local Hardware — Here's What I Learned</title>
      <dc:creator>Jörg Fuchs</dc:creator>
      <pubDate>Thu, 26 Feb 2026 05:59:36 +0000</pubDate>
      <link>https://dev.to/aiengineeringat/i-built-a-production-4-agent-ai-stack-on-local-hardware-heres-what-i-learned-4o0e</link>
      <guid>https://dev.to/aiengineeringat/i-built-a-production-4-agent-ai-stack-on-local-hardware-heres-what-i-learned-4o0e</guid>
      <description>&lt;p&gt;After months of iteration, I'm running a fully local AI agent system — GDPR-compliant by design, no cloud APIs, under €50/month running cost.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Stack
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Hardware:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;3x nodes (Docker Swarm): management, monitoring, databases&lt;/li&gt;
&lt;li&gt;1x GPU server: RTX 3090 for LLM inference&lt;/li&gt;
&lt;li&gt;1x dev machine: RTX 4070&lt;/li&gt;
&lt;li&gt;Total hardware: ~€2,400 (used)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Software:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Ollama&lt;/strong&gt; — Mistral 7B, Llama 3.1, Codestral (local LLM inference)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Neo4j&lt;/strong&gt; — Knowledge graphs for structured memory&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;ChromaDB&lt;/strong&gt; — Vector store for RAG&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Mattermost&lt;/strong&gt; — Self-hosted agent communication&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;n8n&lt;/strong&gt; — Workflow automation (the glue)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Prometheus + Grafana&lt;/strong&gt; — Full monitoring stack&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Uptime Kuma&lt;/strong&gt; — Health checks&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  4 Agents, Different Specializations
&lt;/h2&gt;

&lt;p&gt;The agents communicate via Mattermost channels:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Jim01&lt;/strong&gt; — Infrastructure orchestrator&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Lisa01&lt;/strong&gt; — Content quality and compliance&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;John01&lt;/strong&gt; — Frontend builder&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Echo_log&lt;/strong&gt; — Memory management (Neo4j knowledge graph)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Each agent has its own persona, memory, and tool access.&lt;/p&gt;

&lt;h2&gt;
  
  
  Key Learnings
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Docker Swarm &amp;gt; Kubernetes (for small teams)
&lt;/h3&gt;

&lt;p&gt;Seriously. If you're running 3-5 nodes, Swarm just works. No etcd cluster, no complex networking. &lt;code&gt;docker stack deploy&lt;/code&gt; and done.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. HippoRAG with Neo4j beats pure vector search
&lt;/h3&gt;

&lt;p&gt;The combination of knowledge graphs + Personalized PageRank gives much better results for multi-hop reasoning than ChromaDB alone.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Disk space will kill you before anything else
&lt;/h3&gt;

&lt;p&gt;Ollama models, Neo4j databases, Docker images — monitor your disk. This was our #1 production incident.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Agent personas need careful tuning
&lt;/h3&gt;

&lt;p&gt;Without clear boundaries, agents get confused about their role. Explicit persona files with rules work better than general instructions.&lt;/p&gt;

&lt;h3&gt;
  
  
  5. n8n is the underrated MVP
&lt;/h3&gt;

&lt;p&gt;Webhooks, API orchestration, error handling, notifications — n8n connects everything. 28 workflows running in production.&lt;/p&gt;

&lt;h2&gt;
  
  
  Running Cost
&lt;/h2&gt;

&lt;p&gt;~€47/month electricity. That's it. No API bills, no cloud subscriptions.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Local?
&lt;/h2&gt;

&lt;p&gt;The &lt;strong&gt;EU AI Act&lt;/strong&gt; becomes fully enforceable August 2026. Fines up to €35M or 7% of global revenue. If you're sending data to OpenAI/Anthropic APIs from the EU, compliance gets complex.&lt;/p&gt;

&lt;p&gt;Running everything locally means GDPR-compliant by design. No data leaves your network.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Playbook
&lt;/h2&gt;

&lt;p&gt;I wrote everything up as a detailed playbook: 8 chapters, ~70 pages, all docker-compose files and code examples included.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Check it out:&lt;/strong&gt; &lt;a href="https://www.ai-engineering.at" rel="noopener noreferrer"&gt;ai-engineering.at&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Questions welcome — happy to discuss the architecture!&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Built with Ollama, Docker Swarm, Neo4j, n8n, and a lot of late nights.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>selfhosted</category>
      <category>docker</category>
      <category>gdpr</category>
    </item>
    <item>
      <title>Running AI Locally in 2026: A GDPR-Compliant Guide</title>
      <dc:creator>Jörg Fuchs</dc:creator>
      <pubDate>Wed, 25 Feb 2026 07:21:04 +0000</pubDate>
      <link>https://dev.to/aiengineeringat/running-ai-locally-in-2026-a-gdpr-compliant-guide-oml</link>
      <guid>https://dev.to/aiengineeringat/running-ai-locally-in-2026-a-gdpr-compliant-guide-oml</guid>
      <description>&lt;h2&gt;
  
  
  Why Running AI Locally Actually Matters in 2026
&lt;/h2&gt;

&lt;p&gt;Every AI tool you use — ChatGPT, Copilot, Claude — sends your data to someone else's server. For most developers, that's fine. For companies handling customer data under GDPR, it's a compliance nightmare waiting to happen.&lt;/p&gt;

&lt;p&gt;I've spent the last year building a fully self-hosted AI stack for a small Austrian engineering firm. No cloud. No data leaving our datacenter. Full GDPR Article 30 compliance. And honestly — it's faster than most cloud APIs.&lt;/p&gt;

&lt;p&gt;Here's what I learned.&lt;/p&gt;




&lt;h2&gt;
  
  
  The GDPR Problem With Cloud AI
&lt;/h2&gt;

&lt;p&gt;When you send a query to a cloud LLM:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Your data crosses into a third country (US data centers = Chapter V GDPR transfer)&lt;/li&gt;
&lt;li&gt;You need an Article 28 Data Processing Agreement with the provider&lt;/li&gt;
&lt;li&gt;You need to document it in your Article 30 Register&lt;/li&gt;
&lt;li&gt;If there's a breach, you're on the hook in 72 hours&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For a small team, this paperwork alone is a reason to go local.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Stack (What We're Actually Running)
&lt;/h2&gt;

&lt;p&gt;Here's our production setup on a 5-node Docker Swarm:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Hardware: 1x server + 1x workstation with RTX 3090 (24GB VRAM)
OS: Proxmox VE → Ubuntu VMs

Services:
- Ollama          → Local LLM inference (Mistral, Llama3, Qwen)
- Open WebUI      → Chat interface (like ChatGPT, but yours)
- Whisper STT     → Speech-to-text, fully local
- Piper TTS       → Text-to-speech, runs on CPU
- ChromaDB        → Vector database for RAG
- n8n             → Workflow automation (local, not cloud)
- Prometheus + Grafana → Monitoring
- Mattermost      → Team communication (self-hosted Slack)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Total cost: ~€800-1200 for the GPU workstation (used RTX 3090).&lt;br&gt;
Monthly running cost: ~€40 electricity.&lt;/p&gt;

&lt;p&gt;Compare to: GPT-4 API at $10-30/1M tokens for a team doing 100K queries/month = &lt;strong&gt;$1,000-3,000/month&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Break-even: 1-3 months.&lt;/strong&gt;&lt;/p&gt;


&lt;h2&gt;
  
  
  Getting Started: The Minimal Setup
&lt;/h2&gt;

&lt;p&gt;You don't need a 5-node cluster. Here's the minimal viable self-hosted AI stack:&lt;/p&gt;
&lt;h3&gt;
  
  
  Prerequisites
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;A machine with 16GB RAM (GPU optional, but recommended)&lt;/li&gt;
&lt;li&gt;Docker + Docker Compose&lt;/li&gt;
&lt;li&gt;1 afternoon&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;
  
  
  Step 1: Install Ollama
&lt;/h3&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Linux/Mac&lt;/span&gt;
curl &lt;span class="nt"&gt;-fsSL&lt;/span&gt; https://ollama.com/install.sh | sh

&lt;span class="c"&gt;# Pull a model (Llama3.2 3B runs on CPU, ~2GB)&lt;/span&gt;
ollama pull llama3.2:3b

&lt;span class="c"&gt;# Test it&lt;/span&gt;
ollama run llama3.2:3b &lt;span class="s2"&gt;"What is GDPR Article 5?"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;h3&gt;
  
  
  Step 2: Add Open WebUI (ChatGPT-like Interface)
&lt;/h3&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# docker-compose.yml&lt;/span&gt;
&lt;span class="na"&gt;services&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;open-webui&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ghcr.io/open-webui/open-webui:main&lt;/span&gt;
    &lt;span class="na"&gt;ports&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;3000:8080"&lt;/span&gt;
    &lt;span class="na"&gt;volumes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;open-webui:/app/backend/data&lt;/span&gt;
    &lt;span class="na"&gt;environment&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;OLLAMA_BASE_URL=http://host.docker.internal:11434&lt;/span&gt;
    &lt;span class="na"&gt;extra_hosts&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;host.docker.internal:host-gateway"&lt;/span&gt;

&lt;span class="na"&gt;volumes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;open-webui&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker compose up &lt;span class="nt"&gt;-d&lt;/span&gt;
&lt;span class="c"&gt;# Open http://localhost:3000&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;That's it. You now have a local ChatGPT alternative.&lt;/p&gt;


&lt;h2&gt;
  
  
  GPU Makes All the Difference
&lt;/h2&gt;

&lt;p&gt;On CPU (Intel i7):&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;llama3.2:3b&lt;/code&gt; → ~10 tokens/sec (usable)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;llama3.1:8b&lt;/code&gt; → ~3 tokens/sec (slow)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;mistral:7b&lt;/code&gt; → ~3 tokens/sec (slow)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;On GPU (RTX 3090 24GB):&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;llama3.2:3b&lt;/code&gt; → ~150 tokens/sec (fast)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;llama3.1:8b&lt;/code&gt; → ~80 tokens/sec (fast)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;mistral-small3.2:24b&lt;/code&gt; → ~35 tokens/sec (fast)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;qwen2.5:32b&lt;/code&gt; → ~25 tokens/sec (good for coding)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Recommendation:&lt;/strong&gt; A used RTX 3060 12GB (~€250) is the sweet spot for small teams.&lt;/p&gt;


&lt;h2&gt;
  
  
  RAG: Making Your LLM Know Your Data
&lt;/h2&gt;

&lt;p&gt;The real power of local AI is &lt;strong&gt;Retrieval Augmented Generation (RAG)&lt;/strong&gt; — feeding your own documents to the model without fine-tuning.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Simple RAG with ChromaDB + Ollama
&lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;chromadb&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;ollama&lt;/span&gt;

&lt;span class="c1"&gt;# 1. Index your documents
&lt;/span&gt;&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;chromadb&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Client&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;collection&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create_collection&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;company-docs&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;docs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Our GDPR policy requires...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Customer data is stored in...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;The data retention period is 90 days...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="c1"&gt;# Generate embeddings locally
&lt;/span&gt;&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;doc&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;enumerate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;docs&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;embedding&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ollama&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;embeddings&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;mxbai-embed-large&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;doc&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;collection&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;ids&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;)],&lt;/span&gt;
        &lt;span class="n"&gt;embeddings&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;embedding&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;embedding&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]],&lt;/span&gt;
        &lt;span class="n"&gt;documents&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;doc&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# 2. Query with context
&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;rag_query&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;question&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;q_embedding&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ollama&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;embeddings&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;mxbai-embed-large&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;question&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;collection&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;query&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query_embeddings&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;q_embedding&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;embedding&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]],&lt;/span&gt; &lt;span class="n"&gt;n_results&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;context&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;documents&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;

    &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ollama&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;llama3.1:8b&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;system&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Answer based on this context:&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;question&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;])&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;message&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;rag_query&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;How long do we retain customer data?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="c1"&gt;# → "According to your policy, customer data is retained for 90 days."
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;All of this runs &lt;strong&gt;100% locally&lt;/strong&gt;. Zero data leaves your machine.&lt;/p&gt;




&lt;h2&gt;
  
  
  GDPR Compliance Checklist for Local AI
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Article 25 — Privacy by Design&lt;/strong&gt; ✅&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;No third-party AI APIs = no data transfer by default&lt;/li&gt;
&lt;li&gt;Add access control to Open WebUI (SSO or local users)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Article 30 — Records of Processing Activities&lt;/strong&gt; ✅&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Document: "AI inference on local hardware for internal use"&lt;/li&gt;
&lt;li&gt;No DPA with external processor needed (it's your hardware)&lt;/li&gt;
&lt;li&gt;List which models you use and for what purpose&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Article 32 — Security of Processing&lt;/strong&gt; ✅&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Put Ollama behind a reverse proxy (Nginx/Traefik), don't expose port 11434&lt;/li&gt;
&lt;li&gt;Use HTTPS even internally (Let's Encrypt with private CA)&lt;/li&gt;
&lt;li&gt;Restrict access by IP or VPN&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Chapter V — Third Country Transfers&lt;/strong&gt; ✅&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Zero transfers if fully local&lt;/li&gt;
&lt;li&gt;Carefully review any integrations (n8n, monitoring tools) for cloud callbacks&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  What Models to Use for What
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Use Case&lt;/th&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;VRAM Needed&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;General chat / writing&lt;/td&gt;
&lt;td&gt;&lt;code&gt;llama3.1:8b&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;6GB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Code generation&lt;/td&gt;
&lt;td&gt;&lt;code&gt;qwen2.5:14b&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;10GB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;German/multilingual&lt;/td&gt;
&lt;td&gt;&lt;code&gt;mistral-small3.2:24b&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;16GB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Fast summaries&lt;/td&gt;
&lt;td&gt;&lt;code&gt;llama3.2:3b&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;2GB (CPU ok)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Embeddings/RAG&lt;/td&gt;
&lt;td&gt;&lt;code&gt;mxbai-embed-large&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;1GB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Long documents&lt;/td&gt;
&lt;td&gt;&lt;code&gt;llama3.1:70b&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;40GB (2x GPU)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Rule of thumb:&lt;/strong&gt; 8B parameter model = minimum 5-6GB VRAM. 4-bit quantized versions use ~60% less.&lt;/p&gt;




&lt;h2&gt;
  
  
  Is It Worth It?
&lt;/h2&gt;

&lt;p&gt;For a team of 5+ using AI daily: &lt;strong&gt;Yes, absolutely.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cost comparison (monthly):&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Cloud APIs (GPT-4): $500-2000/month&lt;/li&gt;
&lt;li&gt;Self-hosted (amortized hardware + electricity): ~$50/month&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Savings: ~$450-1950/month&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Compliance benefit:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Zero GDPR transfer risk&lt;/li&gt;
&lt;li&gt;No Article 28 DPA paperwork with AI providers&lt;/li&gt;
&lt;li&gt;Full audit trail (you control the logs)&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Resources
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Ollama&lt;/strong&gt;: &lt;a href="https://ollama.com" rel="noopener noreferrer"&gt;ollama.com&lt;/a&gt; — The easiest way to run LLMs locally&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Open WebUI&lt;/strong&gt;: &lt;a href="https://docs.openwebui.com" rel="noopener noreferrer"&gt;docs.openwebui.com&lt;/a&gt; — Self-hosted ChatGPT UI&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;n8n self-hosted&lt;/strong&gt;: &lt;a href="https://n8n.io" rel="noopener noreferrer"&gt;n8n.io&lt;/a&gt; — Local workflow automation&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Want the Full Guide?
&lt;/h2&gt;

&lt;p&gt;I packaged everything into &lt;strong&gt;&lt;a href="https://www.ai-engineering.at" rel="noopener noreferrer"&gt;Playbook 01 — Der lokale AI-Stack&lt;/a&gt;&lt;/strong&gt; (70+ pages, DACH-focused):&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Complete Docker Swarm setup from zero&lt;/li&gt;
&lt;li&gt;n8n workflow automation templates (production-tested)&lt;/li&gt;
&lt;li&gt;GDPR Article 30 compliance documentation templates&lt;/li&gt;
&lt;li&gt;All config files and code snippets included&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;€49 one-time&lt;/strong&gt; — &lt;a href="https://www.ai-engineering.at" rel="noopener noreferrer"&gt;ai-engineering.at&lt;/a&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Building this publicly at &lt;a href="https://www.ai-engineering.at" rel="noopener noreferrer"&gt;ai-engineering.at&lt;/a&gt;. Running the entire AI infrastructure in a 5-node Docker Swarm in my home lab. AMA in the comments.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>selfhosted</category>
      <category>privacy</category>
      <category>docker</category>
    </item>
  </channel>
</rss>
