DEV Community: Jörg Fuchs

Hunyuan Video 720p on RTX 3090: Full On-Premise AI Media Pipeline E2E

Jörg Fuchs — Wed, 04 Mar 2026 19:36:37 +0000

Running AI video generation on consumer hardware - here is our full E2E pipeline that generates photos and videos without any cloud APIs.

Hardware

RTX 3090 24GB VRAM
Intel i7-14700F (20 cores)
30GB WSL2 RAM (critical for Hunyuan)

Photo Pipeline (FLUX Dev FP8)

Resolution: 1344x768
Generation time: ~44 seconds
Quality: Professional stock photo level
Guidance: 3.5 via FluxGuidance node
CFG: 1.0 (FLUX ignores traditional CFG)

Video Pipeline (Hunyuan Video FP8)

Resolution: 1280x720
Frames: 13 (~1.1s at 12fps)
Generation time: ~7.8 minutes
Key fix: quantization=fp8_e4m3fn keeps model at ~12GB on GPU

Critical Learning

The pre-quantized FP8 Hunyuan model with quantization=disabled causes OOM because HyVideoModelLoader upcasts weights to bf16 (~24GB). Setting quantization to fp8_e4m3fn keeps it in FP8 format (~12GB), leaving room for VAE and sampling.

VRAM Management

We built a custom VRAM Guard service that coordinates GPU access between Ollama (LLM) and ComfyUI (media generation). Before video generation, Ollama models are unloaded and ComfyUI cached models are freed.

Pipeline Architecture

ComfyUI API → n8n workflow orchestration → Social Poster service → auto-post to Twitter, LinkedIn, Reddit, Dev.to

All running on Docker Swarm across 6 nodes. No cloud dependencies.

Why Every AI Engineer Should Build a Homelab

Jörg Fuchs — Tue, 03 Mar 2026 16:20:24 +0000

Running your own AI infrastructure at home is not just possible — it is powerful. Open-source LLMs, self-hosted automation, and a homelab stack can replace expensive cloud subscriptions. Here is how we do it at AI Engineering.

AI #Homelab #SelfHosted #OpenSource #Automation

I Built a Production 4-Agent AI Stack on Local Hardware — Here's What I Learned

Jörg Fuchs — Thu, 26 Feb 2026 05:59:36 +0000

After months of iteration, I'm running a fully local AI agent system — GDPR-compliant by design, no cloud APIs, under €50/month running cost.

The Stack

Hardware:

3x nodes (Docker Swarm): management, monitoring, databases
1x GPU server: RTX 3090 for LLM inference
1x dev machine: RTX 4070
Total hardware: ~€2,400 (used)

Software:

Ollama — Mistral 7B, Llama 3.1, Codestral (local LLM inference)
Neo4j — Knowledge graphs for structured memory
ChromaDB — Vector store for RAG
Mattermost — Self-hosted agent communication
n8n — Workflow automation (the glue)
Prometheus + Grafana — Full monitoring stack
Uptime Kuma — Health checks

4 Agents, Different Specializations

The agents communicate via Mattermost channels:

Jim01 — Infrastructure orchestrator
Lisa01 — Content quality and compliance
John01 — Frontend builder
Echo_log — Memory management (Neo4j knowledge graph)

Each agent has its own persona, memory, and tool access.

Key Learnings

1. Docker Swarm > Kubernetes (for small teams)

Seriously. If you're running 3-5 nodes, Swarm just works. No etcd cluster, no complex networking. docker stack deploy and done.

2. HippoRAG with Neo4j beats pure vector search

The combination of knowledge graphs + Personalized PageRank gives much better results for multi-hop reasoning than ChromaDB alone.

3. Disk space will kill you before anything else

Ollama models, Neo4j databases, Docker images — monitor your disk. This was our #1 production incident.

4. Agent personas need careful tuning

Without clear boundaries, agents get confused about their role. Explicit persona files with rules work better than general instructions.

5. n8n is the underrated MVP

Webhooks, API orchestration, error handling, notifications — n8n connects everything. 28 workflows running in production.

Running Cost

~€47/month electricity. That's it. No API bills, no cloud subscriptions.

Why Local?

The EU AI Act becomes fully enforceable August 2026. Fines up to €35M or 7% of global revenue. If you're sending data to OpenAI/Anthropic APIs from the EU, compliance gets complex.

Running everything locally means GDPR-compliant by design. No data leaves your network.

The Playbook

I wrote everything up as a detailed playbook: 8 chapters, ~70 pages, all docker-compose files and code examples included.

Check it out: ai-engineering.at

Questions welcome — happy to discuss the architecture!

Built with Ollama, Docker Swarm, Neo4j, n8n, and a lot of late nights.

Running AI Locally in 2026: A GDPR-Compliant Guide

Jörg Fuchs — Wed, 25 Feb 2026 07:21:04 +0000

Why Running AI Locally Actually Matters in 2026

Every AI tool you use — ChatGPT, Copilot, Claude — sends your data to someone else's server. For most developers, that's fine. For companies handling customer data under GDPR, it's a compliance nightmare waiting to happen.

I've spent the last year building a fully self-hosted AI stack for a small Austrian engineering firm. No cloud. No data leaving our datacenter. Full GDPR Article 30 compliance. And honestly — it's faster than most cloud APIs.

Here's what I learned.

The GDPR Problem With Cloud AI

When you send a query to a cloud LLM:

Your data crosses into a third country (US data centers = Chapter V GDPR transfer)
You need an Article 28 Data Processing Agreement with the provider
You need to document it in your Article 30 Register
If there's a breach, you're on the hook in 72 hours

For a small team, this paperwork alone is a reason to go local.

The Stack (What We're Actually Running)

Here's our production setup on a 5-node Docker Swarm:

Hardware: 1x server + 1x workstation with RTX 3090 (24GB VRAM)
OS: Proxmox VE → Ubuntu VMs

Services:
- Ollama          → Local LLM inference (Mistral, Llama3, Qwen)
- Open WebUI      → Chat interface (like ChatGPT, but yours)
- Whisper STT     → Speech-to-text, fully local
- Piper TTS       → Text-to-speech, runs on CPU
- ChromaDB        → Vector database for RAG
- n8n             → Workflow automation (local, not cloud)
- Prometheus + Grafana → Monitoring
- Mattermost      → Team communication (self-hosted Slack)

Total cost: ~€800-1200 for the GPU workstation (used RTX 3090).
Monthly running cost: ~€40 electricity.

Compare to: GPT-4 API at $10-30/1M tokens for a team doing 100K queries/month = $1,000-3,000/month.

Break-even: 1-3 months.

Getting Started: The Minimal Setup

You don't need a 5-node cluster. Here's the minimal viable self-hosted AI stack:

Prerequisites

A machine with 16GB RAM (GPU optional, but recommended)
Docker + Docker Compose
1 afternoon

Step 1: Install Ollama

# Linux/Mac
curl -fsSL https://ollama.com/install.sh | sh

# Pull a model (Llama3.2 3B runs on CPU, ~2GB)
ollama pull llama3.2:3b

# Test it
ollama run llama3.2:3b "What is GDPR Article 5?"

Step 2: Add Open WebUI (ChatGPT-like Interface)

# docker-compose.yml
services:
  open-webui:
    image: ghcr.io/open-webui/open-webui:main
    ports:
      - "3000:8080"
    volumes:
      - open-webui:/app/backend/data
    environment:
      - OLLAMA_BASE_URL=http://host.docker.internal:11434
    extra_hosts:
      - "host.docker.internal:host-gateway"

volumes:
  open-webui:

docker compose up -d
# Open http://localhost:3000

That's it. You now have a local ChatGPT alternative.

GPU Makes All the Difference

On CPU (Intel i7):

llama3.2:3b → ~10 tokens/sec (usable)
llama3.1:8b → ~3 tokens/sec (slow)
mistral:7b → ~3 tokens/sec (slow)

On GPU (RTX 3090 24GB):

llama3.2:3b → ~150 tokens/sec (fast)
llama3.1:8b → ~80 tokens/sec (fast)
mistral-small3.2:24b → ~35 tokens/sec (fast)
qwen2.5:32b → ~25 tokens/sec (good for coding)

Recommendation: A used RTX 3060 12GB (~€250) is the sweet spot for small teams.

RAG: Making Your LLM Know Your Data

The real power of local AI is Retrieval Augmented Generation (RAG) — feeding your own documents to the model without fine-tuning.

# Simple RAG with ChromaDB + Ollama
import chromadb
import ollama

# 1. Index your documents
client = chromadb.Client()
collection = client.create_collection("company-docs")

docs = [
    "Our GDPR policy requires...",
    "Customer data is stored in...",
    "The data retention period is 90 days..."
]

# Generate embeddings locally
for i, doc in enumerate(docs):
    embedding = ollama.embeddings(model="mxbai-embed-large", prompt=doc)
    collection.add(
        ids=[str(i)],
        embeddings=[embedding["embedding"]],
        documents=[doc]
    )

# 2. Query with context
def rag_query(question: str) -> str:
    q_embedding = ollama.embeddings(model="mxbai-embed-large", prompt=question)
    results = collection.query(query_embeddings=[q_embedding["embedding"]], n_results=3)
    context = "\n".join(results["documents"][0])

    response = ollama.chat(model="llama3.1:8b", messages=[
        {"role": "system", "content": f"Answer based on this context:\n{context}"},
        {"role": "user", "content": question}
    ])
    return response["message"]["content"]

print(rag_query("How long do we retain customer data?"))
# → "According to your policy, customer data is retained for 90 days."

All of this runs 100% locally. Zero data leaves your machine.

GDPR Compliance Checklist for Local AI

Article 25 — Privacy by Design ✅

No third-party AI APIs = no data transfer by default
Add access control to Open WebUI (SSO or local users)

Article 30 — Records of Processing Activities ✅

Document: "AI inference on local hardware for internal use"
No DPA with external processor needed (it's your hardware)
List which models you use and for what purpose

Article 32 — Security of Processing ✅

Put Ollama behind a reverse proxy (Nginx/Traefik), don't expose port 11434
Use HTTPS even internally (Let's Encrypt with private CA)
Restrict access by IP or VPN

Chapter V — Third Country Transfers ✅

Zero transfers if fully local
Carefully review any integrations (n8n, monitoring tools) for cloud callbacks

What Models to Use for What

Use Case	Model	VRAM Needed
General chat / writing	`llama3.1:8b`	6GB
Code generation	`qwen2.5:14b`	10GB
German/multilingual	`mistral-small3.2:24b`	16GB
Fast summaries	`llama3.2:3b`	2GB (CPU ok)
Embeddings/RAG	`mxbai-embed-large`	1GB
Long documents	`llama3.1:70b`	40GB (2x GPU)

Rule of thumb: 8B parameter model = minimum 5-6GB VRAM. 4-bit quantized versions use ~60% less.

Is It Worth It?

For a team of 5+ using AI daily: Yes, absolutely.

Cost comparison (monthly):

Cloud APIs (GPT-4): $500-2000/month
Self-hosted (amortized hardware + electricity): ~$50/month
Savings: ~$450-1950/month

Compliance benefit:

Zero GDPR transfer risk
No Article 28 DPA paperwork with AI providers
Full audit trail (you control the logs)

Resources

Ollama: ollama.com — The easiest way to run LLMs locally
Open WebUI: docs.openwebui.com — Self-hosted ChatGPT UI
n8n self-hosted: n8n.io — Local workflow automation

Want the Full Guide?

I packaged everything into Playbook 01 — Der lokale AI-Stack (70+ pages, DACH-focused):

Complete Docker Swarm setup from zero
n8n workflow automation templates (production-tested)
GDPR Article 30 compliance documentation templates
All config files and code snippets included

€49 one-time — ai-engineering.at

Building this publicly at ai-engineering.at. Running the entire AI infrastructure in a 5-node Docker Swarm in my home lab. AMA in the comments.