The Homelab AI Stack in 2026: What Self-Hosters Are Actually Running
SIGNAL — Weekly intelligence for builders
Spend five minutes on r/selfhosted and you'll notice: the conversations have changed.
Two years ago everyone asked "what should I run?" Now they're sharing sophisticated stacks that rival small business infrastructure. The self-hosting AI movement has matured. Here's what's actually worth deploying in 2026.
The Core Stack (What Stayed)
Ollama — Local LLM Runtime
Ollama won. It beat LocalAI on simplicity, beat llama.cpp on UX, and the model library makes pulling new models trivial.
# Install
curl -fsSL https://ollama.ai/install.sh | sh
# Pull the best value model for 16GB RAM
ollama pull qwen2.5:14b
# Or for 24GB+ (M4 Mac mini, high-RAM PC)
ollama pull qwen2.5:32b
# Test immediately
ollama run qwen2.5:14b "Explain what makes a good Docker Compose file"
Hardware reality check:
- 8GB RAM → 7B models, basic tasks
- 16GB RAM → 14B models, solid capability
- 24GB RAM (M4 Mac mini sweet spot) → 32B models, near GPT-4 quality
- 32GB+ → 70B models, excellent for everything
Open WebUI — The Interface
Deploys in 2 minutes, gives you a ChatGPT-equivalent interface locally:
# docker-compose.yml
services:
open-webui:
image: ghcr.io/open-webui/open-webui:main
volumes:
- open-webui:/app/backend/data
environment:
- OLLAMA_BASE_URL=http://host.docker.internal:11434
ports:
- "3000:8080"
extra_hosts:
- "host.docker.internal:host-gateway"
restart: unless-stopped
volumes:
open-webui:
n8n — Automation Brain
For connecting AI to everything else. Self-hosted, no per-workflow limits, full control.
The killer use case in 2026: n8n + Ollama = private AI automations that cost $0/month to run.
My actual running workflows:
- Gmail → Ollama triage → priority flag → Telegram alert
- RSS feeds → Ollama summary → daily digest at 7am
- Server logs → Ollama anomaly check → alert if weird
What Got Replaced in 2026
LocalAI → Ollama
LocalAI was great for compatibility. Ollama is better for everything else.
Flowise → n8n
Flowise is excellent for RAG pipelines. But n8n handles AI and everything else — one tool beats two.
Custom Python scripts → n8n workflows
Maintainability. A n8n workflow is inspectable, editable, debuggable without touching code.
What Got Added in 2026
Whisper.cpp — Local Audio Transcription
brew install whisper-cpp
# or build from source for maximum performance
# Transcribe any audio file
whisper-cpp --model base.en audio.mp3
Use cases: meeting transcription, voice notes → text, local podcast search.
LiteLLM — The Unified Proxy
This one changed everything. LiteLLM sits in front of all your AI models and presents a single OpenAI-compatible API endpoint.
# All your apps point to one endpoint
services:
litellm:
image: ghcr.io/berriai/litellm:main-latest
ports:
- "4000:4000"
environment:
- ANTHROPIC_API_KEY=${ANTHROPIC_API_KEY}
- OPENAI_API_KEY=${OPENAI_API_KEY}
volumes:
- ./litellm_config.yaml:/app/config.yaml
Now every app in your stack — n8n, Open WebUI, your scripts — uses http://litellm:4000 and you switch models in one config file.
ChromaDB + LlamaIndex — Private RAG
Search your own documents with AI. All local, all private.
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.vector_stores.chroma import ChromaVectorStore
import chromadb
# Index your documents
docs = SimpleDirectoryReader('/your/docs/folder').load_data()
db = chromadb.PersistentClient(path='./chroma_db')
collection = db.get_or_create_collection('my_docs')
store = ChromaVectorStore(chroma_collection=collection)
# Query them
index = VectorStoreIndex.from_documents(docs, vector_store=store)
engine = index.as_query_engine()
response = engine.query('What did we decide about the API architecture?')
print(response)
The Hardware Question
Everyone asks: GPU server vs Apple Silicon?
In 2026, for pure AI inference at homelab scale: Apple Silicon wins on value.
M4 Mac mini (24GB, ~$800):
- Runs 32B models at 10-15 tokens/sec
- Silent, 30W idle
- No separate GPU needed
- macOS means easy maintenance
NVIDIA GPU server (RTX 4090, 24GB VRAM):
- Faster inference on large batches
- Better for fine-tuning
- Loud, 450W under load
- Linux-only for serious use
For a homelab running 1-5 concurrent users doing text tasks: Mac mini M4. For serious inference throughput or training: GPU server.
The Monitoring Stack
Don't run AI services without knowing when they break:
- Uptime Kuma — check if Ollama/n8n/Open WebUI are responding
- Netdata — per-container resource usage
- Loki + Grafana — aggregate logs from all containers
# Add to any docker-compose.yml for log collection
labels:
- logging=promtail
- logging_jobname=containerlogs
What I'd Set Up First on a New Server
In order, if starting from scratch:
- Traefik — reverse proxy + automatic HTTPS (everything else goes behind it)
- Ollama — pull qwen2.5:14b first, add others as needed
- Open WebUI — immediate usable interface
- n8n — automation brain
- LiteLLM — unified API proxy
- Uptime Kuma — monitoring
- Vaultwarden — password manager (you'll need it)
The One Thing Most People Miss
Running models locally is only half the value.
The other half is connecting them to your actual workflow — your email, your calendar, your codebase, your documents. A local LLM that just answers questions in a chat window is the same as a very slow, private version of ChatGPT.
A local LLM wired into n8n that automatically triages your email, monitors your servers, and summarizes your notes — that's actual leverage.
SIGNAL publishes weekly. Follow @signal-weekly for more practical builder content.
Next: How I use AI agents to automate the boring parts of running a homelab — specific n8n workflows, working code.
Top comments (0)