DEV Community: EveryLocalAI

How to Set Up a Local AI Coding Assistant in VS Code – Free & Private

EveryLocalAI — Thu, 18 Jun 2026 09:02:36 +0000

Want a Cursor/Copilot-style coding assistant that runs entirely on your machine? Your code never leaves your computer and there's no subscription fee. Here's how to set it up with VS Code, Continue, and Ollama.

What You'll Build

Tab autocomplete (like Copilot) that suggests code as you type
Chat with your codebase - ask questions, generate functions, write tests
100% local - zero data sent to any cloud service

Prerequisites

A GPU with 24GB+ VRAM (RTX 3090/4090 or better)
For smaller GPUs (8-12GB), use Qwen2.5 Coder 7B instead
Ollama installed (see ollama.com)
VS Code (free from code.visualstudio.com)

Step 1: Pull the Model

Open a terminal and pull a coding-focused model:

ollama pull qwen2.5-coder:14b

This takes a few minutes depending on your internet. The model is ~8GB at Q4 quantization.

Step 2: Install Continue

In VS Code:

Open Extensions (Ctrl+Shift+X)
Search for "Continue"
Click Install
Reload VS Code when prompted

Step 3: Configure

Create or edit ~/.continue/config.yaml:

models:
  - name: Qwen2.5 Coder 14B
    provider: ollama
    model: qwen2.5-coder:14b
    roles:
      - chat
      - edit
  - name: Qwen2.5 Coder (autocomplete)
    provider: ollama
    model: qwen2.5-coder:14b
    roles:
      - autocomplete

Step 4: Use It

Autocomplete: Start typing. Continue suggests completions in gray. Press Tab to accept.
Chat: Press Ctrl+L (or Cmd+L on Mac) to open the chat panel. Ask questions about your code.
Edit: Select code and press Ctrl+Shift+L to ask for changes.
Inline: Highlight code, press Ctrl+I, and describe what you want changed.

Performance Notes

GPU	Model	Speed	Quality
RTX 3090 (24GB)	Qwen2.5-Coder 14B	25-35 tok/s	Excellent
RTX 4090 (24GB)	Qwen2.5-Coder 14B	40-50 tok/s	Excellent
RTX 3060 (12GB)	Qwen2.5-Coder 7B	30-40 tok/s	Good
RTX 4060 (8GB)	Qwen2.5-Coder 7B (Q4)	20-30 tok/s	Good

Why Go Local?

$0/month vs $20/seat for Copilot or Cursor
Privacy: your proprietary code never touches a third-party server
Offline: works without internet
Model choice: swap models anytime, no vendor lock-in

Originally published on everylocalai.com

Build Your Own Private ChatGPT in 15 Minutes – Local AI, Zero Cloud Cost

EveryLocalAI — Thu, 18 Jun 2026 09:01:47 +0000

Want a ChatGPT-like experience that runs entirely on your own GPU? No monthly fees, no data leaving your machine, and it works offline. Here's how to set it up in 15 minutes.

What You'll Build

A full ChatGPT-style web UI running locally
Your choice of open-source LLM (Qwen3 14B or Llama 3.1 8B)
Multiple user accounts for your LAN
100% private - nothing leaves your network

Prerequisites

A GPU with 12GB+ VRAM (RTX 3060 12GB works great)
Docker + Docker Compose installed
NVIDIA Container Toolkit for GPU passthrough (Linux) or WSL2 (Windows)

Setup

Create a docker-compose.yml file:

services:
  ollama:
    image: ollama/ollama:latest
    container_name: ollama
    volumes:
      - ollama:/root/.ollama
    ports:
      - "11434:11434"
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]

  open-webui:
    image: ghcr.io/open-webui/open-webui:main
    container_name: open-webui
    depends_on:
      - ollama
    environment:
      - OLLAMA_BASE_URL=http://ollama:11434
    volumes:
      - open-webui:/app/backend/data
    ports:
      - "3000:8080"
    restart: unless-stopped

volumes:
  ollama:
  open-webui:

Run It

docker compose up -d
docker exec ollama ollama pull qwen3:14b

Open http://localhost:3000, create your admin account, pick qwen3:14b from the dropdown, and start chatting.

What Makes It Great

$0/month vs $20/month for ChatGPT Plus
Full privacy - conversations stay on your machine
Works offline - no internet connection needed after setup
Multi-user - share with family or your team on the same LAN
Model switching - swap between different models mid-conversation

Performance

On an RTX 3060 12GB with Qwen3 14B (Q4): ~20-25 tok/s, smooth for chat. For 8GB cards, use Llama 3.1 8B instead.

Originally published on everylocalai.com

Run Qwen3.6-27B Locally: The Most Capable Open Model for a Single GPU

EveryLocalAI — Thu, 18 Jun 2026 08:34:28 +0000

Run Qwen3.6-27B Locally: The Most Capable Open Model for a Single GPU

Qwen3.6-27B is a dense 27-billion parameter model from Alibaba that scores 77.2% on SWE-bench Verified — matching closed-source models like Claude Sonnet 4.5 on real-world coding tasks. It ships under Apache 2.0 license with native vision support, 262K context window, and hybrid thinking mode.

Paired with Ollama for one-command serving and Open WebUI for a ChatGPT-like interface, this stack gives you a private AI assistant that rivals cloud services with no monthly fee.

What makes Qwen3.6-27B special

Vision understanding — baked-in vision encoder, upload images and ask about them
262K context window — entire codebases or long documents in one pass
Hybrid thinking — shows reasoning before answering, skip with /no_think
77.2% SWE-bench — competes with Sonnet 4.5 on real PRs
Apache 2.0 license — free for any use

Hardware requirements

Quantization	VRAM needed	Hardware
Q4_K_M	16-18 GB	RTX 3090, RTX 4070 Ti Super, Mac 24GB+
Q8_0	28 GB	RTX 4090, Mac 32GB+
BF16	54 GB	2x RTX 4090, A100

The Q4_K_M sweet spot fits a single RTX 3090 (24GB, ~$750 used). On Mac, you need 24GB+ unified memory.

One-command setup with Ollama

# Install Ollama (macOS/Linux)
curl -fsSL https://ollama.com/install.sh | sh

# Pull Qwen3.6-27B (auto-selects Q4 for your hardware)
ollama pull qwen3.6:27b

# Run it
ollama run qwen3.6:27b

That's it. Hybrid thinking is on by default — the model shows reasoning before answering. Use /no_think for faster responses.

Add a chat UI with Open WebUI

Run Open WebUI alongside Ollama for a polished ChatGPT experience:

services:
  open-webui:
    image: ghcr.io/open-webui/open-webui:main
    ports:
      - "3000:8080"
    environment:
      - OLLAMA_BASE_URL=http://host.docker.internal:11434

Performance on consumer GPUs

Hardware	Q4 speed	Q8 speed
RTX 3090 (24GB)	25-35 tok/s	15-20 tok/s
RTX 4070 Ti Super (16GB)	10-15 tok/s	—
Mac M4 Max (48GB)	20-30 tok/s	12-18 tok/s
Mac M2 Pro (24GB)	10-15 tok/s	—

Cost vs cloud

Local: $0/month + $750 for RTX 3090. Claude Sonnet: $20/month + per-token charges. The GPU pays for itself in ~8 months of heavy API use. Plus complete privacy and no rate limits.

Originally published on everylocalai.com

Build a Private Windows AI Assistant with LM Studio and AnythingLLM

EveryLocalAI — Thu, 18 Jun 2026 08:33:50 +0000

Build a Private Windows AI Assistant with LM Studio and AnythingLLM

A fully private AI stack for Windows that never touches the cloud. LM Studio serves as your local model server with a visual interface — browse, download, and run models from HuggingFace without typing a single command. AnythingLLM adds document RAG, workspace isolation, and agent skills on top.

This stack is built for Windows users who prefer a graphical interface — no Docker, no terminal commands beyond the basics.

What you'll build

Visual model browser — search HuggingFace models inside LM Studio, download with one click
Drop-in document Q&A — PDF, DOCX, TXT, CSV, code files. Drag them into AnythingLLM and ask questions
No data leaves your PC — all inference and embedding runs locally, works completely offline
No Docker, no WSL, no CLI — both apps are native Windows desktop installers
$0/month — the only cost is the GPU you already own

Prerequisites

Windows 11 (64-bit)
GPU with 4GB+ VRAM (6GB+ preferred), CPU works but slower
16GB RAM minimum
10-30GB free disk for models

Step 1: Install LM Studio

Go to lmstudio.ai and download the Windows installer. Run it — default path is fine.

LM Studio is both a model manager and a local OpenAI-compatible API server. You search models from Hugging Face visually and serve them over a local HTTP endpoint.

Step 2: Download a model

In LM Studio, go to the Discover tab and search for Qwen2.5-14B. Look for a Q4_K_M quantized version — best balance of quality and size. Click Download and wait (~8 GB).

If you have 8GB VRAM or less, search for Qwen2.5-7B or Llama 3.2 3B instead.

Step 3: Start the local server

Go to the Developer tab in LM Studio, select your model, and click Start Server. You should see: Server listening on http://localhost:1234.

Step 4: Install AnythingLLM

Go to anythingllm.com/desktop and download the Windows installer. Install for Current User only — not All Users — to avoid a known spawn error.

Step 5: Connect AnythingLLM to LM Studio

In AnythingLLM Settings > LLM Preference, select LM Studio as the provider and set the base URL to http://localhost:1234. Save changes. Go to Embedding Model and set to AnythingLLM built-in.

Step 6: Chat and upload documents

Create a workspace, then drag files into the chat area. AnythingLLM creates embeddings locally and lets you ask questions about your documents. Workspaces are isolated — perfect for keeping work and personal contexts separate.

Performance by GPU

GPU	Max model	Speed
RTX 3060 12GB	14B at Q4	15-20 tok/s
RTX 4060 8GB	7B at Q4	20-30 tok/s
CPU-only 16GB	3B at Q4	3-5 tok/s

Cost comparison

Local stack: $0/month + $200 for used RTX 3060. ChatGPT Plus: $20/month with no privacy guarantees. The GPU pays for itself in 10 months.

Originally published on everylocalai.com

Build a Private Voice Assistant with Whisper, Ollama, and Kokoro TTS

EveryLocalAI — Sun, 14 Jun 2026 22:09:16 +0000

Have you ever wanted your own Jarvis? A voice assistant that listens, thinks, and speaks back - all running privately on your own hardware? Here's how to build one with Whisper.cpp, Ollama, and Kokoro TTS.

No cloud, no wake-word fees, no data leaving your machine.

Prerequisites

Hardware: Any modern computer with a microphone
Software: Python 3.10+, Ollama installed
Time: ~30 minutes setup

Installation

1. Install Ollama and Pull a Model

ollama pull qwen3:14b

2. Install Whisper.cpp

git clone https://github.com/ggerganov/whisper.cpp.git
cd whisper.cpp
cmake -B build && cmake --build build --config Release
bash models/download-ggml-model.sh medium

3. Install Kokoro TTS

pip install kokoro pyaudio requests

Wiring It All Together

Save this as voice_assistant.py:

import subprocess
import tempfile
import wave
import pyaudio
import requests
from kokoro import KPipeline

OLLAMA_URL = "http://localhost:11434/api/generate"
MODEL = "qwen3:14b"
WHISPER_BIN = "./whisper.cpp/build/bin/whisper-cli"
WHISPER_MODEL = "./whisper.cpp/models/ggml-medium.bin"
tts_pipeline = KPipeline(lang_code='a')

def record_audio(duration=5, sample_rate=16000):
    p = pyaudio.PyAudio()
    stream = p.open(format=pyaudio.paInt16, channels=1,
                    rate=sample_rate, input=True,
                    frames_per_buffer=1024)
    frames = [stream.read(1024) for _ in range(int(sample_rate / 1024 * duration))]
    stream.close(); p.terminate()
    with tempfile.NamedTemporaryFile(suffix='.wav', delete=False) as f:
        wf = wave.open(f, 'wb')
        wf.setnchannels(1); wf.setsampwidth(2)
        wf.setframerate(sample_rate)
        wf.writeframes(b''.join(frames))
        return f.name

def transcribe(audio_file):
    result = subprocess.run([WHISPER_BIN, '-m', WHISPER_MODEL, '-f', audio_file],
                          capture_output=True, text=True)
    return result.stdout.strip()

def ask_llm(prompt):
    r = requests.post(OLLAMA_URL, json={"model": MODEL, "prompt": prompt, "stream": False})
    return r.json()["response"]

def speak(text):
    for result in tts_pipeline(text):
        with tempfile.NamedTemporaryFile(suffix='.wav', delete=False) as f:
            f.write(result.audio)
        subprocess.run(['ffplay', '-nodisp', '-autoexit', f.name])

# Run it
print("Listening...")
audio_file = record_audio(5)
text = transcribe(audio_file)
print(f"You: {text}")
response = ask_llm(text)
print(f"AI: {response}")
speak(response)

Run it:

python voice_assistant.py

Speak into your mic. Wait 5 seconds. Hear the AI respond.

Performance

Whisper medium on CPU: transcribes in 2-4 seconds
Qwen3 14B on RTX 3060: responds in 3-5 seconds
Kokoro TTS on CPU: speaks in real-time (< 1 second latency)
Total round-trip: ~10 seconds on modest hardware

For faster responses, use Whisper tiny or a smaller LLM like Llama 3.1 8B.

Originally published on everylocalai.com

Give Your Local AI Tool-Calling Superpowers with Open WebUI and MCP

EveryLocalAI — Sun, 14 Jun 2026 22:07:37 +0000

Want a ChatGPT-like experience where your AI can search the web, read your files, query databases, and run code? Open WebUI + MCP makes it possible - all running locally on your hardware.

The Model Context Protocol (MCP) is an open standard that lets AI connect to external tools. Open WebUI supports MCP natively, turning your local Ollama setup into a tool-equipped AI assistant.

Prerequisites

GPU: RTX 3060 12GB or better (for Qwen3 14B at Q8)
Software: Docker + Docker Compose
Time: ~25 minutes setup

Installation

Create a docker-compose.yml:

services:
  open-webui:
    image: ghcr.io/open-webui/open-webui:main
    environment:
      - OLLAMA_BASE_URL=http://ollama:11434
      - MCP_ENABLE=true
      - ENABLE_TOOLS=true
    volumes:
      - open-webui:/app/backend/data
    ports:
      - "3000:8080"

  ollama:
    image: ollama/ollama:latest
    volumes:
      - ollama:/root/.ollama
    ports:
      - "11434:11434"
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]

volumes:
  ollama:
  open-webui:

docker compose up -d

Pull a model with strong tool-calling:

docker exec ollama ollama pull qwen3:14b:q8_0

Open http://localhost:3000 and create your admin account.

Adding MCP Tools

Go to Admin Panel → Settings → External Tools in Open WebUI.

Web Search Tool

npx -y @anthropic/mcp-server-brave-search

Filesystem Access

npx -y @modelcontextprotocol/server-filesystem /allowed/path

Configure each tool in the Open WebUI admin panel to give your AI real-world capabilities.

Usage

Start a new chat and click the tools icon (wrench) next to the input box. Select which tools the AI can use, then ask:

"Search the web for latest AI news"
"Read my project's README and summarize it"
"Query the sales database for Q3 results"

The AI decides when to call tools and incorporates results into its responses.

Results

With Qwen3 14B Q8 on an RTX 4070 Super: tool calls complete in 3-5 seconds. Web search results are returned in 2-3 seconds. All data stays on your machine.

Originally published on everylocalai.com

AI Pair Programming in Your Terminal with Aider and Ollama

EveryLocalAI — Sun, 14 Jun 2026 22:05:39 +0000

Want an AI coding assistant that works on YOUR codebase, respects YOUR git history, and doesn't send your code to the cloud? Aider + Ollama gives you exactly that.

Aider is an AI pair programming tool that works directly in your terminal. It sees your files, understands your git repo, and makes real edits to your code. Paired with Ollama running a local model, you get a fully private coding assistant.

Prerequisites

GPU: RTX 3090 or 4090 with 16GB+ VRAM (for Qwen3 Coder 30B at Q4)
Software: Python 3.10+, Ollama installed
Time: ~10 minutes setup

Installation

# Install Aider
pip install aider-chat

# Pull a capable coding model
ollama pull qwen3-coder:30b-a3b

Configuration

Set Aider to use your local Ollama model:

# For bash/zsh
export OLLAMA_CONTEXT_LENGTH=8192

# Run Aider with Ollama
aider --model ollama_chat/qwen3-coder:30b-a3b --editor

For persistent config, create .env in your project:

OLLAMA_CONTEXT_LENGTH=8192
AIDER_MODEL=ollama_chat/qwen3-coder:30b-a3b

Usage

# Start Aider in your project directory
cd my-project
aider --model ollama_chat/qwen3-coder:30b-a3b

# Now just describe what you want:
# "Add error handling to the API routes"
# "Refactor the database connection into a singleton"
# "Write unit tests for the user service"

Aider reads your files, makes changes, and commits them with sensible messages. You approve each change before it's applied.

Results

On a RTX 4090 with Qwen3 Coder 30B at Q4: ~15-20 tok/s, enough for real-time code suggestions.

Qwen2.5 Coder 14B runs faster (~35 tok/s) and fits on a 12GB GPU, great for smaller projects.

Why Local?

Privacy - your proprietary code never leaves your machine
No API costs - unlimited suggestions for $0/month
Works offline - code on a plane, in a cafe, anywhere
No rate limits - use it all day without throttling

Originally published on everylocalai.com

Build a Unified AI Gateway with LiteLLM and Ollama

EveryLocalAI — Sun, 14 Jun 2026 21:54:58 +0000

Unify all your AI models - local and cloud - behind a single OpenAI-compatible API with LiteLLM and Ollama.

LiteLLM is a proxy server that exposes 100+ LLM providers through one endpoint. Connect it to Ollama for local inference, and you get load balancing, cost tracking, rate limits, and automatic fallback routing.

What You Need

Python 3.9+
Ollama installed and running
About 20 minutes

Setup

1. Install LiteLLM

pip install 'litellm[proxy]'

2. Create config.yaml

model_list:
  - model_name: qwen3-local
    litellm_params:
      model: ollama/qwen3:14b
      api_base: http://localhost:11434
      rpm: 30
  - model_name: gpt-4o-mini
    litellm_params:
      model: openai/gpt-4o-mini
      api_key: os.environ/OPENAI_API_KEY

general_settings:
  master_key: sk-your-key

3. Start the Proxy

litellm --config config.yaml --port 4000

4. Use It

from openai import OpenAI
client = OpenAI(api_key="sk-your-key",
  base_url="http://localhost:4000/v1")
response = client.chat.completions.create(
  model="qwen3-local",
  messages=[{"role": "user", "content": "Hello!"}])

Key Features

Smart fallback - if local model fails, auto-route to cloud
Load balancing - distribute across multiple GPU instances
Cost tracking - per-model spend dashboard
Rate limiting - control requests per user/key
One API - use any tool that supports OpenAI format

Cost vs Cloud

	LiteLLM + Ollama	Direct Cloud APIs
Gateway	Free, self-hosted	Free
Local inference	$0	N/A
Model switching	One endpoint	Multiple SDKs
Failover	Automatic	Manual

Full guide with advanced config examples: https://everylocalai.com/stack/litellm-ollama-gateway

Generate Professional AI Images Locally with ComfyUI and FLUX

EveryLocalAI — Sun, 14 Jun 2026 21:54:00 +0000

Professional-grade image generation that runs entirely on your own GPU. ComfyUI + FLUX.1 Dev gives you Midjourney-quality output with full creative control and zero data leaving your machine.

What You Need

A GPU with 12GB+ VRAM (24GB recommended)
Python 3.10+ or the ComfyUI desktop app
About 20 minutes

Setup

Option A: Desktop App (Easiest)

Download from comfy.org, install, and use the built-in model manager to download FLUX.1 Dev.

Option B: Manual Install

git clone https://github.com/Comfy-Org/ComfyUI.git
cd ComfyUI
pip install -r requirements.txt
python main.py

Open http://localhost:8188.

Basic FLUX Workflow

Add a Checkpoint Loader node - load flux1-dev.safetensors
Add CLIP Text Encoder - enter your prompt
Add KSampler - connect model, CLIP, and empty latent
Add VAE Decode - decode to image
Add Save Image - save result
Click Queue Prompt

Prompt example: "a photorealistic cat sitting on a vintage leather chair, warm lighting, depth of field"

Advanced Features

LoRA - add a LoRA Loader node for style control
ControlNet - pose/edge guidance with extra nodes
Image-to-Image - feed an existing image through VAE Encode
API mode - integrate with n8n or custom apps
Batch generation - queue multiple prompts at once

Cost vs Cloud

	Local	Midjourney
Monthly	$0	$10-60
Per image	$0	$0.04-0.12
Privacy	Stays on your GPU	Sent to cloud
Control	Full node-level	Limited

Full guide with troubleshooting and hardware tips: https://everylocalai.com/stack/comfyui-flux-local-image

Chat With Your Documents Locally Using AnythingLLM and Ollama

EveryLocalAI — Sun, 14 Jun 2026 21:53:08 +0000

A private RAG system where you drop in PDFs, Word docs, and code files and ask questions. Runs on any machine, no cloud dependency.

What You Need

Any computer (GPU optional - CPU works fine)
Ollama installed
About 10 minutes

Architecture

Component	Role
AnythingLLM	Desktop/server app with RAG, agents, built-in vector DB
Ollama	Serves local LLM for chat + embeddings
Qwen3 14B	Default model for answering questions

Setup

1. Install Ollama

# Install from ollama.com, or run with Docker:
docker run -d --gpus all -p 11434:11434 --name ollama \
  -v ollama:/root/.ollama ollama/ollama

# Pull a model:
ollama pull qwen3:14b
# Pull an embedder:
ollama pull nomic-embed-text

2. Install AnythingLLM

Desktop app (easiest): Download from anythingllm.com

Docker:

docker run -d -p 3001:3001 --name anythingllm \
  --add-host host.docker.internal:host-gateway \
  -v anythingllm:/app/server/storage \
  mintplexlabs/anythingllm

3. Connect & Use

Open AnythingLLM (http://localhost:3001 or desktop app)
Settings > LLM Provider > Select Ollama, model qwen3:14b
Settings > Embedder > Select Ollama, model nomic-embed-text
Create a workspace, drop in documents, start asking questions

What You Can Do

Chat with PDFs, Word docs, code files, web pages
Create isolated workspaces per project
Use built-in agent skills (web search, summarization)
Works on CPU-only machines like a mini PC

Cost vs Cloud

	Local	ChatGPT + GPTs
Monthly	$0	$20-200
Hardware	$0-300	$0
Privacy	Stays on your machine	Sent to cloud
Documents	Unlimited	Token-limited

Full guide with troubleshooting: https://everylocalai.com/stack/anythingllm-ollama-rag

Build Visual AI Agent Pipelines with Langflow and Ollama

EveryLocalAI — Sun, 14 Jun 2026 21:28:30 +0000

Prototype and deploy multi-agent and RAG applications with a visual drag-and-drop interface - all running locally with your own models.

Langflow is an open-source visual framework for building AI applications. Connect it to Ollama for local inference, and you get a powerful environment for designing agent architectures, RAG pipelines, and chatbot workflows without writing code.

What You Need

A GPU with 12GB+ VRAM (or CPU-only for prototyping)
Docker or Python 3.10+
About 15 minutes

Architecture

Component	Role
Langflow	Visual drag-and-drop flow builder and API server
Ollama	Serves local LLM models
Qwen3 14B	Default model - fits 12GB at Q4

Setup

Option A: Docker (Recommended)

Save this as docker-compose.yml:

services:
  ollama:
    image: ollama/ollama:latest
    container_name: ollama
    volumes:
      - ollama:/root/.ollama
    ports:
      - "11434:11434"
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    restart: unless-stopped

  langflow:
    image: langflowai/langflow:latest
    container_name: langflow
    depends_on:
      - ollama
    ports:
      - "7860:7860"
    volumes:
      - langflow_data:/app/langflow
    environment:
      - LANGFLOW_AUTO_LOGIN=true
    restart: unless-stopped

volumes:
  ollama:
  langflow_data:

Launch it:

docker compose up -d
docker exec ollama ollama pull qwen3:14b

Open http://localhost:7860 to access Langflow.

Option B: pip Install

pip install langflow
langflow run
# In another terminal:
ollama pull qwen3:14b

Open http://localhost:7860.

Connect Langflow to Ollama

In the Langflow canvas, add:

Ollama Chat Model component - Base URL: http://ollama:11434 (Docker) or http://localhost:11434 (pip)
Select model: qwen3:14b
Connect to a Prompt node and Chat Output for a basic chatbot

What You Can Build

RAG Chatbot

Drag in: File > Ollama Embeddings > Vector Store (Chroma) > Ollama Chat Model > Chat Output. Upload a PDF, ask questions - answers come from your documents.

Multi-Agent Research System

Add an Agent node with a Web Search Tool + Ollama, add a second Agent for summarization. One agent gathers info, the other condenses it.

Document Processing Pipeline

Combine File Loader > Splitter > Ollama Embeddings > Vector Store. Add Ollama Chat Model with custom prompts for Q&A over your documents.

Cost vs Cloud

	Local Langflow + Ollama	Langflow Cloud + OpenAI
Monthly	$0	$50-200+
Hardware	~$300-600 once	$0
Data privacy	Stays on your machine	Sent to cloud
AI calls	Unlimited, free	Per-token billing

Full guide with detailed troubleshooting and alternatives: https://everylocalai.com/stack/langflow-ollama-rag-agent

Build a Local AI Workflow Automation with n8n and Ollama

EveryLocalAI — Sun, 14 Jun 2026 21:27:03 +0000

Automate tasks with AI-powered workflows that run entirely on your own hardware. n8n + Ollama = self-hosted Zapier with local LLM inference. No monthly fees, no data leaving your machine.

What You Need

A GPU with 12GB+ VRAM (for local AI) or any machine with Docker (n8n works CPU-only too)
Docker + Docker Compose
About 15 minutes

Architecture

Component	Role
n8n	Visual workflow engine with 500+ integrations and AI agent nodes
Ollama	Serves local LLM via OpenAI-compatible API
Qwen3 14B	Default model - strong reasoning, fits 12GB at Q4

Setup

Save this as docker-compose.yml:

services:
  ollama:
    image: ollama/ollama:latest
    container_name: ollama
    volumes:
      - ollama:/root/.ollama
    ports:
      - "11434:11434"
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    restart: unless-stopped

  n8n:
    image: docker.n8n.io/n8nio/n8n
    container_name: n8n
    depends_on:
      - ollama
    environment:
      - N8N_RUNNERS_ENABLED=true
      - GENERIC_TIMEZONE=America/New_York
      - TZ=America/New_York
    volumes:
      - n8n_data:/home/node/.n8n
    ports:
      - "5678:5678"
    restart: unless-stopped

volumes:
  ollama:
  n8n_data:

Start it:

docker compose up -d
docker exec ollama ollama pull qwen3:14b

Open http://localhost:5678 to access n8n.

Connect n8n to Ollama

In n8n, add an Ollama Chat Model node and set:

Base URL: http://ollama:11434
Model: qwen3:14b

Use it with n8n's AI Agent node for agentic workflows.

Example Workflows

Email Summarizer

Trigger: New email → AI step: "Summarize this email in 2 sentences" → Output: Slack message

Content Generator

Trigger: Cron schedule → AI step: "Write a newsletter about [topic]" → Output: Email to subscribers

Smart Classifier

Trigger: Webhook (support tickets) → AI step: "Classify as billing/technical/feature" → Output: Route to different teams

Cost vs Cloud

	Local n8n + Ollama	Zapier + ChatGPT
Monthly	$0	$20-100+
Hardware	~$300 once	$0
Data safety	Stays on your LAN	Sent to cloud
AI calls	Unlimited, free	Token-limited
Workflows	Unlimited	Task-limited

After 3-6 months the hardware pays for itself.

Full guide with detailed troubleshooting and alternatives: https://everylocalai.com/stack/n8n-ollama-ai-automation