DEV Community

Jovan Chan
Jovan Chan

Posted on • Originally published at aifoss.dev

The Open-Source AI Stack in 2026: What Works Together

This article was originally published on aifoss.dev

The tooling exists. Ollama, Open WebUI, AnythingLLM, Continue.dev, Aider, Flowise — five years ago this stack didn't exist at all. The problem in 2026 isn't finding open-source AI tools; it's figuring out which ones compose into a coherent workflow and which combinations quietly waste your weekend.

The short answer: almost any combination works, because all modern local AI tools speak the OpenAI HTTP API. The longer answer: there are real failure modes — port conflicts, silent context truncation, embedding model mismatches — that the tutorials skip. This is the guide that covers them.

Versions verified: Ollama v0.24.0 (May 14, 2026), Open WebUI v0.9.5 (May 2026), vLLM v0.21.0 (May 2026), Continue.dev v1.3.34 (early 2026), Aider v0.86.0.


The API layer that makes it all compose

Every tool in this stack exposes or consumes an OpenAI-compatible HTTP API. This one design decision is the reason cross-tool compatibility is almost automatic.

Ollama's REST API at localhost:11434 accepts the same request format as OpenAI's /v1/chat/completions. Open WebUI detects a local Ollama instance without configuration. AnythingLLM treats Ollama as a selectable LLM provider. Continue.dev has a first-party "provider": "ollama" setting in its config. Aider accepts --api-base http://localhost:11434 to redirect to any OpenAI-compatible server. Flowise has an Ollama LLM node baked in.

The practical implication: swapping the LLM runner underneath a UI or code tool is a URL change, not an integration project. Replace localhost:11434 with localhost:8000 and you're pointed at vLLM instead. Every tool described below works against any compliant backend.

Where it gets complicated is not the protocol — it's the port assignments, context window defaults, and embedding pipeline isolation. Those are the actual failure modes.


Layer 1 — The LLM Runner

Ollama v0.24.0 (MIT license, github.com/ollama/ollama) is the right starting point for single-developer and home-lab setups. One installer, model downloads by name, background daemon, hot-swap between models. The May 2026 release reworked the MLX sampler for Apple Silicon and added Codex App support. It stores models as GGUF and loads them into GPU memory on first request.

# Pull and run a model
ollama pull qwen3:14b
ollama run qwen3:14b

# Check running models and VRAM allocation
ollama ps

# Increase context window (default is often 2048)
OLLAMA_NUM_CTX=16384 ollama serve
Enter fullscreen mode Exit fullscreen mode

vLLM v0.21.0 (Apache 2.0, github.com/vllm-project/vllm) is the answer when Ollama's single-user throughput ceiling isn't enough. It runs on port 8000, exposes the identical OpenAI API, and every other tool in this stack points at it with a URL change. The tradeoffs: Linux-only, requires CUDA, no Apple Silicon support, and takes 30–90 seconds to load a model before serving.

# Serve a Qwen3 14B model with vLLM
vllm serve Qwen/Qwen3-14B \
  --max-model-len 32768 \
  --port 8000
Enter fullscreen mode Exit fullscreen mode

The decision rule is simple: Ollama for development, evaluation, and solo use; vLLM when you're serving more than one person, running a shared team endpoint, or benchmarking batch throughput. For a detailed breakdown of when the switch is worth the ops cost, see Ollama vs vLLM 2026.


Layer 2 — Chat UIs

Open WebUI v0.9.5 (MIT, github.com/open-webui/open-webui) is built first for Ollama. Docker installation detects a local Ollama instance automatically — no manual endpoint configuration. It runs on port 3000 and covers daily chat, model management, basic RAG via document upload, and in v0.9.5, a native desktop app for Mac, Windows, and Linux that removes the Docker requirement entirely for personal setups.

The v0.9.5 release added redirect-based SSRF protection and configurable iframe content security policy, which matter if you're exposing the interface on a local network to other users.

AnythingLLM (MIT, github.com/Mintplex-Labs/anything-llm) is a different tool with a similar surface. The distinction is architectural: AnythingLLM was designed around "workspaces" where document collections, embedding pipelines, and chat history are managed independently. It runs on port 3001 by default, which means Open WebUI and AnythingLLM can run simultaneously against the same Ollama instance without port conflict.

Both tools support the same Ollama backend. The routing decision: Open WebUI for general chat, model exploration, and multi-modal tasks; AnythingLLM when your primary workflow is interrogating documents. For deeper reviews, see Open WebUI review and AnythingLLM review.


Layer 3 — RAG

Both chat UIs include built-in RAG, but they handle embeddings and persistence differently. Knowing which to use before you ingest a large document corpus saves a painful rebuild later.

Open WebUI's RAG stores embeddings in SQLite-vec (since v0.9.x). Upload a document via the chat interface, and it becomes queryable in that conversation. The setup time is near zero. Configuration is limited — you pick an embedding model in admin settings, and that's about it. Good for ad-hoc document queries; not designed for managing multiple independent knowledge bases.

AnythingLLM's RAG uses Chroma by default, supports multiple embedding backends (including nomic-embed-text via Ollama), and lets you create isolated workspaces with separate document collections. You can inspect embedding status per document, rescan sources after updates, and configure retrieval parameters per workspace. It's more to configure but significantly more reliable for ongoing document-heavy workflows.

Flowise (Apache 2.0) handles the cases neither built-in solution covers: multi-step retrieval, reranking, conditional routing based on document metadata, or custom pre-processing pipelines. It talks to Ollama through a standard LLM node and has a visual interface for building chains. For setup, see Flowise local setup guide.

One rule that applies to all three options: embedding vectors are model-specific. Documents embedded with nomic-embed-text cannot be queried with mxbai-embed-large or Open WebUI's default embedding model. If you switch RAG tools or embedding models mid-project, you re-embed everything from scratch. Choose your embedding model before ingesting production data.


Layer 4 — Code Tooling

Both major code tools in this stack are OpenAI-API consumers. Neither requires Ollama specifically — they accept any compatible endpoint.

Continue.dev v1.3.34 (Apache 2.0, github.com/continuedev/continue, 2.4M VS Code installs as of early 2026) is the IDE-integrated option. Configuration lives in a single JSON file:

{
  "models": [
    {
      "title": "Qwen3 14B — chat",
      "provider": "ollama",
      "model": "qwen3:14b"
    }
  ],
  "tabAutocompleteModel": {
    "title": "Deepseek Coder V2 — autocomplete",
    "provider": "ollama",
    "model": "deepseek-coder-v2:16b"
  }
}
Enter fullscreen mode Exit fullscreen mode

The two-model setup is standard practice on Tier 2 hardware: a 14B model for chat and edits, a faster smaller model for inline autocomplete that needs to respond in under a second. Both pull from the same Ollama daemon, so no additional ports or processes.

To point Continue.dev at vLLM instead, change "provider": "openai" and add "apiBase": "http://localhost:8000". The model names change to match whatever vLLM is serving, but everything else is identical.

Aider v0.86.0 (Apache 2.0, github.com/Aider-AI/aider) is the terminal alternative. It maps your repository structure, generates diffs, and

Top comments (0)