ZyVOP

Posted on Jun 1 • Edited on Jun 8 • Originally published at zyvop.com

Your Code Doesn't Have to Leave Your Machine. Here's How to Run a Full AI Coding Setup Locally.

#localai #ollama #selfhosting #opensource

Three years into the Copilot era, most developers don't think twice about what happens when they hit Tab. A code snippet leaves their machine, travels to a cloud server, gets processed by a model running on hardware they don't control, and comes back as a suggestion. It takes milliseconds. It's invisible.

It's also a choice — and in 2026, for the first time, it's a genuinely contestable one.

Ollama, the open-source framework for running AI models locally, hit 52 million monthly downloads in Q1 2026 and 2.5 billion total model downloads since launch. The number of open-source models available for local deployment grew over 300% between 2024 and 2026. A developer with a modern GPU can now run a coding model that scores within 5–10 benchmark points of Claude Sonnet — on their own machine, with zero API costs, zero data leaving the building, and zero dependency on any company's uptime.

This is not theoretical. This is what a significant and growing portion of the developer community is actually doing.

This post is the honest guide to it: the real reasons to do it, the real costs, what you give up, and exactly how to set it up.

Why developers are making this switch in 2026

It's not one reason. It's usually a combination of three.

Reason 1: Your source code is proprietary

Every time you use Cursor, GitHub Copilot, or Claude Code on proprietary work, code fragments leave your machine. Most providers have enterprise agreements that say they won't train on your code — but "won't train on it" is different from "it never touches our servers." The code travels. It gets logged, at least temporarily. It exists outside your environment.

For most consumer SaaS applications, this is fine. For healthcare companies with HIPAA obligations, financial firms with proprietary trading algorithms, defence contractors with classification requirements, and anyone with IP that represents genuine competitive advantage — the risk profile is different.

Organisations across healthcare, legal, and financial services are deploying local AI on internal infrastructure precisely because the combination of privacy, cost control, and model flexibility outweighs the convenience of cloud APIs.

Reason 2: The economics stop making sense at scale

Cursor Ultra is $200/month. Claude Code on the Max plan is $100/month. Copilot Business is $19/user/month. For a two-person team, the cloud costs are manageable. For a 50-person engineering team? The AI tooling line item is now $10,000–$120,000 per year, depending on tier.

A 24 GB GPU pays for itself in under a year for a two-person team compared to a Cursor Ultra subscription. For a team running heavy agent workloads, the ROI math flips in favour of local infrastructure within months.

The models are improving fast enough that the quality gap between local and cloud is narrowing every quarter. The cost gap between local and cloud is not narrowing. For price-sensitive teams doing the right kind of work, this arithmetic increasingly points toward self-hosting.

Reason 3: Vendor lock-in and reliability

Cursor changed its pricing model in June 2025 and caught users mid-month with depleted credits. Copilot has had service outages. Windsurf went through an acquisition that left users uncertain about continuity for several months.

A local model doesn't have pricing changes. It doesn't have service outages. It doesn't get acquired. The model you downloaded last Tuesday will work exactly the same way next Tuesday, regardless of what happens in AI company boardrooms.

The honest quality picture

Here's what you give up, stated plainly.

Benchmarking Qwen2.5-Coder 32B at Q5 on a single RTX 4090 against Claude Sonnet on a 500-task internal suite: you give up roughly 5 to 10 benchmark points of performance and gain zero marginal cost, near-zero latency, and absolute privacy.

For day-to-day coding — refactors, boilerplate, tests, docstrings — the difference is often not noticeable in practice. The benchmark gap doesn't translate linearly to real-world feel.

Where cloud models still clearly win:

Ultra-long context. Claude's 1M token window for entire monorepos is not replicable on consumer hardware today.
Cutting-edge reasoning. The very latest frontier reasoning models are weeks ahead of what's available open-source.
Newly released frameworks. A model trained six months ago doesn't know about a framework that shipped last month. Cloud models get updated training data more frequently.
Complex agentic tasks. Multi-step autonomous tasks that require sophisticated planning still perform better with frontier models.

For 80% of what developers actually do day-to-day, the difference between a local 32B model and Claude Sonnet is smaller than the difference between a well-written prompt and a vague one.

The hardware reality

Let's be direct about what you need, because this is where most guides get optimistic.

For acceptable local coding assistance: A GPU with at least 8GB VRAM runs models in the 7B–13B parameter range. Qwen2.5-Coder 7B performs respectably on single-file generation tasks. Not great for complex multi-file reasoning.

For genuinely good local coding assistance:24GB VRAM (RTX 4090, RTX 4500 Ada, or M2/M3/M4 Pro Mac with 36GB unified memory) runs 32B models at reasonable quality and speed. This is the practical sweet spot in 2026.

For team deployment: A dedicated inference server with an A100 or H100 can serve a team of 10–20 developers simultaneously. The per-seat economics become compelling at this scale.

Apple Silicon note: M2 Pro (18GB), M3 Pro (18–36GB), and M4 Pro chips use unified memory shared between CPU and GPU. This means a MacBook Pro M4 Pro with 48GB of RAM can run 32B models locally with surprisingly good performance — no discrete GPU required. This is the reason many privacy-conscious developers chose Apple hardware specifically for local AI work in 2025–2026.

# Check what you're working with:
# GPU VRAM guide:
# 8GB   → 7B models (Q4 or Q5 quantisation)
# 16GB  → 13B–14B models, or 7B at higher quality
# 24GB  → 32B models (the sweet spot)
# 48GB+ → 70B models, or 32B with higher quality quantisation

The setup: Ollama + Continue.dev in one afternoon

This is the stack the self-hosted coding community has converged on in 2026. It's not the only option, but it's the most production-ready combination with the least configuration overhead.

Ollama runs the models locally. Think of it as the equivalent of a local API server — it downloads models, manages them, and exposes an OpenAI-compatible API endpoint your editor can talk to.

Continue.dev is a VS Code and JetBrains extension that connects your editor to any local or cloud AI, including Ollama. It gives you Copilot-style inline completions, a chat panel, and multi-file editing — all pointing at your local model.

Step 1: Install Ollama

# macOS / Linux:
curl -fsSL https://ollama.com/install.sh | sh

# Windows:
# Download installer from ollama.com

# Verify it's running:
ollama --version

Step 2: Pull a coding model

# The 2026 recommended starting point for coding:
ollama pull qwen2.5-coder:32b

# If you have <16GB VRAM, start here:
ollama pull qwen2.5-coder:7b

# Alternative: DeepSeek Coder V2 (strong on Python/JS)
ollama pull deepseek-coder-v2:16b

# Check what you have:
ollama list

Qwen2.5-Coder from Alibaba's research team has been the community benchmark leader for local coding models through Q1–Q2 2026. DeepSeek Coder V2 is strong for Python and JavaScript specifically. Both are genuinely good.

Step 3: Test the model directly

# Quick sanity check before wiring into your editor:
ollama run qwen2.5-coder:32b

# Try:
>>> Write a TypeScript function that validates an email address
>>> Explain the difference between useCallback and useMemo in React
>>> What's wrong with this code: [paste something broken]

# Ctrl+D to exit

If the responses feel reasonable, your model is working. Move on.

Step 4: Install Continue.dev in VS Code

# VS Code:
# Extensions panel → search "Continue" → install "Continue - Codestral, Claude, and more"

# Or via command line:
code --install-extension continue.continue

Step 5: Configure Continue to use Ollama

Open your Continue config file (~/.continue/config.json) and replace the default:

{
  "models": [
    {
      "title": "Qwen2.5-Coder 32B (Local)",
      "provider": "ollama",
      "model": "qwen2.5-coder:32b",
      "apiBase": "http://localhost:11434"
    }
  ],
  "tabAutocompleteModel": {
    "title": "Qwen2.5-Coder 7B (Fast Completions)",
    "provider": "ollama",
    "model": "qwen2.5-coder:7b",
    "apiBase": "http://localhost:11434"
  },
  "contextProviders": [
    { "name": "code" },
    { "name": "docs" },
    { "name": "diff" },
    { "name": "terminal" },
    { "name": "problems" },
    { "name": "folder" },
    { "name": "codebase" }
  ]
}

The split model setup is intentional: the 32B model for chat and multi-file reasoning (slower but smarter), the 7B model for inline Tab completions (fast enough to not break your flow). This mirrors how cloud-based editors often use a smaller model for completions and a larger one for chat.

Step 6: Verify the integration

Open any code file in VS Code. Start typing. Ghost text should appear from the local model. Open the Continue panel (Cmd/Ctrl+L) and ask it something about your code. If both work, you have a fully local AI coding setup running entirely on your machine.

The optional layer: Open WebUI for the ChatGPT-style interface

If you want a browser-based chat interface alongside your editor — useful for longer conversations, document analysis, and tasks where the editor panel feels cramped:

# Requires Docker:
docker run -d \
  -p 3000:80 \
  --add-host=host.docker.internal:host-gateway \
  -v open-webui:/app/backend/data \
  --name open-webui \
  ghcr.io/open-webui/open-webui:main

# Open http://localhost:3000
# Create a local account (nothing is sent anywhere)
# Your Ollama models appear automatically

Open WebUI gives you conversation history, model switching, document upload, and multi-user support. It's a legitimate, stable option for developers and small teams who want privacy, cost control, and the educational value of understanding how LLM inference works.

The traps that catch people

Thermal throttling. A GPU running inference for 8 hours a day generates substantial heat. Consumer GPUs (RTX 4090) are not designed for sustained server-style workloads. If you're running this professionally, ensure adequate cooling. Some developers run dedicated inference servers with workstation-grade GPUs (RTX 4500 Ada) for exactly this reason.

Slow first-token latency. Local models, even on good hardware, often have slower time-to-first-token than cloud models with optimised inference infrastructure. For chat, this is a minor annoyance. For Tab completion where every millisecond matters for flow state, the 7B completion model matters more than you'd expect — keep it lean and fast.

Outdated models. The open-source model landscape moves fast. The model you pulled six months ago may have been superseded by something significantly better. Make a habit of checking the Ollama model library monthly and pulling updates.

Context window limits. Most practical local models cap at 32k–128k context. Claude's 1M token window is not replicable locally on consumer hardware. For tasks requiring full monorepo context, local models are genuinely limited.

Running out of VRAM mid-task. If your model is close to your VRAM limit and another GPU application opens (a game, video rendering, another model), your inference will slow dramatically or crash. Profile your VRAM headroom before running production workloads.

Who this makes sense for (and who it doesn't)

Makes sense:

Developers working on genuinely proprietary codebases where data residency matters
Teams at the scale where cloud costs are material ($5k+ per month in AI tooling)
Developers in jurisdictions or industries with strict data governance requirements
Developers who work offline frequently (travel, spotty connectivity)
Anyone who wants to understand the AI stack they depend on by running it themselves

Doesn't make sense:

Developers doing cutting-edge work on new frameworks where model freshness matters
Teams running complex multi-file agentic tasks where frontier reasoning quality is the bottleneck
Developers who don't have or won't buy adequate hardware — an underpowered local model is worse than a cloud model, not better
Teams where setup and maintenance overhead is a real cost that outweighs privacy and cost benefits

The honest verdict: for the right use case and the right hardware, local AI coding in 2026 is genuinely competitive with cloud-based tools. For the wrong use case or insufficient hardware, it's a self-imposed limitation that will frustrate you. Know which category you're in before you invest the setup time.

The broader point

The fact that this is a legitimate option in 2026 — that a developer can run models competitive with GPT-4-class performance on their own machine — is itself remarkable. Two years ago, "local LLM" meant something slow and limited you might tinker with on the weekend. Today it means Qwen2.5-Coder 32B running at 30–50 tokens per second on an RTX 4090, available offline, at zero per-query cost, with no data leaving your machine.

The cloud coding tools have real advantages that matter for real use cases. They're also not the only option anymore. That's a genuine shift in developer autonomy — and it's one that more developers should know exists.

Originally published on ZyVOP

💡 For more articles like this, subscribe to the ZyVOP newsletter!

DEV Community