How to Set Up a Local AI Coding Assistant That Actually Works

#vscode #ai #opensource #productivity

The AI coding assistant landscape keeps shifting. Names change, products get reshuffled, pricing tiers evolve. If you're tired of depending on cloud-hosted AI tools that might change their terms tomorrow, there's a better path: running your own local AI code completion stack.

I finally set this up properly last month after one too many "service unavailable" errors during a deadline. Here's what actually works.

The Problem: Cloud Dependency for Code Completion

We've all been there. You're in the zone, cranking through a refactor, and your AI suggestions just... stop. Maybe the service is down. Maybe your company's firewall started blocking it. Maybe the pricing changed and your free tier ran out.

Worse, there's the privacy angle. Not every codebase should be sent to a third-party server. If you're working on proprietary code, in a regulated industry, or just privacy-conscious, cloud-based AI assistants are a non-starter.

The good news: open-source local alternatives have gotten genuinely good in the last year.

The Stack: Ollama + Continue

After testing several combinations, my go-to setup is Ollama for model serving and Continue as the editor extension. Both are open source, both are actively maintained, and they work together cleanly.

Step 1: Install Ollama

Ollama handles downloading and running LLMs locally. It's dead simple.

# macOS / Linux
curl -fsSL https://ollama.com/install.sh | sh

# Verify it's running
ollama --version

# Pull a code-focused model
# DeepSeek Coder V2 Lite is a solid balance of speed and quality
ollama pull deepseek-coder-v2:16b

# For tab completion, you want something faster
ollama pull qwen2.5-coder:7b

A few notes on model selection. Bigger models give better suggestions but slower responses. For real-time tab completion, you want something that responds in under 500ms. The 7B parameter models hit that sweet spot on most modern hardware. For chat-style Q&A about your code, you can afford a larger model since latency matters less.

Step 2: Install Continue in Your Editor

Continue is an open-source extension available for VS Code and JetBrains IDEs. Install it from your editor's marketplace — search for "Continue" by Continue.dev.

Once installed, you need to configure it to point at your local Ollama instance. Open the Continue config file:

// ~/.continue/config.json
{
  "models": [
    {
      "title": "DeepSeek Coder Local",
      "provider": "ollama",
      "model": "deepseek-coder-v2:16b"
    }
  ],
  "tabAutocompleteModel": {
    "title": "Qwen Coder",
    "provider": "ollama",
    "model": "qwen2.5-coder:7b"
  },
  // Use a separate, fast model for autocomplete
  "tabAutocompleteOptions": {
    "debounceDelay": 500,
    "maxPromptTokens": 2048
  }
}

The key insight here: use two different models. A smaller, faster one for tab completion (where latency kills the experience) and a larger, smarter one for the chat panel where you ask questions about your code.

Step 3: Verify Everything Works

Open a code file and start typing. You should see ghost text suggestions appearing after a brief pause. Open the Continue chat panel (usually Cmd+L or Ctrl+L) and ask it something about your current file.

If suggestions aren't appearing, check that Ollama is running:

# Check if the Ollama server is responding
curl http://localhost:11434/api/tags

# You should see your downloaded models listed
# If not, start the server manually
ollama serve

Tuning for Better Results

Out of the box, this setup is decent. But a few tweaks make it significantly better.

Hardware Considerations

This is where I have to be honest: local AI coding assistants need decent hardware. Here's the rough breakdown:

8GB RAM: You can run 7B models, but it'll be tight. Close your browser tabs.
16GB RAM: Comfortable for 7B models, workable for 13-16B.
32GB+ RAM: Run whatever you want without thinking about it.
GPU: Not strictly required (Ollama uses CPU by default) but an NVIDIA or Apple Silicon GPU dramatically improves response times.

On my M2 MacBook Pro with 16GB, the 7B model responds in about 200-300ms for tab completions. Totally usable. The 16B model takes around 800ms for chat responses, which is fine for that use case.

Context Window Tricks

One thing that tripped me up: local models have smaller context windows than the big cloud models. If you're working in a huge file, the model might not see enough context to give relevant suggestions.

Work around this by keeping your functions small (you should be doing this anyway) and using the chat panel with @file references when you need the model to understand multiple files:

@src/auth/middleware.ts @src/auth/types.ts

Why is the session token not being validated 
in the middleware?

Model Updates

Models improve frequently. Check for updates periodically:

# Re-pull to get the latest version of a model
ollama pull deepseek-coder-v2:16b

# List what you have installed
ollama list

# Remove models you're not using (they take disk space)
ollama rm old-model-name

Alternative: Tabby for Team Setups

If you need this for a whole team rather than just yourself, take a look at Tabby — another open-source, self-hosted AI coding assistant. It's designed more for the shared server use case, where you run it on a beefy machine and your team connects to it over the network.

Tabby has its own editor extensions and handles things like usage analytics and access control that matter in a team context. The tradeoff is more infrastructure to manage.

What You Lose (Being Honest)

I'd be lying if I said local models are as good as the largest cloud-hosted ones. They're not. Here's what you're trading off:

Quality: The best cloud models are still ahead, especially for complex multi-file reasoning. The gap is shrinking fast though.
Codebase awareness: Cloud solutions with indexing features can search your entire repo. Local setups are more limited to what fits in the context window.
Zero setup: Cloud tools just work out of the box. This takes 20 minutes to set up (but only once).

What you gain: privacy, reliability, no recurring costs, no usage limits, and the satisfaction of not having your code sent to someone else's servers.

Prevention: Making This Resilient

Once you have this running, a few things to keep it reliable:

Add Ollama to your system startup so it's always available when you open your editor
Pin your model versions if you find one that works well — model updates occasionally regress on specific languages
Keep a backup model downloaded in case your primary one acts up after an update
Monitor disk space — models are large (4-10GB each) and accumulate if you experiment a lot

The local AI coding assistant experience has crossed the threshold from "neat experiment" to "daily driver" for me. It's not perfect, but it's mine, it's private, and it doesn't go down when someone else's servers have a bad day.