I stopped sending my code to external APIs six months ago. Not for privacy reasons—though that's a nice bonus—but because local LLMs for coding have gotten genuinely good.
Here's how to set up a complete local AI coding assistant in under 20 minutes. No subscriptions. No rate limits. No sending your proprietary code to someone else's servers.
Why Local LLMs Actually Make Sense Now
The gap between cloud models and local ones has shrunk dramatically. For most coding tasks—autocomplete, explaining code, writing tests, refactoring—a well-tuned 7B or 14B model running locally performs within 80-90% of GPT-4.
That remaining 10-20%? It's usually in complex multi-file reasoning or obscure language edge cases. For daily coding, local models handle it fine.
The real wins:
- Zero latency dependency — Works offline, on planes, in cafes with garbage wifi
- No token costs — Run it 1000 times a day, costs nothing
- Privacy — Your code stays on your machine
- Customization — Fine-tune on your codebase if you want
Step 1: Install Ollama
Ollama is the easiest way to run local LLMs. One binary, handles model downloads, provides an API.
macOS/Linux:
curl -fsSL https://ollama.com/install.sh | sh
Windows:
Download from ollama.com and run the installer.
Verify it's running:
ollama --version
Step 2: Pull a Coding Model
Not all models are created equal for code. Here's what actually works:
Best all-rounder (7B, runs on 8GB RAM):
ollama pull deepseek-coder:6.7b-instruct
Better quality, needs 16GB RAM:
ollama pull codellama:13b-instruct
Best local coding model (needs 32GB RAM):
ollama pull deepseek-coder:33b-instruct
My daily driver is deepseek-coder:6.7b-instruct. Fast, accurate, fits in memory alongside my IDE and browser.
Step 3: Test It Works
ollama run deepseek-coder:6.7b-instruct "Write a Python function to validate email addresses using regex"
You should see it generate code within seconds. If it's slow, you're either memory-constrained or need to close some Chrome tabs.
Step 4: Connect to Your Editor
VS Code with Continue
Continue is the best free extension for local LLM integration.
- Install Continue from VS Code marketplace
- Open settings (Ctrl+Shift+P → "Continue: Open Config")
- Add this config:
{
"models": [
{
"title": "DeepSeek Local",
"provider": "ollama",
"model": "deepseek-coder:6.7b-instruct"
}
],
"tabAutocompleteModel": {
"title": "DeepSeek Autocomplete",
"provider": "ollama",
"model": "deepseek-coder:6.7b-instruct"
}
}
Now you have:
- Inline autocomplete (like Copilot)
- Chat sidebar for questions
- Cmd+L to explain selected code
Neovim with gen.nvim
-- In your lazy.nvim config
{
"David-Kunz/gen.nvim",
opts = {
model = "deepseek-coder:6.7b-instruct",
host = "localhost",
port = "11434",
}
}
Step 5: API Integration for Scripts
Ollama exposes a REST API on port 11434. Use it in your tooling:
import requests
def ask_llm(prompt: str) -> str:
response = requests.post(
"http://localhost:11434/api/generate",
json={
"model": "deepseek-coder:6.7b-instruct",
"prompt": prompt,
"stream": False
}
)
return response.json()["response"]
# Generate a test
code = open("my_module.py").read()
tests = ask_llm(f"Write pytest tests for this code:\n\n{code}")
print(tests)
I use this for:
- Pre-commit hooks that generate test stubs
- Documentation generators
- Code review bots in CI
Performance Tuning
If responses are slow:
Check memory usage:
ollama ps
Use a smaller context window:
ollama run deepseek-coder:6.7b-instruct --num-ctx 2048
Enable GPU acceleration (if you have NVIDIA):
# Should auto-detect, but verify
nvidia-smi
Most 7B models run fine on CPU with 16GB RAM. For 13B+, you really want a GPU.
Model Recommendations by Use Case
| Task | Model | RAM Needed |
|---|---|---|
| Autocomplete | deepseek-coder:1.3b |
4GB |
| General coding | deepseek-coder:6.7b-instruct |
8GB |
| Complex refactoring | codellama:13b-instruct |
16GB |
| Architecture decisions | deepseek-coder:33b-instruct |
32GB |
Start small. The 6.7B model handles 90% of daily tasks. Scale up when you hit limits.
What Local LLMs Won't Do
Be realistic about limitations:
- Large codebase understanding — They can't hold 50 files in context
- Cutting-edge frameworks — Training data has a cutoff
- Complex debugging — Claude and GPT-4 still win here
For those cases, I keep a cloud API as backup. But 80% of my AI-assisted coding now runs locally.
Wrapping Up
The setup takes 15 minutes. The models are free. The privacy is a bonus.
If you're still paying for Copilot and only use it for autocomplete and simple explanations, try this for a week. You might not go back.
More at dev.to/cumulus
Top comments (0)