I spent last weekend ditching cloud AI for coding. No more API rate limits, no more sending proprietary code to external servers, no more surprise bills. Just a local LLM running on my machine, integrated with my editor.
Here's exactly how to set it up in an afternoon.
Why Local LLMs for Coding?
Three reasons I made the switch:
- Privacy — My client code never leaves my machine
- Cost — $0/month after initial setup
- Speed — No network latency, works offline
The trade-off? You need decent hardware and the models aren't quite GPT-4 level. But for code completion, refactoring, and explaining code? They're surprisingly good.
What You'll Need
- RAM: 16GB minimum, 32GB recommended
- GPU: Optional but helps (NVIDIA with 8GB+ VRAM ideal)
- Storage: 10-50GB depending on models
- OS: Linux, macOS, or Windows with WSL2
No GPU? CPU inference works fine — just slower. I ran this on a 2-year-old laptop with no dedicated GPU and it was usable.
Step 1: Install Ollama
Ollama is the easiest way to run local LLMs. One binary, no Python environment hell.
# Linux/WSL
curl -fsSL https://ollama.ai/install.sh | sh
# macOS
brew install ollama
# Start the service
ollama serve
That's it. Ollama runs as a local API server on port 11434.
Step 2: Pull a Coding Model
Not all models are equal for code. Here's what actually works:
# Best balance of speed and quality (7B params, ~4GB)
ollama pull deepseek-coder:6.7b
# Faster, smaller, good for completions (3B params, ~2GB)
ollama pull starcoder2:3b
# Heavy hitter if you have the RAM (33B params, ~20GB)
ollama pull codellama:34b
I use deepseek-coder:6.7b daily. It handles Python, TypeScript, Go, and Rust well. For quick completions, starcoder2:3b is snappier.
Test it works:
ollama run deepseek-coder:6.7b "Write a Python function to merge two sorted lists"
Step 3: Editor Integration
VS Code with Continue
Continue is my pick. Open source, actively maintained, works offline.
- Install the Continue extension from VS Code marketplace
- Open Continue settings (Cmd/Ctrl + Shift + P → "Continue: Open config.json")
- Add your Ollama model:
{
"models": [
{
"title": "DeepSeek Coder",
"provider": "ollama",
"model": "deepseek-coder:6.7b"
}
],
"tabAutocompleteModel": {
"title": "StarCoder",
"provider": "ollama",
"model": "starcoder2:3b"
}
}
Now you have:
- Chat with code context (highlight code → ask questions)
- Tab completions as you type
- Inline edits (Cmd+I to refactor selected code)
Neovim with Ollama.nvim
-- lazy.nvim
{
"nomnivore/ollama.nvim",
dependencies = { "nvim-lua/plenary.nvim" },
cmd = { "Ollama", "OllamaModel" },
opts = {
model = "deepseek-coder:6.7b",
url = "http://127.0.0.1:11434",
}
}
Map it to a key:
vim.keymap.set("v", "<leader>oo", ":<c-u>lua require('ollama').prompt()<cr>")
Step 4: Terminal Integration
Sometimes I just want to ask a quick question without leaving the terminal.
# Add to .bashrc/.zshrc
ask() {
ollama run deepseek-coder:6.7b "$*"
}
# Usage
ask "What's the time complexity of Python's sorted()?"
For piping code:
cat broken_script.py | ollama run deepseek-coder:6.7b "Fix the bugs in this code"
Performance Tuning
GPU Acceleration (NVIDIA)
Ollama auto-detects CUDA. Verify it's using your GPU:
ollama run deepseek-coder:6.7b --verbose
# Look for "using CUDA" in output
If not detected, ensure you have NVIDIA drivers and nvidia-container-toolkit installed.
Reduce Memory Usage
Loading multiple models eats RAM. Ollama keeps models in memory by default. To unload:
# List loaded models
curl http://localhost:11434/api/tags
# Ollama auto-unloads after 5 min idle
# Or restart the service to clear everything
Speed vs Quality
For faster responses with slight quality drop, use quantized models:
# q4 = 4-bit quantization, faster, less accurate
ollama pull deepseek-coder:6.7b-instruct-q4_0
I use full precision for complex refactoring, quantized for quick completions.
Real-World Usage
After a month with this setup, here's what works well:
Great for:
- Code completion and boilerplate
- Explaining unfamiliar code
- Writing tests for existing functions
- Regex and SQL generation
- Git commit messages
Still use cloud AI for:
- Complex architectural decisions
- Multi-file refactoring
- Debugging truly weird issues
The local setup handles 80% of my daily AI coding needs. That's a win.
Troubleshooting
"Model not found" — Run ollama list to see installed models. Pull again if missing.
Slow responses — Try a smaller model or quantized version. Check if it's using GPU with --verbose.
Out of memory — Close other apps, use a smaller model, or add swap space.
Connection refused — Ensure ollama serve is running. Check nothing else is on port 11434.
What's Next
Once you're comfortable:
- Try different models — Mistral, Phi-3, Llama 3 all have coding variants
- Fine-tune on your codebase — Ollama supports custom Modelfiles
- Build custom tools — The Ollama API is dead simple to script against
The local LLM ecosystem is moving fast. Models that needed 64GB RAM two years ago now run on laptops. It's only getting better.
More at dev.to/cumulus
Top comments (0)