I was paying $10/month for GitHub Copilot. It's fine. It works. But it means every keystroke I type goes to Microsoft's servers, my code context gets shipped off somewhere, and I'm locked into whatever pricing they decide next year.
Then I found out I could run a better-than-Copilot setup on my own machine, completely free, with no data leaving my computer.
Here's exactly how I did it, and how you can too.
What You Actually Need
Before anything else — honest expectations:
- A machine with at least 8GB RAM (16GB is better)
- ~5GB free disk space per model
- A decent CPU, or an NVIDIA/Apple Silicon GPU for speed
- About 20 minutes of setup time
No API keys. No credit cards. No subscriptions.
Step 1: Install Ollama
Ollama is the piece that makes all of this possible. It's basically a runtime that lets you pull and run open-source LLMs the same way Docker lets you run containers.
macOS / Linux:
curl -fsSL https://ollama.com/install.sh | sh
Windows:
Download the installer from ollama.com.
Verify it's running:
ollama --version
Step 2: Pull a Coding Model
This is the part where you choose your AI. For coding specifically, these are the ones worth using:
For most people (8GB RAM):
ollama pull qwen2.5-coder:7b
Qwen 2.5 Coder from Alibaba is genuinely impressive. Beats older Copilot versions on HumanEval benchmarks. Specialised entirely for code.
If you have 16GB+ RAM:
ollama pull qwen2.5-coder:14b
Noticeably better at multi-file context and explaining complex logic.
If you're on Apple Silicon (M1/M2/M3):
ollama pull deepseek-coder-v2:16b
Runs fast on Metal. Great at refactoring and docstring generation.
Absolute minimum (4GB RAM):
ollama pull qwen2.5-coder:3b
Smaller but still surprisingly capable for autocomplete and simple functions.
Test it immediately:
ollama run qwen2.5-coder:7b "write a Python function to flatten a nested list"
If you get a clean response, you're good.
Step 3: Pick Your Editor Integration
Now the fun part — making it feel like Copilot inside your actual editor.
VS Code → Continue
Continue is the open-source Copilot alternative. It's a VS Code (and JetBrains) extension that hooks directly into your local Ollama instance.
- Install the Continue extension from the VS Code marketplace
- Open Continue's config file (
~/.continue/config.json) and add:
{
"models": [
{
"title": "Qwen Coder 7B",
"provider": "ollama",
"model": "qwen2.5-coder:7b"
}
],
"tabAutocompleteModel": {
"title": "Qwen Coder 7B",
"provider": "ollama",
"model": "qwen2.5-coder:7b"
}
}
- Press
Tabto accept inline completions,Cmd+I(Mac) orCtrl+I(Windows) to open the chat panel.
That's it. You now have inline autocomplete, a chat panel, and codebase-aware Q&A — all local.
Neovim → gen.nvim or avante.nvim
If you're a Neovim user, add this to your config with gen.nvim:
require('gen').setup({
model = "qwen2.5-coder:7b",
host = "localhost",
port = "11434",
})
Then :Gen opens a prompt. Select code visually and run :Gen Enhance_Code or :Gen Add_Tests.
JetBrains → Continue (same plugin, different install)
Install Continue from the JetBrains marketplace. Same config file works.
Step 4: Supercharge It With Open WebUI (Optional but Worth It)
Open WebUI gives you a ChatGPT-like interface for your local models. Useful when you want to have a longer conversation about architecture, paste in a whole file, or explain a bug.
docker run -d \
-p 3000:8080 \
--add-host=host.docker.internal:host-gateway \
-v open-webui:/app/backend/data \
--name open-webui \
ghcr.io/open-webui/open-webui:main
Open http://localhost:3000, connect to your Ollama instance, and you have a full ChatGPT-style interface running entirely offline.
Real-World Performance
After a month of daily use on a MacBook Pro M2 with 16GB RAM, here's what I found:
| Task | Qwen 2.5 Coder 7B | GitHub Copilot |
|---|---|---|
| Simple function completion | ✅ Excellent | ✅ Excellent |
| Refactoring a 100-line file | ✅ Good | ✅ Good |
| Explaining unfamiliar code | ✅ Very good | ✅ Very good |
| Multi-file context | ⚠️ Limited | ✅ Better |
| Speed (M2 Mac) | ~2–3 tok/sec | Near instant |
| Privacy | ✅ 100% local | ❌ Sent to servers |
| Cost | ✅ Free | ❌ $10/month |
Speed is the real tradeoff. On CPU-only machines, responses are slower than a cloud API. On Apple Silicon or an NVIDIA GPU, the gap closes a lot.
The Part Nobody Tells You
Prompting matters more locally than with cloud models.
Cloud models like GPT-4 or Claude have been fine-tuned to be forgiving — they infer what you meant even if you're vague. Smaller local models are more literal. A vague prompt gets a vague answer.
Instead of:
fix this function
Try:
This Python function is supposed to parse a JWT token and return the
payload as a dict. It currently throws a KeyError when the token is
expired. Fix the expiry handling and add a try/except that returns None
on any decode failure.
More context = dramatically better output. Once I adjusted my prompting habit, the quality difference between local and cloud shrank a lot.
Bonus: Free Cloud Options When Local Isn't Enough
Sometimes you need a bigger model for a hard problem. These are genuinely free with no credit card:
- Groq — Llama 3.1 70B running at insane speed. Free tier is generous.
- Google AI Studio — Gemini 1.5 Flash, 1M token context window, free.
- Cerebras — 1M tokens/day free, fastest inference available right now.
You can configure all of these in Continue the same way as Ollama — just swap the provider and add an API key.
TL;DR
# 1. Install Ollama
curl -fsSL https://ollama.com/install.sh | sh
# 2. Pull a coding model
ollama pull qwen2.5-coder:7b
# 3. Install Continue extension in VS Code
# 4. Start coding for free
Your code stays on your machine. You pay nothing. It's genuinely good enough for daily use.
The setup takes 20 minutes and you'll never think about it again.
If this helped, I'm posting more practical AI dev workflow stuff — follow along. And if your local setup is different from mine, drop it in the comments — curious what models people are running.
Top comments (0)