This article was originally published on runaihome.com
Three open-weight coding models are worth taking seriously for local inference in 2026: Qwen2.5-Coder, DeepSeek-Coder-V2-Lite, and Codestral. The question isn't which one "wins" — it's which one your GPU can actually run at a useful speed, and whether you're optimizing for chat-style code generation or IDE autocomplete.
The answer splits cleanly by VRAM tier. At 8GB, one model dominates by benchmark. At 12–16GB, you're choosing between a dense model and a Mixture-of-Experts approach with meaningfully different trade-offs. At 24GB, the right answer depends on whether you spend most of your day pressing Tab in VS Code or asking Claude-style chat questions to a coding assistant. Below is the breakdown — with verified benchmark numbers and real VRAM requirements.
The models at a glance
| Model | Params | VRAM (Q4_K_M) | HumanEval | Context | License |
|---|---|---|---|---|---|
| Qwen2.5-Coder 7B-Instruct | 7B | ~5 GB | 88.4% | 128K | Apache 2.0 |
| Qwen2.5-Coder 14B-Instruct | 14B | ~10 GB | between 7B and 32B | 128K | Apache 2.0 |
| DeepSeek-Coder-V2-Lite | 16B (2.4B active) | ~12–13 GB | 81.1% (Python) | 128K | DeepSeek custom |
| Codestral 25.01 | 22B | ~18 GB | 86.6% | 256K | Mistral custom |
| Qwen2.5-Coder 32B-Instruct | 32B | ~20 GB | 92.7% | 128K | Apache 2.0 |
A few notes on reading that table. VRAM figures are for Q4_K_M quantization with a small context window; add 1–2 GB for a 16K context budget. DeepSeek-Coder-V2-Lite is a Mixture-of-Experts model: 16B total parameters, but only 2.4B active per token — it's more comparable in inference speed to a ~7B dense model, not a 16B one. And HumanEval measures "given this Python docstring, write the function" — important, but not the whole story if your main use case is autocomplete.
8GB VRAM: Qwen2.5-Coder 7B is the obvious call
If your GPU has 8GB of VRAM — RTX 3070, RTX 4060, RX 7600 — Qwen2.5-Coder 7B-Instruct is not a compromise. It's a genuinely impressive model. The 7B-Instruct variant scores 88.4% on HumanEval pass@1 and 84.1% on the harder HumanEval+ benchmark, according to the Qwen2.5-Coder technical report. For a 7B model you run at home with no API fees, those are numbers that sit alongside top-tier closed models from two years ago.
At Q4_K_M quantization, the model file is around 4.7 GB and sits comfortably in 8GB VRAM with room for context. Speed on an RTX 4090 lands around 100–130 tokens per second in Ollama or llama.cpp, per the Home GPU LLM Leaderboard at awesomeagents.ai; on a 12GB RTX 3060, 7B Q4 benchmarks come in around 42 tok/s according to community inference speed tests, which is fast enough for interactive sessions.
The 128K context window means you can feed entire files — 2,000-line Python files included — without chunking. That matters for the "refactor this function" use case more than you'd expect.
The one gap: fill-in-the-middle autocomplete isn't Qwen2.5-Coder 7B's specialty. It supports FIM, but if your primary workflow is tab-completion in an editor, consider using the 1.5B model for the autocomplete slot and the 7B for chat, exactly the way the Continue.dev dual-model setup works.
# Install via Ollama (defaults to Q4_K_M)
ollama pull qwen2.5-coder:7b
Verdict: 8GB VRAM has a clear answer. Everything else in this tier is a step down.
12–16GB VRAM: Dense vs MoE trade-off
This is where the choice gets interesting. Two models deserve consideration:
Qwen2.5-Coder 14B-Instruct (dense, 14B parameters)
The 14B-Instruct sits between the 7B and 32B on benchmarks — scaling consistently with size across the Qwen2.5-Coder family. At Q4_K_M, the Ollama model file is roughly 8.5–9.0 GB, requiring about 10–11 GB of VRAM including the KV cache at standard context lengths. An RTX 3080 (10GB) can run it in a pinch; an RTX 3080 Ti (12GB) or RTX 4070 12GB is the comfortable minimum.
The 14B generates tokens faster than the DeepSeek option below, because all 14B parameters are active and the memory bandwidth usage is predictable. Apache 2.0 license means no commercial restrictions.
DeepSeek-Coder-V2-Lite (MoE, 16B total / 2.4B active)
This model scores 81.1% on HumanEval (Python) and 68.8% on MBPP+, per the DeepSeek-Coder-V2 paper. The MoE architecture is the key: despite 16B total parameters, only 2.4B are active per token, which means inference cost is closer to a 2.4B dense model in compute — but quality is closer to a 16B model because the router has access to specialized experts.
The Q4_K_M GGUF is 10.36 GB according to the bartowski/DeepSeek-Coder-V2-Lite-Instruct-GGUF page on Hugging Face, requiring about 12–13 GB of VRAM to run fully on GPU. It fits on a 12GB card with slim margin; a 16GB card (RTX 4060 Ti 16GB, RTX 4070 16GB) runs it comfortably.
The 128K context window and explicit 338-programming-language support are genuine advantages. The DeepSeek custom license doesn't restrict personal or small-business use, but large-scale commercial deployment may require review.
Which one to pick at 12–16GB VRAM:
| Priority | Recommendation |
|---|---|
| Best code generation quality | DeepSeek-Coder-V2-Lite |
| Fastest inference / lowest latency | Qwen2.5-Coder 14B |
| Simplest deployment and no license concerns | Qwen2.5-Coder 14B |
| Widest language coverage | DeepSeek-Coder-V2-Lite (338 languages) |
| Running on exactly 12GB VRAM | Qwen2.5-Coder 14B (more headroom) |
For most developers, Qwen2.5-Coder 14B is the safer default — simpler MoE-free inference, Apache 2.0, and faster generation. DeepSeek-Coder-V2-Lite is worth trying if you work in less common languages or find you need the quality edge on complex multi-file tasks.
# Qwen2.5-Coder 14B (defaults to Q4_K_M)
ollama pull qwen2.5-coder:14b
# DeepSeek-Coder-V2-Lite
ollama pull deepseek-coder-v2:16b-lite-instruct-q4_K_M
# Or check available tags: https://ollama.com/library/deepseek-coder-v2
24GB VRAM: Generation vs autocomplete split
At 24GB — RTX 3090, RTX 4090, or a used 3090 from eBay — you have two serious options with a clear use-case split.
Qwen2.5-Coder 32B-Instruct: the benchmark leader
The 32B-Instruct variant scores 92.7% on HumanEval and 87.2% on HumanEval+, making it the strongest open-weight coding model that fits on a single consumer 24GB card. At Q4_K_M, the model file is around 18–20 GB, which fits in a 24GB card with room for context — tight, but workable if you're not running a 50K-token context window simultaneously.
This is the model you want when you're asking it to architect a new service, review 500 lines of code, write a test suite from scratch, or debug a complex async issue. Chat-style code generation is where 92.7% HumanEval actually shows up.
Apache 2.0 license. No commercial restrictions.
ollama pull qwen2.5-coder:32b
Codestral 25.01: the autocomplete champion
Codestral 25.01 scores 86.6% on HumanEval — good, but below Qwen2.5-Coder 32B. However, it reaches 95.3% average FIM pass@1 across Python, JavaScript, and Java in the January 2025 update, which is the highest fill-in-the-middle score of any model in 2025, including closed ones.
What does that mean practically? When your cursor is in the middle of a function and you press Tab, Codestral completes it correctly at a rate no other locally-runnable model matches. That's the case for most developers during most of their working hours.
Codestral 25.01 also added 256K
Top comments (0)