DEV Community

Jovan Chan
Jovan Chan

Posted on • Originally published at runaihome.com

Codestral 2 for Local AI in 2026: Apache 2.0, 22B Params, 256K Context — Which GPU Runs It Best

This article was originally published on runaihome.com

TL;DR: Codestral 2 is Mistral's 22B dense coding model, now Apache 2.0 — fully commercial-use legal as of April 2026. The Q4_K_M GGUF is 13.3 GB, so it fits a 16 GB card with room for short context and runs comfortably on a 24 GB 3090. The catch: it's a dense 22B, so it's bandwidth-bound and slower than the MoE models everyone's switched to.

RTX 4060 Ti 16GB Used RTX 3090 24GB RTX 4090 24GB
Best for Q4_K_M, tight budget The sweet spot Speed + long context
Price (Jun 2026) ~$430 new ~$1,070 used avg ~$2,000+ used
Memory bandwidth 288 GB/s 936 GB/s 1,008 GB/s
Codestral 2 Q4_K_M speed ~18–22 tok/s ~40–50 tok/s ~60–75 tok/s
The catch Bandwidth-starved Best $/tok, runs hot Overkill for one model

Honest take: If you want Codestral 2 specifically and you're buying, a used RTX 3090 is the obvious pick — it has the bandwidth to make a dense 22B usable and the headroom to push context past the point a 16 GB card chokes. But before you commit, ask whether you actually need this model or just a good local coding model, because the MoE options are faster.

What changed: the license, not the weights

Codestral's original 22B release in 2024 shipped under the Mistral Non-Production License — you could play with it, but you could not legally use it inside a commercial product or paid service. That single clause kept it off most real dev stacks.

In April 2026, Mistral relicensed Codestral 2 under Apache 2.0. That removes the non-production restriction entirely: you can run it inside a paid product, ship it in a closed-source tool, fine-tune it and sell the result, no permission needed. For a coding model that's the whole ballgame — it's the biggest open-source coding license unlock since Llama 2 went commercial.

The model itself is a 22B dense transformer with a 256K context window — the largest context of any dedicated open coding model — fill-in-the-middle (FIM) support for IDE autocomplete, and coverage of 80+ programming languages. Mistral reports 86.6% on HumanEval. That's a strong single-file completion score, though HumanEval is a saturated benchmark in 2026 and shouldn't be read as a ranking against the latest agentic coders.

The number that decides everything: 13.3 GB

The practical question isn't "how good is it" — it's "does it fit, and how fast." Codestral 2 is a dense 22B, which means every token read needs all the active weights pulled from VRAM. There's no MoE sparsity hiding most of the model. That makes its memory footprint predictable and its speed a straight function of bandwidth.

Here are the real GGUF sizes from the community quants (bartowski's widely used build), which range from 6.64 GB at the smallest to 23.64 GB at Q8:

Quant File size Fits 12 GB? Fits 16 GB? Fits 24 GB?
Q4_K_M 13.3 GB No (with context) Yes (tight) Yes
Q5_K_M ~15.7 GB No Yes (very tight) Yes
Q6_K ~18.3 GB No No Yes
Q8_0 ~23.6 GB No No Barely

Q4_K_M is the one almost everyone runs. At 13.3 GB the weights alone leave about 2.7 GB free on a 16 GB card — enough for the KV cache at a few thousand tokens of context, but nowhere near enough to exploit the 256K context window. That context number is a server/API capability; on a 16 GB consumer card you'll be living at 8K–16K context, and even a 24 GB card runs out of room long before 256K. (If you slam into the wall, our CUDA out of memory fixes walk through the KV-cache and context knobs that buy you headroom.)

Speed: where dense bites you

Decode speed on a local LLM is governed by memory bandwidth, not raw compute — the GPU spends its time waiting on weights, not doing math. For a 13.3 GB model the theoretical ceiling is bandwidth ÷ model size, and real-world throughput lands at roughly half that after KV-cache reads and overhead.

That math plays out cleanly across the three cards worth considering:

  • RTX 4060 Ti 16GB (288 GB/s): This is the bottleneck card. A comparable 24B dense model (Mistral Small 3.2) was independently clocked at about 18.5 tok/s on 16 GB hardware — and Codestral 2 lands in the same ~18–22 tok/s range. Usable for autocomplete and short edits, sluggish for anything that streams a long answer.
  • Used RTX 3090 (936 GB/s): More than 3× the bandwidth of the 4060 Ti, and it shows. Expect roughly 40–50 tok/s at Q4_K_M — comfortably past reading speed (~7–10 tok/s), so generations feel responsive. This is the card the model is happiest on.
  • RTX 4090 (1,008 GB/s): A dense 32B at Q4 lands near 60 tok/s here, and the 4090 runs about 20% faster than a 3090 on 30B-class models, so a 22B comes in around 60–75 tok/s. Fast, but you're paying roughly double a 3090 for a model that doesn't need it.

The honest framing: on bandwidth-per-dollar, the used 3090 wins decisively for Codestral 2. The 4060 Ti makes it run; the 3090 makes it pleasant.

Running it: Ollama and llama.cpp

The fastest path is Ollama. Pull the model and point your editor at it:

ollama pull codestral
ollama run codestral "Write a Python function to debounce calls with a configurable delay"
Enter fullscreen mode Exit fullscreen mode

For FIM autocomplete inside your editor, Ollama exposes the completion endpoint on localhost:11434. Pair it with Continue.dev + Ollama for an in-IDE setup that uses Codestral 2 for both chat and tab-completion.

If you want explicit control over quant and context with llama.cpp:

# Grab the Q4_K_M GGUF (13.3 GB), then:
llama-server -m Codestral-22B-v0.1-Q4_K_M.gguf \
  -ngl 99 \
  -c 16384 \
  --host 0.0.0.0 --port 8080
Enter fullscreen mode Exit fullscreen mode

-ngl 99 offloads all layers to the GPU — essential, because partial CPU offload on a dense 22B tanks throughput. -c 16384 sets a realistic 16K context; don't reach for 256K on consumer VRAM, the KV cache will OOM you instantly.

Codestral 2 vs the models that overtook it

Here's the part the marketing won't tell you: in mid-2026, dense models lost the local-coding crown to MoE. A Mixture-of-Experts model with 30B+ total parameters but only 3B active per token reads far less from VRAM per step, so it runs faster than a dense 22B while often coding better.

That's the real competition for Codestral 2:

  • Qwen3-Coder-Next — Alibaba's MoE coding agent, faster decode at similar quality, also open-weight.
  • Devstral Small 2 — Mistral's own agentic coding model, built for multi-file/tool-use workflows Codestral wasn't designed for.

So why run Codestral 2 at all? Three reasons that still hold:

  1. The license. Apache 2.0 with no usage ceiling is cleaner than some competitors' terms if you're shipping a product.
  2. FIM quality. Codestral was built around fill-in-the-middle; its autocomplete inside an editor is excellent and low-latency on a 3090.
  3. Predictability. A dense model's VRAM and speed are dead simple to reason about — no expert-routing surprises, no "why did my MoE just slow down" debugging.

If you're picking a local coding stack from scratch, read our best local coding LLM comparison first — Codestral 2 is a strong FIM autocomplete engine, but it's no longer the default chat/agent pick. For a broader look at how MoE changed the speed math, Qwen3.6 35B-A3B and friends tell the story.

No GPU? Rent before you buy

If you don't have a 16 GB+ card yet and want to try Codestral 2 before spending $430–$1,070, rent an hour of a 24 GB GPU on RunPod. A 24 GB instance runs a few cents to ~$0.40/hour depending on the card, which is enough to load the Q4_K_M GGUF, wire it into your editor, and judge whether the FIM autocomplete is worth buying h

Top comments (0)