Devstral Small 2 Review 2026: 68% SWE-bench on RTX 4090

#devstral #mistral #localllm #coding

This article was originally published on aifoss.dev

TL;DR: Devstral Small 2 is a 24B Apache 2.0 coding model from Mistral that scores 68% on SWE-bench Verified — a serious benchmark for a local model — and runs on a single RTX 4090. If you want an open-weight coding agent that keeps code on your machine and handles multi-file edits, this is the strongest 24B option available as of June 2026. The catch: it's purpose-built for agentic software engineering tasks, not casual code completion.

	Devstral Small 2	Devstral 2 (123B)	Claude Sonnet 4.5
Best for	Local deployment, solo devs	Team servers, max quality	Cloud API, highest accuracy
SWE-bench Verified	68.0%	72.2%	77.2%
VRAM (Q4_K_M)	~14 GB	~70 GB+	API only
License	Apache 2.0	Modified MIT	Proprietary
Context window	256K	256K	200K
API cost (input/1M)	$0.10	$0.40	~$3.00

Honest take: For local coding agents with a single consumer GPU, Devstral Small 2 is the model to run in mid-2026. It won't match Claude Sonnet 4.5 on hard tasks, but it costs you nothing per token and keeps your code off the internet.

What Devstral Small 2 Is

Mistral released Devstral Small 2 on December 9, 2025, alongside its larger sibling Devstral 2 (123B) and the Mistral Vibe CLI. Where Devstral 2 targets multi-GPU servers, Small 2 targets single-GPU workstations.

The model is fine-tuned specifically for software engineering agent tasks: exploring codebases, editing multiple files in a single pass, and calling tools in agentic loops. It handles those tasks differently from a general-purpose chat model — it's optimized to read file trees, understand diffs, and apply targeted edits rather than generate boilerplate from scratch.

Key specs (tested on Devstral-Small-2-24B-Instruct-2512):

Parameters: 24B
Context window: 256K tokens
License: Apache 2.0 — commercial use allowed, no revenue threshold restrictions
Released: December 9, 2025
Ollama tag: devstral-small-2

The Apache 2.0 license is meaningful here. The 123B Devstral 2 ships under a modified MIT license that restricts organizations with over $20M in monthly revenue. Small 2 has no such clause — you can deploy it commercially without legal review.

Benchmark Reality Check

68.0% on SWE-bench Verified is the headline number. Here's what that actually means.

SWE-bench Verified tests models on real GitHub issues from popular Python repositories. A successful "resolve" means the model read the issue, edited the codebase, and passed the existing test suite — without being given the solution. It's a meaningful proxy for agentic software engineering capability.

For reference:

GPT-4o: ~38% at launch (early 2024 snapshot)
Claude Sonnet 3.5: ~49% at launch
Devstral Small 2 (24B, local): 68.0%
Devstral 2 (123B, API/server): 72.2%
Claude Sonnet 4.5 (API, current): 77.2%

A 4.2-point gap between Small 2 and the 123B version is smaller than you'd expect given a 5x parameter difference. The large gap vs. GPT-4o and older Claude versions reflects how much Mistral specialized this model for software agent tasks. General-purpose models trained to be chatty assistants perform worse on this benchmark than a 24B model trained specifically to edit files.

The benchmark also doesn't tell you everything. On tasks requiring deep reasoning across a large unfamiliar codebase, or multi-file refactors that span many files, you'll notice the quality gap between 68% and 77% more clearly. For standalone functions, unit tests, and targeted bug fixes, the difference is often imperceptible.

Installation: Ollama in 3 Commands

Ollama is the fastest path to running Devstral Small 2 locally. If you don't have Ollama installed, the full Ollama setup guide covers it on Linux, macOS, and Windows.

# Pull the model (Q4_K_M by default, ~15 GB)
ollama pull devstral-small-2

# Run interactively
ollama run devstral-small-2

# Or specify a tag explicitly
ollama pull devstral-small-2:24b-instruct-2512-q4_K_M

Quantization options and VRAM requirements:

Quantization	Size	Min VRAM	Quality
Q4_K_M (default)	~15 GB	16 GB	Good for coding tasks
Q6_K	~20 GB	22 GB	Noticeably better on complex edits
Q8_0	~26 GB	28 GB	Near-lossless
FP16	~48 GB	50 GB	Reference quality, multi-GPU only

The Q4_K_M default fits comfortably on an RTX 4090 (24 GB). A Mac Mini M4 Pro with 48 GB unified memory can run Q8_0 with headroom. If you have a 16 GB GPU, Q4_K_M fits but you'll be tight on context — longer files will cause slowdowns.

For a deeper look at how quantization levels affect output quality for coding tasks, see the GGUF quantization guide.

Once pulled, test the model:

ollama run devstral-small-2 "Write a Python function that finds all duplicate entries in a list of dicts by a given key."

Expect output immediately — Ollama's built-in GGUF runtime handles tool-calling setup automatically.

Use It With Aider

Aider is where Devstral Small 2 actually shines. Its architect mode is a close match for how the model was designed to work: read the codebase, plan the edit, then apply it.

Install Aider if you haven't already — the Aider setup guide covers full configuration. Then point it at your local Ollama instance:

# Via local Ollama
aider --model ollama/devstral-small-2:latest

# With explicit context and editor model split
aider \
  --model ollama/devstral-small-2:latest \
  --architect \
  --editor-model ollama/devstral-small-2:latest

The --architect flag puts Aider in a two-step mode: the first call plans the edit, the second applies it. This maps well to how Devstral was trained — it expects to reason about a file tree before making changes.

One practical note: Devstral Small 2 generates longer "thinking" sections when given open-ended architecture questions. For targeted bug fixes (aider --message "fix the race condition in worker.py line 42"), it's fast and accurate. For open-ended feature requests on a large codebase, give it explicit file context with aider file1.py file2.py rather than letting it figure out which files to open on its own.

Use It With Continue.dev

Continue.dev can use Devstral Small 2 via Ollama's OpenAI-compatible API. The model's 256K context window is an advantage here — you can index larger files as context without hitting limits.

In VS Code, open your Continue config (~/.continue/config.json) and add:

{
  "models": [
    {
      "title": "Devstral Small 2 (local)",
      "provider": "ollama",
      "model": "devstral-small-2:latest",
      "apiBase": "http://localhost:11434"
    }
  ]
}

For the agent tab in Continue 0.9+, set it as the default agent model — the tool-calling support Devstral was trained on maps directly to Continue's agent tool loop. In the VS Code sidebar, select the model from the dropdown and switch to Agent mode.

If you're using Continue with a team and want a shared Ollama instance, run Ollama on the server with OLLAMA_HOST=0.0.0.0 ollama serve and point the apiBase at the server IP. See the Continue.dev + Ollama guide for multi-user setup details.

Use It With Mistral Vibe CLI

Mistral Vibe is the native CLI that shipped alongside Devstral 2. It's open-source (MIT license, available on GitHub), built specifically for Devstral, and runs in your terminal without an IDE.

Install and configure it:


bash
# Install via pip
pip install mistral-vibe

# Point at local Ollama (no API key needed)
export OPENAI_BASE_URL="http://localhost:11434/v1"
export OPENAI_API_KEY="ollama"

# Run in your project directory
vib