Jangwook Kim

Posted on May 2 • Originally published at effloow.com

Devstral 2: Run Mistral's Open Coding Agent Locally

#devstral #mistralai #ollama #localllm

Proprietary coding agents are powerful, but every token you send to a cloud API is a token that leaves your codebase. Devstral 2, released December 9, 2025 by Mistral AI, changes that calculus. At 72.2% on SWE-bench Verified — the industry-standard benchmark for real GitHub issue resolution — it matches Claude Sonnet 4.5's general performance territory while running fully on-premise on a single RTX 4090 or a Mac with 32GB RAM.

This guide covers what Devstral 2 and Devstral Small 2 are, their specs and benchmarks, how to pull and run the smaller variant via Ollama, and how to wire either model into your daily coding workflow.

Effloow Lab verified Ollama compatibility (v0.20.5), confirmed registry tags, and documented hardware requirements before writing this guide. See data/lab-runs/devstral-2-mistral-coding-agent-local-guide-2026.md for the full evidence record.

Why Devstral 2 Matters for Self-Hosting Teams

Most open-weight coding models have a dirty secret: they score well on toy benchmarks but fall apart on real multi-file, multi-step engineering tasks. SWE-bench Verified uses 500 manually screened GitHub issues from production repositories — not hand-curated puzzles. It measures whether an agent can actually read a codebase, reason about the bug, make targeted edits across multiple files, and produce a passing test run.

Devstral 2's 72.2% on that benchmark is not a coincidence. Mistral built the model in collaboration with All Hands AI, the team behind OpenHands — a production agentic framework used for real software engineering workflows. The training process was specifically designed around multi-step tool use: reading files, editing code, running tests, and iterating based on output.

For teams running high-volume coding tasks, the cost arithmetic is compelling. Claude Sonnet 4.5, which scores 77.2% on the same benchmark, costs $3/$15 per million tokens (input/output). Devstral 2 API pricing is $0.40/$2.00 — roughly 7x cheaper per token. Trading five percentage points of accuracy for 85% cost savings is a meaningful tradeoff for batch automation, PR review agents, or CI-integrated code generation.

For privacy-sensitive teams, the local option removes the cost equation entirely.

The Two Models: Which One Do You Need?

Mistral released two variants simultaneously on December 9, 2025.

Feature	Devstral 2	Devstral Small 2
Parameters	123B	24B
Architecture	Dense transformer	Dense transformer
Context window	256K tokens	256K tokens
SWE-bench Verified	72.2%	68.0%
License	Modified MIT	Apache 2.0
API input price	$0.40 / 1M tokens	$0.10 / 1M tokens
API output price	$2.00 / 1M tokens	$0.30 / 1M tokens
Minimum VRAM (local)	~80GB (multi-GPU)	24GB (single RTX 4090)
Ollama size (Q4_K_M)	~73GB	~15GB
Best for	API / cloud inference	Local self-hosting

The license difference is notable. Devstral Small 2's Apache 2.0 license allows commercial use, modification, and redistribution without significant restriction — it's a clean open-source license for teams building products on top of the model. Devstral 2's modified MIT has additional terms around commercial deployment at scale; check the Hugging Face model card before productionizing at high volume.

For most self-hosters, Devstral Small 2 is the right target: it fits on a single consumer GPU, gets Apache 2.0 freedoms, and loses only four percentage points against its bigger sibling.

Prerequisites

Before setting up Devstral locally, confirm your environment meets the requirements:

Hardware minimums (Devstral Small 2):

24GB VRAM: single NVIDIA RTX 4090, or a 3090 Ti (tight)
RAM alternative: Apple Silicon Mac with 32GB unified memory
Storage: at least 20GB free (model is ~15GB after download)

Hardware for Devstral 2 (full model):

80GB+ VRAM across multiple GPUs, or a cloud instance

Software requirements:

Ollama 0.13.3 or later (verified working with 0.20.5)
macOS, Linux, or Windows with WSL2

Step 1: Install or Update Ollama

If Ollama is not installed, download from ollama.com. If you have an older version, Devstral Small 2 requires Ollama 0.13.3 or later.

# macOS (Homebrew)
brew install ollama

# Verify version
ollama --version
# Should show 0.13.3 or higher

On Linux, the official install script works across most distributions:

curl -fsSL https://ollama.com/install.sh | sh

Step 2: Pull Devstral Small 2

The default tag uses Q4_K_M quantization (~15GB), which balances quality and VRAM usage for the 24B model.

ollama pull devstral-small-2:24b

This command pulls the model from the Ollama registry. Expect the download to take 15–45 minutes depending on your connection speed. The model identifier devstral-small-2:24b is an alias for devstral-small-2:24b-instruct-2512-q4_K_M.

Once pulled, verify it's available:

ollama list
# devstral-small-2:24b   ...   15GB   just now

Step 3: Run Your First Coding Query

Start an interactive session to confirm the model is operational:

ollama run devstral-small-2:24b

Once the model loads (typically 10–30 seconds on first run), try a practical coding prompt:

>>> Fix this Python function so it handles empty input without raising an exception:

def get_first_item(items):
    return items[0]

Devstral Small 2 will analyze the code, explain the issue (no empty-list guard), and return a corrected version. For 24B Q4_K_M models on an RTX 4090, token generation typically falls in a range comparable to other 24B models at this quantization — your actual speed depends on context length and GPU memory bandwidth.

To exit the interactive session:

>>> /bye

Step 4: Serve as a Local API Endpoint

Ollama exposes an OpenAI-compatible REST API, which means any tool that speaks the OpenAI chat completions format works out of the box.

Start the server:

ollama serve
# Listening on 0.0.0.0:11434

Test with curl:

curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "devstral-small-2:24b",
    "messages": [
      {
        "role": "user",
        "content": "Write a Python function that validates an email address using regex."
      }
    ]
  }'

You'll get a standard OpenAI-format response with the model's output. This endpoint is what IDE extensions like Continue use to talk to your local model.

Step 5: Wire into Your IDE with Continue

Continue is an open-source VS Code and JetBrains extension that supports local Ollama models natively.

Install Continue:

Open VS Code Extensions panel (Ctrl+Shift+X / Cmd+Shift+X)
Search for Continue
Install the extension by Continue Dev

Configure for Devstral:

Open ~/.continue/config.json and add the model:

{
  "models": [
    {
      "title": "Devstral Small 2",
      "provider": "ollama",
      "model": "devstral-small-2:24b",
      "contextLength": 32768
    }
  ]
}

Set contextLength to what your VRAM allows at comfortable speed. The model supports up to 256K tokens, but 32K–64K is a practical limit for real-time IDE use on consumer hardware.

Once configured, Continue's Agent mode can handle multi-file edits, test runs, and iterative bug fixes — the same workflow Devstral was trained for.

Step 6: OpenHands Integration (Advanced)

Devstral 2 was built in collaboration with All Hands AI, the team behind OpenHands (formerly OpenDevin). The model was specifically trained to work within the OpenHands agentic scaffolding, which gives it structured access to file editing, terminal execution, and browser tools.

# Install OpenHands via Docker
docker pull ghcr.io/all-hands-ai/openhands:latest

docker run -it --rm \
  -e LLM_API_KEY="your-mistral-api-key" \
  -e LLM_MODEL="devstral-2-123b-instruct-2512" \
  -v /var/run/docker.sock:/var/run/docker.sock \
  -p 3000:3000 \
  ghcr.io/all-hands-ai/openhands:latest

For local inference, point OpenHands at your Ollama endpoint:

docker run -it --rm \
  -e LLM_API_BASE="http://host.docker.internal:11434/v1" \
  -e LLM_API_KEY="ollama" \
  -e LLM_MODEL="ollama/devstral-small-2:24b" \
  -v /var/run/docker.sock:/var/run/docker.sock \
  -p 3000:3000 \
  ghcr.io/all-hands-ai/openhands:latest

OpenHands' web interface then lets you assign GitHub issues directly to the agent — describe a bug, and Devstral handles codebase exploration, patch creation, and test execution autonomously.

Benchmark Context: Where Does It Actually Stand?

SWE-bench Verified is the clearest apples-to-apples comparison available for agentic coding models. Here's where Devstral 2 sits as of December 2025:

Model	SWE-bench Verified	Open Weights?	Local?
Claude Sonnet 4.5	77.2%	No	No
Devstral 2	72.2%	Yes	API/cloud
Devstral Small 2	68.0%	Yes	Yes (RTX 4090)
DeepSeek V3.2	[DATA NOT AVAILABLE]	Yes	Multi-GPU

The 5pp gap between Devstral 2 and Claude Sonnet 4.5 is real, but narrow enough that for most tasks — especially those where your code and context stay in your infrastructure — the trade is worth making.

Mistral also conducted human preference evaluations: Devstral 2 won 42.8% of head-to-head comparisons against DeepSeek V3.2 (vs. 28.6% losses), and shows a measurable quality gap versus the open-source field.

Common Setup Issues

"CUDA out of memory" on first run:
The Q4_K_M quantization targets ~15GB, but initial context allocation can spike. If you have exactly 24GB VRAM, try reducing context length in your client first:

# In Ollama Modelfile, set context to 16K instead of 256K
ollama run devstral-small-2:24b --num-ctx 16384

Model responds but is very slow:
Check that Ollama is using GPU acceleration, not CPU fallback:

ollama ps
# Should show GPU utilization percentage

If it shows 0% GPU, check your CUDA/ROCm drivers and confirm Ollama has GPU access.

Continue extension can't connect:
Ollama must be running before Continue tries to connect. Start Ollama with ollama serve and verify it's listening:

curl http://localhost:11434/api/version
# Should return {"version":"0.20.5"}

Context length too long / slow responses:
The 256K context window is a maximum, not a default. Most Ollama clients default to 2K–8K. If Continue is slow, set contextLength: 8192 in your config and work up from there.

Q: How does Devstral Small 2 compare to Devstral (the original)?

The original Devstral (May 2025) scored 46.8% on SWE-bench Verified and had 24B parameters. Devstral Small 2 (December 2025) uses the same parameter count but scores 68.0% — a 21-point jump. The architecture improvements and training pipeline revisions from the collaboration with All Hands AI account for most of that gain.

Q: Can I use Devstral Small 2 commercially without paying for a license?

Yes. Devstral Small 2 is Apache 2.0. You can use it in commercial products, modify the weights (with access to them on Hugging Face), and build services on top of it. Review the full license at Hugging Face before any production deployment.

Q: Does Devstral support function/tool calling?

Devstral 2 is specifically trained for tool-use workflows — that is the core of its agentic design. Within frameworks like OpenHands and Continue Agent mode, it handles file reading, code editing, and shell execution via structured tool calls. Raw function-calling via the Ollama API follows the standard chat completions format compatible with any OpenAI tool-calling client.

Q: What if I don't have an RTX 4090?

For Devstral Small 2, Apple Silicon Macs with 32GB unified memory are the most accessible alternative. The M3 Max and M4 Max chips handle 24B Q4_K_M models at practical speeds. RTX 3090 (24GB) also works, though it's slightly under the recommended spec — expect some VRAM pressure at longer context lengths. For anything smaller, consider using the Mistral API at $0.10/$0.30 per 1M tokens rather than a degraded local experience.

Key Takeaways

Devstral 2 was the strongest open-weights coding agent at its December 2025 launch, and it remains highly competitive on the open-source leaderboard as of May 2026. Its 72.2% SWE-bench score, 256K context window, and API pricing that undercuts proprietary competitors by 7x make it a practical choice for teams running coding agents at scale. Devstral Small 2 goes further — Apache 2.0, 24B parameters, RTX 4090-compatible, and 68.0% SWE-bench — and represents a new quality ceiling for self-hosted coding agents.

The Ollama setup is four commands: install, pull, run, serve. From there, Continue and OpenHands handle the IDE and agentic scaffolding. The only real gating factor is hardware: you need 24GB VRAM or 32GB unified memory to run Small 2 at reasonable speed.

If your team has that hardware and cares about code privacy, latency, or API cost at scale, Devstral Small 2 via Ollama is the most defensible open-source coding agent stack available today.

Bottom Line

Devstral Small 2 is the best open-weights coding agent you can self-host on a single GPU: Apache 2.0, 68% SWE-bench Verified, and a 256K context window that fits on an RTX 4090 or Mac with 32GB RAM. If you're already running Ollama, adding it is four commands. The gap versus proprietary models is real but narrow — and the privacy and cost advantages often outweigh five percentage points on a benchmark.

DEV Community