Dmitry Baraishuk

Posted on May 22

Devstral – AI for Software Testing and Autonomous Software Engineering

#testing #ai #automation #devstral

On May 21 2025, Mistral AI and All Hands AI jointly unveiled Devstral, an agentic large-language model purpose-built to automate software-engineering tasks. In benchmarking, Devstral outperformed every other open-source model, demonstrating a margin of improvement. The model is distributed under the permissive Apache 2.0 license, so developers and organizations can adopt, modify, and deploy it freely. Devstral signals the next phase of open-source LLMs: not just competent completion models, but fully-agentic systems that can plug into CI/CD pipelines and deliver patches. Because the model already "understands" standard test runners and VCS commands, scaffolds need far less prompt-engineering.

Belitsoft is an automation testing company with 20+ years of tech expertise verified by customers’ reviews on the most authoritative platforms such as G2, Gartner, and Goodfirms. Belitsoft experts define how a solution should be tested, carry out test cases (automated and manual), design automated test routines, capture any bugs or defects found, report errors to software engineering teams, ensure they are resolved, and provide recommendations on how they could be prevented. Belitsoft’s professionals also develop and maintain multiple test automation frameworks and automated scripts for visual, functional, and performance testing. They have experience in testing mobile and web apps and APIs through automation, as well as expertise in CI/CD and Continuous Testing.

What is Devstral

Devstral is tuned to act as a fully-autonomous software-engineering agent. It can read a repository, write or update tests, generate patches, run those tests, iterate until they pass, and finally open a ready-to-merge pull-request – completely hands-off once you give it the issue description.

Devstral can add a failing test if none exists, locate and edit the right source files, run pytest/npm test/etc. until the suite passes, and push a branch and open a pull-request. This is what the OpenHands agent framework orchestrates.

The fine-tuning dataset included traces of tool usage (shell commands, git, common CI scripts). That lets the model produce commands this agent wrapper can execute directly instead of hallucinating free-form prose.

Problem framing & model design

Devstral marks a shift from "smart autocomplete" toward autonomous coding agents that can own an entire red-green loop.
Regular code-completion models ace small, isolated tasks, but they stumble when you throw them at a real repo with dozens of interdependent files and a failing test.

Models like CodeLlama or DeepSeek can autocomplete a function, but they lack persistent context about the rest of your codebase. They often break integration tests because they don't understand side-effects across files.

Devstral was trained specifically to handle that tougher, whole-project scenario. It is learned by studying thousands of genuine GitHub issues (including the 500 from the SWE-Bench Verified benchmark) and is meant to run inside an "agent scaffold" such as OpenHands or SWE-Agent.

Those scaffolds feed Devstral the repo, call it repeatedly, run the tests after each patch, and stop only when everything is green.

In short, Devstral is designed to behave like a junior engineer who can read, reason, fix, and re-run tests – rather than just guess a single missing line.

You don't chat with Devstral directly – you plug it into a wrapper that: checks out the repo, calls the model for commands, executes them (edit files, run tests), and loops until the suite passes. The scaffold provides tool access and sandboxing, turning the model into a true autonomous worker.

Benchmark performance & community skepticism

Devstral's headline result – 46.8 % solved on the SWE-Bench Verified benchmark. Does that number actually prove that the model is superior in practice?

The "pro" side points out that 46.8 % is the best open-source score to date, beating the next-best by ~6 percentage points and even topping several closed models under identical test scaffolds.

The "con" side worries the number may be inflated because:

Benchmark confusion – At least one commenter claims the team really measured on the easier SWE-Bench Lite, not on the stricter Verified subset.
Possible over-fitting – Devstral was trained on thousands of GitHub issues, so critics ask whether the model "studied for the test".
Mixed anecdotes – Developers in less-mainstream stacks (Clojure, Ruby, etc.) report that bigger competitors like Qwen-3 30 B or Claude 3.x sometimes do better on their day-to-day tasks. (These are social-media reports, not formal studies.)

How to read the result
If you run an agent scaffold (OpenHands, SWE-Agent, etc.) and your codebase is in a mainstream language with solid tests, Devstral may really solve ~half your benchmark-style issues out of the box.

If your project falls outside that "median" case – niche language, sparse tests, very slow CI - expect more modest gains and plan to measure locally.

Licensing, distribution & hardware requirements

How you can legally use, obtain, and run Devstral – and what kind of hardware you need?

Apache 2.0 lets startups embed Devstral in paid products without legal gymnastics. It is the gold-standard "do-whatever-you-want" licence for open-weight models. You can fine-tune Devstral, bundle the new weights into a product, and sell it without owing royalties or publishing your changes. Because Apache 2.0 is already common in the software world, most legal teams sign off quickly – unlike newer "open-weight" licences that still restrict commercial use or require sharing derivatives.

No multi-GPU cluster required, so you can keep proprietary code on-prem. If you don't want to host, the API is priced exactly like Mistral's smaller SaaS model, so you can prototype in the cloud and later migrate on-site.

Hardware requirements

Tokens-per-second on commodity hardware

Early testers measured "tokens per second," the standard speed metric for LLM inference:

These numbers illustrate that more VRAM and higher memory bandwidth buy you speed, but even a solid laptop CPU can limp along for small jobs. The hosted API, sitting on datacenter-grade GPUs, is much faster.

Context-window and stability limits

Devstral supports a 128 k-token context window. Testers found that raising Ollama's context setting all the way to 128 k made the REST server unstable, sometimes dropping the session or reloading the model mid-conversation.

Large context + agentic chains (multiple internal prompts) create a lot of round-trips, so even fast local hardware can feel slower than a single call to a remote model that streams at datacenter speeds.

The front-end matters

Different launchers handle memory and I/O in very different ways:

Cursor's hosted Devstral instance sidesteps all of this: you pay per token, but you never wait for a cold-load and you get datacenter throughput.

Many practitioners now say the public leaderboard has drifted from the shop floor: benchmarks look ever less like real-world coding or data-science workloads, so they no longer predict day-to-day utility.

Tool integration & setup

Unlike most local LLM builds that behave like self-contained chatbots, Devstral was post-trained on the "cradle" scaffold from the open-source OpenHands project. The cradle teaches the model to call explicit functions – search_repo(), read_file(), edit_file(), run_tests() and so on – so the model can plan, navigate a real code-base, make edits and verify them with tests.

Because those function hooks live inside OpenHands, Devstral feels magical inside that environment.

Devstral is terrific if you give it the tool layer it was trained for – choose a frontend that either embeds the OpenHands cradle or lets you enforce a JSON tool schema.

If you're just curious – try Devstral in LM Studio first – if the "pre-flight" check shows green, you're good to go.

If you want a fully local, automated coding agent: Pair Devstral with OpenHands through Ollama, Goose, or Aider – you'll see it read, edit, and test code on its own.

For production APIs on beefy hardware, put Devstral behind vLLM for maximum tokens-per-second, but watch the quantization caveats.

User experiences & real-world performance

Here is the report from developers who've been stress-testing Devstral, Mistral AI's brand-new open-source coding model, and comparing it with other community favourites.

One reviewer hammered Devstral with tricky Ruby/RSpec tickets and it "sailed through," while another – inside the Cline VS Code agent – found its file-I/O tool calls clumsy. Cline lets any model read, edit and run code directly from your IDE, so weak tool use shows up fast.

Mistral advertises smooth local use on GPUs with 24 GB+ VRAM, but community members on 8 GB cards report slowdowns or context-length limits when they throw "broad, time-critical prompts" at it.

Like most small-to-mid-sized models, Devstral behaves best when you feed it problems step-by-step rather than dropping an entire refactor in one go.

Model comparisons

How it stacks up against other open models

Comparisons with other LLMs
Devstral (24 B, open weights) scores 46.8 % on the SWE-Bench Verified test, edging past the best open agent baselines such as o3-mini (OpenAI) and Claude 3.6 (updated Claude 3.5 Sonnet checkpoint) when those two are run in agent-less or community scaffolds. Anthropic's own in-house scaffold lets Claude 3.6 climb even higher, so the overall ranking flips once you compare "full-stack" systems rather than raw model weights.

For developers on a single consumer GPU or a high-RAM MacBook, the community still leans on:

Gemma 3 (4 B → 27 B) – Google's quantisation-aware variants run comfortably on 8-16 GB VRAM, yet handle quick summaries, doc generation, and light bug-fixing duties.

Mistral Small 3 / 3.1 (22 B → 24 B) – snappy, reasonably accurate for "pair-programming" use when prompted step-by-step, and still fits on a 16–24 GB card.

Choose the model-plus-workflow that fits your stack, budget and tooling pipeline, not the one that sits on top of a single leaderboard row.

EU involvement & Mistral's role

The EU is paying for the expensive computing infrastructure that modern-AI startups need, then letting those startups use it almost for free so they can scale without fleeing to U.S. cloud giants.

Mistral was one of the first startups admitted to the EuroHPC pilot in late 2023, training early versions of its open-source 7B and Mixtral models on EU machines such as LUMI and MareNostrum 5.

In February 2025, the company revealed that its new, much larger training cluster will live in the publicly backed Éclairion data-centre, operated together with Fluidstack and designed specifically for GPU densities up to 200 kW per rack.

Compute remains in Europe, intellectual property stays in a European firm, and the public money turns into a "European champion" rather than another startup signing an exclusive Azure or AWS deal.

How the scheme works

Road-map & technical notes

The current Devstral release is strictly a preview – i.e. fast to try, but not yet their "final" agentic coder. Mistral says a larger, more autonomous "agent-style" successor will follow within a few weeks, so you should expect another checkpoint (with a bigger context window, tool-use, and possibly function-calling baked in) before the end of June 2025.

Technical reminders
If you run Devstral (or any big model) on-device: you can load larger quantisations than a 24 GB PC GPU, but every extra token steals RAM from macOS. Decide whether to wait for the agent model or invest in more RAM/GPU.

Even if your Mac/GPU can hold 32 K tokens, the server may forget unless you pin settings on every request or rebuild the model file. Script keep_alive:-1 / fixed PARAMETER num_ctx or to try alternative runners (LM Studio, llama.cpp, etc.).

DEV Community