DEV Community

Vilius
Vilius

Posted on

I Tested 10 More Models. Five Brand New Families Debuted. None Scored Below 75%.

By Vilius Vystartas | May 2026

I ran another 10 models through the same agent coding benchmark. Five of them were from completely untested families — Sao10k, Anthracite, Inflection, Mancer, Undi95 — and every single one scored 75% or higher on its first try. This is getting harder to keep up with.

Two more models tied the all-time record at 90%. The cheapest model ever tested cost $0.0001 for a full 10-task benchmark.


The New 90% Club Members

Eight models have now hit 90% on this benchmark. Batch 11 added two:

Mistral Large 2411 (90%, $0.008, 46s) — Mistral's November 2024 flagship matches their current Large 3. Sometimes the first version is still the best one. Zero hard fails, clean passes on 8/10 tasks.

DeepSeek Chat V3-0324 (90%, $0.002, 73s) — The older V3 variant from March 2024 matches the original DeepSeek Chat at 90%. Every time I test a DeepSeek variant, it lands at 80-90%. The family is remarkably consistent.

The 90% club now includes: DeepSeek Chat (original), DeepSeek Chat V3-0324, Qwen3 Coder 30B, Nemotron 3 Nano 30B, Codestral 2508, Mistral Large 2411, MiniMax M2 Her, and Baidu Ernie 4.5 300B. Eight models. Seven of them cost less than a cent per full benchmark.


Five Families, First Try

Every new family debuted at 75% or higher. That's an impressive hit rate.

Family Model Score Cost Time
Sao10k L3.1 Euryale 70B 85% $0.002 29s
Sao10k L3 Lunaris 8B 85% $0.0001 20s
Anthracite Magnum V4 72B 85% $0.006 35s
Mancer Weaver 80% $0.003 30s
Undi95 Remm Slerp L2 13B 75% $0.002 31s
Inflection Inflection 3 Productivity 75% $0.012 42s

*Inflection 3 result is provisional — awaiting lab response. Will update in due course.

L3 Lunaris 8B at $0.0001 is the cheapest model I've ever tested. A full 10-task benchmark for one ten-thousandth of a dollar. At this price, there's no reason not to test a model before you ship with it. Lunaris scored 85% — competitive with models that cost 100x more.

The Sao10k family (L3.1 Euryale 70B and L3 Lunaris 8B) is the standout. Both models scored 85%, both are fine-tunes of Llama 3.1/3, and both cost almost nothing. Community fine-tunes continue to punch above their weight.

The Recoveries

Two Qwen models from my previous failed batch completed successfully this time:

Qwen3 8B (80%, $0.02, 543s) — Needed per_call_timeout: 300 to finish. The model is competent (6 passes, 4 partials, zero fails) but painfully slow. Each API call takes 100-120 seconds on OpenRouter. Use it as a background job, not a real-time agent.

Qwen Plus 2025-07-28 (80%, $0.001, 19s) — The dated variant works perfectly with enable_thinking: false. 80% at $0.0009 is great value. But use the current qwen/qwen-plus ID instead — it scores 85% and doesn't need the dated suffix.


Price/Performance

Model Score Cost $/%-pt
L3 Lunaris 8B 85% $0.0001 $0.0001
DeepSeek Chat V3-0324 90% $0.002 $0.0017
L3.1 Euryale 70B 85% $0.002 $0.0021
Remm Slerp L2 13B 75% $0.002 $0.0020
Mancer Weaver 80% $0.003 $0.0041
Anthracite Magnum V4 72B 85% $0.006 $0.0066
Mistral Large 2411 90% $0.008 $0.0093
Inflection 3 Productivity 75% $0.012 $0.0156
Qwen3 8B 80% $0.020 $0.0254

The ratio between cheapest and most expensive $/%-pt is 254x. Lunaris at $0.0001/%-pt vs Qwen3 8B at $0.0254/%-pt — same tier of score, wildly different cost profiles.


My Picks

  • Best overall: Mistral Large 2411 — 90%, 46s, $0.008
  • Best value: L3 Lunaris 8B — 85%, $0.0001 total. Absurd price/performance.
  • Best new family debut: Sao10k — both models at 85% first try. Watch this line.
  • Fastest: L3 Lunaris 8B — 20 seconds for all 10 tasks

Methodology

Same setup as the previous 10 batches: ten real-world agent coding tasks — file operations, shell commands, error recovery, data parsing, SQL queries — tested via OpenRouter. Max tokens: 600 (Qwen models), 300 (everyone else). Temperature: 0.1. Pattern-matching scoring against expected outputs.

Pre-flight verification caught zero failures this batch. All 10 candidates passed the simple-prompt test. Total cost: $0.05 for the core 8 models, then $0.02 for the Qwen recovery run. Total dataset: 158 models tested across cloud and local.

Full results and per-task scores: benchmarks.workswithagents.dev

Top comments (0)