I Tested 10 More Models. Five Brand New Families Debuted. None Scored Below 75%.

#ai #llm #benchmark #agents

By Vilius Vystartas | May 2026

I ran another 10 models through the same agent coding benchmark. Five of them were from completely untested families — Sao10k, Anthracite, Inflection, Mancer, Undi95 — and every single one scored 75% or higher on its first try. This is getting harder to keep up with.

Two more models tied the all-time record at 90%. The cheapest model ever tested cost $0.0001 for a full 10-task benchmark.

The New 90% Club Members

Eight models have now hit 90% on this benchmark. Batch 11 added two:

Mistral Large 2411 (90%, $0.008, 46s) — Mistral's November 2024 flagship matches their current Large 3. Sometimes the first version is still the best one. Zero hard fails, clean passes on 8/10 tasks.

DeepSeek Chat V3-0324 (90%, $0.002, 73s) — The older V3 variant from March 2024 matches the original DeepSeek Chat at 90%. Every time I test a DeepSeek variant, it lands at 80-90%. The family is remarkably consistent.

The 90% club now includes: DeepSeek Chat (original), DeepSeek Chat V3-0324, Qwen3 Coder 30B, Nemotron 3 Nano 30B, Codestral 2508, Mistral Large 2411, MiniMax M2 Her, and Baidu Ernie 4.5 300B. Eight models. Seven of them cost less than a cent per full benchmark.

Five Families, First Try

Every new family debuted at 75% or higher. That's an impressive hit rate.

Family	Model	Score	Cost	Time
Sao10k	L3.1 Euryale 70B	85%	$0.002	29s
Sao10k	L3 Lunaris 8B	85%	$0.0001	20s
Anthracite	Magnum V4 72B	85%	$0.006	35s
Mancer	Weaver	80%	$0.003	30s
Undi95	Remm Slerp L2 13B	75%	$0.002	31s
Inflection	Inflection 3 Productivity	75%	$0.012	42s

*Inflection 3 result is provisional — awaiting lab response. Will update in due course.

L3 Lunaris 8B at $0.0001 is the cheapest model I've ever tested. A full 10-task benchmark for one ten-thousandth of a dollar. At this price, there's no reason not to test a model before you ship with it. Lunaris scored 85% — competitive with models that cost 100x more.

The Sao10k family (L3.1 Euryale 70B and L3 Lunaris 8B) is the standout. Both models scored 85%, both are fine-tunes of Llama 3.1/3, and both cost almost nothing. Community fine-tunes continue to punch above their weight.

The Recoveries

Two Qwen models from my previous failed batch completed successfully this time:

Qwen3 8B (80%, $0.02, 543s) — Needed per_call_timeout: 300 to finish. The model is competent (6 passes, 4 partials, zero fails) but painfully slow. Each API call takes 100-120 seconds on OpenRouter. Use it as a background job, not a real-time agent.

Qwen Plus 2025-07-28 (80%, $0.001, 19s) — The dated variant works perfectly with enable_thinking: false. 80% at $0.0009 is great value. But use the current qwen/qwen-plus ID instead — it scores 85% and doesn't need the dated suffix.

Price/Performance

Model	Score	Cost	$/%-pt
L3 Lunaris 8B	85%	$0.0001	$0.0001
DeepSeek Chat V3-0324	90%	$0.002	$0.0017
L3.1 Euryale 70B	85%	$0.002	$0.0021
Remm Slerp L2 13B	75%	$0.002	$0.0020
Mancer Weaver	80%	$0.003	$0.0041
Anthracite Magnum V4 72B	85%	$0.006	$0.0066
Mistral Large 2411	90%	$0.008	$0.0093
Inflection 3 Productivity	75%	$0.012	$0.0156
Qwen3 8B	80%	$0.020	$0.0254

The ratio between cheapest and most expensive $/%-pt is 254x. Lunaris at $0.0001/%-pt vs Qwen3 8B at $0.0254/%-pt — same tier of score, wildly different cost profiles.

My Picks

Best overall: Mistral Large 2411 — 90%, 46s, $0.008
Best value: L3 Lunaris 8B — 85%, $0.0001 total. Absurd price/performance.
Best new family debut: Sao10k — both models at 85% first try. Watch this line.
Fastest: L3 Lunaris 8B — 20 seconds for all 10 tasks

Methodology

Same setup as the previous 10 batches: ten real-world agent coding tasks — file operations, shell commands, error recovery, data parsing, SQL queries — tested via OpenRouter. Max tokens: 600 (Qwen models), 300 (everyone else). Temperature: 0.1. Pattern-matching scoring against expected outputs.

Pre-flight verification caught zero failures this batch. All 10 candidates passed the simple-prompt test. Total cost: $0.05 for the core 8 models, then $0.02 for the Qwen recovery run. Total dataset: 158 models tested across cloud and local.

Full results and per-task scores: benchmarks.workswithagents.dev