By Vilius Vystartas | May 2026
I ran another 10 models through the same agent coding benchmark. Five of them were from completely untested families — Sao10k, Anthracite, Inflection, Mancer, Undi95 — and every single one scored 75% or higher on its first try. This is getting harder to keep up with.
Two more models tied the all-time record at 90%. The cheapest model ever tested cost $0.0001 for a full 10-task benchmark.
The New 90% Club Members
Eight models have now hit 90% on this benchmark. Batch 11 added two:
Mistral Large 2411 (90%, $0.008, 46s) — Mistral's November 2024 flagship matches their current Large 3. Sometimes the first version is still the best one. Zero hard fails, clean passes on 8/10 tasks.
DeepSeek Chat V3-0324 (90%, $0.002, 73s) — The older V3 variant from March 2024 matches the original DeepSeek Chat at 90%. Every time I test a DeepSeek variant, it lands at 80-90%. The family is remarkably consistent.
The 90% club now includes: DeepSeek Chat (original), DeepSeek Chat V3-0324, Qwen3 Coder 30B, Nemotron 3 Nano 30B, Codestral 2508, Mistral Large 2411, MiniMax M2 Her, and Baidu Ernie 4.5 300B. Eight models. Seven of them cost less than a cent per full benchmark.
Five Families, First Try
Every new family debuted at 75% or higher. That's an impressive hit rate.
| Family | Model | Score | Cost | Time |
|---|---|---|---|---|
| Sao10k | L3.1 Euryale 70B | 85% | $0.002 | 29s |
| Sao10k | L3 Lunaris 8B | 85% | $0.0001 | 20s |
| Anthracite | Magnum V4 72B | 85% | $0.006 | 35s |
| Mancer | Weaver | 80% | $0.003 | 30s |
| Undi95 | Remm Slerp L2 13B | 75% | $0.002 | 31s |
| Inflection | Inflection 3 Productivity | 75% | $0.012 | 42s |
*Inflection 3 result is provisional — awaiting lab response. Will update in due course.
L3 Lunaris 8B at $0.0001 is the cheapest model I've ever tested. A full 10-task benchmark for one ten-thousandth of a dollar. At this price, there's no reason not to test a model before you ship with it. Lunaris scored 85% — competitive with models that cost 100x more.
The Sao10k family (L3.1 Euryale 70B and L3 Lunaris 8B) is the standout. Both models scored 85%, both are fine-tunes of Llama 3.1/3, and both cost almost nothing. Community fine-tunes continue to punch above their weight.
The Recoveries
Two Qwen models from my previous failed batch completed successfully this time:
Qwen3 8B (80%, $0.02, 543s) — Needed per_call_timeout: 300 to finish. The model is competent (6 passes, 4 partials, zero fails) but painfully slow. Each API call takes 100-120 seconds on OpenRouter. Use it as a background job, not a real-time agent.
Qwen Plus 2025-07-28 (80%, $0.001, 19s) — The dated variant works perfectly with enable_thinking: false. 80% at $0.0009 is great value. But use the current qwen/qwen-plus ID instead — it scores 85% and doesn't need the dated suffix.
Price/Performance
| Model | Score | Cost | $/%-pt |
|---|---|---|---|
| L3 Lunaris 8B | 85% | $0.0001 | $0.0001 |
| DeepSeek Chat V3-0324 | 90% | $0.002 | $0.0017 |
| L3.1 Euryale 70B | 85% | $0.002 | $0.0021 |
| Remm Slerp L2 13B | 75% | $0.002 | $0.0020 |
| Mancer Weaver | 80% | $0.003 | $0.0041 |
| Anthracite Magnum V4 72B | 85% | $0.006 | $0.0066 |
| Mistral Large 2411 | 90% | $0.008 | $0.0093 |
| Inflection 3 Productivity | 75% | $0.012 | $0.0156 |
| Qwen3 8B | 80% | $0.020 | $0.0254 |
The ratio between cheapest and most expensive $/%-pt is 254x. Lunaris at $0.0001/%-pt vs Qwen3 8B at $0.0254/%-pt — same tier of score, wildly different cost profiles.
My Picks
- Best overall: Mistral Large 2411 — 90%, 46s, $0.008
- Best value: L3 Lunaris 8B — 85%, $0.0001 total. Absurd price/performance.
- Best new family debut: Sao10k — both models at 85% first try. Watch this line.
- Fastest: L3 Lunaris 8B — 20 seconds for all 10 tasks
Methodology
Same setup as the previous 10 batches: ten real-world agent coding tasks — file operations, shell commands, error recovery, data parsing, SQL queries — tested via OpenRouter. Max tokens: 600 (Qwen models), 300 (everyone else). Temperature: 0.1. Pattern-matching scoring against expected outputs.
Pre-flight verification caught zero failures this batch. All 10 candidates passed the simple-prompt test. Total cost: $0.05 for the core 8 models, then $0.02 for the Qwen recovery run. Total dataset: 158 models tested across cloud and local.
Full results and per-task scores: benchmarks.workswithagents.dev
Top comments (0)