Skip to content

DEV Community

Vilius

Posted on May 11

Benchmarking 10 Untested LLMs Tonight — DeepSeek V4, Grok 4.20, GPT-5.5 Pro

#ai #llm #benchmark #agents

Tonight at 23:00 BST we're running fresh benchmarks on 10 LLMs we haven't tested before.

The lineup:

DeepSeek V4 Pro & Flash
Grok 4.20 & 4.1 Fast
GPT-5.5 Pro & GPT-5.4 Pro
Xiaomi MiMo V2.5 Pro
Google Lyria 3 Pro & Clip
inclusionAI Ring 2.6

All tested on the same 10 real-world agent coding tasks. Same methodology, same scoring, same brutal honesty about what breaks.

Results go live immediately after the run — watch benchmarks.workswithagents.dev.

Update to follow once results are in.

Top comments (0)

Subscribe