Tonight at 23:00 BST we're running fresh benchmarks on 10 LLMs we haven't tested before.
The lineup:
- DeepSeek V4 Pro & Flash
- Grok 4.20 & 4.1 Fast
- GPT-5.5 Pro & GPT-5.4 Pro
- Xiaomi MiMo V2.5 Pro
- Google Lyria 3 Pro & Clip
- inclusionAI Ring 2.6
All tested on the same 10 real-world agent coding tasks. Same methodology, same scoring, same brutal honesty about what breaks.
Results go live immediately after the run — watch benchmarks.workswithagents.dev.
Update to follow once results are in.
Top comments (0)