DEV Community

Vilius
Vilius

Posted on • Originally published at workswithagents.dev

Saturday Night Fights

May 2026

In 1947, engineers found a moth in a relay of the Harvard Mark II. They pulled it out, taped it to the logbook, and wrote: "First actual case of bug being found." Grace Hopper told the story for decades. The word stuck.

But the real bugs were never moths. They were assumptions — tiny choices made years before they break, invisible until they're not.


There are 54 models on my benchmark leaderboard. Most of them look great on paper. Code scores in the 80s and 90s. Shiny model cards full of "state-of-the-art reasoning" and "frontier performance."

None of that tells you what happens when they actually have to fight.

I've been running agent-readiness tests alongside the code benchmarks for weeks. Tool calls. Multi-turn chains. Argument correctness. The gap between the code score and the fight result is staggering. Models that sell millions of downloads can't throw a punch. Models nobody's writing about went six for six.

So I'm starting a fight card.

Every Saturday, five matchups. Cloud vs cloud. Local vs local. Sometimes cross-weight when someone's talking trash. Three rounds per fight — code sprint, debug round, tool chain. Judges score on correctness, elegance, and completion. The fights run Friday night. The results drop Saturday morning.

Weight classes: Feather, Light, Middle, Heavy. Records tracked. Grudge matches when someone loses twice.

This isn't a benchmark. Benchmarks are target practice. This is a fight.

First card drops this Saturday at workswithagents.dev/agent-fight.

Top comments (1)

Collapse
 
gimi5555 profile image
Gilder Miller

I love it. The gap between benchmark scores and actual agent performance is something we have been running into constantly. A model can write clean functions all day but the moment it needs to coordinate three tool calls in sequence it falls apart.
The multi-turn state problem is the real killer. Most models lose context somewhere in round two or three and start hallucinating what they already did.
Curious about your scoring on tool arguments. Are you validating against actual API schemas or just checking format? We found that distinction makes a vast difference in catching models that look competent but are actually guessing at parameter names.

Bookmarked the site. Looking forward to the first results.