Tanmay Devare

Posted on Jul 1

Everyone Is Benchmarking Claude 5. They're Measuring the Wrong Thing.

#ai #claude #reviews #rust

Claude 5 is here.

Every timeline is full of benchmark charts.

SWE-bench scores.

Coding comparisons.

Context windows.

Token pricing.

But after building runtime infrastructure for AI agents over the last few months, I think we're measuring the wrong thing.

The Wrong Question

Everyone is asking:

"How smart is Claude 5?"

I think the better question is:

"What happens after Claude 5 decides to call a tool?"

Because that's where production agents actually fail.

Not during reasoning.

During execution.

Reasoning Isn't The Hard Part Anymore

Today's models are incredibly capable.

They can:

Write production code
Search the web
Execute shell commands
Modify files
Query databases
Call APIs
Coordinate complex workflows

The difficult part isn't intelligence anymore.

It's execution.

A Failure I Kept Seeing

While testing coding agents, I noticed the same pattern over and over again.

The model wasn't getting dumber.

It was getting stuck.

Something like this:

write_file

write_file

write_file

write_file

write_file

Or this:

execute_shell

read_file

execute_shell

read_file

execute_shell

read_file

The reasoning wasn't changing.

Only the tool calls were.

The agent was trapped inside its own execution.

The Cost Of Runtime Failures

These aren't harmless mistakes.

I've seen agents:

Burn 40,000+ tokens
Spend 20+ minutes retrying
Rewrite the same file repeatedly
Retry impossible tasks forever

The model wasn't broken.

Nobody was supervising execution.

Prompt Engineering Doesn't Fix This

A better prompt can improve reasoning.

It cannot supervise execution after the model has already started making tool calls.

Once an agent enters a retry loop, telling it to "be careful" doesn't suddenly make it aware that it has repeated the same action ten times.

Execution needs its own runtime.

Why I Built MicroLoop

MicroLoop sits between the AI agent and every tool call.

LLM
 │
 ▼
MicroLoop Runtime
 │
 ▼
Tool Execution
 │
 ▼
Result
 │
 ▼
MicroLoop
 │
 ▼
LLM

Every action passes through the runtime.

If the agent begins spiraling into a pathological execution pattern, MicroLoop can:

Detect repeated execution trajectories
Interrupt infinite tool loops
Repair execution paths
Halt execution when necessary

The goal isn't to make models smarter.

The goal is to stop smart models from making expensive execution mistakes.

What Existing Benchmarks Don't Measure

Most AI benchmarks answer questions like:

Can the model solve the task?
How fast is it?
How many tokens did it consume?
What's the pass rate?

Those are useful.

But they rarely answer questions like:

How many retries occurred?
Did the agent enter a loop?
Did it recover after failure?
Did it terminate gracefully?
How much execution was wasted?

Those are runtime problems.

That's What I'm Testing Next

Instead of arguing over benchmark charts, I'm putting Claude 5 through runtime scenarios that resemble real production failures.

Including:

Infinite retry loops
Impossible tasks
Recursive tool chains
Broken execution states
Tool oscillation

Not to prove the model is bad.

To understand how modern AI agents behave when execution starts going wrong.

The Bigger Picture

As models continue getting smarter, I think the bottleneck shifts.

It won't be reasoning.

It'll be execution.

The next generation of AI infrastructure won't just build smarter agents.

It'll build better runtimes.

Because in production, intelligence gets the job started.

Reliable execution gets it finished.

I'd love to hear how you're handling runtime failures in your own AI agents.

DEV Community