Aamer Mihaysi

Posted on Apr 5

The Gap Between Agent Demos and Agent Production

#ai #agents #llm #engineering

The Gap Between Agent Demos and Agent Production

Watch enough agent demos and you'll notice a pattern.

They work great in controlled environments. Give them a clear task, a fresh context window, a well-defined goal. The agent produces impressive results.

Then you deploy them.

And they drift.

Not catastrophically. Subtly. The fundraising agent that followed MEDDIC qualification perfectly in testing starts skipping discovery questions after a few weeks. The code review agent that caught security issues reliably begins missing edge cases. The data transformation agent that produced clean outputs 95% of the time suddenly hits 70%.

The demos never show this part.

Why Agents Drift

It's not the model degrading. It's not prompt decay.

It's that agents were never measured systematically in the first place.

Most agent development follows a demo-driven cycle:

Write agent instructions
Test manually on 3-5 examples
Tweak when it fails
Ship when it "works"
Hope for the best

This is like shipping code without a test suite. Except your code can change its behavior based on subtle context shifts you didn't anticipate.

The agent that followed instructions perfectly in your test cases might interpret those same instructions differently when:

The user asks a question slightly differently
The context window fills with previous conversation
A new model version ships with different biases
The task has slightly different parameters

You won't know until you're debugging a production incident.

The Evaluation Problem

Andrej Karpathy recently open-sourced a method he calls "autoresearch"—a framework for systematically evaluating and improving ML systems through automated testing cycles.

The pattern translates directly to agents.

Instead of hoping your agent works, you run it through controlled evaluations that measure:

Does it follow instructions? (instruction following rate)
Does it stay on task? (goal drift detection)
Does it handle edge cases? (failure mode coverage)
Does it produce consistent outputs? (output variance analysis)

You define what "good" looks like with eval cases. The agent runs against them. You measure the gap. Then you iterate.

This is how Anthropic tested Skills 2.0 internally—and found improvement opportunities on 5 out of 6 Skills built by their own team.

The A/B Problem

Here's the uncomfortable truth most agent builders avoid.

Your agent instructions might be making your outputs worse.

When Anthropic benchmarked their internal Skills, they found cases where raw Claude (no instructions) outperformed the carefully-crafted Skill.

Why? Because instructions written for an earlier, less capable model version now constrain the newer model unnecessarily. The Skill that helped Claude 3.5 is actively limiting Claude 4.

The solution is continuous A/B benchmarking:

Run the same task set with your agent loaded
Run the same task set with raw model (no agent)
Compare outputs blind
If raw wins, your agent needs rewrites—or removal

Most teams never do this. They assume their instructions help because they helped once.

The Description Problem

The most common failure mode for agent systems: the agent doesn't activate when it should.

You build a sophisticated agent. You test it thoroughly. You ship it. Users type requests that should trigger it.

And nothing happens.

The agent sits dormant because the description—what tells the system when to use this agent—doesn't match how users actually ask for help.

You wrote "Email Drafter." Users type "write a message to my team."

You wrote "Data Cleaner." Users type "fix this spreadsheet."

The description optimization loop solves this:

Generate test prompts that should trigger your agent
Generate test prompts that should NOT trigger your agent
Measure activation accuracy
Rewrite descriptions to improve triggering
Repeat until the right agent fires for the right requests

What Production-Grade Agent Development Looks Like

Teams that ship reliable agents follow a different playbook.

Before shipping:

Define eval cases with expected outputs
Run the agent against the eval suite
Measure pass rate, not "it looks good"
A/B benchmark against raw model
Optimize descriptions for triggering accuracy

After shipping:

Log every agent activation
Sample outputs for manual review weekly
Track drift metrics (output consistency over time)
Re-run evals after model updates
Retire instructions that stop helping

This isn't optional anymore.

The companies winning with agents aren't the ones with the cleverest prompts. They're the ones with the discipline to measure, test, and iterate systematically.

The Takeaway

Demos sell agents. Evaluations keep them working.

The gap between the two is where production failures live. Every agent that drifts in production is an agent that was never properly measured.

The tools are finally catching up. You can now:

Eval your agent's instruction following
A/B benchmark against raw model performance
Optimize descriptions for activation accuracy
Track drift metrics over time

The teams that adopt these practices will ship agents that stay reliable. The ones that don't will debug production incidents forever.

Your agent worked in the demo. Does it still work now?

This isn't theoretical. The difference between agents that scale and agents that fail isn't intelligence—it's measurement.

DEV Community

The Gap Between Agent Demos and Agent Production

The Gap Between Agent Demos and Agent Production

Why Agents Drift

The Evaluation Problem

The A/B Problem

The Description Problem

What Production-Grade Agent Development Looks Like

The Takeaway

Top comments (0)