The Gap Between Agent Demos and Agent Production
Watch enough agent demos and you'll notice a pattern.
They work great in controlled environments. Give them a clear task, a fresh context window, a well-defined goal. The agent produces impressive results.
Then you deploy them.
And they drift.
Not catastrophically. Subtly. The fundraising agent that followed MEDDIC qualification perfectly in testing starts skipping discovery questions after a few weeks. The code review agent that caught security issues reliably begins missing edge cases. The data transformation agent that produced clean outputs 95% of the time suddenly hits 70%.
The demos never show this part.
Why Agents Drift
It's not the model degrading. It's not prompt decay.
It's that agents were never measured systematically in the first place.
Most agent development follows a demo-driven cycle:
- Write agent instructions
- Test manually on 3-5 examples
- Tweak when it fails
- Ship when it "works"
- Hope for the best
This is like shipping code without a test suite. Except your code can change its behavior based on subtle context shifts you didn't anticipate.
The agent that followed instructions perfectly in your test cases might interpret those same instructions differently when:
- The user asks a question slightly differently
- The context window fills with previous conversation
- A new model version ships with different biases
- The task has slightly different parameters
You won't know until you're debugging a production incident.
The Evaluation Problem
Andrej Karpathy recently open-sourced a method he calls "autoresearch"—a framework for systematically evaluating and improving ML systems through automated testing cycles.
The pattern translates directly to agents.
Instead of hoping your agent works, you run it through controlled evaluations that measure:
- Does it follow instructions? (instruction following rate)
- Does it stay on task? (goal drift detection)
- Does it handle edge cases? (failure mode coverage)
- Does it produce consistent outputs? (output variance analysis)
You define what "good" looks like with eval cases. The agent runs against them. You measure the gap. Then you iterate.
This is how Anthropic tested Skills 2.0 internally—and found improvement opportunities on 5 out of 6 Skills built by their own team.
The A/B Problem
Here's the uncomfortable truth most agent builders avoid.
Your agent instructions might be making your outputs worse.
When Anthropic benchmarked their internal Skills, they found cases where raw Claude (no instructions) outperformed the carefully-crafted Skill.
Why? Because instructions written for an earlier, less capable model version now constrain the newer model unnecessarily. The Skill that helped Claude 3.5 is actively limiting Claude 4.
The solution is continuous A/B benchmarking:
- Run the same task set with your agent loaded
- Run the same task set with raw model (no agent)
- Compare outputs blind
- If raw wins, your agent needs rewrites—or removal
Most teams never do this. They assume their instructions help because they helped once.
The Description Problem
The most common failure mode for agent systems: the agent doesn't activate when it should.
You build a sophisticated agent. You test it thoroughly. You ship it. Users type requests that should trigger it.
And nothing happens.
The agent sits dormant because the description—what tells the system when to use this agent—doesn't match how users actually ask for help.
You wrote "Email Drafter." Users type "write a message to my team."
You wrote "Data Cleaner." Users type "fix this spreadsheet."
The description optimization loop solves this:
- Generate test prompts that should trigger your agent
- Generate test prompts that should NOT trigger your agent
- Measure activation accuracy
- Rewrite descriptions to improve triggering
- Repeat until the right agent fires for the right requests
What Production-Grade Agent Development Looks Like
Teams that ship reliable agents follow a different playbook.
Before shipping:
- Define eval cases with expected outputs
- Run the agent against the eval suite
- Measure pass rate, not "it looks good"
- A/B benchmark against raw model
- Optimize descriptions for triggering accuracy
After shipping:
- Log every agent activation
- Sample outputs for manual review weekly
- Track drift metrics (output consistency over time)
- Re-run evals after model updates
- Retire instructions that stop helping
This isn't optional anymore.
The companies winning with agents aren't the ones with the cleverest prompts. They're the ones with the discipline to measure, test, and iterate systematically.
The Takeaway
Demos sell agents. Evaluations keep them working.
The gap between the two is where production failures live. Every agent that drifts in production is an agent that was never properly measured.
The tools are finally catching up. You can now:
- Eval your agent's instruction following
- A/B benchmark against raw model performance
- Optimize descriptions for activation accuracy
- Track drift metrics over time
The teams that adopt these practices will ship agents that stay reliable. The ones that don't will debug production incidents forever.
Your agent worked in the demo. Does it still work now?
This isn't theoretical. The difference between agents that scale and agents that fail isn't intelligence—it's measurement.
Top comments (0)