J. S. Morris

Posted on Feb 25

Why Agent Testing is Broken

#ai #opensource #agents

Why Agent Testing Is Broken

And what to do about it.

Software testing has been solved for decades. You write a function, you assert its output, your CI turns green, you ship. The contract is clear: same input, same output, always.

LLM agents broke this contract completely — and most teams haven’t noticed yet.

The Problem Nobody’s Talking About

Ask your agent “summarize this contract” today and get a good response. Ask it again tomorrow after a model update, a prompt tweak, or a context window change, and get something subtly different. Not wrong, exactly. Just… different. Different enough that the downstream system parsing it breaks silently at 2am.

This is not a hypothetical. It’s happening in production right now at companies that thought they were shipping stable systems.

The failure mode is insidious because:

It doesn’t throw exceptions. The agent responds. It always responds. The response is even plausible. The failure is semantic, not syntactic.
It’s not reproducible on demand. You can’t git bisect a drift in model behavior. The model didn’t change — your prompts did, or the model got a silent update from your API provider, or the context you’re injecting shifted.
Your existing tests don’t catch it. Unit tests mock the LLM entirely. Integration tests check that the API call completes. Neither checks whether the content of the response still satisfies your downstream expectations.

You have no regression suite for cognition. You’re flying blind.

Why This Happens

Traditional software is deterministic. LLMs are stochastic systems operating on learned representations of language. When you update a model, you’re not patching a function — you’re shifting a distribution.

A 3% shift in how Claude-3.5 vs Claude-4 responds to a legal summarization prompt might be invisible in manual review and catastrophic in a pipeline that expects the word “termination” to appear in every output.

The industry’s response has been to add more evals — elaborate human preference datasets, MMLU benchmarks, red-teaming suites. These are valuable for model builders. They are nearly useless for application developers.

What application developers need is not “is this model generally capable?” They need: “does this model, with my specific prompts, in my specific context, still produce outputs my system can rely on?”

That question has no good answer today.

What Broken Looks Like in Practice

Here’s a real pattern seen across teams shipping LLM applications:

Month 1: Team writes prompts, ships agent, manually verifies outputs look good.

Month 2: Someone tweaks a system prompt “slightly” to improve tone. Three downstream parsers start failing intermittently.

Month 3: The model provider silently updates the model behind the same API endpoint. Response format drifts by 15%. The agent still works in demos.

Month 4: A customer reports that summarized contracts are missing liability clauses. Postmortem reveals the issue started in month 2. Nobody noticed because there were no behavioral tests.

This is the norm, not the exception.

The Right Mental Model

Stop thinking about agent outputs as function return values. Think about them as documents produced by a probabilistic process with a behavioral contract.

The contract is: given this class of inputs, the output must satisfy these structural and semantic properties.

Testing that contract requires:

1. Baseline capture. Run your scenarios against a known-good version of the system and record the outputs. This is your behavioral fingerprint.

2. Containment checks. Define what must appear in every output. Not the exact text — that would fail on every run. The semantic anchors: key terms, required sections, structural elements.

3. Drift detection. Compare new outputs against your baseline. When similarity drops below your tolerance threshold, fail the build. Let the engineer decide if the change is intentional.

4. CI integration. Run this on every push. On every model version change. On every prompt edit. The same way you run unit tests.

This is not complicated. It’s just not being done.

Why Nobody’s Done It Yet

The tooling doesn’t exist yet in a usable form.

Existing evaluation frameworks (RAGAS, LangSmith, etc.) are either:

Coupled to specific frameworks (LangChain, etc.)
Focused on RAG quality metrics rather than behavioral regression
Require hosted infrastructure and accounts
Too complex to add to a CI pipeline in an afternoon

What the market needs is a pytest for agents. Lightweight. Composable. Runs locally. Zero-infrastructure. Exits with code 1 when behavior breaks.

What a Solution Looks Like

# scenarios/summarize_contract.yaml
name: summarize_contract
input: |
  Summarize this contract clause in 5 bullet points:
  "...The Contractor shall indemnify...termination upon 30 days notice..."
expected_contains:
  - liability
  - termination
max_tokens: 512

# Run against real model, compare to baseline
agentprobe run scenarios/ \
  --backend anthropic \
  --baseline baseline.json \
  --tolerance 0.8

✓ PASS  summarize_contract
✗ FAIL  extract_parties
        Drift detected: similarity 0.61 < 0.80
        Missing expected terms: ['indemnification']

Exit code 1. CI fails. Engineer investigates before merge.

This is the minimum viable interface for agent regression testing. One command. One config file. Works in any CI system. No accounts. No dashboards.

The Deeper Issue

The reason agent testing is broken isn’t technical. The tooling is straightforward to build.

The reason is cultural. The teams shipping LLM applications came from two worlds:

ML engineers think about evaluation as a training-time concern. You eval the model, you ship the model, done. Application behavior is someone else’s problem.

Software engineers think about testing as a code correctness concern. The LLM is a black box — you can’t unit test a neural network, so you don’t test at all.

Neither group has internalized that LLM applications are probabilistic systems with testable behavioral contracts. That’s a new thing. It requires a new practice.

That practice is agent regression testing. It needs to become as routine as writing unit tests. The tools to do it are simple — they just need to exist and be usable.

Start Today

You don’t need a framework. Here’s the minimum viable version:

Pick your three most critical agent behaviors.
Write a scenario YAML for each: input, and 2-3 terms that must appear in every valid output.
Run your agent against those scenarios and save the outputs as a baseline JSON.
On every deploy, run again and diff against the baseline.
Fail the deploy if outputs drift beyond your tolerance.

That’s it. If you want a tool that does this out of the box: github.com/fallenone269/agentprobe.

Agent testing is broken because nobody built the right tool yet. That’s a solvable problem.

DEV Community