J. S. Morris

Posted on Feb 27

The Prompt Change That Broke Production at 2am

#ai #devops #architecture #learning

Why This Keeps Happening

When you test traditional software, you test a deterministic function. Same input, same output. If the output changes, something broke, the test fails, you investigate.

LLM agents are not deterministic functions. They’re probabilistic systems with behavioral contracts.

The contract isn’t “return exactly this string.” The contract is: given this class of inputs, the output must satisfy these structural and semantic properties.

The word “liability” should appear. The summary should be in bullet points. The termination clause should be mentioned. These are the invariants your downstream systems depend on — and they’re completely untested in most production LLM pipelines.

The industry’s response to this has been more evals. MMLU benchmarks, human preference ratings, red-team suites. Valuable for model builders. Useless for application developers who need to know whether their specific prompts still produce outputs their specific systems can rely on.

You’re not trying to measure whether Claude is generally intelligent. You’re trying to know whether your summarization prompt still hits the contract your parser expects. Those are completely different questions.

The Gap Nobody Is Filling

Here’s what the current tooling landscape looks like for an engineer who wants to regression-test their agent behavior:

Option 1: Unit tests with mocked LLMs.
Fast, deterministic, CI-friendly. Catches exactly nothing about actual model behavior because the model is mocked out.

Option 2: Manual spot-checking.
“Looks good to me.” Works until it doesn’t. Doesn’t scale. Doesn’t run on every deploy.

Option 3: Hosted eval platforms (LangSmith, etc.)
Powerful, but coupled to specific frameworks. Requires accounts, dashboards, infrastructure. Not a pip install and a YAML file.

Option 4: Nothing.
Most common option. “We’ll deal with it when something breaks.”

What nobody has built is the boring, obvious thing: a pytest for agent behavior. A tool that runs your scenarios, checks that outputs satisfy your contracts, compares against a baseline, and exits with code 1 when something drifts. Zero infrastructure. Works in any CI.

What We Actually Need

The minimum viable agent regression test looks like this:

# scenarios/summarize_contract.yaml
name: summarize_contract
input: |
  Summarize this contract clause in 5 bullet points:
  "...The Contractor shall indemnify...termination upon 30 days notice..."
expected_contains:
  - liability
  - termination
max_tokens: 512

One file. Declares the input and the semantic anchors that must appear. Not exact strings — anchors. The things your downstream systems depend on.

Then you run it:

agentprobe run scenarios/ \
  --backend anthropic \
  --baseline baseline.json \
  --tolerance 0.8

And you get:

Model : claude-opus-4-6
────────────────────────────────────────────────
  ✓ PASS  summarize_contract
  ✗ FAIL  extract_parties
          Drift detected: similarity 0.61 < 0.80
          Missing expected terms: ['indemnification']
────────────────────────────────────────────────
  1/2 passed  (50%)

Exit code 1. CI fails. Nobody merges that prompt change until the contract is satisfied again.

The Tuesday incident gets caught on Wednesday morning, before it reaches production.

How Baseline Comparison Works

The drift detection is simple and effective: Jaccard similarity on output tokens compared against a saved baseline.

# Save a baseline after a known-good run
agentprobe run scenarios/ --backend anthropic --save-baseline baseline.json

# Future runs compare against it
agentprobe run scenarios/ --backend anthropic --baseline baseline.json --tolerance 0.8

--tolerance 0.8 means: allow up to 20% variance from baseline. Drop below that, fail.

This is deliberately not semantic similarity. Jaccard is fast, deterministic, and catches structural changes — the ones that break parsers — better than embeddings-based approaches for most cases. Semantic similarity is on the roadmap for v0.2.

The baseline captures the behavioral fingerprint of a known-good run:

{
  "model": "claude-opus-4-6",
  "created_at": "2025-02-24T00:00:00+00:00",
  "scenarios": {
    "summarize_contract": {
      "raw_output": "...",
      "metrics": {
        "prompt_hash": "a1b2c3d4e5f6a7b8",
        "found_terms": ["liability", "termination"]
      }
    }
  }
}

The prompt_hash is there for a reason: if someone edits the prompt, the hash changes. You know the baseline may no longer be valid. You re-run and save a new one intentionally, rather than silently inheriting drift.

The CI Integration Is One Step

# .github/workflows/agent-tests.yml
- name: Agent regression tests
  run: |
    pip install "agentprobe[anthropic]"
    agentprobe run scenarios/ \
      --backend anthropic \
      --baseline baseline.json \
      --tolerance 0.8
  env:
    ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}

That’s it. Every push. Every prompt change. Every model update. The test runs. If behavior drifts past tolerance, the build fails.

The Tuesday incident becomes: commit fails CI, engineer sees the drift report, reviews whether the change was intentional, either adjusts the prompt or updates the baseline deliberately.

No more 2am pages about empty liability clause arrays.

Start With Three Scenarios

You don’t need to test everything. Start with the three behaviors your system absolutely depends on.

For most LLM pipelines, these are:

The happy path — core task with all expected output present
A safety or refusal case — inputs the agent should decline or handle carefully
The format-sensitive case — where downstream parsing depends on structural output

Write a YAML for each. Run them once with --save-baseline. Add the CI step. Done in an afternoon.

Try It

The repo: github.com/fallenone269/agentprobe

pip install "agentprobe[anthropic]"
agentprobe init-scenario my_first_test scenarios/my_first_test.yaml
agentprobe run scenarios/ --backend mock  # no API key needed to start

If you try it and hit friction, open an issue. The roughest edges get smoothed first.

If it saves you a 2am page, star the repo. It helps other engineers find it.

If you have a use case that isn’t covered, start a discussion. The roadmap is driven by real production problems.

The agent testing gap is real, the 2am incidents are happening, and the tooling to prevent them is a pip install away.

The only missing piece was someone building it. That’s what this is.

DEV Community