- Book: Prompt Engineering Pocket Guide
- My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools
- Me: xgabriel.com | GitHub
Picture this, a pattern I've seen play out more than once. A team has 47 prompt tests in their CI. Green for six months. They bump the model version, and an extraction prompt starts returning the user's birth city instead of their billing city. The test suite stays green the whole time. Two of the 47 tests are assert "city" in output.lower(). Both pass. Both are useless.
This is the most common failure mode I see in prompt CI. Engineers who've spent ten years writing pytest assertions for deterministic functions reach for the same instinct on a probabilistic system: assert the output contains a string. It almost always misses the bug.
The thing nobody tells you when you start writing prompt tests: there is exactly one assertion shape that consistently catches regressions across model bumps, prompt edits, and silent provider drift. And one shape that looks similar but breaks every time the model gets better.
The wrong test, in detail
Here is the test pattern that shows up in roughly every prompt-eval PR I've seen:
def test_extract_city_returns_billing():
out = extract_address(invoice_text="""
Bill to: Acme Inc.
123 Market St, Berlin, 10115, Germany
""")
assert "Berlin" in out
assert "city" in out.lower()
It looks fine. It runs in 800ms. It catches the case where the model returns nothing. And it's nearly worthless.
Three problems. First, "Berlin" in out passes when the model says "The shipping address is in Berlin but no billing city was found." That is exactly the regression you're trying to catch: wrong field, right token. Second, "city" in out.lower() is asserting the model used the word "city" anywhere in its prose, which is a property of the prompt template, not the answer. Bump the system prompt to say "the response should be terse" and the test fails for a non-bug. Third, both assertions are oblivious to structure. If the model wraps the answer in markdown, returns JSON, or returns a polite refusal that happens to contain "city", the test can't tell the difference.
The bug here isn't laziness. It's that engineers learn to test by checking outputs against expected values, and LLM outputs are not values. They are populated structures.
The single test that actually catches regressions
Here is the test I want you to write first, before any other prompt test, for every LLM call you ship:
import json
import pytest
from jsonschema import validate, ValidationError
from app.prompts import extract_address
ADDRESS_SCHEMA = {
"type": "object",
"required": ["street", "city", "postal_code", "country"],
"properties": {
"street": {"type": "string", "minLength": 3},
"city": {"type": "string", "minLength": 2},
"postal_code": {"type": "string"},
"country": {"type": "string", "minLength": 2},
},
"additionalProperties": False,
}
@pytest.mark.parametrize("seed", range(5))
def test_address_extraction_is_structurally_valid(seed):
out = extract_address(
invoice_text=(
"Bill to: Acme Inc.\n"
"123 Market St, Berlin, 10115, Germany"
),
temperature=0,
seed=seed,
)
parsed = json.loads(out)
validate(parsed, ADDRESS_SCHEMA)
assert parsed["city"] == "Berlin"
assert "Berlin" not in parsed["street"]
This test does four jobs at once, and each one catches a different class of regression.
It asserts the response parses as JSON. This is the cheapest, highest-signal check you can write. The vast majority of "the prompt broke" reports trace back to the model adding a markdown fence, an apologetic preamble, or a trailing comment. json.loads catches all three. If you only ever write one prompt test, write this one.
It validates the schema. Required keys, types, no extras. If the model decides to return "Berlin, Germany" as a single address field instead of structured components, the test fails immediately and points at the field. If the model invents a confidence key, additionalProperties: false rejects it. This is the assertion that the LLM eval community has converged on as the deterministic baseline. Anthropic's evals guidance specifically calls out JSON schema validity, required keys, enum membership, and forbidden strings as the fast, non-negotiable layer.
It pins the field value, not the prose. parsed["city"] == "Berlin" is an assertion about the answer. The phrasing of the answer is irrelevant. The model can be terser, more verbose, switch from JSON-stringified ints to plain strings: it doesn't matter. The field either equals "Berlin" or it doesn't.
It checks the negative. assert "Berlin" not in parsed["street"] is the assertion that catches the field-confusion class of bug. The model dumping the entire address into street is the failure mode that "city contains Berlin" never catches. Always include at least one negative assertion alongside a positive one.
The temperature=0 and the seed parametrize are doing the unglamorous work of making the test reproducible. At temperature 0 with a fixed seed, the same prompt to the same model returns the same output (subject to provider-side determinism, which most providers describe as "best effort"). Running over five seeds catches the case where the call is technically deterministic but the prompt is one re-roll away from a different field.
What this test still won't catch
Honest accounting. The structural test above won't tell you if the city is actually correct on a document it has never seen. It won't catch semantic regressions where the model starts returning the registered office instead of the billing address. It won't tell you when the prompt is suddenly 30% slower or 4× more expensive.
For those, you layer on top: a small golden set of 30–50 real documents with hand-labeled expected fields, a latency budget assertion, a token-count regression test. The structural test is the floor. It's the test that, alone, still gives you most of the regression coverage for a fraction of the engineering effort.
The "exact wording" test that breaks every Tuesday
The mirror image of the right test is the assertion engineers reach for when they want "stronger" coverage:
def test_summary_phrasing():
out = summarize(article=ARTICLE)
assert out.startswith("This article discusses")
assert "in summary" in out.lower()
assert out.count(".") >= 3
assert len(out) < 500
Every line of this test is a trap. startswith("This article discusses") couples the test to the model's stylistic prior. The next minor version of any frontier model will tighten its preambles and that line will fail across your whole eval suite simultaneously. assert "in summary" in out.lower() is a vocabulary check that has nothing to do with whether the summary is correct. The sentence count and length checks are a prose-detector pretending to be a quality metric.
The damage isn't just that these tests fail on irrelevant changes. It's that engineers, faced with a sea of red after a model bump, do the worst thing possible: they relax the assertions until the suite goes green. Now you have tests that don't fail on anything. The eval suite has become decorative.
If your assertion would change between two competent humans writing the same correct answer, it does not belong in your test suite.
A pytest skeleton you can paste in
This is roughly the smallest useful prompt-test file. Drop it in tests/prompts/, swap the prompt and schema, and you have the structural floor working in your CI in under ten minutes:
import json
import pytest
from jsonschema import validate
from app.prompts import call_prompt
SCHEMA = {
"type": "object",
"required": ["intent", "entities"],
"properties": {
"intent": {
"type": "string",
"enum": ["question", "command", "complaint"],
},
"entities": {"type": "array", "items": {"type": "string"}},
},
"additionalProperties": False,
}
CASES = [
("Where is my order?", "question"),
("Cancel my subscription.", "command"),
("This bill is wrong.", "complaint"),
]
@pytest.mark.parametrize("text,expected_intent", CASES)
def test_intent_classification(text, expected_intent):
raw = call_prompt(text, temperature=0)
parsed = json.loads(raw)
validate(parsed, SCHEMA)
assert parsed["intent"] == expected_intent
assert isinstance(parsed["entities"], list)
Notice what it does not have. No assert "intent" in raw. No assert len(raw) > 20. No assert raw.startswith("{"). The schema validator is doing all of that, more rigorously, with a single line.
How to roll this out on an existing prompt
If you have a prompt in production and zero tests, the migration path is short. Pick the three most-trafficked calls in your app. For each, write the schema you wish the LLM was returning. Even if today it returns prose, write the schema for the structured version. Wrap the call in a thin layer that asks the model to return JSON matching the schema (most providers now support a structured-output flag). Add the structural test above. You are now better defended against silent regressions than most teams shipping LLM features.
The teams that survive their next model bump aren't the ones with 200 tests. They're the ones with 20 tests that assert the right shape. Promptfoo and DeepEval both bake this pattern into their assertion DSLs (see their docs); the structural-checks-over-vibes pattern is what survives in practice: deterministic structural checks catch the regressions; vibes-based assertions catch nothing and bleed credibility.
A prompt without a structural test is a prompt that will silently return the wrong field on a Tuesday afternoon, three months from now, when nobody is looking. Write the schema. Validate the JSON. Pin the field. Negative-assert the obvious failure mode. Then go work on the prompts that are actually hard.
If this was useful
The Prompt Engineering Pocket Guide has a chapter on writing prompts that pass schema-validation gates without the model fighting you: when to use structured outputs, when to use JSON mode, when to use a Pydantic-style schema in the prompt itself, and the prompt patterns that hold up under temperature 0 across model bumps.
Top comments (0)