I have a bias I should disclose up front. I think example-based
unit tests are the wrong default for systems that make decisions.
Decisions have shape — invariants that should hold regardless of
the input — and example tests verify shape on three or four hand-
picked inputs. Property-based tests verify the same shape on a
hundred randomly-generated inputs every CI run, and they find the

counterexamples I'd never have written by hand.
This is the story of fifteen correctness properties, the bug one of
them caught the day before the demo, and why I think every agent
codebase should have a property suite before it has unit tests.
The setup
The system I'm describing is an alert triage co-pilot built on
Hindsight for memory and
cascadeflow for runtime
intelligence. The mechanics aren't the point of this article —
there are separate write-ups on the bypass logic, the alert DNA
fingerprint, the cost curve, and the four-quadrant mode matrix.
What I want to talk about here is how I tested the thing.
The codebase is small (around 3,500 lines of Python), but it makes
a lot of decisions per analysis: extract a fingerprint, recall
prior incidents, score consistency, decide whether to bypass the
strong model, route inference, record cost, emit an audit trail.
Each of those decisions has invariants that don't depend on a
specific input.
I wrote those invariants as Hypothesis properties.
What a property looks like
The simplest one is the round-trip on the alert fingerprint. A
fingerprint serializes to a canonical string and parses back to an
equivalent fingerprint. The property:
# tests/property/test_fingerprint.py
@given(fp=alert_fingerprint_strategy())
@settings(deadline=None, max_examples=200)
def test_fingerprint_round_trip(fp: AlertFingerprint) -> None:
"""Feature: openrecall, Property 1: format/parse round-trip."""
serialized = format_fingerprint(fp)
assert parse_fingerprint(serialized) == fp
alert_fingerprint_strategy is a Hypothesis strategy I wrote
once, in a conftest.py shared across the property suite. It
generates AlertFingerprint instances with semi-realistic field
values — enums for the closed sets, free strings for the open
ones, empty strings explicitly included.
Hypothesis runs this property 200 times per CI build with 200
different fingerprints. If the format changes and someone forgets
to update the parser, the property fails on the first malformed
fingerprint and prints the minimal counterexample.
That's the entire pitch. One declaration replaces what would
otherwise be ten or twenty hand-written examples, and it shrinks
to the smallest counterexample on failure.
The fifteen properties
The full set, summarized:
| ID | Property |
|---|---|
| P1 | Fingerprint format/parse round-trip |
| P2 | Fingerprint extraction is deterministic given the same input |
| P3 | Triage with no memory matches always escalates |
| P4 | Triage decision is a member of the closed TriageDecision Literal |
| P5 |
triage_confidence is in [0, 1]
|
| P6 | Strong-match score above threshold raises consistent_decision_count
|
| P7 |
retain is idempotent on (fingerprint, decision)
|
| P8 | Cumulative cost is monotonic non-decreasing |
| P9 | Bypass routes record cost_usd = 0.0 and preserve baseline_cost_usd
|
| P10 | Memory match scores are in [0, 1]
|
| P11 | Bypass fires iff all four bypass clauses are satisfied |
| P12 | Attack-pattern presence blocks bypass unconditionally |
| P13 | Retain dedupe key handles whitespace/case-equivalent inputs |
| P14 | Savings band (baseline_cumulative - actual_cumulative) is monotonic |
| P15 |
len(audit_trace) == len(route_trace) for every analysis |
These weren't all obvious on day one. I wrote about half of them
up-front from the spec, and the rest accumulated as I noticed
shapes I wanted to enforce. Each one corresponds to a sentence in
the requirements doc that says "the system shall ..." — that's
how I find them. Wherever the spec uses "shall," I look for an
invariant.
The bug P15 caught
P15 is the most important of the set, and it's the one that
caught the bug I would have shipped.
The property says: every routing decision the system records in
the RouteTrace has to have a corresponding AuditTraceEntry.
One per step. Same length. Same order.
# tests/property/test_workflow.py
@given(state=workflow_input_strategy())
@settings(deadline=None, max_examples=80)
def test_audit_trace_parity(state: WorkflowInput) -> None:
"""Feature: openrecall, Property 15: audit-trace completeness."""
result = workflow.analyze(state.raw_alert)
assert len(result.audit_trace) == len(result.route_trace), (
f"audit/route mismatch: route={len(result.route_trace)}, "
f"audit={len(result.audit_trace)}"
)
I wrote this property the day I added the bypass route. The bypass
emits a synthetic RouteTrace step with model="memory-bypass",
and on the first run, P15 failed.
The shrunk counterexample was a single alert that triggered the
bypass cleanly: route trace had three steps (normalize, fingerprint,
memory-bypass), audit trace had two. I had remembered to add the
synthetic route step but forgotten to add the matching audit
entry. The fix was a one-liner — call audit.record_step next to
the route emit — but if I'd been writing example tests, I'd have
needed to know in advance to test exactly that flow with exactly
those inputs.
This is the value Hypothesis generates. I didn't know that bug
existed. The property knew the shape that was supposed to hold,
and the strategy generated an input that violated it.
The strategies are the hard part
People sometimes describe property-based testing as "you write
a property and Hypothesis does the rest." That's misleading. The
hard part is writing strategies that generate semantically
plausible inputs.
Here's the strategy for an alert fingerprint:
# tests/property/conftest.py
from hypothesis import strategies as st_strat
ERROR_CLASSES = ["", "crashloopbackoff", "oomkilled", "http_5xx",
"timeout", "tls_handshake", "auth_failure"]
SERVICE_ROLES = ["", "api_edge", "batch_worker", "db_primary",
"queue_consumer", "scheduler"]
DEPENDENCY_PATTERNS = ["", "configmap", "dns", "downstream_api",
"queue", "object_store"]
SIGNAL_SHAPES = ["", "spike", "sustained", "sawtooth", "regression"]
ATTACK_PATTERNS = ["", "brute_force", "port_scan", "suspicious_login"]
ENVIRONMENTS = ["", "prod", "staging", "dev"]
def alert_fingerprint_strategy() -> st_strat.SearchStrategy[AlertFingerprint]:
return st_strat.builds(
AlertFingerprint,
error_class=st_strat.sampled_from(ERROR_CLASSES),
service_role=st_strat.sampled_from(SERVICE_ROLES),
dependency_pattern=st_strat.sampled_from(DEPENDENCY_PATTERNS),
signal_shape=st_strat.sampled_from(SIGNAL_SHAPES),
attack_pattern=st_strat.sampled_from(ATTACK_PATTERNS),
environment=st_strat.sampled_from(ENVIRONMENTS),
)
The empty strings are deliberate — they're realistic, because real
alerts often miss one or two fields. The closed sets are pulled
from the same constants the production code uses, which means
adding a new error class to the regex extractor automatically
expands the test space.
I have similar strategies for MemoryMatch, TriageResult,
CostCurvePoint, and the workflow input. Each one lives in
conftest.py and is named <noun>_strategy. Naming them
consistently is the difference between "I have a property test
suite" and "I have a property test framework."
Determinism: the seed that lets you trust CI
Property tests are stochastic by default, which means a flake in
one CI run might not reproduce in the next. That's a non-starter
for a project where I want CI failures to be a binding signal.
I forced determinism with one environment variable:
# tests/property/conftest.py
@pytest.fixture(scope="session", autouse=True)
def seed_hypothesis() -> None:
seed = int(os.environ.get("OPENRECALL_PBT_SEED", "20260101"))
settings.register_profile(
"ci",
max_examples=100,
derandomize=True,
deadline=None,
database=None,
)
settings.load_profile("ci")
os.environ.setdefault("HYPOTHESIS_SEED", str(seed))
derandomize=True plus a fixed seed means the same set of 100
fingerprints, 100 memory matches, 100 cost curves runs on every CI
build. If property P11 fails on commit X and I revert to commit
X-1, P11 either fails again or it doesn't — there's no flake.
The seed is 20260101, which is the date I started keeping the
project journal. There's no significance to the number beyond that
it's stable.
The smoke test as a fourth layer
Property tests cover the invariants. There's a fourth layer below
them: a smoke test that runs the end-to-end pipeline with a real
config and asserts the cockpit-level numbers match expectations.
# scripts/smoke_test.py
def smoke_queue_with_seeded_memory() -> None:
workflow = build_workflow()
seed_six_false_positives_per_family(workflow)
results = workflow.analyze_queue(load_seed_alerts())
bypass_steps = sum(
1 for r in results
for step in r.route_trace
if step.model == "memory-bypass"
)
assert bypass_steps >= 1, (
f"Expected at least one memory-bypass step after seeding; got {bypass_steps}"
)
print(f"smoke_queue_with_seeded_memory: bypass_steps={bypass_steps}, results={len(results)}")
The property suite verifies that the bypass code is correct in
isolation. The smoke test verifies that when you wire the whole
system together with realistic seed data, the bypass actually
fires. Both layers are needed. Properties tell you the unit is
correct; smoke tells you the system has the configuration to
exercise the unit.
What I'd recommend
Three principles, learned the hard way.
Properties before unit tests, not after. If you're writing
unit tests for an agent codebase, you're hand-picking the inputs
you can reason about. The bugs are in the inputs you can't.
Name the properties with stable IDs. P11, P15, P9 — every
property has a number that maps to a sentence in the spec. When
a property fails, the error message includes the ID, and I can
go look up what invariant got broken without re-reading the
test.
Make every fix add a property. When I find a bug that the
suite missed, I add the property that would have caught it.
The test suite gets stricter on every fix, which is the only
direction I want a test suite to evolve.
The full property suite is in tests/property/ in the
repo. The
strategies are in tests/property/conftest.py, the properties
themselves are split by module
(test_fingerprint.py, test_triage.py, test_memory.py,
test_cost_curve.py, test_workflow.py), and CI runs them on
every commit under OPENRECALL_PBT_SEED=20260101.
If you're building any kind of system that recalls memory and
makes routing decisions —
Hindsight for the
memory side,
cascadeflow for the
runtime intelligence side —
Vectorize's agent-memory overview
is a good place to start on why these systems need to be
testable in the first place. Property-based testing is the only
testing technique I've found that scales to systems where the
input space is bigger than what any human can enumerate.
Fifteen properties caught a bug I'd have shipped. That's the
return on the investment.
Top comments (0)