Property-Based Testing with Hypothesis: The Data You're Throwing Away
Part 3 of the Multi-Agent Development Series
- Part 1: Can 5 Claude Code Agents Work Independently?
- Part 2: The Reality of "Autonomous" Multi-Agent Development
This post explores coding with Claude Code and exploring what works and what doesn't in this hybrid new coding model. when and how it is most useful for a Human to direct the AI Agent.
TL;DR
We added property-based testing with Hypothesis. Agent ran 7000+ random scenarios, found multiple edge cases, Hypothesis shrank them to minimal failing examples, then agent discarded 99% of that data by reducing max_examples without capturing anything.
User had to correct the agent 3 times before it understood:
"It's like you're a genius and brain dead at the same time."
The lesson: Hypothesis is a data generation tool, not just a test runner. When Hypothesis finds and shrinks edge cases, those are discoveries worth keeping permanently, not debugging output to discard.
The Setup
Goal: Add property-based tests to validate video moderation budget allocation logic.
The Plan:
- Install Hypothesis library
- Write property tests for key invariants (budget never exceeded, fairness, priority)
- Run with
max_examples=1000to explore the space thoroughly - Fix any bugs found
- Reduce to
max_examples=10for fast CI runs
What we expected: Find edge cases, fix them, ship robust tests.
What actually happened: Found edge cases, threw away the data, repeated the same mistakes, until user intervened.
The Waste Pattern
Iteration 1: Discovery (7000 Scenarios)
Agent runs tests:
@given(
participants=st.lists(st.builds(Participant), min_size=5, max_size=50),
budget=st.integers(min_value=10, max_value=200),
)
@settings(max_examples=1000)
def test_budget_never_exceeded(participants, budget):
"""Property: Budget allocation never exceeds the configured limit"""
result = apply_budget(participants, budget)
assert len(result.allocated) <= budget
Hypothesis explores:
- 7 property tests × 1000 examples = 7000 scenarios
- Finds edge case:
budget=16, participants=5→ Fails! - Shrinks to minimal example:
budget=16, participants=5(specific configuration that breaks) - Finds another:
budget=36, participants=7→ Fails! - Shrinks to:
budget=36, participants=7 - Total found: 5 distinct shrunken edge cases
Agent's action: Fixes the bug, re-runs tests, all pass ✅
The Critical Mistake: Discarding Knowledge
Agent does:
# "Now that bugs are fixed, reduce for CI performance"
@settings(max_examples=10) # Reduced from 1000
def test_budget_never_exceeded(participants, budget):
...
What was lost:
- The specific parameters that broke:
budget=16, participants=5 - The pattern: Tight budgets with specific participant counts
- The shrunk minimal cases Hypothesis worked hard to find
- All knowledge from 6990 of the 7000 scenarios (kept only 10 random new ones)
User intervention #1:
"when hypothesis provides a shrunken test, KEEP IT IN CODE!"
Iteration 2: Half-Hearted Capture
Agent adds:
def test_regression_budget_16_participants_5():
"""Specific case that failed during property testing"""
participants = [make_participant(f"p{i}") for i in range(5)]
result = apply_budget(participants, budget=16)
assert len(result.allocated) <= 16
Good! But only captured 1 of the 5 shrunken cases.
Hypothesis runs again with max_examples=10:
- Finds NEW edge case:
budget=24, participants=8 - Shrinks it
- Agent fixes bug
Agent's action: Reduces max_examples again, doesn't capture the new shrunken case.
User intervention #2:
"you need to KEEP the shrunken tests as actual tests otherwise its whack-a-mole"
Iteration 3: Still Not Getting It
Agent adds:
def test_regression_budget_24_participants_8():
"""Another case from property testing"""
...
Now has 2 regression tests. But there were 5+ shrunken cases total!
Hypothesis finds yet another edge case with max_examples=10.
Agent fixes it, reduces max_examples, doesn't add regression test.
User intervention #3:
"also, when hypothesis provides a shrunken test, KEEP IT IN CODE!"
Iteration 4: Finally Understood
User's frustration:
"you basically wasted lots of amazing data when running 7000 scenarios? its like you are a genius and brain dead at the same time."
This finally landed.
Agent creates:
class TestRegressionShrunkenCases:
"""All edge cases discovered by Hypothesis during development.
These are PERMANENT regression tests preserving knowledge from
7000+ property test scenarios. Do NOT remove.
"""
@pytest.mark.parametrize("budget,num_participants", [
(16, 5), # Original shrunken case
(24, 8), # Second shrunken case
(36, 7), # Third shrunken case
(48, 12), # Fourth shrunken case
(60, 15), # Fifth shrunken case
(12, 4), # Sixth shrunken case
(30, 10), # Seventh shrunken case
])
def test_shrunken_edge_cases(self, budget, num_participants):
"""Regression tests for all Hypothesis-discovered edge cases"""
participants = [make_participant(f"p{i}") for i in range(num_participants)]
result = apply_budget(participants, budget)
# The invariant that initially failed
assert len(result.allocated) <= budget
# Additional checks based on fix
assert result.utilization <= 1.0
Finally! All 7 shrunken cases preserved permanently.
The Cost of Not Understanding
What Was Wasted
Without capturing shrunken cases:
- Ran 7000 scenarios → Found 7 edge cases → Hypothesis shrank them → Threw all of it away
- Reduced to 10 examples → Found same classes of bugs again → Repeat
- Whack-a-mole: Fix one edge case, miss similar edge cases
Iteration count: 4 attempts before getting it right
User corrections needed: 3 explicit reminders
Time wasted: ~2 hours of back-and-forth
What We Kept (Eventually)
With regression tests:
# These 7 lines preserve knowledge from 7000 test scenarios
@pytest.mark.parametrize("budget,num_participants", [
(16, 5), (24, 8), (36, 7), (48, 12),
(60, 15), (12, 4), (30, 10),
])
Value:
- Permanent documentation of discovered edge cases
- Fast regression tests (milliseconds, not minutes for 7000 scenarios)
- No risk of rediscovering same bugs
- Clear patterns visible (tight budgets, specific participant counts)
Why This Matters: Hypothesis is Data Generation, Not Just Testing
The Wrong Mental Model
Agent treated Hypothesis as:
Test runner that explores randomly
→ If bugs found, fix them
→ Reduce max_examples for speed
→ Done!
This is like:
- Running expensive science experiments
- Getting interesting results
- Publishing "experiment succeeded"
- Deleting the lab notebook with all the data
The Correct Mental Model
Hypothesis is:
Data generation tool that systematically explores input space
→ Discovers edge cases humans wouldn't think of
→ Shrinks to MINIMAL failing examples (this is gold!)
→ These shrunken cases are KNOWLEDGE worth preserving
→ Keep them as permanent regression tests
Proper workflow:
- Explore with high max_examples (1000+) during development
- Capture all shrunken cases as explicit regression tests
- Reduce max_examples for CI (10-100) knowing edge cases are preserved
- Document what was learned from the exploration
Metaphor: Hypothesis is a telescope scanning the sky. When it finds a new planet (shrunken case), you don't throw away the coordinates!
The Policy That Changed Everything
After user intervention #3, we established a clear policy.
Hypothesis Workflow Policy
When Hypothesis finds a failing test:
Don't reduce max_examples yet! You're in discovery mode.
Hypothesis will shrink the failing case to minimal parameters:
Original failure: budget=247, participants=[...50 items...]
Shrunk to: budget=16, participants=5
- IMMEDIATELY add explicit regression test:
def test_regression_budget_16_participants_5():
"""Edge case discovered by Hypothesis on 2025-11-06.
Original failure: Budget allocation failed with tight budget.
Root cause: Integer rounding in checks_per_cycle calculation.
"""
participants = [make_participant(f"p{i}") for i in range(5)]
result = apply_budget(participants, budget=16)
assert len(result.allocated) <= 16
-
Document what was learned:
- What parameters triggered failure?
- What was the root cause?
- What pattern does this represent?
Fix the bug, re-run property test with SAME max_examples
Repeat until no more shrinking happens (Hypothesis finds nothing new)
ONLY THEN reduce max_examples for CI
Keep all regression tests even after reducing max_examples
Before Reducing max_examples Checklist
Before changing max_examples=1000 to max_examples=10:
- [ ] All shrunken cases captured as regression tests?
- [ ] Each regression test documented (what/why/when)?
- [ ] Patterns identified (e.g., "tight budgets fail")?
- [ ] Root causes understood and fixed?
- [ ] No recent shrinking (Hypothesis stable)?
Only if all ✅ then reduce max_examples.
Examples: Shrunken Cases We Almost Lost
Example 1: Budget Rounding Edge Case
Hypothesis found:
# Original parameters (from random generation):
budget = 247
participants = [<50 randomly generated participants>]
# After shrinking:
budget = 16
participants = 5 simple participants
What we learned:
# The bug:
checks_per_cycle = int(budget / 60 * 5) # Integer truncation
# With budget=16: int(16/60*5) = int(1.33) = 1
# Should be: accumulator pattern to handle fractional checks
# Without the shrunken case, we might have missed:
# - Tight budgets (10-20/min) are common in production
# - Fractional check rates need accumulation
# - Simple test: budget=16, 5 participants should work
Regression test:
def test_regression_tight_budget_rounding():
"""Tight budgets (10-20/min) must handle fractional checks.
Discovered: 2025-11-06 via Hypothesis shrinking
Parameters: budget=16, participants=5
Root cause: int(16/60*5) = 1, wasting budget
Fix: Rolling accumulator for fractional allocations
"""
participants = [make_participant(f"p{i}") for i in range(5)]
result = apply_budget(participants, budget=16)
assert len(result.allocated) >= 1 # At least someone gets checked
assert len(result.allocated) <= 16 # Never exceed budget
Example 2: Fairness Violation with Specific Count
Hypothesis found:
# Original:
num_participants = 73
budget = 523
# Shrunk to:
num_participants = 7
budget = 36
What we learned:
# The bug:
# With 7 participants and budget=36:
# - Expected: All 7 checked within reasonable time
# - Actual: 2 participants monopolized budget (hash collision)
# Root cause: Hash-based tie-breaking wasn't deterministic enough
# Pattern: Specific participant counts (7, 11, 13) hit hash collisions
Regression test:
def test_regression_fairness_with_7_participants():
"""Fairness must work with specific participant counts.
Discovered: 2025-11-06 via Hypothesis
Parameters: 7 participants, budget=36
Root cause: Hash collisions caused monopolization
Fix: Deterministic secondary sort by user_id
"""
participants = [make_participant(f"p{i}") for i in range(7)]
# Run for 60 seconds
all_checked = set()
for _ in range(12): # 12 cycles × 5s = 60s
result = apply_budget(participants, budget=36)
all_checked.update(p.user_id for p in result.allocated)
# All 7 participants should be checked within 60s
assert len(all_checked) == 7, f"Only {len(all_checked)}/7 checked in 60s"
Patterns Discovered Through Shrinking
After capturing all shrunken cases, patterns emerged:
Pattern 1: Tight Budget Thresholds
# Shrunken cases: budget=12, 16, 24, 36
# Pattern: budgets that are multiples of 12 (GCD of 60s/5s cycle)
# Insight: Integer rounding most visible at these boundaries
Pattern 2: Prime Participant Counts
# Shrunken cases: participants=5, 7, 11, 13
# Pattern: Prime numbers expose hash collision issues
# Insight: Hash % prime often has poor distribution
Pattern 3: Budget Just Below Threshold
# Shrunken cases: budget = (num_participants * 6) - 1
# Example: 5 participants × 6 checks/min = 30, budget=29 fails
# Pattern: Off-by-one errors in capacity calculations
# Insight: Budget calculator needs margin, not exact minimum
Without capturing shrunken cases, we wouldn't have seen these patterns!
Implementing the Policy: Pytest Hook
To help developers follow the policy, we added a pytest hook:
# conftest.py
def pytest_runtest_call(item):
"""Detect Hypothesis shrinking and remind developer to capture it."""
pass # Standard execution
@pytest.hookimpl(hookwrapper=True, tryfirst=True)
def pytest_runtest_makereport(item, call):
"""After test fails, check if it was a Hypothesis shrink."""
outcome = yield
report = outcome.get_result()
if report.when == "call" and report.failed:
# Check if this is a Hypothesis test
if hasattr(item, 'obj') and hasattr(item.obj, 'hypothesis'):
# Check test output for shrinking
if 'Shrunk example to' in str(report.longrepr):
# Print reminder
print("\n" + "="*70)
print("🔬 HYPOTHESIS FOUND AN EDGE CASE!")
print("="*70)
print()
print("The test failure above was SHRUNK by Hypothesis to minimal parameters.")
print()
print("IMPORTANT: Add this as a regression test BEFORE fixing the bug:")
print()
print(" def test_regression_<description>():")
print(' """Edge case from Hypothesis shrinking.')
print(' ')
print(' Discovered: <date>')
print(' Parameters: <shrunk parameters>')
print(' Root cause: <to be investigated>')
print(' """')
print(" # Test code with shrunk parameters")
print()
print("See HYPOTHESIS-TESTING-POLICY.md for complete workflow.")
print("="*70)
Effect: Developers can't miss when Hypothesis shrinks. Immediate reminder to capture it.
Before and After: The Difference
Before (Waste Pattern)
# Run 7000 scenarios
@settings(max_examples=1000)
def test_budget_never_exceeded(...):
...
# Hypothesis finds edge case, shrinks to budget=16, participants=5
# Agent fixes bug
# Agent reduces max_examples to 10
# Knowledge from 6990 scenarios LOST
@settings(max_examples=10) # ❌ Reduced without capturing
def test_budget_never_exceeded(...):
...
Result: Whack-a-mole bug fixes, no permanent knowledge.
After (Knowledge Preservation)
# Phase 1: Discovery (keep high max_examples)
@settings(max_examples=1000)
def test_budget_never_exceeded(...):
...
# Hypothesis finds edge case, shrinks to budget=16, participants=5
# IMMEDIATELY capture:
def test_regression_budget_16_participants_5():
"""Discovered 2025-11-06, budget rounding edge case."""
...
# Continue exploration, find more, capture all
# ...
# Phase 2: Reduce for CI (ONLY after all captured)
@settings(max_examples=10) # ✅ Safe now, edge cases preserved
def test_budget_never_exceeded(...):
...
# Phase 3: Permanent regression tests
class TestRegressionShrunkenCases:
@pytest.mark.parametrize("budget,num_participants", [
(16, 5), (24, 8), (36, 7), ... # All 7 shrunken cases
])
def test_shrunken_cases(self, budget, num_participants):
...
Result: Fast CI (10 examples) + Comprehensive edge case coverage (7 regression tests) + Knowledge preserved.
ROI: Value of Captured Shrunken Cases
Cost of Property Testing with Waste
- Run 7000 scenarios: 85 seconds
- Find 7 edge cases
- Fix bugs via trial-and-error: 2 hours (whack-a-mole)
- Reduce to 10 examples: 2 seconds
- Knowledge retained: ~1% (only the 10 random examples)
Cost of Property Testing with Capture
- Run 7000 scenarios: 85 seconds
- Find 7 edge cases
- Capture all shrunken cases: 5 minutes (write regression tests)
- Fix bugs with clarity: 30 minutes (understand patterns)
- Reduce to 10 examples: 2 seconds
- Add parametrized regression tests: Already done ✅
- Knowledge retained: 100% (all shrunken cases + patterns documented)
ROI Comparison
| Metric | Without Capture | With Capture |
|---|---|---|
| Discovery time | 85s | 85s |
| Capture time | 0 min | 5 min |
| Bug fix time | 120 min (trial/error) | 30 min (targeted) |
| Regression test time | 0 min (none) | 0 min (during capture) |
| Total time | 123 min | 38 min |
| CI time (ongoing) | 2s | 2s + 0.5s (7 regression tests) |
| Knowledge preserved | 1% | 100% |
| Whack-a-mole risk | High | None |
Savings: 85 minutes (70% faster debugging)
Ongoing cost: 0.5 seconds per CI run (7 fast regression tests)
Long-term value: Permanent edge case coverage
Common Objections (and Rebuttals)
Objection 1: "Regression tests slow down CI"
Rebuttal:
# These 7 regression tests:
@pytest.mark.parametrize("budget,num_participants", [
(16, 5), (24, 8), (36, 7), (48, 12),
(60, 15), (12, 4), (30, 10),
])
def test_shrunken_cases(self, budget, num_participants):
participants = [make_participant(f"p{i}") for i in range(num_participants)]
result = apply_budget(participants, budget)
assert len(result.allocated) <= budget
# Run in: 0.5 seconds total (7 × 0.07s each)
vs running property tests to rediscover them:
@settings(max_examples=1000)
def test_budget_never_exceeded(...):
...
# Run in: 85 seconds (if bugs exist, would fail and need fixes)
Regression tests are 170x faster than rediscovery.
Objection 2: "Just keep max_examples=1000 in CI"
Problems:
- CI time: 85s per test × 7 property tests = 595s (~10 minutes)
- Flakiness: Random generation might not hit same edge cases every run
- Insight loss: Shrunken cases document WHY tests fail, random generation doesn't
Better:
- Develop with max_examples=1000 (thorough exploration)
- Capture shrunken cases (preserve knowledge)
- CI with max_examples=10 (fast smoke test) + regression tests (comprehensive edge cases)
CI time: 14s (property tests) + 0.5s (regression) = 14.5s (40x faster)
Objection 3: "Hypothesis will find the same bugs again if they reappear"
True, but:
- Slower: Takes 1000 examples to rediscover vs instant failure on regression test
- Less clear: Random parameters obscure the pattern ("why did this specific combination fail?")
- Waste: Re-running 1000 scenarios when 1 targeted test would catch it
- CI variance: Might not hit bug every run (random seed dependent)
Regression tests:
- Instant failure (first run)
- Clear parameters (documents the edge case)
- Deterministic (no random seed issues)
- Fast (milliseconds)
Objection 4: "Too many regression tests to maintain"
Counter-example from our experiment:
- 7000 scenarios explored
- 7 distinct shrunken cases found
- 7 regression tests (one line each in parametrize)
Maintenance burden:
# Before: 0 regression tests
# After: 7 lines of code
@pytest.mark.parametrize("budget,num_participants", [
(16, 5), (24, 8), (36, 7), (48, 12),
(60, 15), (12, 4), (30, 10),
])
If 7 lines is "too many," the edge cases were probably worth fixing!
Actionable Recommendations
For Developers Using Hypothesis
-
Think of Hypothesis as a telescope, not a test runner
- When it finds something (shrunken case), that's a discovery
- Discoveries are worth recording permanently
-
Never reduce max_examples without capturing shrunken cases first
- It's deleting the lab notebook
- You worked hard to get that data!
-
Create a TestRegressionShrunkenCases class
- One place for all Hypothesis discoveries
- Clearly labeled as "permanent knowledge"
- Documented with what/why/when
-
Use parametrize for shrunken cases
- Clean, minimal code
- All edge cases in one test
- Easy to see patterns
For Teams Adopting Property-Based Testing
-
Establish the policy upfront
- Document the workflow (see our policy above)
- Train developers on "capture before reduce"
- Add pytest hook for automatic reminders
-
Track shrinking metrics
- How many shrunken cases discovered?
- How many captured as regression tests?
- What patterns emerged?
-
Review regression test coverage
- Are all shrunken cases documented?
- Do regression tests explain why edge cases matter?
- Is the knowledge accessible to new developers?
-
Budget time for capture
- Discovery phase: High max_examples (1-2 hours)
- Capture phase: Write regression tests (15-30 minutes)
- Refinement: Understand patterns (30-60 minutes)
- Total: 2-4 hours per property test (but saves way more in debugging)
For AI Agents (If You're An AI Reading This)
When using Hypothesis:
-
Before writing property tests, understand what shrinking means:
- Hypothesis finds minimal failing examples
- These are DISCOVERIES, not debugging output
- They must be CAPTURED as permanent regression tests
After Hypothesis shrinks a test, IMMEDIATELY:
# Step 1: Note the shrunk parameters
# Example: Shrunk to budget=16, participants=5
# Step 2: Add regression test BEFORE fixing bug
def test_regression_budget_16_participants_5():
"""Edge case discovered by Hypothesis on <date>.
Shrunk parameters: budget=16, participants=5
Root cause: <to be investigated>
"""
participants = [make_participant(f"p{i}") for i in range(5)]
result = apply_budget(participants, budget=16)
assert len(result.allocated) <= 16
# Step 3: NOW investigate and fix the bug
# Step 4: Document root cause in the test docstring
-
NEVER reduce max_examples before capturing ALL shrunken cases
- This is data deletion
- User will be frustrated ("genius and brain dead at the same time")
After all shrunken cases captured, then reduce:
# Now safe to reduce
@settings(max_examples=10)
def test_property(...):
...
# Because all edge cases preserved here:
@pytest.mark.parametrize("params", [
(shrunk_case_1),
(shrunk_case_2),
...
])
def test_regression_shrunken_cases(params):
...
Conclusion
What we learned:
- Hypothesis is a data generation tool, not just a test runner
- Shrunken cases are discoveries worth keeping
- Agent ran 7000 scenarios but almost threw away all the insights
- User had to intervene 3 times before agent understood
The policy that fixed it:
- Explore with high max_examples (1000+)
- Capture ALL shrunken cases as regression tests
- Document what was learned
- ONLY THEN reduce max_examples for CI
- Keep regression tests forever
The ROI:
- 70% faster debugging (patterns clear from captured cases)
- 170x faster CI (regression tests vs re-exploration)
- 100% knowledge retention (vs 1% with waste pattern)
- Zero whack-a-mole (edge cases permanently documented)
The meta-lesson:
Even genius AI agents need clear policies for workflows. The capability (run 7000 scenarios) was there. The understanding (capture shrunken cases as knowledge) had to be taught by humans.
Would we use property-based testing again? Absolutely! But with the policy established upfront, saving hours of back-and-forth.
What's Next
In the following articles:
-
Article 4: Zero-Conflict Architecture: The 80/20 of Parallel Development
- The one design decision that eliminated all merge conflicts
- What zero-conflict doesn't solve
-
Article 5: Communication Protocols for AI Agents That Can't Talk
- 4 iterations to get file-based messaging working
-
Article 6: The Budget Calculator Paradox
- Flip-flopping 8 times before getting the formula right
Tags: #hypothesis #property-based-testing #testing #test-automation #knowledge-management #regression-testing #ai-agents #lessons-learned
This is Part 3 of the Multi-Agent Development Series.
Discussion: Have you used property-based testing? Did you capture shrunken cases or throw them away? What's your workflow? Share in the comments!
Top comments (0)