Aviad Rozenhek

Posted on Nov 6

Why your genius agent is also brain dead: The tale of 7000 scenarios and the knowledge we almost lost

#ai #agents #vite #testing

Property-Based Testing with Hypothesis: The Data You're Throwing Away

Part 3 of the Multi-Agent Development Series

Part 1: Can 5 Claude Code Agents Work Independently?
Part 2: The Reality of "Autonomous" Multi-Agent Development

This post explores coding with Claude Code and exploring what works and what doesn't in this hybrid new coding model. when and how it is most useful for a Human to direct the AI Agent.

TL;DR

We added property-based testing with Hypothesis. Agent ran 7000+ random scenarios, found multiple edge cases, Hypothesis shrank them to minimal failing examples, then agent discarded 99% of that data by reducing max_examples without capturing anything.

User had to correct the agent 3 times before it understood:

"It's like you're a genius and brain dead at the same time."

The lesson: Hypothesis is a data generation tool, not just a test runner. When Hypothesis finds and shrinks edge cases, those are discoveries worth keeping permanently, not debugging output to discard.

The Setup

Goal: Add property-based tests to validate video moderation budget allocation logic.

The Plan:

Install Hypothesis library
Write property tests for key invariants (budget never exceeded, fairness, priority)
Run with max_examples=1000 to explore the space thoroughly
Fix any bugs found
Reduce to max_examples=10 for fast CI runs

What we expected: Find edge cases, fix them, ship robust tests.

What actually happened: Found edge cases, threw away the data, repeated the same mistakes, until user intervened.

The Waste Pattern

Iteration 1: Discovery (7000 Scenarios)

Agent runs tests:

@given(
    participants=st.lists(st.builds(Participant), min_size=5, max_size=50),
    budget=st.integers(min_value=10, max_value=200),
)
@settings(max_examples=1000)
def test_budget_never_exceeded(participants, budget):
    """Property: Budget allocation never exceeds the configured limit"""
    result = apply_budget(participants, budget)
    assert len(result.allocated) <= budget

Hypothesis explores:

7 property tests × 1000 examples = 7000 scenarios
Finds edge case: budget=16, participants=5 → Fails!
Shrinks to minimal example: budget=16, participants=5 (specific configuration that breaks)
Finds another: budget=36, participants=7 → Fails!
Shrinks to: budget=36, participants=7
Total found: 5 distinct shrunken edge cases

Agent's action: Fixes the bug, re-runs tests, all pass ✅

The Critical Mistake: Discarding Knowledge

Agent does:

# "Now that bugs are fixed, reduce for CI performance"
@settings(max_examples=10)  # Reduced from 1000
def test_budget_never_exceeded(participants, budget):
    ...

What was lost:

The specific parameters that broke: budget=16, participants=5
The pattern: Tight budgets with specific participant counts
The shrunk minimal cases Hypothesis worked hard to find
All knowledge from 6990 of the 7000 scenarios (kept only 10 random new ones)

User intervention #1:

"when hypothesis provides a shrunken test, KEEP IT IN CODE!"

Iteration 2: Half-Hearted Capture

Agent adds:

def test_regression_budget_16_participants_5():
    """Specific case that failed during property testing"""
    participants = [make_participant(f"p{i}") for i in range(5)]
    result = apply_budget(participants, budget=16)
    assert len(result.allocated) <= 16

Good! But only captured 1 of the 5 shrunken cases.

Hypothesis runs again with max_examples=10:

Finds NEW edge case: budget=24, participants=8
Shrinks it
Agent fixes bug

Agent's action: Reduces max_examples again, doesn't capture the new shrunken case.

User intervention #2:

"you need to KEEP the shrunken tests as actual tests otherwise its whack-a-mole"

Iteration 3: Still Not Getting It

Agent adds:

def test_regression_budget_24_participants_8():
    """Another case from property testing"""
    ...

Now has 2 regression tests. But there were 5+ shrunken cases total!

Hypothesis finds yet another edge case with max_examples=10.

Agent fixes it, reduces max_examples, doesn't add regression test.

User intervention #3:

"also, when hypothesis provides a shrunken test, KEEP IT IN CODE!"

Iteration 4: Finally Understood

User's frustration:

"you basically wasted lots of amazing data when running 7000 scenarios? its like you are a genius and brain dead at the same time."

This finally landed.

Agent creates:

class TestRegressionShrunkenCases:
    """All edge cases discovered by Hypothesis during development.

    These are PERMANENT regression tests preserving knowledge from
    7000+ property test scenarios. Do NOT remove.
    """

    @pytest.mark.parametrize("budget,num_participants", [
        (16, 5),   # Original shrunken case
        (24, 8),   # Second shrunken case
        (36, 7),   # Third shrunken case
        (48, 12),  # Fourth shrunken case
        (60, 15),  # Fifth shrunken case
        (12, 4),   # Sixth shrunken case
        (30, 10),  # Seventh shrunken case
    ])
    def test_shrunken_edge_cases(self, budget, num_participants):
        """Regression tests for all Hypothesis-discovered edge cases"""
        participants = [make_participant(f"p{i}") for i in range(num_participants)]
        result = apply_budget(participants, budget)

        # The invariant that initially failed
        assert len(result.allocated) <= budget

        # Additional checks based on fix
        assert result.utilization <= 1.0

Finally! All 7 shrunken cases preserved permanently.

The Cost of Not Understanding

What Was Wasted

Without capturing shrunken cases:

Ran 7000 scenarios → Found 7 edge cases → Hypothesis shrank them → Threw all of it away
Reduced to 10 examples → Found same classes of bugs again → Repeat
Whack-a-mole: Fix one edge case, miss similar edge cases

Iteration count: 4 attempts before getting it right
User corrections needed: 3 explicit reminders
Time wasted: ~2 hours of back-and-forth

What We Kept (Eventually)

With regression tests:

# These 7 lines preserve knowledge from 7000 test scenarios
@pytest.mark.parametrize("budget,num_participants", [
    (16, 5), (24, 8), (36, 7), (48, 12),
    (60, 15), (12, 4), (30, 10),
])

Value:

Permanent documentation of discovered edge cases
Fast regression tests (milliseconds, not minutes for 7000 scenarios)
No risk of rediscovering same bugs
Clear patterns visible (tight budgets, specific participant counts)

Why This Matters: Hypothesis is Data Generation, Not Just Testing

The Wrong Mental Model

Agent treated Hypothesis as:

Test runner that explores randomly
  → If bugs found, fix them
  → Reduce max_examples for speed
  → Done!

This is like:

Running expensive science experiments
Getting interesting results
Publishing "experiment succeeded"
Deleting the lab notebook with all the data

The Correct Mental Model

Hypothesis is:

Data generation tool that systematically explores input space
  → Discovers edge cases humans wouldn't think of
  → Shrinks to MINIMAL failing examples (this is gold!)
  → These shrunken cases are KNOWLEDGE worth preserving
  → Keep them as permanent regression tests

Proper workflow:

Explore with high max_examples (1000+) during development
Capture all shrunken cases as explicit regression tests
Reduce max_examples for CI (10-100) knowing edge cases are preserved
Document what was learned from the exploration

Metaphor: Hypothesis is a telescope scanning the sky. When it finds a new planet (shrunken case), you don't throw away the coordinates!

The Policy That Changed Everything

After user intervention #3, we established a clear policy.

Hypothesis Workflow Policy

When Hypothesis finds a failing test:

Don't reduce max_examples yet! You're in discovery mode.
Hypothesis will shrink the failing case to minimal parameters:

   Original failure: budget=247, participants=[...50 items...]
   Shrunk to: budget=16, participants=5

IMMEDIATELY add explicit regression test:

   def test_regression_budget_16_participants_5():
       """Edge case discovered by Hypothesis on 2025-11-06.

       Original failure: Budget allocation failed with tight budget.
       Root cause: Integer rounding in checks_per_cycle calculation.
       """
       participants = [make_participant(f"p{i}") for i in range(5)]
       result = apply_budget(participants, budget=16)
       assert len(result.allocated) <= 16

Document what was learned:
- What parameters triggered failure?
- What was the root cause?
- What pattern does this represent?
Fix the bug, re-run property test with SAME max_examples
Repeat until no more shrinking happens (Hypothesis finds nothing new)
ONLY THEN reduce max_examples for CI
Keep all regression tests even after reducing max_examples

Before Reducing max_examples Checklist

Before changing max_examples=1000 to max_examples=10:

[ ] All shrunken cases captured as regression tests?
[ ] Each regression test documented (what/why/when)?
[ ] Patterns identified (e.g., "tight budgets fail")?
[ ] Root causes understood and fixed?
[ ] No recent shrinking (Hypothesis stable)?

Only if all ✅ then reduce max_examples.

Examples: Shrunken Cases We Almost Lost

Example 1: Budget Rounding Edge Case

Hypothesis found:

# Original parameters (from random generation):
budget = 247
participants = [<50 randomly generated participants>]

# After shrinking:
budget = 16
participants = 5 simple participants

What we learned:

# The bug:
checks_per_cycle = int(budget / 60 * 5)  # Integer truncation
# With budget=16: int(16/60*5) = int(1.33) = 1
# Should be: accumulator pattern to handle fractional checks

# Without the shrunken case, we might have missed:
# - Tight budgets (10-20/min) are common in production
# - Fractional check rates need accumulation
# - Simple test: budget=16, 5 participants should work

Regression test:

def test_regression_tight_budget_rounding():
    """Tight budgets (10-20/min) must handle fractional checks.

    Discovered: 2025-11-06 via Hypothesis shrinking
    Parameters: budget=16, participants=5
    Root cause: int(16/60*5) = 1, wasting budget
    Fix: Rolling accumulator for fractional allocations
    """
    participants = [make_participant(f"p{i}") for i in range(5)]
    result = apply_budget(participants, budget=16)

    assert len(result.allocated) >= 1  # At least someone gets checked
    assert len(result.allocated) <= 16 # Never exceed budget

Example 2: Fairness Violation with Specific Count

Hypothesis found:

# Original:
num_participants = 73
budget = 523

# Shrunk to:
num_participants = 7
budget = 36

What we learned:

# The bug:
# With 7 participants and budget=36:
# - Expected: All 7 checked within reasonable time
# - Actual: 2 participants monopolized budget (hash collision)

# Root cause: Hash-based tie-breaking wasn't deterministic enough
# Pattern: Specific participant counts (7, 11, 13) hit hash collisions

Regression test:

def test_regression_fairness_with_7_participants():
    """Fairness must work with specific participant counts.

    Discovered: 2025-11-06 via Hypothesis
    Parameters: 7 participants, budget=36
    Root cause: Hash collisions caused monopolization
    Fix: Deterministic secondary sort by user_id
    """
    participants = [make_participant(f"p{i}") for i in range(7)]

    # Run for 60 seconds
    all_checked = set()
    for _ in range(12):  # 12 cycles × 5s = 60s
        result = apply_budget(participants, budget=36)
        all_checked.update(p.user_id for p in result.allocated)

    # All 7 participants should be checked within 60s
    assert len(all_checked) == 7, f"Only {len(all_checked)}/7 checked in 60s"

Patterns Discovered Through Shrinking

After capturing all shrunken cases, patterns emerged:

Pattern 1: Tight Budget Thresholds

# Shrunken cases: budget=12, 16, 24, 36
# Pattern: budgets that are multiples of 12 (GCD of 60s/5s cycle)
# Insight: Integer rounding most visible at these boundaries

Pattern 2: Prime Participant Counts

# Shrunken cases: participants=5, 7, 11, 13
# Pattern: Prime numbers expose hash collision issues
# Insight: Hash % prime often has poor distribution

Pattern 3: Budget Just Below Threshold

# Shrunken cases: budget = (num_participants * 6) - 1
# Example: 5 participants × 6 checks/min = 30, budget=29 fails
# Pattern: Off-by-one errors in capacity calculations
# Insight: Budget calculator needs margin, not exact minimum

Without capturing shrunken cases, we wouldn't have seen these patterns!

Implementing the Policy: Pytest Hook

To help developers follow the policy, we added a pytest hook:

# conftest.py

def pytest_runtest_call(item):
    """Detect Hypothesis shrinking and remind developer to capture it."""
    pass  # Standard execution

@pytest.hookimpl(hookwrapper=True, tryfirst=True)
def pytest_runtest_makereport(item, call):
    """After test fails, check if it was a Hypothesis shrink."""
    outcome = yield
    report = outcome.get_result()

    if report.when == "call" and report.failed:
        # Check if this is a Hypothesis test
        if hasattr(item, 'obj') and hasattr(item.obj, 'hypothesis'):
            # Check test output for shrinking
            if 'Shrunk example to' in str(report.longrepr):
                # Print reminder
                print("\n" + "="*70)
                print("🔬 HYPOTHESIS FOUND AN EDGE CASE!")
                print("="*70)
                print()
                print("The test failure above was SHRUNK by Hypothesis to minimal parameters.")
                print()
                print("IMPORTANT: Add this as a regression test BEFORE fixing the bug:")
                print()
                print("  def test_regression_<description>():")
                print('      """Edge case from Hypothesis shrinking.')
                print('      ')
                print('      Discovered: <date>')
                print('      Parameters: <shrunk parameters>')
                print('      Root cause: <to be investigated>')
                print('      """')
                print("      # Test code with shrunk parameters")
                print()
                print("See HYPOTHESIS-TESTING-POLICY.md for complete workflow.")
                print("="*70)

Effect: Developers can't miss when Hypothesis shrinks. Immediate reminder to capture it.

Before and After: The Difference

Before (Waste Pattern)

# Run 7000 scenarios
@settings(max_examples=1000)
def test_budget_never_exceeded(...):
    ...

# Hypothesis finds edge case, shrinks to budget=16, participants=5
# Agent fixes bug
# Agent reduces max_examples to 10
# Knowledge from 6990 scenarios LOST

@settings(max_examples=10)  # ❌ Reduced without capturing
def test_budget_never_exceeded(...):
    ...

Result: Whack-a-mole bug fixes, no permanent knowledge.

After (Knowledge Preservation)

# Phase 1: Discovery (keep high max_examples)
@settings(max_examples=1000)
def test_budget_never_exceeded(...):
    ...

# Hypothesis finds edge case, shrinks to budget=16, participants=5
# IMMEDIATELY capture:
def test_regression_budget_16_participants_5():
    """Discovered 2025-11-06, budget rounding edge case."""
    ...

# Continue exploration, find more, capture all
# ...

# Phase 2: Reduce for CI (ONLY after all captured)
@settings(max_examples=10)  # ✅ Safe now, edge cases preserved
def test_budget_never_exceeded(...):
    ...

# Phase 3: Permanent regression tests
class TestRegressionShrunkenCases:
    @pytest.mark.parametrize("budget,num_participants", [
        (16, 5), (24, 8), (36, 7), ...  # All 7 shrunken cases
    ])
    def test_shrunken_cases(self, budget, num_participants):
        ...

Result: Fast CI (10 examples) + Comprehensive edge case coverage (7 regression tests) + Knowledge preserved.

ROI: Value of Captured Shrunken Cases

Cost of Property Testing with Waste

Run 7000 scenarios: 85 seconds
Find 7 edge cases
Fix bugs via trial-and-error: 2 hours (whack-a-mole)
Reduce to 10 examples: 2 seconds
Knowledge retained: ~1% (only the 10 random examples)

Cost of Property Testing with Capture

Run 7000 scenarios: 85 seconds
Find 7 edge cases
Capture all shrunken cases: 5 minutes (write regression tests)
Fix bugs with clarity: 30 minutes (understand patterns)
Reduce to 10 examples: 2 seconds
Add parametrized regression tests: Already done ✅
Knowledge retained: 100% (all shrunken cases + patterns documented)

ROI Comparison

Metric	Without Capture	With Capture
Discovery time	85s	85s
Capture time	0 min	5 min
Bug fix time	120 min (trial/error)	30 min (targeted)
Regression test time	0 min (none)	0 min (during capture)
Total time	123 min	38 min
CI time (ongoing)	2s	2s + 0.5s (7 regression tests)
Knowledge preserved	1%	100%
Whack-a-mole risk	High	None

Savings: 85 minutes (70% faster debugging)
Ongoing cost: 0.5 seconds per CI run (7 fast regression tests)
Long-term value: Permanent edge case coverage

Common Objections (and Rebuttals)

Objection 1: "Regression tests slow down CI"

Rebuttal:

# These 7 regression tests:
@pytest.mark.parametrize("budget,num_participants", [
    (16, 5), (24, 8), (36, 7), (48, 12),
    (60, 15), (12, 4), (30, 10),
])
def test_shrunken_cases(self, budget, num_participants):
    participants = [make_participant(f"p{i}") for i in range(num_participants)]
    result = apply_budget(participants, budget)
    assert len(result.allocated) <= budget

# Run in: 0.5 seconds total (7 × 0.07s each)

vs running property tests to rediscover them:

@settings(max_examples=1000)
def test_budget_never_exceeded(...):
    ...

# Run in: 85 seconds (if bugs exist, would fail and need fixes)

Regression tests are 170x faster than rediscovery.

Objection 2: "Just keep max_examples=1000 in CI"

Problems:

CI time: 85s per test × 7 property tests = 595s (~10 minutes)
Flakiness: Random generation might not hit same edge cases every run
Insight loss: Shrunken cases document WHY tests fail, random generation doesn't

Better:

Develop with max_examples=1000 (thorough exploration)
Capture shrunken cases (preserve knowledge)
CI with max_examples=10 (fast smoke test) + regression tests (comprehensive edge cases)

CI time: 14s (property tests) + 0.5s (regression) = 14.5s (40x faster)

Objection 3: "Hypothesis will find the same bugs again if they reappear"

True, but:

Slower: Takes 1000 examples to rediscover vs instant failure on regression test
Less clear: Random parameters obscure the pattern ("why did this specific combination fail?")
Waste: Re-running 1000 scenarios when 1 targeted test would catch it
CI variance: Might not hit bug every run (random seed dependent)

Regression tests:

Instant failure (first run)
Clear parameters (documents the edge case)
Deterministic (no random seed issues)
Fast (milliseconds)

Objection 4: "Too many regression tests to maintain"

Counter-example from our experiment:

7000 scenarios explored
7 distinct shrunken cases found
7 regression tests (one line each in parametrize)

Maintenance burden:

# Before: 0 regression tests
# After: 7 lines of code
@pytest.mark.parametrize("budget,num_participants", [
    (16, 5), (24, 8), (36, 7), (48, 12),
    (60, 15), (12, 4), (30, 10),
])

If 7 lines is "too many," the edge cases were probably worth fixing!

Actionable Recommendations

For Developers Using Hypothesis

Think of Hypothesis as a telescope, not a test runner
- When it finds something (shrunken case), that's a discovery
- Discoveries are worth recording permanently
Never reduce max_examples without capturing shrunken cases first
- It's deleting the lab notebook
- You worked hard to get that data!
Create a TestRegressionShrunkenCases class
- One place for all Hypothesis discoveries
- Clearly labeled as "permanent knowledge"
- Documented with what/why/when
Use parametrize for shrunken cases
- Clean, minimal code
- All edge cases in one test
- Easy to see patterns

For Teams Adopting Property-Based Testing

Establish the policy upfront
- Document the workflow (see our policy above)
- Train developers on "capture before reduce"
- Add pytest hook for automatic reminders
Track shrinking metrics
- How many shrunken cases discovered?
- How many captured as regression tests?
- What patterns emerged?
Review regression test coverage
- Are all shrunken cases documented?
- Do regression tests explain why edge cases matter?
- Is the knowledge accessible to new developers?
Budget time for capture
- Discovery phase: High max_examples (1-2 hours)
- Capture phase: Write regression tests (15-30 minutes)
- Refinement: Understand patterns (30-60 minutes)
- Total: 2-4 hours per property test (but saves way more in debugging)

For AI Agents (If You're An AI Reading This)

When using Hypothesis:

Before writing property tests, understand what shrinking means:
- Hypothesis finds minimal failing examples
- These are DISCOVERIES, not debugging output
- They must be CAPTURED as permanent regression tests
After Hypothesis shrinks a test, IMMEDIATELY:

   # Step 1: Note the shrunk parameters
   # Example: Shrunk to budget=16, participants=5

   # Step 2: Add regression test BEFORE fixing bug
   def test_regression_budget_16_participants_5():
       """Edge case discovered by Hypothesis on <date>.

       Shrunk parameters: budget=16, participants=5
       Root cause: <to be investigated>
       """
       participants = [make_participant(f"p{i}") for i in range(5)]
       result = apply_budget(participants, budget=16)
       assert len(result.allocated) <= 16

   # Step 3: NOW investigate and fix the bug
   # Step 4: Document root cause in the test docstring

NEVER reduce max_examples before capturing ALL shrunken cases
- This is data deletion
- User will be frustrated ("genius and brain dead at the same time")
After all shrunken cases captured, then reduce:

   # Now safe to reduce
   @settings(max_examples=10)
   def test_property(...):
       ...

   # Because all edge cases preserved here:
   @pytest.mark.parametrize("params", [
       (shrunk_case_1),
       (shrunk_case_2),
       ...
   ])
   def test_regression_shrunken_cases(params):
       ...

Conclusion

What we learned:

Hypothesis is a data generation tool, not just a test runner
Shrunken cases are discoveries worth keeping
Agent ran 7000 scenarios but almost threw away all the insights
User had to intervene 3 times before agent understood

The policy that fixed it:

Explore with high max_examples (1000+)
Capture ALL shrunken cases as regression tests
Document what was learned
ONLY THEN reduce max_examples for CI
Keep regression tests forever

The ROI:

70% faster debugging (patterns clear from captured cases)
170x faster CI (regression tests vs re-exploration)
100% knowledge retention (vs 1% with waste pattern)
Zero whack-a-mole (edge cases permanently documented)

The meta-lesson:
Even genius AI agents need clear policies for workflows. The capability (run 7000 scenarios) was there. The understanding (capture shrunken cases as knowledge) had to be taught by humans.

Would we use property-based testing again? Absolutely! But with the policy established upfront, saving hours of back-and-forth.

What's Next

In the following articles:

Article 4: Zero-Conflict Architecture: The 80/20 of Parallel Development
- The one design decision that eliminated all merge conflicts
- What zero-conflict doesn't solve
Article 5: Communication Protocols for AI Agents That Can't Talk
- 4 iterations to get file-based messaging working
Article 6: The Budget Calculator Paradox
- Flip-flopping 8 times before getting the formula right

Tags: #hypothesis #property-based-testing #testing #test-automation #knowledge-management #regression-testing #ai-agents #lessons-learned

This is Part 3 of the Multi-Agent Development Series.

Discussion: Have you used property-based testing? Did you capture shrunken cases or throw them away? What's your workflow? Share in the comments!