The Budget Calculator Paradox: When Tests Don't Match Reality
Part 6 of the Multi-Agent Development Series
- Part 1: Can 5 Claude Code Agents Work Independently?
- Part 2: The Reality of "Autonomous" Multi-Agent Development
- Part 3: Property-Based Testing with Hypothesis
- Part 4: Zero-Conflict Architecture
- Part 5: Communication Protocols for AI Agents
TL;DR
The Problem: Create a budget calculator to determine minimum checks/minute needed for fair participant checking.
What happened: Agent flip-flopped 8 times on the formula, changing tier ordering and capacity calculations reactively based on test failures instead of proactively based on requirements.
User intervention (3 times):
"you keep changing the policy without reflecting on this constant changes"
"did you just make the necessary budget in the calculator TIGHTER when you are not successfully running the scenarios with more relaxed budgets?"
"make the calculator CORRECT and then provide it some extra margin"
The lessons:
- Build the calculator FIRST, then use it to validate test expectations
- Don't react to test failures by changing the policy - ask "is my expectation realistic?"
- Account for reality: Cycle quantization, integer rounding, margins for variance
- Separate concerns: Policy (what should happen) vs Budget (what's needed for it to work)
The correct pattern:
1. Define policy requirements (tier ordering, deadlines)
2. Build budget calculator based on requirements
3. Add safety margin (1.2x - 1.5x minimum)
4. Use calculator in tests to set expectations
5. If tests fail, debug the bug - don't change calculator or policy
The Setup
Context: Video moderation system needs to check participants at different rates based on risk and staleness.
Core question: How many checks/minute budget is needed to ensure fairness?
Naive answer: num_participants / recheck_interval
- Example: 10 participants, 60s recheck → 10 checks/min
Reality: Way more complicated.
Iteration 1: The Optimistic Formula
Agent's first attempt:
def calculate_min_budget_for_fairness(num_participants, critical_deadline=20):
"""Calculate minimum budget needed to check all participants before critical deadline."""
# Continuous time formula
time_to_check_all = critical_deadline # seconds
checks_per_second = num_participants / time_to_check_all
checks_per_minute = checks_per_second * 60
return checks_per_minute
Example:
- 10 participants, critical_deadline=20s
time_to_check_all = 20schecks_per_minute = (10 / 20) * 60 = 30 checks/min
Test using this:
def test_fairness_with_sufficient_budget():
participants = [make_participant(f"p{i}") for i in range(10)]
budget = calculate_min_budget_for_fairness(10, critical_deadline=20)
# budget = 30 checks/min
scenario = run_scenario(participants, budget=budget, duration=60)
# Expected: All 10 participants checked within 20s
first_cycle_checks = scenario.timeline[0:4] # First 20s (4 cycles × 5s)
all_checked = set()
for cycle in first_cycle_checks:
all_checked.update(cycle.checked_users)
assert len(all_checked) == 10, "All participants should be checked within 20s"
Result: ❌ Test failed! Only 7/10 participants checked in 20s.
Why It Failed
Problem 1: Cycle Quantization
Formula assumed continuous time:
checks_per_second = 10 / 20 = 0.5
Reality uses discrete 5s cycles:
cycle_interval = 5 # seconds
checks_per_cycle = int(budget / 60 * cycle_interval)
= int(30 / 60 * 5)
= int(2.5)
= 2 # Integer truncation!
Actual budget: 2 checks per 5s cycle = 24 checks/min (not 30)
Problem 2: Tier Ordering
Formula assumed all participants get checked sequentially. Reality uses tier prioritization:
- Tier 1 (at deadline): Gets checked first
- Tier 2 (fairness/never moderated): Only if budget left after Tier 1
With budget=2 per cycle:
- Cycle 1: Check 2 never-moderated participants → into Tier 1 (at deadline)
- Cycle 2: Those 2 are still Tier 1, get checked again! Only 2 new participants checked.
- Cycle 3: Now 4 in Tier 1, but budget=2, so only 2 get rechecked. 2 new added.
- Result: Tier 1 monopolizes budget, never-moderated participants starve
Iteration 2: The Reactive Fix (First Flip-Flop)
Agent's reaction: "Tests fail because tier ordering is wrong!"
Change 1: Flip tier priority
# Before:
Tier 1: At deadline (staleness prevention)
Tier 2: Never moderated (fairness)
# After:
Tier 1: Never moderated (fairness FIRST)
Tier 2: At deadline (staleness second)
Rationale: "Fairness should come before staleness prevention"
Test:
budget = 30 # Same budget
scenario = run_scenario(participants, budget=30, duration=60)
# Now fairness-first ordering
Result: ✅ Test passed! All 10 checked within 20s.
But new test failed:
def test_staleness_prevention():
"""Participants at deadline should be checked before going stale"""
# This now fails because fairness beats staleness!
User Intervention #1
"you keep changing the policy without reflecting on this constant changes"
The question: Are you changing the policy based on requirements, or reactively based on test failures?
Agent realized: Policy should be based on requirements (what SHOULD happen), not test outcomes (making tests pass).
Iteration 3: Back to Staleness-First (Second Flip-Flop)
Agent reconsidered:
- Staleness prevention is more critical (safety requirement)
- Fairness is important but secondary
- Revert tier ordering back to staleness-first
But now the fairness test fails again!
Agent's realization: "The budget formula is wrong, not the tier ordering."
Iteration 4: Account for Tier Monopolization
The insight: With staleness-first, once a participant is checked, they stay in Tier 1 (at deadline) until rechecked.
This means:
- Must check ALL N participants BEFORE the first one needs rechecking
- Not before the "critical deadline" (20s), but before the "recheck interval" (60s)
Updated formula:
def calculate_min_budget_for_fairness(num_participants, recheck_interval=60):
"""Calculate minimum budget to check all participants before first recheck."""
cycle_interval = 5 # seconds
num_cycles_available = recheck_interval / cycle_interval # 60/5 = 12 cycles
checks_per_cycle_needed = num_participants / num_cycles_available # 10/12 = 0.83
# Convert to checks/minute and account for integer truncation
checks_per_minute = (checks_per_cycle_needed / cycle_interval) * 60
= (0.83 / 5) * 60
= 10 checks/min
return checks_per_minute
Example:
- 10 participants, 60s recheck
- Need to check all 10 before first recheck (60s)
- 12 cycles available (60s / 5s)
- Need 0.83 checks/cycle → 10 checks/min
Test:
budget = calculate_min_budget_for_fairness(10, recheck_interval=60)
# budget = 10 checks/min
scenario = run_scenario(participants, budget=10, duration=60)
# Expected: All 10 checked within 60s (before first recheck)
Result: ❌ Still failed! Only 8/10 checked in 60s.
Why It Still Failed
Integer truncation strikes again:
checks_per_cycle = int(10 / 60 * 5) = int(0.83) = 0
0 checks per cycle! Budget completely wasted by integer truncation.
Iteration 5: Make the Formula TIGHTER (Wrong Direction!)
Agent's reaction: "Budget of 10/min should be sufficient but tests fail. Formula must be too loose!"
Change:
def calculate_min_budget_for_fairness(num_participants, critical_deadline=15):
# TIGHTER deadline: 15s instead of 60s
...
return 40 checks/min # Much higher!
User Intervention #2
"did you just make the necessary budget in the calculator TIGHTER when you are not successfully running the scenarios with more relaxed budgets?"
The issue: When budget "should be sufficient" but tests fail, agent made the formula STRICTER (requiring more budget) instead of adding MARGIN (accounting for waste).
The question: If 10/min "should work" but doesn't, is the formula wrong OR does it need margin for integer truncation?
Iteration 6: Add Margin for Reality
Agent's realization: Formula calculates theoretical minimum. Reality has waste (integer truncation, tier switching overhead).
Solution: Add margin
def calculate_min_budget_for_fairness(num_participants, recheck_interval=60, margin=1.5):
"""Calculate minimum budget with safety margin."""
cycle_interval = 5
num_cycles = recheck_interval / cycle_interval
# Theoretical minimum
min_checks_per_cycle = num_participants / num_cycles
min_checks_per_minute = (min_checks_per_cycle / cycle_interval) * 60
# Add margin for integer truncation and overhead
recommended_budget = min_checks_per_minute * margin
return recommended_budget
Example:
- 10 participants, 60s recheck
- Minimum: 10 checks/min
- With 1.5x margin: 15 checks/min
Test:
budget = calculate_min_budget_for_fairness(10, recheck_interval=60, margin=1.5)
# budget = 15 checks/min
scenario = run_scenario(participants, budget=15, duration=60)
Result: ✅ Test passed! All 10 participants checked within 60s.
Iteration 7: Rolling Accumulator for Fractional Checks
But another problem surfaced:
Scenario: budget=20 checks/min (fractional: 1.67 checks/cycle)
checks_per_cycle = int(20 / 60 * 5) = int(1.67) = 1
Lost capacity: 0.67 checks/cycle wasted!
- Provided: 20 checks/min
- Actually used: 12 checks/min (1 per cycle × 12 cycles)
- Wasted: 40% of budget!
Solution: Rolling accumulator
class BudgetAccumulator:
def __init__(self, checks_per_minute):
self.checks_per_minute = checks_per_minute
self.accumulated = 0.0
def get_checks_this_cycle(self, cycle_interval=5):
# Add fractional checks to accumulator
self.accumulated += (self.checks_per_minute / 60) * cycle_interval
# Return integer checks available
available = int(self.accumulated)
self.accumulated -= available
return available
Example:
- budget=20 checks/min = 1.67 checks per 5s cycle
- Cycle 1: accumulated=0 + 1.67 = 1.67 → return 1, accumulated=0.67
- Cycle 2: accumulated=0.67 + 1.67 = 2.34 → return 2, accumulated=0.34
- Cycle 3: accumulated=0.34 + 1.67 = 2.01 → return 2, accumulated=0.01
- Pattern: 1, 2, 2, 1, 2, 2, ... → Averages to 1.67 checks/cycle ✅
Result: Budget utilized accurately, no waste from integer truncation.
Iteration 8: Build Calculator FIRST, Use It in Tests
Final insight: Agent was writing tests with ad-hoc budget assumptions, then building calculator to justify them. Backwards!
Correct workflow:
- Define policy requirements (tier ordering, recheck intervals)
- Build budget calculator based on policy
- Add safety margin (1.2x - 1.5x)
- Use calculator in tests to set expectations
Example:
# tests/test_budget_allocation.py
def test_fairness_with_sufficient_budget():
participants = [make_participant(f"p{i}") for i in range(10)]
# USE THE CALCULATOR to determine budget
budget = calculate_min_budget_for_fairness(
num_participants=10,
recheck_interval=60,
margin=1.5 # Safety margin
)
# budget = 15 checks/min
scenario = run_scenario(participants, budget=budget, duration=60)
# Now expectation is realistic (based on calculator)
all_checked = get_all_checked_users(scenario, duration=60)
assert len(all_checked) == 10
Key change: Tests don't assume arbitrary budgets. They CALCULATE needed budget, then verify it works.
The Lessons
Lesson 1: Build Calculator First, Use It Everywhere
Anti-pattern:
# Test makes up budget
def test_fairness():
budget = 20 # Seems reasonable?
...
assert all_checked # Fails!
# Agent fixes by adjusting budget
def test_fairness():
budget = 30 # Try higher?
...
assert all_checked # Still fails!
# Repeat until tests pass...
Correct pattern:
# Build calculator based on requirements
def calculate_min_budget(...):
# Account for cycle quantization
# Account for tier ordering
# Add safety margin
return recommended_budget
# Use calculator in tests
def test_fairness():
budget = calculate_min_budget(num_participants=10, margin=1.5)
...
assert all_checked # Passes because budget is correct!
The calculator is the single source of truth.
Lesson 2: Don't React to Test Failures by Changing Policy
Anti-pattern:
Test fails → Change tier ordering
Test fails → Change deadline
Test fails → Change formula
→ Flip-flopping, no stability
Correct pattern:
Test fails → Ask: "Is my expectation realistic?"
→ Debug: What's the actual behavior?
→ Understand: Why does it differ?
→ Fix: The bug OR the test expectation (not the policy)
Policy should be based on requirements, not test outcomes.
Lesson 3: Account for Reality (Cycle Quantization)
Continuous time formulas are optimistic:
# Theory:
checks_per_second = 0.5
checks_per_minute = 30
# Reality (5s cycles):
checks_per_cycle = int(30 / 60 * 5) = 2
actual_checks_per_minute = (2 / 5) * 60 = 24 # Not 30!
Formula must account for discrete cycles:
def calculate_budget(checks_per_minute):
cycle_interval = 5
checks_per_cycle = checks_per_minute / 60 * cycle_interval
# Account for integer truncation
actual_checks_per_cycle = int(checks_per_cycle)
actual_budget = (actual_checks_per_cycle / cycle_interval) * 60
if actual_budget < checks_per_minute:
# Warn about quantization loss
print(f"Warning: Requested {checks_per_minute}/min, actual {actual_budget}/min")
return actual_budget
Lesson 4: Separate Concerns
3 distinct concerns got mixed up:
1. Policy (WHAT should happen)
# Business logic
Tier 1: Staleness prevention (at deadline)
Tier 2: Fairness (never moderated)
Tier 3: Critical urgency (approaching deadline)
2. Budget (HOW MUCH is needed for policy to work)
# Capacity planning
def calculate_min_budget(num_participants, recheck_interval):
# Given policy, how much budget needed?
return minimum_budget * margin
3. Tests (VERIFY policy works with given budget)
# Validation
def test_policy_works():
budget = calculate_min_budget(...) # Use calculator
scenario = run_scenario(budget=budget)
assert policy_invariants_hold(scenario) # Verify
When test fails:
- Don't change Policy (unless requirements changed)
- Don't change Budget formula (unless calculation wrong)
- DO debug: Is there a bug in implementation?
The Correct Formula (Final Version)
def calculate_min_budget_for_fairness(
num_participants: int,
recheck_interval_seconds: int = 60,
cycle_interval_seconds: int = 5,
margin: float = 1.5
) -> float:
"""
Calculate minimum checks/minute needed to ensure all participants
are checked before first recheck (fairness requirement).
Args:
num_participants: Number of participants to check
recheck_interval_seconds: Time before participant needs recheck (default 60s)
cycle_interval_seconds: System cycle interval (default 5s)
margin: Safety margin to account for quantization and overhead (default 1.5x)
Returns:
Recommended checks per minute (float)
Example:
>>> calculate_min_budget_for_fairness(10, recheck_interval=60)
15.0 # 10 participants need 10 checks/min minimum, 15 with 1.5x margin
"""
# How many cycles available before first recheck?
num_cycles_available = recheck_interval_seconds / cycle_interval_seconds
# How many checks per cycle needed?
checks_per_cycle_needed = num_participants / num_cycles_available
# Convert to checks per minute
min_checks_per_minute = (checks_per_cycle_needed / cycle_interval_seconds) * 60
# Add safety margin for:
# - Integer truncation in checks_per_cycle
# - Tier switching overhead
# - Variance in participant arrival times
recommended_checks_per_minute = min_checks_per_minute * margin
return recommended_checks_per_minute
Usage:
# For 10 participants needing 60s recheck:
budget = calculate_min_budget_for_fairness(10, recheck_interval=60, margin=1.5)
# Returns: 15.0 checks/min
# For 50 participants needing 30s recheck:
budget = calculate_min_budget_for_fairness(50, recheck_interval=30, margin=1.2)
# Returns: 120.0 checks/min (50 participants × 2 checks/min × 1.2 margin)
Real-World Application
Use Case 1: API Rate Limiting
Problem: How many requests/second needed to process N items within deadline D?
Formula:
def calculate_min_rps(num_items, deadline_seconds, margin=1.5):
min_rps = num_items / deadline_seconds
recommended_rps = min_rps * margin
return recommended_rps
But account for:
- Batch size quantization (API processes in batches of 100)
- Network latency overhead
- Retry margin
- Concurrent request limits
Use Case 2: Worker Pool Sizing
Problem: How many workers needed to process N jobs within SLA S?
Formula:
def calculate_min_workers(num_jobs, sla_seconds, avg_job_duration, margin=1.5):
# Continuous time minimum
min_workers = (num_jobs * avg_job_duration) / sla_seconds
# Account for discrete worker count
min_workers = math.ceil(min_workers)
# Add margin for variance
recommended_workers = int(min_workers * margin)
return recommended_workers
Margin accounts for:
- Job duration variance (some take 2x average)
- Worker startup time
- Occasional failures requiring retries
Use Case 3: Cache Sizing
Problem: How much cache needed to keep N items with TTL T?
Formula:
def calculate_min_cache_size(num_items, ttl_seconds, request_rate_per_sec, margin=2.0):
# Items alive at any time
items_in_cache = num_items * (request_rate_per_sec * ttl_seconds) / num_items
# Add margin for spikes
recommended_size = items_in_cache * margin
return recommended_size
Margin accounts for:
- Traffic spikes (2x normal)
- Non-uniform access patterns
- Cascading failures if cache full
Debugging Budget Issues: A Checklist
When tests fail with "sufficient" budget:
1. Check Integer Truncation
# Calculate what you're actually getting
budget = 20 # checks/min
checks_per_cycle = int(budget / 60 * cycle_interval)
actual_budget = (checks_per_cycle / cycle_interval) * 60
if actual_budget != budget:
print(f"Truncation loss: {budget} requested, {actual_budget} actual")
2. Check Cycle Quantization
# Ensure formula accounts for discrete cycles
# DON'T: continuous_time_formula(participants, deadline)
# DO: discrete_cycle_formula(participants, num_cycles, cycle_interval)
3. Check Tier Monopolization
# Verify tier ordering doesn't cause starvation
# If staleness-first: Must check ALL before first recheck
# If fairness-first: Must have enough budget after fairness for staleness
4. Check Margin Sufficiency
# Is 1.5x margin enough?
# Try 2.0x - if tests pass, margin was the issue
# If tests still fail, there's a real bug (not just insufficient budget)
5. Check Formula Matches Policy
# Does formula assume fairness-first but policy is staleness-first?
# Formula must match actual tier ordering
The Flip-Flop Timeline (Summary)
| Iteration | Change | Reason | Outcome |
|---|---|---|---|
| 1 | Formula: checks_per_minute = (N / deadline) * 60 | Initial attempt | ❌ Failed (quantization) |
| 2 | Flip tier order: fairness-first | Tests failed, assumed policy wrong | ✅ Fairness test passed, ❌ staleness test failed |
| 3 | Flip back: staleness-first | User intervention: stop flip-flopping | ❌ Fairness test failed again |
| 4 | Formula: checks before recheck (not deadline) | Realized monopolization issue | ❌ Still failed (truncation) |
| 5 | Make formula TIGHTER (wrong direction!) | Tests failed at "sufficient" budget | ❌ User intervention: wrong direction |
| 6 | Add 1.5x margin | Account for waste | ✅ Tests passed! |
| 7 | Add rolling accumulator | Eliminate truncation waste | ✅ Budget utilized fully |
| 8 | Build calculator FIRST, use in tests | Correct workflow | ✅ Stable, correct |
Total iterations: 8
User interventions: 3
Time wasted: ~2 hours
Time with correct approach: ~30 minutes
Lesson: Build calculator first, stop flip-flopping, add margin.
Conclusion
What we learned:
- ✅ Build budget calculator FIRST, then use it to validate tests
- ✅ Don't react to test failures by changing policy
- ✅ Account for reality: quantization, truncation, margin
- ✅ Separate concerns: Policy vs Budget vs Tests
The anti-pattern:
Write test → Guess budget → Test fails → Change policy → Test fails → Change formula → ...
The correct pattern:
Define policy → Build calculator → Add margin → Use in tests → Debug bugs (not policy)
Would we make this mistake again? Probably not! The lesson was learned through painful iteration.
Next time:
- Define policy requirements clearly upfront
- Build budget calculator based on policy
- Add realistic margin (1.5x - 2.0x)
- Use calculator in ALL tests
- If tests fail with calculator-provided budget, it's a BUG (not wrong budget)
Series Conclusion
We've covered 6 aspects of multi-agent AI development:
- Part 1: Can 5 Claude Code Agents Work Independently? - The optimistic hypothesis
- Part 2: The Reality of Autonomous Development - Human orchestration required (31% autonomy)
- Part 3: Property-Based Testing with Hypothesis - The data you're throwing away
- Part 4: Zero-Conflict Architecture - File-level ownership (100% auto-merge)
- Part 5: Communication Protocols for AI Agents - 4 iterations to file-based messaging
- Part 6: The Budget Calculator Paradox - Build it first, use it everywhere
Overall lessons:
- Zero-conflict architecture works (100% auto-merge)
- Human-AI collaboration > pure autonomy (orchestration essential)
- Verification before coding (model introspection prevents wasted effort)
- Knowledge preservation (capture Hypothesis shrunken cases)
- Build calculators first (don't guess in tests)
- Explicit communication (templates, commands, not assumptions)
Was it worth it? Absolutely. 75% time savings despite 12.5% orchestration overhead. Would do it again with lessons learned.
Tags: #budget-calculator #capacity-planning #testing #formulas #quantization #margin #lessons-learned
This is Part 6 (Final) of the Multi-Agent Development Series.
Discussion: Have you struggled with capacity formulas? Do you build calculators first or adjust tests reactively? What's your approach to margin and safety factors? Share in the comments!
Top comments (0)