DEV Community

Aviad Rozenhek
Aviad Rozenhek

Posted on

Vibe coding: What we learned from flip-flopping 8 times on a simple formula

The Budget Calculator Paradox: When Tests Don't Match Reality

Part 6 of the Multi-Agent Development Series

  • Part 1: Can 5 Claude Code Agents Work Independently?
  • Part 2: The Reality of "Autonomous" Multi-Agent Development
  • Part 3: Property-Based Testing with Hypothesis
  • Part 4: Zero-Conflict Architecture
  • Part 5: Communication Protocols for AI Agents

TL;DR

The Problem: Create a budget calculator to determine minimum checks/minute needed for fair participant checking.

What happened: Agent flip-flopped 8 times on the formula, changing tier ordering and capacity calculations reactively based on test failures instead of proactively based on requirements.

User intervention (3 times):

"you keep changing the policy without reflecting on this constant changes"
"did you just make the necessary budget in the calculator TIGHTER when you are not successfully running the scenarios with more relaxed budgets?"
"make the calculator CORRECT and then provide it some extra margin"

The lessons:

  1. Build the calculator FIRST, then use it to validate test expectations
  2. Don't react to test failures by changing the policy - ask "is my expectation realistic?"
  3. Account for reality: Cycle quantization, integer rounding, margins for variance
  4. Separate concerns: Policy (what should happen) vs Budget (what's needed for it to work)

The correct pattern:

1. Define policy requirements (tier ordering, deadlines)
2. Build budget calculator based on requirements
3. Add safety margin (1.2x - 1.5x minimum)
4. Use calculator in tests to set expectations
5. If tests fail, debug the bug - don't change calculator or policy
Enter fullscreen mode Exit fullscreen mode

The Setup

Context: Video moderation system needs to check participants at different rates based on risk and staleness.

Core question: How many checks/minute budget is needed to ensure fairness?

Naive answer: num_participants / recheck_interval

  • Example: 10 participants, 60s recheck → 10 checks/min

Reality: Way more complicated.


Iteration 1: The Optimistic Formula

Agent's first attempt:

def calculate_min_budget_for_fairness(num_participants, critical_deadline=20):
    """Calculate minimum budget needed to check all participants before critical deadline."""
    # Continuous time formula
    time_to_check_all = critical_deadline  # seconds
    checks_per_second = num_participants / time_to_check_all
    checks_per_minute = checks_per_second * 60
    return checks_per_minute
Enter fullscreen mode Exit fullscreen mode

Example:

  • 10 participants, critical_deadline=20s
  • time_to_check_all = 20s
  • checks_per_minute = (10 / 20) * 60 = 30 checks/min

Test using this:

def test_fairness_with_sufficient_budget():
    participants = [make_participant(f"p{i}") for i in range(10)]
    budget = calculate_min_budget_for_fairness(10, critical_deadline=20)
    # budget = 30 checks/min

    scenario = run_scenario(participants, budget=budget, duration=60)

    # Expected: All 10 participants checked within 20s
    first_cycle_checks = scenario.timeline[0:4]  # First 20s (4 cycles × 5s)
    all_checked = set()
    for cycle in first_cycle_checks:
        all_checked.update(cycle.checked_users)

    assert len(all_checked) == 10, "All participants should be checked within 20s"
Enter fullscreen mode Exit fullscreen mode

Result: ❌ Test failed! Only 7/10 participants checked in 20s.


Why It Failed

Problem 1: Cycle Quantization

Formula assumed continuous time:

checks_per_second = 10 / 20 = 0.5
Enter fullscreen mode Exit fullscreen mode

Reality uses discrete 5s cycles:

cycle_interval = 5  # seconds
checks_per_cycle = int(budget / 60 * cycle_interval)
                 = int(30 / 60 * 5)
                 = int(2.5)
                 = 2  # Integer truncation!
Enter fullscreen mode Exit fullscreen mode

Actual budget: 2 checks per 5s cycle = 24 checks/min (not 30)


Problem 2: Tier Ordering

Formula assumed all participants get checked sequentially. Reality uses tier prioritization:

  • Tier 1 (at deadline): Gets checked first
  • Tier 2 (fairness/never moderated): Only if budget left after Tier 1

With budget=2 per cycle:

  • Cycle 1: Check 2 never-moderated participants → into Tier 1 (at deadline)
  • Cycle 2: Those 2 are still Tier 1, get checked again! Only 2 new participants checked.
  • Cycle 3: Now 4 in Tier 1, but budget=2, so only 2 get rechecked. 2 new added.
  • Result: Tier 1 monopolizes budget, never-moderated participants starve

Iteration 2: The Reactive Fix (First Flip-Flop)

Agent's reaction: "Tests fail because tier ordering is wrong!"

Change 1: Flip tier priority

# Before:
Tier 1: At deadline (staleness prevention)
Tier 2: Never moderated (fairness)

# After:
Tier 1: Never moderated (fairness FIRST)
Tier 2: At deadline (staleness second)
Enter fullscreen mode Exit fullscreen mode

Rationale: "Fairness should come before staleness prevention"

Test:

budget = 30  # Same budget
scenario = run_scenario(participants, budget=30, duration=60)
# Now fairness-first ordering
Enter fullscreen mode Exit fullscreen mode

Result: ✅ Test passed! All 10 checked within 20s.

But new test failed:

def test_staleness_prevention():
    """Participants at deadline should be checked before going stale"""
    # This now fails because fairness beats staleness!
Enter fullscreen mode Exit fullscreen mode

User Intervention #1

"you keep changing the policy without reflecting on this constant changes"

The question: Are you changing the policy based on requirements, or reactively based on test failures?

Agent realized: Policy should be based on requirements (what SHOULD happen), not test outcomes (making tests pass).


Iteration 3: Back to Staleness-First (Second Flip-Flop)

Agent reconsidered:

  • Staleness prevention is more critical (safety requirement)
  • Fairness is important but secondary
  • Revert tier ordering back to staleness-first

But now the fairness test fails again!

Agent's realization: "The budget formula is wrong, not the tier ordering."


Iteration 4: Account for Tier Monopolization

The insight: With staleness-first, once a participant is checked, they stay in Tier 1 (at deadline) until rechecked.

This means:

  • Must check ALL N participants BEFORE the first one needs rechecking
  • Not before the "critical deadline" (20s), but before the "recheck interval" (60s)

Updated formula:

def calculate_min_budget_for_fairness(num_participants, recheck_interval=60):
    """Calculate minimum budget to check all participants before first recheck."""
    cycle_interval = 5  # seconds
    num_cycles_available = recheck_interval / cycle_interval  # 60/5 = 12 cycles

    checks_per_cycle_needed = num_participants / num_cycles_available  # 10/12 = 0.83

    # Convert to checks/minute and account for integer truncation
    checks_per_minute = (checks_per_cycle_needed / cycle_interval) * 60
                      = (0.83 / 5) * 60
                      = 10 checks/min

    return checks_per_minute
Enter fullscreen mode Exit fullscreen mode

Example:

  • 10 participants, 60s recheck
  • Need to check all 10 before first recheck (60s)
  • 12 cycles available (60s / 5s)
  • Need 0.83 checks/cycle → 10 checks/min

Test:

budget = calculate_min_budget_for_fairness(10, recheck_interval=60)
# budget = 10 checks/min

scenario = run_scenario(participants, budget=10, duration=60)
# Expected: All 10 checked within 60s (before first recheck)
Enter fullscreen mode Exit fullscreen mode

Result: ❌ Still failed! Only 8/10 checked in 60s.


Why It Still Failed

Integer truncation strikes again:

checks_per_cycle = int(10 / 60 * 5) = int(0.83) = 0
Enter fullscreen mode Exit fullscreen mode

0 checks per cycle! Budget completely wasted by integer truncation.


Iteration 5: Make the Formula TIGHTER (Wrong Direction!)

Agent's reaction: "Budget of 10/min should be sufficient but tests fail. Formula must be too loose!"

Change:

def calculate_min_budget_for_fairness(num_participants, critical_deadline=15):
    # TIGHTER deadline: 15s instead of 60s
    ...
    return 40 checks/min  # Much higher!
Enter fullscreen mode Exit fullscreen mode

User Intervention #2

"did you just make the necessary budget in the calculator TIGHTER when you are not successfully running the scenarios with more relaxed budgets?"

The issue: When budget "should be sufficient" but tests fail, agent made the formula STRICTER (requiring more budget) instead of adding MARGIN (accounting for waste).

The question: If 10/min "should work" but doesn't, is the formula wrong OR does it need margin for integer truncation?


Iteration 6: Add Margin for Reality

Agent's realization: Formula calculates theoretical minimum. Reality has waste (integer truncation, tier switching overhead).

Solution: Add margin

def calculate_min_budget_for_fairness(num_participants, recheck_interval=60, margin=1.5):
    """Calculate minimum budget with safety margin."""
    cycle_interval = 5
    num_cycles = recheck_interval / cycle_interval

    # Theoretical minimum
    min_checks_per_cycle = num_participants / num_cycles
    min_checks_per_minute = (min_checks_per_cycle / cycle_interval) * 60

    # Add margin for integer truncation and overhead
    recommended_budget = min_checks_per_minute * margin

    return recommended_budget
Enter fullscreen mode Exit fullscreen mode

Example:

  • 10 participants, 60s recheck
  • Minimum: 10 checks/min
  • With 1.5x margin: 15 checks/min

Test:

budget = calculate_min_budget_for_fairness(10, recheck_interval=60, margin=1.5)
# budget = 15 checks/min

scenario = run_scenario(participants, budget=15, duration=60)
Enter fullscreen mode Exit fullscreen mode

Result: ✅ Test passed! All 10 participants checked within 60s.


Iteration 7: Rolling Accumulator for Fractional Checks

But another problem surfaced:

Scenario: budget=20 checks/min (fractional: 1.67 checks/cycle)

checks_per_cycle = int(20 / 60 * 5) = int(1.67) = 1
Enter fullscreen mode Exit fullscreen mode

Lost capacity: 0.67 checks/cycle wasted!

  • Provided: 20 checks/min
  • Actually used: 12 checks/min (1 per cycle × 12 cycles)
  • Wasted: 40% of budget!

Solution: Rolling accumulator

class BudgetAccumulator:
    def __init__(self, checks_per_minute):
        self.checks_per_minute = checks_per_minute
        self.accumulated = 0.0

    def get_checks_this_cycle(self, cycle_interval=5):
        # Add fractional checks to accumulator
        self.accumulated += (self.checks_per_minute / 60) * cycle_interval

        # Return integer checks available
        available = int(self.accumulated)
        self.accumulated -= available
        return available
Enter fullscreen mode Exit fullscreen mode

Example:

  • budget=20 checks/min = 1.67 checks per 5s cycle
  • Cycle 1: accumulated=0 + 1.67 = 1.67 → return 1, accumulated=0.67
  • Cycle 2: accumulated=0.67 + 1.67 = 2.34 → return 2, accumulated=0.34
  • Cycle 3: accumulated=0.34 + 1.67 = 2.01 → return 2, accumulated=0.01
  • Pattern: 1, 2, 2, 1, 2, 2, ... → Averages to 1.67 checks/cycle ✅

Result: Budget utilized accurately, no waste from integer truncation.


Iteration 8: Build Calculator FIRST, Use It in Tests

Final insight: Agent was writing tests with ad-hoc budget assumptions, then building calculator to justify them. Backwards!

Correct workflow:

  1. Define policy requirements (tier ordering, recheck intervals)
  2. Build budget calculator based on policy
  3. Add safety margin (1.2x - 1.5x)
  4. Use calculator in tests to set expectations

Example:

# tests/test_budget_allocation.py

def test_fairness_with_sufficient_budget():
    participants = [make_participant(f"p{i}") for i in range(10)]

    # USE THE CALCULATOR to determine budget
    budget = calculate_min_budget_for_fairness(
        num_participants=10,
        recheck_interval=60,
        margin=1.5  # Safety margin
    )
    # budget = 15 checks/min

    scenario = run_scenario(participants, budget=budget, duration=60)

    # Now expectation is realistic (based on calculator)
    all_checked = get_all_checked_users(scenario, duration=60)
    assert len(all_checked) == 10
Enter fullscreen mode Exit fullscreen mode

Key change: Tests don't assume arbitrary budgets. They CALCULATE needed budget, then verify it works.


The Lessons

Lesson 1: Build Calculator First, Use It Everywhere

Anti-pattern:

# Test makes up budget
def test_fairness():
    budget = 20  # Seems reasonable?
    ...
    assert all_checked  # Fails!

# Agent fixes by adjusting budget
def test_fairness():
    budget = 30  # Try higher?
    ...
    assert all_checked  # Still fails!

# Repeat until tests pass...
Enter fullscreen mode Exit fullscreen mode

Correct pattern:

# Build calculator based on requirements
def calculate_min_budget(...):
    # Account for cycle quantization
    # Account for tier ordering
    # Add safety margin
    return recommended_budget

# Use calculator in tests
def test_fairness():
    budget = calculate_min_budget(num_participants=10, margin=1.5)
    ...
    assert all_checked  # Passes because budget is correct!
Enter fullscreen mode Exit fullscreen mode

The calculator is the single source of truth.


Lesson 2: Don't React to Test Failures by Changing Policy

Anti-pattern:

Test fails → Change tier ordering
Test fails → Change deadline
Test fails → Change formula
→ Flip-flopping, no stability
Enter fullscreen mode Exit fullscreen mode

Correct pattern:

Test fails → Ask: "Is my expectation realistic?"
            → Debug: What's the actual behavior?
            → Understand: Why does it differ?
            → Fix: The bug OR the test expectation (not the policy)
Enter fullscreen mode Exit fullscreen mode

Policy should be based on requirements, not test outcomes.


Lesson 3: Account for Reality (Cycle Quantization)

Continuous time formulas are optimistic:

# Theory:
checks_per_second = 0.5
checks_per_minute = 30

# Reality (5s cycles):
checks_per_cycle = int(30 / 60 * 5) = 2
actual_checks_per_minute = (2 / 5) * 60 = 24  # Not 30!
Enter fullscreen mode Exit fullscreen mode

Formula must account for discrete cycles:

def calculate_budget(checks_per_minute):
    cycle_interval = 5
    checks_per_cycle = checks_per_minute / 60 * cycle_interval

    # Account for integer truncation
    actual_checks_per_cycle = int(checks_per_cycle)
    actual_budget = (actual_checks_per_cycle / cycle_interval) * 60

    if actual_budget < checks_per_minute:
        # Warn about quantization loss
        print(f"Warning: Requested {checks_per_minute}/min, actual {actual_budget}/min")

    return actual_budget
Enter fullscreen mode Exit fullscreen mode

Lesson 4: Separate Concerns

3 distinct concerns got mixed up:

1. Policy (WHAT should happen)

# Business logic
Tier 1: Staleness prevention (at deadline)
Tier 2: Fairness (never moderated)
Tier 3: Critical urgency (approaching deadline)
Enter fullscreen mode Exit fullscreen mode

2. Budget (HOW MUCH is needed for policy to work)

# Capacity planning
def calculate_min_budget(num_participants, recheck_interval):
    # Given policy, how much budget needed?
    return minimum_budget * margin
Enter fullscreen mode Exit fullscreen mode

3. Tests (VERIFY policy works with given budget)

# Validation
def test_policy_works():
    budget = calculate_min_budget(...)  # Use calculator
    scenario = run_scenario(budget=budget)
    assert policy_invariants_hold(scenario)  # Verify
Enter fullscreen mode Exit fullscreen mode

When test fails:

  • Don't change Policy (unless requirements changed)
  • Don't change Budget formula (unless calculation wrong)
  • DO debug: Is there a bug in implementation?

The Correct Formula (Final Version)

def calculate_min_budget_for_fairness(
    num_participants: int,
    recheck_interval_seconds: int = 60,
    cycle_interval_seconds: int = 5,
    margin: float = 1.5
) -> float:
    """
    Calculate minimum checks/minute needed to ensure all participants
    are checked before first recheck (fairness requirement).

    Args:
        num_participants: Number of participants to check
        recheck_interval_seconds: Time before participant needs recheck (default 60s)
        cycle_interval_seconds: System cycle interval (default 5s)
        margin: Safety margin to account for quantization and overhead (default 1.5x)

    Returns:
        Recommended checks per minute (float)

    Example:
        >>> calculate_min_budget_for_fairness(10, recheck_interval=60)
        15.0  # 10 participants need 10 checks/min minimum, 15 with 1.5x margin
    """
    # How many cycles available before first recheck?
    num_cycles_available = recheck_interval_seconds / cycle_interval_seconds

    # How many checks per cycle needed?
    checks_per_cycle_needed = num_participants / num_cycles_available

    # Convert to checks per minute
    min_checks_per_minute = (checks_per_cycle_needed / cycle_interval_seconds) * 60

    # Add safety margin for:
    # - Integer truncation in checks_per_cycle
    # - Tier switching overhead
    # - Variance in participant arrival times
    recommended_checks_per_minute = min_checks_per_minute * margin

    return recommended_checks_per_minute
Enter fullscreen mode Exit fullscreen mode

Usage:

# For 10 participants needing 60s recheck:
budget = calculate_min_budget_for_fairness(10, recheck_interval=60, margin=1.5)
# Returns: 15.0 checks/min

# For 50 participants needing 30s recheck:
budget = calculate_min_budget_for_fairness(50, recheck_interval=30, margin=1.2)
# Returns: 120.0 checks/min (50 participants × 2 checks/min × 1.2 margin)
Enter fullscreen mode Exit fullscreen mode

Real-World Application

Use Case 1: API Rate Limiting

Problem: How many requests/second needed to process N items within deadline D?

Formula:

def calculate_min_rps(num_items, deadline_seconds, margin=1.5):
    min_rps = num_items / deadline_seconds
    recommended_rps = min_rps * margin
    return recommended_rps
Enter fullscreen mode Exit fullscreen mode

But account for:

  • Batch size quantization (API processes in batches of 100)
  • Network latency overhead
  • Retry margin
  • Concurrent request limits

Use Case 2: Worker Pool Sizing

Problem: How many workers needed to process N jobs within SLA S?

Formula:

def calculate_min_workers(num_jobs, sla_seconds, avg_job_duration, margin=1.5):
    # Continuous time minimum
    min_workers = (num_jobs * avg_job_duration) / sla_seconds

    # Account for discrete worker count
    min_workers = math.ceil(min_workers)

    # Add margin for variance
    recommended_workers = int(min_workers * margin)

    return recommended_workers
Enter fullscreen mode Exit fullscreen mode

Margin accounts for:

  • Job duration variance (some take 2x average)
  • Worker startup time
  • Occasional failures requiring retries

Use Case 3: Cache Sizing

Problem: How much cache needed to keep N items with TTL T?

Formula:

def calculate_min_cache_size(num_items, ttl_seconds, request_rate_per_sec, margin=2.0):
    # Items alive at any time
    items_in_cache = num_items * (request_rate_per_sec * ttl_seconds) / num_items

    # Add margin for spikes
    recommended_size = items_in_cache * margin

    return recommended_size
Enter fullscreen mode Exit fullscreen mode

Margin accounts for:

  • Traffic spikes (2x normal)
  • Non-uniform access patterns
  • Cascading failures if cache full

Debugging Budget Issues: A Checklist

When tests fail with "sufficient" budget:

1. Check Integer Truncation

# Calculate what you're actually getting
budget = 20  # checks/min
checks_per_cycle = int(budget / 60 * cycle_interval)
actual_budget = (checks_per_cycle / cycle_interval) * 60

if actual_budget != budget:
    print(f"Truncation loss: {budget} requested, {actual_budget} actual")
Enter fullscreen mode Exit fullscreen mode

2. Check Cycle Quantization

# Ensure formula accounts for discrete cycles
# DON'T: continuous_time_formula(participants, deadline)
# DO: discrete_cycle_formula(participants, num_cycles, cycle_interval)
Enter fullscreen mode Exit fullscreen mode

3. Check Tier Monopolization

# Verify tier ordering doesn't cause starvation
# If staleness-first: Must check ALL before first recheck
# If fairness-first: Must have enough budget after fairness for staleness
Enter fullscreen mode Exit fullscreen mode

4. Check Margin Sufficiency

# Is 1.5x margin enough?
# Try 2.0x - if tests pass, margin was the issue
# If tests still fail, there's a real bug (not just insufficient budget)
Enter fullscreen mode Exit fullscreen mode

5. Check Formula Matches Policy

# Does formula assume fairness-first but policy is staleness-first?
# Formula must match actual tier ordering
Enter fullscreen mode Exit fullscreen mode

The Flip-Flop Timeline (Summary)

Iteration Change Reason Outcome
1 Formula: checks_per_minute = (N / deadline) * 60 Initial attempt ❌ Failed (quantization)
2 Flip tier order: fairness-first Tests failed, assumed policy wrong ✅ Fairness test passed, ❌ staleness test failed
3 Flip back: staleness-first User intervention: stop flip-flopping ❌ Fairness test failed again
4 Formula: checks before recheck (not deadline) Realized monopolization issue ❌ Still failed (truncation)
5 Make formula TIGHTER (wrong direction!) Tests failed at "sufficient" budget ❌ User intervention: wrong direction
6 Add 1.5x margin Account for waste ✅ Tests passed!
7 Add rolling accumulator Eliminate truncation waste ✅ Budget utilized fully
8 Build calculator FIRST, use in tests Correct workflow ✅ Stable, correct

Total iterations: 8
User interventions: 3
Time wasted: ~2 hours
Time with correct approach: ~30 minutes

Lesson: Build calculator first, stop flip-flopping, add margin.


Conclusion

What we learned:

  • ✅ Build budget calculator FIRST, then use it to validate tests
  • ✅ Don't react to test failures by changing policy
  • ✅ Account for reality: quantization, truncation, margin
  • ✅ Separate concerns: Policy vs Budget vs Tests

The anti-pattern:

Write test → Guess budget → Test fails → Change policy → Test fails → Change formula → ...
Enter fullscreen mode Exit fullscreen mode

The correct pattern:

Define policy → Build calculator → Add margin → Use in tests → Debug bugs (not policy)
Enter fullscreen mode Exit fullscreen mode

Would we make this mistake again? Probably not! The lesson was learned through painful iteration.

Next time:

  1. Define policy requirements clearly upfront
  2. Build budget calculator based on policy
  3. Add realistic margin (1.5x - 2.0x)
  4. Use calculator in ALL tests
  5. If tests fail with calculator-provided budget, it's a BUG (not wrong budget)

Series Conclusion

We've covered 6 aspects of multi-agent AI development:

  1. Part 1: Can 5 Claude Code Agents Work Independently? - The optimistic hypothesis
  2. Part 2: The Reality of Autonomous Development - Human orchestration required (31% autonomy)
  3. Part 3: Property-Based Testing with Hypothesis - The data you're throwing away
  4. Part 4: Zero-Conflict Architecture - File-level ownership (100% auto-merge)
  5. Part 5: Communication Protocols for AI Agents - 4 iterations to file-based messaging
  6. Part 6: The Budget Calculator Paradox - Build it first, use it everywhere

Overall lessons:

  • Zero-conflict architecture works (100% auto-merge)
  • Human-AI collaboration > pure autonomy (orchestration essential)
  • Verification before coding (model introspection prevents wasted effort)
  • Knowledge preservation (capture Hypothesis shrunken cases)
  • Build calculators first (don't guess in tests)
  • Explicit communication (templates, commands, not assumptions)

Was it worth it? Absolutely. 75% time savings despite 12.5% orchestration overhead. Would do it again with lessons learned.


Tags: #budget-calculator #capacity-planning #testing #formulas #quantization #margin #lessons-learned


This is Part 6 (Final) of the Multi-Agent Development Series.

Discussion: Have you struggled with capacity formulas? Do you build calculators first or adjust tests reactively? What's your approach to margin and safety factors? Share in the comments!

Top comments (0)