Aviad Rozenhek

Posted on Nov 6

Vibe coding: What we learned from flip-flopping 8 times on a simple formula

#ai #vibecoding #agents

The Budget Calculator Paradox: When Tests Don't Match Reality

Part 6 of the Multi-Agent Development Series

Part 1: Can 5 Claude Code Agents Work Independently?
Part 2: The Reality of "Autonomous" Multi-Agent Development
Part 3: Property-Based Testing with Hypothesis
Part 4: Zero-Conflict Architecture
Part 5: Communication Protocols for AI Agents

TL;DR

The Problem: Create a budget calculator to determine minimum checks/minute needed for fair participant checking.

What happened: Agent flip-flopped 8 times on the formula, changing tier ordering and capacity calculations reactively based on test failures instead of proactively based on requirements.

User intervention (3 times):

"you keep changing the policy without reflecting on this constant changes"
"did you just make the necessary budget in the calculator TIGHTER when you are not successfully running the scenarios with more relaxed budgets?"
"make the calculator CORRECT and then provide it some extra margin"

The lessons:

Build the calculator FIRST, then use it to validate test expectations
Don't react to test failures by changing the policy - ask "is my expectation realistic?"
Account for reality: Cycle quantization, integer rounding, margins for variance
Separate concerns: Policy (what should happen) vs Budget (what's needed for it to work)

The correct pattern:

1. Define policy requirements (tier ordering, deadlines)
2. Build budget calculator based on requirements
3. Add safety margin (1.2x - 1.5x minimum)
4. Use calculator in tests to set expectations
5. If tests fail, debug the bug - don't change calculator or policy

The Setup

Context: Video moderation system needs to check participants at different rates based on risk and staleness.

Core question: How many checks/minute budget is needed to ensure fairness?

Naive answer: num_participants / recheck_interval

Example: 10 participants, 60s recheck → 10 checks/min

Reality: Way more complicated.

Iteration 1: The Optimistic Formula

Agent's first attempt:

def calculate_min_budget_for_fairness(num_participants, critical_deadline=20):
    """Calculate minimum budget needed to check all participants before critical deadline."""
    # Continuous time formula
    time_to_check_all = critical_deadline  # seconds
    checks_per_second = num_participants / time_to_check_all
    checks_per_minute = checks_per_second * 60
    return checks_per_minute

Example:

10 participants, critical_deadline=20s
time_to_check_all = 20s
checks_per_minute = (10 / 20) * 60 = 30 checks/min

Test using this:

def test_fairness_with_sufficient_budget():
    participants = [make_participant(f"p{i}") for i in range(10)]
    budget = calculate_min_budget_for_fairness(10, critical_deadline=20)
    # budget = 30 checks/min

    scenario = run_scenario(participants, budget=budget, duration=60)

    # Expected: All 10 participants checked within 20s
    first_cycle_checks = scenario.timeline[0:4]  # First 20s (4 cycles × 5s)
    all_checked = set()
    for cycle in first_cycle_checks:
        all_checked.update(cycle.checked_users)

    assert len(all_checked) == 10, "All participants should be checked within 20s"

Result: ❌ Test failed! Only 7/10 participants checked in 20s.

Why It Failed

Problem 1: Cycle Quantization

Formula assumed continuous time:

checks_per_second = 10 / 20 = 0.5

Reality uses discrete 5s cycles:

cycle_interval = 5  # seconds
checks_per_cycle = int(budget / 60 * cycle_interval)
                 = int(30 / 60 * 5)
                 = int(2.5)
                 = 2  # Integer truncation!

Actual budget: 2 checks per 5s cycle = 24 checks/min (not 30)

Problem 2: Tier Ordering

Formula assumed all participants get checked sequentially. Reality uses tier prioritization:

Tier 1 (at deadline): Gets checked first
Tier 2 (fairness/never moderated): Only if budget left after Tier 1

With budget=2 per cycle:

Cycle 1: Check 2 never-moderated participants → into Tier 1 (at deadline)
Cycle 2: Those 2 are still Tier 1, get checked again! Only 2 new participants checked.
Cycle 3: Now 4 in Tier 1, but budget=2, so only 2 get rechecked. 2 new added.
Result: Tier 1 monopolizes budget, never-moderated participants starve

Iteration 2: The Reactive Fix (First Flip-Flop)

Agent's reaction: "Tests fail because tier ordering is wrong!"

Change 1: Flip tier priority

# Before:
Tier 1: At deadline (staleness prevention)
Tier 2: Never moderated (fairness)

# After:
Tier 1: Never moderated (fairness FIRST)
Tier 2: At deadline (staleness second)

Rationale: "Fairness should come before staleness prevention"

Test:

budget = 30  # Same budget
scenario = run_scenario(participants, budget=30, duration=60)
# Now fairness-first ordering

Result: ✅ Test passed! All 10 checked within 20s.

But new test failed:

def test_staleness_prevention():
    """Participants at deadline should be checked before going stale"""
    # This now fails because fairness beats staleness!

User Intervention #1

"you keep changing the policy without reflecting on this constant changes"

The question: Are you changing the policy based on requirements, or reactively based on test failures?

Agent realized: Policy should be based on requirements (what SHOULD happen), not test outcomes (making tests pass).

Iteration 3: Back to Staleness-First (Second Flip-Flop)

Agent reconsidered:

Staleness prevention is more critical (safety requirement)
Fairness is important but secondary
Revert tier ordering back to staleness-first

But now the fairness test fails again!

Agent's realization: "The budget formula is wrong, not the tier ordering."

Iteration 4: Account for Tier Monopolization

The insight: With staleness-first, once a participant is checked, they stay in Tier 1 (at deadline) until rechecked.

This means:

Must check ALL N participants BEFORE the first one needs rechecking
Not before the "critical deadline" (20s), but before the "recheck interval" (60s)

Updated formula:

def calculate_min_budget_for_fairness(num_participants, recheck_interval=60):
    """Calculate minimum budget to check all participants before first recheck."""
    cycle_interval = 5  # seconds
    num_cycles_available = recheck_interval / cycle_interval  # 60/5 = 12 cycles

    checks_per_cycle_needed = num_participants / num_cycles_available  # 10/12 = 0.83

    # Convert to checks/minute and account for integer truncation
    checks_per_minute = (checks_per_cycle_needed / cycle_interval) * 60
                      = (0.83 / 5) * 60
                      = 10 checks/min

    return checks_per_minute

Example:

10 participants, 60s recheck
Need to check all 10 before first recheck (60s)
12 cycles available (60s / 5s)
Need 0.83 checks/cycle → 10 checks/min

Test:

budget = calculate_min_budget_for_fairness(10, recheck_interval=60)
# budget = 10 checks/min

scenario = run_scenario(participants, budget=10, duration=60)
# Expected: All 10 checked within 60s (before first recheck)

Result: ❌ Still failed! Only 8/10 checked in 60s.

Why It Still Failed

Integer truncation strikes again:

checks_per_cycle = int(10 / 60 * 5) = int(0.83) = 0

0 checks per cycle! Budget completely wasted by integer truncation.

Iteration 5: Make the Formula TIGHTER (Wrong Direction!)

Agent's reaction: "Budget of 10/min should be sufficient but tests fail. Formula must be too loose!"

Change:

def calculate_min_budget_for_fairness(num_participants, critical_deadline=15):
    # TIGHTER deadline: 15s instead of 60s
    ...
    return 40 checks/min  # Much higher!

User Intervention #2

"did you just make the necessary budget in the calculator TIGHTER when you are not successfully running the scenarios with more relaxed budgets?"

The issue: When budget "should be sufficient" but tests fail, agent made the formula STRICTER (requiring more budget) instead of adding MARGIN (accounting for waste).

The question: If 10/min "should work" but doesn't, is the formula wrong OR does it need margin for integer truncation?

Iteration 6: Add Margin for Reality

Agent's realization: Formula calculates theoretical minimum. Reality has waste (integer truncation, tier switching overhead).

Solution: Add margin

def calculate_min_budget_for_fairness(num_participants, recheck_interval=60, margin=1.5):
    """Calculate minimum budget with safety margin."""
    cycle_interval = 5
    num_cycles = recheck_interval / cycle_interval

    # Theoretical minimum
    min_checks_per_cycle = num_participants / num_cycles
    min_checks_per_minute = (min_checks_per_cycle / cycle_interval) * 60

    # Add margin for integer truncation and overhead
    recommended_budget = min_checks_per_minute * margin

    return recommended_budget

Example:

10 participants, 60s recheck
Minimum: 10 checks/min
With 1.5x margin: 15 checks/min

Test:

budget = calculate_min_budget_for_fairness(10, recheck_interval=60, margin=1.5)
# budget = 15 checks/min

scenario = run_scenario(participants, budget=15, duration=60)

Result: ✅ Test passed! All 10 participants checked within 60s.

Iteration 7: Rolling Accumulator for Fractional Checks

But another problem surfaced:

Scenario: budget=20 checks/min (fractional: 1.67 checks/cycle)

checks_per_cycle = int(20 / 60 * 5) = int(1.67) = 1

Lost capacity: 0.67 checks/cycle wasted!

Provided: 20 checks/min
Actually used: 12 checks/min (1 per cycle × 12 cycles)
Wasted: 40% of budget!

Solution: Rolling accumulator

class BudgetAccumulator:
    def __init__(self, checks_per_minute):
        self.checks_per_minute = checks_per_minute
        self.accumulated = 0.0

    def get_checks_this_cycle(self, cycle_interval=5):
        # Add fractional checks to accumulator
        self.accumulated += (self.checks_per_minute / 60) * cycle_interval

        # Return integer checks available
        available = int(self.accumulated)
        self.accumulated -= available
        return available

Example:

budget=20 checks/min = 1.67 checks per 5s cycle
Cycle 1: accumulated=0 + 1.67 = 1.67 → return 1, accumulated=0.67
Cycle 2: accumulated=0.67 + 1.67 = 2.34 → return 2, accumulated=0.34
Cycle 3: accumulated=0.34 + 1.67 = 2.01 → return 2, accumulated=0.01
Pattern: 1, 2, 2, 1, 2, 2, ... → Averages to 1.67 checks/cycle ✅

Result: Budget utilized accurately, no waste from integer truncation.

Iteration 8: Build Calculator FIRST, Use It in Tests

Final insight: Agent was writing tests with ad-hoc budget assumptions, then building calculator to justify them. Backwards!

Correct workflow:

Define policy requirements (tier ordering, recheck intervals)
Build budget calculator based on policy
Add safety margin (1.2x - 1.5x)
Use calculator in tests to set expectations

Example:

# tests/test_budget_allocation.py

def test_fairness_with_sufficient_budget():
    participants = [make_participant(f"p{i}") for i in range(10)]

    # USE THE CALCULATOR to determine budget
    budget = calculate_min_budget_for_fairness(
        num_participants=10,
        recheck_interval=60,
        margin=1.5  # Safety margin
    )
    # budget = 15 checks/min

    scenario = run_scenario(participants, budget=budget, duration=60)

    # Now expectation is realistic (based on calculator)
    all_checked = get_all_checked_users(scenario, duration=60)
    assert len(all_checked) == 10

Key change: Tests don't assume arbitrary budgets. They CALCULATE needed budget, then verify it works.

The Lessons

Lesson 1: Build Calculator First, Use It Everywhere

Anti-pattern:

# Test makes up budget
def test_fairness():
    budget = 20  # Seems reasonable?
    ...
    assert all_checked  # Fails!

# Agent fixes by adjusting budget
def test_fairness():
    budget = 30  # Try higher?
    ...
    assert all_checked  # Still fails!

# Repeat until tests pass...

Correct pattern:

# Build calculator based on requirements
def calculate_min_budget(...):
    # Account for cycle quantization
    # Account for tier ordering
    # Add safety margin
    return recommended_budget

# Use calculator in tests
def test_fairness():
    budget = calculate_min_budget(num_participants=10, margin=1.5)
    ...
    assert all_checked  # Passes because budget is correct!

The calculator is the single source of truth.

Lesson 2: Don't React to Test Failures by Changing Policy

Anti-pattern:

Test fails → Change tier ordering
Test fails → Change deadline
Test fails → Change formula
→ Flip-flopping, no stability

Correct pattern:

Test fails → Ask: "Is my expectation realistic?"
            → Debug: What's the actual behavior?
            → Understand: Why does it differ?
            → Fix: The bug OR the test expectation (not the policy)

Policy should be based on requirements, not test outcomes.

Lesson 3: Account for Reality (Cycle Quantization)

Continuous time formulas are optimistic:

# Theory:
checks_per_second = 0.5
checks_per_minute = 30

# Reality (5s cycles):
checks_per_cycle = int(30 / 60 * 5) = 2
actual_checks_per_minute = (2 / 5) * 60 = 24  # Not 30!

Formula must account for discrete cycles:

def calculate_budget(checks_per_minute):
    cycle_interval = 5
    checks_per_cycle = checks_per_minute / 60 * cycle_interval

    # Account for integer truncation
    actual_checks_per_cycle = int(checks_per_cycle)
    actual_budget = (actual_checks_per_cycle / cycle_interval) * 60

    if actual_budget < checks_per_minute:
        # Warn about quantization loss
        print(f"Warning: Requested {checks_per_minute}/min, actual {actual_budget}/min")

    return actual_budget

Lesson 4: Separate Concerns

3 distinct concerns got mixed up:

1. Policy (WHAT should happen)

# Business logic
Tier 1: Staleness prevention (at deadline)
Tier 2: Fairness (never moderated)
Tier 3: Critical urgency (approaching deadline)

2. Budget (HOW MUCH is needed for policy to work)

# Capacity planning
def calculate_min_budget(num_participants, recheck_interval):
    # Given policy, how much budget needed?
    return minimum_budget * margin

3. Tests (VERIFY policy works with given budget)

# Validation
def test_policy_works():
    budget = calculate_min_budget(...)  # Use calculator
    scenario = run_scenario(budget=budget)
    assert policy_invariants_hold(scenario)  # Verify

When test fails:

Don't change Policy (unless requirements changed)
Don't change Budget formula (unless calculation wrong)
DO debug: Is there a bug in implementation?

The Correct Formula (Final Version)

def calculate_min_budget_for_fairness(
    num_participants: int,
    recheck_interval_seconds: int = 60,
    cycle_interval_seconds: int = 5,
    margin: float = 1.5
) -> float:
    """
    Calculate minimum checks/minute needed to ensure all participants
    are checked before first recheck (fairness requirement).

    Args:
        num_participants: Number of participants to check
        recheck_interval_seconds: Time before participant needs recheck (default 60s)
        cycle_interval_seconds: System cycle interval (default 5s)
        margin: Safety margin to account for quantization and overhead (default 1.5x)

    Returns:
        Recommended checks per minute (float)

    Example:
        >>> calculate_min_budget_for_fairness(10, recheck_interval=60)
        15.0  # 10 participants need 10 checks/min minimum, 15 with 1.5x margin
    """
    # How many cycles available before first recheck?
    num_cycles_available = recheck_interval_seconds / cycle_interval_seconds

    # How many checks per cycle needed?
    checks_per_cycle_needed = num_participants / num_cycles_available

    # Convert to checks per minute
    min_checks_per_minute = (checks_per_cycle_needed / cycle_interval_seconds) * 60

    # Add safety margin for:
    # - Integer truncation in checks_per_cycle
    # - Tier switching overhead
    # - Variance in participant arrival times
    recommended_checks_per_minute = min_checks_per_minute * margin

    return recommended_checks_per_minute

Usage:

# For 10 participants needing 60s recheck:
budget = calculate_min_budget_for_fairness(10, recheck_interval=60, margin=1.5)
# Returns: 15.0 checks/min

# For 50 participants needing 30s recheck:
budget = calculate_min_budget_for_fairness(50, recheck_interval=30, margin=1.2)
# Returns: 120.0 checks/min (50 participants × 2 checks/min × 1.2 margin)

Real-World Application

Use Case 1: API Rate Limiting

Problem: How many requests/second needed to process N items within deadline D?

Formula:

def calculate_min_rps(num_items, deadline_seconds, margin=1.5):
    min_rps = num_items / deadline_seconds
    recommended_rps = min_rps * margin
    return recommended_rps

But account for:

Batch size quantization (API processes in batches of 100)
Network latency overhead
Retry margin
Concurrent request limits

Use Case 2: Worker Pool Sizing

Problem: How many workers needed to process N jobs within SLA S?

Formula:

def calculate_min_workers(num_jobs, sla_seconds, avg_job_duration, margin=1.5):
    # Continuous time minimum
    min_workers = (num_jobs * avg_job_duration) / sla_seconds

    # Account for discrete worker count
    min_workers = math.ceil(min_workers)

    # Add margin for variance
    recommended_workers = int(min_workers * margin)

    return recommended_workers

Margin accounts for:

Job duration variance (some take 2x average)
Worker startup time
Occasional failures requiring retries

Use Case 3: Cache Sizing

Problem: How much cache needed to keep N items with TTL T?

Formula:

def calculate_min_cache_size(num_items, ttl_seconds, request_rate_per_sec, margin=2.0):
    # Items alive at any time
    items_in_cache = num_items * (request_rate_per_sec * ttl_seconds) / num_items

    # Add margin for spikes
    recommended_size = items_in_cache * margin

    return recommended_size

Margin accounts for:

Traffic spikes (2x normal)
Non-uniform access patterns
Cascading failures if cache full

Debugging Budget Issues: A Checklist

When tests fail with "sufficient" budget:

1. Check Integer Truncation

# Calculate what you're actually getting
budget = 20  # checks/min
checks_per_cycle = int(budget / 60 * cycle_interval)
actual_budget = (checks_per_cycle / cycle_interval) * 60

if actual_budget != budget:
    print(f"Truncation loss: {budget} requested, {actual_budget} actual")

2. Check Cycle Quantization

# Ensure formula accounts for discrete cycles
# DON'T: continuous_time_formula(participants, deadline)
# DO: discrete_cycle_formula(participants, num_cycles, cycle_interval)

3. Check Tier Monopolization

# Verify tier ordering doesn't cause starvation
# If staleness-first: Must check ALL before first recheck
# If fairness-first: Must have enough budget after fairness for staleness

4. Check Margin Sufficiency

# Is 1.5x margin enough?
# Try 2.0x - if tests pass, margin was the issue
# If tests still fail, there's a real bug (not just insufficient budget)

5. Check Formula Matches Policy

# Does formula assume fairness-first but policy is staleness-first?
# Formula must match actual tier ordering

The Flip-Flop Timeline (Summary)

Iteration	Change	Reason	Outcome
1	Formula: checks_per_minute = (N / deadline) * 60	Initial attempt	❌ Failed (quantization)
2	Flip tier order: fairness-first	Tests failed, assumed policy wrong	✅ Fairness test passed, ❌ staleness test failed
3	Flip back: staleness-first	User intervention: stop flip-flopping	❌ Fairness test failed again
4	Formula: checks before recheck (not deadline)	Realized monopolization issue	❌ Still failed (truncation)
5	Make formula TIGHTER (wrong direction!)	Tests failed at "sufficient" budget	❌ User intervention: wrong direction
6	Add 1.5x margin	Account for waste	✅ Tests passed!
7	Add rolling accumulator	Eliminate truncation waste	✅ Budget utilized fully
8	Build calculator FIRST, use in tests	Correct workflow	✅ Stable, correct

Total iterations: 8
User interventions: 3
Time wasted: ~2 hours
Time with correct approach: ~30 minutes

Lesson: Build calculator first, stop flip-flopping, add margin.

Conclusion

What we learned:

✅ Build budget calculator FIRST, then use it to validate tests
✅ Don't react to test failures by changing policy
✅ Account for reality: quantization, truncation, margin
✅ Separate concerns: Policy vs Budget vs Tests

The anti-pattern:

Write test → Guess budget → Test fails → Change policy → Test fails → Change formula → ...

The correct pattern:

Define policy → Build calculator → Add margin → Use in tests → Debug bugs (not policy)

Would we make this mistake again? Probably not! The lesson was learned through painful iteration.

Next time:

Define policy requirements clearly upfront
Build budget calculator based on policy
Add realistic margin (1.5x - 2.0x)
Use calculator in ALL tests
If tests fail with calculator-provided budget, it's a BUG (not wrong budget)

Series Conclusion

We've covered 6 aspects of multi-agent AI development:

Part 1: Can 5 Claude Code Agents Work Independently? - The optimistic hypothesis
Part 2: The Reality of Autonomous Development - Human orchestration required (31% autonomy)
Part 3: Property-Based Testing with Hypothesis - The data you're throwing away
Part 4: Zero-Conflict Architecture - File-level ownership (100% auto-merge)
Part 5: Communication Protocols for AI Agents - 4 iterations to file-based messaging
Part 6: The Budget Calculator Paradox - Build it first, use it everywhere

Overall lessons:

Zero-conflict architecture works (100% auto-merge)
Human-AI collaboration > pure autonomy (orchestration essential)
Verification before coding (model introspection prevents wasted effort)
Knowledge preservation (capture Hypothesis shrunken cases)
Build calculators first (don't guess in tests)
Explicit communication (templates, commands, not assumptions)

Was it worth it? Absolutely. 75% time savings despite 12.5% orchestration overhead. Would do it again with lessons learned.

Tags: #budget-calculator #capacity-planning #testing #formulas #quantization #margin #lessons-learned

This is Part 6 (Final) of the Multi-Agent Development Series.

Discussion: Have you struggled with capacity formulas? Do you build calculators first or adjust tests reactively? What's your approach to margin and safety factors? Share in the comments!

DEV Community

Vibe coding: What we learned from flip-flopping 8 times on a simple formula

The Budget Calculator Paradox: When Tests Don't Match Reality

TL;DR

The Setup

Iteration 1: The Optimistic Formula

Why It Failed

Iteration 2: The Reactive Fix (First Flip-Flop)

User Intervention #1

Iteration 3: Back to Staleness-First (Second Flip-Flop)

Iteration 4: Account for Tier Monopolization

Why It Still Failed

Iteration 5: Make the Formula TIGHTER (Wrong Direction!)

User Intervention #2

Iteration 6: Add Margin for Reality

Iteration 7: Rolling Accumulator for Fractional Checks

Iteration 8: Build Calculator FIRST, Use It in Tests

The Lessons

Lesson 1: Build Calculator First, Use It Everywhere

Lesson 2: Don't React to Test Failures by Changing Policy

Lesson 3: Account for Reality (Cycle Quantization)

Lesson 4: Separate Concerns

The Correct Formula (Final Version)

Real-World Application

Use Case 1: API Rate Limiting

Use Case 2: Worker Pool Sizing

Use Case 3: Cache Sizing

Debugging Budget Issues: A Checklist

1. Check Integer Truncation

2. Check Cycle Quantization

3. Check Tier Monopolization

4. Check Margin Sufficiency

5. Check Formula Matches Policy

The Flip-Flop Timeline (Summary)

Conclusion

Series Conclusion

Top comments (0)