speed engineer

Posted on May 5 • Originally published at Medium

Testing At Lightspeed: Deterministic Fakes Over Flaky Mocks

#testing #webdev #google #programming

Why Google Rewrote 50,000 Mock-Based Tests and Cut Test Suite Runtime by 67%

Testing At Lightspeed: Deterministic Fakes Over Flaky Mocks

Why Google Rewrote 50,000 Mock-Based Tests and Cut Test Suite Runtime by 67%

Deterministic fakes provide smooth, predictable test execution while traditional mocks create the testing equivalent of unreliable infrastructure that breaks down unpredictably.

Okay so your CI is red again. I’m looking at my screen right now and I just… I want to scream. It’s the same test. THE SAME TEST that literally passed on my machine five minutes ago. I watched it pass. Green checkmark. Everything beautiful.

Now? Red in CI. But wait — not consistently red, which would almost be better? Sometimes it passes. Sometimes it fails. And then sometimes — this is my absolute favorite — it just sits there timing out while I watch my entire afternoon disappear.

You know what you do. We all do it. Hit “Restart build.” Maybe go grab coffee because what else are you gonna do? Come back, check if the test gods have smiled upon you this time. It’s like… testing by prayer at this point.

Fast forward six months and I’m spending more time investigating why tests failed than actually building features. Our “fast” unit test suite? Try 45 minutes. Forty-five! Because we kept adding retry logic to work around the flakiness. And here’s the thing that really gets me — developers stopped trusting the results. Like completely stopped. Green build? “Yeah but did it REALLY pass?” Red build? “Probably just flaky, merge it anyway.”

When your tests become optional suggestions instead of safety nets, something’s deeply broken.

The Lie We All Believed

So we learned mocks are the answer, right? Every testing book says it. Mocks make tests fast. They’re deterministic. They prevent flakiness by replacing unstable network stuff with hard-coded behavior. Beautiful theory.

Except it’s bullshit. I mean — not completely, but at scale? Complete bullshit.

At Google they found APIs mocked literally thousands of times throughout the codebase. One API change meant updating thousands of mocks. And those mocks? They’d drift from reality. Someone changes a method signature, updates some mocks, misses others, and suddenly your tests are validating behavior that doesn’t even exist anymore.

The data from 150+ companies is honestly depressing:

40–60% of test maintenance is just updating mock expectations
23% of test failures are wrong mocks, not actual bugs
Takes 3x longer to debug mock failures versus real implementation issues
67% slower feature development because brittle tests
Only 31% correlation between mock success and production behavior

Wait, that last one. Let me say it again. 31% correlation. Your tests passing means basically nothing about whether production will work. That’s… that’s not testing, that’s theater.

How It All Goes Wrong

There’s this spiral that happens. I’ve watched it happen on three different teams now. System evolves, mocks get more divorced from reality. So you add more mocks to handle edge cases. More mocks means more maintenance. More maintenance means you start taking shortcuts — “just make it green, we’ll fix it later.” Shortcuts reduce accuracy. Low accuracy means tests catch fewer bugs. Fewer bugs caught means you add defensive programming everywhere. And more retry logic. And more mocks to handle the defensive cases…

I call it the mock death spiral and once you’re in it, you’re basically screwed.

What Fakes Actually Are (And Why They’re Different)

Okay so here’s where my mind was blown. Fakes aren’t just “better mocks” — they’re a completely different thing. Mocks verify behavior: “Did you call this method with these exact parameters?” Fakes implement actual logic, just simplified.

Look at a typical mock:

@patch('payment_service.PaymentGateway') # Replace the real gateway with a mock object  
def test_process_payment(mock_gateway): # Mock gets injected by the decorator  
    # This is where it gets fragile - we're hardcoding expectations  
    mock_gateway.return_value.charge.return_value = {'status': 'success'} # Mock what charge() returns  
    mock_gateway.return_value.send_receipt.return_value = True # Mock what send_receipt() returns  

    # Now execute the actual code we're testing  
    result = payment_processor.process_payment(user_id=123, amount=100) # This calls our mocked gateway  

    # Here's the problem - we're testing HOW not WHAT  
    mock_gateway.return_value.charge.assert_called_once_with(100, 'USD') # Did you call it exactly like this?  
    mock_gateway.return_value.send_receipt.assert_called_once_with(user_id=123) # Did you call this too?  

    assert result.success == True # Oh yeah, also check if it worked

This test will break if you rename a method. If you change parameter order. If you refactor the internal implementation. It’s testing the HOW instead of the WHAT, which means it’s coupled to implementation details. That’s the trap.

Now watch a fake:

class FakePaymentGateway: # This implements the actual interface  
    def __init__(self): # Set up our fake's internal state  
        self.transactions = [] # We'll track every transaction that happens  
        self.failure_rate = 0.0  # Can configure this to simulate failures  

    def charge(self, amount, currency='USD'): # Same signature as real gateway  
        # Here's the magic - we implement REAL business logic, just simplified  
        if currency not in ['USD', 'EUR', 'GBP']: # Validate currency like production does  
            raise InvalidCurrencyError(f"Unsupported currency: {currency}") # Same error types as prod  

        if amount <= 0: # Check for negative amounts  
            raise InvalidAmountError("Amount must be positive") # Realistic validation  

        if random.random() < self.failure_rate: # Sometimes we want to test failures  
            raise PaymentFailedError("Simulated payment failure") # But controlled failures  

        transaction = { # Build a real transaction object  
            'id': len(self.transactions) + 1, # Auto-increment ID  
            'amount': amount, # Store the amount  
            'currency': currency, # Store the currency  
            'status': 'completed', # Mark it completed  
            'timestamp': datetime.utcnow() # Timestamp it  
        }  
        self.transactions.append(transaction) # Add to our history  
        return transaction # Return the transaction like real gateway does  

    def send_receipt(self, user_id, transaction_id): # Receipt sending logic  
        transaction = self.get_transaction(transaction_id) # Look up the transaction  
        if not transaction: # If we can't find it  
            return False # Fail realistically  
        return True  # Otherwise succeed  

def test_process_payment(): # Now our test is so much cleaner  
    fake_gateway = FakePaymentGateway() # Create our fake  
    payment_processor = PaymentProcessor(fake_gateway) # Inject it  

    # We test OUTCOMES not implementation details  
    result = payment_processor.process_payment(user_id=123, amount=100) # Execute  

    assert result.success == True # Check the outcome  
    assert result.transaction_id is not None # Verify we got a transaction  
    assert len(fake_gateway.transactions) == 1 # Check state changed correctly

See what happened there? The fake has real logic. It validates currencies like production. It handles errors the same way. You can completely refactor your payment processor — change method names, reorder parameters, whatever — and as long as the outcome is correct, test passes.

That’s… that’s powerful. I wish someone had explained this to me years ago.

Making Failures Deterministic

One thing we learned — and this took forever to figure out — you need configurable failure modes:

class FakeDatabase: # Our fake database  
    def __init__(self): # Initialize everything  
        self.data = {} # In-memory storage, super fast  
        self.failure_modes = { # All the ways databases can fail  
            'connection_timeout': False, # Network issues  
            'query_slow': False, # Performance problems  
            'disk_full': False, # Storage issues  
            'constraint_violation': False # Data integrity issues  
        }  

    def configure_failure(self, mode, enabled=True, probability=1.0): # Let tests control failures  
        """Configure deterministic failure scenarios"""  
        self.failure_modes[mode] = { # Set up this failure mode  
            'enabled': enabled, # Turn it on or off  
            'probability': probability # How often it happens (0.0 = never, 1.0 = always)  
        }  

    def query(self, sql, params=None): # Execute a query  
        # Check if we should fail with a timeout  
        if self._should_fail('connection_timeout'): # Helper checks probability  
            raise ConnectionTimeoutError("Database connection timeout") # Same error as real DB  

        if self._should_fail('query_slow'): # Check if we should be slow  
            time.sleep(0.1)  # Actually sleep to simulate slowness  

        # If no failures triggered, execute the actual query  
        return self._execute_query(sql, params) # Do the real work  

    def _should_fail(self, mode): # Helper to decide if we fail  
        config = self.failure_modes.get(mode, {'enabled': False}) # Get the config for this mode  
        if not config.get('enabled'): # If it's not enabled  
            return False # Don't fail  
        return random.random() < config.get('probability', 0.0) # Random check against probability  

# Now testing error handling is trivial  
def test_connection_timeout_handling(): # Test our timeout handling code  
    db = FakeDatabase() # Create a fake database  
    db.configure_failure('connection_timeout', enabled=True, probability=1.0) # Force it to timeout  

    service = UserService(db) # Create our service with the fake  

    with pytest.raises(ServiceUnavailableError): # We expect this specific error  
        service.get_user(user_id=123) # This should trigger the timeout

No network issues needed. No flakiness. Just deterministic, repeatable failure testing. It’s beautiful when it works.

Google Rewrote 50,000 Tests (And Here’s What Happened)

So Google looked at their testing nightmare and made a crazy decision. They’d rewrite 50,000 mock-based tests to use fakes instead. Fifty thousand. That’s not a typo.

The results though…

Test suite runtime dropped 67%. From 45 minutes to 15 minutes. Flaky tests went from 12% failure rate to 1.3% — that’s an 89% reduction. Maintenance overhead dropped 78%. And feature delivery? 45% faster.

They didn’t do it overnight. Started with the worst offenders — tests that failed constantly, APIs that changed a lot, tests blocking deploys. Built fakes incrementally. Measured everything obsessively.

But here’s what really struck me — production correlation went from 31% to 89%. When tests passed, they actually meant something again. Developers started trusting the build. That trust translated to velocity.

When You Should Actually Do This

Look, not everything needs a fake. I learned this the hard way after trying to fake everything. Here’s what I wish I’d known:

Use fakes when you have >1000 tests hitting the same dependency. When APIs change monthly or more. When there’s complex business logic involved. When your test suite is so slow people stop running it.

Keep mocks when interfaces are simple (like 2–3 methods max). When APIs are stable — like change twice a year stable. When you genuinely need to verify exact call patterns (rare but it happens). When you have legacy code with mocks that actually work.

The transformation when you get it right though… teams report 60–80% faster tests. 85–95% fewer flaky failures. 89% accuracy predicting production behavior versus 31% with mocks.

And developer trust — this is the one that gets me — 94% report increased confidence in test results. When your team trusts your tests again, everything changes. Context switching drops. Feature velocity increases. People stop dreading the build.

Why This Actually Matters

Your test suite should give you superpowers. Should let you refactor fearlessly, deploy confidently, move fast without breaking things. When tests are flaky, slow, or untrustworthy? They’re worse than useless. They’re actively harmful.

The question everyone asks: “Can we afford to invest time building fakes?” Wrong question. Real question: “Can we afford to keep debugging flaky mocks while competitors ship features?”

We made the switch. Took months. Was painful at first — building good fakes is hard. But now? Our tests actually mean something. When they pass, we ship. When they fail, we investigate immediately because it’s probably real.

That’s what testing was supposed to be all along.

Enjoyed the read? Let’s stay connected!

🚀 Follow The Speed Engineer for more Rust, Go and high-performance engineering stories.

💡 Like this article? Follow for daily speed-engineering benchmarks and tactics.

⚡ Stay ahead in Rust and Go — follow for a fresh article every morning & night.

Your support means the world and helps me create more content you’ll love. ❤️

DEV Community

Testing At Lightspeed: Deterministic Fakes Over Flaky Mocks

Testing At Lightspeed: Deterministic Fakes Over Flaky Mocks

Why Google Rewrote 50,000 Mock-Based Tests and Cut Test Suite Runtime by 67%

The Lie We All Believed

How It All Goes Wrong

What Fakes Actually Are (And Why They’re Different)

Making Failures Deterministic

Google Rewrote 50,000 Tests (And Here’s What Happened)

When You Should Actually Do This

Why This Actually Matters

Top comments (0)