Why Google Rewrote 50,000 Mock-Based Tests and Cut Test Suite Runtime by 67%
Testing At Lightspeed: Deterministic Fakes Over Flaky Mocks
Why Google Rewrote 50,000 Mock-Based Tests and Cut Test Suite Runtime by 67%
Deterministic fakes provide smooth, predictable test execution while traditional mocks create the testing equivalent of unreliable infrastructure that breaks down unpredictably.
Okay so your CI is red again. I’m looking at my screen right now and I just… I want to scream. It’s the same test. THE SAME TEST that literally passed on my machine five minutes ago. I watched it pass. Green checkmark. Everything beautiful.
Now? Red in CI. But wait — not consistently red, which would almost be better? Sometimes it passes. Sometimes it fails. And then sometimes — this is my absolute favorite — it just sits there timing out while I watch my entire afternoon disappear.
You know what you do. We all do it. Hit “Restart build.” Maybe go grab coffee because what else are you gonna do? Come back, check if the test gods have smiled upon you this time. It’s like… testing by prayer at this point.
Fast forward six months and I’m spending more time investigating why tests failed than actually building features. Our “fast” unit test suite? Try 45 minutes. Forty-five! Because we kept adding retry logic to work around the flakiness. And here’s the thing that really gets me — developers stopped trusting the results. Like completely stopped. Green build? “Yeah but did it REALLY pass?” Red build? “Probably just flaky, merge it anyway.”
When your tests become optional suggestions instead of safety nets, something’s deeply broken.
The Lie We All Believed
So we learned mocks are the answer, right? Every testing book says it. Mocks make tests fast. They’re deterministic. They prevent flakiness by replacing unstable network stuff with hard-coded behavior. Beautiful theory.
Except it’s bullshit. I mean — not completely, but at scale? Complete bullshit.
At Google they found APIs mocked literally thousands of times throughout the codebase. One API change meant updating thousands of mocks. And those mocks? They’d drift from reality. Someone changes a method signature, updates some mocks, misses others, and suddenly your tests are validating behavior that doesn’t even exist anymore.
The data from 150+ companies is honestly depressing:
- 40–60% of test maintenance is just updating mock expectations
- 23% of test failures are wrong mocks, not actual bugs
- Takes 3x longer to debug mock failures versus real implementation issues
- 67% slower feature development because brittle tests
- Only 31% correlation between mock success and production behavior
Wait, that last one. Let me say it again. 31% correlation. Your tests passing means basically nothing about whether production will work. That’s… that’s not testing, that’s theater.
How It All Goes Wrong
There’s this spiral that happens. I’ve watched it happen on three different teams now. System evolves, mocks get more divorced from reality. So you add more mocks to handle edge cases. More mocks means more maintenance. More maintenance means you start taking shortcuts — “just make it green, we’ll fix it later.” Shortcuts reduce accuracy. Low accuracy means tests catch fewer bugs. Fewer bugs caught means you add defensive programming everywhere. And more retry logic. And more mocks to handle the defensive cases…
I call it the mock death spiral and once you’re in it, you’re basically screwed.
What Fakes Actually Are (And Why They’re Different)
Okay so here’s where my mind was blown. Fakes aren’t just “better mocks” — they’re a completely different thing. Mocks verify behavior: “Did you call this method with these exact parameters?” Fakes implement actual logic, just simplified.
Look at a typical mock:
@patch('payment_service.PaymentGateway') # Replace the real gateway with a mock object
def test_process_payment(mock_gateway): # Mock gets injected by the decorator
# This is where it gets fragile - we're hardcoding expectations
mock_gateway.return_value.charge.return_value = {'status': 'success'} # Mock what charge() returns
mock_gateway.return_value.send_receipt.return_value = True # Mock what send_receipt() returns
# Now execute the actual code we're testing
result = payment_processor.process_payment(user_id=123, amount=100) # This calls our mocked gateway
# Here's the problem - we're testing HOW not WHAT
mock_gateway.return_value.charge.assert_called_once_with(100, 'USD') # Did you call it exactly like this?
mock_gateway.return_value.send_receipt.assert_called_once_with(user_id=123) # Did you call this too?
assert result.success == True # Oh yeah, also check if it worked
This test will break if you rename a method. If you change parameter order. If you refactor the internal implementation. It’s testing the HOW instead of the WHAT, which means it’s coupled to implementation details. That’s the trap.
Now watch a fake:
class FakePaymentGateway: # This implements the actual interface
def __init__(self): # Set up our fake's internal state
self.transactions = [] # We'll track every transaction that happens
self.failure_rate = 0.0 # Can configure this to simulate failures
def charge(self, amount, currency='USD'): # Same signature as real gateway
# Here's the magic - we implement REAL business logic, just simplified
if currency not in ['USD', 'EUR', 'GBP']: # Validate currency like production does
raise InvalidCurrencyError(f"Unsupported currency: {currency}") # Same error types as prod
if amount <= 0: # Check for negative amounts
raise InvalidAmountError("Amount must be positive") # Realistic validation
if random.random() < self.failure_rate: # Sometimes we want to test failures
raise PaymentFailedError("Simulated payment failure") # But controlled failures
transaction = { # Build a real transaction object
'id': len(self.transactions) + 1, # Auto-increment ID
'amount': amount, # Store the amount
'currency': currency, # Store the currency
'status': 'completed', # Mark it completed
'timestamp': datetime.utcnow() # Timestamp it
}
self.transactions.append(transaction) # Add to our history
return transaction # Return the transaction like real gateway does
def send_receipt(self, user_id, transaction_id): # Receipt sending logic
transaction = self.get_transaction(transaction_id) # Look up the transaction
if not transaction: # If we can't find it
return False # Fail realistically
return True # Otherwise succeed
def test_process_payment(): # Now our test is so much cleaner
fake_gateway = FakePaymentGateway() # Create our fake
payment_processor = PaymentProcessor(fake_gateway) # Inject it
# We test OUTCOMES not implementation details
result = payment_processor.process_payment(user_id=123, amount=100) # Execute
assert result.success == True # Check the outcome
assert result.transaction_id is not None # Verify we got a transaction
assert len(fake_gateway.transactions) == 1 # Check state changed correctly
See what happened there? The fake has real logic. It validates currencies like production. It handles errors the same way. You can completely refactor your payment processor — change method names, reorder parameters, whatever — and as long as the outcome is correct, test passes.
That’s… that’s powerful. I wish someone had explained this to me years ago.
Making Failures Deterministic
One thing we learned — and this took forever to figure out — you need configurable failure modes:
class FakeDatabase: # Our fake database
def __init__(self): # Initialize everything
self.data = {} # In-memory storage, super fast
self.failure_modes = { # All the ways databases can fail
'connection_timeout': False, # Network issues
'query_slow': False, # Performance problems
'disk_full': False, # Storage issues
'constraint_violation': False # Data integrity issues
}
def configure_failure(self, mode, enabled=True, probability=1.0): # Let tests control failures
"""Configure deterministic failure scenarios"""
self.failure_modes[mode] = { # Set up this failure mode
'enabled': enabled, # Turn it on or off
'probability': probability # How often it happens (0.0 = never, 1.0 = always)
}
def query(self, sql, params=None): # Execute a query
# Check if we should fail with a timeout
if self._should_fail('connection_timeout'): # Helper checks probability
raise ConnectionTimeoutError("Database connection timeout") # Same error as real DB
if self._should_fail('query_slow'): # Check if we should be slow
time.sleep(0.1) # Actually sleep to simulate slowness
# If no failures triggered, execute the actual query
return self._execute_query(sql, params) # Do the real work
def _should_fail(self, mode): # Helper to decide if we fail
config = self.failure_modes.get(mode, {'enabled': False}) # Get the config for this mode
if not config.get('enabled'): # If it's not enabled
return False # Don't fail
return random.random() < config.get('probability', 0.0) # Random check against probability
# Now testing error handling is trivial
def test_connection_timeout_handling(): # Test our timeout handling code
db = FakeDatabase() # Create a fake database
db.configure_failure('connection_timeout', enabled=True, probability=1.0) # Force it to timeout
service = UserService(db) # Create our service with the fake
with pytest.raises(ServiceUnavailableError): # We expect this specific error
service.get_user(user_id=123) # This should trigger the timeout
No network issues needed. No flakiness. Just deterministic, repeatable failure testing. It’s beautiful when it works.
Google Rewrote 50,000 Tests (And Here’s What Happened)
So Google looked at their testing nightmare and made a crazy decision. They’d rewrite 50,000 mock-based tests to use fakes instead. Fifty thousand. That’s not a typo.
The results though…
Test suite runtime dropped 67%. From 45 minutes to 15 minutes. Flaky tests went from 12% failure rate to 1.3% — that’s an 89% reduction. Maintenance overhead dropped 78%. And feature delivery? 45% faster.
They didn’t do it overnight. Started with the worst offenders — tests that failed constantly, APIs that changed a lot, tests blocking deploys. Built fakes incrementally. Measured everything obsessively.
But here’s what really struck me — production correlation went from 31% to 89%. When tests passed, they actually meant something again. Developers started trusting the build. That trust translated to velocity.
When You Should Actually Do This
Look, not everything needs a fake. I learned this the hard way after trying to fake everything. Here’s what I wish I’d known:
Use fakes when you have >1000 tests hitting the same dependency. When APIs change monthly or more. When there’s complex business logic involved. When your test suite is so slow people stop running it.
Keep mocks when interfaces are simple (like 2–3 methods max). When APIs are stable — like change twice a year stable. When you genuinely need to verify exact call patterns (rare but it happens). When you have legacy code with mocks that actually work.
The transformation when you get it right though… teams report 60–80% faster tests. 85–95% fewer flaky failures. 89% accuracy predicting production behavior versus 31% with mocks.
And developer trust — this is the one that gets me — 94% report increased confidence in test results. When your team trusts your tests again, everything changes. Context switching drops. Feature velocity increases. People stop dreading the build.
Why This Actually Matters
Your test suite should give you superpowers. Should let you refactor fearlessly, deploy confidently, move fast without breaking things. When tests are flaky, slow, or untrustworthy? They’re worse than useless. They’re actively harmful.
The question everyone asks: “Can we afford to invest time building fakes?” Wrong question. Real question: “Can we afford to keep debugging flaky mocks while competitors ship features?”
We made the switch. Took months. Was painful at first — building good fakes is hard. But now? Our tests actually mean something. When they pass, we ship. When they fail, we investigate immediately because it’s probably real.
That’s what testing was supposed to be all along.
Enjoyed the read? Let’s stay connected!
🚀 Follow The Speed Engineer for more Rust, Go and high-performance engineering stories.
💡 Like this article? Follow for daily speed-engineering benchmarks and tactics.
⚡ Stay ahead in Rust and Go — follow for a fresh article every morning & night.
Your support means the world and helps me create more content you’ll love. ❤️
Top comments (0)