Yesterday morning started with a Slack DM from my boss: "Kai, can you take a look before standup? Might be the retry logic acting up again."
My stomach dropped. Because that retry logic? I wrote it. At 2 AM. Three months ago. After three cà phê sữa đá.
The Symptom
Payment service was throwing timeout errors in production. Users couldn't check out. Every failed request triggered a retry, which triggered another retry, and suddenly we had a cascading failure that woke up the on-call engineer at 5 AM.
Classic, right? Must be my retry logic being too aggressive.
The Investigation
I pulled up the logs, expecting to see my code frantically retrying into oblivion. Instead, I saw something weird:
[payment-gateway] timeout after 5000ms
[payment-gateway] retry 1/3 → timeout after 5000ms
[payment-gateway] retry 2/3 → timeout after 5000ms
[payment-gateway] retry 3/3 → timeout after 5000ms
All timeouts. No partial successes. The retries weren't making things worse — they were just failing consistently. Which meant...
The problem wasn't my retry logic at all.
The Real Culprit
I checked the upstream payment provider's status page. Green across the board. Then I dug into their API changelog — and there it was, buried in a "minor update" from 2 days ago:
Rate limiting threshold adjusted for
/v2/chargesendpoint: 50 req/min → 20 req/min
They changed the rate limit without notifying us. Our service was hitting 30-40 req/min during peak hours, well within the old limit but way over the new one. The "timeouts" were actually 429 responses being swallowed by their SDK.
I sent them the most passive-aggressive email of my career:
"Hi team, just wondering if there was a quiet update to the rate limiting threshold? Asking for a friend whose retry logic is taking heat 😅"
The Fix: Circuit Breaker Pattern
Instead of just increasing retry backoff (which is what exhausted-me would've done at 2 AM), I implemented a proper circuit breaker:
import time
from functools import wraps
class CircuitBreaker:
def __init__(self, failure_threshold=5, reset_timeout=60):
self.failures = 0
self.threshold = failure_threshold
self.reset_timeout = reset_timeout
self.last_failure = 0
self.state = "CLOSED" # CLOSED → OPEN → HALF_OPEN
def call(self, func, *args, **kwargs):
if self.state == "OPEN":
if time.time() - self.last_failure > self.reset_timeout:
self.state = "HALF_OPEN"
else:
raise Exception("Circuit is OPEN — upstream may be down")
try:
result = func(*args, **kwargs)
if self.state == "HALF_OPEN":
self.state = "CLOSED"
self.failures = 0
return result
except Exception as e:
self.failures += 1
self.last_failure = time.time()
if self.failures >= self.threshold:
self.state = "OPEN"
raise e
Now when the payment provider silently changes their limits again, the circuit opens after 5 consecutive failures and stops hammering their API for 60 seconds. No more cascading failures. No more 5 AM wake-up calls.
What I Actually Learned
Your code is innocent until proven guilty. I wasted 30 minutes tracing my own retry logic before checking the upstream. Don't assume you're the problem.
"Minor updates" are never minor. If a third-party API changes behavior, even slightly, it can break your production. Subscribe to changelogs. Set up integration tests that catch rate limit changes.
Circuit breakers are not optional. If your service depends on external APIs, you need them. Retry logic alone just turns timeouts into retry storms.
Don't write production code at 2 AM. My retry logic was actually fine. But I couldn't remember writing it. That's a problem.
I've been collecting patterns like these — circuit breakers, retry strategies, payment gateway integrations — into a reusable Python toolkit. If you're building anything that talks to external APIs, check it out here.
What's the worst "it's not my code... oh wait, yes it is" moment you've had?
Top comments (0)