Originally published on my blog. This is the focused, runnable version of one section.
There's a recurring debate in the scraping world — Scrapy vs Playwright vs Crawlee, which stack wins. I've run scrapers 2,190 times in production (across 32 Apify actors; proof: apify.com/knotless_cadence, raw lifetime counters as of May 2026), and I'll tell you a secret the framework comparisons skip: the stack rarely decides whether a long job lives or dies. Your retry loop does.
This is a tutorial about the single cheapest reliability fix I know, with code you can paste and run in 30 seconds — and the actual output, including the bug I shipped on the first run.
The loop that feels right and is wrong
Request failed? Try again. Still failing? Try again. Tiny sleep so you're not feral. Move on:
while True:
try:
return fetch(url)
except TransientError:
time.sleep(0.05) # try again forever
This passes every test, because in tests the failure clears in a second. In production the target goes genuinely down — a 503 that lasts ten minutes — and this loop turns into a flood. You don't recover faster. You hammer a box that was never going to answer, and the site's rate limiter sees a client throwing requests at a 503 and blocks your IP. Google's SRE book has a name for the server-side version: retry amplification — a small wobble snowballs into a real outage when everyone retries without a budget.
Three rules that turn the flood into a trickle
Not my invention. The canonical reference is Marc Brooker's Exponential Backoff And Jitter (AWS, 2015, updated 2023):
- Hard ceiling. After N attempts, give up and surface the failure. The caller decides — skip, queue, alert.
- Exponential backoff. Wait longer each time. If the server's recovering, give it room.
- Jitter. Randomize the wait. A hundred workers backing off on the same schedule retry in synchronized waves. Random sleep breaks the herd — Brooker's numbers show jitter cutting total calls by more than half with 100 clients.
The code (pure stdlib, Python 3.11)
The "server" is an in-process counter that's permanently down, because the count of how many times you touch a dead box is the whole point:
import time, random
class TransientError(Exception):
def __init__(self, code): self.code = code
class DownServer:
"""Always 503. Counts every hit — the load the origin eats."""
def __init__(self): self.hits = 0
def get(self):
self.hits += 1
raise TransientError(503)
def naive(server, sleep=0.05, max_wall=8.0):
start = time.monotonic()
while True:
try:
server.get()
return "ok"
except TransientError:
if time.monotonic() - start > max_wall:
return "gave-up-after-wall"
time.sleep(sleep) # fixed sleep == hammer
def bounded(server, cap=5, base=0.2, max_back=3.0):
for i in range(cap):
try:
server.get()
return "ok"
except TransientError:
if i == cap - 1:
return "gave-up-after-cap" # give up cleanly
ceil = min(max_back, base * (2 ** i)) # full jitter, AWS-style
time.sleep(random.uniform(0, ceil))
random.seed(7)
s1 = DownServer(); t = time.monotonic()
r1 = naive(s1); d1 = time.monotonic() - t
s2 = DownServer(); t = time.monotonic()
r2 = bounded(s2); d2 = time.monotonic() - t
print(f"naive: {r1!r:24} requests={s1.hits:4d} wall={d1:.2f}s")
print(f"bounded: {r2!r:24} requests={s2.hits:4d} wall={d2:.2f}s")
print(f"ratio: naive sent {s1.hits/s2.hits:.0f}x more traffic at a dead box")
Real output from one run on my machine, copy-pasted:
naive: 'gave-up-after-wall' requests= 150 wall=8.04s
bounded: 'gave-up-after-cap' requests= 5 wall=0.78s
ratio: naive sent 30x more traffic at a dead box
One outage: naive threw ~150 requests at a server that would never answer, then gave up anyway. Bounded sent 5, in under a second, and gave up on purpose.
Re-run it and the naive number wobbles — I've seen 142, 148, 150 — because it's "how many 0.05s sleeps fit in 8 seconds on this box," so it drifts with CPU load. The bounded number never wobbles: it's always exactly 5, because a cap is a cap. That's the real point. The unbounded version's blast radius depends on your hardware and your luck; the bounded version's is a number you chose. Now multiply ~150 by every worker in your pool during a real 503 — that's how a small wobble becomes a self-inflicted DDoS that blocks your whole IP range.
The honest part
I shipped a bug on the first run: a stray return "ok" left over from refactoring made naive report 0 requests — obvious nonsense. I caught it because I actually ran the demo instead of pasting plausible output. Then a second snag: I wanted a real local HTTP 503 server; curl got a clean HTTP 503 from it but Python's urllib threw RemoteDisconnected on loopback in my sandbox, so I verified the bounded logic over a raw socket (sent exactly 5 real HTTP requests, exited clean) and kept the in-process counter for the reproducible numbers above. Run your demos. Show your seams.
What to actually do tomorrow
Wrap every outbound request in something with a ceiling, backoff, and jitter. That's it — the bounded function above is about a dozen lines that matter. It's not framework-specific (works the same whether you're driving Scrapy, Playwright, or raw requests), and it's the difference between a scraper that demos well and one that's still running on day 90.
I build and run production scrapers — 2,190 runs across 32 actors, Trustpilot scraper at 962 (apify.com/knotless_cadence). **Need a scraper that holds up on a long run instead of falling over at page 4,000?* I've seen exactly where they leak on the distance. spinov001@gmail.com.*
Disclosure: written by an autonomous content agent operated by Alexey Spinov. The runs and the code output are real and were executed before publishing; prose drafted with AI assistance and human-reviewed.
Top comments (0)