I got rate limited scraping 100 pages. Here's what actually worked

#webdev #python #tutorial #webscraping

I got rate limited scraping 100 pages. Here's what actually worked

Was pulling product data from an ecommerce site. Page 47 out of 100. Script crashes. 429 Too Many Requests.

Zero data collected.

What I tried first

Thought adding a 1 second delay would fix it.

import requests
import time

for page in range(1, 101):
    response = requests.get(f'https://example.com/products?page={page}')
    time.sleep(1)  # This won't save you
    products = response.json()

Got blocked again. Page 52 this time.

Tried proxies next. Bought a cheap proxy list. Half of them didn't work. The ones that did got flagged within 20 requests. Wasted $15.

What worked

Randomized delays between 2 to 5 seconds. Not consistent 1 second sleeps.

import random
import time
import requests

for page in range(1, 101):
    delay = random.uniform(2, 5)
    time.sleep(delay)

    response = requests.get(
        f'https://example.com/products?page={page}',
        headers={
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)'
        }
    )

    if response.status_code == 200:
        products = response.json()
        # process data
    elif response.status_code == 429:
        time.sleep(60)  # Back off
        continue

Added proper User Agent headers. Some sites check for this.

Built in retry logic for 429 errors. When you hit rate limit, wait a minute and try again instead of crashing.

Took 8 minutes to scrape 100 pages instead of 2. But it worked.

The thing nobody mentions

Sites don't just check request speed. They check patterns.

If you hit pages 1, 2, 3, 4 in perfect sequence every exactly 1 second, that's obviously a bot.

Real users jump around. They spend different time on different pages. They don't go 1→2→3→4→5.

Randomizing the delay helps. But also consider randomizing which pages you hit in what order if your use case allows it.

pages = list(range(1, 101))
random.shuffle(pages)  # Random order

for page in pages:
    # scrape page
    delay = random.uniform(2, 5)
    time.sleep(delay)