DEV Community

Nico Reyes
Nico Reyes

Posted on

I got rate limited scraping 100 pages. Here's what actually worked

Got assigned to scrape competitor pricing from 100 product pages. Figured it would take like an hour max. Boy was I wrong.

Started with basic requests loop. Got through maybe 15 pages before everything started timing out. Added delays between requests. 2 seconds, then 5 seconds, then 10. Still got blocked around page 30.

Tried rotating user agents. Found some list online with like 50 different browsers. Didn't matter. Still blocked.

Then spent a whole afternoon setting up proxy rotation. Free proxies from some sketchy list. Half of them didn't work at all, other half were slower than doing it manually. Got nowhere.

What actually worked was way simpler than I thought. Turns out the site wasn't blocking the scraping itself, just the pattern. Hitting pages in sequential order at regular intervals screams automated script even with delays.

Ended up randomizing the page order and varying the delays. So instead of 5 second waits, random between 3 and 12 seconds. And hitting page 47, then 12, then 89, then 3. Suddenly no more blocks.

import requests
import random
import time

pages = list(range(1, 101))
random.shuffle(pages)

for page_num in pages:
    url = f"https://example.com/products?page={page_num}"
    response = requests.get(url)

    # Random delay between 3 and 12 seconds
    time.sleep(random.uniform(3, 12))
Enter fullscreen mode Exit fullscreen mode

Also helped to add some realistic headers. Not just user agent but Accept-Language, Accept-Encoding, that stuff. Made the requests look more like actual browser traffic.

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)',
    'Accept': 'text/html,application/xhtml+xml',
    'Accept-Language': 'en-US,en;q=0.9',
    'Accept-Encoding': 'gzip, deflate, br'
}
Enter fullscreen mode Exit fullscreen mode

Whole job finished in like 2 hours instead of the week I thought it would take after all the blocking. Random order and random delays worked way better than proxies honestly

Top comments (0)