DEV Community

agenthustler
agenthustler

Posted on

Python Rate Limiting for Web Scrapers: Best Practices 2026

Rate limiting is one of the most critical aspects of web scraping. Without proper rate limiting, your scraper will get blocked, your IP banned, and you'll waste time debugging instead of collecting data.

Here's a comprehensive guide to implementing rate limiting in your Python scrapers.

Why Rate Limiting Matters

  1. Avoid IP bans: Websites detect and block aggressive scrapers
  2. Respect servers: Don't crash someone's website
  3. Better data: Rate-limited scrapers collect MORE data long-term
  4. Ethical scraping: Be a good citizen of the internet

Token Bucket Rate Limiter

A sophisticated approach that allows burst requests while maintaining an average rate:

import time
import threading
import random

class TokenBucket:
    def __init__(self, rate, capacity):
        self.rate = rate
        self.capacity = capacity
        self.tokens = capacity
        self.last_refill = time.monotonic()
        self.lock = threading.Lock()

    def acquire(self):
        with self.lock:
            now = time.monotonic()
            elapsed = now - self.last_refill
            self.tokens = min(self.capacity, self.tokens + elapsed * self.rate)
            self.last_refill = now

            if self.tokens >= 1:
                self.tokens -= 1
                return True

            wait_time = (1 - self.tokens) / self.rate
            time.sleep(wait_time)
            self.tokens = 0
            self.last_refill = time.monotonic()
            return True

limiter = TokenBucket(rate=2, capacity=5)
Enter fullscreen mode Exit fullscreen mode

Exponential Backoff

When a request fails, back off exponentially:

import requests

def request_with_backoff(url, max_retries=5):
    for attempt in range(max_retries):
        try:
            response = requests.get(url, timeout=30)
            if response.status_code == 200:
                return response
            if response.status_code == 429:
                retry_after = response.headers.get("Retry-After")
                wait_time = int(retry_after) if retry_after else (2 ** attempt) + random.uniform(0, 1)
                print(f"Rate limited. Waiting {wait_time:.1f}s")
                time.sleep(wait_time)
                continue
            if response.status_code >= 500:
                wait_time = (2 ** attempt) + random.uniform(0, 1)
                time.sleep(wait_time)
                continue
            return response
        except requests.exceptions.Timeout:
            time.sleep((2 ** attempt) + random.uniform(0, 1))
        except requests.exceptions.ConnectionError:
            time.sleep((2 ** attempt) * 2)

    return None
Enter fullscreen mode Exit fullscreen mode

Respecting robots.txt

from urllib.robotparser import RobotFileParser
from urllib.parse import urlparse

class RobotsChecker:
    def __init__(self):
        self.parsers = {}

    def can_fetch(self, url, user_agent="*"):
        parsed = urlparse(url)
        robots_url = f"{parsed.scheme}://{parsed.netloc}/robots.txt"
        if robots_url not in self.parsers:
            parser = RobotFileParser()
            parser.set_url(robots_url)
            try:
                parser.read()
            except Exception:
                return True
            self.parsers[robots_url] = parser
        return self.parsers[robots_url].can_fetch(user_agent, url)
Enter fullscreen mode Exit fullscreen mode

Adaptive Rate Limiting

Adjust your rate based on server responses:

class AdaptiveRateLimiter:
    def __init__(self, initial_delay=1.0, min_delay=0.5, max_delay=30.0):
        self.delay = initial_delay
        self.min_delay = min_delay
        self.max_delay = max_delay
        self.consecutive_success = 0

    def record_success(self):
        self.consecutive_success += 1
        if self.consecutive_success >= 10:
            self.delay = max(self.min_delay, self.delay * 0.9)
            self.consecutive_success = 0

    def record_error(self, status_code):
        self.consecutive_success = 0
        if status_code == 429:
            self.delay = min(self.max_delay, self.delay * 3)
        else:
            self.delay = min(self.max_delay, self.delay * 1.5)

    def wait(self):
        jitter = random.uniform(0, self.delay * 0.1)
        time.sleep(self.delay + jitter)
Enter fullscreen mode Exit fullscreen mode

Using Proxies for Distributed Rate Limiting

The most effective approach combines rate limiting with proxy rotation:

class ProxyRotator:
    def __init__(self, proxy_list):
        self.proxies = proxy_list
        self.index = 0

    def get_next(self):
        proxy = self.proxies[self.index % len(self.proxies)]
        self.index += 1
        return {"http": proxy, "https": proxy}
Enter fullscreen mode Exit fullscreen mode

For production proxy infrastructure, ThorData provides residential proxies with built-in rotation — perfect for distributed rate limiting across thousands of IPs.

Complete Rate-Limited Scraper

class PoliteScraper:
    def __init__(self, requests_per_second=2):
        self.limiter = TokenBucket(rate=requests_per_second, capacity=5)
        self.adaptive = AdaptiveRateLimiter()
        self.robots = RobotsChecker()
        self.session = requests.Session()
        self.session.headers.update({
            "User-Agent": "ResearchBot/1.0 (contact@example.com)"
        })

    def fetch(self, url):
        if not self.robots.can_fetch(url):
            return None
        self.limiter.acquire()
        self.adaptive.wait()
        response = request_with_backoff(url)
        if response and response.status_code == 200:
            self.adaptive.record_success()
        elif response:
            self.adaptive.record_error(response.status_code)
        return response

    def scrape_all(self, urls):
        results = []
        for i, url in enumerate(urls):
            response = self.fetch(url)
            if response:
                results.append({"url": url, "content": response.text})
            if i % 50 == 0:
                print(f"Progress: {i}/{len(urls)}")
        return results
Enter fullscreen mode Exit fullscreen mode

Best Practices Summary

  1. Always implement rate limiting — never scrape without it
  2. Check robots.txt before scraping any domain
  3. Use exponential backoff for failed requests
  4. Rotate proxies with services like ThorData for high-volume scraping
  5. Adapt your rate based on server responses
  6. Add jitter to prevent thundering herd problems
  7. Set timeouts on all requests

Conclusion

Rate limiting isn't just about avoiding bans — it's about building scrapers that work reliably over time. A well-rate-limited scraper will collect more data in the long run than an aggressive one that gets blocked after 100 requests.

Happy (polite) scraping!

Top comments (0)