DEV Community

Muhammad Ikramullah Khan
Muhammad Ikramullah Khan

Posted on

Scrapy AutoThrottle & Rate Limiting: Stop Getting Blocked

The first time I got blocked by a website, I didn't understand why. I thought "I'm only scraping 100 pages, what's the problem?"

The problem was I scraped those 100 pages in 5 seconds. The website saw 20 requests per second from the same IP and banned me instantly.

I learned the hard way: it's not about WHAT you scrape, it's about HOW FAST you scrape it. Let me show you how to scrape politely and avoid getting blocked.


The Problem: You're Scraping Too Fast

Websites expect human visitors:

  • Humans read pages (5-30 seconds per page)
  • Humans click links (1-2 clicks per minute)
  • Humans take breaks

Your spider:

  • Downloads pages instantly
  • Follows all links immediately
  • Never stops

Result: Website thinks you're a bot (because you are!) and blocks you.


Solution 1: DOWNLOAD_DELAY (Simple Throttling)

The simplest fix: wait between requests.

# settings.py
DOWNLOAD_DELAY = 2  # Wait 2 seconds between requests
Enter fullscreen mode Exit fullscreen mode

Now your spider waits 2 seconds between each request to the same domain.

How It Works

Request 1 → Wait 2 seconds → Request 2 → Wait 2 seconds → Request 3
Enter fullscreen mode Exit fullscreen mode

Without delay:

  • 100 requests in 5 seconds
  • 20 requests/second
  • Looks like attack!

With 2 second delay:

  • 100 requests in 200 seconds
  • 0.5 requests/second
  • Looks more human

What the Docs Don't Tell You

The delay is per domain, not global:

DOWNLOAD_DELAY = 2

# These wait 2 seconds apart
example.com  wait  example.com  wait  example.com

# But these happen simultaneously
example.com  wait  example.com
othersite.com (no wait, different domain)
Enter fullscreen mode Exit fullscreen mode

Randomize the delay:

DOWNLOAD_DELAY = 2
RANDOMIZE_DOWNLOAD_DELAY = True  # Adds ±50% randomness
Enter fullscreen mode Exit fullscreen mode

Now waits 1-3 seconds randomly. More human-like!


Solution 2: CONCURRENT_REQUESTS (Limit Parallel Requests)

Control how many requests happen at once.

# settings.py

# Maximum requests at once (all domains)
CONCURRENT_REQUESTS = 16  # Default

# Maximum requests per domain
CONCURRENT_REQUESTS_PER_DOMAIN = 8  # Default
Enter fullscreen mode Exit fullscreen mode

Understanding Concurrency

CONCURRENT_REQUESTS = 16 means:

  • Scrapy can have 16 requests in-flight globally
  • Across all domains you're scraping

CONCURRENT_REQUESTS_PER_DOMAIN = 8 means:

  • Maximum 8 requests to same domain at once
  • Prevents overwhelming single site

Recommended Values

For aggressive scraping (large sites):

CONCURRENT_REQUESTS = 32
CONCURRENT_REQUESTS_PER_DOMAIN = 16
DOWNLOAD_DELAY = 0.5
Enter fullscreen mode Exit fullscreen mode

For polite scraping (most sites):

CONCURRENT_REQUESTS = 8
CONCURRENT_REQUESTS_PER_DOMAIN = 4
DOWNLOAD_DELAY = 2
Enter fullscreen mode Exit fullscreen mode

For very polite scraping (small sites):

CONCURRENT_REQUESTS = 4
CONCURRENT_REQUESTS_PER_DOMAIN = 2
DOWNLOAD_DELAY = 5
Enter fullscreen mode Exit fullscreen mode

Solution 3: AutoThrottle (Smart Automatic Throttling)

AutoThrottle automatically adjusts speed based on:

  • Server response time
  • Server load
  • Error rates

Enable AutoThrottle

# settings.py

AUTOTHROTTLE_ENABLED = True
AUTOTHROTTLE_START_DELAY = 1  # Initial delay
AUTOTHROTTLE_MAX_DELAY = 10   # Maximum delay if server is slow
AUTOTHROTTLE_TARGET_CONCURRENCY = 2.0  # Average parallel requests
Enter fullscreen mode Exit fullscreen mode

How It Works

Server is fast:

  • Response time: 0.5 seconds
  • AutoThrottle: Speeds up to TARGET_CONCURRENCY

Server is slow:

  • Response time: 5 seconds
  • AutoThrottle: Slows down automatically

Server returns errors:

  • Gets 500 errors
  • AutoThrottle: Backs off

It's like cruise control for scraping!

Understanding TARGET_CONCURRENCY

This is the average number of requests you want in-flight at once.

TARGET_CONCURRENCY = 1.0:

  • Wait for response before next request
  • Very polite
  • Slow

TARGET_CONCURRENCY = 2.0:

  • Average 2 requests in-flight
  • Balanced
  • Recommended for most sites

TARGET_CONCURRENCY = 5.0:

  • Average 5 requests in-flight
  • Aggressive
  • Only for robust sites

What the Docs Don't Tell You

AutoThrottle overrides DOWNLOAD_DELAY:

If you enable AutoThrottle, it takes control. DOWNLOAD_DELAY is ignored.

Debug AutoThrottle:

AUTOTHROTTLE_DEBUG = True  # See what AutoThrottle is doing
Enter fullscreen mode Exit fullscreen mode

Logs show:

[autothrottle] slot: example.com | latency: 0.523 | delay: 1.046
Enter fullscreen mode Exit fullscreen mode

When to use AutoThrottle:

  • Production spiders
  • Unknown server capacity
  • Scraping multiple sites with different speeds

When NOT to use AutoThrottle:

  • You know exact rate limits
  • Need consistent speed
  • Debugging (adds complexity)

Detecting Rate Limiting

Websites will tell you when you're going too fast.

429 Status Code

Most obvious sign:

HTTP 429 Too Many Requests
Enter fullscreen mode Exit fullscreen mode

Handle it:

# settings.py
RETRY_HTTP_CODES = [500, 502, 503, 504, 522, 524, 408, 429]  # Retry on 429
Enter fullscreen mode Exit fullscreen mode

Captchas

If you see captchas, you're being rate limited.

Blocked IPs

If all requests start timing out or returning 403, your IP might be blocked.

Performance Degradation

Server starts responding very slowly. Sign you're stressing it.


Handling 429 Responses

When you get rate limited:

# spider.py
class PoliteSpider(scrapy.Spider):
    name = 'polite'

    custom_settings = {
        'RETRY_HTTP_CODES': [429, 500, 502, 503],
        'RETRY_TIMES': 5
    }

    def parse(self, response):
        if response.status == 429:
            # We're being rate limited
            retry_after = response.headers.get('Retry-After', 60)
            self.logger.warning(f'Rate limited! Waiting {retry_after} seconds')

            # Slow down
            import time
            time.sleep(int(retry_after))

            # Retry
            yield scrapy.Request(
                response.url,
                callback=self.parse,
                dont_filter=True,
                priority=10
            )
            return

        # Normal processing
        yield {'url': response.url}
Enter fullscreen mode Exit fullscreen mode

Progressive Throttling

Start fast, slow down if needed:

# spider.py
class AdaptiveSpider(scrapy.Spider):
    name = 'adaptive'

    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.error_count = 0
        self.request_count = 0

    custom_settings = {
        'DOWNLOAD_DELAY': 0.5  # Start fast
    }

    def parse(self, response):
        self.request_count += 1

        if response.status in [429, 503]:
            self.error_count += 1

            # Calculate error rate
            error_rate = self.error_count / self.request_count

            if error_rate > 0.1:  # More than 10% errors
                # Slow down!
                current_delay = self.crawler.engine.downloader.slots[''].delay
                new_delay = current_delay * 2

                self.logger.warning(f'Too many errors! Increasing delay to {new_delay}s')
                self.crawler.engine.downloader.slots[''].delay = new_delay

        # Continue scraping
        yield {'url': response.url}
Enter fullscreen mode Exit fullscreen mode

Real-World Rate Limiting Strategies

Strategy 1: Time-Based Throttling

Scrape only during off-peak hours:

from datetime import datetime

class TimedSpider(scrapy.Spider):
    name = 'timed'

    def parse(self, response):
        current_hour = datetime.now().hour

        # Only scrape 2 AM to 6 AM (server off-peak)
        if not (2 <= current_hour < 6):
            self.logger.info('Outside scraping hours, pausing')
            # Pause spider
            self.crawler.engine.pause()
            import time
            time.sleep(3600)  # Wait 1 hour
            self.crawler.engine.unpause()

        # Continue scraping
        yield {'url': response.url}
Enter fullscreen mode Exit fullscreen mode

Strategy 2: Respect robots.txt Crawl-Delay

Some sites specify delay in robots.txt:

User-agent: *
Crawl-delay: 5
Enter fullscreen mode Exit fullscreen mode

Scrapy respects this if enabled:

# settings.py
ROBOTSTXT_OBEY = True  # Respects crawl-delay
Enter fullscreen mode Exit fullscreen mode

Strategy 3: Exponential Backoff

When errors occur, wait longer each time:

class BackoffSpider(scrapy.Spider):
    name = 'backoff'

    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.backoff_time = 1  # Start with 1 second

    def parse(self, response):
        if response.status == 429:
            # Double the wait time
            self.backoff_time *= 2

            self.logger.warning(f'Rate limited! Backing off for {self.backoff_time}s')

            import time
            time.sleep(self.backoff_time)

            yield scrapy.Request(
                response.url,
                callback=self.parse,
                dont_filter=True
            )
            return

        # Success! Reset backoff
        self.backoff_time = 1

        yield {'url': response.url}
Enter fullscreen mode Exit fullscreen mode

Combining Strategies

Use multiple techniques together:

# settings.py

# Basic throttling
DOWNLOAD_DELAY = 2
RANDOMIZE_DOWNLOAD_DELAY = True
CONCURRENT_REQUESTS = 8
CONCURRENT_REQUESTS_PER_DOMAIN = 4

# AutoThrottle (overrides DOWNLOAD_DELAY)
AUTOTHROTTLE_ENABLED = True
AUTOTHROTTLE_START_DELAY = 1
AUTOTHROTTLE_MAX_DELAY = 10
AUTOTHROTTLE_TARGET_CONCURRENCY = 2.0

# Retry on rate limits
RETRY_HTTP_CODES = [429, 500, 502, 503, 504]
RETRY_TIMES = 5

# Respect robots.txt
ROBOTSTXT_OBEY = True
Enter fullscreen mode Exit fullscreen mode

Monitoring Scraping Speed

Track how fast you're scraping:

from datetime import datetime

class MonitoredSpider(scrapy.Spider):
    name = 'monitored'

    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.start_time = datetime.now()
        self.request_count = 0

    def parse(self, response):
        self.request_count += 1

        # Calculate speed every 100 requests
        if self.request_count % 100 == 0:
            elapsed = (datetime.now() - self.start_time).total_seconds()
            speed = self.request_count / elapsed

            self.logger.info(
                f'Speed: {speed:.2f} requests/second '
                f'({self.request_count} requests in {elapsed:.1f}s)'
            )

        yield {'url': response.url}
Enter fullscreen mode Exit fullscreen mode

When to Slow Down vs Speed Up

Slow down when:

  • Getting 429 errors
  • Getting captchas
  • Server responses are slow
  • Small website (< 10k pages)
  • Scraping during peak hours

Speed up when:

  • Large website (100k+ pages)
  • API with rate limits specified
  • Server is fast and stable
  • Scraping during off-peak hours
  • Using multiple IPs (proxies)

IP Rotation (Advanced)

If you need to go fast without getting blocked, rotate IPs:

# settings.py with scrapy-rotating-proxies

ROTATING_PROXY_LIST = [
    'http://proxy1.com:8000',
    'http://proxy2.com:8000',
    'http://proxy3.com:8000',
]

DOWNLOADER_MIDDLEWARES = {
    'rotating_proxies.middlewares.RotatingProxyMiddleware': 610,
    'rotating_proxies.middlewares.BanDetectionMiddleware': 620,
}
Enter fullscreen mode Exit fullscreen mode

Now each request uses a different IP. Can scrape much faster without bans.


Common Mistakes

Mistake #1: No Delay At All

# BAD (will get blocked)
DOWNLOAD_DELAY = 0
CONCURRENT_REQUESTS = 100
Enter fullscreen mode Exit fullscreen mode

Always add some delay!

Mistake #2: Same Delay for All Sites

# BAD (one size doesn't fit all)
DOWNLOAD_DELAY = 1  # For ALL sites
Enter fullscreen mode Exit fullscreen mode

Different sites need different delays. Use spider-specific settings:

class FastSiteSpider(scrapy.Spider):
    custom_settings = {
        'DOWNLOAD_DELAY': 0.5
    }

class SlowSiteSpider(scrapy.Spider):
    custom_settings = {
        'DOWNLOAD_DELAY': 5
    }
Enter fullscreen mode Exit fullscreen mode

Mistake #3: Ignoring 429s

# BAD (keeps hammering when rate limited)
# Just keeps scraping
Enter fullscreen mode Exit fullscreen mode

Always handle 429 responses and slow down!


Quick Reference

Basic Throttling

# Polite scraping
DOWNLOAD_DELAY = 2
RANDOMIZE_DOWNLOAD_DELAY = True
CONCURRENT_REQUESTS_PER_DOMAIN = 4
Enter fullscreen mode Exit fullscreen mode

AutoThrottle

# Smart automatic throttling
AUTOTHROTTLE_ENABLED = True
AUTOTHROTTLE_START_DELAY = 1
AUTOTHROTTLE_MAX_DELAY = 10
AUTOTHROTTLE_TARGET_CONCURRENCY = 2.0
AUTOTHROTTLE_DEBUG = True  # For monitoring
Enter fullscreen mode Exit fullscreen mode

Handle Rate Limits

# Retry on rate limit
RETRY_HTTP_CODES = [429, 500, 502, 503]
RETRY_TIMES = 5
Enter fullscreen mode Exit fullscreen mode

Summary

Why throttle:

  • Avoid getting blocked
  • Be respectful to servers
  • Scrape longer without issues

Three approaches:

  1. DOWNLOAD_DELAY - Simple, fixed delay
  2. CONCURRENT_REQUESTS - Limit parallel requests
  3. AutoThrottle - Smart automatic adjustment

Best practices:

  • Start slow (2-3 seconds delay)
  • Use AutoThrottle for production
  • Monitor speed and errors
  • Handle 429 responses
  • Randomize delays
  • Respect robots.txt

Rule of thumb:

  • Small sites: 2-5 second delay
  • Medium sites: 1-2 second delay
  • Large sites: 0.5-1 second delay
  • APIs: Check documentation

Remember:

  • Slower = more reliable
  • Getting blocked wastes more time than going slow
  • Be a good internet citizen

Start with AutoThrottle and adjust based on results. Better to scrape slow and steady than fast and banned!

Happy scraping! 🕷️

Top comments (0)