Muhammad Ikramullah Khan

Posted on Jan 4

Scrapy AutoThrottle & Rate Limiting: Stop Getting Blocked

#python #webdev #programming #beginners

The first time I got blocked by a website, I didn't understand why. I thought "I'm only scraping 100 pages, what's the problem?"

The problem was I scraped those 100 pages in 5 seconds. The website saw 20 requests per second from the same IP and banned me instantly.

I learned the hard way: it's not about WHAT you scrape, it's about HOW FAST you scrape it. Let me show you how to scrape politely and avoid getting blocked.

The Problem: You're Scraping Too Fast

Websites expect human visitors:

Humans read pages (5-30 seconds per page)
Humans click links (1-2 clicks per minute)
Humans take breaks

Your spider:

Downloads pages instantly
Follows all links immediately
Never stops

Result: Website thinks you're a bot (because you are!) and blocks you.

Solution 1: DOWNLOAD_DELAY (Simple Throttling)

The simplest fix: wait between requests.

# settings.py
DOWNLOAD_DELAY = 2  # Wait 2 seconds between requests

Now your spider waits 2 seconds between each request to the same domain.

How It Works

Request 1 → Wait 2 seconds → Request 2 → Wait 2 seconds → Request 3

Without delay:

100 requests in 5 seconds
20 requests/second
Looks like attack!

With 2 second delay:

100 requests in 200 seconds
0.5 requests/second
Looks more human

What the Docs Don't Tell You

The delay is per domain, not global:

DOWNLOAD_DELAY = 2

# These wait 2 seconds apart
example.com → wait → example.com → wait → example.com

# But these happen simultaneously
example.com → wait → example.com
othersite.com (no wait, different domain)

Randomize the delay:

DOWNLOAD_DELAY = 2
RANDOMIZE_DOWNLOAD_DELAY = True  # Adds ±50% randomness

Now waits 1-3 seconds randomly. More human-like!

Solution 2: CONCURRENT_REQUESTS (Limit Parallel Requests)

Control how many requests happen at once.

# settings.py

# Maximum requests at once (all domains)
CONCURRENT_REQUESTS = 16  # Default

# Maximum requests per domain
CONCURRENT_REQUESTS_PER_DOMAIN = 8  # Default

Understanding Concurrency

CONCURRENT_REQUESTS = 16 means:

Scrapy can have 16 requests in-flight globally
Across all domains you're scraping

CONCURRENT_REQUESTS_PER_DOMAIN = 8 means:

Maximum 8 requests to same domain at once
Prevents overwhelming single site

Recommended Values

For aggressive scraping (large sites):

CONCURRENT_REQUESTS = 32
CONCURRENT_REQUESTS_PER_DOMAIN = 16
DOWNLOAD_DELAY = 0.5

For polite scraping (most sites):

CONCURRENT_REQUESTS = 8
CONCURRENT_REQUESTS_PER_DOMAIN = 4
DOWNLOAD_DELAY = 2

For very polite scraping (small sites):

CONCURRENT_REQUESTS = 4
CONCURRENT_REQUESTS_PER_DOMAIN = 2
DOWNLOAD_DELAY = 5

Solution 3: AutoThrottle (Smart Automatic Throttling)

AutoThrottle automatically adjusts speed based on:

Server response time
Server load
Error rates

Enable AutoThrottle

# settings.py

AUTOTHROTTLE_ENABLED = True
AUTOTHROTTLE_START_DELAY = 1  # Initial delay
AUTOTHROTTLE_MAX_DELAY = 10   # Maximum delay if server is slow
AUTOTHROTTLE_TARGET_CONCURRENCY = 2.0  # Average parallel requests

How It Works

Server is fast:

Response time: 0.5 seconds
AutoThrottle: Speeds up to TARGET_CONCURRENCY

Server is slow:

Response time: 5 seconds
AutoThrottle: Slows down automatically

Server returns errors:

Gets 500 errors
AutoThrottle: Backs off

It's like cruise control for scraping!

Understanding TARGET_CONCURRENCY

This is the average number of requests you want in-flight at once.

TARGET_CONCURRENCY = 1.0:

Wait for response before next request
Very polite
Slow

TARGET_CONCURRENCY = 2.0:

Average 2 requests in-flight
Balanced
Recommended for most sites

TARGET_CONCURRENCY = 5.0:

Average 5 requests in-flight
Aggressive
Only for robust sites

What the Docs Don't Tell You

AutoThrottle overrides DOWNLOAD_DELAY:

If you enable AutoThrottle, it takes control. DOWNLOAD_DELAY is ignored.

Debug AutoThrottle:

AUTOTHROTTLE_DEBUG = True  # See what AutoThrottle is doing

Logs show:

[autothrottle] slot: example.com | latency: 0.523 | delay: 1.046

When to use AutoThrottle:

Production spiders
Unknown server capacity
Scraping multiple sites with different speeds

When NOT to use AutoThrottle:

You know exact rate limits
Need consistent speed
Debugging (adds complexity)

Detecting Rate Limiting

Websites will tell you when you're going too fast.

429 Status Code

Most obvious sign:

HTTP 429 Too Many Requests

Handle it:

# settings.py
RETRY_HTTP_CODES = [500, 502, 503, 504, 522, 524, 408, 429]  # Retry on 429

Captchas

If you see captchas, you're being rate limited.

Blocked IPs

If all requests start timing out or returning 403, your IP might be blocked.

Performance Degradation

Server starts responding very slowly. Sign you're stressing it.

Handling 429 Responses

When you get rate limited:

# spider.py
class PoliteSpider(scrapy.Spider):
    name = 'polite'

    custom_settings = {
        'RETRY_HTTP_CODES': [429, 500, 502, 503],
        'RETRY_TIMES': 5
    }

    def parse(self, response):
        if response.status == 429:
            # We're being rate limited
            retry_after = response.headers.get('Retry-After', 60)
            self.logger.warning(f'Rate limited! Waiting {retry_after} seconds')

            # Slow down
            import time
            time.sleep(int(retry_after))

            # Retry
            yield scrapy.Request(
                response.url,
                callback=self.parse,
                dont_filter=True,
                priority=10
            )
            return

        # Normal processing
        yield {'url': response.url}

Progressive Throttling

Start fast, slow down if needed:

# spider.py
class AdaptiveSpider(scrapy.Spider):
    name = 'adaptive'

    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.error_count = 0
        self.request_count = 0

    custom_settings = {
        'DOWNLOAD_DELAY': 0.5  # Start fast
    }

    def parse(self, response):
        self.request_count += 1

        if response.status in [429, 503]:
            self.error_count += 1

            # Calculate error rate
            error_rate = self.error_count / self.request_count

            if error_rate > 0.1:  # More than 10% errors
                # Slow down!
                current_delay = self.crawler.engine.downloader.slots[''].delay
                new_delay = current_delay * 2

                self.logger.warning(f'Too many errors! Increasing delay to {new_delay}s')
                self.crawler.engine.downloader.slots[''].delay = new_delay

        # Continue scraping
        yield {'url': response.url}

Real-World Rate Limiting Strategies

Strategy 1: Time-Based Throttling

Scrape only during off-peak hours:

from datetime import datetime

class TimedSpider(scrapy.Spider):
    name = 'timed'

    def parse(self, response):
        current_hour = datetime.now().hour

        # Only scrape 2 AM to 6 AM (server off-peak)
        if not (2 <= current_hour < 6):
            self.logger.info('Outside scraping hours, pausing')
            # Pause spider
            self.crawler.engine.pause()
            import time
            time.sleep(3600)  # Wait 1 hour
            self.crawler.engine.unpause()

        # Continue scraping
        yield {'url': response.url}

Strategy 2: Respect robots.txt Crawl-Delay

Some sites specify delay in robots.txt:

User-agent: *
Crawl-delay: 5

Scrapy respects this if enabled:

# settings.py
ROBOTSTXT_OBEY = True  # Respects crawl-delay

Strategy 3: Exponential Backoff

When errors occur, wait longer each time:

class BackoffSpider(scrapy.Spider):
    name = 'backoff'

    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.backoff_time = 1  # Start with 1 second

    def parse(self, response):
        if response.status == 429:
            # Double the wait time
            self.backoff_time *= 2

            self.logger.warning(f'Rate limited! Backing off for {self.backoff_time}s')

            import time
            time.sleep(self.backoff_time)

            yield scrapy.Request(
                response.url,
                callback=self.parse,
                dont_filter=True
            )
            return

        # Success! Reset backoff
        self.backoff_time = 1

        yield {'url': response.url}

Combining Strategies

Use multiple techniques together:

# settings.py

# Basic throttling
DOWNLOAD_DELAY = 2
RANDOMIZE_DOWNLOAD_DELAY = True
CONCURRENT_REQUESTS = 8
CONCURRENT_REQUESTS_PER_DOMAIN = 4

# AutoThrottle (overrides DOWNLOAD_DELAY)
AUTOTHROTTLE_ENABLED = True
AUTOTHROTTLE_START_DELAY = 1
AUTOTHROTTLE_MAX_DELAY = 10
AUTOTHROTTLE_TARGET_CONCURRENCY = 2.0

# Retry on rate limits
RETRY_HTTP_CODES = [429, 500, 502, 503, 504]
RETRY_TIMES = 5

# Respect robots.txt
ROBOTSTXT_OBEY = True

Monitoring Scraping Speed

Track how fast you're scraping:

from datetime import datetime

class MonitoredSpider(scrapy.Spider):
    name = 'monitored'

    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.start_time = datetime.now()
        self.request_count = 0

    def parse(self, response):
        self.request_count += 1

        # Calculate speed every 100 requests
        if self.request_count % 100 == 0:
            elapsed = (datetime.now() - self.start_time).total_seconds()
            speed = self.request_count / elapsed

            self.logger.info(
                f'Speed: {speed:.2f} requests/second '
                f'({self.request_count} requests in {elapsed:.1f}s)'
            )

        yield {'url': response.url}

When to Slow Down vs Speed Up

Slow down when:

Getting 429 errors
Getting captchas
Server responses are slow
Small website (< 10k pages)
Scraping during peak hours

Speed up when:

Large website (100k+ pages)
API with rate limits specified
Server is fast and stable
Scraping during off-peak hours
Using multiple IPs (proxies)

IP Rotation (Advanced)

If you need to go fast without getting blocked, rotate IPs:

# settings.py with scrapy-rotating-proxies

ROTATING_PROXY_LIST = [
    'http://proxy1.com:8000',
    'http://proxy2.com:8000',
    'http://proxy3.com:8000',
]

DOWNLOADER_MIDDLEWARES = {
    'rotating_proxies.middlewares.RotatingProxyMiddleware': 610,
    'rotating_proxies.middlewares.BanDetectionMiddleware': 620,
}

Now each request uses a different IP. Can scrape much faster without bans.

Common Mistakes

Mistake #1: No Delay At All

# BAD (will get blocked)
DOWNLOAD_DELAY = 0
CONCURRENT_REQUESTS = 100

Always add some delay!

Mistake #2: Same Delay for All Sites

# BAD (one size doesn't fit all)
DOWNLOAD_DELAY = 1  # For ALL sites

Different sites need different delays. Use spider-specific settings:

class FastSiteSpider(scrapy.Spider):
    custom_settings = {
        'DOWNLOAD_DELAY': 0.5
    }

class SlowSiteSpider(scrapy.Spider):
    custom_settings = {
        'DOWNLOAD_DELAY': 5
    }

Mistake #3: Ignoring 429s

# BAD (keeps hammering when rate limited)
# Just keeps scraping

Always handle 429 responses and slow down!

Quick Reference

Basic Throttling

# Polite scraping
DOWNLOAD_DELAY = 2
RANDOMIZE_DOWNLOAD_DELAY = True
CONCURRENT_REQUESTS_PER_DOMAIN = 4

AutoThrottle

# Smart automatic throttling
AUTOTHROTTLE_ENABLED = True
AUTOTHROTTLE_START_DELAY = 1
AUTOTHROTTLE_MAX_DELAY = 10
AUTOTHROTTLE_TARGET_CONCURRENCY = 2.0
AUTOTHROTTLE_DEBUG = True  # For monitoring

Handle Rate Limits

# Retry on rate limit
RETRY_HTTP_CODES = [429, 500, 502, 503]
RETRY_TIMES = 5

Summary

Why throttle:

Avoid getting blocked
Be respectful to servers
Scrape longer without issues

Three approaches:

DOWNLOAD_DELAY - Simple, fixed delay
CONCURRENT_REQUESTS - Limit parallel requests
AutoThrottle - Smart automatic adjustment

Best practices:

Start slow (2-3 seconds delay)
Use AutoThrottle for production
Monitor speed and errors
Handle 429 responses
Randomize delays
Respect robots.txt

Rule of thumb:

Small sites: 2-5 second delay
Medium sites: 1-2 second delay
Large sites: 0.5-1 second delay
APIs: Check documentation

Remember:

Slower = more reliable
Getting blocked wastes more time than going slow
Be a good internet citizen

Start with AutoThrottle and adjust based on results. Better to scrape slow and steady than fast and banned!

Happy scraping! 🕷️