The first time I got blocked by a website, I didn't understand why. I thought "I'm only scraping 100 pages, what's the problem?"
The problem was I scraped those 100 pages in 5 seconds. The website saw 20 requests per second from the same IP and banned me instantly.
I learned the hard way: it's not about WHAT you scrape, it's about HOW FAST you scrape it. Let me show you how to scrape politely and avoid getting blocked.
The Problem: You're Scraping Too Fast
Websites expect human visitors:
- Humans read pages (5-30 seconds per page)
- Humans click links (1-2 clicks per minute)
- Humans take breaks
Your spider:
- Downloads pages instantly
- Follows all links immediately
- Never stops
Result: Website thinks you're a bot (because you are!) and blocks you.
Solution 1: DOWNLOAD_DELAY (Simple Throttling)
The simplest fix: wait between requests.
# settings.py
DOWNLOAD_DELAY = 2 # Wait 2 seconds between requests
Now your spider waits 2 seconds between each request to the same domain.
How It Works
Request 1 → Wait 2 seconds → Request 2 → Wait 2 seconds → Request 3
Without delay:
- 100 requests in 5 seconds
- 20 requests/second
- Looks like attack!
With 2 second delay:
- 100 requests in 200 seconds
- 0.5 requests/second
- Looks more human
What the Docs Don't Tell You
The delay is per domain, not global:
DOWNLOAD_DELAY = 2
# These wait 2 seconds apart
example.com → wait → example.com → wait → example.com
# But these happen simultaneously
example.com → wait → example.com
othersite.com (no wait, different domain)
Randomize the delay:
DOWNLOAD_DELAY = 2
RANDOMIZE_DOWNLOAD_DELAY = True # Adds ±50% randomness
Now waits 1-3 seconds randomly. More human-like!
Solution 2: CONCURRENT_REQUESTS (Limit Parallel Requests)
Control how many requests happen at once.
# settings.py
# Maximum requests at once (all domains)
CONCURRENT_REQUESTS = 16 # Default
# Maximum requests per domain
CONCURRENT_REQUESTS_PER_DOMAIN = 8 # Default
Understanding Concurrency
CONCURRENT_REQUESTS = 16 means:
- Scrapy can have 16 requests in-flight globally
- Across all domains you're scraping
CONCURRENT_REQUESTS_PER_DOMAIN = 8 means:
- Maximum 8 requests to same domain at once
- Prevents overwhelming single site
Recommended Values
For aggressive scraping (large sites):
CONCURRENT_REQUESTS = 32
CONCURRENT_REQUESTS_PER_DOMAIN = 16
DOWNLOAD_DELAY = 0.5
For polite scraping (most sites):
CONCURRENT_REQUESTS = 8
CONCURRENT_REQUESTS_PER_DOMAIN = 4
DOWNLOAD_DELAY = 2
For very polite scraping (small sites):
CONCURRENT_REQUESTS = 4
CONCURRENT_REQUESTS_PER_DOMAIN = 2
DOWNLOAD_DELAY = 5
Solution 3: AutoThrottle (Smart Automatic Throttling)
AutoThrottle automatically adjusts speed based on:
- Server response time
- Server load
- Error rates
Enable AutoThrottle
# settings.py
AUTOTHROTTLE_ENABLED = True
AUTOTHROTTLE_START_DELAY = 1 # Initial delay
AUTOTHROTTLE_MAX_DELAY = 10 # Maximum delay if server is slow
AUTOTHROTTLE_TARGET_CONCURRENCY = 2.0 # Average parallel requests
How It Works
Server is fast:
- Response time: 0.5 seconds
- AutoThrottle: Speeds up to TARGET_CONCURRENCY
Server is slow:
- Response time: 5 seconds
- AutoThrottle: Slows down automatically
Server returns errors:
- Gets 500 errors
- AutoThrottle: Backs off
It's like cruise control for scraping!
Understanding TARGET_CONCURRENCY
This is the average number of requests you want in-flight at once.
TARGET_CONCURRENCY = 1.0:
- Wait for response before next request
- Very polite
- Slow
TARGET_CONCURRENCY = 2.0:
- Average 2 requests in-flight
- Balanced
- Recommended for most sites
TARGET_CONCURRENCY = 5.0:
- Average 5 requests in-flight
- Aggressive
- Only for robust sites
What the Docs Don't Tell You
AutoThrottle overrides DOWNLOAD_DELAY:
If you enable AutoThrottle, it takes control. DOWNLOAD_DELAY is ignored.
Debug AutoThrottle:
AUTOTHROTTLE_DEBUG = True # See what AutoThrottle is doing
Logs show:
[autothrottle] slot: example.com | latency: 0.523 | delay: 1.046
When to use AutoThrottle:
- Production spiders
- Unknown server capacity
- Scraping multiple sites with different speeds
When NOT to use AutoThrottle:
- You know exact rate limits
- Need consistent speed
- Debugging (adds complexity)
Detecting Rate Limiting
Websites will tell you when you're going too fast.
429 Status Code
Most obvious sign:
HTTP 429 Too Many Requests
Handle it:
# settings.py
RETRY_HTTP_CODES = [500, 502, 503, 504, 522, 524, 408, 429] # Retry on 429
Captchas
If you see captchas, you're being rate limited.
Blocked IPs
If all requests start timing out or returning 403, your IP might be blocked.
Performance Degradation
Server starts responding very slowly. Sign you're stressing it.
Handling 429 Responses
When you get rate limited:
# spider.py
class PoliteSpider(scrapy.Spider):
name = 'polite'
custom_settings = {
'RETRY_HTTP_CODES': [429, 500, 502, 503],
'RETRY_TIMES': 5
}
def parse(self, response):
if response.status == 429:
# We're being rate limited
retry_after = response.headers.get('Retry-After', 60)
self.logger.warning(f'Rate limited! Waiting {retry_after} seconds')
# Slow down
import time
time.sleep(int(retry_after))
# Retry
yield scrapy.Request(
response.url,
callback=self.parse,
dont_filter=True,
priority=10
)
return
# Normal processing
yield {'url': response.url}
Progressive Throttling
Start fast, slow down if needed:
# spider.py
class AdaptiveSpider(scrapy.Spider):
name = 'adaptive'
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
self.error_count = 0
self.request_count = 0
custom_settings = {
'DOWNLOAD_DELAY': 0.5 # Start fast
}
def parse(self, response):
self.request_count += 1
if response.status in [429, 503]:
self.error_count += 1
# Calculate error rate
error_rate = self.error_count / self.request_count
if error_rate > 0.1: # More than 10% errors
# Slow down!
current_delay = self.crawler.engine.downloader.slots[''].delay
new_delay = current_delay * 2
self.logger.warning(f'Too many errors! Increasing delay to {new_delay}s')
self.crawler.engine.downloader.slots[''].delay = new_delay
# Continue scraping
yield {'url': response.url}
Real-World Rate Limiting Strategies
Strategy 1: Time-Based Throttling
Scrape only during off-peak hours:
from datetime import datetime
class TimedSpider(scrapy.Spider):
name = 'timed'
def parse(self, response):
current_hour = datetime.now().hour
# Only scrape 2 AM to 6 AM (server off-peak)
if not (2 <= current_hour < 6):
self.logger.info('Outside scraping hours, pausing')
# Pause spider
self.crawler.engine.pause()
import time
time.sleep(3600) # Wait 1 hour
self.crawler.engine.unpause()
# Continue scraping
yield {'url': response.url}
Strategy 2: Respect robots.txt Crawl-Delay
Some sites specify delay in robots.txt:
User-agent: *
Crawl-delay: 5
Scrapy respects this if enabled:
# settings.py
ROBOTSTXT_OBEY = True # Respects crawl-delay
Strategy 3: Exponential Backoff
When errors occur, wait longer each time:
class BackoffSpider(scrapy.Spider):
name = 'backoff'
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
self.backoff_time = 1 # Start with 1 second
def parse(self, response):
if response.status == 429:
# Double the wait time
self.backoff_time *= 2
self.logger.warning(f'Rate limited! Backing off for {self.backoff_time}s')
import time
time.sleep(self.backoff_time)
yield scrapy.Request(
response.url,
callback=self.parse,
dont_filter=True
)
return
# Success! Reset backoff
self.backoff_time = 1
yield {'url': response.url}
Combining Strategies
Use multiple techniques together:
# settings.py
# Basic throttling
DOWNLOAD_DELAY = 2
RANDOMIZE_DOWNLOAD_DELAY = True
CONCURRENT_REQUESTS = 8
CONCURRENT_REQUESTS_PER_DOMAIN = 4
# AutoThrottle (overrides DOWNLOAD_DELAY)
AUTOTHROTTLE_ENABLED = True
AUTOTHROTTLE_START_DELAY = 1
AUTOTHROTTLE_MAX_DELAY = 10
AUTOTHROTTLE_TARGET_CONCURRENCY = 2.0
# Retry on rate limits
RETRY_HTTP_CODES = [429, 500, 502, 503, 504]
RETRY_TIMES = 5
# Respect robots.txt
ROBOTSTXT_OBEY = True
Monitoring Scraping Speed
Track how fast you're scraping:
from datetime import datetime
class MonitoredSpider(scrapy.Spider):
name = 'monitored'
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
self.start_time = datetime.now()
self.request_count = 0
def parse(self, response):
self.request_count += 1
# Calculate speed every 100 requests
if self.request_count % 100 == 0:
elapsed = (datetime.now() - self.start_time).total_seconds()
speed = self.request_count / elapsed
self.logger.info(
f'Speed: {speed:.2f} requests/second '
f'({self.request_count} requests in {elapsed:.1f}s)'
)
yield {'url': response.url}
When to Slow Down vs Speed Up
Slow down when:
- Getting 429 errors
- Getting captchas
- Server responses are slow
- Small website (< 10k pages)
- Scraping during peak hours
Speed up when:
- Large website (100k+ pages)
- API with rate limits specified
- Server is fast and stable
- Scraping during off-peak hours
- Using multiple IPs (proxies)
IP Rotation (Advanced)
If you need to go fast without getting blocked, rotate IPs:
# settings.py with scrapy-rotating-proxies
ROTATING_PROXY_LIST = [
'http://proxy1.com:8000',
'http://proxy2.com:8000',
'http://proxy3.com:8000',
]
DOWNLOADER_MIDDLEWARES = {
'rotating_proxies.middlewares.RotatingProxyMiddleware': 610,
'rotating_proxies.middlewares.BanDetectionMiddleware': 620,
}
Now each request uses a different IP. Can scrape much faster without bans.
Common Mistakes
Mistake #1: No Delay At All
# BAD (will get blocked)
DOWNLOAD_DELAY = 0
CONCURRENT_REQUESTS = 100
Always add some delay!
Mistake #2: Same Delay for All Sites
# BAD (one size doesn't fit all)
DOWNLOAD_DELAY = 1 # For ALL sites
Different sites need different delays. Use spider-specific settings:
class FastSiteSpider(scrapy.Spider):
custom_settings = {
'DOWNLOAD_DELAY': 0.5
}
class SlowSiteSpider(scrapy.Spider):
custom_settings = {
'DOWNLOAD_DELAY': 5
}
Mistake #3: Ignoring 429s
# BAD (keeps hammering when rate limited)
# Just keeps scraping
Always handle 429 responses and slow down!
Quick Reference
Basic Throttling
# Polite scraping
DOWNLOAD_DELAY = 2
RANDOMIZE_DOWNLOAD_DELAY = True
CONCURRENT_REQUESTS_PER_DOMAIN = 4
AutoThrottle
# Smart automatic throttling
AUTOTHROTTLE_ENABLED = True
AUTOTHROTTLE_START_DELAY = 1
AUTOTHROTTLE_MAX_DELAY = 10
AUTOTHROTTLE_TARGET_CONCURRENCY = 2.0
AUTOTHROTTLE_DEBUG = True # For monitoring
Handle Rate Limits
# Retry on rate limit
RETRY_HTTP_CODES = [429, 500, 502, 503]
RETRY_TIMES = 5
Summary
Why throttle:
- Avoid getting blocked
- Be respectful to servers
- Scrape longer without issues
Three approaches:
- DOWNLOAD_DELAY - Simple, fixed delay
- CONCURRENT_REQUESTS - Limit parallel requests
- AutoThrottle - Smart automatic adjustment
Best practices:
- Start slow (2-3 seconds delay)
- Use AutoThrottle for production
- Monitor speed and errors
- Handle 429 responses
- Randomize delays
- Respect robots.txt
Rule of thumb:
- Small sites: 2-5 second delay
- Medium sites: 1-2 second delay
- Large sites: 0.5-1 second delay
- APIs: Check documentation
Remember:
- Slower = more reliable
- Getting blocked wastes more time than going slow
- Be a good internet citizen
Start with AutoThrottle and adjust based on results. Better to scrape slow and steady than fast and banned!
Happy scraping! 🕷️
Top comments (0)