If you've been using Scrapy for a while, you've probably heard about middlewares. Maybe you've even used a few. But if you're like most beginners, they still feel a bit mysterious—like this powerful feature hiding in the shadows of your scraper.
Here's the thing: middlewares are what separate scrapers that get blocked after 10 requests from ones that can run for hours without issues. They're the difference between a fragile script and a production-ready crawler.
In this guide, I'll walk you through Scrapy middlewares with practical, ethical examples. We'll cover what the documentation doesn't always explain clearly, and I'll show you patterns that actually work in the real world.
What Are Middlewares, Really?
Think of middlewares as checkpoints in your scraping pipeline. Every request and response flows through them, and you can intercept, modify, or even block them at these checkpoints.
There are two main types:
Downloader Middlewares — These sit between the Scrapy engine and the downloader. They handle:
- Modifying requests before they're sent (adding headers, rotating proxies)
- Processing responses before they reach your spider
- Handling errors and retries
Spider Middlewares — These sit between the engine and your spider. They handle:
- Processing responses before your spider's parse method
- Handling items and requests generated by your spider
- Managing exceptions from your spider
For most scraping tasks, you'll spend 90% of your time with downloader middlewares. That's where the magic happens.
When You Actually Need Middlewares
Before we dive into code, let's talk about when you should use middlewares:
You DON'T need middlewares if:
- You're scraping a small, static site once
- The site has no bot protection
- You're just practicing or learning
You DO need middlewares if:
- You're getting 403/429 errors
- The site tracks user agents or IPs
- You need to respect rate limits properly
- You're building a production scraper
- You want your scraper to recover from failures
Setting Up Your First Middleware
Let's start with something simple: rotating user agents ethically. This is probably the most common use case.
The Problem
When you make requests with Scrapy's default user agent, it literally says "Scrapy" in the string. Many sites will block this immediately:
User-Agent: Scrapy/2.11.0 (+https://scrapy.org)
The Solution
Here's a middleware that rotates through realistic user agents:
# middlewares.py
import random
from scrapy import signals
class RandomUserAgentMiddleware:
def __init__(self):
self.user_agents = [
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:121.0) Gecko/20100101 Firefox/121.0',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:121.0) Gecko/20100101 Firefox/121.0',
]
@classmethod
def from_crawler(cls, crawler):
middleware = cls()
crawler.signals.connect(middleware.spider_opened, signal=signals.spider_opened)
return middleware
def spider_opened(self, spider):
spider.logger.info(f'User-Agent rotation enabled with {len(self.user_agents)} agents')
def process_request(self, request, spider):
user_agent = random.choice(self.user_agents)
request.headers['User-Agent'] = user_agent
spider.logger.debug(f'Using User-Agent: {user_agent[:50]}...')
Enabling It
Add this to your settings.py:
DOWNLOADER_MIDDLEWARES = {
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None, # Disable default
'myproject.middlewares.RandomUserAgentMiddleware': 400, # Enable custom
}
What the Documentation Doesn't Tell You
1. The number (400) matters a lot
Lower numbers run first. Scrapy's built-in middlewares use ranges:
- 100-500: Request processing
- 500-900: Response processing
If you're modifying requests (like adding headers), use 400-500. If you're processing responses, use 800-900.
2. Always disable the middleware you're replacing
Notice how we set UserAgentMiddleware to None? If you don't do this, both middlewares will run, and the default one might override your changes.
3. The from_crawler method is optional but useful
You don't need it, but it gives you access to:
- Settings:
crawler.settings.get('MY_SETTING') - Stats:
crawler.stats - Signals: To hook into spider lifecycle events
Smart Retry Logic (Beyond What's Built-In)
Scrapy has a retry middleware, but it's pretty basic. Here's how to make it smarter.
The Problem with Default Retries
The built-in retry middleware retries on:
- Specific HTTP codes (500, 502, 503, 504, etc.)
- Network errors
- Timeouts
But what if the site returns 200 with "Access Denied" in the body? Or what if you need exponential backoff? The default middleware can't handle these.
Smart Retry Middleware
# middlewares.py
import time
from scrapy.downloadermiddlewares.retry import RetryMiddleware
from scrapy.utils.response import response_status_message
class SmartRetryMiddleware(RetryMiddleware):
def __init__(self, settings):
super().__init__(settings)
self.max_retry_times = settings.getint('RETRY_TIMES', 3)
# Patterns that indicate we should retry even on 200
self.retry_patterns = [
b'Access Denied',
b'blocked',
b'captcha',
b'rate limit',
]
@classmethod
def from_crawler(cls, crawler):
return cls(crawler.settings)
def process_response(self, request, response, spider):
# First, handle normal retry logic
if request.meta.get('dont_retry', False):
return response
# Check if response looks like it should be retried
if response.status == 200:
# Check for patterns in response body
for pattern in self.retry_patterns:
if pattern in response.body:
spider.logger.warning(
f'Retry pattern "{pattern.decode()}" found in {response.url}'
)
return self._retry_with_backoff(request, 'blocked_content', spider) or response
# Let parent class handle status code retries
return super().process_response(request, response, spider)
def _retry_with_backoff(self, request, reason, spider):
retry_times = request.meta.get('retry_times', 0) + 1
if retry_times <= self.max_retry_times:
# Exponential backoff: 2^retry_times seconds
delay = 2 ** retry_times
spider.logger.info(
f'Retrying {request.url} (attempt {retry_times}/{self.max_retry_times}) '
f'after {delay}s delay. Reason: {reason}'
)
# Sleep before retry (not ideal but works for simple cases)
# For production, use Scrapy's download delay settings instead
time.sleep(delay)
new_request = request.copy()
new_request.meta['retry_times'] = retry_times
new_request.dont_filter = True
new_request.priority = request.priority + self.settings.getint('RETRY_PRIORITY_ADJUST')
return new_request
else:
spider.logger.error(
f'Gave up retrying {request.url} (failed {retry_times} times): {reason}'
)
return None
Enable It
# settings.py
DOWNLOADER_MIDDLEWARES = {
'scrapy.downloadermiddlewares.retry.RetryMiddleware': None,
'myproject.middlewares.SmartRetryMiddleware': 550,
}
# Configure retry behavior
RETRY_TIMES = 3
RETRY_HTTP_CODES = [500, 502, 503, 504, 522, 524, 408, 429]
Key Improvements
1. Content-based retry detection
Not all failures return error codes. Sometimes you get a 200 with "blocked" in the HTML.
2. Exponential backoff
Instead of retrying immediately, we wait progressively longer (2s, 4s, 8s).
3. Better logging
You can actually see what's happening and why retries occur.
Respecting Rate Limits (The Right Way)
Here's something the documentation barely covers: how to actually respect rate limits properly.
Dynamic Delay Middleware
# middlewares.py
import time
class AdaptiveDelayMiddleware:
def __init__(self, initial_delay=1.0, max_delay=10.0):
self.delay = initial_delay
self.max_delay = max_delay
self.last_request_time = None
self.consecutive_errors = 0
@classmethod
def from_crawler(cls, crawler):
initial = crawler.settings.getfloat('ADAPTIVE_DELAY_INITIAL', 1.0)
maximum = crawler.settings.getfloat('ADAPTIVE_DELAY_MAX', 10.0)
return cls(initial, maximum)
def process_request(self, request, spider):
# Skip if this is a retry
if request.meta.get('retry_times', 0) > 0:
return None
# Apply delay
if self.last_request_time:
elapsed = time.time() - self.last_request_time
if elapsed < self.delay:
time.sleep(self.delay - elapsed)
self.last_request_time = time.time()
return None
def process_response(self, request, response, spider):
# Decrease delay on success
if response.status == 200:
self.consecutive_errors = 0
if self.delay > 0.5:
self.delay *= 0.95 # Gradually speed up
spider.logger.debug(f'Decreased delay to {self.delay:.2f}s')
# Increase delay on rate limit
elif response.status == 429:
self.consecutive_errors += 1
self.delay = min(self.delay * 2, self.max_delay)
spider.logger.warning(
f'Rate limited! Increased delay to {self.delay:.2f}s'
)
return response
What Makes This Better
1. It adapts automatically
Starts conservative, speeds up when things are going well, slows down when you hit rate limits.
2. It's respectful
You're not hammering the server. You're being a good citizen of the internet.
3. It improves efficiency
Instead of using a fixed slow delay, it finds the sweet spot between speed and not getting blocked.
Request Fingerprinting (Advanced but Useful)
Here's something most tutorials skip: making your requests look more realistic by adding headers that real browsers send.
Realistic Headers Middleware
# middlewares.py
import random
class RealisticHeadersMiddleware:
def __init__(self):
self.languages = [
'en-US,en;q=0.9',
'en-GB,en;q=0.9',
'en-US,en;q=0.9,es;q=0.8',
]
self.encodings = [
'gzip, deflate, br',
'gzip, deflate',
]
def process_request(self, request, spider):
# These headers make requests look more like real browsers
request.headers['Accept'] = 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8'
request.headers['Accept-Language'] = random.choice(self.languages)
request.headers['Accept-Encoding'] = random.choice(self.encodings)
request.headers['DNT'] = '1' # Do Not Track
request.headers['Connection'] = 'keep-alive'
request.headers['Upgrade-Insecure-Requests'] = '1'
# Add referer for non-start URLs
if request.meta.get('depth', 0) > 0:
# Set referer to previous page (if available)
referer = request.meta.get('referer')
if referer:
request.headers['Referer'] = referer
Why This Matters
Websites don't just look at your User-Agent. They check:
- Whether you send Accept headers
- If your Accept-Language makes sense
- Whether you're sending Connection: keep-alive
- If you have a Referer header on non-initial requests
Missing these headers is a red flag that screams "I'm a bot!"
Debugging Middlewares
This is crucial but rarely covered: how do you debug when middlewares aren't working?
Debug Logging Middleware
# middlewares.py
class DebugMiddleware:
def process_request(self, request, spider):
spider.logger.info(f"""
=== REQUEST ===
URL: {request.url}
Method: {request.method}
Headers: {dict(request.headers)}
Meta: {request.meta}
Priority: {request.priority}
================
""")
def process_response(self, request, response, spider):
spider.logger.info(f"""
=== RESPONSE ===
URL: {response.url}
Status: {response.status}
Headers: {dict(response.headers)}
Body length: {len(response.body)}
================
""")
return response
def process_exception(self, request, exception, spider):
spider.logger.error(f"""
=== EXCEPTION ===
URL: {request.url}
Exception: {exception}
Type: {type(exception).__name__}
================
""")
Pro tip: Enable this middleware with a high priority (like 1) during development, then disable it in production.
Common Mistakes to Avoid
After helping dozens of people with Scrapy middlewares, here are the mistakes I see repeatedly:
1. Not Returning Anything
# WRONG
def process_request(self, request, spider):
request.headers['Custom-Header'] = 'value'
# Missing return!
# RIGHT
def process_request(self, request, spider):
request.headers['Custom-Header'] = 'value'
return None # Continue processing
2. Wrong Priority Order
# WRONG - Your middleware runs after the retry middleware
DOWNLOADER_MIDDLEWARES = {
'myproject.middlewares.MyMiddleware': 600,
'scrapy.downloadermiddlewares.retry.RetryMiddleware': 550,
}
# RIGHT - Your middleware runs before retry
DOWNLOADER_MIDDLEWARES = {
'myproject.middlewares.MyMiddleware': 540,
'scrapy.downloadermiddlewares.retry.RetryMiddleware': 550,
}
3. Blocking Operations in Middlewares
# WRONG - Blocking call
def process_request(self, request, spider):
time.sleep(5) # Blocks entire reactor!
return None
# RIGHT - Use Scrapy's delay settings
# In settings.py:
DOWNLOAD_DELAY = 5
RANDOMIZE_DOWNLOAD_DELAY = True
4. Not Handling Edge Cases
# WRONG - Will crash if user_agent isn't set
def process_request(self, request, spider):
ua = request.headers['User-Agent'] # KeyError!
# RIGHT - Always check
def process_request(self, request, spider):
ua = request.headers.get('User-Agent', b'')
Testing Your Middlewares
Here's a simple way to test if your middlewares are working:
# test_spider.py
import scrapy
class TestSpider(scrapy.Spider):
name = 'test_middleware'
start_urls = ['http://httpbin.org/headers']
def parse(self, response):
# This endpoint returns the headers you sent
import json
headers = json.loads(response.text)
self.logger.info(f"Headers sent: {headers}")
# Check if your custom headers are there
user_agent = headers.get('headers', {}).get('User-Agent', '')
self.logger.info(f"User-Agent: {user_agent}")
Run it with:
scrapy crawl test_middleware -L INFO
Production-Ready Middleware Setup
Here's what a real production setup looks like:
# settings.py
DOWNLOADER_MIDDLEWARES = {
# Disable defaults
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
'scrapy.downloadermiddlewares.retry.RetryMiddleware': None,
# Custom middlewares
'myproject.middlewares.RandomUserAgentMiddleware': 400,
'myproject.middlewares.RealisticHeadersMiddleware': 410,
'myproject.middlewares.AdaptiveDelayMiddleware': 500,
'myproject.middlewares.SmartRetryMiddleware': 550,
# Keep these enabled
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
}
# General scraping settings
CONCURRENT_REQUESTS = 8
CONCURRENT_REQUESTS_PER_DOMAIN = 2
DOWNLOAD_DELAY = 2
RANDOMIZE_DOWNLOAD_DELAY = True
# Retry settings
RETRY_TIMES = 3
RETRY_HTTP_CODES = [500, 502, 503, 504, 522, 524, 408, 429]
# Respect robots.txt
ROBOTSTXT_OBEY = True
# Enable autothrottle (works with adaptive delay)
AUTOTHROTTLE_ENABLED = True
AUTOTHROTTLE_START_DELAY = 1
AUTOTHROTTLE_MAX_DELAY = 10
AUTOTHROTTLE_TARGET_CONCURRENCY = 2.0
Final Thoughts
Middlewares are powerful, but they come with responsibility. Here's my rule of thumb:
Always ask yourself:
- Am I respecting the site's robots.txt?
- Am I using reasonable delays?
- Would my scraper cause problems if 100 people used it?
If the answer to any of these is "no," slow down and rethink your approach.
Scrapy middlewares are tools, and like any tool, they can be used well or poorly. Use them to build resilient scrapers that respect the sites they access.
What's next? Try building a middleware that:
- Rotates proxies (if you have access to a proxy service)
- Handles cookie management for logged-in sessions
- Implements circuit breaker pattern (stops requests after X failures)
Drop a comment if you want to see any of these in detail!
Top comments (0)