DEV Community

Cover image for Scrapy Middlewares: A Practical Guide for Beginners (With Real-World Examples)
Muhammad Ikramullah Khan
Muhammad Ikramullah Khan

Posted on

Scrapy Middlewares: A Practical Guide for Beginners (With Real-World Examples)

If you've been using Scrapy for a while, you've probably heard about middlewares. Maybe you've even used a few. But if you're like most beginners, they still feel a bit mysterious—like this powerful feature hiding in the shadows of your scraper.

Here's the thing: middlewares are what separate scrapers that get blocked after 10 requests from ones that can run for hours without issues. They're the difference between a fragile script and a production-ready crawler.

In this guide, I'll walk you through Scrapy middlewares with practical, ethical examples. We'll cover what the documentation doesn't always explain clearly, and I'll show you patterns that actually work in the real world.

What Are Middlewares, Really?

Think of middlewares as checkpoints in your scraping pipeline. Every request and response flows through them, and you can intercept, modify, or even block them at these checkpoints.

There are two main types:

Downloader Middlewares — These sit between the Scrapy engine and the downloader. They handle:

  • Modifying requests before they're sent (adding headers, rotating proxies)
  • Processing responses before they reach your spider
  • Handling errors and retries

Spider Middlewares — These sit between the engine and your spider. They handle:

  • Processing responses before your spider's parse method
  • Handling items and requests generated by your spider
  • Managing exceptions from your spider

For most scraping tasks, you'll spend 90% of your time with downloader middlewares. That's where the magic happens.

When You Actually Need Middlewares

Before we dive into code, let's talk about when you should use middlewares:

You DON'T need middlewares if:

  • You're scraping a small, static site once
  • The site has no bot protection
  • You're just practicing or learning

You DO need middlewares if:

  • You're getting 403/429 errors
  • The site tracks user agents or IPs
  • You need to respect rate limits properly
  • You're building a production scraper
  • You want your scraper to recover from failures

Setting Up Your First Middleware

Let's start with something simple: rotating user agents ethically. This is probably the most common use case.

The Problem

When you make requests with Scrapy's default user agent, it literally says "Scrapy" in the string. Many sites will block this immediately:

User-Agent: Scrapy/2.11.0 (+https://scrapy.org)
Enter fullscreen mode Exit fullscreen mode

The Solution

Here's a middleware that rotates through realistic user agents:

# middlewares.py
import random
from scrapy import signals

class RandomUserAgentMiddleware:
    def __init__(self):
        self.user_agents = [
            'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
            'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
            'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
            'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:121.0) Gecko/20100101 Firefox/121.0',
            'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:121.0) Gecko/20100101 Firefox/121.0',
        ]

    @classmethod
    def from_crawler(cls, crawler):
        middleware = cls()
        crawler.signals.connect(middleware.spider_opened, signal=signals.spider_opened)
        return middleware

    def spider_opened(self, spider):
        spider.logger.info(f'User-Agent rotation enabled with {len(self.user_agents)} agents')

    def process_request(self, request, spider):
        user_agent = random.choice(self.user_agents)
        request.headers['User-Agent'] = user_agent
        spider.logger.debug(f'Using User-Agent: {user_agent[:50]}...')
Enter fullscreen mode Exit fullscreen mode

Enabling It

Add this to your settings.py:

DOWNLOADER_MIDDLEWARES = {
    'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,  # Disable default
    'myproject.middlewares.RandomUserAgentMiddleware': 400,  # Enable custom
}
Enter fullscreen mode Exit fullscreen mode

What the Documentation Doesn't Tell You

1. The number (400) matters a lot

Lower numbers run first. Scrapy's built-in middlewares use ranges:

  • 100-500: Request processing
  • 500-900: Response processing

If you're modifying requests (like adding headers), use 400-500. If you're processing responses, use 800-900.

2. Always disable the middleware you're replacing

Notice how we set UserAgentMiddleware to None? If you don't do this, both middlewares will run, and the default one might override your changes.

3. The from_crawler method is optional but useful

You don't need it, but it gives you access to:

  • Settings: crawler.settings.get('MY_SETTING')
  • Stats: crawler.stats
  • Signals: To hook into spider lifecycle events

Smart Retry Logic (Beyond What's Built-In)

Scrapy has a retry middleware, but it's pretty basic. Here's how to make it smarter.

The Problem with Default Retries

The built-in retry middleware retries on:

  • Specific HTTP codes (500, 502, 503, 504, etc.)
  • Network errors
  • Timeouts

But what if the site returns 200 with "Access Denied" in the body? Or what if you need exponential backoff? The default middleware can't handle these.

Smart Retry Middleware

# middlewares.py
import time
from scrapy.downloadermiddlewares.retry import RetryMiddleware
from scrapy.utils.response import response_status_message

class SmartRetryMiddleware(RetryMiddleware):

    def __init__(self, settings):
        super().__init__(settings)
        self.max_retry_times = settings.getint('RETRY_TIMES', 3)
        # Patterns that indicate we should retry even on 200
        self.retry_patterns = [
            b'Access Denied',
            b'blocked',
            b'captcha',
            b'rate limit',
        ]

    @classmethod
    def from_crawler(cls, crawler):
        return cls(crawler.settings)

    def process_response(self, request, response, spider):
        # First, handle normal retry logic
        if request.meta.get('dont_retry', False):
            return response

        # Check if response looks like it should be retried
        if response.status == 200:
            # Check for patterns in response body
            for pattern in self.retry_patterns:
                if pattern in response.body:
                    spider.logger.warning(
                        f'Retry pattern "{pattern.decode()}" found in {response.url}'
                    )
                    return self._retry_with_backoff(request, 'blocked_content', spider) or response

        # Let parent class handle status code retries
        return super().process_response(request, response, spider)

    def _retry_with_backoff(self, request, reason, spider):
        retry_times = request.meta.get('retry_times', 0) + 1

        if retry_times <= self.max_retry_times:
            # Exponential backoff: 2^retry_times seconds
            delay = 2 ** retry_times
            spider.logger.info(
                f'Retrying {request.url} (attempt {retry_times}/{self.max_retry_times}) '
                f'after {delay}s delay. Reason: {reason}'
            )

            # Sleep before retry (not ideal but works for simple cases)
            # For production, use Scrapy's download delay settings instead
            time.sleep(delay)

            new_request = request.copy()
            new_request.meta['retry_times'] = retry_times
            new_request.dont_filter = True
            new_request.priority = request.priority + self.settings.getint('RETRY_PRIORITY_ADJUST')

            return new_request
        else:
            spider.logger.error(
                f'Gave up retrying {request.url} (failed {retry_times} times): {reason}'
            )
            return None
Enter fullscreen mode Exit fullscreen mode

Enable It

# settings.py
DOWNLOADER_MIDDLEWARES = {
    'scrapy.downloadermiddlewares.retry.RetryMiddleware': None,
    'myproject.middlewares.SmartRetryMiddleware': 550,
}

# Configure retry behavior
RETRY_TIMES = 3
RETRY_HTTP_CODES = [500, 502, 503, 504, 522, 524, 408, 429]
Enter fullscreen mode Exit fullscreen mode

Key Improvements

1. Content-based retry detection
Not all failures return error codes. Sometimes you get a 200 with "blocked" in the HTML.

2. Exponential backoff
Instead of retrying immediately, we wait progressively longer (2s, 4s, 8s).

3. Better logging
You can actually see what's happening and why retries occur.

Respecting Rate Limits (The Right Way)

Here's something the documentation barely covers: how to actually respect rate limits properly.

Dynamic Delay Middleware

# middlewares.py
import time

class AdaptiveDelayMiddleware:
    def __init__(self, initial_delay=1.0, max_delay=10.0):
        self.delay = initial_delay
        self.max_delay = max_delay
        self.last_request_time = None
        self.consecutive_errors = 0

    @classmethod
    def from_crawler(cls, crawler):
        initial = crawler.settings.getfloat('ADAPTIVE_DELAY_INITIAL', 1.0)
        maximum = crawler.settings.getfloat('ADAPTIVE_DELAY_MAX', 10.0)
        return cls(initial, maximum)

    def process_request(self, request, spider):
        # Skip if this is a retry
        if request.meta.get('retry_times', 0) > 0:
            return None

        # Apply delay
        if self.last_request_time:
            elapsed = time.time() - self.last_request_time
            if elapsed < self.delay:
                time.sleep(self.delay - elapsed)

        self.last_request_time = time.time()
        return None

    def process_response(self, request, response, spider):
        # Decrease delay on success
        if response.status == 200:
            self.consecutive_errors = 0
            if self.delay > 0.5:
                self.delay *= 0.95  # Gradually speed up
                spider.logger.debug(f'Decreased delay to {self.delay:.2f}s')

        # Increase delay on rate limit
        elif response.status == 429:
            self.consecutive_errors += 1
            self.delay = min(self.delay * 2, self.max_delay)
            spider.logger.warning(
                f'Rate limited! Increased delay to {self.delay:.2f}s'
            )

        return response
Enter fullscreen mode Exit fullscreen mode

What Makes This Better

1. It adapts automatically
Starts conservative, speeds up when things are going well, slows down when you hit rate limits.

2. It's respectful
You're not hammering the server. You're being a good citizen of the internet.

3. It improves efficiency
Instead of using a fixed slow delay, it finds the sweet spot between speed and not getting blocked.

Request Fingerprinting (Advanced but Useful)

Here's something most tutorials skip: making your requests look more realistic by adding headers that real browsers send.

Realistic Headers Middleware

# middlewares.py
import random

class RealisticHeadersMiddleware:
    def __init__(self):
        self.languages = [
            'en-US,en;q=0.9',
            'en-GB,en;q=0.9',
            'en-US,en;q=0.9,es;q=0.8',
        ]

        self.encodings = [
            'gzip, deflate, br',
            'gzip, deflate',
        ]

    def process_request(self, request, spider):
        # These headers make requests look more like real browsers
        request.headers['Accept'] = 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8'
        request.headers['Accept-Language'] = random.choice(self.languages)
        request.headers['Accept-Encoding'] = random.choice(self.encodings)
        request.headers['DNT'] = '1'  # Do Not Track
        request.headers['Connection'] = 'keep-alive'
        request.headers['Upgrade-Insecure-Requests'] = '1'

        # Add referer for non-start URLs
        if request.meta.get('depth', 0) > 0:
            # Set referer to previous page (if available)
            referer = request.meta.get('referer')
            if referer:
                request.headers['Referer'] = referer
Enter fullscreen mode Exit fullscreen mode

Why This Matters

Websites don't just look at your User-Agent. They check:

  • Whether you send Accept headers
  • If your Accept-Language makes sense
  • Whether you're sending Connection: keep-alive
  • If you have a Referer header on non-initial requests

Missing these headers is a red flag that screams "I'm a bot!"

Debugging Middlewares

This is crucial but rarely covered: how do you debug when middlewares aren't working?

Debug Logging Middleware

# middlewares.py
class DebugMiddleware:
    def process_request(self, request, spider):
        spider.logger.info(f"""
=== REQUEST ===
URL: {request.url}
Method: {request.method}
Headers: {dict(request.headers)}
Meta: {request.meta}
Priority: {request.priority}
================
        """)

    def process_response(self, request, response, spider):
        spider.logger.info(f"""
=== RESPONSE ===
URL: {response.url}
Status: {response.status}
Headers: {dict(response.headers)}
Body length: {len(response.body)}
================
        """)
        return response

    def process_exception(self, request, exception, spider):
        spider.logger.error(f"""
=== EXCEPTION ===
URL: {request.url}
Exception: {exception}
Type: {type(exception).__name__}
================
        """)
Enter fullscreen mode Exit fullscreen mode

Pro tip: Enable this middleware with a high priority (like 1) during development, then disable it in production.

Common Mistakes to Avoid

After helping dozens of people with Scrapy middlewares, here are the mistakes I see repeatedly:

1. Not Returning Anything

# WRONG
def process_request(self, request, spider):
    request.headers['Custom-Header'] = 'value'
    # Missing return!

# RIGHT
def process_request(self, request, spider):
    request.headers['Custom-Header'] = 'value'
    return None  # Continue processing
Enter fullscreen mode Exit fullscreen mode

2. Wrong Priority Order

# WRONG - Your middleware runs after the retry middleware
DOWNLOADER_MIDDLEWARES = {
    'myproject.middlewares.MyMiddleware': 600,
    'scrapy.downloadermiddlewares.retry.RetryMiddleware': 550,
}

# RIGHT - Your middleware runs before retry
DOWNLOADER_MIDDLEWARES = {
    'myproject.middlewares.MyMiddleware': 540,
    'scrapy.downloadermiddlewares.retry.RetryMiddleware': 550,
}
Enter fullscreen mode Exit fullscreen mode

3. Blocking Operations in Middlewares

# WRONG - Blocking call
def process_request(self, request, spider):
    time.sleep(5)  # Blocks entire reactor!
    return None

# RIGHT - Use Scrapy's delay settings
# In settings.py:
DOWNLOAD_DELAY = 5
RANDOMIZE_DOWNLOAD_DELAY = True
Enter fullscreen mode Exit fullscreen mode

4. Not Handling Edge Cases

# WRONG - Will crash if user_agent isn't set
def process_request(self, request, spider):
    ua = request.headers['User-Agent']  # KeyError!

# RIGHT - Always check
def process_request(self, request, spider):
    ua = request.headers.get('User-Agent', b'')
Enter fullscreen mode Exit fullscreen mode

Testing Your Middlewares

Here's a simple way to test if your middlewares are working:

# test_spider.py
import scrapy

class TestSpider(scrapy.Spider):
    name = 'test_middleware'
    start_urls = ['http://httpbin.org/headers']

    def parse(self, response):
        # This endpoint returns the headers you sent
        import json
        headers = json.loads(response.text)
        self.logger.info(f"Headers sent: {headers}")

        # Check if your custom headers are there
        user_agent = headers.get('headers', {}).get('User-Agent', '')
        self.logger.info(f"User-Agent: {user_agent}")
Enter fullscreen mode Exit fullscreen mode

Run it with:

scrapy crawl test_middleware -L INFO
Enter fullscreen mode Exit fullscreen mode

Production-Ready Middleware Setup

Here's what a real production setup looks like:

# settings.py
DOWNLOADER_MIDDLEWARES = {
    # Disable defaults
    'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
    'scrapy.downloadermiddlewares.retry.RetryMiddleware': None,

    # Custom middlewares
    'myproject.middlewares.RandomUserAgentMiddleware': 400,
    'myproject.middlewares.RealisticHeadersMiddleware': 410,
    'myproject.middlewares.AdaptiveDelayMiddleware': 500,
    'myproject.middlewares.SmartRetryMiddleware': 550,

    # Keep these enabled
    'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
}

# General scraping settings
CONCURRENT_REQUESTS = 8
CONCURRENT_REQUESTS_PER_DOMAIN = 2
DOWNLOAD_DELAY = 2
RANDOMIZE_DOWNLOAD_DELAY = True

# Retry settings
RETRY_TIMES = 3
RETRY_HTTP_CODES = [500, 502, 503, 504, 522, 524, 408, 429]

# Respect robots.txt
ROBOTSTXT_OBEY = True

# Enable autothrottle (works with adaptive delay)
AUTOTHROTTLE_ENABLED = True
AUTOTHROTTLE_START_DELAY = 1
AUTOTHROTTLE_MAX_DELAY = 10
AUTOTHROTTLE_TARGET_CONCURRENCY = 2.0
Enter fullscreen mode Exit fullscreen mode

Final Thoughts

Middlewares are powerful, but they come with responsibility. Here's my rule of thumb:

Always ask yourself:

  • Am I respecting the site's robots.txt?
  • Am I using reasonable delays?
  • Would my scraper cause problems if 100 people used it?

If the answer to any of these is "no," slow down and rethink your approach.

Scrapy middlewares are tools, and like any tool, they can be used well or poorly. Use them to build resilient scrapers that respect the sites they access.


What's next? Try building a middleware that:

  • Rotates proxies (if you have access to a proxy service)
  • Handles cookie management for logged-in sessions
  • Implements circuit breaker pattern (stops requests after X failures)

Drop a comment if you want to see any of these in detail!

Top comments (0)