Muhammad Ikramullah Khan

Posted on Dec 18, 2025

Scrapy Middlewares: A Practical Guide for Beginners (With Real-World Examples)

#webdev #programming #beginners #backend

If you've been using Scrapy for a while, you've probably heard about middlewares. Maybe you've even used a few. But if you're like most beginners, they still feel a bit mysterious—like this powerful feature hiding in the shadows of your scraper.

Here's the thing: middlewares are what separate scrapers that get blocked after 10 requests from ones that can run for hours without issues. They're the difference between a fragile script and a production-ready crawler.

In this guide, I'll walk you through Scrapy middlewares with practical, ethical examples. We'll cover what the documentation doesn't always explain clearly, and I'll show you patterns that actually work in the real world.

What Are Middlewares, Really?

Think of middlewares as checkpoints in your scraping pipeline. Every request and response flows through them, and you can intercept, modify, or even block them at these checkpoints.

There are two main types:

Downloader Middlewares — These sit between the Scrapy engine and the downloader. They handle:

Modifying requests before they're sent (adding headers, rotating proxies)
Processing responses before they reach your spider
Handling errors and retries

Spider Middlewares — These sit between the engine and your spider. They handle:

Processing responses before your spider's parse method
Handling items and requests generated by your spider
Managing exceptions from your spider

For most scraping tasks, you'll spend 90% of your time with downloader middlewares. That's where the magic happens.

When You Actually Need Middlewares

Before we dive into code, let's talk about when you should use middlewares:

You DON'T need middlewares if:

You're scraping a small, static site once
The site has no bot protection
You're just practicing or learning

You DO need middlewares if:

You're getting 403/429 errors
The site tracks user agents or IPs
You need to respect rate limits properly
You're building a production scraper
You want your scraper to recover from failures

Setting Up Your First Middleware

Let's start with something simple: rotating user agents ethically. This is probably the most common use case.

The Problem

When you make requests with Scrapy's default user agent, it literally says "Scrapy" in the string. Many sites will block this immediately:

User-Agent: Scrapy/2.11.0 (+https://scrapy.org)

The Solution

Here's a middleware that rotates through realistic user agents:

# middlewares.py
import random
from scrapy import signals

class RandomUserAgentMiddleware:
    def __init__(self):
        self.user_agents = [
            'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
            'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
            'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
            'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:121.0) Gecko/20100101 Firefox/121.0',
            'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:121.0) Gecko/20100101 Firefox/121.0',
        ]

    @classmethod
    def from_crawler(cls, crawler):
        middleware = cls()
        crawler.signals.connect(middleware.spider_opened, signal=signals.spider_opened)
        return middleware

    def spider_opened(self, spider):
        spider.logger.info(f'User-Agent rotation enabled with {len(self.user_agents)} agents')

    def process_request(self, request, spider):
        user_agent = random.choice(self.user_agents)
        request.headers['User-Agent'] = user_agent
        spider.logger.debug(f'Using User-Agent: {user_agent[:50]}...')

Enabling It

Add this to your settings.py:

DOWNLOADER_MIDDLEWARES = {
    'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,  # Disable default
    'myproject.middlewares.RandomUserAgentMiddleware': 400,  # Enable custom
}

What the Documentation Doesn't Tell You

1. The number (400) matters a lot

Lower numbers run first. Scrapy's built-in middlewares use ranges:

100-500: Request processing
500-900: Response processing

If you're modifying requests (like adding headers), use 400-500. If you're processing responses, use 800-900.

2. Always disable the middleware you're replacing

Notice how we set UserAgentMiddleware to None? If you don't do this, both middlewares will run, and the default one might override your changes.

3. The from_crawler method is optional but useful

You don't need it, but it gives you access to:

Settings: crawler.settings.get('MY_SETTING')
Stats: crawler.stats
Signals: To hook into spider lifecycle events

Smart Retry Logic (Beyond What's Built-In)

Scrapy has a retry middleware, but it's pretty basic. Here's how to make it smarter.

The Problem with Default Retries

The built-in retry middleware retries on:

Specific HTTP codes (500, 502, 503, 504, etc.)
Network errors
Timeouts

But what if the site returns 200 with "Access Denied" in the body? Or what if you need exponential backoff? The default middleware can't handle these.

Smart Retry Middleware

# middlewares.py
import time
from scrapy.downloadermiddlewares.retry import RetryMiddleware
from scrapy.utils.response import response_status_message

class SmartRetryMiddleware(RetryMiddleware):

    def __init__(self, settings):
        super().__init__(settings)
        self.max_retry_times = settings.getint('RETRY_TIMES', 3)
        # Patterns that indicate we should retry even on 200
        self.retry_patterns = [
            b'Access Denied',
            b'blocked',
            b'captcha',
            b'rate limit',
        ]

    @classmethod
    def from_crawler(cls, crawler):
        return cls(crawler.settings)

    def process_response(self, request, response, spider):
        # First, handle normal retry logic
        if request.meta.get('dont_retry', False):
            return response

        # Check if response looks like it should be retried
        if response.status == 200:
            # Check for patterns in response body
            for pattern in self.retry_patterns:
                if pattern in response.body:
                    spider.logger.warning(
                        f'Retry pattern "{pattern.decode()}" found in {response.url}'
                    )
                    return self._retry_with_backoff(request, 'blocked_content', spider) or response

        # Let parent class handle status code retries
        return super().process_response(request, response, spider)

    def _retry_with_backoff(self, request, reason, spider):
        retry_times = request.meta.get('retry_times', 0) + 1

        if retry_times <= self.max_retry_times:
            # Exponential backoff: 2^retry_times seconds
            delay = 2 ** retry_times
            spider.logger.info(
                f'Retrying {request.url} (attempt {retry_times}/{self.max_retry_times}) '
                f'after {delay}s delay. Reason: {reason}'
            )

            # Sleep before retry (not ideal but works for simple cases)
            # For production, use Scrapy's download delay settings instead
            time.sleep(delay)

            new_request = request.copy()
            new_request.meta['retry_times'] = retry_times
            new_request.dont_filter = True
            new_request.priority = request.priority + self.settings.getint('RETRY_PRIORITY_ADJUST')

            return new_request
        else:
            spider.logger.error(
                f'Gave up retrying {request.url} (failed {retry_times} times): {reason}'
            )
            return None

Enable It

# settings.py
DOWNLOADER_MIDDLEWARES = {
    'scrapy.downloadermiddlewares.retry.RetryMiddleware': None,
    'myproject.middlewares.SmartRetryMiddleware': 550,
}

# Configure retry behavior
RETRY_TIMES = 3
RETRY_HTTP_CODES = [500, 502, 503, 504, 522, 524, 408, 429]

Key Improvements

1. Content-based retry detection
Not all failures return error codes. Sometimes you get a 200 with "blocked" in the HTML.

2. Exponential backoff
Instead of retrying immediately, we wait progressively longer (2s, 4s, 8s).

3. Better logging
You can actually see what's happening and why retries occur.

Respecting Rate Limits (The Right Way)

Here's something the documentation barely covers: how to actually respect rate limits properly.

Dynamic Delay Middleware

# middlewares.py
import time

class AdaptiveDelayMiddleware:
    def __init__(self, initial_delay=1.0, max_delay=10.0):
        self.delay = initial_delay
        self.max_delay = max_delay
        self.last_request_time = None
        self.consecutive_errors = 0

    @classmethod
    def from_crawler(cls, crawler):
        initial = crawler.settings.getfloat('ADAPTIVE_DELAY_INITIAL', 1.0)
        maximum = crawler.settings.getfloat('ADAPTIVE_DELAY_MAX', 10.0)
        return cls(initial, maximum)

    def process_request(self, request, spider):
        # Skip if this is a retry
        if request.meta.get('retry_times', 0) > 0:
            return None

        # Apply delay
        if self.last_request_time:
            elapsed = time.time() - self.last_request_time
            if elapsed < self.delay:
                time.sleep(self.delay - elapsed)

        self.last_request_time = time.time()
        return None

    def process_response(self, request, response, spider):
        # Decrease delay on success
        if response.status == 200:
            self.consecutive_errors = 0
            if self.delay > 0.5:
                self.delay *= 0.95  # Gradually speed up
                spider.logger.debug(f'Decreased delay to {self.delay:.2f}s')

        # Increase delay on rate limit
        elif response.status == 429:
            self.consecutive_errors += 1
            self.delay = min(self.delay * 2, self.max_delay)
            spider.logger.warning(
                f'Rate limited! Increased delay to {self.delay:.2f}s'
            )

        return response

What Makes This Better

1. It adapts automatically
Starts conservative, speeds up when things are going well, slows down when you hit rate limits.

2. It's respectful
You're not hammering the server. You're being a good citizen of the internet.

3. It improves efficiency
Instead of using a fixed slow delay, it finds the sweet spot between speed and not getting blocked.

Request Fingerprinting (Advanced but Useful)

Here's something most tutorials skip: making your requests look more realistic by adding headers that real browsers send.

Realistic Headers Middleware

# middlewares.py
import random

class RealisticHeadersMiddleware:
    def __init__(self):
        self.languages = [
            'en-US,en;q=0.9',
            'en-GB,en;q=0.9',
            'en-US,en;q=0.9,es;q=0.8',
        ]

        self.encodings = [
            'gzip, deflate, br',
            'gzip, deflate',
        ]

    def process_request(self, request, spider):
        # These headers make requests look more like real browsers
        request.headers['Accept'] = 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8'
        request.headers['Accept-Language'] = random.choice(self.languages)
        request.headers['Accept-Encoding'] = random.choice(self.encodings)
        request.headers['DNT'] = '1'  # Do Not Track
        request.headers['Connection'] = 'keep-alive'
        request.headers['Upgrade-Insecure-Requests'] = '1'

        # Add referer for non-start URLs
        if request.meta.get('depth', 0) > 0:
            # Set referer to previous page (if available)
            referer = request.meta.get('referer')
            if referer:
                request.headers['Referer'] = referer

Why This Matters

Websites don't just look at your User-Agent. They check:

Whether you send Accept headers
If your Accept-Language makes sense
Whether you're sending Connection: keep-alive
If you have a Referer header on non-initial requests

Missing these headers is a red flag that screams "I'm a bot!"

Debugging Middlewares

This is crucial but rarely covered: how do you debug when middlewares aren't working?

Debug Logging Middleware

# middlewares.py
class DebugMiddleware:
    def process_request(self, request, spider):
        spider.logger.info(f"""
=== REQUEST ===
URL: {request.url}
Method: {request.method}
Headers: {dict(request.headers)}
Meta: {request.meta}
Priority: {request.priority}
================
        """)

    def process_response(self, request, response, spider):
        spider.logger.info(f"""
=== RESPONSE ===
URL: {response.url}
Status: {response.status}
Headers: {dict(response.headers)}
Body length: {len(response.body)}
================
        """)
        return response

    def process_exception(self, request, exception, spider):
        spider.logger.error(f"""
=== EXCEPTION ===
URL: {request.url}
Exception: {exception}
Type: {type(exception).__name__}
================
        """)

Pro tip: Enable this middleware with a high priority (like 1) during development, then disable it in production.

Common Mistakes to Avoid

After helping dozens of people with Scrapy middlewares, here are the mistakes I see repeatedly:

1. Not Returning Anything

# WRONG
def process_request(self, request, spider):
    request.headers['Custom-Header'] = 'value'
    # Missing return!

# RIGHT
def process_request(self, request, spider):
    request.headers['Custom-Header'] = 'value'
    return None  # Continue processing

2. Wrong Priority Order

# WRONG - Your middleware runs after the retry middleware
DOWNLOADER_MIDDLEWARES = {
    'myproject.middlewares.MyMiddleware': 600,
    'scrapy.downloadermiddlewares.retry.RetryMiddleware': 550,
}

# RIGHT - Your middleware runs before retry
DOWNLOADER_MIDDLEWARES = {
    'myproject.middlewares.MyMiddleware': 540,
    'scrapy.downloadermiddlewares.retry.RetryMiddleware': 550,
}

3. Blocking Operations in Middlewares

# WRONG - Blocking call
def process_request(self, request, spider):
    time.sleep(5)  # Blocks entire reactor!
    return None

# RIGHT - Use Scrapy's delay settings
# In settings.py:
DOWNLOAD_DELAY = 5
RANDOMIZE_DOWNLOAD_DELAY = True

4. Not Handling Edge Cases

# WRONG - Will crash if user_agent isn't set
def process_request(self, request, spider):
    ua = request.headers['User-Agent']  # KeyError!

# RIGHT - Always check
def process_request(self, request, spider):
    ua = request.headers.get('User-Agent', b'')

Testing Your Middlewares

Here's a simple way to test if your middlewares are working:

# test_spider.py
import scrapy

class TestSpider(scrapy.Spider):
    name = 'test_middleware'
    start_urls = ['http://httpbin.org/headers']

    def parse(self, response):
        # This endpoint returns the headers you sent
        import json
        headers = json.loads(response.text)
        self.logger.info(f"Headers sent: {headers}")

        # Check if your custom headers are there
        user_agent = headers.get('headers', {}).get('User-Agent', '')
        self.logger.info(f"User-Agent: {user_agent}")

Run it with:

scrapy crawl test_middleware -L INFO

Production-Ready Middleware Setup

Here's what a real production setup looks like:

# settings.py
DOWNLOADER_MIDDLEWARES = {
    # Disable defaults
    'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
    'scrapy.downloadermiddlewares.retry.RetryMiddleware': None,

    # Custom middlewares
    'myproject.middlewares.RandomUserAgentMiddleware': 400,
    'myproject.middlewares.RealisticHeadersMiddleware': 410,
    'myproject.middlewares.AdaptiveDelayMiddleware': 500,
    'myproject.middlewares.SmartRetryMiddleware': 550,

    # Keep these enabled
    'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
}

# General scraping settings
CONCURRENT_REQUESTS = 8
CONCURRENT_REQUESTS_PER_DOMAIN = 2
DOWNLOAD_DELAY = 2
RANDOMIZE_DOWNLOAD_DELAY = True

# Retry settings
RETRY_TIMES = 3
RETRY_HTTP_CODES = [500, 502, 503, 504, 522, 524, 408, 429]

# Respect robots.txt
ROBOTSTXT_OBEY = True

# Enable autothrottle (works with adaptive delay)
AUTOTHROTTLE_ENABLED = True
AUTOTHROTTLE_START_DELAY = 1
AUTOTHROTTLE_MAX_DELAY = 10
AUTOTHROTTLE_TARGET_CONCURRENCY = 2.0

Final Thoughts

Middlewares are powerful, but they come with responsibility. Here's my rule of thumb:

Always ask yourself:

Am I respecting the site's robots.txt?
Am I using reasonable delays?
Would my scraper cause problems if 100 people used it?

If the answer to any of these is "no," slow down and rethink your approach.

Scrapy middlewares are tools, and like any tool, they can be used well or poorly. Use them to build resilient scrapers that respect the sites they access.

What's next? Try building a middleware that:

Rotates proxies (if you have access to a proxy service)
Handles cookie management for logged-in sessions
Implements circuit breaker pattern (stops requests after X failures)

Drop a comment if you want to see any of these in detail!

Top comments (1)

OnlineProxy • Jan 8

You've got a solid foundation here but there's definitely some stuff to tighten up before you ship this to production. The SmartRetryMiddleware pattern for catching retry-worthy responses is pretty legit though checking for b'Access Denied' directly in response.body is gonna bite you - you'll end up flagging pages that just happen to mention access issues in their content, which is annoying. Better move - validate against specific HTML tags or peek at the response headers first. Here's the real kicker tho- that time.sleep() blocking call in _retry_with_backoff() will absolutely tank your setup once things get heavy, you'll cascade failures across the board. Instead stash the backoff delay in request.meta['download_delay'] and let Scrapy's scheduler do the heavy lifting. And you're missing per-proxy cookie management - honestly this one's sneaky dangerous. If you're hitting stateful sites or cycling through proxies you gotta use request.meta['cookiejar'] = hash(proxy) to keep sessions from colliding and silently borking your auth. Last thing, your concurrency defaults aren't bad, but the guide needs to make it crystal clear that CONCURRENT_REQUESTS_PER_DOMAIN is your actual rate-limiting knob, AUTOTHROTTLE_TARGET_CONCURRENCY just caps things anyway, so that's where you actually get fine-grained control per domain