Muhammad Ikramullah Khan

Posted on Dec 27

Scrapy Response Handling: The Complete Beginner's Guide (Why Your Spider Ignores 404s)

#webdev #python #beginners #programming

When I first started scraping, I hit a confusing problem. My spider would visit a page, I could see the request in the logs, but my parse() method never got called. No data. No errors. Just... nothing.

After hours of debugging, I discovered the truth: the page was returning a 404. And Scrapy, by default, silently drops anything that isn't a 200 response.

This behavior makes sense once you understand it, but nobody explains it clearly to beginners. Let me fix that right now.

The Big Secret: Scrapy Only Handles 200 Responses

Here's what the documentation doesn't emphasize enough:

By default, Scrapy only passes responses with status codes between 200 and 299 to your spider.

Everything else gets dropped silently:

301 redirects? Dropped.
302 redirects? Dropped.
404 not found? Dropped.
403 forbidden? Dropped.
500 server error? Dropped.

Your parse() method never sees them. They disappear into the void.

Why Does Scrapy Do This?

Think about it. Most of the time, you only care about successful responses. If a page returns 404, there's nothing to scrape. If it's a 500 error, the server is broken.

Scrapy assumes you don't want to waste time processing error pages. It's protecting you from bad data.

But sometimes you DO want to handle these responses. Maybe you want to:

Log which pages are missing (404s)
Handle redirects manually (301, 302)
Detect when you're being blocked (403)
Retry server errors (500, 503)

That's where handle_httpstatus_list comes in.

Understanding HTTP Status Codes (Quick Refresh)

Before we dive in, let's quickly review status codes:

2xx: Success

200: OK, everything worked
201: Created (used in APIs)

3xx: Redirection

301: Moved Permanently
302: Found (temporary redirect)
304: Not Modified (cached)

4xx: Client Error

400: Bad Request
401: Unauthorized
403: Forbidden
404: Not Found
429: Too Many Requests (rate limited!)

5xx: Server Error

500: Internal Server Error
502: Bad Gateway
503: Service Unavailable
504: Gateway Timeout

By default, Scrapy only processes 2xx responses.

Handling Specific Status Codes (handle_httpstatus_list)

Method 1: Spider-Level (All Requests)

Add this to your spider class:

import scrapy

class MySpider(scrapy.Spider):
    name = 'myspider'
    start_urls = ['https://example.com']

    # Handle 404 and 500 responses
    handle_httpstatus_list = [404, 500]

    def parse(self, response):
        if response.status == 404:
            self.logger.warning(f'Page not found: {response.url}')
            # Do something with 404s
        elif response.status == 500:
            self.logger.error(f'Server error: {response.url}')
            # Do something with 500s
        else:
            # Normal 200 response
            yield {'url': response.url, 'title': response.css('h1::text').get()}

Now your spider receives 404 and 500 responses. You can check response.status and handle them appropriately.

Method 2: Per-Request (Specific URLs)

Sometimes you only want to handle certain codes for specific requests:

def parse(self, response):
    # This request will handle 404s
    yield scrapy.Request(
        'https://example.com/might-not-exist',
        callback=self.parse_page,
        meta={'handle_httpstatus_list': [404]}
    )

def parse_page(self, response):
    if response.status == 404:
        self.logger.info('Page doesn't exist, that's ok')
    else:
        # Process normal response
        yield {'data': response.css('.content::text').get()}

The meta={'handle_httpstatus_list': [404]} tells Scrapy to pass 404s to the callback for just this request.

Method 3: Settings (Project-Wide)

Set it in settings.py:

# settings.py
HTTPERROR_ALLOWED_CODES = [404, 403, 500]

Now ALL spiders in your project handle these codes by default.

Handling ALL Status Codes (handle_httpstatus_all)

Sometimes you want to handle every possible status code:

def parse(self, response):
    yield scrapy.Request(
        'https://example.com/anything',
        callback=self.parse_any,
        meta={'handle_httpstatus_all': True}
    )

def parse_any(self, response):
    self.logger.info(f'Got status {response.status} from {response.url}')

    if 200 <= response.status < 300:
        # Success
        yield self.parse_success(response)
    elif 300 <= response.status < 400:
        # Redirect
        self.logger.info(f'Redirect to: {response.headers.get("Location")}')
    elif 400 <= response.status < 500:
        # Client error
        self.logger.warning(f'Client error: {response.status}')
    elif 500 <= response.status < 600:
        # Server error
        self.logger.error(f'Server error: {response.status}')

Warning: Use handle_httpstatus_all carefully. You'll get EVERYTHING, including redirects that Scrapy normally handles automatically.

Real Example: Handling Missing Pages

Let's say you're scraping product pages, but some products have been deleted (404):

import scrapy

class ProductSpider(scrapy.Spider):
    name = 'products'
    start_urls = ['https://example.com']

    handle_httpstatus_list = [404]

    def parse(self, response):
        # Get product links
        for link in response.css('.product a::attr(href)').getall():
            yield response.follow(link, callback=self.parse_product)

    def parse_product(self, response):
        if response.status == 404:
            # Product doesn't exist anymore
            yield {
                'url': response.url,
                'status': 'deleted',
                'found': False
            }
        else:
            # Product exists, scrape it
            yield {
                'url': response.url,
                'status': 'active',
                'found': True,
                'name': response.css('h1::text').get(),
                'price': response.css('.price::text').get()
            }

Now you can track which products have been deleted!

Real Example: Detecting Rate Limiting

Websites often return 429 (Too Many Requests) when you're scraping too fast:

import scrapy
import time

class RateLimitSpider(scrapy.Spider):
    name = 'ratelimit'
    start_urls = ['https://example.com']

    handle_httpstatus_list = [429]

    custom_settings = {
        'DOWNLOAD_DELAY': 1  # Start with 1 second delay
    }

    def parse(self, response):
        if response.status == 429:
            # We're being rate limited!
            self.logger.warning('Rate limited! Slowing down...')

            # Increase delay
            self.crawler.engine.downloader.total_concurrency = 1

            # Retry this request after waiting
            retry_after = int(response.headers.get('Retry-After', 60))
            self.logger.info(f'Waiting {retry_after} seconds...')

            # Return the request to be retried
            return scrapy.Request(
                response.url,
                callback=self.parse,
                dont_filter=True,
                priority=10  # High priority
            )

        # Normal processing
        for product in response.css('.product'):
            yield {
                'name': product.css('h2::text').get(),
                'price': product.css('.price::text').get()
            }

Real Example: Handling Redirects

By default, Scrapy follows redirects automatically. But sometimes you want to handle them manually:

import scrapy

class RedirectSpider(scrapy.Spider):
    name = 'redirects'
    start_urls = ['https://example.com/old-page']

    # Handle redirect status codes
    handle_httpstatus_list = [301, 302]

    custom_settings = {
        'REDIRECT_ENABLED': False  # Disable automatic redirect following
    }

    def parse(self, response):
        if response.status in [301, 302]:
            # Manual redirect handling
            new_url = response.headers.get('Location').decode('utf-8')

            self.logger.info(f'Redirect: {response.url} -> {new_url}')

            # Track the redirect
            yield {
                'original_url': response.url,
                'redirect_type': response.status,
                'new_url': new_url
            }

            # Follow manually if needed
            yield response.follow(new_url, callback=self.parse_final)
        else:
            # Normal page
            yield {'url': response.url, 'title': response.css('title::text').get()}

    def parse_final(self, response):
        yield {
            'url': response.url,
            'title': response.css('title::text').get(),
            'is_final': True
        }

The Critical Difference: View Page Source vs Inspect Element

This is huge and trips up almost every beginner. Let me explain.

What You See in Inspect Element

When you right-click a page and choose "Inspect Element" or "Inspect," you see the DOM (Document Object Model). This is the HTML AFTER:

JavaScript has run
Content has loaded dynamically
AJAX requests have completed
React/Vue/Angular has rendered
Infinite scroll has loaded more items

This is NOT what Scrapy sees.

What Scrapy Actually Sees

Scrapy downloads the raw HTML. It doesn't run JavaScript. It doesn't wait for AJAX. It sees the page BEFORE any JavaScript execution.

To see what Scrapy sees, you need to view the page source.

How to View Page Source

Method 1: Right-Click Menu

Right-click on the page
Choose "View Page Source" (NOT "Inspect")
This opens a new tab with raw HTML

Method 2: Keyboard Shortcut

Windows/Linux: Ctrl + U
Mac: Cmd + Option + U

Method 3: URL Bar

Add view-source: before the URL
Example: view-source:https://example.com

The Problem This Solves

Here's a real scenario:

You inspect a product page and see:

<div class="price">$29.99</div>

You write this selector:

response.css('.price::text').get()

But it returns None. Why?

You view page source and discover:

<div class="price"></div>
<script>
    // Price loads via JavaScript
    loadPrice();
</script>

The price isn't in the HTML! It's loaded by JavaScript. Scrapy can't see it because Scrapy doesn't run JavaScript.

Real Example: JavaScript-Loaded Content

Let's say you're scraping a product list. Inspect Element shows:

<div class="products">
    <div class="product">Product 1</div>
    <div class="product">Product 2</div>
    <div class="product">Product 3</div>
</div>

But when you view page source, you see:

<div class="products">
    <!-- Products loaded by JavaScript -->
</div>
<script src="loadProducts.js"></script>

Your selector won't work! The products aren't in the HTML Scrapy downloads.

Solutions:

Use Scrapy-Playwright or Scrapy-Selenium (renders JavaScript)
Find the API endpoint the JavaScript calls
Extract data from <script> tags if it's embedded

Testing What Scrapy Sees

Use Scrapy shell to see exactly what Scrapy downloads:

scrapy shell "https://example.com"

Then check:

# See the raw HTML
>>> print(response.text)

# Try your selectors
>>> response.css('.price::text').get()

If your selector returns None but works in the browser, the content is JavaScript-loaded.

Complete Example: Production-Ready Response Handling

Here's a spider that handles responses properly:

import scrapy
from scrapy.exceptions import IgnoreRequest

class RobustSpider(scrapy.Spider):
    name = 'robust'
    start_urls = ['https://example.com/products']

    # Handle various status codes
    handle_httpstatus_list = [404, 403, 429, 500, 503]

    custom_settings = {
        'DOWNLOAD_DELAY': 2,
        'CONCURRENT_REQUESTS': 8,
        'RETRY_TIMES': 3,
        'RETRY_HTTP_CODES': [500, 502, 503, 504, 408]
    }

    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.stats = {
            'success': 0,
            'not_found': 0,
            'forbidden': 0,
            'rate_limited': 0,
            'server_error': 0
        }

    def parse(self, response):
        # Check status before processing
        if response.status != 200:
            return self.handle_error(response)

        # Normal processing
        self.stats['success'] += 1

        for product in response.css('.product'):
            yield response.follow(
                product.css('a::attr(href)').get(),
                callback=self.parse_product,
                errback=self.handle_failure
            )

    def parse_product(self, response):
        # Check status
        if response.status != 200:
            return self.handle_error(response)

        # Scrape product
        yield {
            'url': response.url,
            'name': response.css('h1::text').get(),
            'price': response.css('.price::text').get(),
            'status': 'active'
        }

    def handle_error(self, response):
        """Handle non-200 responses"""

        if response.status == 404:
            self.stats['not_found'] += 1
            self.logger.warning(f'404 Not Found: {response.url}')
            yield {
                'url': response.url,
                'status': 'deleted',
                'error': '404'
            }

        elif response.status == 403:
            self.stats['forbidden'] += 1
            self.logger.error(f'403 Forbidden: {response.url}')
            # Might be blocked, slow down
            self.crawler.engine.pause()
            import time
            time.sleep(10)
            self.crawler.engine.unpause()

        elif response.status == 429:
            self.stats['rate_limited'] += 1
            self.logger.warning(f'429 Rate Limited: {response.url}')
            # Re-queue with lower priority
            yield scrapy.Request(
                response.url,
                callback=self.parse_product,
                dont_filter=True,
                priority=0
            )

        elif response.status >= 500:
            self.stats['server_error'] += 1
            self.logger.error(f'{response.status} Server Error: {response.url}')
            # Retry middleware will handle this

    def handle_failure(self, failure):
        """Handle request failures (network errors, etc.)"""
        self.logger.error(f'Request failed: {failure}')

    def closed(self, reason):
        """Log statistics when spider closes"""
        self.logger.info('='*60)
        self.logger.info('SPIDER STATISTICS')
        self.logger.info(f'Success: {self.stats["success"]}')
        self.logger.info(f'Not Found (404): {self.stats["not_found"]}')
        self.logger.info(f'Forbidden (403): {self.stats["forbidden"]}')
        self.logger.info(f'Rate Limited (429): {self.stats["rate_limited"]}')
        self.logger.info(f'Server Errors (5xx): {self.stats["server_error"]}')
        self.logger.info('='*60)

This spider:

Handles multiple error codes
Tracks statistics
Slows down when blocked
Re-queues rate-limited requests
Logs everything properly

Common Mistakes

Mistake #1: Not Checking Status

# WRONG (assumes all responses are 200)
def parse(self, response):
    yield {'title': response.css('h1::text').get()}

# RIGHT (checks status first)
def parse(self, response):
    if response.status == 200:
        yield {'title': response.css('h1::text').get()}
    else:
        self.logger.warning(f'Got status {response.status}')

Mistake #2: Using Inspect Element Instead of View Source

# You see this in Inspect Element
response.css('.dynamically-loaded::text').get()

# Returns None because content isn't in page source!
# Always check view-source: first

Mistake #3: Forgetting to Add Status Code to List

handle_httpstatus_list = [404]

def parse(self, response):
    # This never runs for 500 errors!
    if response.status == 500:
        self.logger.error('Server error')

If you want to handle 500s, add them to the list!

Quick Reference

Spider-Level Handling

class MySpider(scrapy.Spider):
    handle_httpstatus_list = [404, 403, 500]

Per-Request Handling

yield scrapy.Request(
    url,
    meta={'handle_httpstatus_list': [404]}
)

Handle All Codes

yield scrapy.Request(
    url,
    meta={'handle_httpstatus_all': True}
)

Settings

# settings.py
HTTPERROR_ALLOWED_CODES = [404, 403, 500]

Check Status

def parse(self, response):
    if response.status == 200:
        # Success
    elif response.status == 404:
        # Not found
    elif response.status >= 500:
        # Server error

View Page Source

Right-click → "View Page Source"
Ctrl+U (Windows/Linux) or Cmd+Option+U (Mac)
view-source:https://example.com

Summary

Key takeaways:

Scrapy only handles 200-299 responses by default
Use handle_httpstatus_list to handle specific codes
Use handle_httpstatus_all to handle everything
Always check response.status before processing
View Page Source, not Inspect Element (critical!)
Page source shows what Scrapy sees
Inspect Element shows what the browser renders

Start checking status codes in your spiders. View page source before writing selectors. Your scraping life will get much easier.

Happy scraping! 🕷️

DEV Community