DEV Community

Muhammad Ikramullah Khan
Muhammad Ikramullah Khan

Posted on

Scrapy Response Handling: The Complete Beginner's Guide (Why Your Spider Ignores 404s)

When I first started scraping, I hit a confusing problem. My spider would visit a page, I could see the request in the logs, but my parse() method never got called. No data. No errors. Just... nothing.

After hours of debugging, I discovered the truth: the page was returning a 404. And Scrapy, by default, silently drops anything that isn't a 200 response.

This behavior makes sense once you understand it, but nobody explains it clearly to beginners. Let me fix that right now.


The Big Secret: Scrapy Only Handles 200 Responses

Here's what the documentation doesn't emphasize enough:

By default, Scrapy only passes responses with status codes between 200 and 299 to your spider.

Everything else gets dropped silently:

  • 301 redirects? Dropped.
  • 302 redirects? Dropped.
  • 404 not found? Dropped.
  • 403 forbidden? Dropped.
  • 500 server error? Dropped.

Your parse() method never sees them. They disappear into the void.

Why Does Scrapy Do This?

Think about it. Most of the time, you only care about successful responses. If a page returns 404, there's nothing to scrape. If it's a 500 error, the server is broken.

Scrapy assumes you don't want to waste time processing error pages. It's protecting you from bad data.

But sometimes you DO want to handle these responses. Maybe you want to:

  • Log which pages are missing (404s)
  • Handle redirects manually (301, 302)
  • Detect when you're being blocked (403)
  • Retry server errors (500, 503)

That's where handle_httpstatus_list comes in.


Understanding HTTP Status Codes (Quick Refresh)

Before we dive in, let's quickly review status codes:

2xx: Success

  • 200: OK, everything worked
  • 201: Created (used in APIs)

3xx: Redirection

  • 301: Moved Permanently
  • 302: Found (temporary redirect)
  • 304: Not Modified (cached)

4xx: Client Error

  • 400: Bad Request
  • 401: Unauthorized
  • 403: Forbidden
  • 404: Not Found
  • 429: Too Many Requests (rate limited!)

5xx: Server Error

  • 500: Internal Server Error
  • 502: Bad Gateway
  • 503: Service Unavailable
  • 504: Gateway Timeout

By default, Scrapy only processes 2xx responses.


Handling Specific Status Codes (handle_httpstatus_list)

Method 1: Spider-Level (All Requests)

Add this to your spider class:

import scrapy

class MySpider(scrapy.Spider):
    name = 'myspider'
    start_urls = ['https://example.com']

    # Handle 404 and 500 responses
    handle_httpstatus_list = [404, 500]

    def parse(self, response):
        if response.status == 404:
            self.logger.warning(f'Page not found: {response.url}')
            # Do something with 404s
        elif response.status == 500:
            self.logger.error(f'Server error: {response.url}')
            # Do something with 500s
        else:
            # Normal 200 response
            yield {'url': response.url, 'title': response.css('h1::text').get()}
Enter fullscreen mode Exit fullscreen mode

Now your spider receives 404 and 500 responses. You can check response.status and handle them appropriately.

Method 2: Per-Request (Specific URLs)

Sometimes you only want to handle certain codes for specific requests:

def parse(self, response):
    # This request will handle 404s
    yield scrapy.Request(
        'https://example.com/might-not-exist',
        callback=self.parse_page,
        meta={'handle_httpstatus_list': [404]}
    )

def parse_page(self, response):
    if response.status == 404:
        self.logger.info('Page doesn't exist, that's ok')
    else:
        # Process normal response
        yield {'data': response.css('.content::text').get()}
Enter fullscreen mode Exit fullscreen mode

The meta={'handle_httpstatus_list': [404]} tells Scrapy to pass 404s to the callback for just this request.

Method 3: Settings (Project-Wide)

Set it in settings.py:

# settings.py
HTTPERROR_ALLOWED_CODES = [404, 403, 500]
Enter fullscreen mode Exit fullscreen mode

Now ALL spiders in your project handle these codes by default.


Handling ALL Status Codes (handle_httpstatus_all)

Sometimes you want to handle every possible status code:

def parse(self, response):
    yield scrapy.Request(
        'https://example.com/anything',
        callback=self.parse_any,
        meta={'handle_httpstatus_all': True}
    )

def parse_any(self, response):
    self.logger.info(f'Got status {response.status} from {response.url}')

    if 200 <= response.status < 300:
        # Success
        yield self.parse_success(response)
    elif 300 <= response.status < 400:
        # Redirect
        self.logger.info(f'Redirect to: {response.headers.get("Location")}')
    elif 400 <= response.status < 500:
        # Client error
        self.logger.warning(f'Client error: {response.status}')
    elif 500 <= response.status < 600:
        # Server error
        self.logger.error(f'Server error: {response.status}')
Enter fullscreen mode Exit fullscreen mode

Warning: Use handle_httpstatus_all carefully. You'll get EVERYTHING, including redirects that Scrapy normally handles automatically.


Real Example: Handling Missing Pages

Let's say you're scraping product pages, but some products have been deleted (404):

import scrapy

class ProductSpider(scrapy.Spider):
    name = 'products'
    start_urls = ['https://example.com']

    handle_httpstatus_list = [404]

    def parse(self, response):
        # Get product links
        for link in response.css('.product a::attr(href)').getall():
            yield response.follow(link, callback=self.parse_product)

    def parse_product(self, response):
        if response.status == 404:
            # Product doesn't exist anymore
            yield {
                'url': response.url,
                'status': 'deleted',
                'found': False
            }
        else:
            # Product exists, scrape it
            yield {
                'url': response.url,
                'status': 'active',
                'found': True,
                'name': response.css('h1::text').get(),
                'price': response.css('.price::text').get()
            }
Enter fullscreen mode Exit fullscreen mode

Now you can track which products have been deleted!


Real Example: Detecting Rate Limiting

Websites often return 429 (Too Many Requests) when you're scraping too fast:

import scrapy
import time

class RateLimitSpider(scrapy.Spider):
    name = 'ratelimit'
    start_urls = ['https://example.com']

    handle_httpstatus_list = [429]

    custom_settings = {
        'DOWNLOAD_DELAY': 1  # Start with 1 second delay
    }

    def parse(self, response):
        if response.status == 429:
            # We're being rate limited!
            self.logger.warning('Rate limited! Slowing down...')

            # Increase delay
            self.crawler.engine.downloader.total_concurrency = 1

            # Retry this request after waiting
            retry_after = int(response.headers.get('Retry-After', 60))
            self.logger.info(f'Waiting {retry_after} seconds...')

            # Return the request to be retried
            return scrapy.Request(
                response.url,
                callback=self.parse,
                dont_filter=True,
                priority=10  # High priority
            )

        # Normal processing
        for product in response.css('.product'):
            yield {
                'name': product.css('h2::text').get(),
                'price': product.css('.price::text').get()
            }
Enter fullscreen mode Exit fullscreen mode

Real Example: Handling Redirects

By default, Scrapy follows redirects automatically. But sometimes you want to handle them manually:

import scrapy

class RedirectSpider(scrapy.Spider):
    name = 'redirects'
    start_urls = ['https://example.com/old-page']

    # Handle redirect status codes
    handle_httpstatus_list = [301, 302]

    custom_settings = {
        'REDIRECT_ENABLED': False  # Disable automatic redirect following
    }

    def parse(self, response):
        if response.status in [301, 302]:
            # Manual redirect handling
            new_url = response.headers.get('Location').decode('utf-8')

            self.logger.info(f'Redirect: {response.url} -> {new_url}')

            # Track the redirect
            yield {
                'original_url': response.url,
                'redirect_type': response.status,
                'new_url': new_url
            }

            # Follow manually if needed
            yield response.follow(new_url, callback=self.parse_final)
        else:
            # Normal page
            yield {'url': response.url, 'title': response.css('title::text').get()}

    def parse_final(self, response):
        yield {
            'url': response.url,
            'title': response.css('title::text').get(),
            'is_final': True
        }
Enter fullscreen mode Exit fullscreen mode

The Critical Difference: View Page Source vs Inspect Element

This is huge and trips up almost every beginner. Let me explain.

What You See in Inspect Element

When you right-click a page and choose "Inspect Element" or "Inspect," you see the DOM (Document Object Model). This is the HTML AFTER:

  • JavaScript has run
  • Content has loaded dynamically
  • AJAX requests have completed
  • React/Vue/Angular has rendered
  • Infinite scroll has loaded more items

This is NOT what Scrapy sees.

What Scrapy Actually Sees

Scrapy downloads the raw HTML. It doesn't run JavaScript. It doesn't wait for AJAX. It sees the page BEFORE any JavaScript execution.

To see what Scrapy sees, you need to view the page source.

How to View Page Source

Method 1: Right-Click Menu

  1. Right-click on the page
  2. Choose "View Page Source" (NOT "Inspect")
  3. This opens a new tab with raw HTML

Method 2: Keyboard Shortcut

  • Windows/Linux: Ctrl + U
  • Mac: Cmd + Option + U

Method 3: URL Bar

  • Add view-source: before the URL
  • Example: view-source:https://example.com

The Problem This Solves

Here's a real scenario:

You inspect a product page and see:

<div class="price">$29.99</div>
Enter fullscreen mode Exit fullscreen mode

You write this selector:

response.css('.price::text').get()
Enter fullscreen mode Exit fullscreen mode

But it returns None. Why?

You view page source and discover:

<div class="price"></div>
<script>
    // Price loads via JavaScript
    loadPrice();
</script>
Enter fullscreen mode Exit fullscreen mode

The price isn't in the HTML! It's loaded by JavaScript. Scrapy can't see it because Scrapy doesn't run JavaScript.

Real Example: JavaScript-Loaded Content

Let's say you're scraping a product list. Inspect Element shows:

<div class="products">
    <div class="product">Product 1</div>
    <div class="product">Product 2</div>
    <div class="product">Product 3</div>
</div>
Enter fullscreen mode Exit fullscreen mode

But when you view page source, you see:

<div class="products">
    <!-- Products loaded by JavaScript -->
</div>
<script src="loadProducts.js"></script>
Enter fullscreen mode Exit fullscreen mode

Your selector won't work! The products aren't in the HTML Scrapy downloads.

Solutions:

  1. Use Scrapy-Playwright or Scrapy-Selenium (renders JavaScript)
  2. Find the API endpoint the JavaScript calls
  3. Extract data from <script> tags if it's embedded

Testing What Scrapy Sees

Use Scrapy shell to see exactly what Scrapy downloads:

scrapy shell "https://example.com"
Enter fullscreen mode Exit fullscreen mode

Then check:

# See the raw HTML
>>> print(response.text)

# Try your selectors
>>> response.css('.price::text').get()
Enter fullscreen mode Exit fullscreen mode

If your selector returns None but works in the browser, the content is JavaScript-loaded.


Complete Example: Production-Ready Response Handling

Here's a spider that handles responses properly:

import scrapy
from scrapy.exceptions import IgnoreRequest

class RobustSpider(scrapy.Spider):
    name = 'robust'
    start_urls = ['https://example.com/products']

    # Handle various status codes
    handle_httpstatus_list = [404, 403, 429, 500, 503]

    custom_settings = {
        'DOWNLOAD_DELAY': 2,
        'CONCURRENT_REQUESTS': 8,
        'RETRY_TIMES': 3,
        'RETRY_HTTP_CODES': [500, 502, 503, 504, 408]
    }

    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.stats = {
            'success': 0,
            'not_found': 0,
            'forbidden': 0,
            'rate_limited': 0,
            'server_error': 0
        }

    def parse(self, response):
        # Check status before processing
        if response.status != 200:
            return self.handle_error(response)

        # Normal processing
        self.stats['success'] += 1

        for product in response.css('.product'):
            yield response.follow(
                product.css('a::attr(href)').get(),
                callback=self.parse_product,
                errback=self.handle_failure
            )

    def parse_product(self, response):
        # Check status
        if response.status != 200:
            return self.handle_error(response)

        # Scrape product
        yield {
            'url': response.url,
            'name': response.css('h1::text').get(),
            'price': response.css('.price::text').get(),
            'status': 'active'
        }

    def handle_error(self, response):
        """Handle non-200 responses"""

        if response.status == 404:
            self.stats['not_found'] += 1
            self.logger.warning(f'404 Not Found: {response.url}')
            yield {
                'url': response.url,
                'status': 'deleted',
                'error': '404'
            }

        elif response.status == 403:
            self.stats['forbidden'] += 1
            self.logger.error(f'403 Forbidden: {response.url}')
            # Might be blocked, slow down
            self.crawler.engine.pause()
            import time
            time.sleep(10)
            self.crawler.engine.unpause()

        elif response.status == 429:
            self.stats['rate_limited'] += 1
            self.logger.warning(f'429 Rate Limited: {response.url}')
            # Re-queue with lower priority
            yield scrapy.Request(
                response.url,
                callback=self.parse_product,
                dont_filter=True,
                priority=0
            )

        elif response.status >= 500:
            self.stats['server_error'] += 1
            self.logger.error(f'{response.status} Server Error: {response.url}')
            # Retry middleware will handle this

    def handle_failure(self, failure):
        """Handle request failures (network errors, etc.)"""
        self.logger.error(f'Request failed: {failure}')

    def closed(self, reason):
        """Log statistics when spider closes"""
        self.logger.info('='*60)
        self.logger.info('SPIDER STATISTICS')
        self.logger.info(f'Success: {self.stats["success"]}')
        self.logger.info(f'Not Found (404): {self.stats["not_found"]}')
        self.logger.info(f'Forbidden (403): {self.stats["forbidden"]}')
        self.logger.info(f'Rate Limited (429): {self.stats["rate_limited"]}')
        self.logger.info(f'Server Errors (5xx): {self.stats["server_error"]}')
        self.logger.info('='*60)
Enter fullscreen mode Exit fullscreen mode

This spider:

  • Handles multiple error codes
  • Tracks statistics
  • Slows down when blocked
  • Re-queues rate-limited requests
  • Logs everything properly

Common Mistakes

Mistake #1: Not Checking Status

# WRONG (assumes all responses are 200)
def parse(self, response):
    yield {'title': response.css('h1::text').get()}

# RIGHT (checks status first)
def parse(self, response):
    if response.status == 200:
        yield {'title': response.css('h1::text').get()}
    else:
        self.logger.warning(f'Got status {response.status}')
Enter fullscreen mode Exit fullscreen mode

Mistake #2: Using Inspect Element Instead of View Source

# You see this in Inspect Element
response.css('.dynamically-loaded::text').get()

# Returns None because content isn't in page source!
# Always check view-source: first
Enter fullscreen mode Exit fullscreen mode

Mistake #3: Forgetting to Add Status Code to List

handle_httpstatus_list = [404]

def parse(self, response):
    # This never runs for 500 errors!
    if response.status == 500:
        self.logger.error('Server error')
Enter fullscreen mode Exit fullscreen mode

If you want to handle 500s, add them to the list!


Quick Reference

Spider-Level Handling

class MySpider(scrapy.Spider):
    handle_httpstatus_list = [404, 403, 500]
Enter fullscreen mode Exit fullscreen mode

Per-Request Handling

yield scrapy.Request(
    url,
    meta={'handle_httpstatus_list': [404]}
)
Enter fullscreen mode Exit fullscreen mode

Handle All Codes

yield scrapy.Request(
    url,
    meta={'handle_httpstatus_all': True}
)
Enter fullscreen mode Exit fullscreen mode

Settings

# settings.py
HTTPERROR_ALLOWED_CODES = [404, 403, 500]
Enter fullscreen mode Exit fullscreen mode

Check Status

def parse(self, response):
    if response.status == 200:
        # Success
    elif response.status == 404:
        # Not found
    elif response.status >= 500:
        # Server error
Enter fullscreen mode Exit fullscreen mode

View Page Source

  • Right-click → "View Page Source"
  • Ctrl+U (Windows/Linux) or Cmd+Option+U (Mac)
  • view-source:https://example.com

Summary

Key takeaways:

  • Scrapy only handles 200-299 responses by default
  • Use handle_httpstatus_list to handle specific codes
  • Use handle_httpstatus_all to handle everything
  • Always check response.status before processing
  • View Page Source, not Inspect Element (critical!)
  • Page source shows what Scrapy sees
  • Inspect Element shows what the browser renders

Start checking status codes in your spiders. View page source before writing selectors. Your scraping life will get much easier.

Happy scraping! 🕷️

Top comments (0)