DEV Community

Muhammad Ikramullah Khan
Muhammad Ikramullah Khan

Posted on

Scrapy Error Handling & Retry Logic: When Things Go Wrong

The first time my spider crashed, I lost 3 hours of scraping. I had scraped 30,000 pages, then hit a bad URL and everything stopped.

No resume. No save point. Just lost data.

I learned the hard way: errors WILL happen. Networks fail. Servers crash. URLs break. Your job is to handle it gracefully.

Let me show you how to make bulletproof spiders that survive anything.


The Types of Errors You'll Face

1. Network Errors

  • Connection timeout
  • DNS failure
  • Connection refused
  • Socket errors

2. HTTP Errors

  • 404 Not Found
  • 500 Internal Server Error
  • 502 Bad Gateway
  • 503 Service Unavailable
  • 429 Too Many Requests

3. Parsing Errors

  • Invalid HTML
  • Missing elements
  • Unexpected structure
  • Encoding issues

4. Pipeline Errors

  • Database connection lost
  • Disk full
  • Permission denied
  • Invalid data

5. Spider Errors

  • Your code has bugs
  • Memory issues
  • Exceptions in callbacks

Built-In Retry Middleware

Scrapy automatically retries failed requests!

Default Behavior

# settings.py (these are defaults)

RETRY_ENABLED = True
RETRY_TIMES = 2  # Retry up to 2 times (3 attempts total)
RETRY_HTTP_CODES = [500, 502, 503, 504, 522, 524, 408, 429]
Enter fullscreen mode Exit fullscreen mode

If a request returns one of these codes, Scrapy automatically retries it.

What the Docs Don't Tell You

Retries use exponential backoff:

  • First attempt: immediate
  • First retry: after 1 second
  • Second retry: after 2 seconds

Total attempts = RETRY_TIMES + 1:

RETRY_TIMES = 2
# Means: 1 original + 2 retries = 3 total attempts
Enter fullscreen mode Exit fullscreen mode

404s don't retry by default:

That's intentional. If a page returns 404, it probably doesn't exist. No point retrying.


Customizing Retry Behavior

Add More Status Codes

# settings.py

RETRY_HTTP_CODES = [500, 502, 503, 504, 522, 524, 408, 429, 403]
# Added 403 (sometimes temporary blocks)
Enter fullscreen mode Exit fullscreen mode

Increase Retry Attempts

RETRY_TIMES = 5  # Retry up to 5 times
Enter fullscreen mode Exit fullscreen mode

Retry Priority

Failed requests get higher priority on retry:

RETRY_PRIORITY_ADJUST = -1  # Negative = higher priority
Enter fullscreen mode Exit fullscreen mode

Handling Network Errors (Connection Failures)

Network errors don't return HTTP codes. They're caught differently.

Using errback

class RobustSpider(scrapy.Spider):
    name = 'robust'

    def start_requests(self):
        yield scrapy.Request(
            'https://example.com',
            callback=self.parse,
            errback=self.handle_error  # Called on network errors
        )

    def parse(self, response):
        yield {'url': response.url}

    def handle_error(self, failure):
        # Log the error
        self.logger.error(f'Request failed: {failure.value}')
        self.logger.error(f'URL: {failure.request.url}')

        # Check error type
        if failure.check(TimeoutError):
            self.logger.error('Timeout error')
        elif failure.check(DNSLookupError):
            self.logger.error('DNS lookup failed')
        elif failure.check(ConnectionRefusedError):
            self.logger.error('Connection refused')
Enter fullscreen mode Exit fullscreen mode

What Errors Can You Catch?

Common error types:

from twisted.internet.error import (
    TimeoutError,
    DNSLookupError,
    ConnectionRefusedError,
    ConnectionLost,
    TCPTimedOutError
)

def handle_error(self, failure):
    if failure.check(TimeoutError, TCPTimedOutError):
        self.logger.error('Timeout!')
    elif failure.check(DNSLookupError):
        self.logger.error('DNS failed!')
    elif failure.check(ConnectionRefusedError):
        self.logger.error('Connection refused!')
    elif failure.check(ConnectionLost):
        self.logger.error('Connection lost!')
    else:
        self.logger.error(f'Unknown error: {failure.value}')
Enter fullscreen mode Exit fullscreen mode

Retry Manually in errback

Sometimes you want custom retry logic:

def handle_error(self, failure):
    request = failure.request

    # Get current retry count
    retry_count = request.meta.get('retry_count', 0)
    max_retries = 3

    if retry_count < max_retries:
        # Retry with increased count
        self.logger.warning(f'Retrying {request.url} (attempt {retry_count + 1})')

        retry_request = request.copy()
        retry_request.meta['retry_count'] = retry_count + 1
        retry_request.dont_filter = True  # Allow duplicate URL

        # Wait before retrying
        import time
        time.sleep(2 ** retry_count)  # Exponential backoff: 1s, 2s, 4s

        yield retry_request
    else:
        self.logger.error(f'Gave up on {request.url} after {max_retries} retries')
Enter fullscreen mode Exit fullscreen mode

Handling HTTP Errors

Check status codes in parse():

def parse(self, response):
    if response.status != 200:
        self.logger.warning(f'Got status {response.status} from {response.url}')

        if response.status == 404:
            # Page doesn't exist
            yield {'url': response.url, 'status': 'not_found'}
            return

        elif response.status == 403:
            # Forbidden (might be blocked)
            self.logger.error('Got 403, might be blocked!')
            # Slow down
            self.crawler.engine.pause()
            import time
            time.sleep(60)
            self.crawler.engine.unpause()
            return

        elif response.status >= 500:
            # Server error (retry handled automatically)
            self.logger.error('Server error, will retry')
            return

    # Normal processing
    yield {
        'url': response.url,
        'title': response.css('h1::text').get()
    }
Enter fullscreen mode Exit fullscreen mode

Handling Parsing Errors

Protect against missing elements:

def parse(self, response):
    try:
        # Risky parsing
        name = response.css('h1.product-name::text').get()
        price = response.css('span.price::text').get()

        if not name:
            raise ValueError('Product name not found')

        if not price:
            raise ValueError('Price not found')

        yield {
            'name': name,
            'price': float(price.replace('$', ''))
        }

    except Exception as e:
        self.logger.error(f'Parsing error on {response.url}: {e}')
        # Save problematic URL for later review
        yield {
            'url': response.url,
            'error': str(e),
            'status': 'parse_failed'
        }
Enter fullscreen mode Exit fullscreen mode

Use .get() with Defaults

# Instead of this (crashes if missing)
name = response.css('h1::text').get()
if name:
    item['name'] = name.strip()

# Do this (safe)
name = response.css('h1::text').get(default='')
item['name'] = name.strip() if name else 'Unknown'
Enter fullscreen mode Exit fullscreen mode

Handling Pipeline Errors

Pipelines can fail too. Handle gracefully:

from scrapy.exceptions import DropItem

class SafePipeline:
    def process_item(self, item, spider):
        try:
            # Risky operation (database insert)
            self.cursor.execute(
                'INSERT INTO products (name, price) VALUES (?, ?)',
                (item['name'], item['price'])
            )
            self.conn.commit()

        except sqlite3.IntegrityError:
            # Duplicate item (unique constraint violated)
            spider.logger.warning(f'Duplicate item: {item["name"]}')
            raise DropItem('Duplicate')

        except sqlite3.OperationalError as e:
            # Database locked or connection lost
            spider.logger.error(f'Database error: {e}')

            # Try to reconnect
            try:
                self.conn.close()
                self.conn = sqlite3.connect('products.db')
                self.cursor = self.conn.cursor()

                # Retry insert
                self.cursor.execute(
                    'INSERT INTO products (name, price) VALUES (?, ?)',
                    (item['name'], item['price'])
                )
                self.conn.commit()

            except Exception as e2:
                spider.logger.error(f'Retry failed: {e2}')
                raise DropItem(f'Database error: {e}')

        return item
Enter fullscreen mode Exit fullscreen mode

Saving Progress (Resume After Crash)

Don't lose work when spider crashes!

Enable Job Directory

# settings.py
JOBDIR = 'crawls/my_spider'
Enter fullscreen mode Exit fullscreen mode

Now run:

scrapy crawl myspider
Enter fullscreen mode Exit fullscreen mode

If it crashes:

scrapy crawl myspider  # Resumes from where it stopped!
Enter fullscreen mode Exit fullscreen mode

What gets saved:

  • Request queue (unvisited URLs)
  • Duplicate filter (visited URLs)
  • Spider state

What doesn't get saved:

  • Scraped items (save those yourself!)
  • Spider instance variables

Manual State Saving

import pickle

class StatefulSpider(scrapy.Spider):
    name = 'stateful'

    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.scraped_urls = self.load_state()

    def load_state(self):
        try:
            with open('spider_state.pkl', 'rb') as f:
                return pickle.load(f)
        except FileNotFoundError:
            return set()

    def save_state(self):
        with open('spider_state.pkl', 'wb') as f:
            pickle.dump(self.scraped_urls, f)

    def parse(self, response):
        self.scraped_urls.add(response.url)

        # Save state every 100 URLs
        if len(self.scraped_urls) % 100 == 0:
            self.save_state()

        yield {'url': response.url}

    def closed(self, reason):
        # Save final state
        self.save_state()
Enter fullscreen mode Exit fullscreen mode

Graceful Shutdown

Handle interruption (Ctrl+C) gracefully:

import signal

class GracefulSpider(scrapy.Spider):
    name = 'graceful'

    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        signal.signal(signal.SIGINT, self.handle_shutdown)
        self.shutting_down = False

    def handle_shutdown(self, signum, frame):
        self.logger.warning('Shutdown signal received, finishing current requests...')
        self.shutting_down = True
        self.crawler.engine.close_spider(self, 'shutdown')

    def parse(self, response):
        if self.shutting_down:
            return  # Don't start new work

        yield {'url': response.url}
Enter fullscreen mode Exit fullscreen mode

Complete Error-Proof Spider

Here's a production-ready spider with full error handling:

import scrapy
from scrapy.exceptions import DropItem
import logging

class BulletproofSpider(scrapy.Spider):
    name = 'bulletproof'

    custom_settings = {
        'RETRY_ENABLED': True,
        'RETRY_TIMES': 5,
        'RETRY_HTTP_CODES': [500, 502, 503, 504, 522, 524, 408, 429, 403],
        'JOBDIR': 'crawls/bulletproof',  # Enable resume
    }

    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.stats = {
            'success': 0,
            'parse_errors': 0,
            'network_errors': 0,
            'http_errors': 0
        }

    def start_requests(self):
        urls = ['https://example.com/page1', 'https://example.com/page2']

        for url in urls:
            yield scrapy.Request(
                url,
                callback=self.parse,
                errback=self.handle_error,
                dont_filter=False
            )

    def parse(self, response):
        # Check HTTP status
        if response.status != 200:
            self.stats['http_errors'] += 1
            self.logger.warning(f'HTTP {response.status}: {response.url}')

            yield {
                'url': response.url,
                'status': response.status,
                'error_type': 'http_error'
            }
            return

        # Try parsing
        try:
            title = response.css('h1::text').get(default='')
            price = response.css('.price::text').get(default='')

            if not title:
                raise ValueError('Title not found')

            # Clean price
            if price:
                price = float(price.replace('$', '').replace(',', ''))
            else:
                price = 0.0

            self.stats['success'] += 1

            yield {
                'url': response.url,
                'title': title.strip(),
                'price': price,
                'status': 'success'
            }

        except Exception as e:
            self.stats['parse_errors'] += 1
            self.logger.error(f'Parse error on {response.url}: {e}', exc_info=True)

            yield {
                'url': response.url,
                'error': str(e),
                'error_type': 'parse_error'
            }

    def handle_error(self, failure):
        self.stats['network_errors'] += 1

        self.logger.error(f'Request failed: {failure.value}')
        self.logger.error(f'URL: {failure.request.url}')

        # Determine error type
        error_type = 'unknown'
        if failure.check(TimeoutError):
            error_type = 'timeout'
        elif failure.check(DNSLookupError):
            error_type = 'dns_error'
        elif failure.check(ConnectionRefusedError):
            error_type = 'connection_refused'

        yield {
            'url': failure.request.url,
            'error': str(failure.value),
            'error_type': error_type
        }

    def closed(self, reason):
        # Log final statistics
        self.logger.info('='*60)
        self.logger.info('SPIDER STATISTICS')
        self.logger.info(f'Successful: {self.stats["success"]}')
        self.logger.info(f'Parse errors: {self.stats["parse_errors"]}')
        self.logger.info(f'Network errors: {self.stats["network_errors"]}')
        self.logger.info(f'HTTP errors: {self.stats["http_errors"]}')
        self.logger.info(f'Close reason: {reason}')
        self.logger.info('='*60)
Enter fullscreen mode Exit fullscreen mode

Common Mistakes

Mistake #1: Not Using errback

# BAD (network errors unhandled)
yield scrapy.Request(url, callback=self.parse)

# GOOD
yield scrapy.Request(url, callback=self.parse, errback=self.handle_error)
Enter fullscreen mode Exit fullscreen mode

Mistake #2: Not Checking Status Codes

# BAD (assumes all responses are 200)
def parse(self, response):
    title = response.css('h1::text').get()
    yield {'title': title}

# GOOD
def parse(self, response):
    if response.status != 200:
        self.logger.error(f'Got status {response.status}')
        return

    title = response.css('h1::text').get()
    yield {'title': title}
Enter fullscreen mode Exit fullscreen mode

Mistake #3: Crashing on Missing Elements

# BAD (crashes if element missing)
price = response.css('.price::text').get()
yield {'price': float(price)}

# GOOD (handles missing elements)
price = response.css('.price::text').get(default='0')
try:
    price = float(price.replace('$', ''))
except ValueError:
    price = 0.0
yield {'price': price}
Enter fullscreen mode Exit fullscreen mode

Quick Reference

Enable Retries

RETRY_ENABLED = True
RETRY_TIMES = 3
RETRY_HTTP_CODES = [500, 502, 503, 504, 429, 403]
Enter fullscreen mode Exit fullscreen mode

Add errback

yield scrapy.Request(url, callback=self.parse, errback=self.handle_error)
Enter fullscreen mode Exit fullscreen mode

Handle Errors

def handle_error(self, failure):
    self.logger.error(f'Failed: {failure.value}')
    self.logger.error(f'URL: {failure.request.url}')
Enter fullscreen mode Exit fullscreen mode

Enable Resume

JOBDIR = 'crawls/myspider'
Enter fullscreen mode Exit fullscreen mode

Summary

Errors will happen:

  • Network failures
  • Server errors
  • Parsing issues
  • Pipeline problems

Handle them gracefully:

  • Use retry middleware
  • Add errback for network errors
  • Check status codes
  • Protect against missing elements
  • Handle pipeline exceptions

Save your work:

  • Enable JOBDIR for resume
  • Save state periodically
  • Log errors for review

Best practices:

  • Always use errback
  • Check response.status
  • Use .get(default='')
  • Try/except around risky code
  • Log errors with exc_info=True

Remember:

  • Retries are automatic for HTTP errors
  • errback catches network errors
  • JOBDIR enables resume
  • Logs help debug issues

Build bulletproof spiders from day one. Handle errors early, and you'll save hours of debugging later!

Happy scraping! 🕷️

Top comments (0)