The first time my spider crashed, I lost 3 hours of scraping. I had scraped 30,000 pages, then hit a bad URL and everything stopped.
No resume. No save point. Just lost data.
I learned the hard way: errors WILL happen. Networks fail. Servers crash. URLs break. Your job is to handle it gracefully.
Let me show you how to make bulletproof spiders that survive anything.
The Types of Errors You'll Face
1. Network Errors
- Connection timeout
- DNS failure
- Connection refused
- Socket errors
2. HTTP Errors
- 404 Not Found
- 500 Internal Server Error
- 502 Bad Gateway
- 503 Service Unavailable
- 429 Too Many Requests
3. Parsing Errors
- Invalid HTML
- Missing elements
- Unexpected structure
- Encoding issues
4. Pipeline Errors
- Database connection lost
- Disk full
- Permission denied
- Invalid data
5. Spider Errors
- Your code has bugs
- Memory issues
- Exceptions in callbacks
Built-In Retry Middleware
Scrapy automatically retries failed requests!
Default Behavior
# settings.py (these are defaults)
RETRY_ENABLED = True
RETRY_TIMES = 2 # Retry up to 2 times (3 attempts total)
RETRY_HTTP_CODES = [500, 502, 503, 504, 522, 524, 408, 429]
If a request returns one of these codes, Scrapy automatically retries it.
What the Docs Don't Tell You
Retries use exponential backoff:
- First attempt: immediate
- First retry: after 1 second
- Second retry: after 2 seconds
Total attempts = RETRY_TIMES + 1:
RETRY_TIMES = 2
# Means: 1 original + 2 retries = 3 total attempts
404s don't retry by default:
That's intentional. If a page returns 404, it probably doesn't exist. No point retrying.
Customizing Retry Behavior
Add More Status Codes
# settings.py
RETRY_HTTP_CODES = [500, 502, 503, 504, 522, 524, 408, 429, 403]
# Added 403 (sometimes temporary blocks)
Increase Retry Attempts
RETRY_TIMES = 5 # Retry up to 5 times
Retry Priority
Failed requests get higher priority on retry:
RETRY_PRIORITY_ADJUST = -1 # Negative = higher priority
Handling Network Errors (Connection Failures)
Network errors don't return HTTP codes. They're caught differently.
Using errback
class RobustSpider(scrapy.Spider):
name = 'robust'
def start_requests(self):
yield scrapy.Request(
'https://example.com',
callback=self.parse,
errback=self.handle_error # Called on network errors
)
def parse(self, response):
yield {'url': response.url}
def handle_error(self, failure):
# Log the error
self.logger.error(f'Request failed: {failure.value}')
self.logger.error(f'URL: {failure.request.url}')
# Check error type
if failure.check(TimeoutError):
self.logger.error('Timeout error')
elif failure.check(DNSLookupError):
self.logger.error('DNS lookup failed')
elif failure.check(ConnectionRefusedError):
self.logger.error('Connection refused')
What Errors Can You Catch?
Common error types:
from twisted.internet.error import (
TimeoutError,
DNSLookupError,
ConnectionRefusedError,
ConnectionLost,
TCPTimedOutError
)
def handle_error(self, failure):
if failure.check(TimeoutError, TCPTimedOutError):
self.logger.error('Timeout!')
elif failure.check(DNSLookupError):
self.logger.error('DNS failed!')
elif failure.check(ConnectionRefusedError):
self.logger.error('Connection refused!')
elif failure.check(ConnectionLost):
self.logger.error('Connection lost!')
else:
self.logger.error(f'Unknown error: {failure.value}')
Retry Manually in errback
Sometimes you want custom retry logic:
def handle_error(self, failure):
request = failure.request
# Get current retry count
retry_count = request.meta.get('retry_count', 0)
max_retries = 3
if retry_count < max_retries:
# Retry with increased count
self.logger.warning(f'Retrying {request.url} (attempt {retry_count + 1})')
retry_request = request.copy()
retry_request.meta['retry_count'] = retry_count + 1
retry_request.dont_filter = True # Allow duplicate URL
# Wait before retrying
import time
time.sleep(2 ** retry_count) # Exponential backoff: 1s, 2s, 4s
yield retry_request
else:
self.logger.error(f'Gave up on {request.url} after {max_retries} retries')
Handling HTTP Errors
Check status codes in parse():
def parse(self, response):
if response.status != 200:
self.logger.warning(f'Got status {response.status} from {response.url}')
if response.status == 404:
# Page doesn't exist
yield {'url': response.url, 'status': 'not_found'}
return
elif response.status == 403:
# Forbidden (might be blocked)
self.logger.error('Got 403, might be blocked!')
# Slow down
self.crawler.engine.pause()
import time
time.sleep(60)
self.crawler.engine.unpause()
return
elif response.status >= 500:
# Server error (retry handled automatically)
self.logger.error('Server error, will retry')
return
# Normal processing
yield {
'url': response.url,
'title': response.css('h1::text').get()
}
Handling Parsing Errors
Protect against missing elements:
def parse(self, response):
try:
# Risky parsing
name = response.css('h1.product-name::text').get()
price = response.css('span.price::text').get()
if not name:
raise ValueError('Product name not found')
if not price:
raise ValueError('Price not found')
yield {
'name': name,
'price': float(price.replace('$', ''))
}
except Exception as e:
self.logger.error(f'Parsing error on {response.url}: {e}')
# Save problematic URL for later review
yield {
'url': response.url,
'error': str(e),
'status': 'parse_failed'
}
Use .get() with Defaults
# Instead of this (crashes if missing)
name = response.css('h1::text').get()
if name:
item['name'] = name.strip()
# Do this (safe)
name = response.css('h1::text').get(default='')
item['name'] = name.strip() if name else 'Unknown'
Handling Pipeline Errors
Pipelines can fail too. Handle gracefully:
from scrapy.exceptions import DropItem
class SafePipeline:
def process_item(self, item, spider):
try:
# Risky operation (database insert)
self.cursor.execute(
'INSERT INTO products (name, price) VALUES (?, ?)',
(item['name'], item['price'])
)
self.conn.commit()
except sqlite3.IntegrityError:
# Duplicate item (unique constraint violated)
spider.logger.warning(f'Duplicate item: {item["name"]}')
raise DropItem('Duplicate')
except sqlite3.OperationalError as e:
# Database locked or connection lost
spider.logger.error(f'Database error: {e}')
# Try to reconnect
try:
self.conn.close()
self.conn = sqlite3.connect('products.db')
self.cursor = self.conn.cursor()
# Retry insert
self.cursor.execute(
'INSERT INTO products (name, price) VALUES (?, ?)',
(item['name'], item['price'])
)
self.conn.commit()
except Exception as e2:
spider.logger.error(f'Retry failed: {e2}')
raise DropItem(f'Database error: {e}')
return item
Saving Progress (Resume After Crash)
Don't lose work when spider crashes!
Enable Job Directory
# settings.py
JOBDIR = 'crawls/my_spider'
Now run:
scrapy crawl myspider
If it crashes:
scrapy crawl myspider # Resumes from where it stopped!
What gets saved:
- Request queue (unvisited URLs)
- Duplicate filter (visited URLs)
- Spider state
What doesn't get saved:
- Scraped items (save those yourself!)
- Spider instance variables
Manual State Saving
import pickle
class StatefulSpider(scrapy.Spider):
name = 'stateful'
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
self.scraped_urls = self.load_state()
def load_state(self):
try:
with open('spider_state.pkl', 'rb') as f:
return pickle.load(f)
except FileNotFoundError:
return set()
def save_state(self):
with open('spider_state.pkl', 'wb') as f:
pickle.dump(self.scraped_urls, f)
def parse(self, response):
self.scraped_urls.add(response.url)
# Save state every 100 URLs
if len(self.scraped_urls) % 100 == 0:
self.save_state()
yield {'url': response.url}
def closed(self, reason):
# Save final state
self.save_state()
Graceful Shutdown
Handle interruption (Ctrl+C) gracefully:
import signal
class GracefulSpider(scrapy.Spider):
name = 'graceful'
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
signal.signal(signal.SIGINT, self.handle_shutdown)
self.shutting_down = False
def handle_shutdown(self, signum, frame):
self.logger.warning('Shutdown signal received, finishing current requests...')
self.shutting_down = True
self.crawler.engine.close_spider(self, 'shutdown')
def parse(self, response):
if self.shutting_down:
return # Don't start new work
yield {'url': response.url}
Complete Error-Proof Spider
Here's a production-ready spider with full error handling:
import scrapy
from scrapy.exceptions import DropItem
import logging
class BulletproofSpider(scrapy.Spider):
name = 'bulletproof'
custom_settings = {
'RETRY_ENABLED': True,
'RETRY_TIMES': 5,
'RETRY_HTTP_CODES': [500, 502, 503, 504, 522, 524, 408, 429, 403],
'JOBDIR': 'crawls/bulletproof', # Enable resume
}
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
self.stats = {
'success': 0,
'parse_errors': 0,
'network_errors': 0,
'http_errors': 0
}
def start_requests(self):
urls = ['https://example.com/page1', 'https://example.com/page2']
for url in urls:
yield scrapy.Request(
url,
callback=self.parse,
errback=self.handle_error,
dont_filter=False
)
def parse(self, response):
# Check HTTP status
if response.status != 200:
self.stats['http_errors'] += 1
self.logger.warning(f'HTTP {response.status}: {response.url}')
yield {
'url': response.url,
'status': response.status,
'error_type': 'http_error'
}
return
# Try parsing
try:
title = response.css('h1::text').get(default='')
price = response.css('.price::text').get(default='')
if not title:
raise ValueError('Title not found')
# Clean price
if price:
price = float(price.replace('$', '').replace(',', ''))
else:
price = 0.0
self.stats['success'] += 1
yield {
'url': response.url,
'title': title.strip(),
'price': price,
'status': 'success'
}
except Exception as e:
self.stats['parse_errors'] += 1
self.logger.error(f'Parse error on {response.url}: {e}', exc_info=True)
yield {
'url': response.url,
'error': str(e),
'error_type': 'parse_error'
}
def handle_error(self, failure):
self.stats['network_errors'] += 1
self.logger.error(f'Request failed: {failure.value}')
self.logger.error(f'URL: {failure.request.url}')
# Determine error type
error_type = 'unknown'
if failure.check(TimeoutError):
error_type = 'timeout'
elif failure.check(DNSLookupError):
error_type = 'dns_error'
elif failure.check(ConnectionRefusedError):
error_type = 'connection_refused'
yield {
'url': failure.request.url,
'error': str(failure.value),
'error_type': error_type
}
def closed(self, reason):
# Log final statistics
self.logger.info('='*60)
self.logger.info('SPIDER STATISTICS')
self.logger.info(f'Successful: {self.stats["success"]}')
self.logger.info(f'Parse errors: {self.stats["parse_errors"]}')
self.logger.info(f'Network errors: {self.stats["network_errors"]}')
self.logger.info(f'HTTP errors: {self.stats["http_errors"]}')
self.logger.info(f'Close reason: {reason}')
self.logger.info('='*60)
Common Mistakes
Mistake #1: Not Using errback
# BAD (network errors unhandled)
yield scrapy.Request(url, callback=self.parse)
# GOOD
yield scrapy.Request(url, callback=self.parse, errback=self.handle_error)
Mistake #2: Not Checking Status Codes
# BAD (assumes all responses are 200)
def parse(self, response):
title = response.css('h1::text').get()
yield {'title': title}
# GOOD
def parse(self, response):
if response.status != 200:
self.logger.error(f'Got status {response.status}')
return
title = response.css('h1::text').get()
yield {'title': title}
Mistake #3: Crashing on Missing Elements
# BAD (crashes if element missing)
price = response.css('.price::text').get()
yield {'price': float(price)}
# GOOD (handles missing elements)
price = response.css('.price::text').get(default='0')
try:
price = float(price.replace('$', ''))
except ValueError:
price = 0.0
yield {'price': price}
Quick Reference
Enable Retries
RETRY_ENABLED = True
RETRY_TIMES = 3
RETRY_HTTP_CODES = [500, 502, 503, 504, 429, 403]
Add errback
yield scrapy.Request(url, callback=self.parse, errback=self.handle_error)
Handle Errors
def handle_error(self, failure):
self.logger.error(f'Failed: {failure.value}')
self.logger.error(f'URL: {failure.request.url}')
Enable Resume
JOBDIR = 'crawls/myspider'
Summary
Errors will happen:
- Network failures
- Server errors
- Parsing issues
- Pipeline problems
Handle them gracefully:
- Use retry middleware
- Add errback for network errors
- Check status codes
- Protect against missing elements
- Handle pipeline exceptions
Save your work:
- Enable JOBDIR for resume
- Save state periodically
- Log errors for review
Best practices:
- Always use errback
- Check response.status
- Use .get(default='')
- Try/except around risky code
- Log errors with exc_info=True
Remember:
- Retries are automatic for HTTP errors
- errback catches network errors
- JOBDIR enables resume
- Logs help debug issues
Build bulletproof spiders from day one. Handle errors early, and you'll save hours of debugging later!
Happy scraping! 🕷️
Top comments (0)