I once spent 4 hours debugging a spider that returned empty results. Four hours! I checked my selectors a hundred times. They looked perfect in the browser.
Finally, I used Scrapy shell and discovered the problem in 2 minutes: the website was serving different HTML to bots than to browsers.
Proper debugging tools turn hours of frustration into minutes of problem-solving. Let me show you every debugging technique that actually works.
The Problem: Why Debugging Scrapy Is Hard
Regular Python debugging:
- Run code
- See error
- Fix it
- Done
Scrapy debugging:
- Asynchronous execution
- Network issues
- HTML parsing
- JavaScript rendering
- Multiple components (spider, middlewares, pipelines)
- Hard to reproduce issues
You need better tools.
Tool 1: Scrapy Shell (Your Best Friend)
Scrapy shell is an interactive console for testing selectors and requests.
Basic Usage
scrapy shell "https://example.com"
Now you can test selectors interactively:
>>> response.css('h1::text').get()
'Welcome to Example.com'
>>> response.css('.product-name::text').getall()
['Product 1', 'Product 2', 'Product 3']
>>> len(response.css('.product'))
10
What the Docs Don't Tell You
You can test your spider's parse method:
>>> from myproject.spiders.myspider import MySpider
>>> spider = MySpider()
>>>
>>> # Test parse method
>>> items = list(spider.parse(response))
>>> len(items)
50
>>> items[0]
{'name': 'Product 1', 'price': 29.99}
You can make new requests:
>>> new_request = scrapy.Request('https://example.com/page2')
>>> new_response = fetch(new_request)
>>> new_response.css('h1::text').get()
'Page 2'
You can test with different user agents:
scrapy shell -s USER_AGENT="Mozilla/5.0 iPhone" "https://example.com"
You can inspect response body:
>>> print(response.text[:500]) # First 500 characters
>>>
>>> # Save to file for inspection
>>> with open('response.html', 'w') as f:
... f.write(response.text)
Tool 2: Scrapy Parse Command
Test your spider without running it fully.
Basic Usage
scrapy parse --spider=myspider https://example.com
Shows what your spider would extract from that URL.
Advanced Options
# Show only items (no debug info)
scrapy parse --spider=myspider --output=items.json https://example.com
# Use specific callback
scrapy parse --spider=myspider --callback=parse_product https://example.com/product/123
# Follow links (depth)
scrapy parse --spider=myspider --depth=2 https://example.com
# Show requests and items
scrapy parse --spider=myspider --verbose https://example.com
What the Docs Don't Tell You
Test with custom settings:
scrapy parse --spider=myspider -s DOWNLOAD_DELAY=0 -s LOG_LEVEL=DEBUG https://example.com
Save output for comparison:
scrapy parse --spider=myspider https://example.com > output1.txt
# Make changes to spider
scrapy parse --spider=myspider https://example.com > output2.txt
diff output1.txt output2.txt
Tool 3: Logging (Debug Like a Pro)
Strategic logging shows exactly what's happening.
Basic Logging
import scrapy
class MySpider(scrapy.Spider):
name = 'myspider'
def parse(self, response):
self.logger.info(f'Processing: {response.url}')
products = response.css('.product')
self.logger.info(f'Found {len(products)} products')
for product in products:
name = product.css('h2::text').get()
self.logger.debug(f'Scraping: {name}')
if not name:
self.logger.warning(f'Product missing name at {response.url}')
continue
yield {'name': name}
Log Levels
self.logger.debug('Detailed debugging info') # Only in DEBUG mode
self.logger.info('General information') # Normal operation
self.logger.warning('Something unexpected') # Potential issues
self.logger.error('Something broke') # Errors that don't stop spider
self.logger.critical('Everything is on fire') # Critical failures
Advanced Logging Tricks
Log selector results:
products = response.css('.product')
self.logger.info(f'Selector ".product" found {len(products)} elements')
if not products:
self.logger.error(f'No products found! URL: {response.url}')
self.logger.error(f'Response status: {response.status}')
self.logger.error(f'Response length: {len(response.text)} bytes')
# Log first 500 chars of response
self.logger.debug(f'Response preview: {response.text[:500]}')
Log with exception details:
try:
price = float(product.css('.price::text').get())
except Exception as e:
self.logger.error(f'Price parsing failed: {e}', exc_info=True)
# exc_info=True adds full stack trace
Conditional logging:
def parse(self, response):
products = response.css('.product')
# Only log if something's wrong
if len(products) == 0:
self.logger.error(f'Zero products found at {response.url}')
# Save problematic page
with open(f'error_{response.url.split("/")[-1]}.html', 'w') as f:
f.write(response.text)
Tool 4: Scrapy Check (Built-in Tests)
Run built-in validation on your spider:
scrapy check myspider
Checks for:
- Contract violations
- Common mistakes
- Spider structure issues
Add Contracts to Your Spider
class MySpider(scrapy.Spider):
name = 'myspider'
def parse(self, response):
"""
@url https://example.com
@returns items 10 20
@scrapes name price
"""
# This contract says:
# - When parsing example.com
# - Should return between 10-20 items
# - Each item should have 'name' and 'price' fields
for product in response.css('.product'):
yield {
'name': product.css('h2::text').get(),
'price': product.css('.price::text').get()
}
Run contracts:
scrapy check myspider
Tool 5: View Response in Browser
See exactly what Scrapy downloaded:
def parse(self, response):
# Open response in browser
from scrapy.utils.response import open_in_browser
open_in_browser(response)
# Continue parsing
yield {'url': response.url}
When to use this:
- Suspecting JavaScript content
- Checking what HTML Scrapy actually sees
- Comparing browser view vs Scrapy view
Tool 6: Save Response for Offline Testing
Save problematic pages for debugging:
def parse(self, response):
# Save response to file
filename = f'debug_{response.url.split("/")[-1]}.html'
with open(filename, 'wb') as f:
f.write(response.body)
self.logger.info(f'Saved response to {filename}')
Then test offline:
scrapy shell file:///path/to/debug_page.html
Tool 7: Breakpoints (IPython Debugger)
Add breakpoints in your spider:
def parse(self, response):
products = response.css('.product')
# Add breakpoint
import ipdb; ipdb.set_trace()
# Or: import pdb; pdb.set_trace()
for product in products:
yield {'name': product.css('h2::text').get()}
When spider hits breakpoint, you get interactive shell:
ipdb> len(products)
10
ipdb> products[0].css('h2::text').get()
'Product Name'
ipdb> c # continue execution
Install IPython debugger:
pip install ipdb
Tool 8: Scrapy Stats (Built-in Metrics)
Scrapy tracks everything automatically:
def closed(self, reason):
stats = self.crawler.stats.get_stats()
self.logger.info('='*60)
self.logger.info('SPIDER STATISTICS')
self.logger.info(f'Items scraped: {stats.get("item_scraped_count", 0)}')
self.logger.info(f'Items dropped: {stats.get("item_dropped_count", 0)}')
self.logger.info(f'Requests made: {stats.get("downloader/request_count", 0)}')
self.logger.info(f'Response 200: {stats.get("downloader/response_status_count/200", 0)}')
self.logger.info(f'Response 404: {stats.get("downloader/response_status_count/404", 0)}')
self.logger.info(f'Response 500: {stats.get("downloader/response_status_count/500", 0)}')
self.logger.info('='*60)
Useful stats:
-
item_scraped_count- Items yielded -
item_dropped_count- Items dropped by pipelines -
downloader/request_count- Total requests -
downloader/response_status_count/XXX- Responses by status code -
response_received_count- Responses received -
scheduler/enqueued- Requests queued
Tool 9: Custom Debugging Middleware
Add middleware to log everything:
# middlewares.py
class DebugMiddleware:
def process_request(self, request, spider):
spider.logger.debug(f'[REQUEST] {request.method} {request.url}')
spider.logger.debug(f'[HEADERS] {dict(request.headers)}')
return None
def process_response(self, request, response, spider):
spider.logger.debug(f'[RESPONSE] {response.status} {response.url}')
spider.logger.debug(f'[LENGTH] {len(response.body)} bytes')
return response
def process_exception(self, request, exception, spider):
spider.logger.error(f'[EXCEPTION] {request.url}: {exception}')
return None
Enable it:
# settings.py
DOWNLOADER_MIDDLEWARES = {
'myproject.middlewares.DebugMiddleware': 100,
}
Tool 10: Compare Scraped Data
Save scraped data and compare between runs:
# First run
scrapy crawl myspider -o run1.json
# Make changes
# Second run
scrapy crawl myspider -o run2.json
# Compare
diff run1.json run2.json
Or use Python:
import json
with open('run1.json') as f:
data1 = json.load(f)
with open('run2.json') as f:
data2 = json.load(f)
# Compare counts
print(f'Run 1: {len(data1)} items')
print(f'Run 2: {len(data2)} items')
# Find differences
urls1 = {item['url'] for item in data1}
urls2 = {item['url'] for item in data2}
missing = urls1 - urls2
print(f'Missing in run 2: {missing}')
new = urls2 - urls1
print(f'New in run 2: {new}')
Common Debugging Scenarios
Scenario 1: Selector Returns None
Problem: response.css('.product::text').get() returns None
Debug:
scrapy shell "https://example.com"
>>> response.css('.product')
[] # Empty! Selector is wrong
>>> # Check what's actually there
>>> response.css('*').getall()[:10] # First 10 elements
>>> # View page source
>>> print(response.text[:1000])
>>> # Or open in browser
>>> from scrapy.utils.response import open_in_browser
>>> open_in_browser(response)
Common causes:
- Typo in selector
- JavaScript-loaded content
- Wrong HTML structure
- Case sensitivity
Scenario 2: Spider Returns No Items
Problem: Spider runs but yields nothing
Debug:
def parse(self, response):
self.logger.info(f'Parse called for: {response.url}')
products = response.css('.product')
self.logger.info(f'Found {len(products)} products')
if not products:
self.logger.error('No products found!')
self.logger.error(f'Response status: {response.status}')
self.logger.error(f'Response length: {len(response.text)}')
# Save for inspection
with open('debug.html', 'w') as f:
f.write(response.text)
for product in products:
item = {
'name': product.css('h2::text').get()
}
self.logger.info(f'Yielding: {item}')
yield item
Scenario 3: Pipeline Drops All Items
Problem: Items scraped but not in output
Debug:
# pipelines.py
class DebugPipeline:
def process_item(self, item, spider):
spider.logger.info(f'Pipeline received: {item}')
# Check if item is being dropped
if not item.get('name'):
spider.logger.warning('Item missing name, dropping')
raise DropItem('Missing name')
spider.logger.info(f'Pipeline passed: {item}')
return item
Check stats:
def closed(self, reason):
stats = self.crawler.stats.get_stats()
scraped = stats.get('item_scraped_count', 0)
dropped = stats.get('item_dropped_count', 0)
self.logger.info(f'Scraped: {scraped}, Dropped: {dropped}')
if dropped > scraped:
self.logger.error('More items dropped than scraped!')
Scenario 4: Slow Spider
Problem: Spider is too slow
Debug:
from datetime import datetime
class TimedSpider(scrapy.Spider):
name = 'timed'
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
self.start_time = datetime.now()
self.request_count = 0
self.parse_times = []
def parse(self, response):
parse_start = datetime.now()
# Your parsing logic
for product in response.css('.product'):
yield {'name': product.css('h2::text').get()}
parse_duration = (datetime.now() - parse_start).total_seconds()
self.parse_times.append(parse_duration)
self.request_count += 1
if self.request_count % 100 == 0:
avg_parse_time = sum(self.parse_times) / len(self.parse_times)
self.logger.info(f'Average parse time: {avg_parse_time:.3f}s')
def closed(self, reason):
total_time = (datetime.now() - self.start_time).total_seconds()
self.logger.info(f'Total time: {total_time:.1f}s')
self.logger.info(f'Requests: {self.request_count}')
self.logger.info(f'Speed: {self.request_count/total_time:.1f} req/s')
Debugging Checklist
When spider doesn't work:
1. Check response in shell:
scrapy shell "https://example.com"
2. Test your selectors:
>>> response.css('.product').getall()
3. Check response status:
>>> response.status
200
4. View what Scrapy sees:
>>> from scrapy.utils.response import open_in_browser
>>> open_in_browser(response)
5. Test parse method:
>>> from myproject.spiders.myspider import MySpider
>>> spider = MySpider()
>>> items = list(spider.parse(response))
>>> len(items)
6. Check logs:
scrapy crawl myspider --loglevel=DEBUG
7. Save problematic response:
with open('debug.html', 'w') as f:
f.write(response.text)
Quick Reference
Scrapy Shell
scrapy shell "https://example.com"
scrapy shell -s USER_AGENT="iPhone" "https://example.com"
scrapy shell file:///path/to/page.html
Scrapy Parse
scrapy parse --spider=myspider https://example.com
scrapy parse --spider=myspider --callback=parse_product URL
scrapy parse --spider=myspider --depth=2 URL
Logging
self.logger.debug('Debug info')
self.logger.info('Information')
self.logger.warning('Warning')
self.logger.error('Error', exc_info=True) # Include traceback
Breakpoint
import ipdb; ipdb.set_trace()
Save Response
from scrapy.utils.response import open_in_browser
open_in_browser(response)
with open('debug.html', 'w') as f:
f.write(response.text)
Summary
Essential debugging tools:
- Scrapy shell - Test selectors interactively
- Scrapy parse - Test spider without full run
- Logging - Strategic info/debug messages
- Breakpoints - Pause execution and inspect
- Save responses - Debug offline
When selector returns None:
- Test in scrapy shell
- Check if JavaScript-loaded
- View page source (Ctrl+U)
- Try different selectors
When spider yields nothing:
- Add logging at each step
- Check parse() is being called
- Verify selectors find elements
- Check pipeline isn't dropping items
For slow spiders:
- Log timing information
- Profile with cProfile
- Check network vs parsing time
- Optimize bottlenecks
Remember:
- Test selectors in shell first
- Log strategically, not excessively
- Save problematic pages for offline testing
- Use breakpoints for complex issues
Start debugging with Scrapy shell. It solves 80% of problems in minutes!
Happy scraping! 🕷️
Top comments (0)