DEV Community

Muhammad Ikramullah Khan
Muhammad Ikramullah Khan

Posted on

Scrapy Debugging Techniques: Find Bugs Fast (Stop Wasting Hours)

I once spent 4 hours debugging a spider that returned empty results. Four hours! I checked my selectors a hundred times. They looked perfect in the browser.

Finally, I used Scrapy shell and discovered the problem in 2 minutes: the website was serving different HTML to bots than to browsers.

Proper debugging tools turn hours of frustration into minutes of problem-solving. Let me show you every debugging technique that actually works.


The Problem: Why Debugging Scrapy Is Hard

Regular Python debugging:

  • Run code
  • See error
  • Fix it
  • Done

Scrapy debugging:

  • Asynchronous execution
  • Network issues
  • HTML parsing
  • JavaScript rendering
  • Multiple components (spider, middlewares, pipelines)
  • Hard to reproduce issues

You need better tools.


Tool 1: Scrapy Shell (Your Best Friend)

Scrapy shell is an interactive console for testing selectors and requests.

Basic Usage

scrapy shell "https://example.com"
Enter fullscreen mode Exit fullscreen mode

Now you can test selectors interactively:

>>> response.css('h1::text').get()
'Welcome to Example.com'

>>> response.css('.product-name::text').getall()
['Product 1', 'Product 2', 'Product 3']

>>> len(response.css('.product'))
10
Enter fullscreen mode Exit fullscreen mode

What the Docs Don't Tell You

You can test your spider's parse method:

>>> from myproject.spiders.myspider import MySpider
>>> spider = MySpider()
>>> 
>>> # Test parse method
>>> items = list(spider.parse(response))
>>> len(items)
50
>>> items[0]
{'name': 'Product 1', 'price': 29.99}
Enter fullscreen mode Exit fullscreen mode

You can make new requests:

>>> new_request = scrapy.Request('https://example.com/page2')
>>> new_response = fetch(new_request)
>>> new_response.css('h1::text').get()
'Page 2'
Enter fullscreen mode Exit fullscreen mode

You can test with different user agents:

scrapy shell -s USER_AGENT="Mozilla/5.0 iPhone" "https://example.com"
Enter fullscreen mode Exit fullscreen mode

You can inspect response body:

>>> print(response.text[:500])  # First 500 characters
>>> 
>>> # Save to file for inspection
>>> with open('response.html', 'w') as f:
...     f.write(response.text)
Enter fullscreen mode Exit fullscreen mode

Tool 2: Scrapy Parse Command

Test your spider without running it fully.

Basic Usage

scrapy parse --spider=myspider https://example.com
Enter fullscreen mode Exit fullscreen mode

Shows what your spider would extract from that URL.

Advanced Options

# Show only items (no debug info)
scrapy parse --spider=myspider --output=items.json https://example.com

# Use specific callback
scrapy parse --spider=myspider --callback=parse_product https://example.com/product/123

# Follow links (depth)
scrapy parse --spider=myspider --depth=2 https://example.com

# Show requests and items
scrapy parse --spider=myspider --verbose https://example.com
Enter fullscreen mode Exit fullscreen mode

What the Docs Don't Tell You

Test with custom settings:

scrapy parse --spider=myspider -s DOWNLOAD_DELAY=0 -s LOG_LEVEL=DEBUG https://example.com
Enter fullscreen mode Exit fullscreen mode

Save output for comparison:

scrapy parse --spider=myspider https://example.com > output1.txt
# Make changes to spider
scrapy parse --spider=myspider https://example.com > output2.txt
diff output1.txt output2.txt
Enter fullscreen mode Exit fullscreen mode

Tool 3: Logging (Debug Like a Pro)

Strategic logging shows exactly what's happening.

Basic Logging

import scrapy

class MySpider(scrapy.Spider):
    name = 'myspider'

    def parse(self, response):
        self.logger.info(f'Processing: {response.url}')

        products = response.css('.product')
        self.logger.info(f'Found {len(products)} products')

        for product in products:
            name = product.css('h2::text').get()
            self.logger.debug(f'Scraping: {name}')

            if not name:
                self.logger.warning(f'Product missing name at {response.url}')
                continue

            yield {'name': name}
Enter fullscreen mode Exit fullscreen mode

Log Levels

self.logger.debug('Detailed debugging info')      # Only in DEBUG mode
self.logger.info('General information')            # Normal operation
self.logger.warning('Something unexpected')        # Potential issues
self.logger.error('Something broke')               # Errors that don't stop spider
self.logger.critical('Everything is on fire')      # Critical failures
Enter fullscreen mode Exit fullscreen mode

Advanced Logging Tricks

Log selector results:

products = response.css('.product')
self.logger.info(f'Selector ".product" found {len(products)} elements')

if not products:
    self.logger.error(f'No products found! URL: {response.url}')
    self.logger.error(f'Response status: {response.status}')
    self.logger.error(f'Response length: {len(response.text)} bytes')

    # Log first 500 chars of response
    self.logger.debug(f'Response preview: {response.text[:500]}')
Enter fullscreen mode Exit fullscreen mode

Log with exception details:

try:
    price = float(product.css('.price::text').get())
except Exception as e:
    self.logger.error(f'Price parsing failed: {e}', exc_info=True)
    # exc_info=True adds full stack trace
Enter fullscreen mode Exit fullscreen mode

Conditional logging:

def parse(self, response):
    products = response.css('.product')

    # Only log if something's wrong
    if len(products) == 0:
        self.logger.error(f'Zero products found at {response.url}')
        # Save problematic page
        with open(f'error_{response.url.split("/")[-1]}.html', 'w') as f:
            f.write(response.text)
Enter fullscreen mode Exit fullscreen mode

Tool 4: Scrapy Check (Built-in Tests)

Run built-in validation on your spider:

scrapy check myspider
Enter fullscreen mode Exit fullscreen mode

Checks for:

  • Contract violations
  • Common mistakes
  • Spider structure issues

Add Contracts to Your Spider

class MySpider(scrapy.Spider):
    name = 'myspider'

    def parse(self, response):
        """
        @url https://example.com
        @returns items 10 20
        @scrapes name price
        """
        # This contract says:
        # - When parsing example.com
        # - Should return between 10-20 items
        # - Each item should have 'name' and 'price' fields

        for product in response.css('.product'):
            yield {
                'name': product.css('h2::text').get(),
                'price': product.css('.price::text').get()
            }
Enter fullscreen mode Exit fullscreen mode

Run contracts:

scrapy check myspider
Enter fullscreen mode Exit fullscreen mode

Tool 5: View Response in Browser

See exactly what Scrapy downloaded:

def parse(self, response):
    # Open response in browser
    from scrapy.utils.response import open_in_browser
    open_in_browser(response)

    # Continue parsing
    yield {'url': response.url}
Enter fullscreen mode Exit fullscreen mode

When to use this:

  • Suspecting JavaScript content
  • Checking what HTML Scrapy actually sees
  • Comparing browser view vs Scrapy view

Tool 6: Save Response for Offline Testing

Save problematic pages for debugging:

def parse(self, response):
    # Save response to file
    filename = f'debug_{response.url.split("/")[-1]}.html'
    with open(filename, 'wb') as f:
        f.write(response.body)

    self.logger.info(f'Saved response to {filename}')
Enter fullscreen mode Exit fullscreen mode

Then test offline:

scrapy shell file:///path/to/debug_page.html
Enter fullscreen mode Exit fullscreen mode

Tool 7: Breakpoints (IPython Debugger)

Add breakpoints in your spider:

def parse(self, response):
    products = response.css('.product')

    # Add breakpoint
    import ipdb; ipdb.set_trace()
    # Or: import pdb; pdb.set_trace()

    for product in products:
        yield {'name': product.css('h2::text').get()}
Enter fullscreen mode Exit fullscreen mode

When spider hits breakpoint, you get interactive shell:

ipdb> len(products)
10
ipdb> products[0].css('h2::text').get()
'Product Name'
ipdb> c  # continue execution
Enter fullscreen mode Exit fullscreen mode

Install IPython debugger:

pip install ipdb
Enter fullscreen mode Exit fullscreen mode

Tool 8: Scrapy Stats (Built-in Metrics)

Scrapy tracks everything automatically:

def closed(self, reason):
    stats = self.crawler.stats.get_stats()

    self.logger.info('='*60)
    self.logger.info('SPIDER STATISTICS')
    self.logger.info(f'Items scraped: {stats.get("item_scraped_count", 0)}')
    self.logger.info(f'Items dropped: {stats.get("item_dropped_count", 0)}')
    self.logger.info(f'Requests made: {stats.get("downloader/request_count", 0)}')
    self.logger.info(f'Response 200: {stats.get("downloader/response_status_count/200", 0)}')
    self.logger.info(f'Response 404: {stats.get("downloader/response_status_count/404", 0)}')
    self.logger.info(f'Response 500: {stats.get("downloader/response_status_count/500", 0)}')
    self.logger.info('='*60)
Enter fullscreen mode Exit fullscreen mode

Useful stats:

  • item_scraped_count - Items yielded
  • item_dropped_count - Items dropped by pipelines
  • downloader/request_count - Total requests
  • downloader/response_status_count/XXX - Responses by status code
  • response_received_count - Responses received
  • scheduler/enqueued - Requests queued

Tool 9: Custom Debugging Middleware

Add middleware to log everything:

# middlewares.py

class DebugMiddleware:
    def process_request(self, request, spider):
        spider.logger.debug(f'[REQUEST] {request.method} {request.url}')
        spider.logger.debug(f'[HEADERS] {dict(request.headers)}')
        return None

    def process_response(self, request, response, spider):
        spider.logger.debug(f'[RESPONSE] {response.status} {response.url}')
        spider.logger.debug(f'[LENGTH] {len(response.body)} bytes')
        return response

    def process_exception(self, request, exception, spider):
        spider.logger.error(f'[EXCEPTION] {request.url}: {exception}')
        return None
Enter fullscreen mode Exit fullscreen mode

Enable it:

# settings.py
DOWNLOADER_MIDDLEWARES = {
    'myproject.middlewares.DebugMiddleware': 100,
}
Enter fullscreen mode Exit fullscreen mode

Tool 10: Compare Scraped Data

Save scraped data and compare between runs:

# First run
scrapy crawl myspider -o run1.json

# Make changes

# Second run
scrapy crawl myspider -o run2.json

# Compare
diff run1.json run2.json
Enter fullscreen mode Exit fullscreen mode

Or use Python:

import json

with open('run1.json') as f:
    data1 = json.load(f)

with open('run2.json') as f:
    data2 = json.load(f)

# Compare counts
print(f'Run 1: {len(data1)} items')
print(f'Run 2: {len(data2)} items')

# Find differences
urls1 = {item['url'] for item in data1}
urls2 = {item['url'] for item in data2}

missing = urls1 - urls2
print(f'Missing in run 2: {missing}')

new = urls2 - urls1
print(f'New in run 2: {new}')
Enter fullscreen mode Exit fullscreen mode

Common Debugging Scenarios

Scenario 1: Selector Returns None

Problem: response.css('.product::text').get() returns None

Debug:

scrapy shell "https://example.com"
Enter fullscreen mode Exit fullscreen mode
>>> response.css('.product')
[]  # Empty! Selector is wrong

>>> # Check what's actually there
>>> response.css('*').getall()[:10]  # First 10 elements

>>> # View page source
>>> print(response.text[:1000])

>>> # Or open in browser
>>> from scrapy.utils.response import open_in_browser
>>> open_in_browser(response)
Enter fullscreen mode Exit fullscreen mode

Common causes:

  • Typo in selector
  • JavaScript-loaded content
  • Wrong HTML structure
  • Case sensitivity

Scenario 2: Spider Returns No Items

Problem: Spider runs but yields nothing

Debug:

def parse(self, response):
    self.logger.info(f'Parse called for: {response.url}')

    products = response.css('.product')
    self.logger.info(f'Found {len(products)} products')

    if not products:
        self.logger.error('No products found!')
        self.logger.error(f'Response status: {response.status}')
        self.logger.error(f'Response length: {len(response.text)}')

        # Save for inspection
        with open('debug.html', 'w') as f:
            f.write(response.text)

    for product in products:
        item = {
            'name': product.css('h2::text').get()
        }
        self.logger.info(f'Yielding: {item}')
        yield item
Enter fullscreen mode Exit fullscreen mode

Scenario 3: Pipeline Drops All Items

Problem: Items scraped but not in output

Debug:

# pipelines.py
class DebugPipeline:
    def process_item(self, item, spider):
        spider.logger.info(f'Pipeline received: {item}')

        # Check if item is being dropped
        if not item.get('name'):
            spider.logger.warning('Item missing name, dropping')
            raise DropItem('Missing name')

        spider.logger.info(f'Pipeline passed: {item}')
        return item
Enter fullscreen mode Exit fullscreen mode

Check stats:

def closed(self, reason):
    stats = self.crawler.stats.get_stats()
    scraped = stats.get('item_scraped_count', 0)
    dropped = stats.get('item_dropped_count', 0)

    self.logger.info(f'Scraped: {scraped}, Dropped: {dropped}')

    if dropped > scraped:
        self.logger.error('More items dropped than scraped!')
Enter fullscreen mode Exit fullscreen mode

Scenario 4: Slow Spider

Problem: Spider is too slow

Debug:

from datetime import datetime

class TimedSpider(scrapy.Spider):
    name = 'timed'

    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.start_time = datetime.now()
        self.request_count = 0
        self.parse_times = []

    def parse(self, response):
        parse_start = datetime.now()

        # Your parsing logic
        for product in response.css('.product'):
            yield {'name': product.css('h2::text').get()}

        parse_duration = (datetime.now() - parse_start).total_seconds()
        self.parse_times.append(parse_duration)

        self.request_count += 1

        if self.request_count % 100 == 0:
            avg_parse_time = sum(self.parse_times) / len(self.parse_times)
            self.logger.info(f'Average parse time: {avg_parse_time:.3f}s')

    def closed(self, reason):
        total_time = (datetime.now() - self.start_time).total_seconds()
        self.logger.info(f'Total time: {total_time:.1f}s')
        self.logger.info(f'Requests: {self.request_count}')
        self.logger.info(f'Speed: {self.request_count/total_time:.1f} req/s')
Enter fullscreen mode Exit fullscreen mode

Debugging Checklist

When spider doesn't work:

1. Check response in shell:

scrapy shell "https://example.com"
Enter fullscreen mode Exit fullscreen mode

2. Test your selectors:

>>> response.css('.product').getall()
Enter fullscreen mode Exit fullscreen mode

3. Check response status:

>>> response.status
200
Enter fullscreen mode Exit fullscreen mode

4. View what Scrapy sees:

>>> from scrapy.utils.response import open_in_browser
>>> open_in_browser(response)
Enter fullscreen mode Exit fullscreen mode

5. Test parse method:

>>> from myproject.spiders.myspider import MySpider
>>> spider = MySpider()
>>> items = list(spider.parse(response))
>>> len(items)
Enter fullscreen mode Exit fullscreen mode

6. Check logs:

scrapy crawl myspider --loglevel=DEBUG
Enter fullscreen mode Exit fullscreen mode

7. Save problematic response:

with open('debug.html', 'w') as f:
    f.write(response.text)
Enter fullscreen mode Exit fullscreen mode

Quick Reference

Scrapy Shell

scrapy shell "https://example.com"
scrapy shell -s USER_AGENT="iPhone" "https://example.com"
scrapy shell file:///path/to/page.html
Enter fullscreen mode Exit fullscreen mode

Scrapy Parse

scrapy parse --spider=myspider https://example.com
scrapy parse --spider=myspider --callback=parse_product URL
scrapy parse --spider=myspider --depth=2 URL
Enter fullscreen mode Exit fullscreen mode

Logging

self.logger.debug('Debug info')
self.logger.info('Information')
self.logger.warning('Warning')
self.logger.error('Error', exc_info=True)  # Include traceback
Enter fullscreen mode Exit fullscreen mode

Breakpoint

import ipdb; ipdb.set_trace()
Enter fullscreen mode Exit fullscreen mode

Save Response

from scrapy.utils.response import open_in_browser
open_in_browser(response)

with open('debug.html', 'w') as f:
    f.write(response.text)
Enter fullscreen mode Exit fullscreen mode

Summary

Essential debugging tools:

  1. Scrapy shell - Test selectors interactively
  2. Scrapy parse - Test spider without full run
  3. Logging - Strategic info/debug messages
  4. Breakpoints - Pause execution and inspect
  5. Save responses - Debug offline

When selector returns None:

  • Test in scrapy shell
  • Check if JavaScript-loaded
  • View page source (Ctrl+U)
  • Try different selectors

When spider yields nothing:

  • Add logging at each step
  • Check parse() is being called
  • Verify selectors find elements
  • Check pipeline isn't dropping items

For slow spiders:

  • Log timing information
  • Profile with cProfile
  • Check network vs parsing time
  • Optimize bottlenecks

Remember:

  • Test selectors in shell first
  • Log strategically, not excessively
  • Save problematic pages for offline testing
  • Use breakpoints for complex issues

Start debugging with Scrapy shell. It solves 80% of problems in minutes!

Happy scraping! 🕷️

Top comments (0)