Muhammad Ikramullah Khan

Posted on Jan 6

Scrapy Debugging Techniques: Find Bugs Fast (Stop Wasting Hours)

#python #webdev #programming #beginners

I once spent 4 hours debugging a spider that returned empty results. Four hours! I checked my selectors a hundred times. They looked perfect in the browser.

Finally, I used Scrapy shell and discovered the problem in 2 minutes: the website was serving different HTML to bots than to browsers.

Proper debugging tools turn hours of frustration into minutes of problem-solving. Let me show you every debugging technique that actually works.

The Problem: Why Debugging Scrapy Is Hard

Regular Python debugging:

Run code
See error
Fix it
Done

Scrapy debugging:

Asynchronous execution
Network issues
HTML parsing
JavaScript rendering
Multiple components (spider, middlewares, pipelines)
Hard to reproduce issues

You need better tools.

Tool 1: Scrapy Shell (Your Best Friend)

Scrapy shell is an interactive console for testing selectors and requests.

Basic Usage

scrapy shell "https://example.com"

Now you can test selectors interactively:

>>> response.css('h1::text').get()
'Welcome to Example.com'

>>> response.css('.product-name::text').getall()
['Product 1', 'Product 2', 'Product 3']

>>> len(response.css('.product'))
10

What the Docs Don't Tell You

You can test your spider's parse method:

>>> from myproject.spiders.myspider import MySpider
>>> spider = MySpider()
>>> 
>>> # Test parse method
>>> items = list(spider.parse(response))
>>> len(items)
50
>>> items[0]
{'name': 'Product 1', 'price': 29.99}

You can make new requests:

>>> new_request = scrapy.Request('https://example.com/page2')
>>> new_response = fetch(new_request)
>>> new_response.css('h1::text').get()
'Page 2'

You can test with different user agents:

scrapy shell -s USER_AGENT="Mozilla/5.0 iPhone" "https://example.com"

You can inspect response body:

>>> print(response.text[:500])  # First 500 characters
>>> 
>>> # Save to file for inspection
>>> with open('response.html', 'w') as f:
...     f.write(response.text)

Tool 2: Scrapy Parse Command

Test your spider without running it fully.

Basic Usage

scrapy parse --spider=myspider https://example.com

Shows what your spider would extract from that URL.

Advanced Options

# Show only items (no debug info)
scrapy parse --spider=myspider --output=items.json https://example.com

# Use specific callback
scrapy parse --spider=myspider --callback=parse_product https://example.com/product/123

# Follow links (depth)
scrapy parse --spider=myspider --depth=2 https://example.com

# Show requests and items
scrapy parse --spider=myspider --verbose https://example.com

What the Docs Don't Tell You

Test with custom settings:

scrapy parse --spider=myspider -s DOWNLOAD_DELAY=0 -s LOG_LEVEL=DEBUG https://example.com

Save output for comparison:

scrapy parse --spider=myspider https://example.com > output1.txt
# Make changes to spider
scrapy parse --spider=myspider https://example.com > output2.txt
diff output1.txt output2.txt

Tool 3: Logging (Debug Like a Pro)

Strategic logging shows exactly what's happening.

Basic Logging

import scrapy

class MySpider(scrapy.Spider):
    name = 'myspider'

    def parse(self, response):
        self.logger.info(f'Processing: {response.url}')

        products = response.css('.product')
        self.logger.info(f'Found {len(products)} products')

        for product in products:
            name = product.css('h2::text').get()
            self.logger.debug(f'Scraping: {name}')

            if not name:
                self.logger.warning(f'Product missing name at {response.url}')
                continue

            yield {'name': name}

Log Levels

self.logger.debug('Detailed debugging info')      # Only in DEBUG mode
self.logger.info('General information')            # Normal operation
self.logger.warning('Something unexpected')        # Potential issues
self.logger.error('Something broke')               # Errors that don't stop spider
self.logger.critical('Everything is on fire')      # Critical failures

Advanced Logging Tricks

Log selector results:

products = response.css('.product')
self.logger.info(f'Selector ".product" found {len(products)} elements')

if not products:
    self.logger.error(f'No products found! URL: {response.url}')
    self.logger.error(f'Response status: {response.status}')
    self.logger.error(f'Response length: {len(response.text)} bytes')

    # Log first 500 chars of response
    self.logger.debug(f'Response preview: {response.text[:500]}')

Log with exception details:

try:
    price = float(product.css('.price::text').get())
except Exception as e:
    self.logger.error(f'Price parsing failed: {e}', exc_info=True)
    # exc_info=True adds full stack trace

Conditional logging:

def parse(self, response):
    products = response.css('.product')

    # Only log if something's wrong
    if len(products) == 0:
        self.logger.error(f'Zero products found at {response.url}')
        # Save problematic page
        with open(f'error_{response.url.split("/")[-1]}.html', 'w') as f:
            f.write(response.text)

Tool 4: Scrapy Check (Built-in Tests)

Run built-in validation on your spider:

scrapy check myspider

Checks for:

Contract violations
Common mistakes
Spider structure issues

Add Contracts to Your Spider

class MySpider(scrapy.Spider):
    name = 'myspider'

    def parse(self, response):
        """
        @url https://example.com
        @returns items 10 20
        @scrapes name price
        """
        # This contract says:
        # - When parsing example.com
        # - Should return between 10-20 items
        # - Each item should have 'name' and 'price' fields

        for product in response.css('.product'):
            yield {
                'name': product.css('h2::text').get(),
                'price': product.css('.price::text').get()
            }

Run contracts:

scrapy check myspider

Tool 5: View Response in Browser

See exactly what Scrapy downloaded:

def parse(self, response):
    # Open response in browser
    from scrapy.utils.response import open_in_browser
    open_in_browser(response)

    # Continue parsing
    yield {'url': response.url}

When to use this:

Suspecting JavaScript content
Checking what HTML Scrapy actually sees
Comparing browser view vs Scrapy view

Tool 6: Save Response for Offline Testing

Save problematic pages for debugging:

def parse(self, response):
    # Save response to file
    filename = f'debug_{response.url.split("/")[-1]}.html'
    with open(filename, 'wb') as f:
        f.write(response.body)

    self.logger.info(f'Saved response to {filename}')

Then test offline:

scrapy shell file:///path/to/debug_page.html

Tool 7: Breakpoints (IPython Debugger)

Add breakpoints in your spider:

def parse(self, response):
    products = response.css('.product')

    # Add breakpoint
    import ipdb; ipdb.set_trace()
    # Or: import pdb; pdb.set_trace()

    for product in products:
        yield {'name': product.css('h2::text').get()}

When spider hits breakpoint, you get interactive shell:

ipdb> len(products)
10
ipdb> products[0].css('h2::text').get()
'Product Name'
ipdb> c  # continue execution

Install IPython debugger:

pip install ipdb

Tool 8: Scrapy Stats (Built-in Metrics)

Scrapy tracks everything automatically:

def closed(self, reason):
    stats = self.crawler.stats.get_stats()

    self.logger.info('='*60)
    self.logger.info('SPIDER STATISTICS')
    self.logger.info(f'Items scraped: {stats.get("item_scraped_count", 0)}')
    self.logger.info(f'Items dropped: {stats.get("item_dropped_count", 0)}')
    self.logger.info(f'Requests made: {stats.get("downloader/request_count", 0)}')
    self.logger.info(f'Response 200: {stats.get("downloader/response_status_count/200", 0)}')
    self.logger.info(f'Response 404: {stats.get("downloader/response_status_count/404", 0)}')
    self.logger.info(f'Response 500: {stats.get("downloader/response_status_count/500", 0)}')
    self.logger.info('='*60)

Useful stats:

item_scraped_count - Items yielded
item_dropped_count - Items dropped by pipelines
downloader/request_count - Total requests
downloader/response_status_count/XXX - Responses by status code
response_received_count - Responses received
scheduler/enqueued - Requests queued

Tool 9: Custom Debugging Middleware

Add middleware to log everything:

# middlewares.py

class DebugMiddleware:
    def process_request(self, request, spider):
        spider.logger.debug(f'[REQUEST] {request.method} {request.url}')
        spider.logger.debug(f'[HEADERS] {dict(request.headers)}')
        return None

    def process_response(self, request, response, spider):
        spider.logger.debug(f'[RESPONSE] {response.status} {response.url}')
        spider.logger.debug(f'[LENGTH] {len(response.body)} bytes')
        return response

    def process_exception(self, request, exception, spider):
        spider.logger.error(f'[EXCEPTION] {request.url}: {exception}')
        return None

Enable it:

# settings.py
DOWNLOADER_MIDDLEWARES = {
    'myproject.middlewares.DebugMiddleware': 100,
}

Tool 10: Compare Scraped Data

Save scraped data and compare between runs:

# First run
scrapy crawl myspider -o run1.json

# Make changes

# Second run
scrapy crawl myspider -o run2.json

# Compare
diff run1.json run2.json

Or use Python:

import json

with open('run1.json') as f:
    data1 = json.load(f)

with open('run2.json') as f:
    data2 = json.load(f)

# Compare counts
print(f'Run 1: {len(data1)} items')
print(f'Run 2: {len(data2)} items')

# Find differences
urls1 = {item['url'] for item in data1}
urls2 = {item['url'] for item in data2}

missing = urls1 - urls2
print(f'Missing in run 2: {missing}')

new = urls2 - urls1
print(f'New in run 2: {new}')

Common Debugging Scenarios

Scenario 1: Selector Returns None

Problem: response.css('.product::text').get() returns None

Debug:

scrapy shell "https://example.com"

>>> response.css('.product')
[]  # Empty! Selector is wrong

>>> # Check what's actually there
>>> response.css('*').getall()[:10]  # First 10 elements

>>> # View page source
>>> print(response.text[:1000])

>>> # Or open in browser
>>> from scrapy.utils.response import open_in_browser
>>> open_in_browser(response)

Common causes:

Typo in selector
JavaScript-loaded content
Wrong HTML structure
Case sensitivity

Scenario 2: Spider Returns No Items

Problem: Spider runs but yields nothing

Debug:

def parse(self, response):
    self.logger.info(f'Parse called for: {response.url}')

    products = response.css('.product')
    self.logger.info(f'Found {len(products)} products')

    if not products:
        self.logger.error('No products found!')
        self.logger.error(f'Response status: {response.status}')
        self.logger.error(f'Response length: {len(response.text)}')

        # Save for inspection
        with open('debug.html', 'w') as f:
            f.write(response.text)

    for product in products:
        item = {
            'name': product.css('h2::text').get()
        }
        self.logger.info(f'Yielding: {item}')
        yield item

Scenario 3: Pipeline Drops All Items

Problem: Items scraped but not in output

Debug:

# pipelines.py
class DebugPipeline:
    def process_item(self, item, spider):
        spider.logger.info(f'Pipeline received: {item}')

        # Check if item is being dropped
        if not item.get('name'):
            spider.logger.warning('Item missing name, dropping')
            raise DropItem('Missing name')

        spider.logger.info(f'Pipeline passed: {item}')
        return item

Check stats:

def closed(self, reason):
    stats = self.crawler.stats.get_stats()
    scraped = stats.get('item_scraped_count', 0)
    dropped = stats.get('item_dropped_count', 0)

    self.logger.info(f'Scraped: {scraped}, Dropped: {dropped}')

    if dropped > scraped:
        self.logger.error('More items dropped than scraped!')

Scenario 4: Slow Spider

Problem: Spider is too slow

Debug:

from datetime import datetime

class TimedSpider(scrapy.Spider):
    name = 'timed'

    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.start_time = datetime.now()
        self.request_count = 0
        self.parse_times = []

    def parse(self, response):
        parse_start = datetime.now()

        # Your parsing logic
        for product in response.css('.product'):
            yield {'name': product.css('h2::text').get()}

        parse_duration = (datetime.now() - parse_start).total_seconds()
        self.parse_times.append(parse_duration)

        self.request_count += 1

        if self.request_count % 100 == 0:
            avg_parse_time = sum(self.parse_times) / len(self.parse_times)
            self.logger.info(f'Average parse time: {avg_parse_time:.3f}s')

    def closed(self, reason):
        total_time = (datetime.now() - self.start_time).total_seconds()
        self.logger.info(f'Total time: {total_time:.1f}s')
        self.logger.info(f'Requests: {self.request_count}')
        self.logger.info(f'Speed: {self.request_count/total_time:.1f} req/s')

Debugging Checklist

When spider doesn't work:

1. Check response in shell:

scrapy shell "https://example.com"

2. Test your selectors:

>>> response.css('.product').getall()

3. Check response status:

>>> response.status
200

4. View what Scrapy sees:

>>> from scrapy.utils.response import open_in_browser
>>> open_in_browser(response)

5. Test parse method:

>>> from myproject.spiders.myspider import MySpider
>>> spider = MySpider()
>>> items = list(spider.parse(response))
>>> len(items)

6. Check logs:

scrapy crawl myspider --loglevel=DEBUG

7. Save problematic response:

with open('debug.html', 'w') as f:
    f.write(response.text)

Quick Reference

Scrapy Shell

scrapy shell "https://example.com"
scrapy shell -s USER_AGENT="iPhone" "https://example.com"
scrapy shell file:///path/to/page.html

Scrapy Parse

scrapy parse --spider=myspider https://example.com
scrapy parse --spider=myspider --callback=parse_product URL
scrapy parse --spider=myspider --depth=2 URL

Logging

self.logger.debug('Debug info')
self.logger.info('Information')
self.logger.warning('Warning')
self.logger.error('Error', exc_info=True)  # Include traceback

Breakpoint

import ipdb; ipdb.set_trace()

Save Response

from scrapy.utils.response import open_in_browser
open_in_browser(response)

with open('debug.html', 'w') as f:
    f.write(response.text)

Summary

Essential debugging tools:

Scrapy shell - Test selectors interactively
Scrapy parse - Test spider without full run
Logging - Strategic info/debug messages
Breakpoints - Pause execution and inspect
Save responses - Debug offline

When selector returns None:

Test in scrapy shell
Check if JavaScript-loaded
View page source (Ctrl+U)
Try different selectors

When spider yields nothing:

Add logging at each step
Check parse() is being called
Verify selectors find elements
Check pipeline isn't dropping items

For slow spiders:

Log timing information
Profile with cProfile
Check network vs parsing time
Optimize bottlenecks

Remember:

Test selectors in shell first
Log strategically, not excessively
Save problematic pages for offline testing
Use breakpoints for complex issues

Start debugging with Scrapy shell. It solves 80% of problems in minutes!

Happy scraping! 🕷️

DEV Community

Scrapy Debugging Techniques: Find Bugs Fast (Stop Wasting Hours)

The Problem: Why Debugging Scrapy Is Hard

Tool 1: Scrapy Shell (Your Best Friend)

Basic Usage

What the Docs Don't Tell You

Tool 2: Scrapy Parse Command

Basic Usage

Advanced Options

What the Docs Don't Tell You

Tool 3: Logging (Debug Like a Pro)

Basic Logging

Log Levels

Advanced Logging Tricks

Tool 4: Scrapy Check (Built-in Tests)

Add Contracts to Your Spider

Tool 5: View Response in Browser

Tool 6: Save Response for Offline Testing

Tool 7: Breakpoints (IPython Debugger)

Tool 8: Scrapy Stats (Built-in Metrics)

Tool 9: Custom Debugging Middleware

Tool 10: Compare Scraped Data

Common Debugging Scenarios

Scenario 1: Selector Returns None

Scenario 2: Spider Returns No Items

Scenario 3: Pipeline Drops All Items

Scenario 4: Slow Spider

Debugging Checklist

Quick Reference

Scrapy Shell

Scrapy Parse

Logging

Breakpoint

Save Response

Summary

Top comments (0)