DEV Community

Muhammad Ikramullah Khan
Muhammad Ikramullah Khan

Posted on

Scrapy Logging: The Complete Beginner's Guide (Debug Like a Pro)

When I first started using Scrapy, my spiders would just... stop working. No error messages. No clues. Just silence.

I'd stare at my terminal thinking "Did it even run? Where did it fail? Why isn't anything happening?"

Then I learned about logging. Suddenly, my spiders started talking to me. They'd tell me exactly what they were doing, where they were going, what they found, and when things went wrong.

Logging transformed me from a confused beginner to someone who could actually debug problems. Let me show you how to make your spiders communicate clearly.


What Is Logging, Really?

Think of logging like a diary your spider keeps. As it runs, it writes down everything it does:

  • "I'm starting up now"
  • "I'm visiting this URL"
  • "I found 20 products"
  • "Uh oh, I got a 404 error"
  • "I'm done, here are my stats"

Without logging, your spider runs in complete silence. You have no idea what's happening inside.

With logging, you see everything. Every step. Every decision. Every problem.


The Five Log Levels (From Loud to Quiet)

Python has five log levels. Think of them like volume settings:

DEBUG (Loudest)

Everything. Every tiny detail. Use this when you're hunting bugs.

self.logger.debug('Checking if this element exists')
Enter fullscreen mode Exit fullscreen mode

INFO

Important milestones. "I started," "I found data," "I finished."

self.logger.info('Successfully scraped 50 products')
Enter fullscreen mode Exit fullscreen mode

WARNING

Something weird happened, but the spider keeps running.

self.logger.warning('Product has no price, skipping')
Enter fullscreen mode Exit fullscreen mode

ERROR

Something broke, but the spider continues with other pages.

self.logger.error('Failed to parse product page')
Enter fullscreen mode Exit fullscreen mode

CRITICAL (Quietest)

Everything is on fire. The spider can't continue.

self.logger.critical('Database connection lost, cannot save data')
Enter fullscreen mode Exit fullscreen mode

By default, Scrapy shows INFO and above. DEBUG messages are hidden unless you ask for them.


Your First Log Messages

Every spider has a built-in logger. Just use self.logger:

import scrapy

class MySpider(scrapy.Spider):
    name = 'myspider'
    start_urls = ['https://example.com']

    def parse(self, response):
        self.logger.info('Started scraping')

        products = response.css('.product')
        self.logger.info(f'Found {len(products)} products')

        for product in products:
            name = product.css('h2::text').get()

            if name:
                self.logger.debug(f'Scraping product: {name}')
                yield {'name': name}
            else:
                self.logger.warning('Product missing name, skipped')
Enter fullscreen mode Exit fullscreen mode

Run it:

scrapy crawl myspider
Enter fullscreen mode Exit fullscreen mode

You'll see your log messages mixed with Scrapy's own messages.


Controlling What You See

Change Log Level from Command Line

Only show warnings and errors:

scrapy crawl myspider --loglevel=WARNING
Enter fullscreen mode Exit fullscreen mode

Show everything, including debug messages:

scrapy crawl myspider --loglevel=DEBUG
Enter fullscreen mode Exit fullscreen mode

Change Log Level in Settings

Edit settings.py:

# Only show important stuff
LOG_LEVEL = 'INFO'

# Or show everything for debugging
LOG_LEVEL = 'DEBUG'

# Or only show problems
LOG_LEVEL = 'WARNING'
Enter fullscreen mode Exit fullscreen mode

Saving Logs to a File

Console output disappears when you close the terminal. Save logs to a file instead:

From Command Line

scrapy crawl myspider --logfile=spider.log
Enter fullscreen mode Exit fullscreen mode

From Settings

# settings.py
LOG_FILE = 'spider.log'
Enter fullscreen mode Exit fullscreen mode

Now all your logs save to spider.log. Perfect for production scrapers that run for hours.

Append to Existing Log File

By default, Scrapy overwrites the log file each time. To keep old logs and append new ones:

# settings.py
LOG_FILE = 'spider.log'
LOG_FILE_APPEND = True  # Don't overwrite, append instead
Enter fullscreen mode Exit fullscreen mode

Real-World Example: A Logging Spider

Let's build a spider that logs everything important:

import scrapy

class ProductSpider(scrapy.Spider):
    name = 'products'
    start_urls = ['https://example.com/products']

    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.products_scraped = 0
        self.products_failed = 0

    def parse(self, response):
        self.logger.info(f'Parsing page: {response.url}')

        products = response.css('.product')
        self.logger.info(f'Found {len(products)} products on this page')

        if not products:
            self.logger.warning('No products found on this page!')

        for product in products:
            try:
                item = self.parse_product(product)
                self.products_scraped += 1
                self.logger.debug(f'Scraped: {item["name"]}')
                yield item
            except Exception as e:
                self.products_failed += 1
                self.logger.error(f'Failed to parse product: {e}')

        # Follow pagination
        next_page = response.css('.next::attr(href)').get()
        if next_page:
            self.logger.info(f'Following next page: {next_page}')
            yield response.follow(next_page, callback=self.parse)
        else:
            self.logger.info('No more pages to scrape')

    def parse_product(self, product):
        name = product.css('h2::text').get()
        price = product.css('.price::text').get()

        if not name:
            raise ValueError('Product has no name')

        if not price:
            self.logger.warning(f'Product {name} has no price')
            price = 'N/A'

        return {
            'name': name.strip(),
            'price': price
        }

    def closed(self, reason):
        # Called when spider finishes
        self.logger.info('=' * 50)
        self.logger.info('SPIDER FINISHED')
        self.logger.info(f'Total products scraped: {self.products_scraped}')
        self.logger.info(f'Total failures: {self.products_failed}')
        self.logger.info(f'Reason: {reason}')
        self.logger.info('=' * 50)
Enter fullscreen mode Exit fullscreen mode

Run it with INFO level:

scrapy crawl products --loglevel=INFO --logfile=products.log
Enter fullscreen mode Exit fullscreen mode

Your log file will have a complete record of what happened.


Advanced: Custom Log Formatting

Scrapy's default logs look like this:

2024-12-24 10:30:15 [myspider] INFO: Scraped product: Widget
Enter fullscreen mode Exit fullscreen mode

You can customize the format:

# settings.py
LOG_FORMAT = '%(levelname)s: %(message)s'
Enter fullscreen mode Exit fullscreen mode

Now it looks like:

INFO: Scraped product: Widget
Enter fullscreen mode Exit fullscreen mode

More Formatting Options

# Show date, time, level, and message
LOG_FORMAT = '%(asctime)s [%(levelname)s] %(message)s'
LOG_DATEFORMAT = '%Y-%m-%d %H:%M:%S'
Enter fullscreen mode Exit fullscreen mode

Output:

2024-12-24 10:30:15 [INFO] Scraped product: Widget
Enter fullscreen mode Exit fullscreen mode

Show Spider Name

LOG_FORMAT = '%(asctime)s [%(name)s] %(levelname)s: %(message)s'
Enter fullscreen mode Exit fullscreen mode

Output:

2024-12-24 10:30:15 [myspider] INFO: Scraped product: Widget
Enter fullscreen mode Exit fullscreen mode

Logging in Pipelines

Pipelines don't have self.logger like spiders do. Create your own:

# pipelines.py
import logging

class MyPipeline:
    def __init__(self):
        self.logger = logging.getLogger(__name__)
        self.items_processed = 0
        self.items_dropped = 0

    def process_item(self, item, spider):
        self.logger.debug(f'Processing item: {item}')

        if not item.get('price'):
            self.items_dropped += 1
            self.logger.warning(f'Dropping item with no price: {item.get("name")}')
            raise DropItem('Missing price')

        # Clean the price
        price = item['price'].replace('$', '').replace(',', '')
        try:
            item['price'] = float(price)
            self.items_processed += 1
            self.logger.info(f'Processed item: {item["name"]} at ${item["price"]}')
        except ValueError:
            self.items_dropped += 1
            self.logger.error(f'Invalid price format: {item["price"]}')
            raise DropItem('Invalid price')

        return item

    def close_spider(self, spider):
        self.logger.info(f'Pipeline stats: {self.items_processed} processed, {self.items_dropped} dropped')
Enter fullscreen mode Exit fullscreen mode

Silencing Noisy Logs

Scrapy logs A LOT. Sometimes too much. Here's how to quiet specific parts:

Hide Specific Log Categories

# settings.py
import logging

# Reduce chattiness of some components
logging.getLogger('scrapy.core.engine').setLevel(logging.WARNING)
logging.getLogger('scrapy.downloadermiddlewares').setLevel(logging.WARNING)
Enter fullscreen mode Exit fullscreen mode

Hide HTTP Error Logs

When scraping, you'll often hit 404s or 500s. These create lots of WARNING logs. To hide them:

# In your spider's __init__
def __init__(self, *args, **kwargs):
    super().__init__(*args, **kwargs)
    logging.getLogger('scrapy.spidermiddlewares.httperror').setLevel(logging.ERROR)
Enter fullscreen mode Exit fullscreen mode

Show Only Your Spider's Logs

# settings.py
LOG_LEVEL = 'INFO'

# After imports, before spider class
import logging
logging.getLogger('scrapy').setLevel(logging.WARNING)  # Quiet Scrapy
Enter fullscreen mode Exit fullscreen mode

Now you'll only see your spider's log messages.


Debugging with Logs

Track Request Flow

def start_requests(self):
    for url in self.start_urls:
        self.logger.info(f'Requesting: {url}')
        yield scrapy.Request(url, callback=self.parse)

def parse(self, response):
    self.logger.info(f'Received response from: {response.url}')
    self.logger.info(f'Status code: {response.status}')
    self.logger.debug(f'Response length: {len(response.body)} bytes')
Enter fullscreen mode Exit fullscreen mode

Log Selector Results

def parse(self, response):
    products = response.css('.product')
    self.logger.info(f'Found {len(products)} products')

    if not products:
        self.logger.warning('No products found! Selector might be wrong.')
        self.logger.debug(f'Page HTML: {response.text[:500]}')  # First 500 chars
Enter fullscreen mode Exit fullscreen mode

Log Exceptions Properly

def parse_product(self, product):
    try:
        name = product.css('h2::text').get()
        price = product.css('.price::text').get()

        return {'name': name, 'price': price}

    except Exception as e:
        self.logger.error(f'Error parsing product: {e}', exc_info=True)
        # exc_info=True adds the full stack trace
Enter fullscreen mode Exit fullscreen mode

With exc_info=True, you get the full error traceback in your logs. Super helpful for debugging.


Tips Nobody Tells You

Tip #1: Log Data Quality Issues

def parse(self, response):
    for product in response.css('.product'):
        name = product.css('h2::text').get()
        price = product.css('.price::text').get()

        # Log data quality
        if not name:
            self.logger.warning(f'Missing name at {response.url}')
        if not price:
            self.logger.warning(f'Missing price for {name}')

        yield {'name': name, 'price': price}
Enter fullscreen mode Exit fullscreen mode

This helps you spot issues with the website's data, not just your selectors.

Tip #2: Log Progress for Long Scrapes

def __init__(self, *args, **kwargs):
    super().__init__(*args, **kwargs)
    self.page_count = 0

def parse(self, response):
    self.page_count += 1

    if self.page_count % 10 == 0:
        self.logger.info(f'Progress: Scraped {self.page_count} pages so far')
Enter fullscreen mode Exit fullscreen mode

For spiders that run for hours, this shows you're still making progress.

Tip #3: Log What You're NOT Scraping

def parse(self, response):
    for product in response.css('.product'):
        if product.css('.out-of-stock'):
            self.logger.debug(f'Skipping out-of-stock product: {product.css("h2::text").get()}')
            continue  # Skip this product

        yield self.parse_product(product)
Enter fullscreen mode Exit fullscreen mode

Knowing what you skipped helps debug incomplete data.

Tip #4: Different Logs for Different Spiders

Running multiple spiders? Keep their logs separate:

scrapy crawl spider1 --logfile=spider1.log
scrapy crawl spider2 --logfile=spider2.log
Enter fullscreen mode Exit fullscreen mode

Or in code:

# settings.py (for spider1)
LOG_FILE = f'{BOT_NAME}_spider1.log'
Enter fullscreen mode Exit fullscreen mode

Common Mistakes

Mistake #1: Using print() Instead of Logging

# WRONG
def parse(self, response):
    print('Found products')  # Don't do this!

# RIGHT
def parse(self, response):
    self.logger.info('Found products')
Enter fullscreen mode Exit fullscreen mode

print() doesn't respect log levels or file output. Always use the logger.

Mistake #2: Logging Too Much in Production

# WRONG (logs every single item!)
def parse(self, response):
    for product in response.css('.product'):
        self.logger.info(f'Scraping: {product.css("h2::text").get()}')
        yield {...}

# RIGHT (log summaries)
def parse(self, response):
    products = response.css('.product')
    self.logger.info(f'Scraping {len(products)} products from page')

    for product in products:
        yield {...}
Enter fullscreen mode Exit fullscreen mode

In production with thousands of items, logging each one creates massive log files.

Mistake #3: Not Logging Failures

# WRONG (fails silently)
def parse_product(self, product):
    name = product.css('h2::text').get()
    return {'name': name}

# RIGHT (logs the issue)
def parse_product(self, product):
    name = product.css('h2::text').get()

    if not name:
        self.logger.error('Failed to extract name')

    return {'name': name}
Enter fullscreen mode Exit fullscreen mode

Production Spider with Complete Logging

Here's a complete, production-ready spider with proper logging:

import scrapy
import logging

class ProductionSpider(scrapy.Spider):
    name = 'production'
    start_urls = ['https://example.com/products']

    custom_settings = {
        'LOG_LEVEL': 'INFO',
        'LOG_FILE': 'production_spider.log',
        'LOG_FILE_APPEND': True,
        'LOG_FORMAT': '%(asctime)s [%(name)s] %(levelname)s: %(message)s',
        'LOG_DATEFORMAT': '%Y-%m-%d %H:%M:%S'
    }

    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.stats = {
            'pages_scraped': 0,
            'items_scraped': 0,
            'items_failed': 0,
            'errors': 0
        }

    def start_requests(self):
        self.logger.info('='*60)
        self.logger.info('SPIDER STARTED')
        self.logger.info(f'Starting URLs: {self.start_urls}')
        self.logger.info('='*60)

        for url in self.start_urls:
            yield scrapy.Request(url, callback=self.parse, errback=self.handle_error)

    def parse(self, response):
        self.stats['pages_scraped'] += 1
        self.logger.info(f'Processing page {self.stats["pages_scraped"]}: {response.url}')

        products = response.css('.product')
        if not products:
            self.logger.warning(f'No products found on {response.url}')
            return

        for product in products:
            try:
                item = self.parse_product(product, response)
                self.stats['items_scraped'] += 1
                yield item
            except Exception as e:
                self.stats['items_failed'] += 1
                self.logger.error(f'Failed to parse product: {e}', exc_info=True)

        # Log progress every 5 pages
        if self.stats['pages_scraped'] % 5 == 0:
            self.log_progress()

        # Pagination
        next_page = response.css('.next::attr(href)').get()
        if next_page:
            yield response.follow(next_page, callback=self.parse, errback=self.handle_error)

    def parse_product(self, product, response):
        name = product.css('h2::text').get()
        price = product.css('.price::text').get()

        if not name:
            raise ValueError('Missing product name')

        if not price:
            self.logger.warning(f'Product {name} has no price')

        return {
            'name': name.strip() if name else None,
            'price': price.strip() if price else None,
            'url': response.urljoin(product.css('a::attr(href)').get())
        }

    def handle_error(self, failure):
        self.stats['errors'] += 1
        self.logger.error(f'Request failed: {failure.value}')
        self.logger.error(f'URL: {failure.request.url}')

    def log_progress(self):
        self.logger.info('-'*60)
        self.logger.info('PROGRESS REPORT')
        self.logger.info(f'Pages scraped: {self.stats["pages_scraped"]}')
        self.logger.info(f'Items scraped: {self.stats["items_scraped"]}')
        self.logger.info(f'Failed items: {self.stats["items_failed"]}')
        self.logger.info(f'Errors: {self.stats["errors"]}')
        self.logger.info('-'*60)

    def closed(self, reason):
        self.logger.info('='*60)
        self.logger.info('SPIDER FINISHED')
        self.logger.info(f'Reason: {reason}')
        self.log_progress()
        self.logger.info('='*60)
Enter fullscreen mode Exit fullscreen mode

This spider logs:

  • When it starts and ends
  • Progress every 5 pages
  • All errors with full details
  • Final statistics

Perfect for production!


Quick Reference

Log Levels (In Order)

self.logger.debug('Detailed debugging info')
self.logger.info('General information')
self.logger.warning('Something unexpected happened')
self.logger.error('Something broke')
self.logger.critical('Everything is on fire')
Enter fullscreen mode Exit fullscreen mode

Settings

# settings.py

# Set minimum log level
LOG_LEVEL = 'INFO'  # or DEBUG, WARNING, ERROR, CRITICAL

# Save to file
LOG_FILE = 'spider.log'

# Append instead of overwrite
LOG_FILE_APPEND = True

# Custom format
LOG_FORMAT = '%(asctime)s [%(levelname)s] %(message)s'
LOG_DATEFORMAT = '%Y-%m-%d %H:%M:%S'

# Disable logging completely
LOG_ENABLED = False
Enter fullscreen mode Exit fullscreen mode

Command Line

# Change log level
scrapy crawl myspider --loglevel=DEBUG

# Save to file
scrapy crawl myspider --logfile=spider.log

# Multiple options
scrapy crawl myspider --loglevel=INFO --logfile=spider.log
Enter fullscreen mode Exit fullscreen mode

Summary

Logging is your spider's voice. It tells you:

  • What it's doing right now
  • What it found
  • What went wrong
  • How long things took

Key takeaways:

  • Use self.logger in spiders
  • Five levels: DEBUG, INFO, WARNING, ERROR, CRITICAL
  • Save logs to files for production
  • Log errors with exc_info=True for stack traces
  • Log progress for long-running scrapers
  • Use appropriate log levels (don't log everything as INFO!)

Start adding logs to your spiders today. When something breaks (and it will), you'll have the information you need to fix it quickly.

Happy scraping! 🕷️

Top comments (0)