Muhammad Ikramullah Khan

Posted on Dec 26

Scrapy Logging: The Complete Beginner's Guide (Debug Like a Pro)

#webdev #python #beginners #programming

When I first started using Scrapy, my spiders would just... stop working. No error messages. No clues. Just silence.

I'd stare at my terminal thinking "Did it even run? Where did it fail? Why isn't anything happening?"

Then I learned about logging. Suddenly, my spiders started talking to me. They'd tell me exactly what they were doing, where they were going, what they found, and when things went wrong.

Logging transformed me from a confused beginner to someone who could actually debug problems. Let me show you how to make your spiders communicate clearly.

What Is Logging, Really?

Think of logging like a diary your spider keeps. As it runs, it writes down everything it does:

"I'm starting up now"
"I'm visiting this URL"
"I found 20 products"
"Uh oh, I got a 404 error"
"I'm done, here are my stats"

Without logging, your spider runs in complete silence. You have no idea what's happening inside.

With logging, you see everything. Every step. Every decision. Every problem.

The Five Log Levels (From Loud to Quiet)

Python has five log levels. Think of them like volume settings:

DEBUG (Loudest)

Everything. Every tiny detail. Use this when you're hunting bugs.

self.logger.debug('Checking if this element exists')

INFO

Important milestones. "I started," "I found data," "I finished."

self.logger.info('Successfully scraped 50 products')

WARNING

Something weird happened, but the spider keeps running.

self.logger.warning('Product has no price, skipping')

ERROR

Something broke, but the spider continues with other pages.

self.logger.error('Failed to parse product page')

CRITICAL (Quietest)

Everything is on fire. The spider can't continue.

self.logger.critical('Database connection lost, cannot save data')

By default, Scrapy shows INFO and above. DEBUG messages are hidden unless you ask for them.

Your First Log Messages

Every spider has a built-in logger. Just use self.logger:

import scrapy

class MySpider(scrapy.Spider):
    name = 'myspider'
    start_urls = ['https://example.com']

    def parse(self, response):
        self.logger.info('Started scraping')

        products = response.css('.product')
        self.logger.info(f'Found {len(products)} products')

        for product in products:
            name = product.css('h2::text').get()

            if name:
                self.logger.debug(f'Scraping product: {name}')
                yield {'name': name}
            else:
                self.logger.warning('Product missing name, skipped')

Run it:

scrapy crawl myspider

You'll see your log messages mixed with Scrapy's own messages.

Controlling What You See

Change Log Level from Command Line

Only show warnings and errors:

scrapy crawl myspider --loglevel=WARNING

Show everything, including debug messages:

scrapy crawl myspider --loglevel=DEBUG

Change Log Level in Settings

Edit settings.py:

# Only show important stuff
LOG_LEVEL = 'INFO'

# Or show everything for debugging
LOG_LEVEL = 'DEBUG'

# Or only show problems
LOG_LEVEL = 'WARNING'

Saving Logs to a File

Console output disappears when you close the terminal. Save logs to a file instead:

From Command Line

scrapy crawl myspider --logfile=spider.log

From Settings

# settings.py
LOG_FILE = 'spider.log'

Now all your logs save to spider.log. Perfect for production scrapers that run for hours.

Append to Existing Log File

By default, Scrapy overwrites the log file each time. To keep old logs and append new ones:

# settings.py
LOG_FILE = 'spider.log'
LOG_FILE_APPEND = True  # Don't overwrite, append instead

Real-World Example: A Logging Spider

Let's build a spider that logs everything important:

import scrapy

class ProductSpider(scrapy.Spider):
    name = 'products'
    start_urls = ['https://example.com/products']

    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.products_scraped = 0
        self.products_failed = 0

    def parse(self, response):
        self.logger.info(f'Parsing page: {response.url}')

        products = response.css('.product')
        self.logger.info(f'Found {len(products)} products on this page')

        if not products:
            self.logger.warning('No products found on this page!')

        for product in products:
            try:
                item = self.parse_product(product)
                self.products_scraped += 1
                self.logger.debug(f'Scraped: {item["name"]}')
                yield item
            except Exception as e:
                self.products_failed += 1
                self.logger.error(f'Failed to parse product: {e}')

        # Follow pagination
        next_page = response.css('.next::attr(href)').get()
        if next_page:
            self.logger.info(f'Following next page: {next_page}')
            yield response.follow(next_page, callback=self.parse)
        else:
            self.logger.info('No more pages to scrape')

    def parse_product(self, product):
        name = product.css('h2::text').get()
        price = product.css('.price::text').get()

        if not name:
            raise ValueError('Product has no name')

        if not price:
            self.logger.warning(f'Product {name} has no price')
            price = 'N/A'

        return {
            'name': name.strip(),
            'price': price
        }

    def closed(self, reason):
        # Called when spider finishes
        self.logger.info('=' * 50)
        self.logger.info('SPIDER FINISHED')
        self.logger.info(f'Total products scraped: {self.products_scraped}')
        self.logger.info(f'Total failures: {self.products_failed}')
        self.logger.info(f'Reason: {reason}')
        self.logger.info('=' * 50)

Run it with INFO level:

scrapy crawl products --loglevel=INFO --logfile=products.log

Your log file will have a complete record of what happened.

Advanced: Custom Log Formatting

Scrapy's default logs look like this:

2024-12-24 10:30:15 [myspider] INFO: Scraped product: Widget

You can customize the format:

# settings.py
LOG_FORMAT = '%(levelname)s: %(message)s'

Now it looks like:

INFO: Scraped product: Widget

More Formatting Options

# Show date, time, level, and message
LOG_FORMAT = '%(asctime)s [%(levelname)s] %(message)s'
LOG_DATEFORMAT = '%Y-%m-%d %H:%M:%S'

Output:

2024-12-24 10:30:15 [INFO] Scraped product: Widget

Show Spider Name

LOG_FORMAT = '%(asctime)s [%(name)s] %(levelname)s: %(message)s'

Output:

2024-12-24 10:30:15 [myspider] INFO: Scraped product: Widget

Logging in Pipelines

Pipelines don't have self.logger like spiders do. Create your own:

# pipelines.py
import logging

class MyPipeline:
    def __init__(self):
        self.logger = logging.getLogger(__name__)
        self.items_processed = 0
        self.items_dropped = 0

    def process_item(self, item, spider):
        self.logger.debug(f'Processing item: {item}')

        if not item.get('price'):
            self.items_dropped += 1
            self.logger.warning(f'Dropping item with no price: {item.get("name")}')
            raise DropItem('Missing price')

        # Clean the price
        price = item['price'].replace('$', '').replace(',', '')
        try:
            item['price'] = float(price)
            self.items_processed += 1
            self.logger.info(f'Processed item: {item["name"]} at ${item["price"]}')
        except ValueError:
            self.items_dropped += 1
            self.logger.error(f'Invalid price format: {item["price"]}')
            raise DropItem('Invalid price')

        return item

    def close_spider(self, spider):
        self.logger.info(f'Pipeline stats: {self.items_processed} processed, {self.items_dropped} dropped')

Silencing Noisy Logs

Scrapy logs A LOT. Sometimes too much. Here's how to quiet specific parts:

Hide Specific Log Categories

# settings.py
import logging

# Reduce chattiness of some components
logging.getLogger('scrapy.core.engine').setLevel(logging.WARNING)
logging.getLogger('scrapy.downloadermiddlewares').setLevel(logging.WARNING)

Hide HTTP Error Logs

When scraping, you'll often hit 404s or 500s. These create lots of WARNING logs. To hide them:

# In your spider's __init__
def __init__(self, *args, **kwargs):
    super().__init__(*args, **kwargs)
    logging.getLogger('scrapy.spidermiddlewares.httperror').setLevel(logging.ERROR)

Show Only Your Spider's Logs

# settings.py
LOG_LEVEL = 'INFO'

# After imports, before spider class
import logging
logging.getLogger('scrapy').setLevel(logging.WARNING)  # Quiet Scrapy

Now you'll only see your spider's log messages.

Debugging with Logs

Track Request Flow

def start_requests(self):
    for url in self.start_urls:
        self.logger.info(f'Requesting: {url}')
        yield scrapy.Request(url, callback=self.parse)

def parse(self, response):
    self.logger.info(f'Received response from: {response.url}')
    self.logger.info(f'Status code: {response.status}')
    self.logger.debug(f'Response length: {len(response.body)} bytes')

Log Selector Results

def parse(self, response):
    products = response.css('.product')
    self.logger.info(f'Found {len(products)} products')

    if not products:
        self.logger.warning('No products found! Selector might be wrong.')
        self.logger.debug(f'Page HTML: {response.text[:500]}')  # First 500 chars

Log Exceptions Properly

def parse_product(self, product):
    try:
        name = product.css('h2::text').get()
        price = product.css('.price::text').get()

        return {'name': name, 'price': price}

    except Exception as e:
        self.logger.error(f'Error parsing product: {e}', exc_info=True)
        # exc_info=True adds the full stack trace

With exc_info=True, you get the full error traceback in your logs. Super helpful for debugging.

Tips Nobody Tells You

Tip #1: Log Data Quality Issues

def parse(self, response):
    for product in response.css('.product'):
        name = product.css('h2::text').get()
        price = product.css('.price::text').get()

        # Log data quality
        if not name:
            self.logger.warning(f'Missing name at {response.url}')
        if not price:
            self.logger.warning(f'Missing price for {name}')

        yield {'name': name, 'price': price}

This helps you spot issues with the website's data, not just your selectors.

Tip #2: Log Progress for Long Scrapes

def __init__(self, *args, **kwargs):
    super().__init__(*args, **kwargs)
    self.page_count = 0

def parse(self, response):
    self.page_count += 1

    if self.page_count % 10 == 0:
        self.logger.info(f'Progress: Scraped {self.page_count} pages so far')

For spiders that run for hours, this shows you're still making progress.

Tip #3: Log What You're NOT Scraping

def parse(self, response):
    for product in response.css('.product'):
        if product.css('.out-of-stock'):
            self.logger.debug(f'Skipping out-of-stock product: {product.css("h2::text").get()}')
            continue  # Skip this product

        yield self.parse_product(product)

Knowing what you skipped helps debug incomplete data.

Tip #4: Different Logs for Different Spiders

Running multiple spiders? Keep their logs separate:

scrapy crawl spider1 --logfile=spider1.log
scrapy crawl spider2 --logfile=spider2.log

Or in code:

# settings.py (for spider1)
LOG_FILE = f'{BOT_NAME}_spider1.log'

Common Mistakes

Mistake #1: Using print() Instead of Logging

# WRONG
def parse(self, response):
    print('Found products')  # Don't do this!

# RIGHT
def parse(self, response):
    self.logger.info('Found products')

print() doesn't respect log levels or file output. Always use the logger.

Mistake #2: Logging Too Much in Production

# WRONG (logs every single item!)
def parse(self, response):
    for product in response.css('.product'):
        self.logger.info(f'Scraping: {product.css("h2::text").get()}')
        yield {...}

# RIGHT (log summaries)
def parse(self, response):
    products = response.css('.product')
    self.logger.info(f'Scraping {len(products)} products from page')

    for product in products:
        yield {...}

In production with thousands of items, logging each one creates massive log files.

Mistake #3: Not Logging Failures

# WRONG (fails silently)
def parse_product(self, product):
    name = product.css('h2::text').get()
    return {'name': name}

# RIGHT (logs the issue)
def parse_product(self, product):
    name = product.css('h2::text').get()

    if not name:
        self.logger.error('Failed to extract name')

    return {'name': name}

Production Spider with Complete Logging

Here's a complete, production-ready spider with proper logging:

import scrapy
import logging

class ProductionSpider(scrapy.Spider):
    name = 'production'
    start_urls = ['https://example.com/products']

    custom_settings = {
        'LOG_LEVEL': 'INFO',
        'LOG_FILE': 'production_spider.log',
        'LOG_FILE_APPEND': True,
        'LOG_FORMAT': '%(asctime)s [%(name)s] %(levelname)s: %(message)s',
        'LOG_DATEFORMAT': '%Y-%m-%d %H:%M:%S'
    }

    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.stats = {
            'pages_scraped': 0,
            'items_scraped': 0,
            'items_failed': 0,
            'errors': 0
        }

    def start_requests(self):
        self.logger.info('='*60)
        self.logger.info('SPIDER STARTED')
        self.logger.info(f'Starting URLs: {self.start_urls}')
        self.logger.info('='*60)

        for url in self.start_urls:
            yield scrapy.Request(url, callback=self.parse, errback=self.handle_error)

    def parse(self, response):
        self.stats['pages_scraped'] += 1
        self.logger.info(f'Processing page {self.stats["pages_scraped"]}: {response.url}')

        products = response.css('.product')
        if not products:
            self.logger.warning(f'No products found on {response.url}')
            return

        for product in products:
            try:
                item = self.parse_product(product, response)
                self.stats['items_scraped'] += 1
                yield item
            except Exception as e:
                self.stats['items_failed'] += 1
                self.logger.error(f'Failed to parse product: {e}', exc_info=True)

        # Log progress every 5 pages
        if self.stats['pages_scraped'] % 5 == 0:
            self.log_progress()

        # Pagination
        next_page = response.css('.next::attr(href)').get()
        if next_page:
            yield response.follow(next_page, callback=self.parse, errback=self.handle_error)

    def parse_product(self, product, response):
        name = product.css('h2::text').get()
        price = product.css('.price::text').get()

        if not name:
            raise ValueError('Missing product name')

        if not price:
            self.logger.warning(f'Product {name} has no price')

        return {
            'name': name.strip() if name else None,
            'price': price.strip() if price else None,
            'url': response.urljoin(product.css('a::attr(href)').get())
        }

    def handle_error(self, failure):
        self.stats['errors'] += 1
        self.logger.error(f'Request failed: {failure.value}')
        self.logger.error(f'URL: {failure.request.url}')

    def log_progress(self):
        self.logger.info('-'*60)
        self.logger.info('PROGRESS REPORT')
        self.logger.info(f'Pages scraped: {self.stats["pages_scraped"]}')
        self.logger.info(f'Items scraped: {self.stats["items_scraped"]}')
        self.logger.info(f'Failed items: {self.stats["items_failed"]}')
        self.logger.info(f'Errors: {self.stats["errors"]}')
        self.logger.info('-'*60)

    def closed(self, reason):
        self.logger.info('='*60)
        self.logger.info('SPIDER FINISHED')
        self.logger.info(f'Reason: {reason}')
        self.log_progress()
        self.logger.info('='*60)

This spider logs:

When it starts and ends
Progress every 5 pages
All errors with full details
Final statistics

Perfect for production!

Quick Reference

Log Levels (In Order)

self.logger.debug('Detailed debugging info')
self.logger.info('General information')
self.logger.warning('Something unexpected happened')
self.logger.error('Something broke')
self.logger.critical('Everything is on fire')

Settings

# settings.py

# Set minimum log level
LOG_LEVEL = 'INFO'  # or DEBUG, WARNING, ERROR, CRITICAL

# Save to file
LOG_FILE = 'spider.log'

# Append instead of overwrite
LOG_FILE_APPEND = True

# Custom format
LOG_FORMAT = '%(asctime)s [%(levelname)s] %(message)s'
LOG_DATEFORMAT = '%Y-%m-%d %H:%M:%S'

# Disable logging completely
LOG_ENABLED = False

Command Line

# Change log level
scrapy crawl myspider --loglevel=DEBUG

# Save to file
scrapy crawl myspider --logfile=spider.log

# Multiple options
scrapy crawl myspider --loglevel=INFO --logfile=spider.log