When I first started using Scrapy, my spiders would just... stop working. No error messages. No clues. Just silence.
I'd stare at my terminal thinking "Did it even run? Where did it fail? Why isn't anything happening?"
Then I learned about logging. Suddenly, my spiders started talking to me. They'd tell me exactly what they were doing, where they were going, what they found, and when things went wrong.
Logging transformed me from a confused beginner to someone who could actually debug problems. Let me show you how to make your spiders communicate clearly.
What Is Logging, Really?
Think of logging like a diary your spider keeps. As it runs, it writes down everything it does:
- "I'm starting up now"
- "I'm visiting this URL"
- "I found 20 products"
- "Uh oh, I got a 404 error"
- "I'm done, here are my stats"
Without logging, your spider runs in complete silence. You have no idea what's happening inside.
With logging, you see everything. Every step. Every decision. Every problem.
The Five Log Levels (From Loud to Quiet)
Python has five log levels. Think of them like volume settings:
DEBUG (Loudest)
Everything. Every tiny detail. Use this when you're hunting bugs.
self.logger.debug('Checking if this element exists')
INFO
Important milestones. "I started," "I found data," "I finished."
self.logger.info('Successfully scraped 50 products')
WARNING
Something weird happened, but the spider keeps running.
self.logger.warning('Product has no price, skipping')
ERROR
Something broke, but the spider continues with other pages.
self.logger.error('Failed to parse product page')
CRITICAL (Quietest)
Everything is on fire. The spider can't continue.
self.logger.critical('Database connection lost, cannot save data')
By default, Scrapy shows INFO and above. DEBUG messages are hidden unless you ask for them.
Your First Log Messages
Every spider has a built-in logger. Just use self.logger:
import scrapy
class MySpider(scrapy.Spider):
name = 'myspider'
start_urls = ['https://example.com']
def parse(self, response):
self.logger.info('Started scraping')
products = response.css('.product')
self.logger.info(f'Found {len(products)} products')
for product in products:
name = product.css('h2::text').get()
if name:
self.logger.debug(f'Scraping product: {name}')
yield {'name': name}
else:
self.logger.warning('Product missing name, skipped')
Run it:
scrapy crawl myspider
You'll see your log messages mixed with Scrapy's own messages.
Controlling What You See
Change Log Level from Command Line
Only show warnings and errors:
scrapy crawl myspider --loglevel=WARNING
Show everything, including debug messages:
scrapy crawl myspider --loglevel=DEBUG
Change Log Level in Settings
Edit settings.py:
# Only show important stuff
LOG_LEVEL = 'INFO'
# Or show everything for debugging
LOG_LEVEL = 'DEBUG'
# Or only show problems
LOG_LEVEL = 'WARNING'
Saving Logs to a File
Console output disappears when you close the terminal. Save logs to a file instead:
From Command Line
scrapy crawl myspider --logfile=spider.log
From Settings
# settings.py
LOG_FILE = 'spider.log'
Now all your logs save to spider.log. Perfect for production scrapers that run for hours.
Append to Existing Log File
By default, Scrapy overwrites the log file each time. To keep old logs and append new ones:
# settings.py
LOG_FILE = 'spider.log'
LOG_FILE_APPEND = True # Don't overwrite, append instead
Real-World Example: A Logging Spider
Let's build a spider that logs everything important:
import scrapy
class ProductSpider(scrapy.Spider):
name = 'products'
start_urls = ['https://example.com/products']
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
self.products_scraped = 0
self.products_failed = 0
def parse(self, response):
self.logger.info(f'Parsing page: {response.url}')
products = response.css('.product')
self.logger.info(f'Found {len(products)} products on this page')
if not products:
self.logger.warning('No products found on this page!')
for product in products:
try:
item = self.parse_product(product)
self.products_scraped += 1
self.logger.debug(f'Scraped: {item["name"]}')
yield item
except Exception as e:
self.products_failed += 1
self.logger.error(f'Failed to parse product: {e}')
# Follow pagination
next_page = response.css('.next::attr(href)').get()
if next_page:
self.logger.info(f'Following next page: {next_page}')
yield response.follow(next_page, callback=self.parse)
else:
self.logger.info('No more pages to scrape')
def parse_product(self, product):
name = product.css('h2::text').get()
price = product.css('.price::text').get()
if not name:
raise ValueError('Product has no name')
if not price:
self.logger.warning(f'Product {name} has no price')
price = 'N/A'
return {
'name': name.strip(),
'price': price
}
def closed(self, reason):
# Called when spider finishes
self.logger.info('=' * 50)
self.logger.info('SPIDER FINISHED')
self.logger.info(f'Total products scraped: {self.products_scraped}')
self.logger.info(f'Total failures: {self.products_failed}')
self.logger.info(f'Reason: {reason}')
self.logger.info('=' * 50)
Run it with INFO level:
scrapy crawl products --loglevel=INFO --logfile=products.log
Your log file will have a complete record of what happened.
Advanced: Custom Log Formatting
Scrapy's default logs look like this:
2024-12-24 10:30:15 [myspider] INFO: Scraped product: Widget
You can customize the format:
# settings.py
LOG_FORMAT = '%(levelname)s: %(message)s'
Now it looks like:
INFO: Scraped product: Widget
More Formatting Options
# Show date, time, level, and message
LOG_FORMAT = '%(asctime)s [%(levelname)s] %(message)s'
LOG_DATEFORMAT = '%Y-%m-%d %H:%M:%S'
Output:
2024-12-24 10:30:15 [INFO] Scraped product: Widget
Show Spider Name
LOG_FORMAT = '%(asctime)s [%(name)s] %(levelname)s: %(message)s'
Output:
2024-12-24 10:30:15 [myspider] INFO: Scraped product: Widget
Logging in Pipelines
Pipelines don't have self.logger like spiders do. Create your own:
# pipelines.py
import logging
class MyPipeline:
def __init__(self):
self.logger = logging.getLogger(__name__)
self.items_processed = 0
self.items_dropped = 0
def process_item(self, item, spider):
self.logger.debug(f'Processing item: {item}')
if not item.get('price'):
self.items_dropped += 1
self.logger.warning(f'Dropping item with no price: {item.get("name")}')
raise DropItem('Missing price')
# Clean the price
price = item['price'].replace('$', '').replace(',', '')
try:
item['price'] = float(price)
self.items_processed += 1
self.logger.info(f'Processed item: {item["name"]} at ${item["price"]}')
except ValueError:
self.items_dropped += 1
self.logger.error(f'Invalid price format: {item["price"]}')
raise DropItem('Invalid price')
return item
def close_spider(self, spider):
self.logger.info(f'Pipeline stats: {self.items_processed} processed, {self.items_dropped} dropped')
Silencing Noisy Logs
Scrapy logs A LOT. Sometimes too much. Here's how to quiet specific parts:
Hide Specific Log Categories
# settings.py
import logging
# Reduce chattiness of some components
logging.getLogger('scrapy.core.engine').setLevel(logging.WARNING)
logging.getLogger('scrapy.downloadermiddlewares').setLevel(logging.WARNING)
Hide HTTP Error Logs
When scraping, you'll often hit 404s or 500s. These create lots of WARNING logs. To hide them:
# In your spider's __init__
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
logging.getLogger('scrapy.spidermiddlewares.httperror').setLevel(logging.ERROR)
Show Only Your Spider's Logs
# settings.py
LOG_LEVEL = 'INFO'
# After imports, before spider class
import logging
logging.getLogger('scrapy').setLevel(logging.WARNING) # Quiet Scrapy
Now you'll only see your spider's log messages.
Debugging with Logs
Track Request Flow
def start_requests(self):
for url in self.start_urls:
self.logger.info(f'Requesting: {url}')
yield scrapy.Request(url, callback=self.parse)
def parse(self, response):
self.logger.info(f'Received response from: {response.url}')
self.logger.info(f'Status code: {response.status}')
self.logger.debug(f'Response length: {len(response.body)} bytes')
Log Selector Results
def parse(self, response):
products = response.css('.product')
self.logger.info(f'Found {len(products)} products')
if not products:
self.logger.warning('No products found! Selector might be wrong.')
self.logger.debug(f'Page HTML: {response.text[:500]}') # First 500 chars
Log Exceptions Properly
def parse_product(self, product):
try:
name = product.css('h2::text').get()
price = product.css('.price::text').get()
return {'name': name, 'price': price}
except Exception as e:
self.logger.error(f'Error parsing product: {e}', exc_info=True)
# exc_info=True adds the full stack trace
With exc_info=True, you get the full error traceback in your logs. Super helpful for debugging.
Tips Nobody Tells You
Tip #1: Log Data Quality Issues
def parse(self, response):
for product in response.css('.product'):
name = product.css('h2::text').get()
price = product.css('.price::text').get()
# Log data quality
if not name:
self.logger.warning(f'Missing name at {response.url}')
if not price:
self.logger.warning(f'Missing price for {name}')
yield {'name': name, 'price': price}
This helps you spot issues with the website's data, not just your selectors.
Tip #2: Log Progress for Long Scrapes
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
self.page_count = 0
def parse(self, response):
self.page_count += 1
if self.page_count % 10 == 0:
self.logger.info(f'Progress: Scraped {self.page_count} pages so far')
For spiders that run for hours, this shows you're still making progress.
Tip #3: Log What You're NOT Scraping
def parse(self, response):
for product in response.css('.product'):
if product.css('.out-of-stock'):
self.logger.debug(f'Skipping out-of-stock product: {product.css("h2::text").get()}')
continue # Skip this product
yield self.parse_product(product)
Knowing what you skipped helps debug incomplete data.
Tip #4: Different Logs for Different Spiders
Running multiple spiders? Keep their logs separate:
scrapy crawl spider1 --logfile=spider1.log
scrapy crawl spider2 --logfile=spider2.log
Or in code:
# settings.py (for spider1)
LOG_FILE = f'{BOT_NAME}_spider1.log'
Common Mistakes
Mistake #1: Using print() Instead of Logging
# WRONG
def parse(self, response):
print('Found products') # Don't do this!
# RIGHT
def parse(self, response):
self.logger.info('Found products')
print() doesn't respect log levels or file output. Always use the logger.
Mistake #2: Logging Too Much in Production
# WRONG (logs every single item!)
def parse(self, response):
for product in response.css('.product'):
self.logger.info(f'Scraping: {product.css("h2::text").get()}')
yield {...}
# RIGHT (log summaries)
def parse(self, response):
products = response.css('.product')
self.logger.info(f'Scraping {len(products)} products from page')
for product in products:
yield {...}
In production with thousands of items, logging each one creates massive log files.
Mistake #3: Not Logging Failures
# WRONG (fails silently)
def parse_product(self, product):
name = product.css('h2::text').get()
return {'name': name}
# RIGHT (logs the issue)
def parse_product(self, product):
name = product.css('h2::text').get()
if not name:
self.logger.error('Failed to extract name')
return {'name': name}
Production Spider with Complete Logging
Here's a complete, production-ready spider with proper logging:
import scrapy
import logging
class ProductionSpider(scrapy.Spider):
name = 'production'
start_urls = ['https://example.com/products']
custom_settings = {
'LOG_LEVEL': 'INFO',
'LOG_FILE': 'production_spider.log',
'LOG_FILE_APPEND': True,
'LOG_FORMAT': '%(asctime)s [%(name)s] %(levelname)s: %(message)s',
'LOG_DATEFORMAT': '%Y-%m-%d %H:%M:%S'
}
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
self.stats = {
'pages_scraped': 0,
'items_scraped': 0,
'items_failed': 0,
'errors': 0
}
def start_requests(self):
self.logger.info('='*60)
self.logger.info('SPIDER STARTED')
self.logger.info(f'Starting URLs: {self.start_urls}')
self.logger.info('='*60)
for url in self.start_urls:
yield scrapy.Request(url, callback=self.parse, errback=self.handle_error)
def parse(self, response):
self.stats['pages_scraped'] += 1
self.logger.info(f'Processing page {self.stats["pages_scraped"]}: {response.url}')
products = response.css('.product')
if not products:
self.logger.warning(f'No products found on {response.url}')
return
for product in products:
try:
item = self.parse_product(product, response)
self.stats['items_scraped'] += 1
yield item
except Exception as e:
self.stats['items_failed'] += 1
self.logger.error(f'Failed to parse product: {e}', exc_info=True)
# Log progress every 5 pages
if self.stats['pages_scraped'] % 5 == 0:
self.log_progress()
# Pagination
next_page = response.css('.next::attr(href)').get()
if next_page:
yield response.follow(next_page, callback=self.parse, errback=self.handle_error)
def parse_product(self, product, response):
name = product.css('h2::text').get()
price = product.css('.price::text').get()
if not name:
raise ValueError('Missing product name')
if not price:
self.logger.warning(f'Product {name} has no price')
return {
'name': name.strip() if name else None,
'price': price.strip() if price else None,
'url': response.urljoin(product.css('a::attr(href)').get())
}
def handle_error(self, failure):
self.stats['errors'] += 1
self.logger.error(f'Request failed: {failure.value}')
self.logger.error(f'URL: {failure.request.url}')
def log_progress(self):
self.logger.info('-'*60)
self.logger.info('PROGRESS REPORT')
self.logger.info(f'Pages scraped: {self.stats["pages_scraped"]}')
self.logger.info(f'Items scraped: {self.stats["items_scraped"]}')
self.logger.info(f'Failed items: {self.stats["items_failed"]}')
self.logger.info(f'Errors: {self.stats["errors"]}')
self.logger.info('-'*60)
def closed(self, reason):
self.logger.info('='*60)
self.logger.info('SPIDER FINISHED')
self.logger.info(f'Reason: {reason}')
self.log_progress()
self.logger.info('='*60)
This spider logs:
- When it starts and ends
- Progress every 5 pages
- All errors with full details
- Final statistics
Perfect for production!
Quick Reference
Log Levels (In Order)
self.logger.debug('Detailed debugging info')
self.logger.info('General information')
self.logger.warning('Something unexpected happened')
self.logger.error('Something broke')
self.logger.critical('Everything is on fire')
Settings
# settings.py
# Set minimum log level
LOG_LEVEL = 'INFO' # or DEBUG, WARNING, ERROR, CRITICAL
# Save to file
LOG_FILE = 'spider.log'
# Append instead of overwrite
LOG_FILE_APPEND = True
# Custom format
LOG_FORMAT = '%(asctime)s [%(levelname)s] %(message)s'
LOG_DATEFORMAT = '%Y-%m-%d %H:%M:%S'
# Disable logging completely
LOG_ENABLED = False
Command Line
# Change log level
scrapy crawl myspider --loglevel=DEBUG
# Save to file
scrapy crawl myspider --logfile=spider.log
# Multiple options
scrapy crawl myspider --loglevel=INFO --logfile=spider.log
Summary
Logging is your spider's voice. It tells you:
- What it's doing right now
- What it found
- What went wrong
- How long things took
Key takeaways:
- Use
self.loggerin spiders - Five levels: DEBUG, INFO, WARNING, ERROR, CRITICAL
- Save logs to files for production
- Log errors with
exc_info=Truefor stack traces - Log progress for long-running scrapers
- Use appropriate log levels (don't log everything as INFO!)
Start adding logs to your spiders today. When something breaks (and it will), you'll have the information you need to fix it quickly.
Happy scraping! 🕷️
Top comments (0)