My spider ran for 6 hours on the server. It scraped 50,000 products successfully. Or did it? I had no idea.
The console output was gone. No logs saved. No way to check what happened. Did it hit errors? Which pages failed? How many items were actually scraped? I had no clue.
That's when I learned that logs aren't optional. They're essential. Let me show you how to save, manage, and organize Scrapy logs properly.
The Problem: Console Logs Disappear
What Happens by Default
When you run Scrapy:
scrapy crawl myspider
Logs appear in console:
2024-01-15 10:30:22 [scrapy.core.engine] INFO: Spider opened
2024-01-15 10:30:23 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://example.com>
2024-01-15 10:30:24 [myspider] INFO: Scraped item: {'name': 'Product 1'}
But when spider finishes:
- Console clears
- Logs gone forever
- No way to review what happened
On a server with nohup:
- Logs go to nohup.out
- File grows huge
- Mixed with other output
- Hard to search
You need:
- Logs saved to specific files
- Organized by spider and date
- Easy to find and search
- Automatic cleanup
Basic Log File Configuration
Method 1: Command Line
Simplest way:
scrapy crawl myspider --logfile=spider.log
All logs go to spider.log.
Method 2: Settings (Better)
In settings.py:
# Enable log file
LOG_FILE = 'logs/spider.log'
# Set log level
LOG_LEVEL = 'INFO' # Options: DEBUG, INFO, WARNING, ERROR, CRITICAL
Now every spider run saves to that file.
Understanding Log Levels
DEBUG (Most verbose)
- Every request and response
- Selector queries
- Middleware operations
- Use for development only
INFO (Default)
- Spider opened/closed
- Items scraped
- Requests made
- Good for production
WARNING
- Potential issues
- Retries
- Ignored requests
ERROR
- Actual errors
- Failed requests
- Pipeline errors
CRITICAL
- Severe errors only
- Spider crashes
My recommendation: Use INFO for production, DEBUG for development.
Timestamped Log Files
Don't overwrite logs! Use timestamps:
Dynamic Log File Name
# settings.py
from datetime import datetime
LOG_FILE = f'logs/spider_{datetime.now().strftime("%Y%m%d_%H%M%S")}.log'
Creates files like:
logs/spider_20240115_103022.log
logs/spider_20240115_143155.log
logs/spider_20240115_180420.log
Each run gets its own log file!
Per-Spider Log Files
Different log for each spider:
# In spider file
class MySpider(scrapy.Spider):
name = 'myspider'
custom_settings = {
'LOG_FILE': f'logs/{name}_{datetime.now().strftime("%Y%m%d_%H%M%S")}.log'
}
Or in settings with spider name:
# settings.py
from datetime import datetime
def get_log_file(spider_name):
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
return f'logs/{spider_name}/{timestamp}.log'
# This gets called automatically
LOG_FILE = 'logs/default.log' # Fallback
Then in spider:
class MySpider(scrapy.Spider):
name = 'products'
custom_settings = {
'LOG_FILE': f'logs/products/{datetime.now().strftime("%Y%m%d_%H%M%S")}.log'
}
Organizing Log Directory Structure
Good Structure
logs/
├── products/
│ ├── 20240115_103022.log
│ ├── 20240115_143155.log
│ └── 20240116_091234.log
├── categories/
│ ├── 20240115_110045.log
│ └── 20240115_160312.log
└── reviews/
├── 20240115_120034.log
└── 20240115_180520.log
Benefits:
- Easy to find logs by spider
- Clear organization
- Can delete per-spider logs
Create Log Directory Automatically
# settings.py
from datetime import datetime
import os
# Ensure log directory exists
LOG_DIR = 'logs'
os.makedirs(LOG_DIR, exist_ok=True)
LOG_FILE = f'{LOG_DIR}/spider_{datetime.now().strftime("%Y%m%d_%H%M%S")}.log'
Or in spider:
class MySpider(scrapy.Spider):
name = 'products'
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
# Create log directory
log_dir = f'logs/{self.name}'
os.makedirs(log_dir, exist_ok=True)
# Set log file
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
self.custom_settings = {
'LOG_FILE': f'{log_dir}/{timestamp}.log'
}
Log Rotation (Prevent Huge Files)
Problem: Logs Get Too Big
Long-running spiders create massive logs:
- 100 MB log files
- Hard to open
- Slow to search
- Fill up disk
Solution 1: Size-Based Rotation
Split log when it reaches size limit:
# settings.py
import logging
from logging.handlers import RotatingFileHandler
# Create rotating file handler
LOG_FILE = 'logs/spider.log'
LOG_LEVEL = 'INFO'
# This needs custom extension
EXTENSIONS = {
'myproject.extensions.RotatingLogExtension': 500
}
Create extension:
# extensions.py
import logging
from logging.handlers import RotatingFileHandler
class RotatingLogExtension:
def __init__(self, log_file, max_bytes, backup_count):
self.log_file = log_file
self.max_bytes = max_bytes
self.backup_count = backup_count
@classmethod
def from_crawler(cls, crawler):
log_file = crawler.settings.get('LOG_FILE')
max_bytes = crawler.settings.get('LOG_MAX_BYTES', 10*1024*1024) # 10MB
backup_count = crawler.settings.get('LOG_BACKUP_COUNT', 5)
ext = cls(log_file, max_bytes, backup_count)
# Setup rotating handler
handler = RotatingFileHandler(
log_file,
maxBytes=max_bytes,
backupCount=backup_count
)
# Get root logger
logger = logging.getLogger()
logger.addHandler(handler)
return ext
Settings:
# settings.py
LOG_FILE = 'logs/spider.log'
LOG_MAX_BYTES = 10 * 1024 * 1024 # 10 MB
LOG_BACKUP_COUNT = 5 # Keep 5 old logs
EXTENSIONS = {
'myproject.extensions.RotatingLogExtension': 500
}
Creates files:
logs/spider.log # Current
logs/spider.log.1 # Previous
logs/spider.log.2 # Older
logs/spider.log.3
logs/spider.log.4
logs/spider.log.5
Solution 2: Time-Based (New File Each Day)
Better for long-running spiders:
# extensions.py
import logging
from logging.handlers import TimedRotatingFileHandler
class DailyRotatingLogExtension:
@classmethod
def from_crawler(cls, crawler):
log_file = crawler.settings.get('LOG_FILE', 'logs/spider.log')
# Rotate at midnight, keep 7 days
handler = TimedRotatingFileHandler(
log_file,
when='midnight',
interval=1,
backupCount=7
)
formatter = logging.Formatter(
'%(asctime)s [%(name)s] %(levelname)s: %(message)s'
)
handler.setFormatter(formatter)
logger = logging.getLogger()
logger.addHandler(handler)
return cls()
Creates files:
logs/spider.log # Today
logs/spider.log.2024-01-14 # Yesterday
logs/spider.log.2024-01-13
logs/spider.log.2024-01-12
Viewing Logs on Server
Method 1: tail -f (Live Viewing)
Watch logs in real-time:
tail -f logs/spider.log
See new lines as they're written. Press Ctrl+C to stop.
Method 2: tail with Line Count
Show last 100 lines:
tail -n 100 logs/spider.log
Method 3: less (Browse Entire File)
less logs/spider.log
Navigation:
- Arrow keys to move
- Space bar for next page
-
/to search -
qto quit
Method 4: grep (Search Logs)
Find errors:
grep "ERROR" logs/spider.log
Find specific item:
grep "Product ABC" logs/spider.log
Count errors:
grep -c "ERROR" logs/spider.log
Show errors with context (3 lines before/after):
grep -A 3 -B 3 "ERROR" logs/spider.log
Method 5: Multiple grep (Complex Searches)
Find errors in specific spider:
grep "myspider" logs/spider.log | grep "ERROR"
Find scraped items with specific field:
grep "Scraped item" logs/spider.log | grep "price"
Log Analysis Scripts
Count Scraped Items
#!/bin/bash
# count_items.sh
LOG_FILE="$1"
if [ -z "$LOG_FILE" ]; then
echo "Usage: ./count_items.sh <log_file>"
exit 1
fi
ITEMS=$(grep -c "Scraped from" "$LOG_FILE")
DROPPED=$(grep -c "Dropped" "$LOG_FILE")
ERRORS=$(grep -c "ERROR" "$LOG_FILE")
echo "Statistics for: $LOG_FILE"
echo "================================"
echo "Items scraped: $ITEMS"
echo "Items dropped: $DROPPED"
echo "Errors: $ERRORS"
Usage:
./count_items.sh logs/spider.log
Output:
Statistics for: logs/spider.log
================================
Items scraped: 5430
Items dropped: 12
Errors: 3
Extract Error Messages
#!/bin/bash
# extract_errors.sh
LOG_FILE="$1"
echo "Errors in $LOG_FILE:"
echo "================================"
grep "ERROR" "$LOG_FILE" | while read line; do
echo "$line"
echo ""
done
Generate Summary Report
#!/usr/bin/env python3
# analyze_log.py
import sys
import re
from collections import Counter
def analyze_log(log_file):
with open(log_file, 'r') as f:
lines = f.readlines()
stats = {
'total_lines': len(lines),
'items_scraped': 0,
'items_dropped': 0,
'requests': 0,
'errors': 0,
'warnings': 0
}
error_types = Counter()
for line in lines:
if 'Scraped from' in line:
stats['items_scraped'] += 1
elif 'Dropped' in line:
stats['items_dropped'] += 1
elif 'Crawled' in line:
stats['requests'] += 1
elif 'ERROR' in line:
stats['errors'] += 1
# Extract error type
match = re.search(r'ERROR: (.+?):', line)
if match:
error_types[match.group(1)] += 1
elif 'WARNING' in line:
stats['warnings'] += 1
print(f"Log Analysis: {log_file}")
print("="*60)
print(f"Total lines: {stats['total_lines']}")
print(f"Items scraped: {stats['items_scraped']}")
print(f"Items dropped: {stats['items_dropped']}")
print(f"Requests: {stats['requests']}")
print(f"Errors: {stats['errors']}")
print(f"Warnings: {stats['warnings']}")
if error_types:
print("\nError types:")
for error, count in error_types.most_common():
print(f" {error}: {count}")
if __name__ == '__main__':
if len(sys.argv) < 2:
print("Usage: python analyze_log.py <log_file>")
sys.exit(1)
analyze_log(sys.argv[1])
Usage:
python analyze_log.py logs/spider.log
Cleaning Old Logs
Manual Cleanup
Delete logs older than 7 days:
find logs/ -name "*.log" -mtime +7 -delete
Delete all but last 10 log files:
ls -t logs/*.log | tail -n +11 | xargs rm -f
Automatic Cleanup Script
#!/bin/bash
# cleanup_logs.sh
LOG_DIR="logs"
DAYS_TO_KEEP=7
echo "Cleaning logs older than $DAYS_TO_KEEP days in $LOG_DIR"
# Count files before
BEFORE=$(find "$LOG_DIR" -name "*.log" | wc -l)
# Delete old files
find "$LOG_DIR" -name "*.log" -mtime +$DAYS_TO_KEEP -delete
# Count files after
AFTER=$(find "$LOG_DIR" -name "*.log" | wc -l)
DELETED=$((BEFORE - AFTER))
echo "Deleted $DELETED log files"
echo "Remaining: $AFTER log files"
Cron Job for Automatic Cleanup
Run cleanup daily at 2 AM:
crontab -e
Add:
0 2 * * * /home/user/scrapy_project/cleanup_logs.sh >> /home/user/scrapy_project/logs/cleanup.log 2>&1
Separate Error Logs
Save errors to separate file:
# settings.py
import logging
LOG_FILE = 'logs/spider.log'
LOG_LEVEL = 'INFO'
# Custom logging setup
EXTENSIONS = {
'myproject.extensions.ErrorLogExtension': 500
}
Create extension:
# extensions.py
import logging
class ErrorLogExtension:
@classmethod
def from_crawler(cls, crawler):
# Create error file handler
error_handler = logging.FileHandler('logs/errors.log')
error_handler.setLevel(logging.ERROR)
formatter = logging.Formatter(
'%(asctime)s [%(name)s] %(levelname)s: %(message)s'
)
error_handler.setFormatter(formatter)
# Add to root logger
logger = logging.getLogger()
logger.addHandler(error_handler)
return cls()
Now you have:
-
logs/spider.log- All logs -
logs/errors.log- Only errors
Log Format Customization
Change Log Format
# settings.py
LOG_FORMAT = '%(asctime)s [%(name)s] %(levelname)s: %(message)s'
LOG_DATEFORMAT = '%Y-%m-%d %H:%M:%S'
Default format:
2024-01-15 10:30:22 [scrapy.core.engine] INFO: Spider opened
Custom format:
LOG_FORMAT = '[%(levelname)s] %(name)s: %(message)s'
Output:
[INFO] scrapy.core.engine: Spider opened
Add Custom Fields
# settings.py
import logging
class CustomFormatter(logging.Formatter):
def format(self, record):
record.spider_name = 'myspider' # Add custom field
return super().format(record)
LOG_FORMAT = '%(asctime)s [%(spider_name)s] %(levelname)s: %(message)s'
Complete Production Setup
Directory Structure
scrapy_project/
├── logs/
│ ├── products/
│ ├── categories/
│ └── reviews/
├── scripts/
│ ├── analyze_log.py
│ ├── cleanup_logs.sh
│ └── monitor_logs.sh
├── myproject/
│ ├── settings.py
│ ├── extensions.py
│ └── spiders/
└── scrapy.cfg
settings.py
import os
from datetime import datetime
# Ensure log directory exists
LOG_BASE_DIR = 'logs'
os.makedirs(LOG_BASE_DIR, exist_ok=True)
# Default log file (per spider will override)
LOG_FILE = f'{LOG_BASE_DIR}/default_{datetime.now().strftime("%Y%m%d_%H%M%S")}.log'
# Log level
LOG_LEVEL = 'INFO'
# Log format
LOG_FORMAT = '%(asctime)s [%(name)s] %(levelname)s: %(message)s'
LOG_DATEFORMAT = '%Y-%m-%d %H:%M:%S'
# Disable default log to stdout (optional)
LOG_STDOUT = False
# Extensions
EXTENSIONS = {
'myproject.extensions.ErrorLogExtension': 500,
}
Spider with Custom Logging
import scrapy
import os
from datetime import datetime
class ProductSpider(scrapy.Spider):
name = 'products'
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
# Setup log directory
log_dir = f'logs/{self.name}'
os.makedirs(log_dir, exist_ok=True)
# Timestamped log file
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
log_file = f'{log_dir}/{timestamp}.log'
self.custom_settings = {
'LOG_FILE': log_file,
'LOG_LEVEL': 'INFO'
}
self.logger.info(f'Logging to: {log_file}')
def start_requests(self):
urls = ['https://example.com/products']
for url in urls:
yield scrapy.Request(url, callback=self.parse)
def parse(self, response):
for product in response.css('.product'):
item = {
'name': product.css('h2::text').get(),
'price': product.css('.price::text').get()
}
self.logger.info(f'Scraped: {item["name"]}')
yield item
def closed(self, reason):
stats = self.crawler.stats.get_stats()
self.logger.info('='*60)
self.logger.info('SPIDER FINISHED')
self.logger.info(f'Items scraped: {stats.get("item_scraped_count", 0)}')
self.logger.info(f'Items dropped: {stats.get("item_dropped_count", 0)}')
self.logger.info(f'Requests: {stats.get("downloader/request_count", 0)}')
self.logger.info(f'Errors: {stats.get("log_count/ERROR", 0)}')
self.logger.info('='*60)
Monitoring Logs in Real-Time
Simple Monitor Script
#!/bin/bash
# monitor_logs.sh
LOG_DIR="logs"
SPIDER_NAME="${1:-products}"
LOG_FILE=$(ls -t $LOG_DIR/$SPIDER_NAME/*.log 2>/dev/null | head -n 1)
if [ -z "$LOG_FILE" ]; then
echo "No log files found for spider: $SPIDER_NAME"
exit 1
fi
echo "Monitoring: $LOG_FILE"
echo "Press Ctrl+C to stop"
echo ""
tail -f "$LOG_FILE"
Usage:
./monitor_logs.sh products
Monitor with Filtering
#!/bin/bash
# monitor_errors.sh
LOG_DIR="logs"
SPIDER_NAME="${1:-products}"
LOG_FILE=$(ls -t $LOG_DIR/$SPIDER_NAME/*.log 2>/dev/null | head -n 1)
if [ -z "$LOG_FILE" ]; then
echo "No log files found for spider: $SPIDER_NAME"
exit 1
fi
echo "Monitoring ERRORS in: $LOG_FILE"
echo "Press Ctrl+C to stop"
echo ""
tail -f "$LOG_FILE" | grep --line-buffered "ERROR"
Common Mistakes
Mistake #1: Not Creating Log Directory
# BAD
LOG_FILE = 'logs/spider.log'
# Fails if logs/ doesn't exist
# GOOD
import os
os.makedirs('logs', exist_ok=True)
LOG_FILE = 'logs/spider.log'
Mistake #2: Same Log File for All Runs
# BAD
LOG_FILE = 'spider.log'
# Overwrites previous logs
# GOOD
from datetime import datetime
LOG_FILE = f'logs/spider_{datetime.now().strftime("%Y%m%d_%H%M%S")}.log'
Mistake #3: Too Verbose Logging
# BAD (for production)
LOG_LEVEL = 'DEBUG'
# Creates massive log files
# GOOD
LOG_LEVEL = 'INFO' # Production
LOG_LEVEL = 'DEBUG' # Development only
Mistake #4: No Log Cleanup
Logs fill up disk space! Always implement cleanup.
# Check disk space
df -h
# Clean old logs
find logs/ -name "*.log" -mtime +7 -delete
Quick Reference
Basic Configuration
# settings.py
LOG_FILE = 'logs/spider.log'
LOG_LEVEL = 'INFO'
Timestamped Logs
from datetime import datetime
LOG_FILE = f'logs/spider_{datetime.now().strftime("%Y%m%d_%H%M%S")}.log'
View Logs
tail -f logs/spider.log # Live view
tail -n 100 logs/spider.log # Last 100 lines
grep "ERROR" logs/spider.log # Find errors
less logs/spider.log # Browse entire file
Clean Old Logs
find logs/ -name "*.log" -mtime +7 -delete
Summary
Why log files matter:
- Review past runs
- Debug production issues
- Track statistics
- Monitor spider health
Basic setup:
LOG_FILE = 'logs/spider.log'
LOG_LEVEL = 'INFO'
Best practices:
- Use timestamped filenames
- Organize by spider name
- Implement log rotation
- Clean old logs regularly
- Use INFO level in production
- Create log directory automatically
Viewing logs:
-
tail -ffor live monitoring -
grepfor searching -
lessfor browsing - Write analysis scripts
Remember:
- Always save logs to files
- Don't let logs fill disk
- Clean old logs
- Use appropriate log level
- Separate logs per spider
Log files are your time machine. They let you see exactly what happened, even days later. Set them up properly from the start!
Happy scraping! 🕷️
Top comments (0)