DEV Community

Muhammad Ikramullah Khan
Muhammad Ikramullah Khan

Posted on

Scrapy Log Files: Save, Rotate, and Organize Your Crawler Logs

My spider ran for 6 hours on the server. It scraped 50,000 products successfully. Or did it? I had no idea.

The console output was gone. No logs saved. No way to check what happened. Did it hit errors? Which pages failed? How many items were actually scraped? I had no clue.

That's when I learned that logs aren't optional. They're essential. Let me show you how to save, manage, and organize Scrapy logs properly.


The Problem: Console Logs Disappear

What Happens by Default

When you run Scrapy:

scrapy crawl myspider
Enter fullscreen mode Exit fullscreen mode

Logs appear in console:

2024-01-15 10:30:22 [scrapy.core.engine] INFO: Spider opened
2024-01-15 10:30:23 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://example.com>
2024-01-15 10:30:24 [myspider] INFO: Scraped item: {'name': 'Product 1'}
Enter fullscreen mode Exit fullscreen mode

But when spider finishes:

  • Console clears
  • Logs gone forever
  • No way to review what happened

On a server with nohup:

  • Logs go to nohup.out
  • File grows huge
  • Mixed with other output
  • Hard to search

You need:

  • Logs saved to specific files
  • Organized by spider and date
  • Easy to find and search
  • Automatic cleanup

Basic Log File Configuration

Method 1: Command Line

Simplest way:

scrapy crawl myspider --logfile=spider.log
Enter fullscreen mode Exit fullscreen mode

All logs go to spider.log.

Method 2: Settings (Better)

In settings.py:

# Enable log file
LOG_FILE = 'logs/spider.log'

# Set log level
LOG_LEVEL = 'INFO'  # Options: DEBUG, INFO, WARNING, ERROR, CRITICAL
Enter fullscreen mode Exit fullscreen mode

Now every spider run saves to that file.

Understanding Log Levels

DEBUG (Most verbose)

  • Every request and response
  • Selector queries
  • Middleware operations
  • Use for development only

INFO (Default)

  • Spider opened/closed
  • Items scraped
  • Requests made
  • Good for production

WARNING

  • Potential issues
  • Retries
  • Ignored requests

ERROR

  • Actual errors
  • Failed requests
  • Pipeline errors

CRITICAL

  • Severe errors only
  • Spider crashes

My recommendation: Use INFO for production, DEBUG for development.


Timestamped Log Files

Don't overwrite logs! Use timestamps:

Dynamic Log File Name

# settings.py
from datetime import datetime

LOG_FILE = f'logs/spider_{datetime.now().strftime("%Y%m%d_%H%M%S")}.log'
Enter fullscreen mode Exit fullscreen mode

Creates files like:

logs/spider_20240115_103022.log
logs/spider_20240115_143155.log
logs/spider_20240115_180420.log
Enter fullscreen mode Exit fullscreen mode

Each run gets its own log file!

Per-Spider Log Files

Different log for each spider:

# In spider file
class MySpider(scrapy.Spider):
    name = 'myspider'

    custom_settings = {
        'LOG_FILE': f'logs/{name}_{datetime.now().strftime("%Y%m%d_%H%M%S")}.log'
    }
Enter fullscreen mode Exit fullscreen mode

Or in settings with spider name:

# settings.py
from datetime import datetime

def get_log_file(spider_name):
    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
    return f'logs/{spider_name}/{timestamp}.log'

# This gets called automatically
LOG_FILE = 'logs/default.log'  # Fallback
Enter fullscreen mode Exit fullscreen mode

Then in spider:

class MySpider(scrapy.Spider):
    name = 'products'

    custom_settings = {
        'LOG_FILE': f'logs/products/{datetime.now().strftime("%Y%m%d_%H%M%S")}.log'
    }
Enter fullscreen mode Exit fullscreen mode

Organizing Log Directory Structure

Good Structure

logs/
├── products/
│   ├── 20240115_103022.log
│   ├── 20240115_143155.log
│   └── 20240116_091234.log
├── categories/
│   ├── 20240115_110045.log
│   └── 20240115_160312.log
└── reviews/
    ├── 20240115_120034.log
    └── 20240115_180520.log
Enter fullscreen mode Exit fullscreen mode

Benefits:

  • Easy to find logs by spider
  • Clear organization
  • Can delete per-spider logs

Create Log Directory Automatically

# settings.py
from datetime import datetime
import os

# Ensure log directory exists
LOG_DIR = 'logs'
os.makedirs(LOG_DIR, exist_ok=True)

LOG_FILE = f'{LOG_DIR}/spider_{datetime.now().strftime("%Y%m%d_%H%M%S")}.log'
Enter fullscreen mode Exit fullscreen mode

Or in spider:

class MySpider(scrapy.Spider):
    name = 'products'

    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)

        # Create log directory
        log_dir = f'logs/{self.name}'
        os.makedirs(log_dir, exist_ok=True)

        # Set log file
        timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
        self.custom_settings = {
            'LOG_FILE': f'{log_dir}/{timestamp}.log'
        }
Enter fullscreen mode Exit fullscreen mode

Log Rotation (Prevent Huge Files)

Problem: Logs Get Too Big

Long-running spiders create massive logs:

  • 100 MB log files
  • Hard to open
  • Slow to search
  • Fill up disk

Solution 1: Size-Based Rotation

Split log when it reaches size limit:

# settings.py
import logging
from logging.handlers import RotatingFileHandler

# Create rotating file handler
LOG_FILE = 'logs/spider.log'
LOG_LEVEL = 'INFO'

# This needs custom extension
EXTENSIONS = {
    'myproject.extensions.RotatingLogExtension': 500
}
Enter fullscreen mode Exit fullscreen mode

Create extension:

# extensions.py
import logging
from logging.handlers import RotatingFileHandler

class RotatingLogExtension:
    def __init__(self, log_file, max_bytes, backup_count):
        self.log_file = log_file
        self.max_bytes = max_bytes
        self.backup_count = backup_count

    @classmethod
    def from_crawler(cls, crawler):
        log_file = crawler.settings.get('LOG_FILE')
        max_bytes = crawler.settings.get('LOG_MAX_BYTES', 10*1024*1024)  # 10MB
        backup_count = crawler.settings.get('LOG_BACKUP_COUNT', 5)

        ext = cls(log_file, max_bytes, backup_count)

        # Setup rotating handler
        handler = RotatingFileHandler(
            log_file,
            maxBytes=max_bytes,
            backupCount=backup_count
        )

        # Get root logger
        logger = logging.getLogger()
        logger.addHandler(handler)

        return ext
Enter fullscreen mode Exit fullscreen mode

Settings:

# settings.py
LOG_FILE = 'logs/spider.log'
LOG_MAX_BYTES = 10 * 1024 * 1024  # 10 MB
LOG_BACKUP_COUNT = 5  # Keep 5 old logs

EXTENSIONS = {
    'myproject.extensions.RotatingLogExtension': 500
}
Enter fullscreen mode Exit fullscreen mode

Creates files:

logs/spider.log         # Current
logs/spider.log.1       # Previous
logs/spider.log.2       # Older
logs/spider.log.3
logs/spider.log.4
logs/spider.log.5
Enter fullscreen mode Exit fullscreen mode

Solution 2: Time-Based (New File Each Day)

Better for long-running spiders:

# extensions.py
import logging
from logging.handlers import TimedRotatingFileHandler

class DailyRotatingLogExtension:
    @classmethod
    def from_crawler(cls, crawler):
        log_file = crawler.settings.get('LOG_FILE', 'logs/spider.log')

        # Rotate at midnight, keep 7 days
        handler = TimedRotatingFileHandler(
            log_file,
            when='midnight',
            interval=1,
            backupCount=7
        )

        formatter = logging.Formatter(
            '%(asctime)s [%(name)s] %(levelname)s: %(message)s'
        )
        handler.setFormatter(formatter)

        logger = logging.getLogger()
        logger.addHandler(handler)

        return cls()
Enter fullscreen mode Exit fullscreen mode

Creates files:

logs/spider.log                    # Today
logs/spider.log.2024-01-14         # Yesterday
logs/spider.log.2024-01-13
logs/spider.log.2024-01-12
Enter fullscreen mode Exit fullscreen mode

Viewing Logs on Server

Method 1: tail -f (Live Viewing)

Watch logs in real-time:

tail -f logs/spider.log
Enter fullscreen mode Exit fullscreen mode

See new lines as they're written. Press Ctrl+C to stop.

Method 2: tail with Line Count

Show last 100 lines:

tail -n 100 logs/spider.log
Enter fullscreen mode Exit fullscreen mode

Method 3: less (Browse Entire File)

less logs/spider.log
Enter fullscreen mode Exit fullscreen mode

Navigation:

  • Arrow keys to move
  • Space bar for next page
  • / to search
  • q to quit

Method 4: grep (Search Logs)

Find errors:

grep "ERROR" logs/spider.log
Enter fullscreen mode Exit fullscreen mode

Find specific item:

grep "Product ABC" logs/spider.log
Enter fullscreen mode Exit fullscreen mode

Count errors:

grep -c "ERROR" logs/spider.log
Enter fullscreen mode Exit fullscreen mode

Show errors with context (3 lines before/after):

grep -A 3 -B 3 "ERROR" logs/spider.log
Enter fullscreen mode Exit fullscreen mode

Method 5: Multiple grep (Complex Searches)

Find errors in specific spider:

grep "myspider" logs/spider.log | grep "ERROR"
Enter fullscreen mode Exit fullscreen mode

Find scraped items with specific field:

grep "Scraped item" logs/spider.log | grep "price"
Enter fullscreen mode Exit fullscreen mode

Log Analysis Scripts

Count Scraped Items

#!/bin/bash
# count_items.sh

LOG_FILE="$1"

if [ -z "$LOG_FILE" ]; then
    echo "Usage: ./count_items.sh <log_file>"
    exit 1
fi

ITEMS=$(grep -c "Scraped from" "$LOG_FILE")
DROPPED=$(grep -c "Dropped" "$LOG_FILE")
ERRORS=$(grep -c "ERROR" "$LOG_FILE")

echo "Statistics for: $LOG_FILE"
echo "================================"
echo "Items scraped: $ITEMS"
echo "Items dropped: $DROPPED"
echo "Errors: $ERRORS"
Enter fullscreen mode Exit fullscreen mode

Usage:

./count_items.sh logs/spider.log
Enter fullscreen mode Exit fullscreen mode

Output:

Statistics for: logs/spider.log
================================
Items scraped: 5430
Items dropped: 12
Errors: 3
Enter fullscreen mode Exit fullscreen mode

Extract Error Messages

#!/bin/bash
# extract_errors.sh

LOG_FILE="$1"

echo "Errors in $LOG_FILE:"
echo "================================"

grep "ERROR" "$LOG_FILE" | while read line; do
    echo "$line"
    echo ""
done
Enter fullscreen mode Exit fullscreen mode

Generate Summary Report

#!/usr/bin/env python3
# analyze_log.py

import sys
import re
from collections import Counter

def analyze_log(log_file):
    with open(log_file, 'r') as f:
        lines = f.readlines()

    stats = {
        'total_lines': len(lines),
        'items_scraped': 0,
        'items_dropped': 0,
        'requests': 0,
        'errors': 0,
        'warnings': 0
    }

    error_types = Counter()

    for line in lines:
        if 'Scraped from' in line:
            stats['items_scraped'] += 1
        elif 'Dropped' in line:
            stats['items_dropped'] += 1
        elif 'Crawled' in line:
            stats['requests'] += 1
        elif 'ERROR' in line:
            stats['errors'] += 1
            # Extract error type
            match = re.search(r'ERROR: (.+?):', line)
            if match:
                error_types[match.group(1)] += 1
        elif 'WARNING' in line:
            stats['warnings'] += 1

    print(f"Log Analysis: {log_file}")
    print("="*60)
    print(f"Total lines: {stats['total_lines']}")
    print(f"Items scraped: {stats['items_scraped']}")
    print(f"Items dropped: {stats['items_dropped']}")
    print(f"Requests: {stats['requests']}")
    print(f"Errors: {stats['errors']}")
    print(f"Warnings: {stats['warnings']}")

    if error_types:
        print("\nError types:")
        for error, count in error_types.most_common():
            print(f"  {error}: {count}")

if __name__ == '__main__':
    if len(sys.argv) < 2:
        print("Usage: python analyze_log.py <log_file>")
        sys.exit(1)

    analyze_log(sys.argv[1])
Enter fullscreen mode Exit fullscreen mode

Usage:

python analyze_log.py logs/spider.log
Enter fullscreen mode Exit fullscreen mode

Cleaning Old Logs

Manual Cleanup

Delete logs older than 7 days:

find logs/ -name "*.log" -mtime +7 -delete
Enter fullscreen mode Exit fullscreen mode

Delete all but last 10 log files:

ls -t logs/*.log | tail -n +11 | xargs rm -f
Enter fullscreen mode Exit fullscreen mode

Automatic Cleanup Script

#!/bin/bash
# cleanup_logs.sh

LOG_DIR="logs"
DAYS_TO_KEEP=7

echo "Cleaning logs older than $DAYS_TO_KEEP days in $LOG_DIR"

# Count files before
BEFORE=$(find "$LOG_DIR" -name "*.log" | wc -l)

# Delete old files
find "$LOG_DIR" -name "*.log" -mtime +$DAYS_TO_KEEP -delete

# Count files after
AFTER=$(find "$LOG_DIR" -name "*.log" | wc -l)

DELETED=$((BEFORE - AFTER))

echo "Deleted $DELETED log files"
echo "Remaining: $AFTER log files"
Enter fullscreen mode Exit fullscreen mode

Cron Job for Automatic Cleanup

Run cleanup daily at 2 AM:

crontab -e
Enter fullscreen mode Exit fullscreen mode

Add:

0 2 * * * /home/user/scrapy_project/cleanup_logs.sh >> /home/user/scrapy_project/logs/cleanup.log 2>&1
Enter fullscreen mode Exit fullscreen mode

Separate Error Logs

Save errors to separate file:

# settings.py
import logging

LOG_FILE = 'logs/spider.log'
LOG_LEVEL = 'INFO'

# Custom logging setup
EXTENSIONS = {
    'myproject.extensions.ErrorLogExtension': 500
}
Enter fullscreen mode Exit fullscreen mode

Create extension:

# extensions.py
import logging

class ErrorLogExtension:
    @classmethod
    def from_crawler(cls, crawler):
        # Create error file handler
        error_handler = logging.FileHandler('logs/errors.log')
        error_handler.setLevel(logging.ERROR)

        formatter = logging.Formatter(
            '%(asctime)s [%(name)s] %(levelname)s: %(message)s'
        )
        error_handler.setFormatter(formatter)

        # Add to root logger
        logger = logging.getLogger()
        logger.addHandler(error_handler)

        return cls()
Enter fullscreen mode Exit fullscreen mode

Now you have:

  • logs/spider.log - All logs
  • logs/errors.log - Only errors

Log Format Customization

Change Log Format

# settings.py
LOG_FORMAT = '%(asctime)s [%(name)s] %(levelname)s: %(message)s'
LOG_DATEFORMAT = '%Y-%m-%d %H:%M:%S'
Enter fullscreen mode Exit fullscreen mode

Default format:

2024-01-15 10:30:22 [scrapy.core.engine] INFO: Spider opened
Enter fullscreen mode Exit fullscreen mode

Custom format:

LOG_FORMAT = '[%(levelname)s] %(name)s: %(message)s'
Enter fullscreen mode Exit fullscreen mode

Output:

[INFO] scrapy.core.engine: Spider opened
Enter fullscreen mode Exit fullscreen mode

Add Custom Fields

# settings.py
import logging

class CustomFormatter(logging.Formatter):
    def format(self, record):
        record.spider_name = 'myspider'  # Add custom field
        return super().format(record)

LOG_FORMAT = '%(asctime)s [%(spider_name)s] %(levelname)s: %(message)s'
Enter fullscreen mode Exit fullscreen mode

Complete Production Setup

Directory Structure

scrapy_project/
├── logs/
│   ├── products/
│   ├── categories/
│   └── reviews/
├── scripts/
│   ├── analyze_log.py
│   ├── cleanup_logs.sh
│   └── monitor_logs.sh
├── myproject/
│   ├── settings.py
│   ├── extensions.py
│   └── spiders/
└── scrapy.cfg
Enter fullscreen mode Exit fullscreen mode

settings.py

import os
from datetime import datetime

# Ensure log directory exists
LOG_BASE_DIR = 'logs'
os.makedirs(LOG_BASE_DIR, exist_ok=True)

# Default log file (per spider will override)
LOG_FILE = f'{LOG_BASE_DIR}/default_{datetime.now().strftime("%Y%m%d_%H%M%S")}.log'

# Log level
LOG_LEVEL = 'INFO'

# Log format
LOG_FORMAT = '%(asctime)s [%(name)s] %(levelname)s: %(message)s'
LOG_DATEFORMAT = '%Y-%m-%d %H:%M:%S'

# Disable default log to stdout (optional)
LOG_STDOUT = False

# Extensions
EXTENSIONS = {
    'myproject.extensions.ErrorLogExtension': 500,
}
Enter fullscreen mode Exit fullscreen mode

Spider with Custom Logging

import scrapy
import os
from datetime import datetime

class ProductSpider(scrapy.Spider):
    name = 'products'

    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)

        # Setup log directory
        log_dir = f'logs/{self.name}'
        os.makedirs(log_dir, exist_ok=True)

        # Timestamped log file
        timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
        log_file = f'{log_dir}/{timestamp}.log'

        self.custom_settings = {
            'LOG_FILE': log_file,
            'LOG_LEVEL': 'INFO'
        }

        self.logger.info(f'Logging to: {log_file}')

    def start_requests(self):
        urls = ['https://example.com/products']
        for url in urls:
            yield scrapy.Request(url, callback=self.parse)

    def parse(self, response):
        for product in response.css('.product'):
            item = {
                'name': product.css('h2::text').get(),
                'price': product.css('.price::text').get()
            }

            self.logger.info(f'Scraped: {item["name"]}')
            yield item

    def closed(self, reason):
        stats = self.crawler.stats.get_stats()

        self.logger.info('='*60)
        self.logger.info('SPIDER FINISHED')
        self.logger.info(f'Items scraped: {stats.get("item_scraped_count", 0)}')
        self.logger.info(f'Items dropped: {stats.get("item_dropped_count", 0)}')
        self.logger.info(f'Requests: {stats.get("downloader/request_count", 0)}')
        self.logger.info(f'Errors: {stats.get("log_count/ERROR", 0)}')
        self.logger.info('='*60)
Enter fullscreen mode Exit fullscreen mode

Monitoring Logs in Real-Time

Simple Monitor Script

#!/bin/bash
# monitor_logs.sh

LOG_DIR="logs"
SPIDER_NAME="${1:-products}"
LOG_FILE=$(ls -t $LOG_DIR/$SPIDER_NAME/*.log 2>/dev/null | head -n 1)

if [ -z "$LOG_FILE" ]; then
    echo "No log files found for spider: $SPIDER_NAME"
    exit 1
fi

echo "Monitoring: $LOG_FILE"
echo "Press Ctrl+C to stop"
echo ""

tail -f "$LOG_FILE"
Enter fullscreen mode Exit fullscreen mode

Usage:

./monitor_logs.sh products
Enter fullscreen mode Exit fullscreen mode

Monitor with Filtering

#!/bin/bash
# monitor_errors.sh

LOG_DIR="logs"
SPIDER_NAME="${1:-products}"
LOG_FILE=$(ls -t $LOG_DIR/$SPIDER_NAME/*.log 2>/dev/null | head -n 1)

if [ -z "$LOG_FILE" ]; then
    echo "No log files found for spider: $SPIDER_NAME"
    exit 1
fi

echo "Monitoring ERRORS in: $LOG_FILE"
echo "Press Ctrl+C to stop"
echo ""

tail -f "$LOG_FILE" | grep --line-buffered "ERROR"
Enter fullscreen mode Exit fullscreen mode

Common Mistakes

Mistake #1: Not Creating Log Directory

# BAD
LOG_FILE = 'logs/spider.log'
# Fails if logs/ doesn't exist

# GOOD
import os
os.makedirs('logs', exist_ok=True)
LOG_FILE = 'logs/spider.log'
Enter fullscreen mode Exit fullscreen mode

Mistake #2: Same Log File for All Runs

# BAD
LOG_FILE = 'spider.log'
# Overwrites previous logs

# GOOD
from datetime import datetime
LOG_FILE = f'logs/spider_{datetime.now().strftime("%Y%m%d_%H%M%S")}.log'
Enter fullscreen mode Exit fullscreen mode

Mistake #3: Too Verbose Logging

# BAD (for production)
LOG_LEVEL = 'DEBUG'
# Creates massive log files

# GOOD
LOG_LEVEL = 'INFO'  # Production
LOG_LEVEL = 'DEBUG'  # Development only
Enter fullscreen mode Exit fullscreen mode

Mistake #4: No Log Cleanup

Logs fill up disk space! Always implement cleanup.

# Check disk space
df -h

# Clean old logs
find logs/ -name "*.log" -mtime +7 -delete
Enter fullscreen mode Exit fullscreen mode

Quick Reference

Basic Configuration

# settings.py
LOG_FILE = 'logs/spider.log'
LOG_LEVEL = 'INFO'
Enter fullscreen mode Exit fullscreen mode

Timestamped Logs

from datetime import datetime
LOG_FILE = f'logs/spider_{datetime.now().strftime("%Y%m%d_%H%M%S")}.log'
Enter fullscreen mode Exit fullscreen mode

View Logs

tail -f logs/spider.log          # Live view
tail -n 100 logs/spider.log      # Last 100 lines
grep "ERROR" logs/spider.log     # Find errors
less logs/spider.log             # Browse entire file
Enter fullscreen mode Exit fullscreen mode

Clean Old Logs

find logs/ -name "*.log" -mtime +7 -delete
Enter fullscreen mode Exit fullscreen mode

Summary

Why log files matter:

  • Review past runs
  • Debug production issues
  • Track statistics
  • Monitor spider health

Basic setup:

LOG_FILE = 'logs/spider.log'
LOG_LEVEL = 'INFO'
Enter fullscreen mode Exit fullscreen mode

Best practices:

  • Use timestamped filenames
  • Organize by spider name
  • Implement log rotation
  • Clean old logs regularly
  • Use INFO level in production
  • Create log directory automatically

Viewing logs:

  • tail -f for live monitoring
  • grep for searching
  • less for browsing
  • Write analysis scripts

Remember:

  • Always save logs to files
  • Don't let logs fill disk
  • Clean old logs
  • Use appropriate log level
  • Separate logs per spider

Log files are your time machine. They let you see exactly what happened, even days later. Set them up properly from the start!

Happy scraping! 🕷️

Top comments (0)