Muhammad Ikramullah Khan

Posted on Jan 12

Scrapy Log Files: Save, Rotate, and Organize Your Crawler Logs

#webdev #programming #beginners #python

My spider ran for 6 hours on the server. It scraped 50,000 products successfully. Or did it? I had no idea.

The console output was gone. No logs saved. No way to check what happened. Did it hit errors? Which pages failed? How many items were actually scraped? I had no clue.

That's when I learned that logs aren't optional. They're essential. Let me show you how to save, manage, and organize Scrapy logs properly.

The Problem: Console Logs Disappear

What Happens by Default

When you run Scrapy:

scrapy crawl myspider

Logs appear in console:

2024-01-15 10:30:22 [scrapy.core.engine] INFO: Spider opened
2024-01-15 10:30:23 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://example.com>
2024-01-15 10:30:24 [myspider] INFO: Scraped item: {'name': 'Product 1'}

But when spider finishes:

Console clears
Logs gone forever
No way to review what happened

On a server with nohup:

Logs go to nohup.out
File grows huge
Mixed with other output
Hard to search

You need:

Logs saved to specific files
Organized by spider and date
Easy to find and search
Automatic cleanup

Basic Log File Configuration

Method 1: Command Line

Simplest way:

scrapy crawl myspider --logfile=spider.log

All logs go to spider.log.

Method 2: Settings (Better)

In settings.py:

# Enable log file
LOG_FILE = 'logs/spider.log'

# Set log level
LOG_LEVEL = 'INFO'  # Options: DEBUG, INFO, WARNING, ERROR, CRITICAL

Now every spider run saves to that file.

Understanding Log Levels

DEBUG (Most verbose)

Every request and response
Selector queries
Middleware operations
Use for development only

INFO (Default)

Spider opened/closed
Items scraped
Requests made
Good for production

WARNING

Potential issues
Retries
Ignored requests

ERROR

Actual errors
Failed requests
Pipeline errors

CRITICAL

Severe errors only
Spider crashes

My recommendation: Use INFO for production, DEBUG for development.

Timestamped Log Files

Don't overwrite logs! Use timestamps:

Dynamic Log File Name

# settings.py
from datetime import datetime

LOG_FILE = f'logs/spider_{datetime.now().strftime("%Y%m%d_%H%M%S")}.log'

Creates files like:

logs/spider_20240115_103022.log
logs/spider_20240115_143155.log
logs/spider_20240115_180420.log

Each run gets its own log file!

Per-Spider Log Files

Different log for each spider:

# In spider file
class MySpider(scrapy.Spider):
    name = 'myspider'

    custom_settings = {
        'LOG_FILE': f'logs/{name}_{datetime.now().strftime("%Y%m%d_%H%M%S")}.log'
    }

Or in settings with spider name:

# settings.py
from datetime import datetime

def get_log_file(spider_name):
    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
    return f'logs/{spider_name}/{timestamp}.log'

# This gets called automatically
LOG_FILE = 'logs/default.log'  # Fallback

Then in spider:

class MySpider(scrapy.Spider):
    name = 'products'

    custom_settings = {
        'LOG_FILE': f'logs/products/{datetime.now().strftime("%Y%m%d_%H%M%S")}.log'
    }

Organizing Log Directory Structure

Good Structure

logs/
├── products/
│   ├── 20240115_103022.log
│   ├── 20240115_143155.log
│   └── 20240116_091234.log
├── categories/
│   ├── 20240115_110045.log
│   └── 20240115_160312.log
└── reviews/
    ├── 20240115_120034.log
    └── 20240115_180520.log

Benefits:

Easy to find logs by spider
Clear organization
Can delete per-spider logs

Create Log Directory Automatically

# settings.py
from datetime import datetime
import os

# Ensure log directory exists
LOG_DIR = 'logs'
os.makedirs(LOG_DIR, exist_ok=True)

LOG_FILE = f'{LOG_DIR}/spider_{datetime.now().strftime("%Y%m%d_%H%M%S")}.log'

Or in spider:

class MySpider(scrapy.Spider):
    name = 'products'

    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)

        # Create log directory
        log_dir = f'logs/{self.name}'
        os.makedirs(log_dir, exist_ok=True)

        # Set log file
        timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
        self.custom_settings = {
            'LOG_FILE': f'{log_dir}/{timestamp}.log'
        }

Log Rotation (Prevent Huge Files)

Problem: Logs Get Too Big

Long-running spiders create massive logs:

100 MB log files
Hard to open
Slow to search
Fill up disk

Solution 1: Size-Based Rotation

Split log when it reaches size limit:

# settings.py
import logging
from logging.handlers import RotatingFileHandler

# Create rotating file handler
LOG_FILE = 'logs/spider.log'
LOG_LEVEL = 'INFO'

# This needs custom extension
EXTENSIONS = {
    'myproject.extensions.RotatingLogExtension': 500
}

Create extension:

# extensions.py
import logging
from logging.handlers import RotatingFileHandler

class RotatingLogExtension:
    def __init__(self, log_file, max_bytes, backup_count):
        self.log_file = log_file
        self.max_bytes = max_bytes
        self.backup_count = backup_count

    @classmethod
    def from_crawler(cls, crawler):
        log_file = crawler.settings.get('LOG_FILE')
        max_bytes = crawler.settings.get('LOG_MAX_BYTES', 10*1024*1024)  # 10MB
        backup_count = crawler.settings.get('LOG_BACKUP_COUNT', 5)

        ext = cls(log_file, max_bytes, backup_count)

        # Setup rotating handler
        handler = RotatingFileHandler(
            log_file,
            maxBytes=max_bytes,
            backupCount=backup_count
        )

        # Get root logger
        logger = logging.getLogger()
        logger.addHandler(handler)

        return ext

Settings:

# settings.py
LOG_FILE = 'logs/spider.log'
LOG_MAX_BYTES = 10 * 1024 * 1024  # 10 MB
LOG_BACKUP_COUNT = 5  # Keep 5 old logs

EXTENSIONS = {
    'myproject.extensions.RotatingLogExtension': 500
}

Creates files:

logs/spider.log         # Current
logs/spider.log.1       # Previous
logs/spider.log.2       # Older
logs/spider.log.3
logs/spider.log.4
logs/spider.log.5

Solution 2: Time-Based (New File Each Day)

Better for long-running spiders:

# extensions.py
import logging
from logging.handlers import TimedRotatingFileHandler

class DailyRotatingLogExtension:
    @classmethod
    def from_crawler(cls, crawler):
        log_file = crawler.settings.get('LOG_FILE', 'logs/spider.log')

        # Rotate at midnight, keep 7 days
        handler = TimedRotatingFileHandler(
            log_file,
            when='midnight',
            interval=1,
            backupCount=7
        )

        formatter = logging.Formatter(
            '%(asctime)s [%(name)s] %(levelname)s: %(message)s'
        )
        handler.setFormatter(formatter)

        logger = logging.getLogger()
        logger.addHandler(handler)

        return cls()

Creates files:

logs/spider.log                    # Today
logs/spider.log.2024-01-14         # Yesterday
logs/spider.log.2024-01-13
logs/spider.log.2024-01-12

Viewing Logs on Server

Method 1: tail -f (Live Viewing)

Watch logs in real-time:

tail -f logs/spider.log

See new lines as they're written. Press Ctrl+C to stop.

Method 2: tail with Line Count

Show last 100 lines:

tail -n 100 logs/spider.log

Method 3: less (Browse Entire File)

less logs/spider.log

Navigation:

Arrow keys to move
Space bar for next page
/ to search
q to quit

Method 4: grep (Search Logs)

Find errors:

grep "ERROR" logs/spider.log

Find specific item:

grep "Product ABC" logs/spider.log

Count errors:

grep -c "ERROR" logs/spider.log

Show errors with context (3 lines before/after):

grep -A 3 -B 3 "ERROR" logs/spider.log

Method 5: Multiple grep (Complex Searches)

Find errors in specific spider:

grep "myspider" logs/spider.log | grep "ERROR"

Find scraped items with specific field:

grep "Scraped item" logs/spider.log | grep "price"

Log Analysis Scripts

Count Scraped Items

#!/bin/bash
# count_items.sh

LOG_FILE="$1"

if [ -z "$LOG_FILE" ]; then
    echo "Usage: ./count_items.sh <log_file>"
    exit 1
fi

ITEMS=$(grep -c "Scraped from" "$LOG_FILE")
DROPPED=$(grep -c "Dropped" "$LOG_FILE")
ERRORS=$(grep -c "ERROR" "$LOG_FILE")

echo "Statistics for: $LOG_FILE"
echo "================================"
echo "Items scraped: $ITEMS"
echo "Items dropped: $DROPPED"
echo "Errors: $ERRORS"

Usage:

./count_items.sh logs/spider.log

Output:

Statistics for: logs/spider.log
================================
Items scraped: 5430
Items dropped: 12
Errors: 3

Extract Error Messages

#!/bin/bash
# extract_errors.sh

LOG_FILE="$1"

echo "Errors in $LOG_FILE:"
echo "================================"

grep "ERROR" "$LOG_FILE" | while read line; do
    echo "$line"
    echo ""
done

Generate Summary Report

#!/usr/bin/env python3
# analyze_log.py

import sys
import re
from collections import Counter

def analyze_log(log_file):
    with open(log_file, 'r') as f:
        lines = f.readlines()

    stats = {
        'total_lines': len(lines),
        'items_scraped': 0,
        'items_dropped': 0,
        'requests': 0,
        'errors': 0,
        'warnings': 0
    }

    error_types = Counter()

    for line in lines:
        if 'Scraped from' in line:
            stats['items_scraped'] += 1
        elif 'Dropped' in line:
            stats['items_dropped'] += 1
        elif 'Crawled' in line:
            stats['requests'] += 1
        elif 'ERROR' in line:
            stats['errors'] += 1
            # Extract error type
            match = re.search(r'ERROR: (.+?):', line)
            if match:
                error_types[match.group(1)] += 1
        elif 'WARNING' in line:
            stats['warnings'] += 1

    print(f"Log Analysis: {log_file}")
    print("="*60)
    print(f"Total lines: {stats['total_lines']}")
    print(f"Items scraped: {stats['items_scraped']}")
    print(f"Items dropped: {stats['items_dropped']}")
    print(f"Requests: {stats['requests']}")
    print(f"Errors: {stats['errors']}")
    print(f"Warnings: {stats['warnings']}")

    if error_types:
        print("\nError types:")
        for error, count in error_types.most_common():
            print(f"  {error}: {count}")

if __name__ == '__main__':
    if len(sys.argv) < 2:
        print("Usage: python analyze_log.py <log_file>")
        sys.exit(1)

    analyze_log(sys.argv[1])

Usage:

python analyze_log.py logs/spider.log

Cleaning Old Logs

Manual Cleanup

Delete logs older than 7 days:

find logs/ -name "*.log" -mtime +7 -delete

Delete all but last 10 log files:

ls -t logs/*.log | tail -n +11 | xargs rm -f

Automatic Cleanup Script

#!/bin/bash
# cleanup_logs.sh

LOG_DIR="logs"
DAYS_TO_KEEP=7

echo "Cleaning logs older than $DAYS_TO_KEEP days in $LOG_DIR"

# Count files before
BEFORE=$(find "$LOG_DIR" -name "*.log" | wc -l)

# Delete old files
find "$LOG_DIR" -name "*.log" -mtime +$DAYS_TO_KEEP -delete

# Count files after
AFTER=$(find "$LOG_DIR" -name "*.log" | wc -l)

DELETED=$((BEFORE - AFTER))

echo "Deleted $DELETED log files"
echo "Remaining: $AFTER log files"

Cron Job for Automatic Cleanup

Run cleanup daily at 2 AM:

crontab -e

Add:

0 2 * * * /home/user/scrapy_project/cleanup_logs.sh >> /home/user/scrapy_project/logs/cleanup.log 2>&1

Separate Error Logs

Save errors to separate file:

# settings.py
import logging

LOG_FILE = 'logs/spider.log'
LOG_LEVEL = 'INFO'

# Custom logging setup
EXTENSIONS = {
    'myproject.extensions.ErrorLogExtension': 500
}

Create extension:

# extensions.py
import logging

class ErrorLogExtension:
    @classmethod
    def from_crawler(cls, crawler):
        # Create error file handler
        error_handler = logging.FileHandler('logs/errors.log')
        error_handler.setLevel(logging.ERROR)

        formatter = logging.Formatter(
            '%(asctime)s [%(name)s] %(levelname)s: %(message)s'
        )
        error_handler.setFormatter(formatter)

        # Add to root logger
        logger = logging.getLogger()
        logger.addHandler(error_handler)

        return cls()

Now you have:

logs/spider.log - All logs
logs/errors.log - Only errors

Log Format Customization

Change Log Format

# settings.py
LOG_FORMAT = '%(asctime)s [%(name)s] %(levelname)s: %(message)s'
LOG_DATEFORMAT = '%Y-%m-%d %H:%M:%S'

Default format:

2024-01-15 10:30:22 [scrapy.core.engine] INFO: Spider opened

Custom format:

LOG_FORMAT = '[%(levelname)s] %(name)s: %(message)s'

Output:

[INFO] scrapy.core.engine: Spider opened

Add Custom Fields

# settings.py
import logging

class CustomFormatter(logging.Formatter):
    def format(self, record):
        record.spider_name = 'myspider'  # Add custom field
        return super().format(record)

LOG_FORMAT = '%(asctime)s [%(spider_name)s] %(levelname)s: %(message)s'

Complete Production Setup

Directory Structure

scrapy_project/
├── logs/
│   ├── products/
│   ├── categories/
│   └── reviews/
├── scripts/
│   ├── analyze_log.py
│   ├── cleanup_logs.sh
│   └── monitor_logs.sh
├── myproject/
│   ├── settings.py
│   ├── extensions.py
│   └── spiders/
└── scrapy.cfg

settings.py

import os
from datetime import datetime

# Ensure log directory exists
LOG_BASE_DIR = 'logs'
os.makedirs(LOG_BASE_DIR, exist_ok=True)

# Default log file (per spider will override)
LOG_FILE = f'{LOG_BASE_DIR}/default_{datetime.now().strftime("%Y%m%d_%H%M%S")}.log'

# Log level
LOG_LEVEL = 'INFO'

# Log format
LOG_FORMAT = '%(asctime)s [%(name)s] %(levelname)s: %(message)s'
LOG_DATEFORMAT = '%Y-%m-%d %H:%M:%S'

# Disable default log to stdout (optional)
LOG_STDOUT = False

# Extensions
EXTENSIONS = {
    'myproject.extensions.ErrorLogExtension': 500,
}

Spider with Custom Logging

import scrapy
import os
from datetime import datetime

class ProductSpider(scrapy.Spider):
    name = 'products'

    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)

        # Setup log directory
        log_dir = f'logs/{self.name}'
        os.makedirs(log_dir, exist_ok=True)

        # Timestamped log file
        timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
        log_file = f'{log_dir}/{timestamp}.log'

        self.custom_settings = {
            'LOG_FILE': log_file,
            'LOG_LEVEL': 'INFO'
        }

        self.logger.info(f'Logging to: {log_file}')

    def start_requests(self):
        urls = ['https://example.com/products']
        for url in urls:
            yield scrapy.Request(url, callback=self.parse)

    def parse(self, response):
        for product in response.css('.product'):
            item = {
                'name': product.css('h2::text').get(),
                'price': product.css('.price::text').get()
            }

            self.logger.info(f'Scraped: {item["name"]}')
            yield item

    def closed(self, reason):
        stats = self.crawler.stats.get_stats()

        self.logger.info('='*60)
        self.logger.info('SPIDER FINISHED')
        self.logger.info(f'Items scraped: {stats.get("item_scraped_count", 0)}')
        self.logger.info(f'Items dropped: {stats.get("item_dropped_count", 0)}')
        self.logger.info(f'Requests: {stats.get("downloader/request_count", 0)}')
        self.logger.info(f'Errors: {stats.get("log_count/ERROR", 0)}')
        self.logger.info('='*60)

Monitoring Logs in Real-Time

Simple Monitor Script

#!/bin/bash
# monitor_logs.sh

LOG_DIR="logs"
SPIDER_NAME="${1:-products}"
LOG_FILE=$(ls -t $LOG_DIR/$SPIDER_NAME/*.log 2>/dev/null | head -n 1)

if [ -z "$LOG_FILE" ]; then
    echo "No log files found for spider: $SPIDER_NAME"
    exit 1
fi

echo "Monitoring: $LOG_FILE"
echo "Press Ctrl+C to stop"
echo ""

tail -f "$LOG_FILE"

Usage:

./monitor_logs.sh products

Monitor with Filtering

#!/bin/bash
# monitor_errors.sh

LOG_DIR="logs"
SPIDER_NAME="${1:-products}"
LOG_FILE=$(ls -t $LOG_DIR/$SPIDER_NAME/*.log 2>/dev/null | head -n 1)

if [ -z "$LOG_FILE" ]; then
    echo "No log files found for spider: $SPIDER_NAME"
    exit 1
fi

echo "Monitoring ERRORS in: $LOG_FILE"
echo "Press Ctrl+C to stop"
echo ""

tail -f "$LOG_FILE" | grep --line-buffered "ERROR"

Common Mistakes

Mistake #1: Not Creating Log Directory

# BAD
LOG_FILE = 'logs/spider.log'
# Fails if logs/ doesn't exist

# GOOD
import os
os.makedirs('logs', exist_ok=True)
LOG_FILE = 'logs/spider.log'

Mistake #2: Same Log File for All Runs

# BAD
LOG_FILE = 'spider.log'
# Overwrites previous logs

# GOOD
from datetime import datetime
LOG_FILE = f'logs/spider_{datetime.now().strftime("%Y%m%d_%H%M%S")}.log'

Mistake #3: Too Verbose Logging

# BAD (for production)
LOG_LEVEL = 'DEBUG'
# Creates massive log files

# GOOD
LOG_LEVEL = 'INFO'  # Production
LOG_LEVEL = 'DEBUG'  # Development only

Mistake #4: No Log Cleanup

Logs fill up disk space! Always implement cleanup.

# Check disk space
df -h

# Clean old logs
find logs/ -name "*.log" -mtime +7 -delete

Quick Reference

Basic Configuration

# settings.py
LOG_FILE = 'logs/spider.log'
LOG_LEVEL = 'INFO'

Timestamped Logs

from datetime import datetime
LOG_FILE = f'logs/spider_{datetime.now().strftime("%Y%m%d_%H%M%S")}.log'

View Logs

tail -f logs/spider.log          # Live view
tail -n 100 logs/spider.log      # Last 100 lines
grep "ERROR" logs/spider.log     # Find errors
less logs/spider.log             # Browse entire file

Clean Old Logs

find logs/ -name "*.log" -mtime +7 -delete

Summary

Why log files matter:

Review past runs
Debug production issues
Track statistics
Monitor spider health

Basic setup:

LOG_FILE = 'logs/spider.log'
LOG_LEVEL = 'INFO'

Best practices:

Use timestamped filenames
Organize by spider name
Implement log rotation
Clean old logs regularly
Use INFO level in production
Create log directory automatically

Viewing logs:

tail -f for live monitoring
grep for searching
less for browsing
Write analysis scripts

Remember:

Always save logs to files
Don't let logs fill disk
Clean old logs
Use appropriate log level
Separate logs per spider

Log files are your time machine. They let you see exactly what happened, even days later. Set them up properly from the start!

Happy scraping! 🕷️