Muhammad Ikramullah Khan

Posted on Jan 13 • Edited on Jan 14

Running Scrapy with nohup: Keep Your Spiders Running 24/7

#programming #python #webdev #beginners

I deployed my first spider to a server, ran it, and closed my laptop to go home. When I checked the next morning, the spider had stopped after 5 minutes. All the data I expected? Gone.

The problem? When I closed my SSH connection, the spider died with it. I didn't know about nohup.

Once I learned to use nohup, my spiders ran for hours, days, even weeks without me babysitting them. Let me show you how to run Scrapy properly on servers.

The Problem: SSH Disconnection Kills Your Spider

What Happens Normally

You SSH into your server:

ssh user@server.com

Run your spider:

scrapy crawl myspider

Everything works! But then:

You close your laptop
Internet disconnects
SSH session times out
Your spider DIES

Why? When SSH disconnects, all processes started in that session get killed.

What Is nohup?

nohup stands for "no hangup". It tells Linux to keep your process running even after you disconnect.

What nohup does:

Ignores hangup signals (SIGHUP)
Keeps process running after SSH disconnect
Redirects output to a file
Runs in background

Think of it like: Setting your spider to "run independently" mode.

Basic nohup Usage

Simple Command

nohup scrapy crawl myspider &

Let me break this down:

nohup - Don't kill on disconnect
scrapy crawl myspider - Your spider command
& - Run in background

What happens:

Spider starts
You get your terminal back immediately
Output goes to nohup.out file
You can disconnect safely

Check It's Running

ps aux | grep scrapy

You'll see something like:

user  12345  0.5  2.1  python scrapy crawl myspider

The number (12345) is the process ID (PID).

Better: Redirect Output to Custom File

Don't use the default nohup.out. Use meaningful names:

nohup scrapy crawl myspider > myspider.log 2>&1 &

Explanation:

> myspider.log - Send normal output to myspider.log
2>&1 - Send errors to same file (2=stderr, 1=stdout)
& - Run in background

Now all output (logs, errors, everything) goes to myspider.log.

Watching Logs in Real-Time

Your spider is running, but how do you see what's happening?

tail -f (Follow logs)

tail -f myspider.log

Shows last lines and updates live. Press Ctrl+C to stop watching (spider keeps running).

tail with line limit

tail -n 100 -f myspider.log  # Show last 100 lines

grep while following

tail -f myspider.log | grep "ERROR"  # Only show errors
tail -f myspider.log | grep "Scraped"  # Only show scraped items

Stopping Your Spider

Find Process ID

ps aux | grep scrapy

Output:

user  12345  0.5  2.1  python scrapy crawl myspider

PID is 12345.

Kill the Process

kill 12345

Spider stops gracefully (finishes current requests, saves state).

Force Kill (if needed)

kill -9 12345

Immediately terminates (use only if normal kill doesn't work).

Better Approach: Save PID Automatically

Save PID to File

nohup scrapy crawl myspider > myspider.log 2>&1 & echo $! > myspider.pid

Explanation:

$! - PID of last background command
> myspider.pid - Save to file

Now you can easily stop it:

kill $(cat myspider.pid)

Or check if running:

ps -p $(cat myspider.pid)

Creating a Run Script

Don't type long commands. Create a script!

Create run_spider.sh

#!/bin/bash

SPIDER_NAME="myspider"
LOG_FILE="logs/${SPIDER_NAME}_$(date +%Y%m%d_%H%M%S).log"
PID_FILE="${SPIDER_NAME}.pid"

# Create logs directory if doesn't exist
mkdir -p logs

# Check if already running
if [ -f "$PID_FILE" ]; then
    PID=$(cat "$PID_FILE")
    if ps -p $PID > /dev/null 2>&1; then
        echo "Spider is already running (PID: $PID)"
        exit 1
    fi
fi

# Run spider
echo "Starting spider: $SPIDER_NAME"
echo "Log file: $LOG_FILE"

nohup scrapy crawl $SPIDER_NAME > $LOG_FILE 2>&1 &

# Save PID
echo $! > $PID_FILE

echo "Spider started with PID: $(cat $PID_FILE)"
echo "Watch logs: tail -f $LOG_FILE"

Make Executable

chmod +x run_spider.sh

Run It

./run_spider.sh

Output:

Starting spider: myspider
Log file: logs/myspider_20240115_143022.log
Spider started with PID: 12345
Watch logs: tail -f logs/myspider_20240115_143022.log

Creating a Stop Script

Create stop_spider.sh

#!/bin/bash

SPIDER_NAME="myspider"
PID_FILE="${SPIDER_NAME}.pid"

# Check if PID file exists
if [ ! -f "$PID_FILE" ]; then
    echo "Spider is not running (no PID file found)"
    exit 1
fi

PID=$(cat "$PID_FILE")

# Check if process exists
if ! ps -p $PID > /dev/null 2>&1; then
    echo "Spider is not running (process not found)"
    rm -f "$PID_FILE"
    exit 1
fi

# Stop spider
echo "Stopping spider (PID: $PID)..."
kill $PID

# Wait for process to stop
COUNTER=0
while ps -p $PID > /dev/null 2>&1; do
    sleep 1
    COUNTER=$((COUNTER + 1))

    if [ $COUNTER -eq 30 ]; then
        echo "Spider didn't stop gracefully, forcing..."
        kill -9 $PID
        break
    fi
done

rm -f "$PID_FILE"
echo "Spider stopped"

Make Executable

chmod +x stop_spider.sh

Use It

./stop_spider.sh

Creating a Status Script

Create status_spider.sh

#!/bin/bash

SPIDER_NAME="myspider"
PID_FILE="${SPIDER_NAME}.pid"

if [ ! -f "$PID_FILE" ]; then
    echo "Status: NOT RUNNING"
    exit 0
fi

PID=$(cat "$PID_FILE")

if ps -p $PID > /dev/null 2>&1; then
    echo "Status: RUNNING"
    echo "PID: $PID"
    echo "Started: $(ps -p $PID -o lstart=)"
    echo "CPU: $(ps -p $PID -o %cpu=)%"
    echo "Memory: $(ps -p $PID -o %mem=)%"
else
    echo "Status: NOT RUNNING (stale PID file)"
    rm -f "$PID_FILE"
fi

Make Executable

chmod +x status_spider.sh

Check Status

./status_spider.sh

Output:

Status: RUNNING
PID: 12345
Started: Mon Jan 15 14:30:22 2024
CPU: 12.5%
Memory: 2.8%

Running Multiple Spiders

Different Spiders

nohup scrapy crawl spider1 > spider1.log 2>&1 & echo $! > spider1.pid
nohup scrapy crawl spider2 > spider2.log 2>&1 & echo $! > spider2.pid
nohup scrapy crawl spider3 > spider3.log 2>&1 & echo $! > spider3.pid

Same Spider, Different Arguments

nohup scrapy crawl myspider -a category=electronics > spider_electronics.log 2>&1 &
nohup scrapy crawl myspider -a category=books > spider_books.log 2>&1 &
nohup scrapy crawl myspider -a category=clothing > spider_clothing.log 2>&1 &

Advanced: Run Script with All Features

Complete run_spider.sh

#!/bin/bash

# Configuration
SPIDER_NAME="${1:-myspider}"  # Use first argument or default
PROJECT_DIR="/home/user/scrapy_project"
LOG_DIR="$PROJECT_DIR/logs"
PID_DIR="$PROJECT_DIR/pids"
LOG_FILE="$LOG_DIR/${SPIDER_NAME}_$(date +%Y%m%d_%H%M%S).log"
PID_FILE="$PID_DIR/${SPIDER_NAME}.pid"

# Create directories
mkdir -p "$LOG_DIR" "$PID_DIR"

# Change to project directory
cd "$PROJECT_DIR" || exit 1

# Check if already running
if [ -f "$PID_FILE" ]; then
    PID=$(cat "$PID_FILE")
    if ps -p $PID > /dev/null 2>&1; then
        echo "ERROR: Spider '$SPIDER_NAME' is already running (PID: $PID)"
        exit 1
    else
        echo "Removing stale PID file..."
        rm -f "$PID_FILE"
    fi
fi

# Activate virtual environment if exists
if [ -d "venv" ]; then
    source venv/bin/activate
fi

# Run spider
echo "Starting spider: $SPIDER_NAME"
echo "Log file: $LOG_FILE"
echo "PID file: $PID_FILE"
echo "----------------------------------------"

nohup scrapy crawl $SPIDER_NAME "${@:2}" > $LOG_FILE 2>&1 &

# Save PID
PID=$!
echo $PID > $PID_FILE

# Verify it started
sleep 2
if ps -p $PID > /dev/null 2>&1; then
    echo "SUCCESS: Spider started (PID: $PID)"
    echo ""
    echo "Commands:"
    echo "  Watch logs: tail -f $LOG_FILE"
    echo "  Check status: ps -p $PID"
    echo "  Stop spider: kill $PID"
else
    echo "ERROR: Spider failed to start"
    echo "Check log file: $LOG_FILE"
    rm -f "$PID_FILE"
    exit 1
fi

Usage Examples

# Run default spider
./run_spider.sh

# Run specific spider
./run_spider.sh products

# Run with arguments
./run_spider.sh products -a category=electronics -a start_page=1

Monitoring Your Spider

Check CPU and Memory

ps aux | grep scrapy

Or more detailed:

top -p $(cat myspider.pid)

Count Scraped Items (if using JSON)

wc -l output.json

Watch Error Count

grep -c "ERROR" myspider.log

Watch Progress

tail -f myspider.log | grep "Crawled"

Auto-Restart on Failure

Create restart_on_failure.sh

#!/bin/bash

SPIDER_NAME="myspider"
PID_FILE="${SPIDER_NAME}.pid"
LOG_FILE="logs/${SPIDER_NAME}_$(date +%Y%m%d_%H%M%S).log"

while true; do
    # Check if running
    if [ -f "$PID_FILE" ]; then
        PID=$(cat "$PID_FILE")
        if ps -p $PID > /dev/null 2>&1; then
            # Still running
            sleep 60  # Check every minute
            continue
        fi
    fi

    # Not running, start it
    echo "$(date): Spider not running, starting..."

    nohup scrapy crawl $SPIDER_NAME > $LOG_FILE 2>&1 &
    echo $! > $PID_FILE

    echo "$(date): Spider restarted (PID: $(cat $PID_FILE))"

    sleep 60
done

Run the monitor:

nohup ./restart_on_failure.sh > monitor.log 2>&1 &

Now your spider automatically restarts if it crashes!

Using screen (Alternative to nohup)

screen creates a virtual terminal that persists after disconnect.

Install screen

sudo apt install screen  # Ubuntu/Debian
sudo yum install screen  # CentOS/RHEL

Start screen session

screen -S myspider

Run spider

scrapy crawl myspider

Detach (keep running)

Press: Ctrl+A then D

List sessions

screen -ls

Output:

There is a screen on:
    12345.myspider  (Detached)

Reattach

screen -r myspider

Now you see your spider running!

Kill session

screen -X -S myspider quit

Using tmux (Another Alternative)

tmux is like screen but more modern.

Install tmux

sudo apt install tmux

Start session

tmux new -s myspider

Run spider

scrapy crawl myspider

Detach

Press: Ctrl+B then D

List sessions

tmux ls

Reattach

tmux attach -t myspider

Kill session

tmux kill-session -t myspider

Comparison: nohup vs screen vs tmux

nohup

Pros:

Simple, no installation needed
Works everywhere
Good for fire-and-forget

Cons:

Can't see live output (need tail -f)
Can't interact with running process
Basic features only

screen

Pros:

See live output
Interact with process
Multiple windows

Cons:

Needs installation
Older, less maintained
Learning curve

tmux

Pros:

Modern, actively maintained
Powerful features
Better UI
Scriptable

Cons:

Needs installation
More complex
Overkill for simple tasks

My recommendation:

Simple scrapers: nohup
Development: tmux or screen
Production: systemd service (covered in deployment blog)

Common Issues and Solutions

Issue 1: Spider Dies Immediately

Check logs:

cat myspider.log

Common causes:

Python errors
Import errors
Settings errors

Issue 2: Can't Find PID File

Problem:

kill $(cat myspider.pid)
# cat: myspider.pid: No such file or directory

Solution:

ps aux | grep scrapy  # Find PID manually
kill 12345  # Use actual PID

Issue 3: Log File Gets Too Large

Check size:

du -h myspider.log

Rotate logs:

# In your run script
if [ -f "$LOG_FILE" ] && [ $(stat -f%z "$LOG_FILE") -gt 100000000 ]; then
    mv "$LOG_FILE" "${LOG_FILE}.old"
fi

Issue 4: Multiple Instances Running

Find all:

ps aux | grep "scrapy crawl myspider"

Kill all:

pkill -f "scrapy crawl myspider"

Best Practices

1. Always Use Timestamped Logs

LOG_FILE="logs/myspider_$(date +%Y%m%d_%H%M%S).log"

This way you have history of all runs.

2. Monitor Disk Space

df -h  # Check available space

Spider logs can fill disk!

3. Set Up Log Rotation

# Keep only last 10 log files
ls -t logs/myspider_*.log | tail -n +11 | xargs rm -f

4. Use Process Monitoring

Create cron job to check spider health:

# Check every 5 minutes
*/5 * * * * /home/user/check_spider_health.sh

5. Send Notifications

When spider finishes:

# At end of spider
echo "Spider finished" | mail -s "Spider Status" you@example.com

Complete Production Example

Full Setup

# Directory structure
/home/user/scrapy_project/
├── scrapy.cfg
├── myproject/
│   ├── spiders/
│   └── ...
├── logs/
├── pids/
└── scripts/
    ├── run_spider.sh
    ├── stop_spider.sh
    ├── status_spider.sh
    └── monitor.sh

Production run_spider.sh

#!/bin/bash
set -e  # Exit on error

SPIDER_NAME="${1:-myspider}"
PROJECT_DIR="$(cd "$(dirname "$0")/.." && pwd)"
TIMESTAMP=$(date +%Y%m%d_%H%M%S)

# Directories
LOG_DIR="$PROJECT_DIR/logs"
PID_DIR="$PROJECT_DIR/pids"
DATA_DIR="$PROJECT_DIR/data"

# Files
LOG_FILE="$LOG_DIR/${SPIDER_NAME}_${TIMESTAMP}.log"
PID_FILE="$PID_DIR/${SPIDER_NAME}.pid"
DATA_FILE="$DATA_DIR/${SPIDER_NAME}_${TIMESTAMP}.json"

# Create directories
mkdir -p "$LOG_DIR" "$PID_DIR" "$DATA_DIR"

# Change to project directory
cd "$PROJECT_DIR"

# Check if already running
if [ -f "$PID_FILE" ]; then
    PID=$(cat "$PID_FILE")
    if ps -p $PID > /dev/null 2>&1; then
        echo "ERROR: Spider already running (PID: $PID)"
        exit 1
    fi
    rm -f "$PID_FILE"
fi

# Activate virtual environment
if [ -d "venv" ]; then
    source venv/bin/activate
fi

# Start spider
echo "$(date): Starting spider $SPIDER_NAME"

nohup scrapy crawl $SPIDER_NAME \
    -o $DATA_FILE \
    "${@:2}" \
    > $LOG_FILE 2>&1 &

PID=$!
echo $PID > $PID_FILE

# Verify started
sleep 2
if ps -p $PID > /dev/null 2>&1; then
    echo "SUCCESS: Spider started"
    echo "PID: $PID"
    echo "Log: $LOG_FILE"
    echo "Data: $DATA_FILE"
else
    echo "ERROR: Spider failed to start"
    cat $LOG_FILE
    rm -f "$PID_FILE"
    exit 1
fi

Summary

nohup basics:

nohup scrapy crawl myspider > spider.log 2>&1 &

Save PID:

nohup scrapy crawl myspider > spider.log 2>&1 & echo $! > spider.pid

Watch logs:

tail -f spider.log

Stop spider:

kill $(cat spider.pid)

Alternatives:

screen - Interactive sessions
tmux - Modern terminal multiplexer
systemd - Production service management

Best practices:

Use run scripts
Save PIDs
Timestamped logs
Monitor resources
Auto-restart on failure

Remember:

Always redirect output (> log.log 2>&1)
Always run in background (&)
Always save PID for later
Watch logs with tail -f
Use screen/tmux for development

nohup is simple, reliable, and works everywhere. Perfect for running Scrapy spiders on servers!

Happy scraping! 🕷️

Top comments (1)

Comment hidden by post author - thread only accessible via permalink

sinni800 • Jan 13

Let ai write your posts3

Some comments have been hidden by the post's author - find out more