DEV Community

Muhammad Ikramullah Khan
Muhammad Ikramullah Khan

Posted on • Edited on

Running Scrapy with nohup: Keep Your Spiders Running 24/7

I deployed my first spider to a server, ran it, and closed my laptop to go home. When I checked the next morning, the spider had stopped after 5 minutes. All the data I expected? Gone.

The problem? When I closed my SSH connection, the spider died with it. I didn't know about nohup.

Once I learned to use nohup, my spiders ran for hours, days, even weeks without me babysitting them. Let me show you how to run Scrapy properly on servers.


The Problem: SSH Disconnection Kills Your Spider

What Happens Normally

You SSH into your server:

ssh user@server.com
Enter fullscreen mode Exit fullscreen mode

Run your spider:

scrapy crawl myspider
Enter fullscreen mode Exit fullscreen mode

Everything works! But then:

  • You close your laptop
  • Internet disconnects
  • SSH session times out
  • Your spider DIES

Why? When SSH disconnects, all processes started in that session get killed.


What Is nohup?

nohup stands for "no hangup". It tells Linux to keep your process running even after you disconnect.

What nohup does:

  • Ignores hangup signals (SIGHUP)
  • Keeps process running after SSH disconnect
  • Redirects output to a file
  • Runs in background

Think of it like: Setting your spider to "run independently" mode.


Basic nohup Usage

Simple Command

nohup scrapy crawl myspider &
Enter fullscreen mode Exit fullscreen mode

Let me break this down:

  • nohup - Don't kill on disconnect
  • scrapy crawl myspider - Your spider command
  • & - Run in background

What happens:

  1. Spider starts
  2. You get your terminal back immediately
  3. Output goes to nohup.out file
  4. You can disconnect safely

Check It's Running

ps aux | grep scrapy
Enter fullscreen mode Exit fullscreen mode

You'll see something like:

user  12345  0.5  2.1  python scrapy crawl myspider
Enter fullscreen mode Exit fullscreen mode

The number (12345) is the process ID (PID).


Better: Redirect Output to Custom File

Don't use the default nohup.out. Use meaningful names:

nohup scrapy crawl myspider > myspider.log 2>&1 &
Enter fullscreen mode Exit fullscreen mode

Explanation:

  • > myspider.log - Send normal output to myspider.log
  • 2>&1 - Send errors to same file (2=stderr, 1=stdout)
  • & - Run in background

Now all output (logs, errors, everything) goes to myspider.log.


Watching Logs in Real-Time

Your spider is running, but how do you see what's happening?

tail -f (Follow logs)

tail -f myspider.log
Enter fullscreen mode Exit fullscreen mode

Shows last lines and updates live. Press Ctrl+C to stop watching (spider keeps running).

tail with line limit

tail -n 100 -f myspider.log  # Show last 100 lines
Enter fullscreen mode Exit fullscreen mode

grep while following

tail -f myspider.log | grep "ERROR"  # Only show errors
tail -f myspider.log | grep "Scraped"  # Only show scraped items
Enter fullscreen mode Exit fullscreen mode

Stopping Your Spider

Find Process ID

ps aux | grep scrapy
Enter fullscreen mode Exit fullscreen mode

Output:

user  12345  0.5  2.1  python scrapy crawl myspider
Enter fullscreen mode Exit fullscreen mode

PID is 12345.

Kill the Process

kill 12345
Enter fullscreen mode Exit fullscreen mode

Spider stops gracefully (finishes current requests, saves state).

Force Kill (if needed)

kill -9 12345
Enter fullscreen mode Exit fullscreen mode

Immediately terminates (use only if normal kill doesn't work).


Better Approach: Save PID Automatically

Save PID to File

nohup scrapy crawl myspider > myspider.log 2>&1 & echo $! > myspider.pid
Enter fullscreen mode Exit fullscreen mode

Explanation:

  • $! - PID of last background command
  • > myspider.pid - Save to file

Now you can easily stop it:

kill $(cat myspider.pid)
Enter fullscreen mode Exit fullscreen mode

Or check if running:

ps -p $(cat myspider.pid)
Enter fullscreen mode Exit fullscreen mode

Creating a Run Script

Don't type long commands. Create a script!

Create run_spider.sh

#!/bin/bash

SPIDER_NAME="myspider"
LOG_FILE="logs/${SPIDER_NAME}_$(date +%Y%m%d_%H%M%S).log"
PID_FILE="${SPIDER_NAME}.pid"

# Create logs directory if doesn't exist
mkdir -p logs

# Check if already running
if [ -f "$PID_FILE" ]; then
    PID=$(cat "$PID_FILE")
    if ps -p $PID > /dev/null 2>&1; then
        echo "Spider is already running (PID: $PID)"
        exit 1
    fi
fi

# Run spider
echo "Starting spider: $SPIDER_NAME"
echo "Log file: $LOG_FILE"

nohup scrapy crawl $SPIDER_NAME > $LOG_FILE 2>&1 &

# Save PID
echo $! > $PID_FILE

echo "Spider started with PID: $(cat $PID_FILE)"
echo "Watch logs: tail -f $LOG_FILE"
Enter fullscreen mode Exit fullscreen mode

Make Executable

chmod +x run_spider.sh
Enter fullscreen mode Exit fullscreen mode

Run It

./run_spider.sh
Enter fullscreen mode Exit fullscreen mode

Output:

Starting spider: myspider
Log file: logs/myspider_20240115_143022.log
Spider started with PID: 12345
Watch logs: tail -f logs/myspider_20240115_143022.log
Enter fullscreen mode Exit fullscreen mode

Creating a Stop Script

Create stop_spider.sh

#!/bin/bash

SPIDER_NAME="myspider"
PID_FILE="${SPIDER_NAME}.pid"

# Check if PID file exists
if [ ! -f "$PID_FILE" ]; then
    echo "Spider is not running (no PID file found)"
    exit 1
fi

PID=$(cat "$PID_FILE")

# Check if process exists
if ! ps -p $PID > /dev/null 2>&1; then
    echo "Spider is not running (process not found)"
    rm -f "$PID_FILE"
    exit 1
fi

# Stop spider
echo "Stopping spider (PID: $PID)..."
kill $PID

# Wait for process to stop
COUNTER=0
while ps -p $PID > /dev/null 2>&1; do
    sleep 1
    COUNTER=$((COUNTER + 1))

    if [ $COUNTER -eq 30 ]; then
        echo "Spider didn't stop gracefully, forcing..."
        kill -9 $PID
        break
    fi
done

rm -f "$PID_FILE"
echo "Spider stopped"
Enter fullscreen mode Exit fullscreen mode

Make Executable

chmod +x stop_spider.sh
Enter fullscreen mode Exit fullscreen mode

Use It

./stop_spider.sh
Enter fullscreen mode Exit fullscreen mode

Creating a Status Script

Create status_spider.sh

#!/bin/bash

SPIDER_NAME="myspider"
PID_FILE="${SPIDER_NAME}.pid"

if [ ! -f "$PID_FILE" ]; then
    echo "Status: NOT RUNNING"
    exit 0
fi

PID=$(cat "$PID_FILE")

if ps -p $PID > /dev/null 2>&1; then
    echo "Status: RUNNING"
    echo "PID: $PID"
    echo "Started: $(ps -p $PID -o lstart=)"
    echo "CPU: $(ps -p $PID -o %cpu=)%"
    echo "Memory: $(ps -p $PID -o %mem=)%"
else
    echo "Status: NOT RUNNING (stale PID file)"
    rm -f "$PID_FILE"
fi
Enter fullscreen mode Exit fullscreen mode

Make Executable

chmod +x status_spider.sh
Enter fullscreen mode Exit fullscreen mode

Check Status

./status_spider.sh
Enter fullscreen mode Exit fullscreen mode

Output:

Status: RUNNING
PID: 12345
Started: Mon Jan 15 14:30:22 2024
CPU: 12.5%
Memory: 2.8%
Enter fullscreen mode Exit fullscreen mode

Running Multiple Spiders

Different Spiders

nohup scrapy crawl spider1 > spider1.log 2>&1 & echo $! > spider1.pid
nohup scrapy crawl spider2 > spider2.log 2>&1 & echo $! > spider2.pid
nohup scrapy crawl spider3 > spider3.log 2>&1 & echo $! > spider3.pid
Enter fullscreen mode Exit fullscreen mode

Same Spider, Different Arguments

nohup scrapy crawl myspider -a category=electronics > spider_electronics.log 2>&1 &
nohup scrapy crawl myspider -a category=books > spider_books.log 2>&1 &
nohup scrapy crawl myspider -a category=clothing > spider_clothing.log 2>&1 &
Enter fullscreen mode Exit fullscreen mode

Advanced: Run Script with All Features

Complete run_spider.sh

#!/bin/bash

# Configuration
SPIDER_NAME="${1:-myspider}"  # Use first argument or default
PROJECT_DIR="/home/user/scrapy_project"
LOG_DIR="$PROJECT_DIR/logs"
PID_DIR="$PROJECT_DIR/pids"
LOG_FILE="$LOG_DIR/${SPIDER_NAME}_$(date +%Y%m%d_%H%M%S).log"
PID_FILE="$PID_DIR/${SPIDER_NAME}.pid"

# Create directories
mkdir -p "$LOG_DIR" "$PID_DIR"

# Change to project directory
cd "$PROJECT_DIR" || exit 1

# Check if already running
if [ -f "$PID_FILE" ]; then
    PID=$(cat "$PID_FILE")
    if ps -p $PID > /dev/null 2>&1; then
        echo "ERROR: Spider '$SPIDER_NAME' is already running (PID: $PID)"
        exit 1
    else
        echo "Removing stale PID file..."
        rm -f "$PID_FILE"
    fi
fi

# Activate virtual environment if exists
if [ -d "venv" ]; then
    source venv/bin/activate
fi

# Run spider
echo "Starting spider: $SPIDER_NAME"
echo "Log file: $LOG_FILE"
echo "PID file: $PID_FILE"
echo "----------------------------------------"

nohup scrapy crawl $SPIDER_NAME "${@:2}" > $LOG_FILE 2>&1 &

# Save PID
PID=$!
echo $PID > $PID_FILE

# Verify it started
sleep 2
if ps -p $PID > /dev/null 2>&1; then
    echo "SUCCESS: Spider started (PID: $PID)"
    echo ""
    echo "Commands:"
    echo "  Watch logs: tail -f $LOG_FILE"
    echo "  Check status: ps -p $PID"
    echo "  Stop spider: kill $PID"
else
    echo "ERROR: Spider failed to start"
    echo "Check log file: $LOG_FILE"
    rm -f "$PID_FILE"
    exit 1
fi
Enter fullscreen mode Exit fullscreen mode

Usage Examples

# Run default spider
./run_spider.sh

# Run specific spider
./run_spider.sh products

# Run with arguments
./run_spider.sh products -a category=electronics -a start_page=1
Enter fullscreen mode Exit fullscreen mode

Monitoring Your Spider

Check CPU and Memory

ps aux | grep scrapy
Enter fullscreen mode Exit fullscreen mode

Or more detailed:

top -p $(cat myspider.pid)
Enter fullscreen mode Exit fullscreen mode

Count Scraped Items (if using JSON)

wc -l output.json
Enter fullscreen mode Exit fullscreen mode

Watch Error Count

grep -c "ERROR" myspider.log
Enter fullscreen mode Exit fullscreen mode

Watch Progress

tail -f myspider.log | grep "Crawled"
Enter fullscreen mode Exit fullscreen mode

Auto-Restart on Failure

Create restart_on_failure.sh

#!/bin/bash

SPIDER_NAME="myspider"
PID_FILE="${SPIDER_NAME}.pid"
LOG_FILE="logs/${SPIDER_NAME}_$(date +%Y%m%d_%H%M%S).log"

while true; do
    # Check if running
    if [ -f "$PID_FILE" ]; then
        PID=$(cat "$PID_FILE")
        if ps -p $PID > /dev/null 2>&1; then
            # Still running
            sleep 60  # Check every minute
            continue
        fi
    fi

    # Not running, start it
    echo "$(date): Spider not running, starting..."

    nohup scrapy crawl $SPIDER_NAME > $LOG_FILE 2>&1 &
    echo $! > $PID_FILE

    echo "$(date): Spider restarted (PID: $(cat $PID_FILE))"

    sleep 60
done
Enter fullscreen mode Exit fullscreen mode

Run the monitor:

nohup ./restart_on_failure.sh > monitor.log 2>&1 &
Enter fullscreen mode Exit fullscreen mode

Now your spider automatically restarts if it crashes!


Using screen (Alternative to nohup)

screen creates a virtual terminal that persists after disconnect.

Install screen

sudo apt install screen  # Ubuntu/Debian
sudo yum install screen  # CentOS/RHEL
Enter fullscreen mode Exit fullscreen mode

Start screen session

screen -S myspider
Enter fullscreen mode Exit fullscreen mode

Run spider

scrapy crawl myspider
Enter fullscreen mode Exit fullscreen mode

Detach (keep running)

Press: Ctrl+A then D

List sessions

screen -ls
Enter fullscreen mode Exit fullscreen mode

Output:

There is a screen on:
    12345.myspider  (Detached)
Enter fullscreen mode Exit fullscreen mode

Reattach

screen -r myspider
Enter fullscreen mode Exit fullscreen mode

Now you see your spider running!

Kill session

screen -X -S myspider quit
Enter fullscreen mode Exit fullscreen mode

Using tmux (Another Alternative)

tmux is like screen but more modern.

Install tmux

sudo apt install tmux
Enter fullscreen mode Exit fullscreen mode

Start session

tmux new -s myspider
Enter fullscreen mode Exit fullscreen mode

Run spider

scrapy crawl myspider
Enter fullscreen mode Exit fullscreen mode

Detach

Press: Ctrl+B then D

List sessions

tmux ls
Enter fullscreen mode Exit fullscreen mode

Reattach

tmux attach -t myspider
Enter fullscreen mode Exit fullscreen mode

Kill session

tmux kill-session -t myspider
Enter fullscreen mode Exit fullscreen mode

Comparison: nohup vs screen vs tmux

nohup

Pros:

  • Simple, no installation needed
  • Works everywhere
  • Good for fire-and-forget

Cons:

  • Can't see live output (need tail -f)
  • Can't interact with running process
  • Basic features only

screen

Pros:

  • See live output
  • Interact with process
  • Multiple windows

Cons:

  • Needs installation
  • Older, less maintained
  • Learning curve

tmux

Pros:

  • Modern, actively maintained
  • Powerful features
  • Better UI
  • Scriptable

Cons:

  • Needs installation
  • More complex
  • Overkill for simple tasks

My recommendation:

  • Simple scrapers: nohup
  • Development: tmux or screen
  • Production: systemd service (covered in deployment blog)

Common Issues and Solutions

Issue 1: Spider Dies Immediately

Check logs:

cat myspider.log
Enter fullscreen mode Exit fullscreen mode

Common causes:

  • Python errors
  • Import errors
  • Settings errors

Issue 2: Can't Find PID File

Problem:

kill $(cat myspider.pid)
# cat: myspider.pid: No such file or directory
Enter fullscreen mode Exit fullscreen mode

Solution:

ps aux | grep scrapy  # Find PID manually
kill 12345  # Use actual PID
Enter fullscreen mode Exit fullscreen mode

Issue 3: Log File Gets Too Large

Check size:

du -h myspider.log
Enter fullscreen mode Exit fullscreen mode

Rotate logs:

# In your run script
if [ -f "$LOG_FILE" ] && [ $(stat -f%z "$LOG_FILE") -gt 100000000 ]; then
    mv "$LOG_FILE" "${LOG_FILE}.old"
fi
Enter fullscreen mode Exit fullscreen mode

Issue 4: Multiple Instances Running

Find all:

ps aux | grep "scrapy crawl myspider"
Enter fullscreen mode Exit fullscreen mode

Kill all:

pkill -f "scrapy crawl myspider"
Enter fullscreen mode Exit fullscreen mode

Best Practices

1. Always Use Timestamped Logs

LOG_FILE="logs/myspider_$(date +%Y%m%d_%H%M%S).log"
Enter fullscreen mode Exit fullscreen mode

This way you have history of all runs.

2. Monitor Disk Space

df -h  # Check available space
Enter fullscreen mode Exit fullscreen mode

Spider logs can fill disk!

3. Set Up Log Rotation

# Keep only last 10 log files
ls -t logs/myspider_*.log | tail -n +11 | xargs rm -f
Enter fullscreen mode Exit fullscreen mode

4. Use Process Monitoring

Create cron job to check spider health:

# Check every 5 minutes
*/5 * * * * /home/user/check_spider_health.sh
Enter fullscreen mode Exit fullscreen mode

5. Send Notifications

When spider finishes:

# At end of spider
echo "Spider finished" | mail -s "Spider Status" you@example.com
Enter fullscreen mode Exit fullscreen mode

Complete Production Example

Full Setup

# Directory structure
/home/user/scrapy_project/
├── scrapy.cfg
├── myproject/
│   ├── spiders/
│   └── ...
├── logs/
├── pids/
└── scripts/
    ├── run_spider.sh
    ├── stop_spider.sh
    ├── status_spider.sh
    └── monitor.sh
Enter fullscreen mode Exit fullscreen mode

Production run_spider.sh

#!/bin/bash
set -e  # Exit on error

SPIDER_NAME="${1:-myspider}"
PROJECT_DIR="$(cd "$(dirname "$0")/.." && pwd)"
TIMESTAMP=$(date +%Y%m%d_%H%M%S)

# Directories
LOG_DIR="$PROJECT_DIR/logs"
PID_DIR="$PROJECT_DIR/pids"
DATA_DIR="$PROJECT_DIR/data"

# Files
LOG_FILE="$LOG_DIR/${SPIDER_NAME}_${TIMESTAMP}.log"
PID_FILE="$PID_DIR/${SPIDER_NAME}.pid"
DATA_FILE="$DATA_DIR/${SPIDER_NAME}_${TIMESTAMP}.json"

# Create directories
mkdir -p "$LOG_DIR" "$PID_DIR" "$DATA_DIR"

# Change to project directory
cd "$PROJECT_DIR"

# Check if already running
if [ -f "$PID_FILE" ]; then
    PID=$(cat "$PID_FILE")
    if ps -p $PID > /dev/null 2>&1; then
        echo "ERROR: Spider already running (PID: $PID)"
        exit 1
    fi
    rm -f "$PID_FILE"
fi

# Activate virtual environment
if [ -d "venv" ]; then
    source venv/bin/activate
fi

# Start spider
echo "$(date): Starting spider $SPIDER_NAME"

nohup scrapy crawl $SPIDER_NAME \
    -o $DATA_FILE \
    "${@:2}" \
    > $LOG_FILE 2>&1 &

PID=$!
echo $PID > $PID_FILE

# Verify started
sleep 2
if ps -p $PID > /dev/null 2>&1; then
    echo "SUCCESS: Spider started"
    echo "PID: $PID"
    echo "Log: $LOG_FILE"
    echo "Data: $DATA_FILE"
else
    echo "ERROR: Spider failed to start"
    cat $LOG_FILE
    rm -f "$PID_FILE"
    exit 1
fi
Enter fullscreen mode Exit fullscreen mode

Summary

nohup basics:

nohup scrapy crawl myspider > spider.log 2>&1 &
Enter fullscreen mode Exit fullscreen mode

Save PID:

nohup scrapy crawl myspider > spider.log 2>&1 & echo $! > spider.pid
Enter fullscreen mode Exit fullscreen mode

Watch logs:

tail -f spider.log
Enter fullscreen mode Exit fullscreen mode

Stop spider:

kill $(cat spider.pid)
Enter fullscreen mode Exit fullscreen mode

Alternatives:

  • screen - Interactive sessions
  • tmux - Modern terminal multiplexer
  • systemd - Production service management

Best practices:

  • Use run scripts
  • Save PIDs
  • Timestamped logs
  • Monitor resources
  • Auto-restart on failure

Remember:

  • Always redirect output (> log.log 2>&1)
  • Always run in background (&)
  • Always save PID for later
  • Watch logs with tail -f
  • Use screen/tmux for development

nohup is simple, reliable, and works everywhere. Perfect for running Scrapy spiders on servers!

Happy scraping! 🕷️

Top comments (1)

Collapse
 
sinni800 profile image
Info Comment hidden by post author - thread only accessible via permalink
sinni800

Let ai write your posts3

Some comments have been hidden by the post's author - find out more