I deployed my first spider to a server, ran it, and closed my laptop to go home. When I checked the next morning, the spider had stopped after 5 minutes. All the data I expected? Gone.
The problem? When I closed my SSH connection, the spider died with it. I didn't know about nohup.
Once I learned to use nohup, my spiders ran for hours, days, even weeks without me babysitting them. Let me show you how to run Scrapy properly on servers.
The Problem: SSH Disconnection Kills Your Spider
What Happens Normally
You SSH into your server:
ssh user@server.com
Run your spider:
scrapy crawl myspider
Everything works! But then:
- You close your laptop
- Internet disconnects
- SSH session times out
- Your spider DIES
Why? When SSH disconnects, all processes started in that session get killed.
What Is nohup?
nohup stands for "no hangup". It tells Linux to keep your process running even after you disconnect.
What nohup does:
- Ignores hangup signals (SIGHUP)
- Keeps process running after SSH disconnect
- Redirects output to a file
- Runs in background
Think of it like: Setting your spider to "run independently" mode.
Basic nohup Usage
Simple Command
nohup scrapy crawl myspider &
Let me break this down:
-
nohup- Don't kill on disconnect -
scrapy crawl myspider- Your spider command -
&- Run in background
What happens:
- Spider starts
- You get your terminal back immediately
- Output goes to
nohup.outfile - You can disconnect safely
Check It's Running
ps aux | grep scrapy
You'll see something like:
user 12345 0.5 2.1 python scrapy crawl myspider
The number (12345) is the process ID (PID).
Better: Redirect Output to Custom File
Don't use the default nohup.out. Use meaningful names:
nohup scrapy crawl myspider > myspider.log 2>&1 &
Explanation:
-
> myspider.log- Send normal output to myspider.log -
2>&1- Send errors to same file (2=stderr, 1=stdout) -
&- Run in background
Now all output (logs, errors, everything) goes to myspider.log.
Watching Logs in Real-Time
Your spider is running, but how do you see what's happening?
tail -f (Follow logs)
tail -f myspider.log
Shows last lines and updates live. Press Ctrl+C to stop watching (spider keeps running).
tail with line limit
tail -n 100 -f myspider.log # Show last 100 lines
grep while following
tail -f myspider.log | grep "ERROR" # Only show errors
tail -f myspider.log | grep "Scraped" # Only show scraped items
Stopping Your Spider
Find Process ID
ps aux | grep scrapy
Output:
user 12345 0.5 2.1 python scrapy crawl myspider
PID is 12345.
Kill the Process
kill 12345
Spider stops gracefully (finishes current requests, saves state).
Force Kill (if needed)
kill -9 12345
Immediately terminates (use only if normal kill doesn't work).
Better Approach: Save PID Automatically
Save PID to File
nohup scrapy crawl myspider > myspider.log 2>&1 & echo $! > myspider.pid
Explanation:
-
$!- PID of last background command -
> myspider.pid- Save to file
Now you can easily stop it:
kill $(cat myspider.pid)
Or check if running:
ps -p $(cat myspider.pid)
Creating a Run Script
Don't type long commands. Create a script!
Create run_spider.sh
#!/bin/bash
SPIDER_NAME="myspider"
LOG_FILE="logs/${SPIDER_NAME}_$(date +%Y%m%d_%H%M%S).log"
PID_FILE="${SPIDER_NAME}.pid"
# Create logs directory if doesn't exist
mkdir -p logs
# Check if already running
if [ -f "$PID_FILE" ]; then
PID=$(cat "$PID_FILE")
if ps -p $PID > /dev/null 2>&1; then
echo "Spider is already running (PID: $PID)"
exit 1
fi
fi
# Run spider
echo "Starting spider: $SPIDER_NAME"
echo "Log file: $LOG_FILE"
nohup scrapy crawl $SPIDER_NAME > $LOG_FILE 2>&1 &
# Save PID
echo $! > $PID_FILE
echo "Spider started with PID: $(cat $PID_FILE)"
echo "Watch logs: tail -f $LOG_FILE"
Make Executable
chmod +x run_spider.sh
Run It
./run_spider.sh
Output:
Starting spider: myspider
Log file: logs/myspider_20240115_143022.log
Spider started with PID: 12345
Watch logs: tail -f logs/myspider_20240115_143022.log
Creating a Stop Script
Create stop_spider.sh
#!/bin/bash
SPIDER_NAME="myspider"
PID_FILE="${SPIDER_NAME}.pid"
# Check if PID file exists
if [ ! -f "$PID_FILE" ]; then
echo "Spider is not running (no PID file found)"
exit 1
fi
PID=$(cat "$PID_FILE")
# Check if process exists
if ! ps -p $PID > /dev/null 2>&1; then
echo "Spider is not running (process not found)"
rm -f "$PID_FILE"
exit 1
fi
# Stop spider
echo "Stopping spider (PID: $PID)..."
kill $PID
# Wait for process to stop
COUNTER=0
while ps -p $PID > /dev/null 2>&1; do
sleep 1
COUNTER=$((COUNTER + 1))
if [ $COUNTER -eq 30 ]; then
echo "Spider didn't stop gracefully, forcing..."
kill -9 $PID
break
fi
done
rm -f "$PID_FILE"
echo "Spider stopped"
Make Executable
chmod +x stop_spider.sh
Use It
./stop_spider.sh
Creating a Status Script
Create status_spider.sh
#!/bin/bash
SPIDER_NAME="myspider"
PID_FILE="${SPIDER_NAME}.pid"
if [ ! -f "$PID_FILE" ]; then
echo "Status: NOT RUNNING"
exit 0
fi
PID=$(cat "$PID_FILE")
if ps -p $PID > /dev/null 2>&1; then
echo "Status: RUNNING"
echo "PID: $PID"
echo "Started: $(ps -p $PID -o lstart=)"
echo "CPU: $(ps -p $PID -o %cpu=)%"
echo "Memory: $(ps -p $PID -o %mem=)%"
else
echo "Status: NOT RUNNING (stale PID file)"
rm -f "$PID_FILE"
fi
Make Executable
chmod +x status_spider.sh
Check Status
./status_spider.sh
Output:
Status: RUNNING
PID: 12345
Started: Mon Jan 15 14:30:22 2024
CPU: 12.5%
Memory: 2.8%
Running Multiple Spiders
Different Spiders
nohup scrapy crawl spider1 > spider1.log 2>&1 & echo $! > spider1.pid
nohup scrapy crawl spider2 > spider2.log 2>&1 & echo $! > spider2.pid
nohup scrapy crawl spider3 > spider3.log 2>&1 & echo $! > spider3.pid
Same Spider, Different Arguments
nohup scrapy crawl myspider -a category=electronics > spider_electronics.log 2>&1 &
nohup scrapy crawl myspider -a category=books > spider_books.log 2>&1 &
nohup scrapy crawl myspider -a category=clothing > spider_clothing.log 2>&1 &
Advanced: Run Script with All Features
Complete run_spider.sh
#!/bin/bash
# Configuration
SPIDER_NAME="${1:-myspider}" # Use first argument or default
PROJECT_DIR="/home/user/scrapy_project"
LOG_DIR="$PROJECT_DIR/logs"
PID_DIR="$PROJECT_DIR/pids"
LOG_FILE="$LOG_DIR/${SPIDER_NAME}_$(date +%Y%m%d_%H%M%S).log"
PID_FILE="$PID_DIR/${SPIDER_NAME}.pid"
# Create directories
mkdir -p "$LOG_DIR" "$PID_DIR"
# Change to project directory
cd "$PROJECT_DIR" || exit 1
# Check if already running
if [ -f "$PID_FILE" ]; then
PID=$(cat "$PID_FILE")
if ps -p $PID > /dev/null 2>&1; then
echo "ERROR: Spider '$SPIDER_NAME' is already running (PID: $PID)"
exit 1
else
echo "Removing stale PID file..."
rm -f "$PID_FILE"
fi
fi
# Activate virtual environment if exists
if [ -d "venv" ]; then
source venv/bin/activate
fi
# Run spider
echo "Starting spider: $SPIDER_NAME"
echo "Log file: $LOG_FILE"
echo "PID file: $PID_FILE"
echo "----------------------------------------"
nohup scrapy crawl $SPIDER_NAME "${@:2}" > $LOG_FILE 2>&1 &
# Save PID
PID=$!
echo $PID > $PID_FILE
# Verify it started
sleep 2
if ps -p $PID > /dev/null 2>&1; then
echo "SUCCESS: Spider started (PID: $PID)"
echo ""
echo "Commands:"
echo " Watch logs: tail -f $LOG_FILE"
echo " Check status: ps -p $PID"
echo " Stop spider: kill $PID"
else
echo "ERROR: Spider failed to start"
echo "Check log file: $LOG_FILE"
rm -f "$PID_FILE"
exit 1
fi
Usage Examples
# Run default spider
./run_spider.sh
# Run specific spider
./run_spider.sh products
# Run with arguments
./run_spider.sh products -a category=electronics -a start_page=1
Monitoring Your Spider
Check CPU and Memory
ps aux | grep scrapy
Or more detailed:
top -p $(cat myspider.pid)
Count Scraped Items (if using JSON)
wc -l output.json
Watch Error Count
grep -c "ERROR" myspider.log
Watch Progress
tail -f myspider.log | grep "Crawled"
Auto-Restart on Failure
Create restart_on_failure.sh
#!/bin/bash
SPIDER_NAME="myspider"
PID_FILE="${SPIDER_NAME}.pid"
LOG_FILE="logs/${SPIDER_NAME}_$(date +%Y%m%d_%H%M%S).log"
while true; do
# Check if running
if [ -f "$PID_FILE" ]; then
PID=$(cat "$PID_FILE")
if ps -p $PID > /dev/null 2>&1; then
# Still running
sleep 60 # Check every minute
continue
fi
fi
# Not running, start it
echo "$(date): Spider not running, starting..."
nohup scrapy crawl $SPIDER_NAME > $LOG_FILE 2>&1 &
echo $! > $PID_FILE
echo "$(date): Spider restarted (PID: $(cat $PID_FILE))"
sleep 60
done
Run the monitor:
nohup ./restart_on_failure.sh > monitor.log 2>&1 &
Now your spider automatically restarts if it crashes!
Using screen (Alternative to nohup)
screen creates a virtual terminal that persists after disconnect.
Install screen
sudo apt install screen # Ubuntu/Debian
sudo yum install screen # CentOS/RHEL
Start screen session
screen -S myspider
Run spider
scrapy crawl myspider
Detach (keep running)
Press: Ctrl+A then D
List sessions
screen -ls
Output:
There is a screen on:
12345.myspider (Detached)
Reattach
screen -r myspider
Now you see your spider running!
Kill session
screen -X -S myspider quit
Using tmux (Another Alternative)
tmux is like screen but more modern.
Install tmux
sudo apt install tmux
Start session
tmux new -s myspider
Run spider
scrapy crawl myspider
Detach
Press: Ctrl+B then D
List sessions
tmux ls
Reattach
tmux attach -t myspider
Kill session
tmux kill-session -t myspider
Comparison: nohup vs screen vs tmux
nohup
Pros:
- Simple, no installation needed
- Works everywhere
- Good for fire-and-forget
Cons:
- Can't see live output (need tail -f)
- Can't interact with running process
- Basic features only
screen
Pros:
- See live output
- Interact with process
- Multiple windows
Cons:
- Needs installation
- Older, less maintained
- Learning curve
tmux
Pros:
- Modern, actively maintained
- Powerful features
- Better UI
- Scriptable
Cons:
- Needs installation
- More complex
- Overkill for simple tasks
My recommendation:
- Simple scrapers: nohup
- Development: tmux or screen
- Production: systemd service (covered in deployment blog)
Common Issues and Solutions
Issue 1: Spider Dies Immediately
Check logs:
cat myspider.log
Common causes:
- Python errors
- Import errors
- Settings errors
Issue 2: Can't Find PID File
Problem:
kill $(cat myspider.pid)
# cat: myspider.pid: No such file or directory
Solution:
ps aux | grep scrapy # Find PID manually
kill 12345 # Use actual PID
Issue 3: Log File Gets Too Large
Check size:
du -h myspider.log
Rotate logs:
# In your run script
if [ -f "$LOG_FILE" ] && [ $(stat -f%z "$LOG_FILE") -gt 100000000 ]; then
mv "$LOG_FILE" "${LOG_FILE}.old"
fi
Issue 4: Multiple Instances Running
Find all:
ps aux | grep "scrapy crawl myspider"
Kill all:
pkill -f "scrapy crawl myspider"
Best Practices
1. Always Use Timestamped Logs
LOG_FILE="logs/myspider_$(date +%Y%m%d_%H%M%S).log"
This way you have history of all runs.
2. Monitor Disk Space
df -h # Check available space
Spider logs can fill disk!
3. Set Up Log Rotation
# Keep only last 10 log files
ls -t logs/myspider_*.log | tail -n +11 | xargs rm -f
4. Use Process Monitoring
Create cron job to check spider health:
# Check every 5 minutes
*/5 * * * * /home/user/check_spider_health.sh
5. Send Notifications
When spider finishes:
# At end of spider
echo "Spider finished" | mail -s "Spider Status" you@example.com
Complete Production Example
Full Setup
# Directory structure
/home/user/scrapy_project/
├── scrapy.cfg
├── myproject/
│ ├── spiders/
│ └── ...
├── logs/
├── pids/
└── scripts/
├── run_spider.sh
├── stop_spider.sh
├── status_spider.sh
└── monitor.sh
Production run_spider.sh
#!/bin/bash
set -e # Exit on error
SPIDER_NAME="${1:-myspider}"
PROJECT_DIR="$(cd "$(dirname "$0")/.." && pwd)"
TIMESTAMP=$(date +%Y%m%d_%H%M%S)
# Directories
LOG_DIR="$PROJECT_DIR/logs"
PID_DIR="$PROJECT_DIR/pids"
DATA_DIR="$PROJECT_DIR/data"
# Files
LOG_FILE="$LOG_DIR/${SPIDER_NAME}_${TIMESTAMP}.log"
PID_FILE="$PID_DIR/${SPIDER_NAME}.pid"
DATA_FILE="$DATA_DIR/${SPIDER_NAME}_${TIMESTAMP}.json"
# Create directories
mkdir -p "$LOG_DIR" "$PID_DIR" "$DATA_DIR"
# Change to project directory
cd "$PROJECT_DIR"
# Check if already running
if [ -f "$PID_FILE" ]; then
PID=$(cat "$PID_FILE")
if ps -p $PID > /dev/null 2>&1; then
echo "ERROR: Spider already running (PID: $PID)"
exit 1
fi
rm -f "$PID_FILE"
fi
# Activate virtual environment
if [ -d "venv" ]; then
source venv/bin/activate
fi
# Start spider
echo "$(date): Starting spider $SPIDER_NAME"
nohup scrapy crawl $SPIDER_NAME \
-o $DATA_FILE \
"${@:2}" \
> $LOG_FILE 2>&1 &
PID=$!
echo $PID > $PID_FILE
# Verify started
sleep 2
if ps -p $PID > /dev/null 2>&1; then
echo "SUCCESS: Spider started"
echo "PID: $PID"
echo "Log: $LOG_FILE"
echo "Data: $DATA_FILE"
else
echo "ERROR: Spider failed to start"
cat $LOG_FILE
rm -f "$PID_FILE"
exit 1
fi
Summary
nohup basics:
nohup scrapy crawl myspider > spider.log 2>&1 &
Save PID:
nohup scrapy crawl myspider > spider.log 2>&1 & echo $! > spider.pid
Watch logs:
tail -f spider.log
Stop spider:
kill $(cat spider.pid)
Alternatives:
- screen - Interactive sessions
- tmux - Modern terminal multiplexer
- systemd - Production service management
Best practices:
- Use run scripts
- Save PIDs
- Timestamped logs
- Monitor resources
- Auto-restart on failure
Remember:
- Always redirect output (> log.log 2>&1)
- Always run in background (&)
- Always save PID for later
- Watch logs with tail -f
- Use screen/tmux for development
nohup is simple, reliable, and works everywhere. Perfect for running Scrapy spiders on servers!
Happy scraping! 🕷️
Top comments (1)
Let ai write your posts3
Some comments have been hidden by the post's author - find out more