I spent 3 hours setting up Scrapy on a new server once. Installing Python, dependencies, libraries, system packages. Then it worked on my machine but broke on the server because of version differences.
The next time, I used Docker. Setup took 5 minutes. It worked identically on my laptop, the server, and my teammate's Windows machine.
Docker eliminates "works on my machine" problems forever. Let me show you how to Dockerize your Scrapy project properly.
What Is Docker and Why Use It?
Docker packages your entire application (code + dependencies + environment) into a container.
Benefits for Scrapy:
- Same environment everywhere (laptop, server, cloud)
- No dependency hell
- Easy deployment
- Isolation from host system
- Version control for entire environment
Analogy: Shipping containers for code. Works the same wherever you ship it.
Installing Docker
Linux
curl -fsSL https://get.docker.com -o get-docker.sh
sudo sh get-docker.sh
sudo usermod -aG docker $USER
Mac/Windows
Download Docker Desktop from docker.com
Verify installation:
docker --version
Creating a Dockerfile
This defines your container.
Basic Dockerfile
Create Dockerfile in your project root:
# Use official Python image
FROM python:3.11-slim
# Set working directory
WORKDIR /app
# Copy requirements first (for caching)
COPY requirements.txt .
# Install dependencies
RUN pip install --no-cache-dir -r requirements.txt
# Copy project files
COPY . .
# Run spider
CMD ["scrapy", "crawl", "myspider"]
Build Image
docker build -t myspider .
Run Container
docker run myspider
That's it! Your spider runs in a container.
Better Dockerfile (Production Ready)
FROM python:3.11-slim
# Install system dependencies
RUN apt-get update && apt-get install -y \
gcc \
g++ \
libxml2-dev \
libxslt1-dev \
&& rm -rf /var/lib/apt/lists/*
# Create non-root user
RUN useradd -m -u 1000 scrapy
# Set working directory
WORKDIR /app
# Copy and install requirements as root
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
# Copy project files
COPY . .
# Change ownership
RUN chown -R scrapy:scrapy /app
# Switch to non-root user
USER scrapy
# Set environment variables
ENV PYTHONUNBUFFERED=1
# Run spider
CMD ["scrapy", "crawl", "myspider"]
What this adds:
- System dependencies (for lxml, etc.)
- Non-root user (security)
- Unbuffered output (see logs immediately)
- Proper cleanup (smaller image)
Docker Compose for Complete Setup
Docker Compose manages multiple containers.
docker-compose.yml
version: '3.8'
services:
spider:
build: .
container_name: myspider
environment:
- DATABASE_URL=postgresql://postgres:password@db:5432/scrapy
- LOG_LEVEL=INFO
volumes:
- ./data:/app/data
- ./logs:/app/logs
depends_on:
- db
restart: unless-stopped
db:
image: postgres:15
container_name: scrapy_db
environment:
- POSTGRES_DB=scrapy
- POSTGRES_USER=postgres
- POSTGRES_PASSWORD=password
volumes:
- postgres_data:/var/lib/postgresql/data
ports:
- "5432:5432"
volumes:
postgres_data:
Run with Compose
# Start all services
docker-compose up -d
# View logs
docker-compose logs -f spider
# Stop
docker-compose down
# Rebuild and restart
docker-compose up -d --build
Passing Arguments to Spider
Method 1: Environment Variables
# Dockerfile
CMD ["scrapy", "crawl", "myspider"]
# docker-compose.yml
services:
spider:
environment:
- START_PAGE=1
- END_PAGE=100
# spider.py
import os
class MySpider(scrapy.Spider):
name = 'myspider'
def __init__(self):
self.start_page = int(os.getenv('START_PAGE', 1))
self.end_page = int(os.getenv('END_PAGE', 10))
Method 2: Command Override
docker run myspider scrapy crawl myspider -a start=1 -a end=100
Or with compose:
services:
spider:
command: scrapy crawl myspider -a start=1 -a end=100
Persistent Data with Volumes
Data inside containers disappears when container stops. Use volumes!
Named Volumes
services:
spider:
volumes:
- scraped_data:/app/output
- spider_logs:/app/logs
volumes:
scraped_data:
spider_logs:
Data persists even after container removal.
Bind Mounts (Local Folders)
services:
spider:
volumes:
- ./output:/app/output # Maps local ./output to container /app/output
- ./logs:/app/logs
Now files appear in your local folders!
Multi-Stage Builds (Smaller Images)
Reduce image size by splitting build and runtime:
# Build stage
FROM python:3.11-slim as builder
WORKDIR /app
# Install build dependencies
RUN apt-get update && apt-get install -y gcc g++ libxml2-dev libxslt1-dev
# Install Python packages
COPY requirements.txt .
RUN pip install --user --no-cache-dir -r requirements.txt
# Runtime stage
FROM python:3.11-slim
# Install only runtime dependencies
RUN apt-get update && apt-get install -y libxml2 libxslt1.1 && rm -rf /var/lib/apt/lists/*
# Copy installed packages from builder
COPY --from=builder /root/.local /root/.local
# Add to PATH
ENV PATH=/root/.local/bin:$PATH
WORKDIR /app
COPY . .
CMD ["scrapy", "crawl", "myspider"]
Result: Much smaller final image!
Scheduling with Docker
Method 1: Cron in Container
FROM python:3.11-slim
# Install cron
RUN apt-get update && apt-get install -y cron
# Copy crontab
COPY crontab /etc/cron.d/spider-cron
RUN chmod 0644 /etc/cron.d/spider-cron
RUN crontab /etc/cron.d/spider-cron
WORKDIR /app
COPY . .
# Start cron
CMD ["cron", "-f"]
# crontab file
0 2 * * * cd /app && scrapy crawl myspider >> /var/log/cron.log 2>&1
Method 2: External Cron
# Host machine crontab
0 2 * * * docker run myspider
Method 3: Docker Compose with Restart Policy
services:
spider:
build: .
restart: "no" # Run once and stop
Then use host cron:
0 2 * * * cd /path/to/project && docker-compose up
Logging in Docker
View Container Logs
# Follow logs
docker logs -f myspider
# Last 100 lines
docker logs --tail 100 myspider
# With timestamps
docker logs -t myspider
Persist Logs to Host
services:
spider:
volumes:
- ./logs:/app/logs
Now logs save to local ./logs/ folder.
Centralized Logging
services:
spider:
logging:
driver: "json-file"
options:
max-size: "10m"
max-file: "3"
Limits log size to prevent filling disk.
Environment-Specific Builds
Development Dockerfile
# Dockerfile.dev
FROM python:3.11-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install -r requirements.txt
# Install dev tools
RUN pip install ipdb pytest
COPY . .
# Enable debug logging
ENV LOG_LEVEL=DEBUG
CMD ["scrapy", "crawl", "myspider"]
Production Dockerfile
# Dockerfile.prod
FROM python:3.11-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY . .
# Production logging
ENV LOG_LEVEL=INFO
# Run as non-root
USER nobody
CMD ["scrapy", "crawl", "myspider"]
Build Specific Version
# Development
docker build -f Dockerfile.dev -t myspider:dev .
# Production
docker build -f Dockerfile.prod -t myspider:prod .
Scrapy + Selenium + Docker
For JavaScript-heavy sites:
FROM python:3.11-slim
# Install Chrome and ChromeDriver
RUN apt-get update && apt-get install -y \
wget \
gnupg \
unzip \
&& wget -q -O - https://dl-ssl.google.com/linux/linux_signing_key.pub | apt-key add - \
&& echo "deb http://dl.google.com/linux/chrome/deb/ stable main" >> /etc/apt/sources.list.d/google.list \
&& apt-get update \
&& apt-get install -y google-chrome-stable \
&& rm -rf /var/lib/apt/lists/*
# Install ChromeDriver
RUN wget -q "https://chromedriver.storage.googleapis.com/114.0.5735.90/chromedriver_linux64.zip" \
&& unzip chromedriver_linux64.zip \
&& mv chromedriver /usr/local/bin/ \
&& chmod +x /usr/local/bin/chromedriver
WORKDIR /app
COPY requirements.txt .
RUN pip install scrapy scrapy-selenium
COPY . .
CMD ["scrapy", "crawl", "myspider"]
Scrapy + Playwright + Docker
Modern JavaScript rendering:
FROM python:3.11-slim
# Install dependencies for Playwright
RUN apt-get update && apt-get install -y \
libnss3 \
libatk-bridge2.0-0 \
libdrm2 \
libxkbcommon0 \
libgbm1 \
libasound2 \
&& rm -rf /var/lib/apt/lists/*
WORKDIR /app
COPY requirements.txt .
RUN pip install scrapy scrapy-playwright
RUN playwright install chromium
RUN playwright install-deps chromium
COPY . .
CMD ["scrapy", "crawl", "myspider"]
Docker Networking
Connect Multiple Spiders
services:
spider1:
build: .
command: scrapy crawl spider1
networks:
- scrapy-network
spider2:
build: .
command: scrapy crawl spider2
networks:
- scrapy-network
db:
image: postgres:15
networks:
- scrapy-network
networks:
scrapy-network:
driver: bridge
All services can communicate on scrapy-network.
Health Checks
Monitor container health:
FROM python:3.11-slim
WORKDIR /app
COPY . .
# Add health check script
COPY healthcheck.py .
HEALTHCHECK --interval=30s --timeout=3s --start-period=5s --retries=3 \
CMD python healthcheck.py || exit 1
CMD ["scrapy", "crawl", "myspider"]
# healthcheck.py
import sys
import os
# Check if critical files exist
if not os.path.exists('/app/data/output.json'):
sys.exit(1)
# Check file age
import time
age = time.time() - os.path.getmtime('/app/data/output.json')
if age > 3600: # More than 1 hour old
sys.exit(1)
sys.exit(0) # Healthy
Optimizing Docker Images
Use .dockerignore
Create .dockerignore:
__pycache__/
*.pyc
*.pyo
*.pyd
.git/
.gitignore
.venv/
venv/
*.log
.env
.scrapy/
Excludes unnecessary files from image.
Layer Caching
# Good (requirements cached separately)
COPY requirements.txt .
RUN pip install -r requirements.txt
COPY . .
# Bad (re-installs on any code change)
COPY . .
RUN pip install -r requirements.txt
Use Slim Images
# Large (1 GB)
FROM python:3.11
# Smaller (150 MB)
FROM python:3.11-slim
# Smallest (50 MB, but missing some tools)
FROM python:3.11-alpine
Complete Production Example
Here's everything together:
Dockerfile:
FROM python:3.11-slim
RUN apt-get update && apt-get install -y \
gcc g++ libxml2-dev libxslt1-dev \
&& rm -rf /var/lib/apt/lists/*
RUN useradd -m -u 1000 scrapy
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY . .
RUN chown -R scrapy:scrapy /app
USER scrapy
ENV PYTHONUNBUFFERED=1
HEALTHCHECK --interval=60s --timeout=5s \
CMD test -f /app/logs/spider.log || exit 1
CMD ["scrapy", "crawl", "myspider"]
docker-compose.yml:
version: '3.8'
services:
spider:
build: .
container_name: production_spider
environment:
- DATABASE_URL=postgresql://postgres:${DB_PASSWORD}@db:5432/scrapy
- LOG_LEVEL=INFO
- USER_AGENT=MyBot/1.0
volumes:
- ./data:/app/data
- ./logs:/app/logs
depends_on:
- db
restart: unless-stopped
logging:
driver: "json-file"
options:
max-size: "10m"
max-file: "3"
networks:
- scrapy_network
db:
image: postgres:15
container_name: scrapy_postgres
environment:
- POSTGRES_DB=scrapy
- POSTGRES_USER=postgres
- POSTGRES_PASSWORD=${DB_PASSWORD}
volumes:
- postgres_data:/var/lib/postgresql/data
networks:
- scrapy_network
restart: unless-stopped
volumes:
postgres_data:
networks:
scrapy_network:
driver: bridge
.env:
DB_PASSWORD=your_secure_password
Deploy:
docker-compose up -d
docker-compose logs -f spider
Troubleshooting Docker Issues
Container Exits Immediately
Check logs:
docker logs myspider
Common causes:
- Command syntax error
- Missing dependencies
- Python import errors
Can't Connect to Database
Check network:
docker network ls
docker inspect scrapy_network
Use service name (not localhost):
# WRONG
DATABASE_URL = 'postgresql://localhost:5432/db'
# RIGHT (in Docker)
DATABASE_URL = 'postgresql://db:5432/db'
Permission Denied
Run as correct user:
USER scrapy
Or fix ownership:
RUN chown -R scrapy:scrapy /app
Summary
Why Docker:
- Consistent environment everywhere
- Easy deployment
- Isolated from host
- Version controlled environment
Basic workflow:
- Create Dockerfile
- Build image:
docker build -t myspider . - Run container:
docker run myspider
Production setup:
- Multi-stage builds (smaller images)
- Non-root user (security)
- Volume mounts (persist data)
- Docker Compose (manage services)
- Health checks (monitor status)
Best practices:
- Use .dockerignore
- Layer caching (requirements first)
- Slim base images
- Environment variables for config
- Volume mounts for data/logs
Remember:
- Service names for networking (not localhost)
- Volumes for persistent data
- Logs to stdout or volume
- Health checks for monitoring
Start with basic Dockerfile, then add Docker Compose as you need databases or multiple services!
Happy scraping! 🕷️
Top comments (0)