DEV Community

Muhammad Ikramullah Khan
Muhammad Ikramullah Khan

Posted on

Scrapy with Docker: Deploy Anywhere in Minutes

I spent 3 hours setting up Scrapy on a new server once. Installing Python, dependencies, libraries, system packages. Then it worked on my machine but broke on the server because of version differences.

The next time, I used Docker. Setup took 5 minutes. It worked identically on my laptop, the server, and my teammate's Windows machine.

Docker eliminates "works on my machine" problems forever. Let me show you how to Dockerize your Scrapy project properly.


What Is Docker and Why Use It?

Docker packages your entire application (code + dependencies + environment) into a container.

Benefits for Scrapy:

  • Same environment everywhere (laptop, server, cloud)
  • No dependency hell
  • Easy deployment
  • Isolation from host system
  • Version control for entire environment

Analogy: Shipping containers for code. Works the same wherever you ship it.


Installing Docker

Linux

curl -fsSL https://get.docker.com -o get-docker.sh
sudo sh get-docker.sh
sudo usermod -aG docker $USER
Enter fullscreen mode Exit fullscreen mode

Mac/Windows

Download Docker Desktop from docker.com

Verify installation:

docker --version
Enter fullscreen mode Exit fullscreen mode

Creating a Dockerfile

This defines your container.

Basic Dockerfile

Create Dockerfile in your project root:

# Use official Python image
FROM python:3.11-slim

# Set working directory
WORKDIR /app

# Copy requirements first (for caching)
COPY requirements.txt .

# Install dependencies
RUN pip install --no-cache-dir -r requirements.txt

# Copy project files
COPY . .

# Run spider
CMD ["scrapy", "crawl", "myspider"]
Enter fullscreen mode Exit fullscreen mode

Build Image

docker build -t myspider .
Enter fullscreen mode Exit fullscreen mode

Run Container

docker run myspider
Enter fullscreen mode Exit fullscreen mode

That's it! Your spider runs in a container.


Better Dockerfile (Production Ready)

FROM python:3.11-slim

# Install system dependencies
RUN apt-get update && apt-get install -y \
    gcc \
    g++ \
    libxml2-dev \
    libxslt1-dev \
    && rm -rf /var/lib/apt/lists/*

# Create non-root user
RUN useradd -m -u 1000 scrapy

# Set working directory
WORKDIR /app

# Copy and install requirements as root
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Copy project files
COPY . .

# Change ownership
RUN chown -R scrapy:scrapy /app

# Switch to non-root user
USER scrapy

# Set environment variables
ENV PYTHONUNBUFFERED=1

# Run spider
CMD ["scrapy", "crawl", "myspider"]
Enter fullscreen mode Exit fullscreen mode

What this adds:

  • System dependencies (for lxml, etc.)
  • Non-root user (security)
  • Unbuffered output (see logs immediately)
  • Proper cleanup (smaller image)

Docker Compose for Complete Setup

Docker Compose manages multiple containers.

docker-compose.yml

version: '3.8'

services:
  spider:
    build: .
    container_name: myspider
    environment:
      - DATABASE_URL=postgresql://postgres:password@db:5432/scrapy
      - LOG_LEVEL=INFO
    volumes:
      - ./data:/app/data
      - ./logs:/app/logs
    depends_on:
      - db
    restart: unless-stopped

  db:
    image: postgres:15
    container_name: scrapy_db
    environment:
      - POSTGRES_DB=scrapy
      - POSTGRES_USER=postgres
      - POSTGRES_PASSWORD=password
    volumes:
      - postgres_data:/var/lib/postgresql/data
    ports:
      - "5432:5432"

volumes:
  postgres_data:
Enter fullscreen mode Exit fullscreen mode

Run with Compose

# Start all services
docker-compose up -d

# View logs
docker-compose logs -f spider

# Stop
docker-compose down

# Rebuild and restart
docker-compose up -d --build
Enter fullscreen mode Exit fullscreen mode

Passing Arguments to Spider

Method 1: Environment Variables

# Dockerfile
CMD ["scrapy", "crawl", "myspider"]
Enter fullscreen mode Exit fullscreen mode
# docker-compose.yml
services:
  spider:
    environment:
      - START_PAGE=1
      - END_PAGE=100
Enter fullscreen mode Exit fullscreen mode
# spider.py
import os

class MySpider(scrapy.Spider):
    name = 'myspider'

    def __init__(self):
        self.start_page = int(os.getenv('START_PAGE', 1))
        self.end_page = int(os.getenv('END_PAGE', 10))
Enter fullscreen mode Exit fullscreen mode

Method 2: Command Override

docker run myspider scrapy crawl myspider -a start=1 -a end=100
Enter fullscreen mode Exit fullscreen mode

Or with compose:

services:
  spider:
    command: scrapy crawl myspider -a start=1 -a end=100
Enter fullscreen mode Exit fullscreen mode

Persistent Data with Volumes

Data inside containers disappears when container stops. Use volumes!

Named Volumes

services:
  spider:
    volumes:
      - scraped_data:/app/output
      - spider_logs:/app/logs

volumes:
  scraped_data:
  spider_logs:
Enter fullscreen mode Exit fullscreen mode

Data persists even after container removal.

Bind Mounts (Local Folders)

services:
  spider:
    volumes:
      - ./output:/app/output  # Maps local ./output to container /app/output
      - ./logs:/app/logs
Enter fullscreen mode Exit fullscreen mode

Now files appear in your local folders!


Multi-Stage Builds (Smaller Images)

Reduce image size by splitting build and runtime:

# Build stage
FROM python:3.11-slim as builder

WORKDIR /app

# Install build dependencies
RUN apt-get update && apt-get install -y gcc g++ libxml2-dev libxslt1-dev

# Install Python packages
COPY requirements.txt .
RUN pip install --user --no-cache-dir -r requirements.txt

# Runtime stage
FROM python:3.11-slim

# Install only runtime dependencies
RUN apt-get update && apt-get install -y libxml2 libxslt1.1 && rm -rf /var/lib/apt/lists/*

# Copy installed packages from builder
COPY --from=builder /root/.local /root/.local

# Add to PATH
ENV PATH=/root/.local/bin:$PATH

WORKDIR /app
COPY . .

CMD ["scrapy", "crawl", "myspider"]
Enter fullscreen mode Exit fullscreen mode

Result: Much smaller final image!


Scheduling with Docker

Method 1: Cron in Container

FROM python:3.11-slim

# Install cron
RUN apt-get update && apt-get install -y cron

# Copy crontab
COPY crontab /etc/cron.d/spider-cron
RUN chmod 0644 /etc/cron.d/spider-cron
RUN crontab /etc/cron.d/spider-cron

WORKDIR /app
COPY . .

# Start cron
CMD ["cron", "-f"]
Enter fullscreen mode Exit fullscreen mode
# crontab file
0 2 * * * cd /app && scrapy crawl myspider >> /var/log/cron.log 2>&1
Enter fullscreen mode Exit fullscreen mode

Method 2: External Cron

# Host machine crontab
0 2 * * * docker run myspider
Enter fullscreen mode Exit fullscreen mode

Method 3: Docker Compose with Restart Policy

services:
  spider:
    build: .
    restart: "no"  # Run once and stop
Enter fullscreen mode Exit fullscreen mode

Then use host cron:

0 2 * * * cd /path/to/project && docker-compose up
Enter fullscreen mode Exit fullscreen mode

Logging in Docker

View Container Logs

# Follow logs
docker logs -f myspider

# Last 100 lines
docker logs --tail 100 myspider

# With timestamps
docker logs -t myspider
Enter fullscreen mode Exit fullscreen mode

Persist Logs to Host

services:
  spider:
    volumes:
      - ./logs:/app/logs
Enter fullscreen mode Exit fullscreen mode

Now logs save to local ./logs/ folder.

Centralized Logging

services:
  spider:
    logging:
      driver: "json-file"
      options:
        max-size: "10m"
        max-file: "3"
Enter fullscreen mode Exit fullscreen mode

Limits log size to prevent filling disk.


Environment-Specific Builds

Development Dockerfile

# Dockerfile.dev
FROM python:3.11-slim

WORKDIR /app

COPY requirements.txt .
RUN pip install -r requirements.txt

# Install dev tools
RUN pip install ipdb pytest

COPY . .

# Enable debug logging
ENV LOG_LEVEL=DEBUG

CMD ["scrapy", "crawl", "myspider"]
Enter fullscreen mode Exit fullscreen mode

Production Dockerfile

# Dockerfile.prod
FROM python:3.11-slim

WORKDIR /app

COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

COPY . .

# Production logging
ENV LOG_LEVEL=INFO

# Run as non-root
USER nobody

CMD ["scrapy", "crawl", "myspider"]
Enter fullscreen mode Exit fullscreen mode

Build Specific Version

# Development
docker build -f Dockerfile.dev -t myspider:dev .

# Production
docker build -f Dockerfile.prod -t myspider:prod .
Enter fullscreen mode Exit fullscreen mode

Scrapy + Selenium + Docker

For JavaScript-heavy sites:

FROM python:3.11-slim

# Install Chrome and ChromeDriver
RUN apt-get update && apt-get install -y \
    wget \
    gnupg \
    unzip \
    && wget -q -O - https://dl-ssl.google.com/linux/linux_signing_key.pub | apt-key add - \
    && echo "deb http://dl.google.com/linux/chrome/deb/ stable main" >> /etc/apt/sources.list.d/google.list \
    && apt-get update \
    && apt-get install -y google-chrome-stable \
    && rm -rf /var/lib/apt/lists/*

# Install ChromeDriver
RUN wget -q "https://chromedriver.storage.googleapis.com/114.0.5735.90/chromedriver_linux64.zip" \
    && unzip chromedriver_linux64.zip \
    && mv chromedriver /usr/local/bin/ \
    && chmod +x /usr/local/bin/chromedriver

WORKDIR /app

COPY requirements.txt .
RUN pip install scrapy scrapy-selenium

COPY . .

CMD ["scrapy", "crawl", "myspider"]
Enter fullscreen mode Exit fullscreen mode

Scrapy + Playwright + Docker

Modern JavaScript rendering:

FROM python:3.11-slim

# Install dependencies for Playwright
RUN apt-get update && apt-get install -y \
    libnss3 \
    libatk-bridge2.0-0 \
    libdrm2 \
    libxkbcommon0 \
    libgbm1 \
    libasound2 \
    && rm -rf /var/lib/apt/lists/*

WORKDIR /app

COPY requirements.txt .
RUN pip install scrapy scrapy-playwright
RUN playwright install chromium
RUN playwright install-deps chromium

COPY . .

CMD ["scrapy", "crawl", "myspider"]
Enter fullscreen mode Exit fullscreen mode

Docker Networking

Connect Multiple Spiders

services:
  spider1:
    build: .
    command: scrapy crawl spider1
    networks:
      - scrapy-network

  spider2:
    build: .
    command: scrapy crawl spider2
    networks:
      - scrapy-network

  db:
    image: postgres:15
    networks:
      - scrapy-network

networks:
  scrapy-network:
    driver: bridge
Enter fullscreen mode Exit fullscreen mode

All services can communicate on scrapy-network.


Health Checks

Monitor container health:

FROM python:3.11-slim

WORKDIR /app
COPY . .

# Add health check script
COPY healthcheck.py .

HEALTHCHECK --interval=30s --timeout=3s --start-period=5s --retries=3 \
  CMD python healthcheck.py || exit 1

CMD ["scrapy", "crawl", "myspider"]
Enter fullscreen mode Exit fullscreen mode
# healthcheck.py
import sys
import os

# Check if critical files exist
if not os.path.exists('/app/data/output.json'):
    sys.exit(1)

# Check file age
import time
age = time.time() - os.path.getmtime('/app/data/output.json')
if age > 3600:  # More than 1 hour old
    sys.exit(1)

sys.exit(0)  # Healthy
Enter fullscreen mode Exit fullscreen mode

Optimizing Docker Images

Use .dockerignore

Create .dockerignore:

__pycache__/
*.pyc
*.pyo
*.pyd
.git/
.gitignore
.venv/
venv/
*.log
.env
.scrapy/
Enter fullscreen mode Exit fullscreen mode

Excludes unnecessary files from image.

Layer Caching

# Good (requirements cached separately)
COPY requirements.txt .
RUN pip install -r requirements.txt
COPY . .

# Bad (re-installs on any code change)
COPY . .
RUN pip install -r requirements.txt
Enter fullscreen mode Exit fullscreen mode

Use Slim Images

# Large (1 GB)
FROM python:3.11

# Smaller (150 MB)
FROM python:3.11-slim

# Smallest (50 MB, but missing some tools)
FROM python:3.11-alpine
Enter fullscreen mode Exit fullscreen mode

Complete Production Example

Here's everything together:

Dockerfile:

FROM python:3.11-slim

RUN apt-get update && apt-get install -y \
    gcc g++ libxml2-dev libxslt1-dev \
    && rm -rf /var/lib/apt/lists/*

RUN useradd -m -u 1000 scrapy

WORKDIR /app

COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

COPY . .
RUN chown -R scrapy:scrapy /app

USER scrapy

ENV PYTHONUNBUFFERED=1

HEALTHCHECK --interval=60s --timeout=5s \
  CMD test -f /app/logs/spider.log || exit 1

CMD ["scrapy", "crawl", "myspider"]
Enter fullscreen mode Exit fullscreen mode

docker-compose.yml:

version: '3.8'

services:
  spider:
    build: .
    container_name: production_spider
    environment:
      - DATABASE_URL=postgresql://postgres:${DB_PASSWORD}@db:5432/scrapy
      - LOG_LEVEL=INFO
      - USER_AGENT=MyBot/1.0
    volumes:
      - ./data:/app/data
      - ./logs:/app/logs
    depends_on:
      - db
    restart: unless-stopped
    logging:
      driver: "json-file"
      options:
        max-size: "10m"
        max-file: "3"
    networks:
      - scrapy_network

  db:
    image: postgres:15
    container_name: scrapy_postgres
    environment:
      - POSTGRES_DB=scrapy
      - POSTGRES_USER=postgres
      - POSTGRES_PASSWORD=${DB_PASSWORD}
    volumes:
      - postgres_data:/var/lib/postgresql/data
    networks:
      - scrapy_network
    restart: unless-stopped

volumes:
  postgres_data:

networks:
  scrapy_network:
    driver: bridge
Enter fullscreen mode Exit fullscreen mode

.env:

DB_PASSWORD=your_secure_password
Enter fullscreen mode Exit fullscreen mode

Deploy:

docker-compose up -d
docker-compose logs -f spider
Enter fullscreen mode Exit fullscreen mode

Troubleshooting Docker Issues

Container Exits Immediately

Check logs:

docker logs myspider
Enter fullscreen mode Exit fullscreen mode

Common causes:

  • Command syntax error
  • Missing dependencies
  • Python import errors

Can't Connect to Database

Check network:

docker network ls
docker inspect scrapy_network
Enter fullscreen mode Exit fullscreen mode

Use service name (not localhost):

# WRONG
DATABASE_URL = 'postgresql://localhost:5432/db'

# RIGHT (in Docker)
DATABASE_URL = 'postgresql://db:5432/db'
Enter fullscreen mode Exit fullscreen mode

Permission Denied

Run as correct user:

USER scrapy
Enter fullscreen mode Exit fullscreen mode

Or fix ownership:

RUN chown -R scrapy:scrapy /app
Enter fullscreen mode Exit fullscreen mode

Summary

Why Docker:

  • Consistent environment everywhere
  • Easy deployment
  • Isolated from host
  • Version controlled environment

Basic workflow:

  1. Create Dockerfile
  2. Build image: docker build -t myspider .
  3. Run container: docker run myspider

Production setup:

  • Multi-stage builds (smaller images)
  • Non-root user (security)
  • Volume mounts (persist data)
  • Docker Compose (manage services)
  • Health checks (monitor status)

Best practices:

  • Use .dockerignore
  • Layer caching (requirements first)
  • Slim base images
  • Environment variables for config
  • Volume mounts for data/logs

Remember:

  • Service names for networking (not localhost)
  • Volumes for persistent data
  • Logs to stdout or volume
  • Health checks for monitoring

Start with basic Dockerfile, then add Docker Compose as you need databases or multiple services!

Happy scraping! 🕷️

Top comments (0)