Alex Spinov

Posted on Mar 25

The Perfect Docker Setup for Web Scraping (I Spent Months Getting This Right)

#docker #python #devops #tutorial

I've dockerized 20+ scraping projects. Every time, I hit the same problems:

Playwright browsers bloating the image to 2GB+
Chrome crashing with 'out of memory' in containers
Different behavior between local and production
Slow builds when changing one line of code

Here's the Dockerfile I now use for every project. It took months of pain to get right.

The Dockerfile

# Stage 1: Dependencies (cached layer)
FROM python:3.12-slim AS deps

WORKDIR /app

# System deps for Playwright/Chrome
RUN apt-get update && apt-get install -y --no-install-recommends \
    libnss3 libatk1.0-0 libatk-bridge2.0-0 libdrm2 \
    libxkbcommon0 libxcomposite1 libxdamage1 libxrandr2 \
    libgbm1 libasound2 libpango-1.0-0 libcairo2 \
    && rm -rf /var/lib/apt/lists/*

COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Install ONLY Chromium (not all browsers)
RUN playwright install chromium --with-deps

# Stage 2: App
FROM deps AS app

WORKDIR /app
COPY . .

# Non-root user (important for security)
RUN useradd -m scraper
USER scraper

CMD ["python", "scrape.py"]

Image size: ~650MB (vs 2.1GB with the default Playwright image)

Why This Works

1. Multi-Stage Build

The deps stage is cached. When you change your Python code, Docker only rebuilds the app stage — 5 seconds instead of 5 minutes.

2. Chromium Only

playwright install chromium --with-deps installs just Chromium (~350MB). The default playwright install installs Chromium + Firefox + WebKit (~1.2GB). You almost never need all three.

3. Slim Base Image

python:3.12-slim is 130MB vs python:3.12 at 1GB. The manual apt-get installs only the exact libraries Chromium needs.

4. Non-Root User

Running Chrome as root in Docker works but is a security risk. The scraper user prevents container escape attacks.

The docker-compose.yml

services:
  scraper:
    build: .
    restart: unless-stopped
    environment:
      - PYTHONUNBUFFERED=1
    deploy:
      resources:
        limits:
          memory: 1G    # Prevent Chrome from eating all RAM
          cpus: '1.0'
    volumes:
      - ./data:/app/data   # Persist scraped data
    shm_size: '256m'       # CRITICAL: Chrome needs this

The `shm_size` Trick

Chrome in Docker crashes with "out of memory" even with 4GB RAM. The issue is /dev/shm — Docker defaults it to 64MB, Chrome needs more.

Three fixes:

# Option 1: Increase shm_size
shm_size: '256m'

# Option 2: Disable shm usage in Chrome
# In your Python code:
browser = playwright.chromium.launch(
    args=['--disable-dev-shm-usage']
)

# Option 3: Mount /dev/shm
volumes:
  - /dev/shm:/dev/shm

I use Option 1 — cleanest solution.

Cron Scheduling

For daily scrapes, add a cron service:

  scheduler:
    image: mcuadros/ofelia:latest
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock
    labels:
      ofelia.job-local.scrape.schedule: "0 8 * * *"
      ofelia.job-local.scrape.command: "docker compose run --rm scraper"

Or simply use crontab on the host:

0 8 * * * cd /app && docker compose run --rm scraper >> /var/log/scraper.log 2>&1

Common Mistakes I Made (So You Don't Have To)

1. Installing `chromium-browser` via apt

# BAD — version mismatch with Playwright
RUN apt-get install -y chromium-browser

# GOOD — Playwright downloads the exact version it needs
RUN playwright install chromium

2. Missing `--no-cache-dir`

# BAD — pip cache bloats the image by 200MB+
RUN pip install -r requirements.txt

# GOOD
RUN pip install --no-cache-dir -r requirements.txt

3. COPY . . Before requirements

# BAD — any code change invalidates pip cache
COPY . .
RUN pip install -r requirements.txt

# GOOD — requirements layer is cached
COPY requirements.txt .
RUN pip install -r requirements.txt
COPY . .

4. Running as Root

# BAD — security risk
CMD ["python", "scrape.py"]

# GOOD
RUN useradd -m scraper
USER scraper
CMD ["python", "scrape.py"]

Performance Numbers

Metric	Before	After
Image size	2.1 GB	650 MB
Build time (cold)	8 min	3 min
Build time (code change)	8 min	5 sec
Memory usage	1.5 GB	800 MB
Chrome crashes	~2/day	0

Full starter template: python-web-scraping-starter

More scraping tools: awesome-web-scraping-2026

What does your Docker setup for scraping look like? Any tricks I'm missing? 👇

Need web scraping or data extraction? I've built 77+ production scrapers. Email spinov001@gmail.com — quote in 2 hours. Or try my ready-made Apify actors — no code needed.

DEV Community

The Perfect Docker Setup for Web Scraping (I Spent Months Getting This Right)

The Dockerfile

Why This Works

1. Multi-Stage Build

2. Chromium Only

3. Slim Base Image

4. Non-Root User

The docker-compose.yml

The `shm_size` Trick

Cron Scheduling

Common Mistakes I Made (So You Don't Have To)

1. Installing `chromium-browser` via apt

2. Missing `--no-cache-dir`

3. COPY . . Before requirements

4. Running as Root

Performance Numbers

Top comments (0)

The Dockerfile

Why This Works

1. Multi-Stage Build

2. Chromium Only

3. Slim Base Image

4. Non-Root User

The docker-compose.yml

The shm_size Trick

Cron Scheduling

Common Mistakes I Made (So You Don't Have To)

1. Installing chromium-browser via apt

2. Missing --no-cache-dir

3. COPY . . Before requirements

4. Running as Root

Performance Numbers

The `shm_size` Trick

1. Installing `chromium-browser` via apt

2. Missing `--no-cache-dir`