DEV Community

Alex Spinov
Alex Spinov

Posted on

The Perfect Docker Setup for Web Scraping (I Spent Months Getting This Right)

I've dockerized 20+ scraping projects. Every time, I hit the same problems:

  • Playwright browsers bloating the image to 2GB+
  • Chrome crashing with 'out of memory' in containers
  • Different behavior between local and production
  • Slow builds when changing one line of code

Here's the Dockerfile I now use for every project. It took months of pain to get right.


The Dockerfile

# Stage 1: Dependencies (cached layer)
FROM python:3.12-slim AS deps

WORKDIR /app

# System deps for Playwright/Chrome
RUN apt-get update && apt-get install -y --no-install-recommends \
    libnss3 libatk1.0-0 libatk-bridge2.0-0 libdrm2 \
    libxkbcommon0 libxcomposite1 libxdamage1 libxrandr2 \
    libgbm1 libasound2 libpango-1.0-0 libcairo2 \
    && rm -rf /var/lib/apt/lists/*

COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Install ONLY Chromium (not all browsers)
RUN playwright install chromium --with-deps

# Stage 2: App
FROM deps AS app

WORKDIR /app
COPY . .

# Non-root user (important for security)
RUN useradd -m scraper
USER scraper

CMD ["python", "scrape.py"]
Enter fullscreen mode Exit fullscreen mode

Image size: ~650MB (vs 2.1GB with the default Playwright image)


Why This Works

1. Multi-Stage Build

The deps stage is cached. When you change your Python code, Docker only rebuilds the app stage — 5 seconds instead of 5 minutes.

2. Chromium Only

playwright install chromium --with-deps installs just Chromium (~350MB). The default playwright install installs Chromium + Firefox + WebKit (~1.2GB). You almost never need all three.

3. Slim Base Image

python:3.12-slim is 130MB vs python:3.12 at 1GB. The manual apt-get installs only the exact libraries Chromium needs.

4. Non-Root User

Running Chrome as root in Docker works but is a security risk. The scraper user prevents container escape attacks.


The docker-compose.yml

services:
  scraper:
    build: .
    restart: unless-stopped
    environment:
      - PYTHONUNBUFFERED=1
    deploy:
      resources:
        limits:
          memory: 1G    # Prevent Chrome from eating all RAM
          cpus: '1.0'
    volumes:
      - ./data:/app/data   # Persist scraped data
    shm_size: '256m'       # CRITICAL: Chrome needs this
Enter fullscreen mode Exit fullscreen mode

The shm_size Trick

Chrome in Docker crashes with "out of memory" even with 4GB RAM. The issue is /dev/shm — Docker defaults it to 64MB, Chrome needs more.

Three fixes:

# Option 1: Increase shm_size
shm_size: '256m'

# Option 2: Disable shm usage in Chrome
# In your Python code:
browser = playwright.chromium.launch(
    args=['--disable-dev-shm-usage']
)

# Option 3: Mount /dev/shm
volumes:
  - /dev/shm:/dev/shm
Enter fullscreen mode Exit fullscreen mode

I use Option 1 — cleanest solution.


Cron Scheduling

For daily scrapes, add a cron service:

  scheduler:
    image: mcuadros/ofelia:latest
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock
    labels:
      ofelia.job-local.scrape.schedule: "0 8 * * *"
      ofelia.job-local.scrape.command: "docker compose run --rm scraper"
Enter fullscreen mode Exit fullscreen mode

Or simply use crontab on the host:

0 8 * * * cd /app && docker compose run --rm scraper >> /var/log/scraper.log 2>&1
Enter fullscreen mode Exit fullscreen mode

Common Mistakes I Made (So You Don't Have To)

1. Installing chromium-browser via apt

# BAD — version mismatch with Playwright
RUN apt-get install -y chromium-browser

# GOOD — Playwright downloads the exact version it needs
RUN playwright install chromium
Enter fullscreen mode Exit fullscreen mode

2. Missing --no-cache-dir

# BAD — pip cache bloats the image by 200MB+
RUN pip install -r requirements.txt

# GOOD
RUN pip install --no-cache-dir -r requirements.txt
Enter fullscreen mode Exit fullscreen mode

3. COPY . . Before requirements

# BAD — any code change invalidates pip cache
COPY . .
RUN pip install -r requirements.txt

# GOOD — requirements layer is cached
COPY requirements.txt .
RUN pip install -r requirements.txt
COPY . .
Enter fullscreen mode Exit fullscreen mode

4. Running as Root

# BAD — security risk
CMD ["python", "scrape.py"]

# GOOD
RUN useradd -m scraper
USER scraper
CMD ["python", "scrape.py"]
Enter fullscreen mode Exit fullscreen mode

Performance Numbers

Metric Before After
Image size 2.1 GB 650 MB
Build time (cold) 8 min 3 min
Build time (code change) 8 min 5 sec
Memory usage 1.5 GB 800 MB
Chrome crashes ~2/day 0

Full starter template: python-web-scraping-starter

More scraping tools: awesome-web-scraping-2026

What does your Docker setup for scraping look like? Any tricks I'm missing? 👇


More from me: 10 Dev Tools I Use Daily | 77 Scrapers on a Schedule | 150+ Free APIs

Top comments (0)