DEV Community

Muhammad Ikramullah Khan
Muhammad Ikramullah Khan

Posted on

Running Scrapy with Docker: The Complete Beginner's Guide

I tried to run my Scrapy project on a friend's computer. It didn't work. "Works on my machine," I said. He rolled his eyes.

The problem? Different Python version. Different packages. Different operating system. Everything was different.

Then I learned Docker. I packaged my entire project in a container. It worked on my machine, his machine, and every server we tried. Always the same. Always perfect.

Let me explain Docker in the simplest way possible, without any confusing technical terms.


What is Docker? (The Simple Explanation)

Imagine you want to ship a fragile item to someone. You could just throw it in a box and hope it arrives safely. Or you could pack it carefully with bubble wrap, foam, and a solid container.

Docker is like that protective container for your code.

It packages:

  • Your Scrapy project
  • Python (the right version)
  • All your packages (scrapy, requests, etc.)
  • Everything needed to run

All in one neat package called a "container."

The magic: This container works EXACTLY the same on:

  • Your laptop
  • Your friend's computer
  • Any server
  • Windows, Mac, or Linux

No more "it works on my machine" problems!


Why Use Docker with Scrapy?

Problem 1: Different Computers, Different Problems

Without Docker:

Your laptop: Python 3.11, Scrapy 2.11 → Works!
Server: Python 3.8, Scrapy 2.8 → Breaks!
Enter fullscreen mode Exit fullscreen mode

With Docker:

Your laptop: Docker container → Works!
Server: Same Docker container → Works!
Enter fullscreen mode Exit fullscreen mode

Problem 2: Installing Everything is Hard

Without Docker:

  1. Install Python
  2. Install pip
  3. Install scrapy
  4. Install other packages
  5. Hope nothing breaks

With Docker:

  1. Run one command
  2. Done!

Problem 3: Cleanup is Messy

Without Docker:

  • Packages installed globally
  • Hard to remove completely
  • Leaves traces everywhere

With Docker:

  • Delete container
  • Everything gone
  • Clean!

Installing Docker

On Windows

  1. Download Docker Desktop from docker.com
  2. Install it (double-click, next, next, finish)
  3. Restart your computer
  4. Done!

On Mac

  1. Download Docker Desktop from docker.com
  2. Drag to Applications folder
  3. Open it
  4. Done!

On Linux (Ubuntu)

sudo apt update
sudo apt install docker.io
sudo systemctl start docker
Enter fullscreen mode Exit fullscreen mode

Verify Installation

Open terminal (or command prompt) and type:

docker --version
Enter fullscreen mode Exit fullscreen mode

You should see something like:

Docker version 24.0.7
Enter fullscreen mode Exit fullscreen mode

If you see this, Docker is installed!


Understanding Docker Basics (Super Simple)

Before we start, understand these three terms:

Image: A recipe for your container (like a blueprint)

Container: The actual running thing (like a house built from the blueprint)

Dockerfile: The instructions to create the image (like building plans)

That's it. Just three concepts.


Your First Scrapy Docker Container

Let's create the simplest possible Docker setup for Scrapy.

Step 1: Create Your Project Folder

mkdir my_scraper
cd my_scraper
Enter fullscreen mode Exit fullscreen mode

Step 2: Create Your Spider

Create a file called spider.py:

import scrapy

class SimpleSpider(scrapy.Spider):
    name = 'simple'
    start_urls = ['http://quotes.toscrape.com/']

    def parse(self, response):
        for quote in response.css('.quote'):
            yield {
                'text': quote.css('.text::text').get(),
                'author': quote.css('.author::text').get(),
            }
Enter fullscreen mode Exit fullscreen mode

Step 3: Create Dockerfile

Create a file called Dockerfile (exactly that name, no extension):

FROM python:3.11

WORKDIR /app

RUN pip install scrapy

COPY spider.py .

CMD ["scrapy", "runspider", "spider.py"]
Enter fullscreen mode Exit fullscreen mode

What this means (line by line):

  • FROM python:3.11 → Start with Python 3.11
  • WORKDIR /app → Work in folder called /app
  • RUN pip install scrapy → Install Scrapy
  • COPY spider.py . → Copy your spider file
  • CMD ["scrapy", "runspider", "spider.py"] → Run the spider

Step 4: Build Your Docker Image

docker build -t my-scraper .
Enter fullscreen mode Exit fullscreen mode

This takes a minute. Docker is downloading Python and installing Scrapy.

What this does:

  • docker build → Create an image
  • -t my-scraper → Name it "my-scraper"
  • . → Use current folder

Step 5: Run Your Spider

docker run my-scraper
Enter fullscreen mode Exit fullscreen mode

That's it! Your spider runs inside Docker!

You'll see output like:

2024-01-15 10:30:22 [scrapy.core.engine] INFO: Spider opened
...
{'text': 'The world as we have created it...', 'author': 'Albert Einstein'}
Enter fullscreen mode Exit fullscreen mode

Understanding What Just Happened

Let's break down what Docker did:

  1. Created an isolated environment (like a mini computer inside your computer)
  2. Installed Python 3.11 (just for this environment)
  3. Installed Scrapy (just for this environment)
  4. Copied your spider (into this environment)
  5. Ran your spider (inside this environment)

All without touching your actual computer!


Saving Scraped Data

Right now, data stays inside the container and disappears. Let's save it to your computer.

Create Output Folder

mkdir output
Enter fullscreen mode Exit fullscreen mode

Update Dockerfile

FROM python:3.11

WORKDIR /app

RUN pip install scrapy

COPY spider.py .

CMD ["scrapy", "runspider", "spider.py", "-o", "/output/data.json"]
Enter fullscreen mode Exit fullscreen mode

Notice the new part: -o /output/data.json

Run with Volume

docker run -v $(pwd)/output:/output my-scraper
Enter fullscreen mode Exit fullscreen mode

What this does:

  • -v → Connect folders
  • $(pwd)/output → Your computer's output folder
  • :/output → Container's output folder

Now check your output folder:

cat output/data.json
Enter fullscreen mode Exit fullscreen mode

Your data is there!


Real Project with Multiple Files

Let's do a real Scrapy project.

Your Project Structure

my_scraper/
├── Dockerfile
├── scrapy.cfg
├── myproject/
│   ├── __init__.py
│   ├── settings.py
│   ├── items.py
│   └── spiders/
│       ├── __init__.py
│       └── quotes.py
└── output/
Enter fullscreen mode Exit fullscreen mode

Create Scrapy Project

scrapy startproject myproject .
cd myproject/spiders
scrapy genspider quotes quotes.toscrape.com
cd ../..
Enter fullscreen mode Exit fullscreen mode

Simple Dockerfile for Full Project

FROM python:3.11-slim

WORKDIR /app

# Install Scrapy
RUN pip install scrapy

# Copy entire project
COPY . .

# Run spider
CMD ["scrapy", "crawl", "quotes", "-o", "/output/quotes.json"]
Enter fullscreen mode Exit fullscreen mode

Build and Run

# Build
docker build -t quotes-scraper .

# Run
docker run -v $(pwd)/output:/output quotes-scraper
Enter fullscreen mode Exit fullscreen mode

Done! Check output/quotes.json for your data.


Making It Even Easier (docker-compose)

Instead of long commands, use docker-compose.

Install docker-compose

Usually comes with Docker Desktop. Check:

docker-compose --version
Enter fullscreen mode Exit fullscreen mode

Create docker-compose.yml

In your project folder, create docker-compose.yml:

version: '3.8'

services:
  scraper:
    build: .
    volumes:
      - ./output:/output
Enter fullscreen mode Exit fullscreen mode

That's it! Super simple.

Run with docker-compose

docker-compose up
Enter fullscreen mode Exit fullscreen mode

Much easier than the long docker run command!

Stop Everything

docker-compose down
Enter fullscreen mode Exit fullscreen mode

Common Beginner Commands

See All Images

docker images
Enter fullscreen mode Exit fullscreen mode

Output:

REPOSITORY      TAG       IMAGE ID       SIZE
quotes-scraper  latest    abc123def456   180MB
my-scraper      latest    def789ghi012   190MB
Enter fullscreen mode Exit fullscreen mode

See Running Containers

docker ps
Enter fullscreen mode Exit fullscreen mode

See All Containers (including stopped)

docker ps -a
Enter fullscreen mode Exit fullscreen mode

Delete an Image

docker rmi quotes-scraper
Enter fullscreen mode Exit fullscreen mode

Delete a Container

docker rm container_name
Enter fullscreen mode Exit fullscreen mode

Delete Everything (Clean Start)

docker system prune -a
Enter fullscreen mode Exit fullscreen mode

Warning: This deletes all Docker images and containers!


Troubleshooting for Beginners

Problem: "docker: command not found"

Solution: Docker not installed or not running.

  • Open Docker Desktop
  • Wait for it to start (icon in taskbar)
  • Try again

Problem: "permission denied"

On Linux:

sudo docker run my-scraper
Enter fullscreen mode Exit fullscreen mode

Or add yourself to docker group:

sudo usermod -aG docker $USER
Enter fullscreen mode Exit fullscreen mode

Then logout and login again.

Problem: "Cannot connect to Docker daemon"

Solution: Start Docker Desktop.

Problem: Build is slow

Why: Docker is downloading Python and packages. First time is always slow.

Good news: Second time is FAST because Docker caches everything.

Problem: Container exits immediately

Check logs:

docker logs container_name
Enter fullscreen mode Exit fullscreen mode

Or run in interactive mode:

docker run -it my-scraper /bin/bash
Enter fullscreen mode Exit fullscreen mode

Real Beginner Example (Step by Step)

Let's do everything from scratch together.

Step 1: Create Project Folder

mkdir scrapy_docker_tutorial
cd scrapy_docker_tutorial
Enter fullscreen mode Exit fullscreen mode

Step 2: Create Simple Spider

Create simple_spider.py:

import scrapy

class QuotesSpider(scrapy.Spider):
    name = 'quotes'
    start_urls = ['http://quotes.toscrape.com/']

    def parse(self, response):
        for quote in response.css('.quote'):
            yield {
                'text': quote.css('.text::text').get(),
                'author': quote.css('.author::text').get(),
            }

        # Follow next page
        next_page = response.css('.next a::attr(href)').get()
        if next_page:
            yield response.follow(next_page, self.parse)
Enter fullscreen mode Exit fullscreen mode

Step 3: Create Dockerfile

Create Dockerfile:

FROM python:3.11-slim

WORKDIR /app

RUN pip install scrapy

COPY simple_spider.py .

CMD ["scrapy", "runspider", "simple_spider.py", "-o", "/output/quotes.json"]
Enter fullscreen mode Exit fullscreen mode

Step 4: Create Output Folder

mkdir output
Enter fullscreen mode Exit fullscreen mode

Step 5: Build Docker Image

docker build -t quotes-scraper .
Enter fullscreen mode Exit fullscreen mode

Wait for it to finish (first time takes 2-3 minutes).

Step 6: Run Spider

docker run -v $(pwd)/output:/output quotes-scraper
Enter fullscreen mode Exit fullscreen mode

On Windows (Command Prompt):

docker run -v %cd%/output:/output quotes-scraper
Enter fullscreen mode Exit fullscreen mode

On Windows (PowerShell):

docker run -v ${PWD}/output:/output quotes-scraper
Enter fullscreen mode Exit fullscreen mode

Step 7: Check Results

cat output/quotes.json
Enter fullscreen mode Exit fullscreen mode

On Windows:

type output\quotes.json
Enter fullscreen mode Exit fullscreen mode

You should see all the scraped quotes!

Congratulations! You just ran Scrapy in Docker!


What to Do Next

Share Your Project

Send someone your project folder. They can run:

docker build -t scraper .
docker run scraper
Enter fullscreen mode Exit fullscreen mode

Works on their computer. Guaranteed.

Run on a Server

Copy your project to a server:

scp -r my_scraper user@server.com:/home/user/
Enter fullscreen mode Exit fullscreen mode

On server:

cd my_scraper
docker build -t scraper .
docker run scraper
Enter fullscreen mode Exit fullscreen mode

Same code. Same result. Always.


Simple Best Practices

1. Use .dockerignore

Create .dockerignore file:

__pycache__/
*.pyc
.git/
output/
*.log
Enter fullscreen mode Exit fullscreen mode

This tells Docker to ignore these files (makes build faster).

2. Use Specific Python Version

FROM python:3.11-slim
Enter fullscreen mode Exit fullscreen mode

Not just python:latest (version might change).

3. Name Your Images Well

docker build -t project-name:version .
Enter fullscreen mode Exit fullscreen mode

Example:

docker build -t quotes-scraper:v1 .
Enter fullscreen mode Exit fullscreen mode

4. Clean Up Old Images

docker image prune
Enter fullscreen mode Exit fullscreen mode

Removes unused images (saves disk space).


Quick Command Reference

Build image:

docker build -t my-scraper .
Enter fullscreen mode Exit fullscreen mode

Run container:

docker run my-scraper
Enter fullscreen mode Exit fullscreen mode

Run with output folder:

docker run -v $(pwd)/output:/output my-scraper
Enter fullscreen mode Exit fullscreen mode

See running containers:

docker ps
Enter fullscreen mode Exit fullscreen mode

Stop container:

docker stop container_name
Enter fullscreen mode Exit fullscreen mode

Delete image:

docker rmi my-scraper
Enter fullscreen mode Exit fullscreen mode

Clean everything:

docker system prune -a
Enter fullscreen mode Exit fullscreen mode

When to Use Docker

Use Docker when:

  • Sharing project with others
  • Deploying to server
  • Want consistent environment
  • Multiple projects on same computer
  • Want easy cleanup

Don't need Docker when:

  • Just learning Scrapy
  • Running on your own computer only
  • Simple one-time scrape
  • Already comfortable with virtual environments

Docker is awesome, but it's not always necessary. Start simple!


Summary

What is Docker?
A way to package your Scrapy project so it runs the same everywhere.

Basic steps:

  1. Create Dockerfile
  2. Build image: docker build -t name .
  3. Run container: docker run name

To save data:

docker run -v $(pwd)/output:/output name
Enter fullscreen mode Exit fullscreen mode

Remember:

  • Dockerfile = instructions
  • Image = blueprint
  • Container = running copy
  • Volume = shared folder

That's all you need to know to get started!

Docker seems complicated, but you really only need a few commands. Start with the simple examples above, and you'll be dockerizing your Scrapy projects in no time.

The best part? Once you learn it, you'll never go back. No more "works on my machine" excuses!

Happy scraping! 🕷️

Top comments (0)