Muhammad Ikramullah Khan

Posted on Jan 15

Running Scrapy with Docker: The Complete Beginner's Guide

#python #webdev #programming #beginners

I tried to run my Scrapy project on a friend's computer. It didn't work. "Works on my machine," I said. He rolled his eyes.

The problem? Different Python version. Different packages. Different operating system. Everything was different.

Then I learned Docker. I packaged my entire project in a container. It worked on my machine, his machine, and every server we tried. Always the same. Always perfect.

Let me explain Docker in the simplest way possible, without any confusing technical terms.

What is Docker? (The Simple Explanation)

Imagine you want to ship a fragile item to someone. You could just throw it in a box and hope it arrives safely. Or you could pack it carefully with bubble wrap, foam, and a solid container.

Docker is like that protective container for your code.

It packages:

Your Scrapy project
Python (the right version)
All your packages (scrapy, requests, etc.)
Everything needed to run

All in one neat package called a "container."

The magic: This container works EXACTLY the same on:

Your laptop
Your friend's computer
Any server
Windows, Mac, or Linux

No more "it works on my machine" problems!

Why Use Docker with Scrapy?

Problem 1: Different Computers, Different Problems

Without Docker:

Your laptop: Python 3.11, Scrapy 2.11 → Works!
Server: Python 3.8, Scrapy 2.8 → Breaks!

With Docker:

Your laptop: Docker container → Works!
Server: Same Docker container → Works!

Problem 2: Installing Everything is Hard

Without Docker:

Install Python
Install pip
Install scrapy
Install other packages
Hope nothing breaks

With Docker:

Run one command
Done!

Problem 3: Cleanup is Messy

Without Docker:

Packages installed globally
Hard to remove completely
Leaves traces everywhere

With Docker:

Delete container
Everything gone
Clean!

Installing Docker

On Windows

Download Docker Desktop from docker.com
Install it (double-click, next, next, finish)
Restart your computer
Done!

On Mac

Download Docker Desktop from docker.com
Drag to Applications folder
Open it
Done!

On Linux (Ubuntu)

sudo apt update
sudo apt install docker.io
sudo systemctl start docker

Verify Installation

Open terminal (or command prompt) and type:

docker --version

You should see something like:

Docker version 24.0.7

If you see this, Docker is installed!

Understanding Docker Basics (Super Simple)

Before we start, understand these three terms:

Image: A recipe for your container (like a blueprint)

Container: The actual running thing (like a house built from the blueprint)

Dockerfile: The instructions to create the image (like building plans)

That's it. Just three concepts.

Your First Scrapy Docker Container

Let's create the simplest possible Docker setup for Scrapy.

Step 1: Create Your Project Folder

mkdir my_scraper
cd my_scraper

Step 2: Create Your Spider

Create a file called spider.py:

import scrapy

class SimpleSpider(scrapy.Spider):
    name = 'simple'
    start_urls = ['http://quotes.toscrape.com/']

    def parse(self, response):
        for quote in response.css('.quote'):
            yield {
                'text': quote.css('.text::text').get(),
                'author': quote.css('.author::text').get(),
            }

Step 3: Create Dockerfile

Create a file called Dockerfile (exactly that name, no extension):

FROM python:3.11

WORKDIR /app

RUN pip install scrapy

COPY spider.py .

CMD ["scrapy", "runspider", "spider.py"]

What this means (line by line):

FROM python:3.11 → Start with Python 3.11
WORKDIR /app → Work in folder called /app
RUN pip install scrapy → Install Scrapy
COPY spider.py . → Copy your spider file
CMD ["scrapy", "runspider", "spider.py"] → Run the spider

Step 4: Build Your Docker Image

docker build -t my-scraper .

This takes a minute. Docker is downloading Python and installing Scrapy.

What this does:

docker build → Create an image
-t my-scraper → Name it "my-scraper"
. → Use current folder

Step 5: Run Your Spider

docker run my-scraper

That's it! Your spider runs inside Docker!

You'll see output like:

2024-01-15 10:30:22 [scrapy.core.engine] INFO: Spider opened
...
{'text': 'The world as we have created it...', 'author': 'Albert Einstein'}

Understanding What Just Happened

Let's break down what Docker did:

Created an isolated environment (like a mini computer inside your computer)
Installed Python 3.11 (just for this environment)
Installed Scrapy (just for this environment)
Copied your spider (into this environment)
Ran your spider (inside this environment)

All without touching your actual computer!

Saving Scraped Data

Right now, data stays inside the container and disappears. Let's save it to your computer.

Create Output Folder

mkdir output

Update Dockerfile

FROM python:3.11

WORKDIR /app

RUN pip install scrapy

COPY spider.py .

CMD ["scrapy", "runspider", "spider.py", "-o", "/output/data.json"]

Notice the new part: -o /output/data.json

Run with Volume

docker run -v $(pwd)/output:/output my-scraper

What this does:

-v → Connect folders
$(pwd)/output → Your computer's output folder
:/output → Container's output folder

Now check your output folder:

cat output/data.json

Your data is there!

Real Project with Multiple Files

Let's do a real Scrapy project.

Your Project Structure

my_scraper/
├── Dockerfile
├── scrapy.cfg
├── myproject/
│   ├── __init__.py
│   ├── settings.py
│   ├── items.py
│   └── spiders/
│       ├── __init__.py
│       └── quotes.py
└── output/

Create Scrapy Project

scrapy startproject myproject .
cd myproject/spiders
scrapy genspider quotes quotes.toscrape.com
cd ../..

Simple Dockerfile for Full Project

FROM python:3.11-slim

WORKDIR /app

# Install Scrapy
RUN pip install scrapy

# Copy entire project
COPY . .

# Run spider
CMD ["scrapy", "crawl", "quotes", "-o", "/output/quotes.json"]

Build and Run

# Build
docker build -t quotes-scraper .

# Run
docker run -v $(pwd)/output:/output quotes-scraper

Done! Check output/quotes.json for your data.

Making It Even Easier (docker-compose)

Instead of long commands, use docker-compose.

Install docker-compose

Usually comes with Docker Desktop. Check:

docker-compose --version

Create docker-compose.yml

In your project folder, create docker-compose.yml:

version: '3.8'

services:
  scraper:
    build: .
    volumes:
      - ./output:/output

That's it! Super simple.

Run with docker-compose

docker-compose up

Much easier than the long docker run command!

Stop Everything

docker-compose down

Common Beginner Commands

See All Images

docker images

Output:

REPOSITORY      TAG       IMAGE ID       SIZE
quotes-scraper  latest    abc123def456   180MB
my-scraper      latest    def789ghi012   190MB

See Running Containers

docker ps

See All Containers (including stopped)

docker ps -a

Delete an Image

docker rmi quotes-scraper

Delete a Container

docker rm container_name

Delete Everything (Clean Start)

docker system prune -a

Warning: This deletes all Docker images and containers!

Troubleshooting for Beginners

Problem: "docker: command not found"

Solution: Docker not installed or not running.

Open Docker Desktop
Wait for it to start (icon in taskbar)
Try again

Problem: "permission denied"

On Linux:

sudo docker run my-scraper

Or add yourself to docker group:

sudo usermod -aG docker $USER

Then logout and login again.

Problem: "Cannot connect to Docker daemon"

Solution: Start Docker Desktop.

Problem: Build is slow

Why: Docker is downloading Python and packages. First time is always slow.

Good news: Second time is FAST because Docker caches everything.

Problem: Container exits immediately

Check logs:

docker logs container_name

Or run in interactive mode:

docker run -it my-scraper /bin/bash

Real Beginner Example (Step by Step)

Let's do everything from scratch together.

Step 1: Create Project Folder

mkdir scrapy_docker_tutorial
cd scrapy_docker_tutorial

Step 2: Create Simple Spider

Create simple_spider.py:

import scrapy

class QuotesSpider(scrapy.Spider):
    name = 'quotes'
    start_urls = ['http://quotes.toscrape.com/']

    def parse(self, response):
        for quote in response.css('.quote'):
            yield {
                'text': quote.css('.text::text').get(),
                'author': quote.css('.author::text').get(),
            }

        # Follow next page
        next_page = response.css('.next a::attr(href)').get()
        if next_page:
            yield response.follow(next_page, self.parse)

Step 3: Create Dockerfile

Create Dockerfile:

FROM python:3.11-slim

WORKDIR /app

RUN pip install scrapy

COPY simple_spider.py .

CMD ["scrapy", "runspider", "simple_spider.py", "-o", "/output/quotes.json"]

Step 4: Create Output Folder

mkdir output

Step 5: Build Docker Image

docker build -t quotes-scraper .

Wait for it to finish (first time takes 2-3 minutes).

Step 6: Run Spider

docker run -v $(pwd)/output:/output quotes-scraper

On Windows (Command Prompt):

docker run -v %cd%/output:/output quotes-scraper

On Windows (PowerShell):

docker run -v ${PWD}/output:/output quotes-scraper

Step 7: Check Results

cat output/quotes.json

On Windows:

type output\quotes.json

You should see all the scraped quotes!

Congratulations! You just ran Scrapy in Docker!

What to Do Next

Share Your Project

Send someone your project folder. They can run:

docker build -t scraper .
docker run scraper

Works on their computer. Guaranteed.

Run on a Server

Copy your project to a server:

scp -r my_scraper user@server.com:/home/user/

On server:

cd my_scraper
docker build -t scraper .
docker run scraper

Same code. Same result. Always.

Simple Best Practices

1. Use .dockerignore

Create .dockerignore file:

__pycache__/
*.pyc
.git/
output/
*.log

This tells Docker to ignore these files (makes build faster).

2. Use Specific Python Version

FROM python:3.11-slim

Not just python:latest (version might change).

3. Name Your Images Well

docker build -t project-name:version .

Example:

docker build -t quotes-scraper:v1 .

4. Clean Up Old Images

docker image prune

Removes unused images (saves disk space).

Quick Command Reference

Build image:

docker build -t my-scraper .

Run container:

docker run my-scraper

Run with output folder:

docker run -v $(pwd)/output:/output my-scraper

See running containers:

docker ps

Stop container:

docker stop container_name

Delete image:

docker rmi my-scraper

Clean everything:

docker system prune -a

When to Use Docker

Use Docker when:

Sharing project with others
Deploying to server
Want consistent environment
Multiple projects on same computer
Want easy cleanup

Don't need Docker when:

Just learning Scrapy
Running on your own computer only
Simple one-time scrape
Already comfortable with virtual environments

Docker is awesome, but it's not always necessary. Start simple!

Summary

What is Docker?
A way to package your Scrapy project so it runs the same everywhere.

Basic steps:

Create Dockerfile
Build image: docker build -t name .
Run container: docker run name

To save data:

docker run -v $(pwd)/output:/output name

Remember:

Dockerfile = instructions
Image = blueprint
Container = running copy
Volume = shared folder

That's all you need to know to get started!

Docker seems complicated, but you really only need a few commands. Start with the simple examples above, and you'll be dockerizing your Scrapy projects in no time.

The best part? Once you learn it, you'll never go back. No more "works on my machine" excuses!

Happy scraping! 🕷️