I tried to run my Scrapy project on a friend's computer. It didn't work. "Works on my machine," I said. He rolled his eyes.
The problem? Different Python version. Different packages. Different operating system. Everything was different.
Then I learned Docker. I packaged my entire project in a container. It worked on my machine, his machine, and every server we tried. Always the same. Always perfect.
Let me explain Docker in the simplest way possible, without any confusing technical terms.
What is Docker? (The Simple Explanation)
Imagine you want to ship a fragile item to someone. You could just throw it in a box and hope it arrives safely. Or you could pack it carefully with bubble wrap, foam, and a solid container.
Docker is like that protective container for your code.
It packages:
- Your Scrapy project
- Python (the right version)
- All your packages (scrapy, requests, etc.)
- Everything needed to run
All in one neat package called a "container."
The magic: This container works EXACTLY the same on:
- Your laptop
- Your friend's computer
- Any server
- Windows, Mac, or Linux
No more "it works on my machine" problems!
Why Use Docker with Scrapy?
Problem 1: Different Computers, Different Problems
Without Docker:
Your laptop: Python 3.11, Scrapy 2.11 → Works!
Server: Python 3.8, Scrapy 2.8 → Breaks!
With Docker:
Your laptop: Docker container → Works!
Server: Same Docker container → Works!
Problem 2: Installing Everything is Hard
Without Docker:
- Install Python
- Install pip
- Install scrapy
- Install other packages
- Hope nothing breaks
With Docker:
- Run one command
- Done!
Problem 3: Cleanup is Messy
Without Docker:
- Packages installed globally
- Hard to remove completely
- Leaves traces everywhere
With Docker:
- Delete container
- Everything gone
- Clean!
Installing Docker
On Windows
- Download Docker Desktop from docker.com
- Install it (double-click, next, next, finish)
- Restart your computer
- Done!
On Mac
- Download Docker Desktop from docker.com
- Drag to Applications folder
- Open it
- Done!
On Linux (Ubuntu)
sudo apt update
sudo apt install docker.io
sudo systemctl start docker
Verify Installation
Open terminal (or command prompt) and type:
docker --version
You should see something like:
Docker version 24.0.7
If you see this, Docker is installed!
Understanding Docker Basics (Super Simple)
Before we start, understand these three terms:
Image: A recipe for your container (like a blueprint)
Container: The actual running thing (like a house built from the blueprint)
Dockerfile: The instructions to create the image (like building plans)
That's it. Just three concepts.
Your First Scrapy Docker Container
Let's create the simplest possible Docker setup for Scrapy.
Step 1: Create Your Project Folder
mkdir my_scraper
cd my_scraper
Step 2: Create Your Spider
Create a file called spider.py:
import scrapy
class SimpleSpider(scrapy.Spider):
name = 'simple'
start_urls = ['http://quotes.toscrape.com/']
def parse(self, response):
for quote in response.css('.quote'):
yield {
'text': quote.css('.text::text').get(),
'author': quote.css('.author::text').get(),
}
Step 3: Create Dockerfile
Create a file called Dockerfile (exactly that name, no extension):
FROM python:3.11
WORKDIR /app
RUN pip install scrapy
COPY spider.py .
CMD ["scrapy", "runspider", "spider.py"]
What this means (line by line):
-
FROM python:3.11→ Start with Python 3.11 -
WORKDIR /app→ Work in folder called /app -
RUN pip install scrapy→ Install Scrapy -
COPY spider.py .→ Copy your spider file -
CMD ["scrapy", "runspider", "spider.py"]→ Run the spider
Step 4: Build Your Docker Image
docker build -t my-scraper .
This takes a minute. Docker is downloading Python and installing Scrapy.
What this does:
-
docker build→ Create an image -
-t my-scraper→ Name it "my-scraper" -
.→ Use current folder
Step 5: Run Your Spider
docker run my-scraper
That's it! Your spider runs inside Docker!
You'll see output like:
2024-01-15 10:30:22 [scrapy.core.engine] INFO: Spider opened
...
{'text': 'The world as we have created it...', 'author': 'Albert Einstein'}
Understanding What Just Happened
Let's break down what Docker did:
- Created an isolated environment (like a mini computer inside your computer)
- Installed Python 3.11 (just for this environment)
- Installed Scrapy (just for this environment)
- Copied your spider (into this environment)
- Ran your spider (inside this environment)
All without touching your actual computer!
Saving Scraped Data
Right now, data stays inside the container and disappears. Let's save it to your computer.
Create Output Folder
mkdir output
Update Dockerfile
FROM python:3.11
WORKDIR /app
RUN pip install scrapy
COPY spider.py .
CMD ["scrapy", "runspider", "spider.py", "-o", "/output/data.json"]
Notice the new part: -o /output/data.json
Run with Volume
docker run -v $(pwd)/output:/output my-scraper
What this does:
-
-v→ Connect folders -
$(pwd)/output→ Your computer's output folder -
:/output→ Container's output folder
Now check your output folder:
cat output/data.json
Your data is there!
Real Project with Multiple Files
Let's do a real Scrapy project.
Your Project Structure
my_scraper/
├── Dockerfile
├── scrapy.cfg
├── myproject/
│ ├── __init__.py
│ ├── settings.py
│ ├── items.py
│ └── spiders/
│ ├── __init__.py
│ └── quotes.py
└── output/
Create Scrapy Project
scrapy startproject myproject .
cd myproject/spiders
scrapy genspider quotes quotes.toscrape.com
cd ../..
Simple Dockerfile for Full Project
FROM python:3.11-slim
WORKDIR /app
# Install Scrapy
RUN pip install scrapy
# Copy entire project
COPY . .
# Run spider
CMD ["scrapy", "crawl", "quotes", "-o", "/output/quotes.json"]
Build and Run
# Build
docker build -t quotes-scraper .
# Run
docker run -v $(pwd)/output:/output quotes-scraper
Done! Check output/quotes.json for your data.
Making It Even Easier (docker-compose)
Instead of long commands, use docker-compose.
Install docker-compose
Usually comes with Docker Desktop. Check:
docker-compose --version
Create docker-compose.yml
In your project folder, create docker-compose.yml:
version: '3.8'
services:
scraper:
build: .
volumes:
- ./output:/output
That's it! Super simple.
Run with docker-compose
docker-compose up
Much easier than the long docker run command!
Stop Everything
docker-compose down
Common Beginner Commands
See All Images
docker images
Output:
REPOSITORY TAG IMAGE ID SIZE
quotes-scraper latest abc123def456 180MB
my-scraper latest def789ghi012 190MB
See Running Containers
docker ps
See All Containers (including stopped)
docker ps -a
Delete an Image
docker rmi quotes-scraper
Delete a Container
docker rm container_name
Delete Everything (Clean Start)
docker system prune -a
Warning: This deletes all Docker images and containers!
Troubleshooting for Beginners
Problem: "docker: command not found"
Solution: Docker not installed or not running.
- Open Docker Desktop
- Wait for it to start (icon in taskbar)
- Try again
Problem: "permission denied"
On Linux:
sudo docker run my-scraper
Or add yourself to docker group:
sudo usermod -aG docker $USER
Then logout and login again.
Problem: "Cannot connect to Docker daemon"
Solution: Start Docker Desktop.
Problem: Build is slow
Why: Docker is downloading Python and packages. First time is always slow.
Good news: Second time is FAST because Docker caches everything.
Problem: Container exits immediately
Check logs:
docker logs container_name
Or run in interactive mode:
docker run -it my-scraper /bin/bash
Real Beginner Example (Step by Step)
Let's do everything from scratch together.
Step 1: Create Project Folder
mkdir scrapy_docker_tutorial
cd scrapy_docker_tutorial
Step 2: Create Simple Spider
Create simple_spider.py:
import scrapy
class QuotesSpider(scrapy.Spider):
name = 'quotes'
start_urls = ['http://quotes.toscrape.com/']
def parse(self, response):
for quote in response.css('.quote'):
yield {
'text': quote.css('.text::text').get(),
'author': quote.css('.author::text').get(),
}
# Follow next page
next_page = response.css('.next a::attr(href)').get()
if next_page:
yield response.follow(next_page, self.parse)
Step 3: Create Dockerfile
Create Dockerfile:
FROM python:3.11-slim
WORKDIR /app
RUN pip install scrapy
COPY simple_spider.py .
CMD ["scrapy", "runspider", "simple_spider.py", "-o", "/output/quotes.json"]
Step 4: Create Output Folder
mkdir output
Step 5: Build Docker Image
docker build -t quotes-scraper .
Wait for it to finish (first time takes 2-3 minutes).
Step 6: Run Spider
docker run -v $(pwd)/output:/output quotes-scraper
On Windows (Command Prompt):
docker run -v %cd%/output:/output quotes-scraper
On Windows (PowerShell):
docker run -v ${PWD}/output:/output quotes-scraper
Step 7: Check Results
cat output/quotes.json
On Windows:
type output\quotes.json
You should see all the scraped quotes!
Congratulations! You just ran Scrapy in Docker!
What to Do Next
Share Your Project
Send someone your project folder. They can run:
docker build -t scraper .
docker run scraper
Works on their computer. Guaranteed.
Run on a Server
Copy your project to a server:
scp -r my_scraper user@server.com:/home/user/
On server:
cd my_scraper
docker build -t scraper .
docker run scraper
Same code. Same result. Always.
Simple Best Practices
1. Use .dockerignore
Create .dockerignore file:
__pycache__/
*.pyc
.git/
output/
*.log
This tells Docker to ignore these files (makes build faster).
2. Use Specific Python Version
FROM python:3.11-slim
Not just python:latest (version might change).
3. Name Your Images Well
docker build -t project-name:version .
Example:
docker build -t quotes-scraper:v1 .
4. Clean Up Old Images
docker image prune
Removes unused images (saves disk space).
Quick Command Reference
Build image:
docker build -t my-scraper .
Run container:
docker run my-scraper
Run with output folder:
docker run -v $(pwd)/output:/output my-scraper
See running containers:
docker ps
Stop container:
docker stop container_name
Delete image:
docker rmi my-scraper
Clean everything:
docker system prune -a
When to Use Docker
Use Docker when:
- Sharing project with others
- Deploying to server
- Want consistent environment
- Multiple projects on same computer
- Want easy cleanup
Don't need Docker when:
- Just learning Scrapy
- Running on your own computer only
- Simple one-time scrape
- Already comfortable with virtual environments
Docker is awesome, but it's not always necessary. Start simple!
Summary
What is Docker?
A way to package your Scrapy project so it runs the same everywhere.
Basic steps:
- Create Dockerfile
- Build image:
docker build -t name . - Run container:
docker run name
To save data:
docker run -v $(pwd)/output:/output name
Remember:
- Dockerfile = instructions
- Image = blueprint
- Container = running copy
- Volume = shared folder
That's all you need to know to get started!
Docker seems complicated, but you really only need a few commands. Start with the simple examples above, and you'll be dockerizing your Scrapy projects in no time.
The best part? Once you learn it, you'll never go back. No more "works on my machine" excuses!
Happy scraping! 🕷️
Top comments (0)