A person scraped a website with 1 million products. Their single laptop took 3 days to finish. The website was updated daily, so the data was already outdated by the time the crawl completed.
Then the person learned about distributed crawling. They ran the same spider across 10 computers, and it completed the task in just 7 hours, delivering fresh data every day.
Let me explain what distributed crawling is and when you actually need it, in the simplest way possible.
What is Distributed Crawling? (Super Simple Explanation)
Imagine you need to paint 100 houses.
Single crawler (normal Scrapy):
You → Paint house 1 → Paint house 2 → Paint house 3... → Paint house 100
(Takes 100 days, one house per day)
Distributed crawling:
You → Paint houses 1-10 (10 days)
Friend 1 → Paint houses 11-20 (10 days)
Friend 2 → Paint houses 21-30 (10 days)
...
Friend 9 → Paint houses 91-100 (10 days)
All working at the same time!
(Takes 10 days total, 10x faster!)
In web scraping:
- Instead of one computer scraping all URLs
- You have multiple computers
- Each scrapes different URLs
- All working at the same time
- Much faster!
How Does Distributed Crawling Work?
The Simple Version
Normal Scrapy (Single Machine):
Spider → Queue → Download → Parse → Save
(One machine does everything)
Distributed Scrapy:
Computer 1 → Spider → Queue ← Computer 2 → Spider
↑ ↓
Shared Queue Computer 3 → Spider
↑ ↓
Computer 4 → Spider
All computers share the same queue. They take URLs from the queue and scrape them.
What You Need
1. Shared Queue
A place where all computers can see which URLs need scraping.
Common options:
- Redis (most popular)
- RabbitMQ
- MongoDB
2. Multiple Computers/Servers
- Your laptop
- Cloud servers (AWS, DigitalOcean, etc.)
- Multiple processes on one machine
3. Shared Storage (optional)
Where to save scraped data so all spiders can access it.
When Do You NEED Distributed Crawling?
You Need It When:
1. Too Many URLs
1 million URLs
Single spider: 10 URLs per minute
Time needed: 1,000,000 ÷ 10 = 100,000 minutes = 69 days!
With 10 spiders: 69 ÷ 10 = 7 days
With 50 spiders: 69 ÷ 50 = 1.4 days
2. Time Constraints
- Need fresh data daily
- Website updates every hour
- Single spider takes too long
3. Large Scale Operations
- Scraping multiple websites
- Monitoring thousands of pages
- Commercial scraping business
4. Geographic Distribution
- Need to scrape from different countries
- Bypass geo-restrictions
- Reduce latency
When You DON'T Need Distributed Crawling
You Don't Need It When:
1. Small Websites
Website has: 1,000 pages
Single spider can finish: 2 hours
Why complicate things?
2. Learning/Testing
- Just learning Scrapy
- Testing your spider
- Personal projects
3. Low Frequency Scraping
- Scraping once a month
- No time pressure
- Small data needs
4. Simple Projects
- Blog scraping
- Research projects
- One-time data collection
Rule of thumb:
- Less than 10,000 URLs? → Single spider
- More than 100,000 URLs? → Consider distributed
- More than 1,000,000 URLs? → Definitely distributed
Distributed Crawling Options
Option 1: Scrapy-Redis (Most Popular)
What it is:
Extension that makes Scrapy use Redis for the queue.
Pros:
- Easy to set up
- Battle-tested
- Large community
- Good documentation
Cons:
- Requires Redis
- Learning curve
Cost:
- Free (open source)
- Redis hosting: $0-$10/month
When to use:
- Most common choice
- Good for 99% of cases
- Large scale scraping
Option 2: Scrapy Cluster
What it is:
Complete distributed system with monitoring.
Pros:
- Built-in monitoring
- Job scheduling
- REST API
- Production-ready
Cons:
- More complex setup
- Overkill for simple projects
Cost:
- Free (open source)
- Infrastructure costs
When to use:
- Professional operations
- Need monitoring
- Multiple teams
Option 3: Cloud Solutions
ScrapyCloud (by Zyte):
- Managed Scrapy hosting
- No setup needed
- Pay per use
AWS/GCP/Azure:
- Run spiders on cloud
- Scale automatically
- Full control
When to use:
- Don't want to manage servers
- Need reliability
- Have budget
Option 4: Simple Multi-Process
What it is:
Run multiple spiders on one machine.
Pros:
- No setup needed
- Simple
- Works immediately
Cons:
- Limited to one machine
- Not truly distributed
When to use:
- Testing distributed concepts
- Medium-sized scraping
- Limited budget
Setting Up Simple Distributed Crawling
Let's start with the easiest option: Scrapy-Redis.
Step 1: Install Redis
On Ubuntu/Linux:
sudo apt update
sudo apt install redis-server
sudo systemctl start redis
On Mac:
brew install redis
brew services start redis
On Windows:
Download from: https://redis.io/download
Or use Docker (easiest):
docker run -d -p 6379:6379 redis
Step 2: Install Scrapy-Redis
pip install scrapy-redis
Step 3: Create Distributed Spider
# myspider.py
from scrapy_redis.spiders import RedisSpider
class MyDistributedSpider(RedisSpider):
name = 'distributed'
# Don't use start_urls, use redis_key instead
redis_key = 'myspider:start_urls'
def parse(self, response):
# Your scraping logic
for product in response.css('.product'):
yield {
'name': product.css('.name::text').get(),
'price': product.css('.price::text').get()
}
# Follow links (goes to shared queue)
for link in response.css('a::attr(href)').getall():
yield response.follow(link, self.parse)
Step 4: Configure Settings
# settings.py
# Enable Scrapy-Redis
SCHEDULER = "scrapy_redis.scheduler.Scheduler"
DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"
# Don't cleanup Redis queue on spider close
SCHEDULER_PERSIST = True
# Redis connection
REDIS_HOST = 'localhost'
REDIS_PORT = 6379
# Optional: Redis password
# REDIS_PARAMS = {'password': 'your-password'}
Step 5: Add URLs to Redis
# Add starting URLs to Redis
redis-cli lpush myspider:start_urls "https://example.com/page1"
redis-cli lpush myspider:start_urls "https://example.com/page2"
redis-cli lpush myspider:start_urls "https://example.com/page3"
Or with Python:
import redis
r = redis.Redis(host='localhost', port=6379)
urls = [
'https://example.com/page1',
'https://example.com/page2',
'https://example.com/page3',
]
for url in urls:
r.lpush('myspider:start_urls', url)
Step 6: Run Multiple Spiders
On Computer 1:
scrapy crawl distributed
On Computer 2:
scrapy crawl distributed
On Computer 3:
scrapy crawl distributed
All three spiders share the same Redis queue!
Each one:
- Takes URLs from Redis
- Scrapes them
- Adds new URLs to Redis
- Other spiders see new URLs
- No duplicate scraping
Magic!
Complete Working Example
Let's build a real distributed spider step by step.
The Scenario
Scrape 10,000 product pages from an e-commerce site using 3 computers.
Project Structure
distributed_scraper/
├── scrapy.cfg
├── settings.py
├── spiders/
│ └── products.py
└── items.py
Spider Code
# spiders/products.py
from scrapy_redis.spiders import RedisSpider
import scrapy
class ProductSpider(RedisSpider):
name = 'products'
redis_key = 'products:start_urls'
def parse(self, response):
"""Parse product listing page"""
# Extract products
for product in response.css('.product'):
# Go to product detail page
detail_url = product.css('a::attr(href)').get()
if detail_url:
yield response.follow(detail_url, self.parse_product)
# Follow pagination
next_page = response.css('.next::attr(href)').get()
if next_page:
yield response.follow(next_page, self.parse)
def parse_product(self, response):
"""Parse individual product page"""
yield {
'url': response.url,
'name': response.css('h1.product-name::text').get(),
'price': response.css('.price::text').get(),
'description': response.css('.description::text').get(),
'in_stock': response.css('.stock::text').get(),
}
Settings
# settings.py
BOT_NAME = 'distributed_scraper'
# Scrapy-Redis settings
SCHEDULER = "scrapy_redis.scheduler.Scheduler"
DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"
SCHEDULER_PERSIST = True
# Redis connection
REDIS_HOST = 'localhost' # Change to your Redis server IP
REDIS_PORT = 6379
# Be polite
DOWNLOAD_DELAY = 1
CONCURRENT_REQUESTS = 16
# Output
FEEDS = {
'products.json': {
'format': 'json',
'overwrite': False, # Append, don't overwrite
}
}
Adding Starting URLs
# add_urls.py
import redis
r = redis.Redis(host='localhost', port=6379)
# Clear old URLs
r.delete('products:start_urls')
# Add starting URL
starting_url = 'https://example.com/products?page=1'
r.lpush('products:start_urls', starting_url)
print(f"Added {starting_url} to queue")
print(f"Queue size: {r.llen('products:start_urls')}")
Running the Spider
On Computer 1 (or Process 1):
scrapy crawl products
On Computer 2 (or Process 2):
scrapy crawl products
On Computer 3 (or Process 3):
scrapy crawl products
All three will:
- Share the same queue
- Split the work automatically
- No duplicate scraping
- Save to same output file
Monitoring Distributed Crawling
Check Queue Size
import redis
r = redis.Redis(host='localhost', port=6379)
# Check how many URLs left
queue_size = r.llen('products:start_urls')
print(f"URLs remaining: {queue_size}")
Simple Monitoring Script
# monitor.py
import redis
import time
r = redis.Redis(host='localhost', port=6379)
while True:
# Get queue size
queue_size = r.llen('products:start_urls')
# Get stats (if using Scrapy-Redis stats)
print(f"URLs in queue: {queue_size}")
if queue_size == 0:
print("Queue empty! Scraping might be done.")
time.sleep(10) # Check every 10 seconds
Common Problems and Solutions
Problem 1: Spiders Scrape Same URLs
Cause: Duplicate filter not working.
Solution:
# settings.py
DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"
This prevents duplicate scraping.
Problem 2: Queue Never Empties
Cause: Spider keeps finding new URLs.
Solution: Add domain restrictions
class ProductSpider(RedisSpider):
name = 'products'
allowed_domains = ['example.com'] # Only this domain
Problem 3: Can't Connect to Redis
Error:
redis.exceptions.ConnectionError
Solutions:
- Check Redis is running:
redis-cli ping - Check IP address in settings
- Check firewall allows port 6379
Problem 4: Too Slow
Even with multiple spiders?
Solutions:
- Increase CONCURRENT_REQUESTS
- Use faster proxies
- Reduce DOWNLOAD_DELAY
- Add more spiders
Simple Multi-Process Alternative
Don't want to set up Redis? Use multiple processes on one machine:
Method 1: Run Multiple Processes
# Terminal 1
scrapy crawl myspider -a start=0 -a end=1000
# Terminal 2
scrapy crawl myspider -a start=1000 -a end=2000
# Terminal 3
scrapy crawl myspider -a start=2000 -a end=3000
Method 2: Python Script
# run_distributed.py
import subprocess
# Number of processes
num_processes = 4
# Total URLs
total_urls = 10000
urls_per_process = total_urls // num_processes
processes = []
for i in range(num_processes):
start = i * urls_per_process
end = start + urls_per_process
cmd = f'scrapy crawl myspider -a start={start} -a end={end}'
process = subprocess.Popen(cmd, shell=True)
processes.append(process)
print(f"Started process {i+1}: URLs {start}-{end}")
# Wait for all to finish
for process in processes:
process.wait()
print("All processes finished!")
This is simpler but limited to one machine.
Cost Comparison
Single Spider (Baseline)
Cost:
- 1 server: $5/month
- Time: 10 days
Total: $5/month
Distributed (5 Spiders)
Cost:
- 5 servers: $25/month
- Redis: $5/month
- Time: 2 days
Total: $30/month
Worth it?
- If time is valuable: YES
- If just learning: NO
When to Upgrade to Distributed
Start with single spider. Upgrade when:
Sign 1: Takes Too Long
Single spider: 5 days
Need data: Daily
Problem: Can't keep up!
Sign 2: Missing Time Windows
Website updates: Every 6 hours
Scraping takes: 12 hours
Problem: Always behind!
Sign 3: Growing Scale
Started: 1,000 pages
Now: 100,000 pages
Single spider: Can't handle it
Sign 4: Business Need
Making money from data
Time = money
Faster scraping = more profit
Best Practices
1. Start Small
Test with 2-3 spiders first:
# Don't start with 100 spiders!
# Start with 2-3, see if it works
2. Monitor Everything
# Log everything
self.logger.info(f'Scraped {response.url}')
3. Use Realistic Delays
# Even distributed, be polite
DOWNLOAD_DELAY = 1
4. Test Locally First
# Test on your laptop before deploying
# Make sure spider works correctly
5. Plan Your URLs
# Know how many URLs you have
# Calculate: URLs ÷ spiders = time
Quick Decision Guide
Should I use distributed crawling?
How many URLs?
├─ Less than 10,000
│ └─ Use single spider
│
├─ 10,000 - 100,000
│ └─ Maybe distributed (if time-sensitive)
│
└─ More than 100,000
└─ Definitely distributed
What option should I choose?
Budget?
├─ Low budget
│ └─ Scrapy-Redis (self-hosted)
│
├─ Medium budget
│ └─ Multiple small servers
│
└─ High budget
└─ Managed solutions (ScrapyCloud)
Summary
What is distributed crawling?
Running multiple spiders at the same time, sharing the same queue.
When to use it:
- More than 100,000 URLs
- Time-sensitive data
- Commercial operations
- Large scale scraping
When NOT to use it:
- Less than 10,000 URLs
- Learning/testing
- No time pressure
- Simple projects
Best option for beginners:
Scrapy-Redis with 2-3 spiders
Setup:
- Install Redis
- Install scrapy-redis
- Configure settings
- Add URLs to Redis
- Run multiple spiders
Key settings:
SCHEDULER = "scrapy_redis.scheduler.Scheduler"
DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"
REDIS_HOST = 'localhost'
Remember:
- Start simple (single spider)
- Upgrade when needed
- Test before scaling
- Monitor everything
- Be polite even when distributed
Distributed crawling is powerful but not always necessary. Start simple, scale when you need to!
Happy scraping! 🕷️
Top comments (0)