Muhammad Ikramullah Khan

Posted on Jan 22

Distributed Crawling: The Beginner's Guide

#webdev #programming #beginners #python

A person scraped a website with 1 million products. Their single laptop took 3 days to finish. The website was updated daily, so the data was already outdated by the time the crawl completed.

Then the person learned about distributed crawling. They ran the same spider across 10 computers, and it completed the task in just 7 hours, delivering fresh data every day.

Let me explain what distributed crawling is and when you actually need it, in the simplest way possible.

What is Distributed Crawling? (Super Simple Explanation)

Imagine you need to paint 100 houses.

Single crawler (normal Scrapy):

You → Paint house 1 → Paint house 2 → Paint house 3... → Paint house 100
(Takes 100 days, one house per day)

Distributed crawling:

You        → Paint houses 1-10   (10 days)
Friend 1   → Paint houses 11-20  (10 days)
Friend 2   → Paint houses 21-30  (10 days)
...
Friend 9   → Paint houses 91-100 (10 days)

All working at the same time!
(Takes 10 days total, 10x faster!)

In web scraping:

Instead of one computer scraping all URLs
You have multiple computers
Each scrapes different URLs
All working at the same time
Much faster!

How Does Distributed Crawling Work?

The Simple Version

Normal Scrapy (Single Machine):

Spider → Queue → Download → Parse → Save
(One machine does everything)

Distributed Scrapy:

Computer 1 → Spider → Queue ← Computer 2 → Spider
                ↑                ↓
            Shared Queue    Computer 3 → Spider
                ↑                ↓
           Computer 4 → Spider

All computers share the same queue. They take URLs from the queue and scrape them.

What You Need

1. Shared Queue
A place where all computers can see which URLs need scraping.

Common options:

Redis (most popular)
RabbitMQ
MongoDB

2. Multiple Computers/Servers

Your laptop
Cloud servers (AWS, DigitalOcean, etc.)
Multiple processes on one machine

3. Shared Storage (optional)
Where to save scraped data so all spiders can access it.

When Do You NEED Distributed Crawling?

You Need It When:

1. Too Many URLs

1 million URLs
Single spider: 10 URLs per minute
Time needed: 1,000,000 ÷ 10 = 100,000 minutes = 69 days!

With 10 spiders: 69 ÷ 10 = 7 days
With 50 spiders: 69 ÷ 50 = 1.4 days

2. Time Constraints

Need fresh data daily
Website updates every hour
Single spider takes too long

3. Large Scale Operations

Scraping multiple websites
Monitoring thousands of pages
Commercial scraping business

4. Geographic Distribution

Need to scrape from different countries
Bypass geo-restrictions
Reduce latency

When You DON'T Need Distributed Crawling

You Don't Need It When:

1. Small Websites

Website has: 1,000 pages
Single spider can finish: 2 hours
Why complicate things?

2. Learning/Testing

Just learning Scrapy
Testing your spider
Personal projects

3. Low Frequency Scraping

Scraping once a month
No time pressure
Small data needs

4. Simple Projects

Blog scraping
Research projects
One-time data collection

Rule of thumb:

Less than 10,000 URLs? → Single spider
More than 100,000 URLs? → Consider distributed
More than 1,000,000 URLs? → Definitely distributed

Distributed Crawling Options

Option 1: Scrapy-Redis (Most Popular)

What it is:
Extension that makes Scrapy use Redis for the queue.

Pros:

Easy to set up
Battle-tested
Large community
Good documentation

Cons:

Requires Redis
Learning curve

Cost:

Free (open source)
Redis hosting: $0-$10/month

When to use:

Most common choice
Good for 99% of cases
Large scale scraping

Option 2: Scrapy Cluster

What it is:
Complete distributed system with monitoring.

Pros:

Built-in monitoring
Job scheduling
REST API
Production-ready

Cons:

More complex setup
Overkill for simple projects

Cost:

Free (open source)
Infrastructure costs

When to use:

Professional operations
Need monitoring
Multiple teams

Option 3: Cloud Solutions

ScrapyCloud (by Zyte):

Managed Scrapy hosting
No setup needed
Pay per use

AWS/GCP/Azure:

Run spiders on cloud
Scale automatically
Full control

When to use:

Don't want to manage servers
Need reliability
Have budget

Option 4: Simple Multi-Process

What it is:
Run multiple spiders on one machine.

Pros:

No setup needed
Simple
Works immediately

Cons:

Limited to one machine
Not truly distributed

When to use:

Testing distributed concepts
Medium-sized scraping
Limited budget

Setting Up Simple Distributed Crawling

Let's start with the easiest option: Scrapy-Redis.

Step 1: Install Redis

On Ubuntu/Linux:

sudo apt update
sudo apt install redis-server
sudo systemctl start redis

On Mac:

brew install redis
brew services start redis

On Windows:
Download from: https://redis.io/download

Or use Docker (easiest):

docker run -d -p 6379:6379 redis

Step 2: Install Scrapy-Redis

pip install scrapy-redis

Step 3: Create Distributed Spider

# myspider.py
from scrapy_redis.spiders import RedisSpider

class MyDistributedSpider(RedisSpider):
    name = 'distributed'

    # Don't use start_urls, use redis_key instead
    redis_key = 'myspider:start_urls'

    def parse(self, response):
        # Your scraping logic
        for product in response.css('.product'):
            yield {
                'name': product.css('.name::text').get(),
                'price': product.css('.price::text').get()
            }

        # Follow links (goes to shared queue)
        for link in response.css('a::attr(href)').getall():
            yield response.follow(link, self.parse)

Step 4: Configure Settings

# settings.py

# Enable Scrapy-Redis
SCHEDULER = "scrapy_redis.scheduler.Scheduler"
DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"

# Don't cleanup Redis queue on spider close
SCHEDULER_PERSIST = True

# Redis connection
REDIS_HOST = 'localhost'
REDIS_PORT = 6379

# Optional: Redis password
# REDIS_PARAMS = {'password': 'your-password'}

Step 5: Add URLs to Redis

# Add starting URLs to Redis
redis-cli lpush myspider:start_urls "https://example.com/page1"
redis-cli lpush myspider:start_urls "https://example.com/page2"
redis-cli lpush myspider:start_urls "https://example.com/page3"

Or with Python:

import redis

r = redis.Redis(host='localhost', port=6379)

urls = [
    'https://example.com/page1',
    'https://example.com/page2',
    'https://example.com/page3',
]

for url in urls:
    r.lpush('myspider:start_urls', url)

Step 6: Run Multiple Spiders

On Computer 1:

scrapy crawl distributed

On Computer 2:

scrapy crawl distributed

On Computer 3:

scrapy crawl distributed

All three spiders share the same Redis queue!

Each one:

Takes URLs from Redis
Scrapes them
Adds new URLs to Redis
Other spiders see new URLs
No duplicate scraping

Magic!

Complete Working Example

Let's build a real distributed spider step by step.

The Scenario

Scrape 10,000 product pages from an e-commerce site using 3 computers.

Project Structure

distributed_scraper/
├── scrapy.cfg
├── settings.py
├── spiders/
│   └── products.py
└── items.py

Spider Code

# spiders/products.py
from scrapy_redis.spiders import RedisSpider
import scrapy

class ProductSpider(RedisSpider):
    name = 'products'
    redis_key = 'products:start_urls'

    def parse(self, response):
        """Parse product listing page"""

        # Extract products
        for product in response.css('.product'):
            # Go to product detail page
            detail_url = product.css('a::attr(href)').get()
            if detail_url:
                yield response.follow(detail_url, self.parse_product)

        # Follow pagination
        next_page = response.css('.next::attr(href)').get()
        if next_page:
            yield response.follow(next_page, self.parse)

    def parse_product(self, response):
        """Parse individual product page"""

        yield {
            'url': response.url,
            'name': response.css('h1.product-name::text').get(),
            'price': response.css('.price::text').get(),
            'description': response.css('.description::text').get(),
            'in_stock': response.css('.stock::text').get(),
        }

Settings

# settings.py

BOT_NAME = 'distributed_scraper'

# Scrapy-Redis settings
SCHEDULER = "scrapy_redis.scheduler.Scheduler"
DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"
SCHEDULER_PERSIST = True

# Redis connection
REDIS_HOST = 'localhost'  # Change to your Redis server IP
REDIS_PORT = 6379

# Be polite
DOWNLOAD_DELAY = 1
CONCURRENT_REQUESTS = 16

# Output
FEEDS = {
    'products.json': {
        'format': 'json',
        'overwrite': False,  # Append, don't overwrite
    }
}

Adding Starting URLs

# add_urls.py
import redis

r = redis.Redis(host='localhost', port=6379)

# Clear old URLs
r.delete('products:start_urls')

# Add starting URL
starting_url = 'https://example.com/products?page=1'
r.lpush('products:start_urls', starting_url)

print(f"Added {starting_url} to queue")
print(f"Queue size: {r.llen('products:start_urls')}")

Running the Spider

On Computer 1 (or Process 1):

scrapy crawl products

On Computer 2 (or Process 2):

scrapy crawl products

On Computer 3 (or Process 3):

scrapy crawl products

All three will:

Share the same queue
Split the work automatically
No duplicate scraping
Save to same output file

Monitoring Distributed Crawling

Check Queue Size

import redis

r = redis.Redis(host='localhost', port=6379)

# Check how many URLs left
queue_size = r.llen('products:start_urls')
print(f"URLs remaining: {queue_size}")

Simple Monitoring Script

# monitor.py
import redis
import time

r = redis.Redis(host='localhost', port=6379)

while True:
    # Get queue size
    queue_size = r.llen('products:start_urls')

    # Get stats (if using Scrapy-Redis stats)
    print(f"URLs in queue: {queue_size}")

    if queue_size == 0:
        print("Queue empty! Scraping might be done.")

    time.sleep(10)  # Check every 10 seconds

Common Problems and Solutions

Problem 1: Spiders Scrape Same URLs

Cause: Duplicate filter not working.

Solution:

# settings.py
DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"

This prevents duplicate scraping.

Problem 2: Queue Never Empties

Cause: Spider keeps finding new URLs.

Solution: Add domain restrictions

class ProductSpider(RedisSpider):
    name = 'products'
    allowed_domains = ['example.com']  # Only this domain

Problem 3: Can't Connect to Redis

Error:

redis.exceptions.ConnectionError

Solutions:

Check Redis is running: redis-cli ping
Check IP address in settings
Check firewall allows port 6379

Problem 4: Too Slow

Even with multiple spiders?

Solutions:

Increase CONCURRENT_REQUESTS
Use faster proxies
Reduce DOWNLOAD_DELAY
Add more spiders

Simple Multi-Process Alternative

Don't want to set up Redis? Use multiple processes on one machine:

Method 1: Run Multiple Processes

# Terminal 1
scrapy crawl myspider -a start=0 -a end=1000

# Terminal 2
scrapy crawl myspider -a start=1000 -a end=2000

# Terminal 3
scrapy crawl myspider -a start=2000 -a end=3000

Method 2: Python Script

# run_distributed.py
import subprocess

# Number of processes
num_processes = 4

# Total URLs
total_urls = 10000
urls_per_process = total_urls // num_processes

processes = []

for i in range(num_processes):
    start = i * urls_per_process
    end = start + urls_per_process

    cmd = f'scrapy crawl myspider -a start={start} -a end={end}'

    process = subprocess.Popen(cmd, shell=True)
    processes.append(process)

    print(f"Started process {i+1}: URLs {start}-{end}")

# Wait for all to finish
for process in processes:
    process.wait()

print("All processes finished!")

This is simpler but limited to one machine.

Cost Comparison

Single Spider (Baseline)

Cost:

1 server: $5/month
Time: 10 days

Total: $5/month

Distributed (5 Spiders)

Cost:

5 servers: $25/month
Redis: $5/month
Time: 2 days

Total: $30/month

Worth it?

If time is valuable: YES
If just learning: NO

When to Upgrade to Distributed

Start with single spider. Upgrade when:

Sign 1: Takes Too Long

Single spider: 5 days
Need data: Daily
Problem: Can't keep up!

Sign 2: Missing Time Windows

Website updates: Every 6 hours
Scraping takes: 12 hours
Problem: Always behind!

Sign 3: Growing Scale

Started: 1,000 pages
Now: 100,000 pages
Single spider: Can't handle it

Sign 4: Business Need

Making money from data
Time = money
Faster scraping = more profit

Best Practices

1. Start Small

Test with 2-3 spiders first:

# Don't start with 100 spiders!
# Start with 2-3, see if it works

2. Monitor Everything

# Log everything
self.logger.info(f'Scraped {response.url}')

3. Use Realistic Delays

# Even distributed, be polite
DOWNLOAD_DELAY = 1

4. Test Locally First

# Test on your laptop before deploying
# Make sure spider works correctly

5. Plan Your URLs

# Know how many URLs you have
# Calculate: URLs ÷ spiders = time

Quick Decision Guide

Should I use distributed crawling?

How many URLs?
├─ Less than 10,000
│   └─ Use single spider
│
├─ 10,000 - 100,000
│   └─ Maybe distributed (if time-sensitive)
│
└─ More than 100,000
    └─ Definitely distributed

What option should I choose?

Budget?
├─ Low budget
│   └─ Scrapy-Redis (self-hosted)
│
├─ Medium budget
│   └─ Multiple small servers
│
└─ High budget
    └─ Managed solutions (ScrapyCloud)

Summary

What is distributed crawling?
Running multiple spiders at the same time, sharing the same queue.

When to use it:

More than 100,000 URLs
Time-sensitive data
Commercial operations
Large scale scraping

When NOT to use it:

Less than 10,000 URLs
Learning/testing
No time pressure
Simple projects

Best option for beginners:
Scrapy-Redis with 2-3 spiders

Setup:

Install Redis
Install scrapy-redis
Configure settings
Add URLs to Redis
Run multiple spiders

Key settings:

SCHEDULER = "scrapy_redis.scheduler.Scheduler"
DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"
REDIS_HOST = 'localhost'

Remember:

Start simple (single spider)
Upgrade when needed
Test before scaling
Monitor everything
Be polite even when distributed

Distributed crawling is powerful but not always necessary. Start simple, scale when you need to!

Happy scraping! 🕷️