DEV Community

Muhammad Ikramullah Khan
Muhammad Ikramullah Khan

Posted on

Distributed Crawling: The Beginner's Guide

A person scraped a website with 1 million products. Their single laptop took 3 days to finish. The website was updated daily, so the data was already outdated by the time the crawl completed.

Then the person learned about distributed crawling. They ran the same spider across 10 computers, and it completed the task in just 7 hours, delivering fresh data every day.

Let me explain what distributed crawling is and when you actually need it, in the simplest way possible.


What is Distributed Crawling? (Super Simple Explanation)

Imagine you need to paint 100 houses.

Single crawler (normal Scrapy):

You → Paint house 1 → Paint house 2 → Paint house 3... → Paint house 100
(Takes 100 days, one house per day)
Enter fullscreen mode Exit fullscreen mode

Distributed crawling:

You        → Paint houses 1-10   (10 days)
Friend 1   → Paint houses 11-20  (10 days)
Friend 2   → Paint houses 21-30  (10 days)
...
Friend 9   → Paint houses 91-100 (10 days)

All working at the same time!
(Takes 10 days total, 10x faster!)
Enter fullscreen mode Exit fullscreen mode

In web scraping:

  • Instead of one computer scraping all URLs
  • You have multiple computers
  • Each scrapes different URLs
  • All working at the same time
  • Much faster!

How Does Distributed Crawling Work?

The Simple Version

Normal Scrapy (Single Machine):

Spider → Queue → Download → Parse → Save
(One machine does everything)
Enter fullscreen mode Exit fullscreen mode

Distributed Scrapy:

Computer 1 → Spider → Queue ← Computer 2 → Spider
                ↑                ↓
            Shared Queue    Computer 3 → Spider
                ↑                ↓
           Computer 4 → Spider
Enter fullscreen mode Exit fullscreen mode

All computers share the same queue. They take URLs from the queue and scrape them.

What You Need

1. Shared Queue
A place where all computers can see which URLs need scraping.

Common options:

  • Redis (most popular)
  • RabbitMQ
  • MongoDB

2. Multiple Computers/Servers

  • Your laptop
  • Cloud servers (AWS, DigitalOcean, etc.)
  • Multiple processes on one machine

3. Shared Storage (optional)
Where to save scraped data so all spiders can access it.


When Do You NEED Distributed Crawling?

You Need It When:

1. Too Many URLs

1 million URLs
Single spider: 10 URLs per minute
Time needed: 1,000,000 ÷ 10 = 100,000 minutes = 69 days!

With 10 spiders: 69 ÷ 10 = 7 days
With 50 spiders: 69 ÷ 50 = 1.4 days
Enter fullscreen mode Exit fullscreen mode

2. Time Constraints

  • Need fresh data daily
  • Website updates every hour
  • Single spider takes too long

3. Large Scale Operations

  • Scraping multiple websites
  • Monitoring thousands of pages
  • Commercial scraping business

4. Geographic Distribution

  • Need to scrape from different countries
  • Bypass geo-restrictions
  • Reduce latency

When You DON'T Need Distributed Crawling

You Don't Need It When:

1. Small Websites

Website has: 1,000 pages
Single spider can finish: 2 hours
Why complicate things?
Enter fullscreen mode Exit fullscreen mode

2. Learning/Testing

  • Just learning Scrapy
  • Testing your spider
  • Personal projects

3. Low Frequency Scraping

  • Scraping once a month
  • No time pressure
  • Small data needs

4. Simple Projects

  • Blog scraping
  • Research projects
  • One-time data collection

Rule of thumb:

  • Less than 10,000 URLs? → Single spider
  • More than 100,000 URLs? → Consider distributed
  • More than 1,000,000 URLs? → Definitely distributed

Distributed Crawling Options

Option 1: Scrapy-Redis (Most Popular)

What it is:
Extension that makes Scrapy use Redis for the queue.

Pros:

  • Easy to set up
  • Battle-tested
  • Large community
  • Good documentation

Cons:

  • Requires Redis
  • Learning curve

Cost:

  • Free (open source)
  • Redis hosting: $0-$10/month

When to use:

  • Most common choice
  • Good for 99% of cases
  • Large scale scraping

Option 2: Scrapy Cluster

What it is:
Complete distributed system with monitoring.

Pros:

  • Built-in monitoring
  • Job scheduling
  • REST API
  • Production-ready

Cons:

  • More complex setup
  • Overkill for simple projects

Cost:

  • Free (open source)
  • Infrastructure costs

When to use:

  • Professional operations
  • Need monitoring
  • Multiple teams

Option 3: Cloud Solutions

ScrapyCloud (by Zyte):

  • Managed Scrapy hosting
  • No setup needed
  • Pay per use

AWS/GCP/Azure:

  • Run spiders on cloud
  • Scale automatically
  • Full control

When to use:

  • Don't want to manage servers
  • Need reliability
  • Have budget

Option 4: Simple Multi-Process

What it is:
Run multiple spiders on one machine.

Pros:

  • No setup needed
  • Simple
  • Works immediately

Cons:

  • Limited to one machine
  • Not truly distributed

When to use:

  • Testing distributed concepts
  • Medium-sized scraping
  • Limited budget

Setting Up Simple Distributed Crawling

Let's start with the easiest option: Scrapy-Redis.

Step 1: Install Redis

On Ubuntu/Linux:

sudo apt update
sudo apt install redis-server
sudo systemctl start redis
Enter fullscreen mode Exit fullscreen mode

On Mac:

brew install redis
brew services start redis
Enter fullscreen mode Exit fullscreen mode

On Windows:
Download from: https://redis.io/download

Or use Docker (easiest):

docker run -d -p 6379:6379 redis
Enter fullscreen mode Exit fullscreen mode

Step 2: Install Scrapy-Redis

pip install scrapy-redis
Enter fullscreen mode Exit fullscreen mode

Step 3: Create Distributed Spider

# myspider.py
from scrapy_redis.spiders import RedisSpider

class MyDistributedSpider(RedisSpider):
    name = 'distributed'

    # Don't use start_urls, use redis_key instead
    redis_key = 'myspider:start_urls'

    def parse(self, response):
        # Your scraping logic
        for product in response.css('.product'):
            yield {
                'name': product.css('.name::text').get(),
                'price': product.css('.price::text').get()
            }

        # Follow links (goes to shared queue)
        for link in response.css('a::attr(href)').getall():
            yield response.follow(link, self.parse)
Enter fullscreen mode Exit fullscreen mode

Step 4: Configure Settings

# settings.py

# Enable Scrapy-Redis
SCHEDULER = "scrapy_redis.scheduler.Scheduler"
DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"

# Don't cleanup Redis queue on spider close
SCHEDULER_PERSIST = True

# Redis connection
REDIS_HOST = 'localhost'
REDIS_PORT = 6379

# Optional: Redis password
# REDIS_PARAMS = {'password': 'your-password'}
Enter fullscreen mode Exit fullscreen mode

Step 5: Add URLs to Redis

# Add starting URLs to Redis
redis-cli lpush myspider:start_urls "https://example.com/page1"
redis-cli lpush myspider:start_urls "https://example.com/page2"
redis-cli lpush myspider:start_urls "https://example.com/page3"
Enter fullscreen mode Exit fullscreen mode

Or with Python:

import redis

r = redis.Redis(host='localhost', port=6379)

urls = [
    'https://example.com/page1',
    'https://example.com/page2',
    'https://example.com/page3',
]

for url in urls:
    r.lpush('myspider:start_urls', url)
Enter fullscreen mode Exit fullscreen mode

Step 6: Run Multiple Spiders

On Computer 1:

scrapy crawl distributed
Enter fullscreen mode Exit fullscreen mode

On Computer 2:

scrapy crawl distributed
Enter fullscreen mode Exit fullscreen mode

On Computer 3:

scrapy crawl distributed
Enter fullscreen mode Exit fullscreen mode

All three spiders share the same Redis queue!

Each one:

  • Takes URLs from Redis
  • Scrapes them
  • Adds new URLs to Redis
  • Other spiders see new URLs
  • No duplicate scraping

Magic!


Complete Working Example

Let's build a real distributed spider step by step.

The Scenario

Scrape 10,000 product pages from an e-commerce site using 3 computers.

Project Structure

distributed_scraper/
├── scrapy.cfg
├── settings.py
├── spiders/
│   └── products.py
└── items.py
Enter fullscreen mode Exit fullscreen mode

Spider Code

# spiders/products.py
from scrapy_redis.spiders import RedisSpider
import scrapy

class ProductSpider(RedisSpider):
    name = 'products'
    redis_key = 'products:start_urls'

    def parse(self, response):
        """Parse product listing page"""

        # Extract products
        for product in response.css('.product'):
            # Go to product detail page
            detail_url = product.css('a::attr(href)').get()
            if detail_url:
                yield response.follow(detail_url, self.parse_product)

        # Follow pagination
        next_page = response.css('.next::attr(href)').get()
        if next_page:
            yield response.follow(next_page, self.parse)

    def parse_product(self, response):
        """Parse individual product page"""

        yield {
            'url': response.url,
            'name': response.css('h1.product-name::text').get(),
            'price': response.css('.price::text').get(),
            'description': response.css('.description::text').get(),
            'in_stock': response.css('.stock::text').get(),
        }
Enter fullscreen mode Exit fullscreen mode

Settings

# settings.py

BOT_NAME = 'distributed_scraper'

# Scrapy-Redis settings
SCHEDULER = "scrapy_redis.scheduler.Scheduler"
DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"
SCHEDULER_PERSIST = True

# Redis connection
REDIS_HOST = 'localhost'  # Change to your Redis server IP
REDIS_PORT = 6379

# Be polite
DOWNLOAD_DELAY = 1
CONCURRENT_REQUESTS = 16

# Output
FEEDS = {
    'products.json': {
        'format': 'json',
        'overwrite': False,  # Append, don't overwrite
    }
}
Enter fullscreen mode Exit fullscreen mode

Adding Starting URLs

# add_urls.py
import redis

r = redis.Redis(host='localhost', port=6379)

# Clear old URLs
r.delete('products:start_urls')

# Add starting URL
starting_url = 'https://example.com/products?page=1'
r.lpush('products:start_urls', starting_url)

print(f"Added {starting_url} to queue")
print(f"Queue size: {r.llen('products:start_urls')}")
Enter fullscreen mode Exit fullscreen mode

Running the Spider

On Computer 1 (or Process 1):

scrapy crawl products
Enter fullscreen mode Exit fullscreen mode

On Computer 2 (or Process 2):

scrapy crawl products
Enter fullscreen mode Exit fullscreen mode

On Computer 3 (or Process 3):

scrapy crawl products
Enter fullscreen mode Exit fullscreen mode

All three will:

  • Share the same queue
  • Split the work automatically
  • No duplicate scraping
  • Save to same output file

Monitoring Distributed Crawling

Check Queue Size

import redis

r = redis.Redis(host='localhost', port=6379)

# Check how many URLs left
queue_size = r.llen('products:start_urls')
print(f"URLs remaining: {queue_size}")
Enter fullscreen mode Exit fullscreen mode

Simple Monitoring Script

# monitor.py
import redis
import time

r = redis.Redis(host='localhost', port=6379)

while True:
    # Get queue size
    queue_size = r.llen('products:start_urls')

    # Get stats (if using Scrapy-Redis stats)
    print(f"URLs in queue: {queue_size}")

    if queue_size == 0:
        print("Queue empty! Scraping might be done.")

    time.sleep(10)  # Check every 10 seconds
Enter fullscreen mode Exit fullscreen mode

Common Problems and Solutions

Problem 1: Spiders Scrape Same URLs

Cause: Duplicate filter not working.

Solution:

# settings.py
DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"
Enter fullscreen mode Exit fullscreen mode

This prevents duplicate scraping.

Problem 2: Queue Never Empties

Cause: Spider keeps finding new URLs.

Solution: Add domain restrictions

class ProductSpider(RedisSpider):
    name = 'products'
    allowed_domains = ['example.com']  # Only this domain
Enter fullscreen mode Exit fullscreen mode

Problem 3: Can't Connect to Redis

Error:

redis.exceptions.ConnectionError
Enter fullscreen mode Exit fullscreen mode

Solutions:

  1. Check Redis is running: redis-cli ping
  2. Check IP address in settings
  3. Check firewall allows port 6379

Problem 4: Too Slow

Even with multiple spiders?

Solutions:

  • Increase CONCURRENT_REQUESTS
  • Use faster proxies
  • Reduce DOWNLOAD_DELAY
  • Add more spiders

Simple Multi-Process Alternative

Don't want to set up Redis? Use multiple processes on one machine:

Method 1: Run Multiple Processes

# Terminal 1
scrapy crawl myspider -a start=0 -a end=1000

# Terminal 2
scrapy crawl myspider -a start=1000 -a end=2000

# Terminal 3
scrapy crawl myspider -a start=2000 -a end=3000
Enter fullscreen mode Exit fullscreen mode

Method 2: Python Script

# run_distributed.py
import subprocess

# Number of processes
num_processes = 4

# Total URLs
total_urls = 10000
urls_per_process = total_urls // num_processes

processes = []

for i in range(num_processes):
    start = i * urls_per_process
    end = start + urls_per_process

    cmd = f'scrapy crawl myspider -a start={start} -a end={end}'

    process = subprocess.Popen(cmd, shell=True)
    processes.append(process)

    print(f"Started process {i+1}: URLs {start}-{end}")

# Wait for all to finish
for process in processes:
    process.wait()

print("All processes finished!")
Enter fullscreen mode Exit fullscreen mode

This is simpler but limited to one machine.


Cost Comparison

Single Spider (Baseline)

Cost:

  • 1 server: $5/month
  • Time: 10 days

Total: $5/month

Distributed (5 Spiders)

Cost:

  • 5 servers: $25/month
  • Redis: $5/month
  • Time: 2 days

Total: $30/month

Worth it?

  • If time is valuable: YES
  • If just learning: NO

When to Upgrade to Distributed

Start with single spider. Upgrade when:

Sign 1: Takes Too Long

Single spider: 5 days
Need data: Daily
Problem: Can't keep up!
Enter fullscreen mode Exit fullscreen mode

Sign 2: Missing Time Windows

Website updates: Every 6 hours
Scraping takes: 12 hours
Problem: Always behind!
Enter fullscreen mode Exit fullscreen mode

Sign 3: Growing Scale

Started: 1,000 pages
Now: 100,000 pages
Single spider: Can't handle it
Enter fullscreen mode Exit fullscreen mode

Sign 4: Business Need

Making money from data
Time = money
Faster scraping = more profit
Enter fullscreen mode Exit fullscreen mode

Best Practices

1. Start Small

Test with 2-3 spiders first:

# Don't start with 100 spiders!
# Start with 2-3, see if it works
Enter fullscreen mode Exit fullscreen mode

2. Monitor Everything

# Log everything
self.logger.info(f'Scraped {response.url}')
Enter fullscreen mode Exit fullscreen mode

3. Use Realistic Delays

# Even distributed, be polite
DOWNLOAD_DELAY = 1
Enter fullscreen mode Exit fullscreen mode

4. Test Locally First

# Test on your laptop before deploying
# Make sure spider works correctly
Enter fullscreen mode Exit fullscreen mode

5. Plan Your URLs

# Know how many URLs you have
# Calculate: URLs ÷ spiders = time
Enter fullscreen mode Exit fullscreen mode

Quick Decision Guide

Should I use distributed crawling?

How many URLs?
├─ Less than 10,000
│   └─ Use single spider
│
├─ 10,000 - 100,000
│   └─ Maybe distributed (if time-sensitive)
│
└─ More than 100,000
    └─ Definitely distributed
Enter fullscreen mode Exit fullscreen mode

What option should I choose?

Budget?
├─ Low budget
│   └─ Scrapy-Redis (self-hosted)
│
├─ Medium budget
│   └─ Multiple small servers
│
└─ High budget
    └─ Managed solutions (ScrapyCloud)
Enter fullscreen mode Exit fullscreen mode

Summary

What is distributed crawling?
Running multiple spiders at the same time, sharing the same queue.

When to use it:

  • More than 100,000 URLs
  • Time-sensitive data
  • Commercial operations
  • Large scale scraping

When NOT to use it:

  • Less than 10,000 URLs
  • Learning/testing
  • No time pressure
  • Simple projects

Best option for beginners:
Scrapy-Redis with 2-3 spiders

Setup:

  1. Install Redis
  2. Install scrapy-redis
  3. Configure settings
  4. Add URLs to Redis
  5. Run multiple spiders

Key settings:

SCHEDULER = "scrapy_redis.scheduler.Scheduler"
DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"
REDIS_HOST = 'localhost'
Enter fullscreen mode Exit fullscreen mode

Remember:

  • Start simple (single spider)
  • Upgrade when needed
  • Test before scaling
  • Monitor everything
  • Be polite even when distributed

Distributed crawling is powerful but not always necessary. Start simple, scale when you need to!

Happy scraping! 🕷️

Top comments (0)