Muhammad Ikramullah Khan

Posted on Jan 5

Scrapy Performance Optimization: Make Your Spider 10x Faster

#webdev #programming #performance #web

My first spider took 6 hours to scrape 50,000 pages. I thought that was just how long it took.

Then I learned about optimization. Same spider, same website, now takes 30 minutes. That's 12x faster!

The difference? Understanding bottlenecks and fixing them. Let me show you how to make your spiders blazing fast.

The Big Picture: Where Time Is Spent

When Scrapy scrapes, time goes to:

1. Network (70-90%)

Downloading pages
Waiting for responses
DNS lookups

2. Parsing (5-15%)

Running selectors
Extracting data
Processing items

3. Processing (5-15%)

Running pipelines
Saving to database
Validating data

Key insight: Network is usually the bottleneck. Optimize that first!

Optimization 1: Increase Concurrency

By default, Scrapy runs 16 concurrent requests. Increase it:

# settings.py

# From default
CONCURRENT_REQUESTS = 16

# To faster
CONCURRENT_REQUESTS = 32  # or 64, or even 128

Speed improvement: 2-4x

Also Increase Per-Domain Concurrency

CONCURRENT_REQUESTS_PER_DOMAIN = 16  # From 8

What the Docs Don't Tell You

More isn't always better:

Your network might be the limit
Target server might block you
Your CPU might max out

Find your limit:

Start at 16, double it, test speed. Keep doubling until speed stops improving.

Test with:

time scrapy crawl myspider

Optimization 2: Reduce Download Timeout

Default timeout is 180 seconds. That's way too long!

# settings.py

# From default
DOWNLOAD_TIMEOUT = 180  # 3 minutes!

# To faster
DOWNLOAD_TIMEOUT = 30  # 30 seconds

If a page takes 30+ seconds, it's either:

The site is blocking you
The server is overloaded
The page is broken

Don't wait 3 minutes for it!

Speed improvement: Saves time on slow/dead pages

Optimization 3: Disable Cookies (When Not Needed)

Cookie processing takes time. If you don't need cookies:

COOKIES_ENABLED = False

Speed improvement: 5-10%

Warning: Only disable if:

You don't need session handling
You don't need to stay logged in
The site doesn't require cookies

Optimization 4: Disable Redirects (When Safe)

Following redirects takes extra requests:

REDIRECT_ENABLED = False

Speed improvement: 10-20% (if site uses many redirects)

Warning: Only disable if:

You know the exact URLs
No redirects are expected
You're scraping an API

Optimization 5: Disable Retry Middleware (Advanced)

Retrying failed requests takes time:

RETRY_ENABLED = False

Speed improvement: 5-15% (if many failures)

Warning: Only disable if:

You're okay with missing some pages
You'll re-run the spider anyway
Speed matters more than completeness

Optimization 6: Use DNS Cache

DNS lookups are slow. Cache them:

DNSCACHE_ENABLED = True  # Already default, but verify

Also increase DNS timeout:

DNS_TIMEOUT = 10  # From 60

Speed improvement: 5-10%

Optimization 7: Optimize Your Selectors

Slow selectors slow down everything.

Use CSS Over XPath (Usually)

# Slower
response.xpath('//div[@class="product"]/span[@class="name"]/text()').get()

# Faster
response.css('div.product span.name::text').get()

CSS selectors are usually 10-30% faster than XPath.

Cache Selector Results

# Slow (selector runs multiple times)
def parse(self, response):
    for product in response.css('.product'):
        name = product.css('.name::text').get()
        price = product.css('.price::text').get()
        description = product.css('.description::text').get()

# Fast (selector runs once, cached)
def parse(self, response):
    products = response.css('.product')  # Cache this
    for product in products:
        name = product.css('.name::text').get()
        price = product.css('.price::text').get()
        description = product.css('.description::text').get()

Use More Specific Selectors

# Slow (searches entire page)
response.css('span::text').getall()

# Fast (narrows search)
response.css('.product-list span.price::text').getall()

Optimization 8: Minimize Pipeline Work

Heavy pipeline processing slows everything down.

Bad Pipeline

class SlowPipeline:
    def process_item(self, item, spider):
        # Slow: API call for each item
        enriched_data = requests.get(f'https://api.example.com/enrich?q={item["name"]}')
        item['enriched'] = enriched_data.json()

        # Slow: Database call for each item
        self.cursor.execute('INSERT INTO items VALUES (...)')
        self.conn.commit()  # Commit each item!

        return item

Fast Pipeline

class FastPipeline:
    def __init__(self):
        self.items_buffer = []
        self.buffer_size = 100

    def process_item(self, item, spider):
        # Buffer items
        self.items_buffer.append(item)

        # Batch insert when buffer is full
        if len(self.items_buffer) >= self.buffer_size:
            self.flush_buffer()

        return item

    def flush_buffer(self):
        # Batch insert (much faster!)
        values = [(item['name'], item['price']) for item in self.items_buffer]
        self.cursor.executemany('INSERT INTO items VALUES (?, ?)', values)
        self.conn.commit()
        self.items_buffer = []

    def close_spider(self, spider):
        # Insert remaining items
        self.flush_buffer()

Speed improvement: 5-50x for database operations!

Optimization 9: Use Async Pipelines

For I/O heavy pipelines (API calls, database), use async:

import asyncio
import aiohttp

class AsyncPipeline:
    async def process_item(self, item, spider):
        async with aiohttp.ClientSession() as session:
            async with session.get(f'https://api.example.com/data?id={item["id"]}') as response:
                data = await response.json()
                item['extra'] = data

        return item

Speed improvement: 2-10x for I/O operations

Optimization 10: Scrape APIs Instead of HTML

If the site has an API, use it!

# Slow: Scraping HTML
def parse(self, response):
    for product in response.css('.product'):
        yield {
            'name': product.css('h2::text').get(),
            'price': product.css('.price::text').get()
        }

# Fast: Scraping API
def parse(self, response):
    data = json.loads(response.text)
    for product in data['products']:
        yield {
            'name': product['name'],
            'price': product['price']
        }

Speed improvement: 10-100x

APIs are:

Faster to download (smaller)
Faster to parse (no HTML)
More reliable

Optimization 11: Use HTTP/2

HTTP/2 is faster than HTTP/1.1:

# Install
pip install scrapy[http2]

# Enable in settings.py
DOWNLOAD_HANDLERS = {
    'https': 'scrapy.core.downloader.handlers.http2.H2DownloadHandler',
}

Speed improvement: 10-30% (especially with high latency)

Optimization 12: Disable Logging in Production

Logging to console is slow:

# Development
LOG_LEVEL = 'DEBUG'

# Production
LOG_LEVEL = 'WARNING'  # or ERROR
LOG_FILE = 'spider.log'  # Log to file, not console

Speed improvement: 5-10%

Optimization 13: Use Memory Queue

By default, Scrapy uses disk for request queue. Use memory:

SCHEDULER_PRIORITY_QUEUE = 'scrapy.pqueues.ScrapyPriorityQueue'
SCHEDULER_DISK_QUEUE = 'scrapy.squeues.PickleFifoDiskQueue'
SCHEDULER_MEMORY_QUEUE = 'scrapy.squeues.FifoMemoryQueue'

Actually, this is already the default for memory queue. Just make sure you're not using disk queue:

# Make sure this is NOT set
# JOBDIR = 'crawls/myjob'  # This forces disk queue

Speed improvement: 10-20%

Optimization 14: Reduce Item Overhead

Items have overhead. For simple scraping, use dicts:

# Slower (Item objects have overhead)
class ProductItem(scrapy.Item):
    name = scrapy.Field()
    price = scrapy.Field()

def parse(self, response):
    item = ProductItem()
    item['name'] = response.css('h1::text').get()
    yield item

# Faster (plain dicts)
def parse(self, response):
    yield {
        'name': response.css('h1::text').get(),
        'price': response.css('.price::text').get()
    }

Speed improvement: 5-10%

Trade-off: Lose Item validation and field definitions.

Optimization 15: Profile Your Spider

Find actual bottlenecks:

# Install yappi
pip install yappi

# Profile spider
python -m cProfile -o profile.stats scrapy crawl myspider

# Analyze
python -m pstats profile.stats
>>> sort cumulative
>>> stats 20

Shows which functions take the most time.

Real-World Optimization Example

Let's optimize a slow spider:

Before (Slow)

class SlowSpider(scrapy.Spider):
    name = 'slow'

    custom_settings = {
        'CONCURRENT_REQUESTS': 16,  # Default
        'DOWNLOAD_TIMEOUT': 180,
        'COOKIES_ENABLED': True,
        'RETRY_ENABLED': True,
        'LOG_LEVEL': 'DEBUG'
    }

    def parse(self, response):
        # Inefficient selectors
        for product in response.xpath('//div[@class="product"]'):
            item = ProductItem()
            item['name'] = product.xpath('.//h2/text()').get()
            item['price'] = product.xpath('.//span[@class="price"]/text()').get()
            yield item

# Slow pipeline
class SlowPipeline:
    def process_item(self, item, spider):
        # Single insert (slow!)
        self.cursor.execute('INSERT INTO products VALUES (?, ?)', 
                          (item['name'], item['price']))
        self.conn.commit()
        return item

Speed: 50,000 pages in 6 hours

After (Fast)

class FastSpider(scrapy.Spider):
    name = 'fast'

    custom_settings = {
        'CONCURRENT_REQUESTS': 64,  # Increased
        'CONCURRENT_REQUESTS_PER_DOMAIN': 32,
        'DOWNLOAD_TIMEOUT': 30,  # Reduced
        'COOKIES_ENABLED': False,  # Disabled (not needed)
        'RETRY_ENABLED': True,
        'RETRY_TIMES': 2,  # Reduced from 3
        'LOG_LEVEL': 'INFO',  # Less verbose
        'LOG_FILE': 'spider.log'  # File instead of console
    }

    def parse(self, response):
        # Efficient CSS selectors
        for product in response.css('.product'):
            yield {  # Dict instead of Item
                'name': product.css('h2::text').get(),
                'price': product.css('.price::text').get()
            }

# Fast pipeline with batching
class FastPipeline:
    def __init__(self):
        self.items = []
        self.batch_size = 100

    def process_item(self, item, spider):
        self.items.append(item)

        if len(self.items) >= self.batch_size:
            self.flush()

        return item

    def flush(self):
        # Batch insert (much faster!)
        values = [(item['name'], item['price']) for item in self.items]
        self.cursor.executemany('INSERT INTO products VALUES (?, ?)', values)
        self.conn.commit()
        self.items = []

    def close_spider(self, spider):
        self.flush()

Speed: 50,000 pages in 30 minutes

Result: 12x faster!

Measuring Performance

Always measure before and after:

from datetime import datetime

class MeasuredSpider(scrapy.Spider):
    name = 'measured'

    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.start_time = datetime.now()
        self.page_count = 0

    def parse(self, response):
        self.page_count += 1

        # Log speed every 1000 pages
        if self.page_count % 1000 == 0:
            elapsed = (datetime.now() - self.start_time).total_seconds()
            speed = self.page_count / elapsed

            self.logger.info(
                f'Scraped {self.page_count} pages in {elapsed:.1f}s '
                f'({speed:.1f} pages/sec)'
            )

        yield {'url': response.url}

When NOT to Optimize

Don't over-optimize:

Skip optimization if:

Spider runs once
Total time < 5 minutes
You're still developing
Site is very slow (bottleneck is server, not you)

Optimize when:

Spider runs regularly
Total time > 30 minutes
Scraping large sites (100k+ pages)
Time is critical

Quick Wins Checklist

Apply these for immediate speed boost:

[ ] Increase CONCURRENT_REQUESTS to 32-64
[ ] Reduce DOWNLOAD_TIMEOUT to 30
[ ] Disable COOKIES_ENABLED if not needed
[ ] Use CSS selectors instead of XPath
[ ] Batch database operations
[ ] Set LOG_LEVEL to INFO or WARNING
[ ] Look for APIs instead of scraping HTML

These 7 changes can give you 2-10x speedup!

Summary

Network optimization (biggest impact):

Increase concurrency
Reduce timeouts
Disable unnecessary features (cookies, redirects)
Use HTTP/2

Parsing optimization:

Use CSS over XPath
Cache selector results
Use more specific selectors
Use dicts instead of Items

Pipeline optimization:

Batch database operations
Use async for I/O
Minimize per-item processing

General tips:

Profile to find real bottlenecks
Measure before and after
Start with quick wins
APIs are always faster than HTML

Remember:

Network is usually the bottleneck
Optimize network first
Batch database operations
More concurrency = faster (up to a point)

Start with the quick wins checklist. That alone can give you 5-10x speedup in 5 minutes!

Happy scraping! 🕷️

DEV Community

Scrapy Performance Optimization: Make Your Spider 10x Faster

The Big Picture: Where Time Is Spent

Optimization 1: Increase Concurrency

Also Increase Per-Domain Concurrency

What the Docs Don't Tell You

Optimization 2: Reduce Download Timeout

Optimization 3: Disable Cookies (When Not Needed)

Optimization 4: Disable Redirects (When Safe)

Optimization 5: Disable Retry Middleware (Advanced)

Optimization 6: Use DNS Cache

Optimization 7: Optimize Your Selectors

Use CSS Over XPath (Usually)

Cache Selector Results

Use More Specific Selectors

Optimization 8: Minimize Pipeline Work

Bad Pipeline

Fast Pipeline

Optimization 9: Use Async Pipelines

Optimization 10: Scrape APIs Instead of HTML

Optimization 11: Use HTTP/2

Optimization 12: Disable Logging in Production

Optimization 13: Use Memory Queue

Optimization 14: Reduce Item Overhead

Optimization 15: Profile Your Spider

Real-World Optimization Example

Before (Slow)

After (Fast)

Measuring Performance

When NOT to Optimize

Quick Wins Checklist

Summary

Top comments (0)