DEV Community

Muhammad Ikramullah Khan
Muhammad Ikramullah Khan

Posted on

Scrapy Performance Optimization: Make Your Spider 10x Faster

My first spider took 6 hours to scrape 50,000 pages. I thought that was just how long it took.

Then I learned about optimization. Same spider, same website, now takes 30 minutes. That's 12x faster!

The difference? Understanding bottlenecks and fixing them. Let me show you how to make your spiders blazing fast.


The Big Picture: Where Time Is Spent

When Scrapy scrapes, time goes to:

1. Network (70-90%)

  • Downloading pages
  • Waiting for responses
  • DNS lookups

2. Parsing (5-15%)

  • Running selectors
  • Extracting data
  • Processing items

3. Processing (5-15%)

  • Running pipelines
  • Saving to database
  • Validating data

Key insight: Network is usually the bottleneck. Optimize that first!


Optimization 1: Increase Concurrency

By default, Scrapy runs 16 concurrent requests. Increase it:

# settings.py

# From default
CONCURRENT_REQUESTS = 16

# To faster
CONCURRENT_REQUESTS = 32  # or 64, or even 128
Enter fullscreen mode Exit fullscreen mode

Speed improvement: 2-4x

Also Increase Per-Domain Concurrency

CONCURRENT_REQUESTS_PER_DOMAIN = 16  # From 8
Enter fullscreen mode Exit fullscreen mode

What the Docs Don't Tell You

More isn't always better:

  • Your network might be the limit
  • Target server might block you
  • Your CPU might max out

Find your limit:

Start at 16, double it, test speed. Keep doubling until speed stops improving.

Test with:

time scrapy crawl myspider
Enter fullscreen mode Exit fullscreen mode

Optimization 2: Reduce Download Timeout

Default timeout is 180 seconds. That's way too long!

# settings.py

# From default
DOWNLOAD_TIMEOUT = 180  # 3 minutes!

# To faster
DOWNLOAD_TIMEOUT = 30  # 30 seconds
Enter fullscreen mode Exit fullscreen mode

If a page takes 30+ seconds, it's either:

  • The site is blocking you
  • The server is overloaded
  • The page is broken

Don't wait 3 minutes for it!

Speed improvement: Saves time on slow/dead pages


Optimization 3: Disable Cookies (When Not Needed)

Cookie processing takes time. If you don't need cookies:

COOKIES_ENABLED = False
Enter fullscreen mode Exit fullscreen mode

Speed improvement: 5-10%

Warning: Only disable if:

  • You don't need session handling
  • You don't need to stay logged in
  • The site doesn't require cookies

Optimization 4: Disable Redirects (When Safe)

Following redirects takes extra requests:

REDIRECT_ENABLED = False
Enter fullscreen mode Exit fullscreen mode

Speed improvement: 10-20% (if site uses many redirects)

Warning: Only disable if:

  • You know the exact URLs
  • No redirects are expected
  • You're scraping an API

Optimization 5: Disable Retry Middleware (Advanced)

Retrying failed requests takes time:

RETRY_ENABLED = False
Enter fullscreen mode Exit fullscreen mode

Speed improvement: 5-15% (if many failures)

Warning: Only disable if:

  • You're okay with missing some pages
  • You'll re-run the spider anyway
  • Speed matters more than completeness

Optimization 6: Use DNS Cache

DNS lookups are slow. Cache them:

DNSCACHE_ENABLED = True  # Already default, but verify
Enter fullscreen mode Exit fullscreen mode

Also increase DNS timeout:

DNS_TIMEOUT = 10  # From 60
Enter fullscreen mode Exit fullscreen mode

Speed improvement: 5-10%


Optimization 7: Optimize Your Selectors

Slow selectors slow down everything.

Use CSS Over XPath (Usually)

# Slower
response.xpath('//div[@class="product"]/span[@class="name"]/text()').get()

# Faster
response.css('div.product span.name::text').get()
Enter fullscreen mode Exit fullscreen mode

CSS selectors are usually 10-30% faster than XPath.

Cache Selector Results

# Slow (selector runs multiple times)
def parse(self, response):
    for product in response.css('.product'):
        name = product.css('.name::text').get()
        price = product.css('.price::text').get()
        description = product.css('.description::text').get()

# Fast (selector runs once, cached)
def parse(self, response):
    products = response.css('.product')  # Cache this
    for product in products:
        name = product.css('.name::text').get()
        price = product.css('.price::text').get()
        description = product.css('.description::text').get()
Enter fullscreen mode Exit fullscreen mode

Use More Specific Selectors

# Slow (searches entire page)
response.css('span::text').getall()

# Fast (narrows search)
response.css('.product-list span.price::text').getall()
Enter fullscreen mode Exit fullscreen mode

Optimization 8: Minimize Pipeline Work

Heavy pipeline processing slows everything down.

Bad Pipeline

class SlowPipeline:
    def process_item(self, item, spider):
        # Slow: API call for each item
        enriched_data = requests.get(f'https://api.example.com/enrich?q={item["name"]}')
        item['enriched'] = enriched_data.json()

        # Slow: Database call for each item
        self.cursor.execute('INSERT INTO items VALUES (...)')
        self.conn.commit()  # Commit each item!

        return item
Enter fullscreen mode Exit fullscreen mode

Fast Pipeline

class FastPipeline:
    def __init__(self):
        self.items_buffer = []
        self.buffer_size = 100

    def process_item(self, item, spider):
        # Buffer items
        self.items_buffer.append(item)

        # Batch insert when buffer is full
        if len(self.items_buffer) >= self.buffer_size:
            self.flush_buffer()

        return item

    def flush_buffer(self):
        # Batch insert (much faster!)
        values = [(item['name'], item['price']) for item in self.items_buffer]
        self.cursor.executemany('INSERT INTO items VALUES (?, ?)', values)
        self.conn.commit()
        self.items_buffer = []

    def close_spider(self, spider):
        # Insert remaining items
        self.flush_buffer()
Enter fullscreen mode Exit fullscreen mode

Speed improvement: 5-50x for database operations!


Optimization 9: Use Async Pipelines

For I/O heavy pipelines (API calls, database), use async:

import asyncio
import aiohttp

class AsyncPipeline:
    async def process_item(self, item, spider):
        async with aiohttp.ClientSession() as session:
            async with session.get(f'https://api.example.com/data?id={item["id"]}') as response:
                data = await response.json()
                item['extra'] = data

        return item
Enter fullscreen mode Exit fullscreen mode

Speed improvement: 2-10x for I/O operations


Optimization 10: Scrape APIs Instead of HTML

If the site has an API, use it!

# Slow: Scraping HTML
def parse(self, response):
    for product in response.css('.product'):
        yield {
            'name': product.css('h2::text').get(),
            'price': product.css('.price::text').get()
        }

# Fast: Scraping API
def parse(self, response):
    data = json.loads(response.text)
    for product in data['products']:
        yield {
            'name': product['name'],
            'price': product['price']
        }
Enter fullscreen mode Exit fullscreen mode

Speed improvement: 10-100x

APIs are:

  • Faster to download (smaller)
  • Faster to parse (no HTML)
  • More reliable

Optimization 11: Use HTTP/2

HTTP/2 is faster than HTTP/1.1:

# Install
pip install scrapy[http2]

# Enable in settings.py
DOWNLOAD_HANDLERS = {
    'https': 'scrapy.core.downloader.handlers.http2.H2DownloadHandler',
}
Enter fullscreen mode Exit fullscreen mode

Speed improvement: 10-30% (especially with high latency)


Optimization 12: Disable Logging in Production

Logging to console is slow:

# Development
LOG_LEVEL = 'DEBUG'

# Production
LOG_LEVEL = 'WARNING'  # or ERROR
LOG_FILE = 'spider.log'  # Log to file, not console
Enter fullscreen mode Exit fullscreen mode

Speed improvement: 5-10%


Optimization 13: Use Memory Queue

By default, Scrapy uses disk for request queue. Use memory:

SCHEDULER_PRIORITY_QUEUE = 'scrapy.pqueues.ScrapyPriorityQueue'
SCHEDULER_DISK_QUEUE = 'scrapy.squeues.PickleFifoDiskQueue'
SCHEDULER_MEMORY_QUEUE = 'scrapy.squeues.FifoMemoryQueue'
Enter fullscreen mode Exit fullscreen mode

Actually, this is already the default for memory queue. Just make sure you're not using disk queue:

# Make sure this is NOT set
# JOBDIR = 'crawls/myjob'  # This forces disk queue
Enter fullscreen mode Exit fullscreen mode

Speed improvement: 10-20%


Optimization 14: Reduce Item Overhead

Items have overhead. For simple scraping, use dicts:

# Slower (Item objects have overhead)
class ProductItem(scrapy.Item):
    name = scrapy.Field()
    price = scrapy.Field()

def parse(self, response):
    item = ProductItem()
    item['name'] = response.css('h1::text').get()
    yield item

# Faster (plain dicts)
def parse(self, response):
    yield {
        'name': response.css('h1::text').get(),
        'price': response.css('.price::text').get()
    }
Enter fullscreen mode Exit fullscreen mode

Speed improvement: 5-10%

Trade-off: Lose Item validation and field definitions.


Optimization 15: Profile Your Spider

Find actual bottlenecks:

# Install yappi
pip install yappi

# Profile spider
python -m cProfile -o profile.stats scrapy crawl myspider

# Analyze
python -m pstats profile.stats
>>> sort cumulative
>>> stats 20
Enter fullscreen mode Exit fullscreen mode

Shows which functions take the most time.


Real-World Optimization Example

Let's optimize a slow spider:

Before (Slow)

class SlowSpider(scrapy.Spider):
    name = 'slow'

    custom_settings = {
        'CONCURRENT_REQUESTS': 16,  # Default
        'DOWNLOAD_TIMEOUT': 180,
        'COOKIES_ENABLED': True,
        'RETRY_ENABLED': True,
        'LOG_LEVEL': 'DEBUG'
    }

    def parse(self, response):
        # Inefficient selectors
        for product in response.xpath('//div[@class="product"]'):
            item = ProductItem()
            item['name'] = product.xpath('.//h2/text()').get()
            item['price'] = product.xpath('.//span[@class="price"]/text()').get()
            yield item

# Slow pipeline
class SlowPipeline:
    def process_item(self, item, spider):
        # Single insert (slow!)
        self.cursor.execute('INSERT INTO products VALUES (?, ?)', 
                          (item['name'], item['price']))
        self.conn.commit()
        return item
Enter fullscreen mode Exit fullscreen mode

Speed: 50,000 pages in 6 hours

After (Fast)

class FastSpider(scrapy.Spider):
    name = 'fast'

    custom_settings = {
        'CONCURRENT_REQUESTS': 64,  # Increased
        'CONCURRENT_REQUESTS_PER_DOMAIN': 32,
        'DOWNLOAD_TIMEOUT': 30,  # Reduced
        'COOKIES_ENABLED': False,  # Disabled (not needed)
        'RETRY_ENABLED': True,
        'RETRY_TIMES': 2,  # Reduced from 3
        'LOG_LEVEL': 'INFO',  # Less verbose
        'LOG_FILE': 'spider.log'  # File instead of console
    }

    def parse(self, response):
        # Efficient CSS selectors
        for product in response.css('.product'):
            yield {  # Dict instead of Item
                'name': product.css('h2::text').get(),
                'price': product.css('.price::text').get()
            }

# Fast pipeline with batching
class FastPipeline:
    def __init__(self):
        self.items = []
        self.batch_size = 100

    def process_item(self, item, spider):
        self.items.append(item)

        if len(self.items) >= self.batch_size:
            self.flush()

        return item

    def flush(self):
        # Batch insert (much faster!)
        values = [(item['name'], item['price']) for item in self.items]
        self.cursor.executemany('INSERT INTO products VALUES (?, ?)', values)
        self.conn.commit()
        self.items = []

    def close_spider(self, spider):
        self.flush()
Enter fullscreen mode Exit fullscreen mode

Speed: 50,000 pages in 30 minutes

Result: 12x faster!


Measuring Performance

Always measure before and after:

from datetime import datetime

class MeasuredSpider(scrapy.Spider):
    name = 'measured'

    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.start_time = datetime.now()
        self.page_count = 0

    def parse(self, response):
        self.page_count += 1

        # Log speed every 1000 pages
        if self.page_count % 1000 == 0:
            elapsed = (datetime.now() - self.start_time).total_seconds()
            speed = self.page_count / elapsed

            self.logger.info(
                f'Scraped {self.page_count} pages in {elapsed:.1f}s '
                f'({speed:.1f} pages/sec)'
            )

        yield {'url': response.url}
Enter fullscreen mode Exit fullscreen mode

When NOT to Optimize

Don't over-optimize:

Skip optimization if:

  • Spider runs once
  • Total time < 5 minutes
  • You're still developing
  • Site is very slow (bottleneck is server, not you)

Optimize when:

  • Spider runs regularly
  • Total time > 30 minutes
  • Scraping large sites (100k+ pages)
  • Time is critical

Quick Wins Checklist

Apply these for immediate speed boost:

  • [ ] Increase CONCURRENT_REQUESTS to 32-64
  • [ ] Reduce DOWNLOAD_TIMEOUT to 30
  • [ ] Disable COOKIES_ENABLED if not needed
  • [ ] Use CSS selectors instead of XPath
  • [ ] Batch database operations
  • [ ] Set LOG_LEVEL to INFO or WARNING
  • [ ] Look for APIs instead of scraping HTML

These 7 changes can give you 2-10x speedup!


Summary

Network optimization (biggest impact):

  • Increase concurrency
  • Reduce timeouts
  • Disable unnecessary features (cookies, redirects)
  • Use HTTP/2

Parsing optimization:

  • Use CSS over XPath
  • Cache selector results
  • Use more specific selectors
  • Use dicts instead of Items

Pipeline optimization:

  • Batch database operations
  • Use async for I/O
  • Minimize per-item processing

General tips:

  • Profile to find real bottlenecks
  • Measure before and after
  • Start with quick wins
  • APIs are always faster than HTML

Remember:

  • Network is usually the bottleneck
  • Optimize network first
  • Batch database operations
  • More concurrency = faster (up to a point)

Start with the quick wins checklist. That alone can give you 5-10x speedup in 5 minutes!

Happy scraping! 🕷️

Top comments (0)