My first spider took 6 hours to scrape 50,000 pages. I thought that was just how long it took.
Then I learned about optimization. Same spider, same website, now takes 30 minutes. That's 12x faster!
The difference? Understanding bottlenecks and fixing them. Let me show you how to make your spiders blazing fast.
The Big Picture: Where Time Is Spent
When Scrapy scrapes, time goes to:
1. Network (70-90%)
- Downloading pages
- Waiting for responses
- DNS lookups
2. Parsing (5-15%)
- Running selectors
- Extracting data
- Processing items
3. Processing (5-15%)
- Running pipelines
- Saving to database
- Validating data
Key insight: Network is usually the bottleneck. Optimize that first!
Optimization 1: Increase Concurrency
By default, Scrapy runs 16 concurrent requests. Increase it:
# settings.py
# From default
CONCURRENT_REQUESTS = 16
# To faster
CONCURRENT_REQUESTS = 32 # or 64, or even 128
Speed improvement: 2-4x
Also Increase Per-Domain Concurrency
CONCURRENT_REQUESTS_PER_DOMAIN = 16 # From 8
What the Docs Don't Tell You
More isn't always better:
- Your network might be the limit
- Target server might block you
- Your CPU might max out
Find your limit:
Start at 16, double it, test speed. Keep doubling until speed stops improving.
Test with:
time scrapy crawl myspider
Optimization 2: Reduce Download Timeout
Default timeout is 180 seconds. That's way too long!
# settings.py
# From default
DOWNLOAD_TIMEOUT = 180 # 3 minutes!
# To faster
DOWNLOAD_TIMEOUT = 30 # 30 seconds
If a page takes 30+ seconds, it's either:
- The site is blocking you
- The server is overloaded
- The page is broken
Don't wait 3 minutes for it!
Speed improvement: Saves time on slow/dead pages
Optimization 3: Disable Cookies (When Not Needed)
Cookie processing takes time. If you don't need cookies:
COOKIES_ENABLED = False
Speed improvement: 5-10%
Warning: Only disable if:
- You don't need session handling
- You don't need to stay logged in
- The site doesn't require cookies
Optimization 4: Disable Redirects (When Safe)
Following redirects takes extra requests:
REDIRECT_ENABLED = False
Speed improvement: 10-20% (if site uses many redirects)
Warning: Only disable if:
- You know the exact URLs
- No redirects are expected
- You're scraping an API
Optimization 5: Disable Retry Middleware (Advanced)
Retrying failed requests takes time:
RETRY_ENABLED = False
Speed improvement: 5-15% (if many failures)
Warning: Only disable if:
- You're okay with missing some pages
- You'll re-run the spider anyway
- Speed matters more than completeness
Optimization 6: Use DNS Cache
DNS lookups are slow. Cache them:
DNSCACHE_ENABLED = True # Already default, but verify
Also increase DNS timeout:
DNS_TIMEOUT = 10 # From 60
Speed improvement: 5-10%
Optimization 7: Optimize Your Selectors
Slow selectors slow down everything.
Use CSS Over XPath (Usually)
# Slower
response.xpath('//div[@class="product"]/span[@class="name"]/text()').get()
# Faster
response.css('div.product span.name::text').get()
CSS selectors are usually 10-30% faster than XPath.
Cache Selector Results
# Slow (selector runs multiple times)
def parse(self, response):
for product in response.css('.product'):
name = product.css('.name::text').get()
price = product.css('.price::text').get()
description = product.css('.description::text').get()
# Fast (selector runs once, cached)
def parse(self, response):
products = response.css('.product') # Cache this
for product in products:
name = product.css('.name::text').get()
price = product.css('.price::text').get()
description = product.css('.description::text').get()
Use More Specific Selectors
# Slow (searches entire page)
response.css('span::text').getall()
# Fast (narrows search)
response.css('.product-list span.price::text').getall()
Optimization 8: Minimize Pipeline Work
Heavy pipeline processing slows everything down.
Bad Pipeline
class SlowPipeline:
def process_item(self, item, spider):
# Slow: API call for each item
enriched_data = requests.get(f'https://api.example.com/enrich?q={item["name"]}')
item['enriched'] = enriched_data.json()
# Slow: Database call for each item
self.cursor.execute('INSERT INTO items VALUES (...)')
self.conn.commit() # Commit each item!
return item
Fast Pipeline
class FastPipeline:
def __init__(self):
self.items_buffer = []
self.buffer_size = 100
def process_item(self, item, spider):
# Buffer items
self.items_buffer.append(item)
# Batch insert when buffer is full
if len(self.items_buffer) >= self.buffer_size:
self.flush_buffer()
return item
def flush_buffer(self):
# Batch insert (much faster!)
values = [(item['name'], item['price']) for item in self.items_buffer]
self.cursor.executemany('INSERT INTO items VALUES (?, ?)', values)
self.conn.commit()
self.items_buffer = []
def close_spider(self, spider):
# Insert remaining items
self.flush_buffer()
Speed improvement: 5-50x for database operations!
Optimization 9: Use Async Pipelines
For I/O heavy pipelines (API calls, database), use async:
import asyncio
import aiohttp
class AsyncPipeline:
async def process_item(self, item, spider):
async with aiohttp.ClientSession() as session:
async with session.get(f'https://api.example.com/data?id={item["id"]}') as response:
data = await response.json()
item['extra'] = data
return item
Speed improvement: 2-10x for I/O operations
Optimization 10: Scrape APIs Instead of HTML
If the site has an API, use it!
# Slow: Scraping HTML
def parse(self, response):
for product in response.css('.product'):
yield {
'name': product.css('h2::text').get(),
'price': product.css('.price::text').get()
}
# Fast: Scraping API
def parse(self, response):
data = json.loads(response.text)
for product in data['products']:
yield {
'name': product['name'],
'price': product['price']
}
Speed improvement: 10-100x
APIs are:
- Faster to download (smaller)
- Faster to parse (no HTML)
- More reliable
Optimization 11: Use HTTP/2
HTTP/2 is faster than HTTP/1.1:
# Install
pip install scrapy[http2]
# Enable in settings.py
DOWNLOAD_HANDLERS = {
'https': 'scrapy.core.downloader.handlers.http2.H2DownloadHandler',
}
Speed improvement: 10-30% (especially with high latency)
Optimization 12: Disable Logging in Production
Logging to console is slow:
# Development
LOG_LEVEL = 'DEBUG'
# Production
LOG_LEVEL = 'WARNING' # or ERROR
LOG_FILE = 'spider.log' # Log to file, not console
Speed improvement: 5-10%
Optimization 13: Use Memory Queue
By default, Scrapy uses disk for request queue. Use memory:
SCHEDULER_PRIORITY_QUEUE = 'scrapy.pqueues.ScrapyPriorityQueue'
SCHEDULER_DISK_QUEUE = 'scrapy.squeues.PickleFifoDiskQueue'
SCHEDULER_MEMORY_QUEUE = 'scrapy.squeues.FifoMemoryQueue'
Actually, this is already the default for memory queue. Just make sure you're not using disk queue:
# Make sure this is NOT set
# JOBDIR = 'crawls/myjob' # This forces disk queue
Speed improvement: 10-20%
Optimization 14: Reduce Item Overhead
Items have overhead. For simple scraping, use dicts:
# Slower (Item objects have overhead)
class ProductItem(scrapy.Item):
name = scrapy.Field()
price = scrapy.Field()
def parse(self, response):
item = ProductItem()
item['name'] = response.css('h1::text').get()
yield item
# Faster (plain dicts)
def parse(self, response):
yield {
'name': response.css('h1::text').get(),
'price': response.css('.price::text').get()
}
Speed improvement: 5-10%
Trade-off: Lose Item validation and field definitions.
Optimization 15: Profile Your Spider
Find actual bottlenecks:
# Install yappi
pip install yappi
# Profile spider
python -m cProfile -o profile.stats scrapy crawl myspider
# Analyze
python -m pstats profile.stats
>>> sort cumulative
>>> stats 20
Shows which functions take the most time.
Real-World Optimization Example
Let's optimize a slow spider:
Before (Slow)
class SlowSpider(scrapy.Spider):
name = 'slow'
custom_settings = {
'CONCURRENT_REQUESTS': 16, # Default
'DOWNLOAD_TIMEOUT': 180,
'COOKIES_ENABLED': True,
'RETRY_ENABLED': True,
'LOG_LEVEL': 'DEBUG'
}
def parse(self, response):
# Inefficient selectors
for product in response.xpath('//div[@class="product"]'):
item = ProductItem()
item['name'] = product.xpath('.//h2/text()').get()
item['price'] = product.xpath('.//span[@class="price"]/text()').get()
yield item
# Slow pipeline
class SlowPipeline:
def process_item(self, item, spider):
# Single insert (slow!)
self.cursor.execute('INSERT INTO products VALUES (?, ?)',
(item['name'], item['price']))
self.conn.commit()
return item
Speed: 50,000 pages in 6 hours
After (Fast)
class FastSpider(scrapy.Spider):
name = 'fast'
custom_settings = {
'CONCURRENT_REQUESTS': 64, # Increased
'CONCURRENT_REQUESTS_PER_DOMAIN': 32,
'DOWNLOAD_TIMEOUT': 30, # Reduced
'COOKIES_ENABLED': False, # Disabled (not needed)
'RETRY_ENABLED': True,
'RETRY_TIMES': 2, # Reduced from 3
'LOG_LEVEL': 'INFO', # Less verbose
'LOG_FILE': 'spider.log' # File instead of console
}
def parse(self, response):
# Efficient CSS selectors
for product in response.css('.product'):
yield { # Dict instead of Item
'name': product.css('h2::text').get(),
'price': product.css('.price::text').get()
}
# Fast pipeline with batching
class FastPipeline:
def __init__(self):
self.items = []
self.batch_size = 100
def process_item(self, item, spider):
self.items.append(item)
if len(self.items) >= self.batch_size:
self.flush()
return item
def flush(self):
# Batch insert (much faster!)
values = [(item['name'], item['price']) for item in self.items]
self.cursor.executemany('INSERT INTO products VALUES (?, ?)', values)
self.conn.commit()
self.items = []
def close_spider(self, spider):
self.flush()
Speed: 50,000 pages in 30 minutes
Result: 12x faster!
Measuring Performance
Always measure before and after:
from datetime import datetime
class MeasuredSpider(scrapy.Spider):
name = 'measured'
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
self.start_time = datetime.now()
self.page_count = 0
def parse(self, response):
self.page_count += 1
# Log speed every 1000 pages
if self.page_count % 1000 == 0:
elapsed = (datetime.now() - self.start_time).total_seconds()
speed = self.page_count / elapsed
self.logger.info(
f'Scraped {self.page_count} pages in {elapsed:.1f}s '
f'({speed:.1f} pages/sec)'
)
yield {'url': response.url}
When NOT to Optimize
Don't over-optimize:
Skip optimization if:
- Spider runs once
- Total time < 5 minutes
- You're still developing
- Site is very slow (bottleneck is server, not you)
Optimize when:
- Spider runs regularly
- Total time > 30 minutes
- Scraping large sites (100k+ pages)
- Time is critical
Quick Wins Checklist
Apply these for immediate speed boost:
- [ ] Increase CONCURRENT_REQUESTS to 32-64
- [ ] Reduce DOWNLOAD_TIMEOUT to 30
- [ ] Disable COOKIES_ENABLED if not needed
- [ ] Use CSS selectors instead of XPath
- [ ] Batch database operations
- [ ] Set LOG_LEVEL to INFO or WARNING
- [ ] Look for APIs instead of scraping HTML
These 7 changes can give you 2-10x speedup!
Summary
Network optimization (biggest impact):
- Increase concurrency
- Reduce timeouts
- Disable unnecessary features (cookies, redirects)
- Use HTTP/2
Parsing optimization:
- Use CSS over XPath
- Cache selector results
- Use more specific selectors
- Use dicts instead of Items
Pipeline optimization:
- Batch database operations
- Use async for I/O
- Minimize per-item processing
General tips:
- Profile to find real bottlenecks
- Measure before and after
- Start with quick wins
- APIs are always faster than HTML
Remember:
- Network is usually the bottleneck
- Optimize network first
- Batch database operations
- More concurrency = faster (up to a point)
Start with the quick wins checklist. That alone can give you 5-10x speedup in 5 minutes!
Happy scraping! 🕷️
Top comments (0)