When I first started building spiders, I'd test them by running them over and over. Each time I tweaked a selector, I'd run the spider again. Hit the website again. Download the same pages again.
After my 50th run in one day, I realized something: I was being a terrible internet citizen. I was hammering some poor website with hundreds of requests just because I couldn't write a CSS selector properly.
Then I discovered Scrapy's HTTP cache. Game changer. Now when I test my spiders, they fetch pages once and reuse the cached responses. Faster testing. No guilt. No getting blocked.
Let me show you how to use caching properly.
What Is HTTP Cache?
Think of HTTP cache like a photocopy machine for webpages.
Without cache:
- Run spider → Downloads page
- Fix selector → Run again → Downloads same page again
- Fix another thing → Run again → Downloads same page AGAIN
You're downloading the exact same page multiple times. Wasteful. Slow. Annoying to the website.
With cache:
- Run spider → Downloads page → Saves a copy
- Fix selector → Run again → Uses saved copy (no download!)
- Fix another thing → Run again → Still using saved copy
You download once, test infinite times. The website only sees one request.
Enabling Cache (The One-Liner)
Add this to your settings.py:
HTTPCACHE_ENABLED = True
That's it. Seriously. Scrapy now caches everything.
Run your spider:
scrapy crawl myspider
The first run downloads pages normally. Check your project folder, you'll see a new .scrapy/httpcache/myspider/ directory. That's where cached pages live.
Run it again:
scrapy crawl myspider
This time? Lightning fast. No actual HTTP requests. Everything comes from cache.
How Cache Works (The Simple Explanation)
When you enable cache, here's what happens:
- First request: Spider asks for a URL
- Cache checks: "Do I have this page already?"
- Cache miss: Nope, don't have it
- Download: Fetch from website
- Store: Save response to cache
- Return: Give response to spider
Next time you request the same URL:
- Request: Spider asks for same URL
- Cache checks: "Do I have this page?"
- Cache hit: Yes! Found it
- Return: Give cached response (no download!)
Simple. Fast. Efficient.
Basic Cache Settings
How Long to Keep Cache
By default, cache never expires. Keep it forever. But you can set an expiration:
# settings.py
HTTPCACHE_ENABLED = True
HTTPCACHE_EXPIRATION_SECS = 86400 # 24 hours in seconds
After 24 hours, cached pages get re-downloaded.
When to use expiration:
- Scraping news sites (content changes daily)
- Product prices (change frequently)
- Any dynamic content
When NOT to use expiration:
- Development (you want pages to stay cached)
- Scraping static content
- Historical data that doesn't change
Where to Store Cache
By default, cache goes in .scrapy/httpcache/. Change it:
# settings.py
HTTPCACHE_ENABLED = True
HTTPCACHE_DIR = 'my_custom_cache'
Now cache goes in my_custom_cache/ instead.
Ignore Certain Status Codes
Don't cache errors:
# settings.py
HTTPCACHE_ENABLED = True
HTTPCACHE_IGNORE_HTTP_CODES = [404, 500, 502, 503]
404s and 500s won't be cached. Makes sense. You don't want to cache broken pages.
Cache Policies (Two Flavors)
Scrapy has two cache policies: Dummy and RFC2616.
DummyPolicy (The Simple One)
This is the default. It caches EVERYTHING. No questions asked.
# settings.py
HTTPCACHE_ENABLED = True
HTTPCACHE_POLICY = 'scrapy.extensions.httpcache.DummyPolicy'
How it works:
- Every request gets cached
- Never checks if cache is fresh
- Never revalidates
- Perfect for development
Use when:
- Testing your spider
- Offline development
- You want to "replay" scrapes exactly
RFC2616Policy (The Smart One)
This follows HTTP caching rules. It respects Cache-Control headers from websites.
# settings.py
HTTPCACHE_ENABLED = True
HTTPCACHE_POLICY = 'scrapy.extensions.httpcache.RFC2616Policy'
How it works:
- Checks HTTP headers
- Respects
max-agedirectives - Revalidates when needed
- Acts like a real browser cache
Use when:
- Running production scrapers
- Want to respect website caching rules
- Need up-to-date data
- Being a good internet citizen
Real Example: Development vs Production
Development Setup (Cache Everything)
# settings.py
# Development: cache everything forever
HTTPCACHE_ENABLED = True
HTTPCACHE_POLICY = 'scrapy.extensions.httpcache.DummyPolicy'
HTTPCACHE_DIR = '.dev_cache'
HTTPCACHE_EXPIRATION_SECS = 0 # Never expire
Perfect for testing. Download once, test forever.
Production Setup (Smart Caching)
# settings.py
# Production: respect HTTP caching rules
HTTPCACHE_ENABLED = True
HTTPCACHE_POLICY = 'scrapy.extensions.httpcache.RFC2616Policy'
HTTPCACHE_DIR = '.prod_cache'
HTTPCACHE_EXPIRATION_SECS = 3600 # 1 hour
HTTPCACHE_IGNORE_HTTP_CODES = [404, 500, 502, 503]
Respects website rules. Updates when needed.
Practical Workflow
Here's how I actually use cache when building spiders:
Step 1: Enable Cache for Development
# settings.py
HTTPCACHE_ENABLED = True
Step 2: First Run (Populate Cache)
scrapy crawl myspider
This downloads all pages and caches them.
Step 3: Develop with Cache
Now I can run my spider hundreds of times without hitting the website:
# Run it again
scrapy crawl myspider
# Fix selector
# Run again
scrapy crawl myspider
# Fix another thing
# Run again
scrapy crawl myspider
All instant. All from cache.
Step 4: Clear Cache When Needed
When the website's structure changes or I need fresh data:
rm -rf .scrapy/httpcache/
Then run again to re-populate cache with fresh pages.
Per-Request Cache Control
You can disable cache for specific requests:
def parse(self, response):
# This request won't be cached
yield scrapy.Request(
'https://example.com/dynamic',
callback=self.parse_dynamic,
meta={'dont_cache': True}
)
Useful when some pages need to be fresh but others can be cached.
Advanced: Storage Backends
Scrapy has two storage backends: Filesystem (default) and DBM.
Filesystem (Default)
Stores each response as a separate file:
# settings.py
HTTPCACHE_ENABLED = True
HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
Pros:
- Easy to inspect (just open files!)
- Works everywhere
- Simple
Cons:
- Many small files
- Slower with thousands of pages
- Takes more disk space
DBM (Database)
Stores responses in a database:
# settings.py
HTTPCACHE_ENABLED = True
HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.DbmCacheStorage'
Pros:
- Faster with lots of pages
- Fewer files
- More efficient
Cons:
- Harder to inspect
- Database-specific issues
- More complex
For most projects, stick with Filesystem. It's simpler.
Debugging with Cache
See What's Cached
ls -R .scrapy/httpcache/
You'll see folders for each request. Inside each folder:
-
request_body(the request that was made) -
request_headers(headers sent) -
response_body(HTML received) -
response_headers(response headers) -
meta(metadata)
Check If Request Was Cached
Scrapy logs cache hits:
[scrapy.core.engine] DEBUG: Crawled (200) <GET https://example.com> (referer: None) ['cached']
See that ['cached'] at the end? That's a cache hit.
Without cache:
[scrapy.core.engine] DEBUG: Crawled (200) <GET https://example.com> (referer: None)
No ['cached']. Fresh download.
Common Gotchas
Gotcha #1: Cache Survives Spider Runs
Cache persists between runs. If you want fresh data, you need to manually clear it:
rm -rf .scrapy/httpcache/
Or set an expiration time.
Gotcha #2: Different Spiders Share Cache
If you have multiple spiders in one project, they share the cache directory. Each spider gets its own subfolder though:
.scrapy/httpcache/
spider1/
spider2/
spider3/
Gotcha #3: POST Requests Aren't Cached by Default
Only GET requests are cached. POST requests (like form submissions) bypass cache:
# This won't be cached
yield scrapy.FormRequest(
'https://example.com/search',
formdata={'query': 'test'}
)
This is by design. POST requests usually aren't idempotent (they change things).
Gotcha #4: Redirects Are Cached Too
If a URL redirects, the redirect is cached. You won't see the redirect happen again:
https://example.com → https://www.example.com
The first run follows the redirect and caches the final page. Subsequent runs just return the cached final page.
Real-World Scenarios
Scenario 1: Testing Selectors
You're building a spider and constantly testing CSS selectors:
# settings.py
HTTPCACHE_ENABLED = True
HTTPCACHE_EXPIRATION_SECS = 0 # Never expire
Run once to populate cache. Then test selectors all day without hitting the website.
Scenario 2: Scraping Historical Data
You're scraping historical data that never changes (like old articles):
# settings.py
HTTPCACHE_ENABLED = True
HTTPCACHE_POLICY = 'scrapy.extensions.httpcache.DummyPolicy'
HTTPCACHE_EXPIRATION_SECS = 0 # Keep forever
Once you scrape it, it's cached forever. Perfect for historical archives.
Scenario 3: Production Scraper
You're running a production scraper that needs fresh data:
# settings.py
HTTPCACHE_ENABLED = True
HTTPCACHE_POLICY = 'scrapy.extensions.httpcache.RFC2616Policy'
HTTPCACHE_EXPIRATION_SECS = 1800 # 30 minutes
Respects HTTP rules. Re-fetches after 30 minutes. Balanced approach.
Scenario 4: Offline Development
You're on a plane with no internet but want to work on your spider:
# settings.py
HTTPCACHE_ENABLED = True
HTTPCACHE_IGNORE_MISSING = False # Fail if not in cache
Your spider only uses cache. If a page isn't cached, it fails instead of trying to download.
Tips Nobody Tells You
Tip #1: Use Cache for CI/CD
In continuous integration, you don't want to hit real websites. Use cache:
# settings.py for CI/CD
HTTPCACHE_ENABLED = True
HTTPCACHE_IGNORE_MISSING = False # Tests fail if page not cached
Pre-populate cache in your repo. Tests run against cached pages. Fast. Reliable.
Tip #2: Share Cache Between Developers
Commit the cache folder to version control:
git add .scrapy/httpcache/
git commit -m "Add test cache"
Now everyone on your team uses the same cached pages for testing. Consistent results.
Tip #3: Different Cache for Different Environments
# settings.py
import os
HTTPCACHE_ENABLED = True
if os.getenv('ENV') == 'production':
HTTPCACHE_DIR = '.prod_cache'
HTTPCACHE_POLICY = 'scrapy.extensions.httpcache.RFC2616Policy'
else:
HTTPCACHE_DIR = '.dev_cache'
HTTPCACHE_POLICY = 'scrapy.extensions.httpcache.DummyPolicy'
Separate cache for dev and prod. Best of both worlds.
Tip #4: Compress Cache to Save Space
# settings.py
HTTPCACHE_ENABLED = True
HTTPCACHE_GZIP = True # Compress cached responses
Saves tons of disk space. Especially with large pages.
Complete Example Spider
Here's a production-ready spider with smart caching:
# spider.py
import scrapy
class SmartCacheSpider(scrapy.Spider):
name = 'smartcache'
start_urls = ['https://example.com/products']
custom_settings = {
# Enable cache
'HTTPCACHE_ENABLED': True,
# Use smart policy
'HTTPCACHE_POLICY': 'scrapy.extensions.httpcache.RFC2616Policy',
# Cache for 1 hour
'HTTPCACHE_EXPIRATION_SECS': 3600,
# Don't cache errors
'HTTPCACHE_IGNORE_HTTP_CODES': [404, 500, 502, 503],
# Compress to save space
'HTTPCACHE_GZIP': True,
# Custom cache directory
'HTTPCACHE_DIR': '.product_cache'
}
def parse(self, response):
# Log whether this was cached
if 'cached' in response.flags:
self.logger.info(f'Using cached version of {response.url}')
else:
self.logger.info(f'Downloaded fresh: {response.url}')
for product in response.css('.product'):
yield {
'name': product.css('h2::text').get(),
'price': product.css('.price::text').get()
}
# For some pages, force fresh download
next_page = response.css('.next::attr(href)').get()
if next_page:
# Don't cache pagination (always get fresh)
yield response.follow(
next_page,
callback=self.parse,
meta={'dont_cache': True}
)
This spider:
- Caches product pages for 1 hour
- Respects HTTP caching rules
- Always downloads pagination pages fresh
- Logs cache hits/misses
- Compresses cache to save space
When NOT to Use Cache
Cache isn't always the answer:
Don't cache when:
- Scraping real-time data (stock prices, sports scores)
- The website explicitly says not to cache (Cache-Control: no-store)
- You need the absolute latest data every time
- Running in production and disk space is limited
Do cache when:
- Developing and testing
- Scraping static/historical content
- Want to reduce server load
- Need consistent test data
- Working offline
Quick Reference
Basic Setup
# settings.py
# Enable cache (simplest)
HTTPCACHE_ENABLED = True
# Set expiration (seconds)
HTTPCACHE_EXPIRATION_SECS = 3600
# Set directory
HTTPCACHE_DIR = '.my_cache'
# Choose policy
HTTPCACHE_POLICY = 'scrapy.extensions.httpcache.DummyPolicy' # or RFC2616Policy
# Ignore certain codes
HTTPCACHE_IGNORE_HTTP_CODES = [404, 500]
# Compress cache
HTTPCACHE_GZIP = True
Per-Request Control
# Don't cache this request
yield scrapy.Request(
url,
meta={'dont_cache': True}
)
Check if Cached
def parse(self, response):
if 'cached' in response.flags:
print('From cache!')
else:
print('Fresh download!')
Clear Cache
rm -rf .scrapy/httpcache/
Summary
HTTP cache is your best friend during development. It:
- Speeds up testing dramatically
- Reduces load on websites
- Lets you work offline
- Makes tests consistent
- Saves bandwidth
Key takeaways:
- Enable with
HTTPCACHE_ENABLED = True - Use DummyPolicy for development
- Use RFC2616Policy for production
- Clear cache when you need fresh data
- Use
dont_cachemeta for specific requests - Check logs for
['cached']to see cache hits
Start using cache in your next project. Your spider will run faster, and website admins will thank you for not hammering their servers.
Happy scraping! 🕷️
Top comments (0)