Muhammad Ikramullah Khan

Posted on Dec 26

Scrapy HTTP Cache: The Complete Beginner's Guide (Stop Hammering Websites)

#webdev #python #beginners #programming

When I first started building spiders, I'd test them by running them over and over. Each time I tweaked a selector, I'd run the spider again. Hit the website again. Download the same pages again.

After my 50th run in one day, I realized something: I was being a terrible internet citizen. I was hammering some poor website with hundreds of requests just because I couldn't write a CSS selector properly.

Then I discovered Scrapy's HTTP cache. Game changer. Now when I test my spiders, they fetch pages once and reuse the cached responses. Faster testing. No guilt. No getting blocked.

Let me show you how to use caching properly.

What Is HTTP Cache?

Think of HTTP cache like a photocopy machine for webpages.

Without cache:

Run spider → Downloads page
Fix selector → Run again → Downloads same page again
Fix another thing → Run again → Downloads same page AGAIN

You're downloading the exact same page multiple times. Wasteful. Slow. Annoying to the website.

With cache:

Run spider → Downloads page → Saves a copy
Fix selector → Run again → Uses saved copy (no download!)
Fix another thing → Run again → Still using saved copy

You download once, test infinite times. The website only sees one request.

Enabling Cache (The One-Liner)

Add this to your settings.py:

HTTPCACHE_ENABLED = True

That's it. Seriously. Scrapy now caches everything.

Run your spider:

scrapy crawl myspider

The first run downloads pages normally. Check your project folder, you'll see a new .scrapy/httpcache/myspider/ directory. That's where cached pages live.

Run it again:

scrapy crawl myspider

This time? Lightning fast. No actual HTTP requests. Everything comes from cache.

How Cache Works (The Simple Explanation)

When you enable cache, here's what happens:

First request: Spider asks for a URL
Cache checks: "Do I have this page already?"
Cache miss: Nope, don't have it
Download: Fetch from website
Store: Save response to cache
Return: Give response to spider

Next time you request the same URL:

Request: Spider asks for same URL
Cache checks: "Do I have this page?"
Cache hit: Yes! Found it
Return: Give cached response (no download!)

Simple. Fast. Efficient.

Basic Cache Settings

How Long to Keep Cache

By default, cache never expires. Keep it forever. But you can set an expiration:

# settings.py
HTTPCACHE_ENABLED = True
HTTPCACHE_EXPIRATION_SECS = 86400  # 24 hours in seconds

After 24 hours, cached pages get re-downloaded.

When to use expiration:

Scraping news sites (content changes daily)
Product prices (change frequently)
Any dynamic content

When NOT to use expiration:

Development (you want pages to stay cached)
Scraping static content
Historical data that doesn't change

Where to Store Cache

By default, cache goes in .scrapy/httpcache/. Change it:

# settings.py
HTTPCACHE_ENABLED = True
HTTPCACHE_DIR = 'my_custom_cache'

Now cache goes in my_custom_cache/ instead.

Ignore Certain Status Codes

Don't cache errors:

# settings.py
HTTPCACHE_ENABLED = True
HTTPCACHE_IGNORE_HTTP_CODES = [404, 500, 502, 503]

404s and 500s won't be cached. Makes sense. You don't want to cache broken pages.

Cache Policies (Two Flavors)

Scrapy has two cache policies: Dummy and RFC2616.

DummyPolicy (The Simple One)

This is the default. It caches EVERYTHING. No questions asked.

# settings.py
HTTPCACHE_ENABLED = True
HTTPCACHE_POLICY = 'scrapy.extensions.httpcache.DummyPolicy'

How it works:

Every request gets cached
Never checks if cache is fresh
Never revalidates
Perfect for development

Use when:

Testing your spider
Offline development
You want to "replay" scrapes exactly

RFC2616Policy (The Smart One)

This follows HTTP caching rules. It respects Cache-Control headers from websites.

# settings.py
HTTPCACHE_ENABLED = True
HTTPCACHE_POLICY = 'scrapy.extensions.httpcache.RFC2616Policy'

How it works:

Checks HTTP headers
Respects max-age directives
Revalidates when needed
Acts like a real browser cache

Use when:

Running production scrapers
Want to respect website caching rules
Need up-to-date data
Being a good internet citizen

Real Example: Development vs Production

Development Setup (Cache Everything)

# settings.py

# Development: cache everything forever
HTTPCACHE_ENABLED = True
HTTPCACHE_POLICY = 'scrapy.extensions.httpcache.DummyPolicy'
HTTPCACHE_DIR = '.dev_cache'
HTTPCACHE_EXPIRATION_SECS = 0  # Never expire

Perfect for testing. Download once, test forever.

Production Setup (Smart Caching)

# settings.py

# Production: respect HTTP caching rules
HTTPCACHE_ENABLED = True
HTTPCACHE_POLICY = 'scrapy.extensions.httpcache.RFC2616Policy'
HTTPCACHE_DIR = '.prod_cache'
HTTPCACHE_EXPIRATION_SECS = 3600  # 1 hour
HTTPCACHE_IGNORE_HTTP_CODES = [404, 500, 502, 503]

Respects website rules. Updates when needed.

Practical Workflow

Here's how I actually use cache when building spiders:

Step 1: Enable Cache for Development

# settings.py
HTTPCACHE_ENABLED = True

Step 2: First Run (Populate Cache)

scrapy crawl myspider

This downloads all pages and caches them.

Step 3: Develop with Cache

Now I can run my spider hundreds of times without hitting the website:

# Run it again
scrapy crawl myspider

# Fix selector
# Run again
scrapy crawl myspider

# Fix another thing
# Run again
scrapy crawl myspider

All instant. All from cache.

Step 4: Clear Cache When Needed

When the website's structure changes or I need fresh data:

rm -rf .scrapy/httpcache/

Then run again to re-populate cache with fresh pages.

Per-Request Cache Control

You can disable cache for specific requests:

def parse(self, response):
    # This request won't be cached
    yield scrapy.Request(
        'https://example.com/dynamic',
        callback=self.parse_dynamic,
        meta={'dont_cache': True}
    )

Useful when some pages need to be fresh but others can be cached.

Advanced: Storage Backends

Scrapy has two storage backends: Filesystem (default) and DBM.

Filesystem (Default)

Stores each response as a separate file:

# settings.py
HTTPCACHE_ENABLED = True
HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

Pros:

Easy to inspect (just open files!)
Works everywhere
Simple

Cons:

Many small files
Slower with thousands of pages
Takes more disk space

DBM (Database)

Stores responses in a database:

# settings.py
HTTPCACHE_ENABLED = True
HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.DbmCacheStorage'

Pros:

Faster with lots of pages
Fewer files
More efficient

Cons:

Harder to inspect
Database-specific issues
More complex

For most projects, stick with Filesystem. It's simpler.

Debugging with Cache

See What's Cached

ls -R .scrapy/httpcache/

You'll see folders for each request. Inside each folder:

request_body (the request that was made)
request_headers (headers sent)
response_body (HTML received)
response_headers (response headers)
meta (metadata)

Check If Request Was Cached

Scrapy logs cache hits:

[scrapy.core.engine] DEBUG: Crawled (200) <GET https://example.com> (referer: None) ['cached']

See that ['cached'] at the end? That's a cache hit.

Without cache:

[scrapy.core.engine] DEBUG: Crawled (200) <GET https://example.com> (referer: None)

No ['cached']. Fresh download.

Common Gotchas

Gotcha #1: Cache Survives Spider Runs

Cache persists between runs. If you want fresh data, you need to manually clear it:

rm -rf .scrapy/httpcache/

Or set an expiration time.

Gotcha #2: Different Spiders Share Cache

If you have multiple spiders in one project, they share the cache directory. Each spider gets its own subfolder though:

.scrapy/httpcache/
    spider1/
    spider2/
    spider3/

Gotcha #3: POST Requests Aren't Cached by Default

Only GET requests are cached. POST requests (like form submissions) bypass cache:

# This won't be cached
yield scrapy.FormRequest(
    'https://example.com/search',
    formdata={'query': 'test'}
)

This is by design. POST requests usually aren't idempotent (they change things).

Gotcha #4: Redirects Are Cached Too

If a URL redirects, the redirect is cached. You won't see the redirect happen again:

https://example.com → https://www.example.com

The first run follows the redirect and caches the final page. Subsequent runs just return the cached final page.

Real-World Scenarios

Scenario 1: Testing Selectors

You're building a spider and constantly testing CSS selectors:

# settings.py
HTTPCACHE_ENABLED = True
HTTPCACHE_EXPIRATION_SECS = 0  # Never expire

Run once to populate cache. Then test selectors all day without hitting the website.

Scenario 2: Scraping Historical Data

You're scraping historical data that never changes (like old articles):

# settings.py
HTTPCACHE_ENABLED = True
HTTPCACHE_POLICY = 'scrapy.extensions.httpcache.DummyPolicy'
HTTPCACHE_EXPIRATION_SECS = 0  # Keep forever

Once you scrape it, it's cached forever. Perfect for historical archives.

Scenario 3: Production Scraper

You're running a production scraper that needs fresh data:

# settings.py
HTTPCACHE_ENABLED = True
HTTPCACHE_POLICY = 'scrapy.extensions.httpcache.RFC2616Policy'
HTTPCACHE_EXPIRATION_SECS = 1800  # 30 minutes

Respects HTTP rules. Re-fetches after 30 minutes. Balanced approach.

Scenario 4: Offline Development

You're on a plane with no internet but want to work on your spider:

# settings.py
HTTPCACHE_ENABLED = True
HTTPCACHE_IGNORE_MISSING = False  # Fail if not in cache

Your spider only uses cache. If a page isn't cached, it fails instead of trying to download.

Tips Nobody Tells You

Tip #1: Use Cache for CI/CD

In continuous integration, you don't want to hit real websites. Use cache:

# settings.py for CI/CD
HTTPCACHE_ENABLED = True
HTTPCACHE_IGNORE_MISSING = False  # Tests fail if page not cached

Pre-populate cache in your repo. Tests run against cached pages. Fast. Reliable.

Tip #2: Share Cache Between Developers

Commit the cache folder to version control:

git add .scrapy/httpcache/
git commit -m "Add test cache"

Now everyone on your team uses the same cached pages for testing. Consistent results.

Tip #3: Different Cache for Different Environments

# settings.py
import os

HTTPCACHE_ENABLED = True

if os.getenv('ENV') == 'production':
    HTTPCACHE_DIR = '.prod_cache'
    HTTPCACHE_POLICY = 'scrapy.extensions.httpcache.RFC2616Policy'
else:
    HTTPCACHE_DIR = '.dev_cache'
    HTTPCACHE_POLICY = 'scrapy.extensions.httpcache.DummyPolicy'

Separate cache for dev and prod. Best of both worlds.

Tip #4: Compress Cache to Save Space

# settings.py
HTTPCACHE_ENABLED = True
HTTPCACHE_GZIP = True  # Compress cached responses

Saves tons of disk space. Especially with large pages.

Complete Example Spider

Here's a production-ready spider with smart caching:

# spider.py
import scrapy

class SmartCacheSpider(scrapy.Spider):
    name = 'smartcache'
    start_urls = ['https://example.com/products']

    custom_settings = {
        # Enable cache
        'HTTPCACHE_ENABLED': True,

        # Use smart policy
        'HTTPCACHE_POLICY': 'scrapy.extensions.httpcache.RFC2616Policy',

        # Cache for 1 hour
        'HTTPCACHE_EXPIRATION_SECS': 3600,

        # Don't cache errors
        'HTTPCACHE_IGNORE_HTTP_CODES': [404, 500, 502, 503],

        # Compress to save space
        'HTTPCACHE_GZIP': True,

        # Custom cache directory
        'HTTPCACHE_DIR': '.product_cache'
    }

    def parse(self, response):
        # Log whether this was cached
        if 'cached' in response.flags:
            self.logger.info(f'Using cached version of {response.url}')
        else:
            self.logger.info(f'Downloaded fresh: {response.url}')

        for product in response.css('.product'):
            yield {
                'name': product.css('h2::text').get(),
                'price': product.css('.price::text').get()
            }

        # For some pages, force fresh download
        next_page = response.css('.next::attr(href)').get()
        if next_page:
            # Don't cache pagination (always get fresh)
            yield response.follow(
                next_page,
                callback=self.parse,
                meta={'dont_cache': True}
            )

This spider:

Caches product pages for 1 hour
Respects HTTP caching rules
Always downloads pagination pages fresh
Logs cache hits/misses
Compresses cache to save space

When NOT to Use Cache

Cache isn't always the answer:

Don't cache when:

Scraping real-time data (stock prices, sports scores)
The website explicitly says not to cache (Cache-Control: no-store)
You need the absolute latest data every time
Running in production and disk space is limited

Do cache when:

Developing and testing
Scraping static/historical content
Want to reduce server load
Need consistent test data
Working offline

Quick Reference

Basic Setup

# settings.py

# Enable cache (simplest)
HTTPCACHE_ENABLED = True

# Set expiration (seconds)
HTTPCACHE_EXPIRATION_SECS = 3600

# Set directory
HTTPCACHE_DIR = '.my_cache'

# Choose policy
HTTPCACHE_POLICY = 'scrapy.extensions.httpcache.DummyPolicy'  # or RFC2616Policy

# Ignore certain codes
HTTPCACHE_IGNORE_HTTP_CODES = [404, 500]

# Compress cache
HTTPCACHE_GZIP = True

Per-Request Control

# Don't cache this request
yield scrapy.Request(
    url,
    meta={'dont_cache': True}
)

Check if Cached

def parse(self, response):
    if 'cached' in response.flags:
        print('From cache!')
    else:
        print('Fresh download!')

Clear Cache

rm -rf .scrapy/httpcache/

Summary

HTTP cache is your best friend during development. It:

Speeds up testing dramatically
Reduces load on websites
Lets you work offline
Makes tests consistent
Saves bandwidth

Key takeaways:

Enable with HTTPCACHE_ENABLED = True
Use DummyPolicy for development
Use RFC2616Policy for production
Clear cache when you need fresh data
Use dont_cache meta for specific requests
Check logs for ['cached'] to see cache hits

Start using cache in your next project. Your spider will run faster, and website admins will thank you for not hammering their servers.

Happy scraping! 🕷️