DEV Community

Muhammad Ikramullah Khan
Muhammad Ikramullah Khan

Posted on

Scrapy HTTP Cache: The Complete Beginner's Guide (Stop Hammering Websites)

When I first started building spiders, I'd test them by running them over and over. Each time I tweaked a selector, I'd run the spider again. Hit the website again. Download the same pages again.

After my 50th run in one day, I realized something: I was being a terrible internet citizen. I was hammering some poor website with hundreds of requests just because I couldn't write a CSS selector properly.

Then I discovered Scrapy's HTTP cache. Game changer. Now when I test my spiders, they fetch pages once and reuse the cached responses. Faster testing. No guilt. No getting blocked.

Let me show you how to use caching properly.


What Is HTTP Cache?

Think of HTTP cache like a photocopy machine for webpages.

Without cache:

  • Run spider → Downloads page
  • Fix selector → Run again → Downloads same page again
  • Fix another thing → Run again → Downloads same page AGAIN

You're downloading the exact same page multiple times. Wasteful. Slow. Annoying to the website.

With cache:

  • Run spider → Downloads page → Saves a copy
  • Fix selector → Run again → Uses saved copy (no download!)
  • Fix another thing → Run again → Still using saved copy

You download once, test infinite times. The website only sees one request.


Enabling Cache (The One-Liner)

Add this to your settings.py:

HTTPCACHE_ENABLED = True
Enter fullscreen mode Exit fullscreen mode

That's it. Seriously. Scrapy now caches everything.

Run your spider:

scrapy crawl myspider
Enter fullscreen mode Exit fullscreen mode

The first run downloads pages normally. Check your project folder, you'll see a new .scrapy/httpcache/myspider/ directory. That's where cached pages live.

Run it again:

scrapy crawl myspider
Enter fullscreen mode Exit fullscreen mode

This time? Lightning fast. No actual HTTP requests. Everything comes from cache.


How Cache Works (The Simple Explanation)

When you enable cache, here's what happens:

  1. First request: Spider asks for a URL
  2. Cache checks: "Do I have this page already?"
  3. Cache miss: Nope, don't have it
  4. Download: Fetch from website
  5. Store: Save response to cache
  6. Return: Give response to spider

Next time you request the same URL:

  1. Request: Spider asks for same URL
  2. Cache checks: "Do I have this page?"
  3. Cache hit: Yes! Found it
  4. Return: Give cached response (no download!)

Simple. Fast. Efficient.


Basic Cache Settings

How Long to Keep Cache

By default, cache never expires. Keep it forever. But you can set an expiration:

# settings.py
HTTPCACHE_ENABLED = True
HTTPCACHE_EXPIRATION_SECS = 86400  # 24 hours in seconds
Enter fullscreen mode Exit fullscreen mode

After 24 hours, cached pages get re-downloaded.

When to use expiration:

  • Scraping news sites (content changes daily)
  • Product prices (change frequently)
  • Any dynamic content

When NOT to use expiration:

  • Development (you want pages to stay cached)
  • Scraping static content
  • Historical data that doesn't change

Where to Store Cache

By default, cache goes in .scrapy/httpcache/. Change it:

# settings.py
HTTPCACHE_ENABLED = True
HTTPCACHE_DIR = 'my_custom_cache'
Enter fullscreen mode Exit fullscreen mode

Now cache goes in my_custom_cache/ instead.

Ignore Certain Status Codes

Don't cache errors:

# settings.py
HTTPCACHE_ENABLED = True
HTTPCACHE_IGNORE_HTTP_CODES = [404, 500, 502, 503]
Enter fullscreen mode Exit fullscreen mode

404s and 500s won't be cached. Makes sense. You don't want to cache broken pages.


Cache Policies (Two Flavors)

Scrapy has two cache policies: Dummy and RFC2616.

DummyPolicy (The Simple One)

This is the default. It caches EVERYTHING. No questions asked.

# settings.py
HTTPCACHE_ENABLED = True
HTTPCACHE_POLICY = 'scrapy.extensions.httpcache.DummyPolicy'
Enter fullscreen mode Exit fullscreen mode

How it works:

  • Every request gets cached
  • Never checks if cache is fresh
  • Never revalidates
  • Perfect for development

Use when:

  • Testing your spider
  • Offline development
  • You want to "replay" scrapes exactly

RFC2616Policy (The Smart One)

This follows HTTP caching rules. It respects Cache-Control headers from websites.

# settings.py
HTTPCACHE_ENABLED = True
HTTPCACHE_POLICY = 'scrapy.extensions.httpcache.RFC2616Policy'
Enter fullscreen mode Exit fullscreen mode

How it works:

  • Checks HTTP headers
  • Respects max-age directives
  • Revalidates when needed
  • Acts like a real browser cache

Use when:

  • Running production scrapers
  • Want to respect website caching rules
  • Need up-to-date data
  • Being a good internet citizen

Real Example: Development vs Production

Development Setup (Cache Everything)

# settings.py

# Development: cache everything forever
HTTPCACHE_ENABLED = True
HTTPCACHE_POLICY = 'scrapy.extensions.httpcache.DummyPolicy'
HTTPCACHE_DIR = '.dev_cache'
HTTPCACHE_EXPIRATION_SECS = 0  # Never expire
Enter fullscreen mode Exit fullscreen mode

Perfect for testing. Download once, test forever.

Production Setup (Smart Caching)

# settings.py

# Production: respect HTTP caching rules
HTTPCACHE_ENABLED = True
HTTPCACHE_POLICY = 'scrapy.extensions.httpcache.RFC2616Policy'
HTTPCACHE_DIR = '.prod_cache'
HTTPCACHE_EXPIRATION_SECS = 3600  # 1 hour
HTTPCACHE_IGNORE_HTTP_CODES = [404, 500, 502, 503]
Enter fullscreen mode Exit fullscreen mode

Respects website rules. Updates when needed.


Practical Workflow

Here's how I actually use cache when building spiders:

Step 1: Enable Cache for Development

# settings.py
HTTPCACHE_ENABLED = True
Enter fullscreen mode Exit fullscreen mode

Step 2: First Run (Populate Cache)

scrapy crawl myspider
Enter fullscreen mode Exit fullscreen mode

This downloads all pages and caches them.

Step 3: Develop with Cache

Now I can run my spider hundreds of times without hitting the website:

# Run it again
scrapy crawl myspider

# Fix selector
# Run again
scrapy crawl myspider

# Fix another thing
# Run again
scrapy crawl myspider
Enter fullscreen mode Exit fullscreen mode

All instant. All from cache.

Step 4: Clear Cache When Needed

When the website's structure changes or I need fresh data:

rm -rf .scrapy/httpcache/
Enter fullscreen mode Exit fullscreen mode

Then run again to re-populate cache with fresh pages.


Per-Request Cache Control

You can disable cache for specific requests:

def parse(self, response):
    # This request won't be cached
    yield scrapy.Request(
        'https://example.com/dynamic',
        callback=self.parse_dynamic,
        meta={'dont_cache': True}
    )
Enter fullscreen mode Exit fullscreen mode

Useful when some pages need to be fresh but others can be cached.


Advanced: Storage Backends

Scrapy has two storage backends: Filesystem (default) and DBM.

Filesystem (Default)

Stores each response as a separate file:

# settings.py
HTTPCACHE_ENABLED = True
HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
Enter fullscreen mode Exit fullscreen mode

Pros:

  • Easy to inspect (just open files!)
  • Works everywhere
  • Simple

Cons:

  • Many small files
  • Slower with thousands of pages
  • Takes more disk space

DBM (Database)

Stores responses in a database:

# settings.py
HTTPCACHE_ENABLED = True
HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.DbmCacheStorage'
Enter fullscreen mode Exit fullscreen mode

Pros:

  • Faster with lots of pages
  • Fewer files
  • More efficient

Cons:

  • Harder to inspect
  • Database-specific issues
  • More complex

For most projects, stick with Filesystem. It's simpler.


Debugging with Cache

See What's Cached

ls -R .scrapy/httpcache/
Enter fullscreen mode Exit fullscreen mode

You'll see folders for each request. Inside each folder:

  • request_body (the request that was made)
  • request_headers (headers sent)
  • response_body (HTML received)
  • response_headers (response headers)
  • meta (metadata)

Check If Request Was Cached

Scrapy logs cache hits:

[scrapy.core.engine] DEBUG: Crawled (200) <GET https://example.com> (referer: None) ['cached']
Enter fullscreen mode Exit fullscreen mode

See that ['cached'] at the end? That's a cache hit.

Without cache:

[scrapy.core.engine] DEBUG: Crawled (200) <GET https://example.com> (referer: None)
Enter fullscreen mode Exit fullscreen mode

No ['cached']. Fresh download.


Common Gotchas

Gotcha #1: Cache Survives Spider Runs

Cache persists between runs. If you want fresh data, you need to manually clear it:

rm -rf .scrapy/httpcache/
Enter fullscreen mode Exit fullscreen mode

Or set an expiration time.

Gotcha #2: Different Spiders Share Cache

If you have multiple spiders in one project, they share the cache directory. Each spider gets its own subfolder though:

.scrapy/httpcache/
    spider1/
    spider2/
    spider3/
Enter fullscreen mode Exit fullscreen mode

Gotcha #3: POST Requests Aren't Cached by Default

Only GET requests are cached. POST requests (like form submissions) bypass cache:

# This won't be cached
yield scrapy.FormRequest(
    'https://example.com/search',
    formdata={'query': 'test'}
)
Enter fullscreen mode Exit fullscreen mode

This is by design. POST requests usually aren't idempotent (they change things).

Gotcha #4: Redirects Are Cached Too

If a URL redirects, the redirect is cached. You won't see the redirect happen again:

https://example.com → https://www.example.com
Enter fullscreen mode Exit fullscreen mode

The first run follows the redirect and caches the final page. Subsequent runs just return the cached final page.


Real-World Scenarios

Scenario 1: Testing Selectors

You're building a spider and constantly testing CSS selectors:

# settings.py
HTTPCACHE_ENABLED = True
HTTPCACHE_EXPIRATION_SECS = 0  # Never expire
Enter fullscreen mode Exit fullscreen mode

Run once to populate cache. Then test selectors all day without hitting the website.

Scenario 2: Scraping Historical Data

You're scraping historical data that never changes (like old articles):

# settings.py
HTTPCACHE_ENABLED = True
HTTPCACHE_POLICY = 'scrapy.extensions.httpcache.DummyPolicy'
HTTPCACHE_EXPIRATION_SECS = 0  # Keep forever
Enter fullscreen mode Exit fullscreen mode

Once you scrape it, it's cached forever. Perfect for historical archives.

Scenario 3: Production Scraper

You're running a production scraper that needs fresh data:

# settings.py
HTTPCACHE_ENABLED = True
HTTPCACHE_POLICY = 'scrapy.extensions.httpcache.RFC2616Policy'
HTTPCACHE_EXPIRATION_SECS = 1800  # 30 minutes
Enter fullscreen mode Exit fullscreen mode

Respects HTTP rules. Re-fetches after 30 minutes. Balanced approach.

Scenario 4: Offline Development

You're on a plane with no internet but want to work on your spider:

# settings.py
HTTPCACHE_ENABLED = True
HTTPCACHE_IGNORE_MISSING = False  # Fail if not in cache
Enter fullscreen mode Exit fullscreen mode

Your spider only uses cache. If a page isn't cached, it fails instead of trying to download.


Tips Nobody Tells You

Tip #1: Use Cache for CI/CD

In continuous integration, you don't want to hit real websites. Use cache:

# settings.py for CI/CD
HTTPCACHE_ENABLED = True
HTTPCACHE_IGNORE_MISSING = False  # Tests fail if page not cached
Enter fullscreen mode Exit fullscreen mode

Pre-populate cache in your repo. Tests run against cached pages. Fast. Reliable.

Tip #2: Share Cache Between Developers

Commit the cache folder to version control:

git add .scrapy/httpcache/
git commit -m "Add test cache"
Enter fullscreen mode Exit fullscreen mode

Now everyone on your team uses the same cached pages for testing. Consistent results.

Tip #3: Different Cache for Different Environments

# settings.py
import os

HTTPCACHE_ENABLED = True

if os.getenv('ENV') == 'production':
    HTTPCACHE_DIR = '.prod_cache'
    HTTPCACHE_POLICY = 'scrapy.extensions.httpcache.RFC2616Policy'
else:
    HTTPCACHE_DIR = '.dev_cache'
    HTTPCACHE_POLICY = 'scrapy.extensions.httpcache.DummyPolicy'
Enter fullscreen mode Exit fullscreen mode

Separate cache for dev and prod. Best of both worlds.

Tip #4: Compress Cache to Save Space

# settings.py
HTTPCACHE_ENABLED = True
HTTPCACHE_GZIP = True  # Compress cached responses
Enter fullscreen mode Exit fullscreen mode

Saves tons of disk space. Especially with large pages.


Complete Example Spider

Here's a production-ready spider with smart caching:

# spider.py
import scrapy

class SmartCacheSpider(scrapy.Spider):
    name = 'smartcache'
    start_urls = ['https://example.com/products']

    custom_settings = {
        # Enable cache
        'HTTPCACHE_ENABLED': True,

        # Use smart policy
        'HTTPCACHE_POLICY': 'scrapy.extensions.httpcache.RFC2616Policy',

        # Cache for 1 hour
        'HTTPCACHE_EXPIRATION_SECS': 3600,

        # Don't cache errors
        'HTTPCACHE_IGNORE_HTTP_CODES': [404, 500, 502, 503],

        # Compress to save space
        'HTTPCACHE_GZIP': True,

        # Custom cache directory
        'HTTPCACHE_DIR': '.product_cache'
    }

    def parse(self, response):
        # Log whether this was cached
        if 'cached' in response.flags:
            self.logger.info(f'Using cached version of {response.url}')
        else:
            self.logger.info(f'Downloaded fresh: {response.url}')

        for product in response.css('.product'):
            yield {
                'name': product.css('h2::text').get(),
                'price': product.css('.price::text').get()
            }

        # For some pages, force fresh download
        next_page = response.css('.next::attr(href)').get()
        if next_page:
            # Don't cache pagination (always get fresh)
            yield response.follow(
                next_page,
                callback=self.parse,
                meta={'dont_cache': True}
            )
Enter fullscreen mode Exit fullscreen mode

This spider:

  • Caches product pages for 1 hour
  • Respects HTTP caching rules
  • Always downloads pagination pages fresh
  • Logs cache hits/misses
  • Compresses cache to save space

When NOT to Use Cache

Cache isn't always the answer:

Don't cache when:

  • Scraping real-time data (stock prices, sports scores)
  • The website explicitly says not to cache (Cache-Control: no-store)
  • You need the absolute latest data every time
  • Running in production and disk space is limited

Do cache when:

  • Developing and testing
  • Scraping static/historical content
  • Want to reduce server load
  • Need consistent test data
  • Working offline

Quick Reference

Basic Setup

# settings.py

# Enable cache (simplest)
HTTPCACHE_ENABLED = True

# Set expiration (seconds)
HTTPCACHE_EXPIRATION_SECS = 3600

# Set directory
HTTPCACHE_DIR = '.my_cache'

# Choose policy
HTTPCACHE_POLICY = 'scrapy.extensions.httpcache.DummyPolicy'  # or RFC2616Policy

# Ignore certain codes
HTTPCACHE_IGNORE_HTTP_CODES = [404, 500]

# Compress cache
HTTPCACHE_GZIP = True
Enter fullscreen mode Exit fullscreen mode

Per-Request Control

# Don't cache this request
yield scrapy.Request(
    url,
    meta={'dont_cache': True}
)
Enter fullscreen mode Exit fullscreen mode

Check if Cached

def parse(self, response):
    if 'cached' in response.flags:
        print('From cache!')
    else:
        print('Fresh download!')
Enter fullscreen mode Exit fullscreen mode

Clear Cache

rm -rf .scrapy/httpcache/
Enter fullscreen mode Exit fullscreen mode

Summary

HTTP cache is your best friend during development. It:

  • Speeds up testing dramatically
  • Reduces load on websites
  • Lets you work offline
  • Makes tests consistent
  • Saves bandwidth

Key takeaways:

  • Enable with HTTPCACHE_ENABLED = True
  • Use DummyPolicy for development
  • Use RFC2616Policy for production
  • Clear cache when you need fresh data
  • Use dont_cache meta for specific requests
  • Check logs for ['cached'] to see cache hits

Start using cache in your next project. Your spider will run faster, and website admins will thank you for not hammering their servers.

Happy scraping! 🕷️

Top comments (0)