DEV Community

Muhammad Ikramullah Khan
Muhammad Ikramullah Khan

Posted on

Scrapy Cookie Handling: Master Sessions Like a Pro

I once built a spider that scraped perfectly for 5 pages, then randomly failed on the 6th. Sometimes it worked, sometimes it didn't. I was going crazy.

Turns out the website was setting a session cookie on page 1 that expired after 5 pages. My spider didn't handle cookies properly, so page 6 always failed.

Once I understood cookie handling, the spider became 100% reliable. Let me show you everything about cookies in Scrapy.


Understanding Cookies

What are cookies?

  • Small pieces of data stored by browser
  • Sent with every request to same domain
  • Used for sessions, preferences, tracking

Why sites use cookies:

  • Remember who you are (session)
  • Track your activity
  • Store preferences
  • Anti-bot protection

Why you need to handle them:

  • Sites expect cookies
  • Sessions won't work without them
  • You'll look like a bot

Scrapy's Default Cookie Handling

Good news: Scrapy handles cookies automatically!

# settings.py
COOKIES_ENABLED = True  # This is the default
Enter fullscreen mode Exit fullscreen mode

What Scrapy does automatically:

  • Stores cookies from responses
  • Sends cookies with requests
  • Maintains separate cookie jar per spider
  • Handles cookie expiration

You usually don't need to do anything!


Checking If Cookies Are Working

View Cookies in Response

def parse(self, response):
    # Log cookies received
    set_cookies = response.headers.getlist('Set-Cookie')
    for cookie in set_cookies:
        self.logger.info(f'Received cookie: {cookie}')

    yield {'url': response.url}
Enter fullscreen mode Exit fullscreen mode

Check Cookies Being Sent

def parse(self, response):
    # Log cookies sent with request
    request_cookies = response.request.headers.get('Cookie')
    self.logger.info(f'Sent cookies: {request_cookies}')

    yield {'url': response.url}
Enter fullscreen mode Exit fullscreen mode

Disabling Cookies (When Needed)

Sometimes you want to disable cookies:

# settings.py
COOKIES_ENABLED = False
Enter fullscreen mode Exit fullscreen mode

When to disable:

  • Scraping public pages (no session needed)
  • Want to avoid tracking
  • Testing how site behaves without cookies
  • Slight performance boost

When NOT to disable:

  • Site requires login
  • Site uses sessions
  • Site tracks state across pages

Setting Initial Cookies

Provide cookies from the start:

Method 1: In start_requests

class CookieSpider(scrapy.Spider):
    name = 'cookie'

    def start_requests(self):
        cookies = {
            'session_id': 'abc123',
            'user_token': 'xyz789',
            'preferences': 'dark_mode'
        }

        yield scrapy.Request(
            'https://example.com',
            cookies=cookies,
            callback=self.parse
        )
Enter fullscreen mode Exit fullscreen mode

Method 2: Per Request

def parse(self, response):
    cookies = {
        'page_state': 'viewed',
        'timestamp': '12345'
    }

    yield scrapy.Request(
        'https://example.com/next',
        cookies=cookies,
        callback=self.parse_next
    )
Enter fullscreen mode Exit fullscreen mode

Method 3: From Browser

Get cookies from your browser and use them:

Chrome:

  1. Login to site
  2. F12 → Application → Cookies
  3. Copy cookie names and values
def start_requests(self):
    # Cookies copied from Chrome
    cookies = {
        'sessionid': 'abc123def456',
        'csrftoken': 'xyz789',
        '_ga': 'GA1.2.123456789.1234567890'
    }

    yield scrapy.Request(
        'https://example.com/dashboard',
        cookies=cookies,
        callback=self.parse
    )
Enter fullscreen mode Exit fullscreen mode

Cookie Jar Per Spider

Each spider gets its own cookie jar:

class Spider1(scrapy.Spider):
    name = 'spider1'
    # Has its own cookie jar

class Spider2(scrapy.Spider):
    name = 'spider2'
    # Different cookie jar
Enter fullscreen mode Exit fullscreen mode

Cookies from spider1 don't affect spider2.


Persisting Cookies Between Runs

Save cookies to file and reuse:

import pickle
import scrapy

class PersistentCookieSpider(scrapy.Spider):
    name = 'persistent'
    cookie_file = 'cookies.pkl'

    def start_requests(self):
        # Load saved cookies
        cookies = self.load_cookies()

        if cookies:
            self.logger.info('Using saved cookies')
            yield scrapy.Request(
                'https://example.com/dashboard',
                cookies=cookies,
                callback=self.parse
            )
        else:
            self.logger.info('No saved cookies, starting fresh')
            yield scrapy.Request(
                'https://example.com',
                callback=self.parse
            )

    def parse(self, response):
        # Save cookies for next run
        cookies = self.extract_cookies(response)
        self.save_cookies(cookies)

        yield {'url': response.url}

    def extract_cookies(self, response):
        cookies = {}
        for header in response.headers.getlist('Set-Cookie'):
            cookie_str = header.decode()
            name_value = cookie_str.split(';')[0]
            if '=' in name_value:
                name, value = name_value.split('=', 1)
                cookies[name] = value
        return cookies

    def save_cookies(self, cookies):
        with open(self.cookie_file, 'wb') as f:
            pickle.dump(cookies, f)
        self.logger.info(f'Saved {len(cookies)} cookies')

    def load_cookies(self):
        try:
            with open(self.cookie_file, 'rb') as f:
                cookies = pickle.load(f)
            self.logger.info(f'Loaded {len(cookies)} cookies')
            return cookies
        except FileNotFoundError:
            return None
Enter fullscreen mode Exit fullscreen mode

Cookie Debugging Middleware

See all cookie activity:

# middlewares.py
class CookieDebugMiddleware:
    def process_request(self, request, spider):
        # Log cookies being sent
        cookie_header = request.headers.get('Cookie')
        if cookie_header:
            spider.logger.debug(f'[COOKIES OUT] {request.url}')
            spider.logger.debug(f'  {cookie_header.decode()}')
        return None

    def process_response(self, request, response, spider):
        # Log cookies being received
        set_cookies = response.headers.getlist('Set-Cookie')
        if set_cookies:
            spider.logger.debug(f'[COOKIES IN] {response.url}')
            for cookie in set_cookies:
                spider.logger.debug(f'  {cookie.decode()}')
        return response
Enter fullscreen mode Exit fullscreen mode

Enable it:

# settings.py
DOWNLOADER_MIDDLEWARES = {
    'myproject.middlewares.CookieDebugMiddleware': 900,
}
Enter fullscreen mode Exit fullscreen mode

Handling Cookie Expiration

Cookies expire. Handle it:

def parse(self, response):
    # Check if session expired
    if 'login' in response.url or 'session expired' in response.text.lower():
        self.logger.warning('Session expired!')

        # Re-authenticate
        yield scrapy.Request(
            'https://example.com/login',
            callback=self.login,
            dont_filter=True
        )
        return

    # Normal processing
    yield {'data': response.css('.content::text').get()}
Enter fullscreen mode Exit fullscreen mode

Domain-Specific Cookies

Cookies are domain-specific by default:

# example.com cookies
cookies_example = {'session': 'abc'}
yield scrapy.Request('https://example.com', cookies=cookies_example)

# different-site.com cookies
cookies_other = {'session': 'xyz'}
yield scrapy.Request('https://different-site.com', cookies=cookies_other)
Enter fullscreen mode Exit fullscreen mode

Scrapy keeps them separate automatically.


Cookie Pool (Advanced)

Rotate through multiple accounts:

class CookiePoolSpider(scrapy.Spider):
    name = 'pool'

    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)

        # Pool of cookie sets (different accounts)
        self.cookie_pool = [
            {'session': 'account1_session', 'user_id': '1'},
            {'session': 'account2_session', 'user_id': '2'},
            {'session': 'account3_session', 'user_id': '3'},
        ]

        self.current_cookie_index = 0

    def start_requests(self):
        for i in range(100):
            # Rotate cookies
            cookies = self.get_next_cookies()

            yield scrapy.Request(
                f'https://example.com/page{i}',
                cookies=cookies,
                callback=self.parse,
                meta={'cookie_index': self.current_cookie_index}
            )

    def get_next_cookies(self):
        cookies = self.cookie_pool[self.current_cookie_index]
        self.current_cookie_index = (self.current_cookie_index + 1) % len(self.cookie_pool)
        return cookies

    def parse(self, response):
        cookie_index = response.meta['cookie_index']
        self.logger.info(f'Using account {cookie_index + 1}')

        yield {'url': response.url, 'account': cookie_index + 1}
Enter fullscreen mode Exit fullscreen mode

Use case: Distribute load across multiple accounts to avoid rate limits.


Ignoring Cookies for Specific Requests

Sometimes you want one request without cookies:

def parse(self, response):
    # This request sends cookies (normal)
    yield scrapy.Request('https://example.com/page1', callback=self.parse_page)

    # This request ignores cookies
    yield scrapy.Request(
        'https://example.com/page2',
        callback=self.parse_page,
        meta={'dont_merge_cookies': True}  # Fresh cookie jar
    )
Enter fullscreen mode Exit fullscreen mode

Cookie Middleware Priority

Cookie middleware runs at priority 700:

# settings.py
DOWNLOADER_MIDDLEWARES = {
    'scrapy.downloadermiddlewares.cookies.CookiesMiddleware': 700,
}
Enter fullscreen mode Exit fullscreen mode

Your custom middleware should run:

  • Before 700 to modify cookies before they're processed
  • After 700 to see processed cookies

Third-Party Cookie Libraries

Using http.cookiejar

from http.cookiejar import MozillaCookieJar

class CookieJarSpider(scrapy.Spider):
    name = 'jar'

    def start_requests(self):
        # Load cookies from Netscape/Mozilla format file
        jar = MozillaCookieJar('cookies.txt')
        jar.load(ignore_discard=True, ignore_expires=True)

        # Convert to dict
        cookies = {cookie.name: cookie.value for cookie in jar}

        yield scrapy.Request(
            'https://example.com',
            cookies=cookies,
            callback=self.parse
        )
Enter fullscreen mode Exit fullscreen mode

Common Cookie Issues

Issue #1: Cookies Not Being Sent

Problem: Site expects cookies but doesn't get them

Debug:

def parse(self, response):
    sent_cookies = response.request.headers.get('Cookie')
    if not sent_cookies:
        self.logger.error('No cookies were sent!')
    else:
        self.logger.info(f'Sent: {sent_cookies}')
Enter fullscreen mode Exit fullscreen mode

Solution: Make sure COOKIES_ENABLED = True

Issue #2: Cookies Not Being Stored

Problem: Scrapy receives cookies but doesn't save them

Debug:

def parse(self, response):
    received_cookies = response.headers.getlist('Set-Cookie')
    if received_cookies:
        self.logger.info(f'Received {len(received_cookies)} cookies')
        for cookie in received_cookies:
            self.logger.info(f'  {cookie}')
Enter fullscreen mode Exit fullscreen mode

Solution: Check cookie middleware is enabled

Issue #3: Wrong Domain for Cookies

Problem: Cookies for example.com being sent to other-site.com

This shouldn't happen - Scrapy handles domains automatically.

If it does, you might be setting cookies manually wrong:

# WRONG (cookies sent to all domains)
# Don't use Request.meta['cookiejar']

# RIGHT (cookies only for target domain)
yield scrapy.Request(url, cookies=cookies)
Enter fullscreen mode Exit fullscreen mode

Real-World Example: E-Commerce with Cart

Shopping cart tracking with cookies:

class ShoppingSpider(scrapy.Spider):
    name = 'shopping'

    def start_requests(self):
        # Visit homepage (gets session cookie)
        yield scrapy.Request(
            'https://shop.example.com',
            callback=self.parse_home
        )

    def parse_home(self, response):
        # Session cookie now stored automatically
        self.logger.info('Session established')

        # Browse category (uses session cookie)
        yield scrapy.Request(
            'https://shop.example.com/electronics',
            callback=self.parse_category
        )

    def parse_category(self, response):
        # Session cookie sent automatically

        for product in response.css('.product'):
            product_url = product.css('a::attr(href)').get()

            # Each request uses same session
            yield response.follow(
                product_url,
                callback=self.parse_product
            )

    def parse_product(self, response):
        # Session cookie still active

        # Add to cart
        add_to_cart_url = 'https://shop.example.com/cart/add'

        yield scrapy.FormRequest(
            add_to_cart_url,
            formdata={
                'product_id': response.css('.product-id::text').get(),
                'quantity': '1'
            },
            callback=self.after_add_to_cart
        )

    def after_add_to_cart(self, response):
        # Cart now has item (tracked by session cookie)

        # View cart
        yield scrapy.Request(
            'https://shop.example.com/cart',
            callback=self.parse_cart
        )

    def parse_cart(self, response):
        # Extract cart items
        items = response.css('.cart-item')

        for item in items:
            yield {
                'name': item.css('.name::text').get(),
                'price': item.css('.price::text').get(),
                'quantity': item.css('.quantity::text').get()
            }
Enter fullscreen mode Exit fullscreen mode

All cookie handling is automatic!


Cookie Security Best Practices

Don't Log Sensitive Cookies

def parse(self, response):
    # BAD (logs sensitive data)
    self.logger.info(f'Cookies: {response.headers.getlist("Set-Cookie")}')

    # GOOD (log only cookie names)
    cookie_names = [c.decode().split('=')[0] for c in response.headers.getlist('Set-Cookie')]
    self.logger.info(f'Cookie names: {cookie_names}')
Enter fullscreen mode Exit fullscreen mode

Store Cookies Securely

import json
from cryptography.fernet import Fernet

class SecureCookieSpider(scrapy.Spider):
    name = 'secure'

    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.key = Fernet.generate_key()
        self.cipher = Fernet(self.key)

    def save_cookies(self, cookies):
        # Encrypt before saving
        cookie_json = json.dumps(cookies)
        encrypted = self.cipher.encrypt(cookie_json.encode())

        with open('cookies.enc', 'wb') as f:
            f.write(encrypted)

    def load_cookies(self):
        try:
            with open('cookies.enc', 'rb') as f:
                encrypted = f.read()

            # Decrypt
            decrypted = self.cipher.decrypt(encrypted)
            return json.loads(decrypted)
        except FileNotFoundError:
            return None
Enter fullscreen mode Exit fullscreen mode

Quick Reference

Enable/Disable Cookies

# settings.py
COOKIES_ENABLED = True   # Enable (default)
COOKIES_ENABLED = False  # Disable
Enter fullscreen mode Exit fullscreen mode

Set Cookies

# In start_requests
cookies = {'session': 'abc123'}
yield scrapy.Request(url, cookies=cookies)

# In parse
yield scrapy.Request(url, cookies={'key': 'value'})
Enter fullscreen mode Exit fullscreen mode

Debug Cookies

# Received
response.headers.getlist('Set-Cookie')

# Sent
response.request.headers.get('Cookie')
Enter fullscreen mode Exit fullscreen mode

Save/Load Cookies

import pickle

# Save
with open('cookies.pkl', 'wb') as f:
    pickle.dump(cookies, f)

# Load
with open('cookies.pkl', 'rb') as f:
    cookies = pickle.load(f)
Enter fullscreen mode Exit fullscreen mode

Summary

Scrapy handles cookies automatically:

  • Stores cookies from Set-Cookie headers
  • Sends cookies with requests
  • Maintains per-spider cookie jar
  • Handles expiration

When you need manual control:

  • Set initial cookies (login sessions)
  • Persist cookies between runs
  • Rotate cookie pools
  • Debug cookie issues

Best practices:

  • Leave COOKIES_ENABLED = True (default)
  • Use cookies parameter for initial cookies
  • Save cookies for session persistence
  • Don't log sensitive cookie values
  • Handle cookie expiration

Remember:

  • Cookies are per-domain automatically
  • Each spider has separate cookie jar
  • Scrapy handles cookie paths and domains
  • Session persistence requires manual save/load

In most cases, Scrapy's automatic cookie handling just works. Only intervene when you need session persistence or multiple accounts!

Happy scraping! 🕷️

Top comments (0)