I once built a spider that scraped perfectly for 5 pages, then randomly failed on the 6th. Sometimes it worked, sometimes it didn't. I was going crazy.
Turns out the website was setting a session cookie on page 1 that expired after 5 pages. My spider didn't handle cookies properly, so page 6 always failed.
Once I understood cookie handling, the spider became 100% reliable. Let me show you everything about cookies in Scrapy.
Understanding Cookies
What are cookies?
- Small pieces of data stored by browser
- Sent with every request to same domain
- Used for sessions, preferences, tracking
Why sites use cookies:
- Remember who you are (session)
- Track your activity
- Store preferences
- Anti-bot protection
Why you need to handle them:
- Sites expect cookies
- Sessions won't work without them
- You'll look like a bot
Scrapy's Default Cookie Handling
Good news: Scrapy handles cookies automatically!
# settings.py
COOKIES_ENABLED = True # This is the default
What Scrapy does automatically:
- Stores cookies from responses
- Sends cookies with requests
- Maintains separate cookie jar per spider
- Handles cookie expiration
You usually don't need to do anything!
Checking If Cookies Are Working
View Cookies in Response
def parse(self, response):
# Log cookies received
set_cookies = response.headers.getlist('Set-Cookie')
for cookie in set_cookies:
self.logger.info(f'Received cookie: {cookie}')
yield {'url': response.url}
Check Cookies Being Sent
def parse(self, response):
# Log cookies sent with request
request_cookies = response.request.headers.get('Cookie')
self.logger.info(f'Sent cookies: {request_cookies}')
yield {'url': response.url}
Disabling Cookies (When Needed)
Sometimes you want to disable cookies:
# settings.py
COOKIES_ENABLED = False
When to disable:
- Scraping public pages (no session needed)
- Want to avoid tracking
- Testing how site behaves without cookies
- Slight performance boost
When NOT to disable:
- Site requires login
- Site uses sessions
- Site tracks state across pages
Setting Initial Cookies
Provide cookies from the start:
Method 1: In start_requests
class CookieSpider(scrapy.Spider):
name = 'cookie'
def start_requests(self):
cookies = {
'session_id': 'abc123',
'user_token': 'xyz789',
'preferences': 'dark_mode'
}
yield scrapy.Request(
'https://example.com',
cookies=cookies,
callback=self.parse
)
Method 2: Per Request
def parse(self, response):
cookies = {
'page_state': 'viewed',
'timestamp': '12345'
}
yield scrapy.Request(
'https://example.com/next',
cookies=cookies,
callback=self.parse_next
)
Method 3: From Browser
Get cookies from your browser and use them:
Chrome:
- Login to site
- F12 → Application → Cookies
- Copy cookie names and values
def start_requests(self):
# Cookies copied from Chrome
cookies = {
'sessionid': 'abc123def456',
'csrftoken': 'xyz789',
'_ga': 'GA1.2.123456789.1234567890'
}
yield scrapy.Request(
'https://example.com/dashboard',
cookies=cookies,
callback=self.parse
)
Cookie Jar Per Spider
Each spider gets its own cookie jar:
class Spider1(scrapy.Spider):
name = 'spider1'
# Has its own cookie jar
class Spider2(scrapy.Spider):
name = 'spider2'
# Different cookie jar
Cookies from spider1 don't affect spider2.
Persisting Cookies Between Runs
Save cookies to file and reuse:
import pickle
import scrapy
class PersistentCookieSpider(scrapy.Spider):
name = 'persistent'
cookie_file = 'cookies.pkl'
def start_requests(self):
# Load saved cookies
cookies = self.load_cookies()
if cookies:
self.logger.info('Using saved cookies')
yield scrapy.Request(
'https://example.com/dashboard',
cookies=cookies,
callback=self.parse
)
else:
self.logger.info('No saved cookies, starting fresh')
yield scrapy.Request(
'https://example.com',
callback=self.parse
)
def parse(self, response):
# Save cookies for next run
cookies = self.extract_cookies(response)
self.save_cookies(cookies)
yield {'url': response.url}
def extract_cookies(self, response):
cookies = {}
for header in response.headers.getlist('Set-Cookie'):
cookie_str = header.decode()
name_value = cookie_str.split(';')[0]
if '=' in name_value:
name, value = name_value.split('=', 1)
cookies[name] = value
return cookies
def save_cookies(self, cookies):
with open(self.cookie_file, 'wb') as f:
pickle.dump(cookies, f)
self.logger.info(f'Saved {len(cookies)} cookies')
def load_cookies(self):
try:
with open(self.cookie_file, 'rb') as f:
cookies = pickle.load(f)
self.logger.info(f'Loaded {len(cookies)} cookies')
return cookies
except FileNotFoundError:
return None
Cookie Debugging Middleware
See all cookie activity:
# middlewares.py
class CookieDebugMiddleware:
def process_request(self, request, spider):
# Log cookies being sent
cookie_header = request.headers.get('Cookie')
if cookie_header:
spider.logger.debug(f'[COOKIES OUT] {request.url}')
spider.logger.debug(f' {cookie_header.decode()}')
return None
def process_response(self, request, response, spider):
# Log cookies being received
set_cookies = response.headers.getlist('Set-Cookie')
if set_cookies:
spider.logger.debug(f'[COOKIES IN] {response.url}')
for cookie in set_cookies:
spider.logger.debug(f' {cookie.decode()}')
return response
Enable it:
# settings.py
DOWNLOADER_MIDDLEWARES = {
'myproject.middlewares.CookieDebugMiddleware': 900,
}
Handling Cookie Expiration
Cookies expire. Handle it:
def parse(self, response):
# Check if session expired
if 'login' in response.url or 'session expired' in response.text.lower():
self.logger.warning('Session expired!')
# Re-authenticate
yield scrapy.Request(
'https://example.com/login',
callback=self.login,
dont_filter=True
)
return
# Normal processing
yield {'data': response.css('.content::text').get()}
Domain-Specific Cookies
Cookies are domain-specific by default:
# example.com cookies
cookies_example = {'session': 'abc'}
yield scrapy.Request('https://example.com', cookies=cookies_example)
# different-site.com cookies
cookies_other = {'session': 'xyz'}
yield scrapy.Request('https://different-site.com', cookies=cookies_other)
Scrapy keeps them separate automatically.
Cookie Pool (Advanced)
Rotate through multiple accounts:
class CookiePoolSpider(scrapy.Spider):
name = 'pool'
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
# Pool of cookie sets (different accounts)
self.cookie_pool = [
{'session': 'account1_session', 'user_id': '1'},
{'session': 'account2_session', 'user_id': '2'},
{'session': 'account3_session', 'user_id': '3'},
]
self.current_cookie_index = 0
def start_requests(self):
for i in range(100):
# Rotate cookies
cookies = self.get_next_cookies()
yield scrapy.Request(
f'https://example.com/page{i}',
cookies=cookies,
callback=self.parse,
meta={'cookie_index': self.current_cookie_index}
)
def get_next_cookies(self):
cookies = self.cookie_pool[self.current_cookie_index]
self.current_cookie_index = (self.current_cookie_index + 1) % len(self.cookie_pool)
return cookies
def parse(self, response):
cookie_index = response.meta['cookie_index']
self.logger.info(f'Using account {cookie_index + 1}')
yield {'url': response.url, 'account': cookie_index + 1}
Use case: Distribute load across multiple accounts to avoid rate limits.
Ignoring Cookies for Specific Requests
Sometimes you want one request without cookies:
def parse(self, response):
# This request sends cookies (normal)
yield scrapy.Request('https://example.com/page1', callback=self.parse_page)
# This request ignores cookies
yield scrapy.Request(
'https://example.com/page2',
callback=self.parse_page,
meta={'dont_merge_cookies': True} # Fresh cookie jar
)
Cookie Middleware Priority
Cookie middleware runs at priority 700:
# settings.py
DOWNLOADER_MIDDLEWARES = {
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware': 700,
}
Your custom middleware should run:
- Before 700 to modify cookies before they're processed
- After 700 to see processed cookies
Third-Party Cookie Libraries
Using http.cookiejar
from http.cookiejar import MozillaCookieJar
class CookieJarSpider(scrapy.Spider):
name = 'jar'
def start_requests(self):
# Load cookies from Netscape/Mozilla format file
jar = MozillaCookieJar('cookies.txt')
jar.load(ignore_discard=True, ignore_expires=True)
# Convert to dict
cookies = {cookie.name: cookie.value for cookie in jar}
yield scrapy.Request(
'https://example.com',
cookies=cookies,
callback=self.parse
)
Common Cookie Issues
Issue #1: Cookies Not Being Sent
Problem: Site expects cookies but doesn't get them
Debug:
def parse(self, response):
sent_cookies = response.request.headers.get('Cookie')
if not sent_cookies:
self.logger.error('No cookies were sent!')
else:
self.logger.info(f'Sent: {sent_cookies}')
Solution: Make sure COOKIES_ENABLED = True
Issue #2: Cookies Not Being Stored
Problem: Scrapy receives cookies but doesn't save them
Debug:
def parse(self, response):
received_cookies = response.headers.getlist('Set-Cookie')
if received_cookies:
self.logger.info(f'Received {len(received_cookies)} cookies')
for cookie in received_cookies:
self.logger.info(f' {cookie}')
Solution: Check cookie middleware is enabled
Issue #3: Wrong Domain for Cookies
Problem: Cookies for example.com being sent to other-site.com
This shouldn't happen - Scrapy handles domains automatically.
If it does, you might be setting cookies manually wrong:
# WRONG (cookies sent to all domains)
# Don't use Request.meta['cookiejar']
# RIGHT (cookies only for target domain)
yield scrapy.Request(url, cookies=cookies)
Real-World Example: E-Commerce with Cart
Shopping cart tracking with cookies:
class ShoppingSpider(scrapy.Spider):
name = 'shopping'
def start_requests(self):
# Visit homepage (gets session cookie)
yield scrapy.Request(
'https://shop.example.com',
callback=self.parse_home
)
def parse_home(self, response):
# Session cookie now stored automatically
self.logger.info('Session established')
# Browse category (uses session cookie)
yield scrapy.Request(
'https://shop.example.com/electronics',
callback=self.parse_category
)
def parse_category(self, response):
# Session cookie sent automatically
for product in response.css('.product'):
product_url = product.css('a::attr(href)').get()
# Each request uses same session
yield response.follow(
product_url,
callback=self.parse_product
)
def parse_product(self, response):
# Session cookie still active
# Add to cart
add_to_cart_url = 'https://shop.example.com/cart/add'
yield scrapy.FormRequest(
add_to_cart_url,
formdata={
'product_id': response.css('.product-id::text').get(),
'quantity': '1'
},
callback=self.after_add_to_cart
)
def after_add_to_cart(self, response):
# Cart now has item (tracked by session cookie)
# View cart
yield scrapy.Request(
'https://shop.example.com/cart',
callback=self.parse_cart
)
def parse_cart(self, response):
# Extract cart items
items = response.css('.cart-item')
for item in items:
yield {
'name': item.css('.name::text').get(),
'price': item.css('.price::text').get(),
'quantity': item.css('.quantity::text').get()
}
All cookie handling is automatic!
Cookie Security Best Practices
Don't Log Sensitive Cookies
def parse(self, response):
# BAD (logs sensitive data)
self.logger.info(f'Cookies: {response.headers.getlist("Set-Cookie")}')
# GOOD (log only cookie names)
cookie_names = [c.decode().split('=')[0] for c in response.headers.getlist('Set-Cookie')]
self.logger.info(f'Cookie names: {cookie_names}')
Store Cookies Securely
import json
from cryptography.fernet import Fernet
class SecureCookieSpider(scrapy.Spider):
name = 'secure'
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
self.key = Fernet.generate_key()
self.cipher = Fernet(self.key)
def save_cookies(self, cookies):
# Encrypt before saving
cookie_json = json.dumps(cookies)
encrypted = self.cipher.encrypt(cookie_json.encode())
with open('cookies.enc', 'wb') as f:
f.write(encrypted)
def load_cookies(self):
try:
with open('cookies.enc', 'rb') as f:
encrypted = f.read()
# Decrypt
decrypted = self.cipher.decrypt(encrypted)
return json.loads(decrypted)
except FileNotFoundError:
return None
Quick Reference
Enable/Disable Cookies
# settings.py
COOKIES_ENABLED = True # Enable (default)
COOKIES_ENABLED = False # Disable
Set Cookies
# In start_requests
cookies = {'session': 'abc123'}
yield scrapy.Request(url, cookies=cookies)
# In parse
yield scrapy.Request(url, cookies={'key': 'value'})
Debug Cookies
# Received
response.headers.getlist('Set-Cookie')
# Sent
response.request.headers.get('Cookie')
Save/Load Cookies
import pickle
# Save
with open('cookies.pkl', 'wb') as f:
pickle.dump(cookies, f)
# Load
with open('cookies.pkl', 'rb') as f:
cookies = pickle.load(f)
Summary
Scrapy handles cookies automatically:
- Stores cookies from Set-Cookie headers
- Sends cookies with requests
- Maintains per-spider cookie jar
- Handles expiration
When you need manual control:
- Set initial cookies (login sessions)
- Persist cookies between runs
- Rotate cookie pools
- Debug cookie issues
Best practices:
- Leave COOKIES_ENABLED = True (default)
- Use cookies parameter for initial cookies
- Save cookies for session persistence
- Don't log sensitive cookie values
- Handle cookie expiration
Remember:
- Cookies are per-domain automatically
- Each spider has separate cookie jar
- Scrapy handles cookie paths and domains
- Session persistence requires manual save/load
In most cases, Scrapy's automatic cookie handling just works. Only intervene when you need session persistence or multiple accounts!
Happy scraping! 🕷️
Top comments (0)