The first time I needed to scrape a site behind a login, I was stuck for days. I could see the data when logged in through my browser, but my spider just got redirected to the login page.
I tried copying cookies manually. Didn't work. I tried storing session data. Still failed. I was about to give up.
Then I learned how authentication actually works. Suddenly, logging in with Scrapy became easy. Let me show you every authentication method that works.
Understanding Web Authentication
Before we code, understand how login works:
1. You visit login page
- Server sends you a form
- Form has hidden CSRF token
2. You submit credentials
- Username + password + CSRF token
- POST request to server
3. Server validates
- Checks username/password
- Creates session
4. Server sends session cookie
- Cookie stored in browser
- Cookie sent with every request
5. You access protected pages
- Cookie proves you're logged in
- Server shows private data
Your spider needs to replicate steps 1-5.
Method 1: FormRequest (Simple Forms)
For basic username/password forms.
Basic Example
import scrapy
class LoginSpider(scrapy.Spider):
name = 'login'
start_urls = ['https://example.com/login']
def parse(self, response):
# Submit login form
return scrapy.FormRequest.from_response(
response,
formdata={
'username': 'your_username',
'password': 'your_password'
},
callback=self.after_login
)
def after_login(self, response):
# Check if login succeeded
if 'logout' in response.text:
self.logger.info('Login successful!')
# Now scrape protected pages
yield scrapy.Request('https://example.com/dashboard',
callback=self.parse_dashboard)
else:
self.logger.error('Login failed!')
def parse_dashboard(self, response):
# Scrape protected content
yield {
'data': response.css('.private-data::text').get()
}
What from_response Does
FormRequest.from_response() is magic. It:
- Finds the form automatically
- Extracts hidden fields (CSRF tokens)
- Fills in your credentials
- Submits the form
You don't manually handle CSRF tokens!
Method 2: Manual FormRequest (When Auto-Detect Fails)
Sometimes from_response() picks the wrong form. Do it manually:
def parse(self, response):
# Manually create FormRequest
return scrapy.FormRequest(
url='https://example.com/login',
formdata={
'username': 'your_username',
'password': 'your_password',
'csrf_token': response.css('input[name="csrf_token"]::attr(value)').get()
},
callback=self.after_login
)
Extract CSRF Token Manually
Different sites hide CSRF tokens differently:
# Hidden input field
csrf = response.css('input[name="csrf_token"]::attr(value)').get()
# Meta tag
csrf = response.css('meta[name="csrf-token"]::attr(content)').get()
# In JavaScript variable
import re
csrf = re.search(r'csrfToken = "([^"]+)"', response.text).group(1)
# In cookie
csrf = response.headers.getlist('Set-Cookie')[0].split('csrf=')[1].split(';')[0]
Method 3: Start Requests (Login Before Scraping)
Login before spider starts crawling:
class LoginSpider(scrapy.Spider):
name = 'login'
login_url = 'https://example.com/login'
def start_requests(self):
# Start by logging in
yield scrapy.Request(self.login_url, callback=self.login)
def login(self, response):
# Submit login form
return scrapy.FormRequest.from_response(
response,
formdata={
'username': 'your_username',
'password': 'your_password'
},
callback=self.after_login
)
def after_login(self, response):
# Verify login
if 'logout' not in response.text:
self.logger.error('Login failed!')
return
# Login successful, start scraping
yield scrapy.Request('https://example.com/page1', callback=self.parse)
yield scrapy.Request('https://example.com/page2', callback=self.parse)
def parse(self, response):
# Scrape authenticated content
yield {'data': response.css('.content::text').get()}
Method 4: Cookie-Based Authentication
Some sites just need cookies, no form submission.
Pass Cookies Directly
class CookieSpider(scrapy.Spider):
name = 'cookie'
def start_requests(self):
cookies = {
'session_id': 'abc123',
'user_token': 'xyz789'
}
yield scrapy.Request(
'https://example.com/dashboard',
cookies=cookies,
callback=self.parse
)
def parse(self, response):
yield {'data': response.css('.content::text').get()}
Get Cookies from Browser
Chrome:
- Login to site in Chrome
- Press F12 (DevTools)
- Application tab → Cookies
- Copy cookie values
Using cookies.txt format:
from http.cookiejar import MozillaCookieJar
class CookieFileSpider(scrapy.Spider):
name = 'cookiefile'
def start_requests(self):
# Load cookies from file
jar = MozillaCookieJar('cookies.txt')
jar.load()
cookies = {cookie.name: cookie.value for cookie in jar}
yield scrapy.Request(
'https://example.com/dashboard',
cookies=cookies,
callback=self.parse
)
Method 5: Headers-Based Authentication (API Tokens)
For sites using Bearer tokens or API keys:
class TokenSpider(scrapy.Spider):
name = 'token'
def start_requests(self):
headers = {
'Authorization': 'Bearer your_access_token_here'
}
yield scrapy.Request(
'https://api.example.com/data',
headers=headers,
callback=self.parse
)
def parse(self, response):
data = response.json()
yield data
Common Header Patterns
# Bearer token
'Authorization': 'Bearer abc123xyz'
# Basic auth (username:password base64 encoded)
'Authorization': 'Basic dXNlcjpwYXNz'
# API key
'X-API-Key': 'your_api_key'
# Custom auth header
'X-Auth-Token': 'your_token'
Method 6: OAuth Authentication
For sites using OAuth (Google, Facebook login):
import scrapy
class OAuthSpider(scrapy.Spider):
name = 'oauth'
def __init__(self, access_token=None, *args, **kwargs):
super().__init__(*args, **kwargs)
self.access_token = access_token
def start_requests(self):
if not self.access_token:
self.logger.error('No access token provided!')
return
headers = {
'Authorization': f'Bearer {self.access_token}'
}
yield scrapy.Request(
'https://api.example.com/me',
headers=headers,
callback=self.parse
)
def parse(self, response):
yield response.json()
Run with token:
scrapy crawl oauth -a access_token="your_oauth_token"
Method 7: Session Persistence (Multiple Spiders)
Share authentication across multiple spider runs:
import pickle
import scrapy
class SessionSpider(scrapy.Spider):
name = 'session'
def start_requests(self):
# Try to load saved session
cookies = self.load_cookies()
if cookies:
self.logger.info('Using saved session')
yield scrapy.Request(
'https://example.com/dashboard',
cookies=cookies,
callback=self.parse,
errback=self.session_expired
)
else:
self.logger.info('No saved session, logging in')
yield scrapy.Request(
'https://example.com/login',
callback=self.login
)
def login(self, response):
return scrapy.FormRequest.from_response(
response,
formdata={
'username': 'your_username',
'password': 'your_password'
},
callback=self.after_login
)
def after_login(self, response):
# Save cookies for next run
cookies = {}
for cookie in response.headers.getlist('Set-Cookie'):
name, value = cookie.decode().split(';')[0].split('=', 1)
cookies[name] = value
self.save_cookies(cookies)
# Continue scraping
yield scrapy.Request(
'https://example.com/dashboard',
callback=self.parse
)
def session_expired(self, failure):
self.logger.warning('Session expired, logging in again')
# Delete saved cookies
self.delete_cookies()
# Login again
yield scrapy.Request(
'https://example.com/login',
callback=self.login
)
def parse(self, response):
yield {'data': response.css('.content::text').get()}
def save_cookies(self, cookies):
with open('session.pkl', 'wb') as f:
pickle.dump(cookies, f)
def load_cookies(self):
try:
with open('session.pkl', 'rb') as f:
return pickle.load(f)
except FileNotFoundError:
return None
def delete_cookies(self):
import os
try:
os.remove('session.pkl')
except FileNotFoundError:
pass
Verifying Login Success
Always verify login worked:
def after_login(self, response):
# Method 1: Check for logout link
if 'logout' in response.text or '/logout' in response.text:
self.logger.info('Login successful (found logout link)')
return True
# Method 2: Check for username
username = response.css('.username::text').get()
if username:
self.logger.info(f'Login successful (logged in as {username})')
return True
# Method 3: Check for login form (shouldn't be there after login)
login_form = response.css('form.login')
if not login_form:
self.logger.info('Login successful (no login form)')
return True
# Method 4: Check URL (redirected to dashboard?)
if 'dashboard' in response.url or 'profile' in response.url:
self.logger.info('Login successful (redirected to dashboard)')
return True
# Login failed
self.logger.error('Login failed!')
self.logger.error(f'Response URL: {response.url}')
self.logger.error(f'Response status: {response.status}')
# Save response for debugging
with open('login_failed.html', 'w') as f:
f.write(response.text)
return False
Handling 2FA (Two-Factor Authentication)
2FA is tricky. Options:
Option 1: Backup Codes
Some sites give backup codes. Use those:
formdata={
'username': 'your_username',
'password': 'your_password',
'backup_code': 'your_backup_code'
}
Option 2: App-Specific Passwords
Some sites (Google, GitHub) let you generate app passwords:
formdata={
'username': 'your_username',
'password': 'your_app_specific_password'
}
Option 3: Disable 2FA for Bot Account
Create separate account without 2FA for scraping (if allowed).
Option 4: Use Selenium/Playwright
For sites requiring interactive 2FA:
# Login manually once with Selenium
# Save cookies
# Use cookies in Scrapy
Complete Real-World Example
Here's a production-ready authenticated spider:
import scrapy
from scrapy.exceptions import CloseSpider
class ProductionAuthSpider(scrapy.Spider):
name = 'auth'
login_url = 'https://example.com/login'
def __init__(self, username=None, password=None, *args, **kwargs):
super().__init__(*args, **kwargs)
if not username or not password:
raise CloseSpider('Username and password required')
self.username = username
self.password = password
self.login_attempts = 0
self.max_login_attempts = 3
def start_requests(self):
self.logger.info('Starting authentication...')
yield scrapy.Request(self.login_url, callback=self.login)
def login(self, response):
self.login_attempts += 1
if self.login_attempts > self.max_login_attempts:
raise CloseSpider('Max login attempts exceeded')
self.logger.info(f'Login attempt {self.login_attempts}')
# Extract CSRF token
csrf_token = response.css('input[name="csrf_token"]::attr(value)').get()
if not csrf_token:
self.logger.error('CSRF token not found')
raise CloseSpider('Cannot extract CSRF token')
return scrapy.FormRequest.from_response(
response,
formdata={
'username': self.username,
'password': self.password,
'csrf_token': csrf_token
},
callback=self.after_login
)
def after_login(self, response):
# Verify login
if not self.is_logged_in(response):
self.logger.error('Login failed!')
# Check for error messages
error = response.css('.error-message::text').get()
if error:
self.logger.error(f'Error: {error}')
# Save failed response
with open('login_failed.html', 'w') as f:
f.write(response.text)
# Retry login
if self.login_attempts < self.max_login_attempts:
import time
time.sleep(2) # Wait before retry
yield scrapy.Request(self.login_url, callback=self.login)
else:
raise CloseSpider('Login failed after maximum attempts')
return
self.logger.info('Login successful!')
# Start scraping authenticated pages
yield scrapy.Request(
'https://example.com/dashboard',
callback=self.parse_dashboard
)
def is_logged_in(self, response):
# Multiple checks
has_logout = 'logout' in response.text.lower()
has_username = response.css('.user-profile').get() is not None
no_login_form = not response.css('form.login-form').get()
correct_url = 'dashboard' in response.url or 'profile' in response.url
return has_logout or has_username or (no_login_form and correct_url)
def parse_dashboard(self, response):
# Check if still authenticated
if not self.is_logged_in(response):
self.logger.warning('Session expired, re-authenticating')
yield scrapy.Request(self.login_url, callback=self.login)
return
# Scrape authenticated content
for item in response.css('.item'):
yield {
'name': item.css('.name::text').get(),
'data': item.css('.data::text').get()
}
# Pagination
next_page = response.css('.next::attr(href)').get()
if next_page:
yield response.follow(next_page, callback=self.parse_dashboard)
Run it:
scrapy crawl auth -a username="myuser" -a password="mypass"
Common Authentication Issues
Issue #1: Login Succeeds but Next Request Fails
Problem: Cookies not being sent
Solution: Ensure cookies are enabled:
# settings.py
COOKIES_ENABLED = True # Should be True by default
Issue #2: CSRF Token Validation Failed
Problem: Token not being sent or wrong token
Solution: Extract and send correct token:
# Check what field name the site uses
csrf = response.css('input[name="csrf_token"]::attr(value)').get()
csrf = response.css('input[name="_token"]::attr(value)').get()
csrf = response.css('input[name="authenticity_token"]::attr(value)').get()
Issue #3: Redirected to Login After First Page
Problem: Session not persisting
Solution: Check cookie handling:
def after_login(self, response):
# Log cookies received
cookies = response.headers.getlist('Set-Cookie')
self.logger.info(f'Received cookies: {cookies}')
Summary
Authentication methods:
- FormRequest.from_response() - Automatic form handling
- Manual FormRequest - When auto-detect fails
- Cookies - Direct cookie passing
- Headers - Bearer tokens, API keys
- Session persistence - Save/load cookies
Best practices:
- Always verify login succeeded
- Handle CSRF tokens properly
- Save cookies for session persistence
- Add retry logic for login failures
- Log authentication steps
- Handle session expiration
Common patterns:
- Login in start_requests()
- Use FormRequest.from_response()
- Verify with logout link or username
- Continue to protected pages
Remember:
- Cookies enabled by default (good!)
- CSRF tokens in hidden fields
- Session cookies sent automatically
- Verify login before scraping
Start with FormRequest.from_response(). It handles 90% of login forms automatically!
Happy scraping! 🕷️
Top comments (0)