Muhammad Ikramullah Khan

Posted on Jan 7

Scrapy Authentication & Login Forms: Scrape Behind the Login Wall

#python #webdev #programming #beginners

The first time I needed to scrape a site behind a login, I was stuck for days. I could see the data when logged in through my browser, but my spider just got redirected to the login page.

I tried copying cookies manually. Didn't work. I tried storing session data. Still failed. I was about to give up.

Then I learned how authentication actually works. Suddenly, logging in with Scrapy became easy. Let me show you every authentication method that works.

Understanding Web Authentication

Before we code, understand how login works:

1. You visit login page

Server sends you a form
Form has hidden CSRF token

2. You submit credentials

Username + password + CSRF token
POST request to server

3. Server validates

Checks username/password
Creates session

4. Server sends session cookie

Cookie stored in browser
Cookie sent with every request

5. You access protected pages

Cookie proves you're logged in
Server shows private data

Your spider needs to replicate steps 1-5.

Method 1: FormRequest (Simple Forms)

For basic username/password forms.

Basic Example

import scrapy

class LoginSpider(scrapy.Spider):
    name = 'login'
    start_urls = ['https://example.com/login']

    def parse(self, response):
        # Submit login form
        return scrapy.FormRequest.from_response(
            response,
            formdata={
                'username': 'your_username',
                'password': 'your_password'
            },
            callback=self.after_login
        )

    def after_login(self, response):
        # Check if login succeeded
        if 'logout' in response.text:
            self.logger.info('Login successful!')

            # Now scrape protected pages
            yield scrapy.Request('https://example.com/dashboard', 
                               callback=self.parse_dashboard)
        else:
            self.logger.error('Login failed!')

    def parse_dashboard(self, response):
        # Scrape protected content
        yield {
            'data': response.css('.private-data::text').get()
        }

What `from_response` Does

FormRequest.from_response() is magic. It:

Finds the form automatically
Extracts hidden fields (CSRF tokens)
Fills in your credentials
Submits the form

You don't manually handle CSRF tokens!

Method 2: Manual FormRequest (When Auto-Detect Fails)

Sometimes from_response() picks the wrong form. Do it manually:

def parse(self, response):
    # Manually create FormRequest
    return scrapy.FormRequest(
        url='https://example.com/login',
        formdata={
            'username': 'your_username',
            'password': 'your_password',
            'csrf_token': response.css('input[name="csrf_token"]::attr(value)').get()
        },
        callback=self.after_login
    )

Extract CSRF Token Manually

Different sites hide CSRF tokens differently:

# Hidden input field
csrf = response.css('input[name="csrf_token"]::attr(value)').get()

# Meta tag
csrf = response.css('meta[name="csrf-token"]::attr(content)').get()

# In JavaScript variable
import re
csrf = re.search(r'csrfToken = "([^"]+)"', response.text).group(1)

# In cookie
csrf = response.headers.getlist('Set-Cookie')[0].split('csrf=')[1].split(';')[0]

Method 3: Start Requests (Login Before Scraping)

class LoginSpider(scrapy.Spider):
    name = 'login'
    login_url = 'https://example.com/login'

    def start_requests(self):
        # Start by logging in
        yield scrapy.Request(self.login_url, callback=self.login)

    def login(self, response):
        # Submit login form
        return scrapy.FormRequest.from_response(
            response,
            formdata={
                'username': 'your_username',
                'password': 'your_password'
            },
            callback=self.after_login
        )

    def after_login(self, response):
        # Verify login
        if 'logout' not in response.text:
            self.logger.error('Login failed!')
            return

        # Login successful, start scraping
        yield scrapy.Request('https://example.com/page1', callback=self.parse)
        yield scrapy.Request('https://example.com/page2', callback=self.parse)

    def parse(self, response):
        # Scrape authenticated content
        yield {'data': response.css('.content::text').get()}

Method 4: Cookie-Based Authentication

Some sites just need cookies, no form submission.

Pass Cookies Directly

class CookieSpider(scrapy.Spider):
    name = 'cookie'

    def start_requests(self):
        cookies = {
            'session_id': 'abc123',
            'user_token': 'xyz789'
        }

        yield scrapy.Request(
            'https://example.com/dashboard',
            cookies=cookies,
            callback=self.parse
        )

    def parse(self, response):
        yield {'data': response.css('.content::text').get()}

Get Cookies from Browser

Chrome:

Login to site in Chrome
Press F12 (DevTools)
Application tab → Cookies
Copy cookie values

Using cookies.txt format:

from http.cookiejar import MozillaCookieJar

class CookieFileSpider(scrapy.Spider):
    name = 'cookiefile'

    def start_requests(self):
        # Load cookies from file
        jar = MozillaCookieJar('cookies.txt')
        jar.load()

        cookies = {cookie.name: cookie.value for cookie in jar}

        yield scrapy.Request(
            'https://example.com/dashboard',
            cookies=cookies,
            callback=self.parse
        )

Method 5: Headers-Based Authentication (API Tokens)

For sites using Bearer tokens or API keys:

class TokenSpider(scrapy.Spider):
    name = 'token'

    def start_requests(self):
        headers = {
            'Authorization': 'Bearer your_access_token_here'
        }

        yield scrapy.Request(
            'https://api.example.com/data',
            headers=headers,
            callback=self.parse
        )

    def parse(self, response):
        data = response.json()
        yield data

Common Header Patterns

# Bearer token
'Authorization': 'Bearer abc123xyz'

# Basic auth (username:password base64 encoded)
'Authorization': 'Basic dXNlcjpwYXNz'

# API key
'X-API-Key': 'your_api_key'

# Custom auth header
'X-Auth-Token': 'your_token'

Method 6: OAuth Authentication

For sites using OAuth (Google, Facebook login):

import scrapy

class OAuthSpider(scrapy.Spider):
    name = 'oauth'

    def __init__(self, access_token=None, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.access_token = access_token

    def start_requests(self):
        if not self.access_token:
            self.logger.error('No access token provided!')
            return

        headers = {
            'Authorization': f'Bearer {self.access_token}'
        }

        yield scrapy.Request(
            'https://api.example.com/me',
            headers=headers,
            callback=self.parse
        )

    def parse(self, response):
        yield response.json()

Run with token:

scrapy crawl oauth -a access_token="your_oauth_token"

Method 7: Session Persistence (Multiple Spiders)

Share authentication across multiple spider runs:

import pickle
import scrapy

class SessionSpider(scrapy.Spider):
    name = 'session'

    def start_requests(self):
        # Try to load saved session
        cookies = self.load_cookies()

        if cookies:
            self.logger.info('Using saved session')
            yield scrapy.Request(
                'https://example.com/dashboard',
                cookies=cookies,
                callback=self.parse,
                errback=self.session_expired
            )
        else:
            self.logger.info('No saved session, logging in')
            yield scrapy.Request(
                'https://example.com/login',
                callback=self.login
            )

    def login(self, response):
        return scrapy.FormRequest.from_response(
            response,
            formdata={
                'username': 'your_username',
                'password': 'your_password'
            },
            callback=self.after_login
        )

    def after_login(self, response):
        # Save cookies for next run
        cookies = {}
        for cookie in response.headers.getlist('Set-Cookie'):
            name, value = cookie.decode().split(';')[0].split('=', 1)
            cookies[name] = value

        self.save_cookies(cookies)

        # Continue scraping
        yield scrapy.Request(
            'https://example.com/dashboard',
            callback=self.parse
        )

    def session_expired(self, failure):
        self.logger.warning('Session expired, logging in again')
        # Delete saved cookies
        self.delete_cookies()
        # Login again
        yield scrapy.Request(
            'https://example.com/login',
            callback=self.login
        )

    def parse(self, response):
        yield {'data': response.css('.content::text').get()}

    def save_cookies(self, cookies):
        with open('session.pkl', 'wb') as f:
            pickle.dump(cookies, f)

    def load_cookies(self):
        try:
            with open('session.pkl', 'rb') as f:
                return pickle.load(f)
        except FileNotFoundError:
            return None

    def delete_cookies(self):
        import os
        try:
            os.remove('session.pkl')
        except FileNotFoundError:
            pass

Verifying Login Success

Always verify login worked:

def after_login(self, response):
    # Method 1: Check for logout link
    if 'logout' in response.text or '/logout' in response.text:
        self.logger.info('Login successful (found logout link)')
        return True

    # Method 2: Check for username
    username = response.css('.username::text').get()
    if username:
        self.logger.info(f'Login successful (logged in as {username})')
        return True

    # Method 3: Check for login form (shouldn't be there after login)
    login_form = response.css('form.login')
    if not login_form:
        self.logger.info('Login successful (no login form)')
        return True

    # Method 4: Check URL (redirected to dashboard?)
    if 'dashboard' in response.url or 'profile' in response.url:
        self.logger.info('Login successful (redirected to dashboard)')
        return True

    # Login failed
    self.logger.error('Login failed!')
    self.logger.error(f'Response URL: {response.url}')
    self.logger.error(f'Response status: {response.status}')

    # Save response for debugging
    with open('login_failed.html', 'w') as f:
        f.write(response.text)

    return False

Handling 2FA (Two-Factor Authentication)

2FA is tricky. Options:

Option 1: Backup Codes

Some sites give backup codes. Use those:

formdata={
    'username': 'your_username',
    'password': 'your_password',
    'backup_code': 'your_backup_code'
}

Option 2: App-Specific Passwords

Some sites (Google, GitHub) let you generate app passwords:

formdata={
    'username': 'your_username',
    'password': 'your_app_specific_password'
}

Option 3: Disable 2FA for Bot Account

Create separate account without 2FA for scraping (if allowed).

Option 4: Use Selenium/Playwright

For sites requiring interactive 2FA:

# Login manually once with Selenium
# Save cookies
# Use cookies in Scrapy

Complete Real-World Example

Here's a production-ready authenticated spider:

import scrapy
from scrapy.exceptions import CloseSpider

class ProductionAuthSpider(scrapy.Spider):
    name = 'auth'

    login_url = 'https://example.com/login'

    def __init__(self, username=None, password=None, *args, **kwargs):
        super().__init__(*args, **kwargs)

        if not username or not password:
            raise CloseSpider('Username and password required')

        self.username = username
        self.password = password
        self.login_attempts = 0
        self.max_login_attempts = 3

    def start_requests(self):
        self.logger.info('Starting authentication...')
        yield scrapy.Request(self.login_url, callback=self.login)

    def login(self, response):
        self.login_attempts += 1

        if self.login_attempts > self.max_login_attempts:
            raise CloseSpider('Max login attempts exceeded')

        self.logger.info(f'Login attempt {self.login_attempts}')

        # Extract CSRF token
        csrf_token = response.css('input[name="csrf_token"]::attr(value)').get()

        if not csrf_token:
            self.logger.error('CSRF token not found')
            raise CloseSpider('Cannot extract CSRF token')

        return scrapy.FormRequest.from_response(
            response,
            formdata={
                'username': self.username,
                'password': self.password,
                'csrf_token': csrf_token
            },
            callback=self.after_login
        )

    def after_login(self, response):
        # Verify login
        if not self.is_logged_in(response):
            self.logger.error('Login failed!')

            # Check for error messages
            error = response.css('.error-message::text').get()
            if error:
                self.logger.error(f'Error: {error}')

            # Save failed response
            with open('login_failed.html', 'w') as f:
                f.write(response.text)

            # Retry login
            if self.login_attempts < self.max_login_attempts:
                import time
                time.sleep(2)  # Wait before retry
                yield scrapy.Request(self.login_url, callback=self.login)
            else:
                raise CloseSpider('Login failed after maximum attempts')
            return

        self.logger.info('Login successful!')

        # Start scraping authenticated pages
        yield scrapy.Request(
            'https://example.com/dashboard',
            callback=self.parse_dashboard
        )

    def is_logged_in(self, response):
        # Multiple checks
        has_logout = 'logout' in response.text.lower()
        has_username = response.css('.user-profile').get() is not None
        no_login_form = not response.css('form.login-form').get()
        correct_url = 'dashboard' in response.url or 'profile' in response.url

        return has_logout or has_username or (no_login_form and correct_url)

    def parse_dashboard(self, response):
        # Check if still authenticated
        if not self.is_logged_in(response):
            self.logger.warning('Session expired, re-authenticating')
            yield scrapy.Request(self.login_url, callback=self.login)
            return

        # Scrape authenticated content
        for item in response.css('.item'):
            yield {
                'name': item.css('.name::text').get(),
                'data': item.css('.data::text').get()
            }

        # Pagination
        next_page = response.css('.next::attr(href)').get()
        if next_page:
            yield response.follow(next_page, callback=self.parse_dashboard)

Run it:

scrapy crawl auth -a username="myuser" -a password="mypass"

Common Authentication Issues

Issue #1: Login Succeeds but Next Request Fails

Problem: Cookies not being sent

Solution: Ensure cookies are enabled:

# settings.py
COOKIES_ENABLED = True  # Should be True by default

Issue #2: CSRF Token Validation Failed

Problem: Token not being sent or wrong token

Solution: Extract and send correct token:

# Check what field name the site uses
csrf = response.css('input[name="csrf_token"]::attr(value)').get()
csrf = response.css('input[name="_token"]::attr(value)').get()
csrf = response.css('input[name="authenticity_token"]::attr(value)').get()

Issue #3: Redirected to Login After First Page

Problem: Session not persisting

Solution: Check cookie handling:

def after_login(self, response):
    # Log cookies received
    cookies = response.headers.getlist('Set-Cookie')
    self.logger.info(f'Received cookies: {cookies}')

Summary

Authentication methods:

FormRequest.from_response() - Automatic form handling
Manual FormRequest - When auto-detect fails
Cookies - Direct cookie passing
Headers - Bearer tokens, API keys
Session persistence - Save/load cookies

Best practices:

Always verify login succeeded
Handle CSRF tokens properly
Save cookies for session persistence
Add retry logic for login failures
Log authentication steps
Handle session expiration

Common patterns:

Login in start_requests()
Use FormRequest.from_response()
Verify with logout link or username
Continue to protected pages

Remember:

Cookies enabled by default (good!)
CSRF tokens in hidden fields
Session cookies sent automatically
Verify login before scraping

Start with FormRequest.from_response(). It handles 90% of login forms automatically!

Happy scraping! 🕷️

DEV Community

Scrapy Authentication & Login Forms: Scrape Behind the Login Wall

Understanding Web Authentication

Method 1: FormRequest (Simple Forms)

Basic Example

What `from_response` Does

Method 2: Manual FormRequest (When Auto-Detect Fails)

Extract CSRF Token Manually

Method 3: Start Requests (Login Before Scraping)

Method 4: Cookie-Based Authentication

Pass Cookies Directly

Get Cookies from Browser

Method 5: Headers-Based Authentication (API Tokens)

Common Header Patterns

Method 6: OAuth Authentication

Method 7: Session Persistence (Multiple Spiders)

Verifying Login Success

Handling 2FA (Two-Factor Authentication)

Option 1: Backup Codes

Option 2: App-Specific Passwords

Option 3: Disable 2FA for Bot Account

Option 4: Use Selenium/Playwright

Complete Real-World Example

Common Authentication Issues

Issue #1: Login Succeeds but Next Request Fails

Issue #2: CSRF Token Validation Failed

Issue #3: Redirected to Login After First Page

Summary

Top comments (0)

Understanding Web Authentication

Method 1: FormRequest (Simple Forms)

Basic Example

What from_response Does

Method 2: Manual FormRequest (When Auto-Detect Fails)

Extract CSRF Token Manually

Method 3: Start Requests (Login Before Scraping)

Method 4: Cookie-Based Authentication

Pass Cookies Directly

Get Cookies from Browser

Method 5: Headers-Based Authentication (API Tokens)

Common Header Patterns

Method 6: OAuth Authentication

Method 7: Session Persistence (Multiple Spiders)

Verifying Login Success

Handling 2FA (Two-Factor Authentication)

Option 1: Backup Codes

Option 2: App-Specific Passwords

Option 3: Disable 2FA for Bot Account

Option 4: Use Selenium/Playwright

Complete Real-World Example

Common Authentication Issues

Issue #1: Login Succeeds but Next Request Fails

Issue #2: CSRF Token Validation Failed

Issue #3: Redirected to Login After First Page

Summary

What `from_response` Does