DEV Community

Muhammad Ikramullah Khan
Muhammad Ikramullah Khan

Posted on

Scrapy Authentication & Login Forms: Scrape Behind the Login Wall

The first time I needed to scrape a site behind a login, I was stuck for days. I could see the data when logged in through my browser, but my spider just got redirected to the login page.

I tried copying cookies manually. Didn't work. I tried storing session data. Still failed. I was about to give up.

Then I learned how authentication actually works. Suddenly, logging in with Scrapy became easy. Let me show you every authentication method that works.


Understanding Web Authentication

Before we code, understand how login works:

1. You visit login page

  • Server sends you a form
  • Form has hidden CSRF token

2. You submit credentials

  • Username + password + CSRF token
  • POST request to server

3. Server validates

  • Checks username/password
  • Creates session

4. Server sends session cookie

  • Cookie stored in browser
  • Cookie sent with every request

5. You access protected pages

  • Cookie proves you're logged in
  • Server shows private data

Your spider needs to replicate steps 1-5.


Method 1: FormRequest (Simple Forms)

For basic username/password forms.

Basic Example

import scrapy

class LoginSpider(scrapy.Spider):
    name = 'login'
    start_urls = ['https://example.com/login']

    def parse(self, response):
        # Submit login form
        return scrapy.FormRequest.from_response(
            response,
            formdata={
                'username': 'your_username',
                'password': 'your_password'
            },
            callback=self.after_login
        )

    def after_login(self, response):
        # Check if login succeeded
        if 'logout' in response.text:
            self.logger.info('Login successful!')

            # Now scrape protected pages
            yield scrapy.Request('https://example.com/dashboard', 
                               callback=self.parse_dashboard)
        else:
            self.logger.error('Login failed!')

    def parse_dashboard(self, response):
        # Scrape protected content
        yield {
            'data': response.css('.private-data::text').get()
        }
Enter fullscreen mode Exit fullscreen mode

What from_response Does

FormRequest.from_response() is magic. It:

  • Finds the form automatically
  • Extracts hidden fields (CSRF tokens)
  • Fills in your credentials
  • Submits the form

You don't manually handle CSRF tokens!


Method 2: Manual FormRequest (When Auto-Detect Fails)

Sometimes from_response() picks the wrong form. Do it manually:

def parse(self, response):
    # Manually create FormRequest
    return scrapy.FormRequest(
        url='https://example.com/login',
        formdata={
            'username': 'your_username',
            'password': 'your_password',
            'csrf_token': response.css('input[name="csrf_token"]::attr(value)').get()
        },
        callback=self.after_login
    )
Enter fullscreen mode Exit fullscreen mode

Extract CSRF Token Manually

Different sites hide CSRF tokens differently:

# Hidden input field
csrf = response.css('input[name="csrf_token"]::attr(value)').get()

# Meta tag
csrf = response.css('meta[name="csrf-token"]::attr(content)').get()

# In JavaScript variable
import re
csrf = re.search(r'csrfToken = "([^"]+)"', response.text).group(1)

# In cookie
csrf = response.headers.getlist('Set-Cookie')[0].split('csrf=')[1].split(';')[0]
Enter fullscreen mode Exit fullscreen mode

Method 3: Start Requests (Login Before Scraping)

Login before spider starts crawling:

class LoginSpider(scrapy.Spider):
    name = 'login'
    login_url = 'https://example.com/login'

    def start_requests(self):
        # Start by logging in
        yield scrapy.Request(self.login_url, callback=self.login)

    def login(self, response):
        # Submit login form
        return scrapy.FormRequest.from_response(
            response,
            formdata={
                'username': 'your_username',
                'password': 'your_password'
            },
            callback=self.after_login
        )

    def after_login(self, response):
        # Verify login
        if 'logout' not in response.text:
            self.logger.error('Login failed!')
            return

        # Login successful, start scraping
        yield scrapy.Request('https://example.com/page1', callback=self.parse)
        yield scrapy.Request('https://example.com/page2', callback=self.parse)

    def parse(self, response):
        # Scrape authenticated content
        yield {'data': response.css('.content::text').get()}
Enter fullscreen mode Exit fullscreen mode

Method 4: Cookie-Based Authentication

Some sites just need cookies, no form submission.

Pass Cookies Directly

class CookieSpider(scrapy.Spider):
    name = 'cookie'

    def start_requests(self):
        cookies = {
            'session_id': 'abc123',
            'user_token': 'xyz789'
        }

        yield scrapy.Request(
            'https://example.com/dashboard',
            cookies=cookies,
            callback=self.parse
        )

    def parse(self, response):
        yield {'data': response.css('.content::text').get()}
Enter fullscreen mode Exit fullscreen mode

Get Cookies from Browser

Chrome:

  1. Login to site in Chrome
  2. Press F12 (DevTools)
  3. Application tab → Cookies
  4. Copy cookie values

Using cookies.txt format:

from http.cookiejar import MozillaCookieJar

class CookieFileSpider(scrapy.Spider):
    name = 'cookiefile'

    def start_requests(self):
        # Load cookies from file
        jar = MozillaCookieJar('cookies.txt')
        jar.load()

        cookies = {cookie.name: cookie.value for cookie in jar}

        yield scrapy.Request(
            'https://example.com/dashboard',
            cookies=cookies,
            callback=self.parse
        )
Enter fullscreen mode Exit fullscreen mode

Method 5: Headers-Based Authentication (API Tokens)

For sites using Bearer tokens or API keys:

class TokenSpider(scrapy.Spider):
    name = 'token'

    def start_requests(self):
        headers = {
            'Authorization': 'Bearer your_access_token_here'
        }

        yield scrapy.Request(
            'https://api.example.com/data',
            headers=headers,
            callback=self.parse
        )

    def parse(self, response):
        data = response.json()
        yield data
Enter fullscreen mode Exit fullscreen mode

Common Header Patterns

# Bearer token
'Authorization': 'Bearer abc123xyz'

# Basic auth (username:password base64 encoded)
'Authorization': 'Basic dXNlcjpwYXNz'

# API key
'X-API-Key': 'your_api_key'

# Custom auth header
'X-Auth-Token': 'your_token'
Enter fullscreen mode Exit fullscreen mode

Method 6: OAuth Authentication

For sites using OAuth (Google, Facebook login):

import scrapy

class OAuthSpider(scrapy.Spider):
    name = 'oauth'

    def __init__(self, access_token=None, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.access_token = access_token

    def start_requests(self):
        if not self.access_token:
            self.logger.error('No access token provided!')
            return

        headers = {
            'Authorization': f'Bearer {self.access_token}'
        }

        yield scrapy.Request(
            'https://api.example.com/me',
            headers=headers,
            callback=self.parse
        )

    def parse(self, response):
        yield response.json()
Enter fullscreen mode Exit fullscreen mode

Run with token:

scrapy crawl oauth -a access_token="your_oauth_token"
Enter fullscreen mode Exit fullscreen mode

Method 7: Session Persistence (Multiple Spiders)

Share authentication across multiple spider runs:

import pickle
import scrapy

class SessionSpider(scrapy.Spider):
    name = 'session'

    def start_requests(self):
        # Try to load saved session
        cookies = self.load_cookies()

        if cookies:
            self.logger.info('Using saved session')
            yield scrapy.Request(
                'https://example.com/dashboard',
                cookies=cookies,
                callback=self.parse,
                errback=self.session_expired
            )
        else:
            self.logger.info('No saved session, logging in')
            yield scrapy.Request(
                'https://example.com/login',
                callback=self.login
            )

    def login(self, response):
        return scrapy.FormRequest.from_response(
            response,
            formdata={
                'username': 'your_username',
                'password': 'your_password'
            },
            callback=self.after_login
        )

    def after_login(self, response):
        # Save cookies for next run
        cookies = {}
        for cookie in response.headers.getlist('Set-Cookie'):
            name, value = cookie.decode().split(';')[0].split('=', 1)
            cookies[name] = value

        self.save_cookies(cookies)

        # Continue scraping
        yield scrapy.Request(
            'https://example.com/dashboard',
            callback=self.parse
        )

    def session_expired(self, failure):
        self.logger.warning('Session expired, logging in again')
        # Delete saved cookies
        self.delete_cookies()
        # Login again
        yield scrapy.Request(
            'https://example.com/login',
            callback=self.login
        )

    def parse(self, response):
        yield {'data': response.css('.content::text').get()}

    def save_cookies(self, cookies):
        with open('session.pkl', 'wb') as f:
            pickle.dump(cookies, f)

    def load_cookies(self):
        try:
            with open('session.pkl', 'rb') as f:
                return pickle.load(f)
        except FileNotFoundError:
            return None

    def delete_cookies(self):
        import os
        try:
            os.remove('session.pkl')
        except FileNotFoundError:
            pass
Enter fullscreen mode Exit fullscreen mode

Verifying Login Success

Always verify login worked:

def after_login(self, response):
    # Method 1: Check for logout link
    if 'logout' in response.text or '/logout' in response.text:
        self.logger.info('Login successful (found logout link)')
        return True

    # Method 2: Check for username
    username = response.css('.username::text').get()
    if username:
        self.logger.info(f'Login successful (logged in as {username})')
        return True

    # Method 3: Check for login form (shouldn't be there after login)
    login_form = response.css('form.login')
    if not login_form:
        self.logger.info('Login successful (no login form)')
        return True

    # Method 4: Check URL (redirected to dashboard?)
    if 'dashboard' in response.url or 'profile' in response.url:
        self.logger.info('Login successful (redirected to dashboard)')
        return True

    # Login failed
    self.logger.error('Login failed!')
    self.logger.error(f'Response URL: {response.url}')
    self.logger.error(f'Response status: {response.status}')

    # Save response for debugging
    with open('login_failed.html', 'w') as f:
        f.write(response.text)

    return False
Enter fullscreen mode Exit fullscreen mode

Handling 2FA (Two-Factor Authentication)

2FA is tricky. Options:

Option 1: Backup Codes

Some sites give backup codes. Use those:

formdata={
    'username': 'your_username',
    'password': 'your_password',
    'backup_code': 'your_backup_code'
}
Enter fullscreen mode Exit fullscreen mode

Option 2: App-Specific Passwords

Some sites (Google, GitHub) let you generate app passwords:

formdata={
    'username': 'your_username',
    'password': 'your_app_specific_password'
}
Enter fullscreen mode Exit fullscreen mode

Option 3: Disable 2FA for Bot Account

Create separate account without 2FA for scraping (if allowed).

Option 4: Use Selenium/Playwright

For sites requiring interactive 2FA:

# Login manually once with Selenium
# Save cookies
# Use cookies in Scrapy
Enter fullscreen mode Exit fullscreen mode

Complete Real-World Example

Here's a production-ready authenticated spider:

import scrapy
from scrapy.exceptions import CloseSpider

class ProductionAuthSpider(scrapy.Spider):
    name = 'auth'

    login_url = 'https://example.com/login'

    def __init__(self, username=None, password=None, *args, **kwargs):
        super().__init__(*args, **kwargs)

        if not username or not password:
            raise CloseSpider('Username and password required')

        self.username = username
        self.password = password
        self.login_attempts = 0
        self.max_login_attempts = 3

    def start_requests(self):
        self.logger.info('Starting authentication...')
        yield scrapy.Request(self.login_url, callback=self.login)

    def login(self, response):
        self.login_attempts += 1

        if self.login_attempts > self.max_login_attempts:
            raise CloseSpider('Max login attempts exceeded')

        self.logger.info(f'Login attempt {self.login_attempts}')

        # Extract CSRF token
        csrf_token = response.css('input[name="csrf_token"]::attr(value)').get()

        if not csrf_token:
            self.logger.error('CSRF token not found')
            raise CloseSpider('Cannot extract CSRF token')

        return scrapy.FormRequest.from_response(
            response,
            formdata={
                'username': self.username,
                'password': self.password,
                'csrf_token': csrf_token
            },
            callback=self.after_login
        )

    def after_login(self, response):
        # Verify login
        if not self.is_logged_in(response):
            self.logger.error('Login failed!')

            # Check for error messages
            error = response.css('.error-message::text').get()
            if error:
                self.logger.error(f'Error: {error}')

            # Save failed response
            with open('login_failed.html', 'w') as f:
                f.write(response.text)

            # Retry login
            if self.login_attempts < self.max_login_attempts:
                import time
                time.sleep(2)  # Wait before retry
                yield scrapy.Request(self.login_url, callback=self.login)
            else:
                raise CloseSpider('Login failed after maximum attempts')
            return

        self.logger.info('Login successful!')

        # Start scraping authenticated pages
        yield scrapy.Request(
            'https://example.com/dashboard',
            callback=self.parse_dashboard
        )

    def is_logged_in(self, response):
        # Multiple checks
        has_logout = 'logout' in response.text.lower()
        has_username = response.css('.user-profile').get() is not None
        no_login_form = not response.css('form.login-form').get()
        correct_url = 'dashboard' in response.url or 'profile' in response.url

        return has_logout or has_username or (no_login_form and correct_url)

    def parse_dashboard(self, response):
        # Check if still authenticated
        if not self.is_logged_in(response):
            self.logger.warning('Session expired, re-authenticating')
            yield scrapy.Request(self.login_url, callback=self.login)
            return

        # Scrape authenticated content
        for item in response.css('.item'):
            yield {
                'name': item.css('.name::text').get(),
                'data': item.css('.data::text').get()
            }

        # Pagination
        next_page = response.css('.next::attr(href)').get()
        if next_page:
            yield response.follow(next_page, callback=self.parse_dashboard)
Enter fullscreen mode Exit fullscreen mode

Run it:

scrapy crawl auth -a username="myuser" -a password="mypass"
Enter fullscreen mode Exit fullscreen mode

Common Authentication Issues

Issue #1: Login Succeeds but Next Request Fails

Problem: Cookies not being sent

Solution: Ensure cookies are enabled:

# settings.py
COOKIES_ENABLED = True  # Should be True by default
Enter fullscreen mode Exit fullscreen mode

Issue #2: CSRF Token Validation Failed

Problem: Token not being sent or wrong token

Solution: Extract and send correct token:

# Check what field name the site uses
csrf = response.css('input[name="csrf_token"]::attr(value)').get()
csrf = response.css('input[name="_token"]::attr(value)').get()
csrf = response.css('input[name="authenticity_token"]::attr(value)').get()
Enter fullscreen mode Exit fullscreen mode

Issue #3: Redirected to Login After First Page

Problem: Session not persisting

Solution: Check cookie handling:

def after_login(self, response):
    # Log cookies received
    cookies = response.headers.getlist('Set-Cookie')
    self.logger.info(f'Received cookies: {cookies}')
Enter fullscreen mode Exit fullscreen mode

Summary

Authentication methods:

  1. FormRequest.from_response() - Automatic form handling
  2. Manual FormRequest - When auto-detect fails
  3. Cookies - Direct cookie passing
  4. Headers - Bearer tokens, API keys
  5. Session persistence - Save/load cookies

Best practices:

  • Always verify login succeeded
  • Handle CSRF tokens properly
  • Save cookies for session persistence
  • Add retry logic for login failures
  • Log authentication steps
  • Handle session expiration

Common patterns:

  • Login in start_requests()
  • Use FormRequest.from_response()
  • Verify with logout link or username
  • Continue to protected pages

Remember:

  • Cookies enabled by default (good!)
  • CSRF tokens in hidden fields
  • Session cookies sent automatically
  • Verify login before scraping

Start with FormRequest.from_response(). It handles 90% of login forms automatically!

Happy scraping! 🕷️

Top comments (0)