agenthustler

Posted on Mar 26 • Edited on Apr 19

How to Scrape Behind Login Walls: Session Management in Python

#webscraping #webdev #python #tutorial

Many valuable datasets live behind login walls — job boards, business directories, analytics dashboards, and member-only content. Scraping authenticated pages requires managing sessions, cookies, and tokens properly.

In this guide, I'll show you how to handle authentication for web scraping in Python, ethically and effectively.

Important: Legal and Ethical Considerations

Before scraping behind login walls, ensure you:

Have a legitimate account — never use stolen credentials
Have the right to access the data — check the platform's ToS
Are collecting your own data or data you have authorization to access
Respect rate limits — authenticated sessions are easier to track

Method 1: Session-Based Authentication (Form Login)

Most websites use form-based login with session cookies:

# Implementation is proprietary (that IS the moat).
# Skip the build — use our ready-made Apify actor:
# see the CTA below for the link (fpr=yw6md3).

Method 2: API Token Authentication

Many modern apps use JWT or API tokens:

# Implementation is proprietary (that IS the moat).
# Skip the build — use our ready-made Apify actor:
# see the CTA below for the link (fpr=yw6md3).

Method 3: Cookie-Based Authentication

Sometimes you need to extract cookies from a browser session:

import requests
import json
from pathlib import Path

def create_session_from_cookies(cookies_dict):
    """Create a requests session from exported cookies."""
    session = requests.Session()
    for name, value in cookies_dict.items():
        session.cookies.set(name, value)
    return session

# Save and load cookies for reuse
def save_cookies(session, filepath="cookies.json"):
    cookies = {c.name: c.value for c in session.cookies}
    Path(filepath).write_text(json.dumps(cookies))

def load_cookies(filepath="cookies.json"):
    if Path(filepath).exists():
        cookies = json.loads(Path(filepath).read_text())
        return create_session_from_cookies(cookies)
    return None

Method 4: Browser-Based Login with Playwright

For complex login flows (2FA, CAPTCHAs, OAuth):

# Implementation is proprietary (that IS the moat).
# Skip the build — use our ready-made Apify actor:
# see the CTA below for the link (fpr=yw6md3).

Persisting Sessions Across Runs

Save browser state to avoid re-logging in:

# Implementation is proprietary (that IS the moat).
# Skip the build — use our ready-made Apify actor:
# see the CTA below for the link (fpr=yw6md3).

Handling Session Expiration

import time

class AuthenticatedScraper:
    def __init__(self, login_url, credentials):
        self.login_url = login_url
        self.credentials = credentials
        self.session = None
        self.login_time = 0
        self.session_lifetime = 3600  # Re-login every hour

    def ensure_logged_in(self):
        if not self.session or (time.time() - self.login_time > self.session_lifetime):
            self.session = login_with_session(
                self.login_url,
                self.credentials["username"],
                self.credentials["password"]
            )
            self.login_time = time.time()
        return self.session

    def get(self, url):
        session = self.ensure_logged_in()
        response = session.get(url)

        # Check if session expired mid-scrape
        if response.status_code == 401 or "login" in response.url:
            self.session = None  # Force re-login
            session = self.ensure_logged_in()
            response = session.get(url)

        return response

# Usage
scraper = AuthenticatedScraper(
    "https://example.com/login",
    {"username": "user", "password": "pass"}
)

data = scraper.get("https://example.com/api/protected-data")
print(data.json())

Best Practices

Reuse sessions — don't login for every request
Save cookies to disk — persist sessions across script runs
Handle expiration gracefully — detect 401s and re-authenticate
Use environment variables for credentials — never hardcode them
Rate limit authenticated requests — sites track logged-in users more closely
Log out when done — clean up your sessions

Scaling Authenticated Scraping

For large-scale authenticated scraping, you'll need reliable proxy rotation to prevent your sessions from being flagged. ThorData provides sticky residential proxies that maintain consistent IP addresses throughout your session, preventing authentication disruptions.

Conclusion

Authenticated scraping adds complexity but opens up access to valuable datasets. Start with form-based login for simple sites, use API tokens for modern apps, and fall back to Playwright for complex auth flows. Always persist your sessions and handle expiration gracefully.

Happy scraping!

DEV Community