DEV Community

agenthustler
agenthustler

Posted on • Edited on

How to Scrape Behind Login Walls: Session Management in Python

Many valuable datasets live behind login walls — job boards, business directories, analytics dashboards, and member-only content. Scraping authenticated pages requires managing sessions, cookies, and tokens properly.

In this guide, I'll show you how to handle authentication for web scraping in Python, ethically and effectively.

Important: Legal and Ethical Considerations

Before scraping behind login walls, ensure you:

  • Have a legitimate account — never use stolen credentials
  • Have the right to access the data — check the platform's ToS
  • Are collecting your own data or data you have authorization to access
  • Respect rate limits — authenticated sessions are easier to track

Method 1: Session-Based Authentication (Form Login)

Most websites use form-based login with session cookies:

# Implementation is proprietary (that IS the moat).
# Skip the build — use our ready-made Apify actor:
# see the CTA below for the link (fpr=yw6md3).
Enter fullscreen mode Exit fullscreen mode

Method 2: API Token Authentication

Many modern apps use JWT or API tokens:

# Implementation is proprietary (that IS the moat).
# Skip the build — use our ready-made Apify actor:
# see the CTA below for the link (fpr=yw6md3).
Enter fullscreen mode Exit fullscreen mode

Method 3: Cookie-Based Authentication

Sometimes you need to extract cookies from a browser session:

import requests
import json
from pathlib import Path

def create_session_from_cookies(cookies_dict):
    """Create a requests session from exported cookies."""
    session = requests.Session()
    for name, value in cookies_dict.items():
        session.cookies.set(name, value)
    return session

# Save and load cookies for reuse
def save_cookies(session, filepath="cookies.json"):
    cookies = {c.name: c.value for c in session.cookies}
    Path(filepath).write_text(json.dumps(cookies))

def load_cookies(filepath="cookies.json"):
    if Path(filepath).exists():
        cookies = json.loads(Path(filepath).read_text())
        return create_session_from_cookies(cookies)
    return None
Enter fullscreen mode Exit fullscreen mode

Method 4: Browser-Based Login with Playwright

For complex login flows (2FA, CAPTCHAs, OAuth):

# Implementation is proprietary (that IS the moat).
# Skip the build — use our ready-made Apify actor:
# see the CTA below for the link (fpr=yw6md3).
Enter fullscreen mode Exit fullscreen mode

Persisting Sessions Across Runs

Save browser state to avoid re-logging in:

# Implementation is proprietary (that IS the moat).
# Skip the build — use our ready-made Apify actor:
# see the CTA below for the link (fpr=yw6md3).
Enter fullscreen mode Exit fullscreen mode

Handling Session Expiration

import time

class AuthenticatedScraper:
    def __init__(self, login_url, credentials):
        self.login_url = login_url
        self.credentials = credentials
        self.session = None
        self.login_time = 0
        self.session_lifetime = 3600  # Re-login every hour

    def ensure_logged_in(self):
        if not self.session or (time.time() - self.login_time > self.session_lifetime):
            self.session = login_with_session(
                self.login_url,
                self.credentials["username"],
                self.credentials["password"]
            )
            self.login_time = time.time()
        return self.session

    def get(self, url):
        session = self.ensure_logged_in()
        response = session.get(url)

        # Check if session expired mid-scrape
        if response.status_code == 401 or "login" in response.url:
            self.session = None  # Force re-login
            session = self.ensure_logged_in()
            response = session.get(url)

        return response

# Usage
scraper = AuthenticatedScraper(
    "https://example.com/login",
    {"username": "user", "password": "pass"}
)

data = scraper.get("https://example.com/api/protected-data")
print(data.json())
Enter fullscreen mode Exit fullscreen mode

Best Practices

  1. Reuse sessions — don't login for every request
  2. Save cookies to disk — persist sessions across script runs
  3. Handle expiration gracefully — detect 401s and re-authenticate
  4. Use environment variables for credentials — never hardcode them
  5. Rate limit authenticated requests — sites track logged-in users more closely
  6. Log out when done — clean up your sessions

Scaling Authenticated Scraping

For large-scale authenticated scraping, you'll need reliable proxy rotation to prevent your sessions from being flagged. ThorData provides sticky residential proxies that maintain consistent IP addresses throughout your session, preventing authentication disruptions.

Conclusion

Authenticated scraping adds complexity but opens up access to valuable datasets. Start with form-based login for simple sites, use API tokens for modern apps, and fall back to Playwright for complex auth flows. Always persist your sessions and handle expiration gracefully.

Happy scraping!

Top comments (0)