Many valuable datasets live behind login walls — job boards, business directories, analytics dashboards, and member-only content. Scraping authenticated pages requires managing sessions, cookies, and tokens properly.
In this guide, I'll show you how to handle authentication for web scraping in Python, ethically and effectively.
Important: Legal and Ethical Considerations
Before scraping behind login walls, ensure you:
- Have a legitimate account — never use stolen credentials
- Have the right to access the data — check the platform's ToS
- Are collecting your own data or data you have authorization to access
- Respect rate limits — authenticated sessions are easier to track
Method 1: Session-Based Authentication (Form Login)
Most websites use form-based login with session cookies:
# Implementation is proprietary (that IS the moat).
# Skip the build — use our ready-made Apify actor:
# see the CTA below for the link (fpr=yw6md3).
Method 2: API Token Authentication
Many modern apps use JWT or API tokens:
# Implementation is proprietary (that IS the moat).
# Skip the build — use our ready-made Apify actor:
# see the CTA below for the link (fpr=yw6md3).
Method 3: Cookie-Based Authentication
Sometimes you need to extract cookies from a browser session:
import requests
import json
from pathlib import Path
def create_session_from_cookies(cookies_dict):
"""Create a requests session from exported cookies."""
session = requests.Session()
for name, value in cookies_dict.items():
session.cookies.set(name, value)
return session
# Save and load cookies for reuse
def save_cookies(session, filepath="cookies.json"):
cookies = {c.name: c.value for c in session.cookies}
Path(filepath).write_text(json.dumps(cookies))
def load_cookies(filepath="cookies.json"):
if Path(filepath).exists():
cookies = json.loads(Path(filepath).read_text())
return create_session_from_cookies(cookies)
return None
Method 4: Browser-Based Login with Playwright
For complex login flows (2FA, CAPTCHAs, OAuth):
# Implementation is proprietary (that IS the moat).
# Skip the build — use our ready-made Apify actor:
# see the CTA below for the link (fpr=yw6md3).
Persisting Sessions Across Runs
Save browser state to avoid re-logging in:
# Implementation is proprietary (that IS the moat).
# Skip the build — use our ready-made Apify actor:
# see the CTA below for the link (fpr=yw6md3).
Handling Session Expiration
import time
class AuthenticatedScraper:
def __init__(self, login_url, credentials):
self.login_url = login_url
self.credentials = credentials
self.session = None
self.login_time = 0
self.session_lifetime = 3600 # Re-login every hour
def ensure_logged_in(self):
if not self.session or (time.time() - self.login_time > self.session_lifetime):
self.session = login_with_session(
self.login_url,
self.credentials["username"],
self.credentials["password"]
)
self.login_time = time.time()
return self.session
def get(self, url):
session = self.ensure_logged_in()
response = session.get(url)
# Check if session expired mid-scrape
if response.status_code == 401 or "login" in response.url:
self.session = None # Force re-login
session = self.ensure_logged_in()
response = session.get(url)
return response
# Usage
scraper = AuthenticatedScraper(
"https://example.com/login",
{"username": "user", "password": "pass"}
)
data = scraper.get("https://example.com/api/protected-data")
print(data.json())
Best Practices
- Reuse sessions — don't login for every request
- Save cookies to disk — persist sessions across script runs
- Handle expiration gracefully — detect 401s and re-authenticate
- Use environment variables for credentials — never hardcode them
- Rate limit authenticated requests — sites track logged-in users more closely
- Log out when done — clean up your sessions
Scaling Authenticated Scraping
For large-scale authenticated scraping, you'll need reliable proxy rotation to prevent your sessions from being flagged. ThorData provides sticky residential proxies that maintain consistent IP addresses throughout your session, preventing authentication disruptions.
Conclusion
Authenticated scraping adds complexity but opens up access to valuable datasets. Start with form-based login for simple sites, use API tokens for modern apps, and fall back to Playwright for complex auth flows. Always persist your sessions and handle expiration gracefully.
Happy scraping!
Top comments (0)