Web scraping and GDPR compliance seem like opposites. One collects data at scale. The other limits what data you can collect. But here is the thing: they are not mutually exclusive.
I have been building scrapers professionally for 4 years. Here is what I have learned about making them GDPR-compliant.
The Core Problem
Most scrapers are built to collect everything. GDPR says you can only collect what you need, for a specific purpose, with legal basis.
This creates 3 practical constraints:
- Data minimization: only scrape fields you actually use
- Purpose limitation: know WHY you are scraping before you build
- Legal basis: you need one of 6 reasons to process personal data
Step 1: Define Your Legal Basis First
Before writing a single line of Playwright code, answer this: what is your legal basis under GDPR Article 6?
For most B2B scraping:
- Legitimate interest (Art. 6(1)(f)): valid for publicly posted professional data
- Contract performance: if user asked you to fetch their own data
- Public interest: academic/journalism use cases
If you cannot answer this question, stop. Build the legal basis first.
# Document your legal basis in code
SCRAPER_CONFIG = {
"legal_basis": "legitimate_interest",
"purpose": "B2B lead enrichment from public company websites",
"data_categories": ["business_email", "job_title", "company_name"],
"excludes": ["personal_emails", "home_addresses", "private_profiles"],
"retention_days": 90
}
Step 2: Scope Your Playwright Scraper Correctly
GDPR requires data minimization. Only collect what you need.
Bad approach (collect everything):
from playwright.sync_api import sync_playwright
def scrape_profile(url):
with sync_playwright() as p:
browser = p.chromium.launch()
page = browser.new_page()
page.goto(url)
# Scraping entire page DOM
return page.content() # GDPR problem: stores everything
GDPR-compliant approach (selective collection):
from playwright.sync_api import sync_playwright
from datetime import datetime
def scrape_business_profile(url: str, config: dict) -> dict:
"""Only collect fields specified in config.data_categories"""
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
page = browser.new_page()
# Set legitimate user agent
page.set_extra_http_headers({
"User-Agent": "Mozilla/5.0 (compatible; DataBot/1.0; +https://yourdomain.com/bot)"
})
page.goto(url, wait_until="networkidle")
# Only extract what you need
result = {}
if "job_title" in config["data_categories"]:
result["job_title"] = page.locator(".job-title").first.text_content()
if "company_name" in config["data_categories"]:
result["company_name"] = page.locator(".company").first.text_content()
# Never collect: photos, personal emails, home city unless needed
result["scraped_at"] = datetime.utcnow().isoformat()
result["source_url"] = url
result["legal_basis"] = config["legal_basis"]
browser.close()
return result
Step 3: Respect robots.txt (It Is Not Optional Under GDPR)
Under GDPR legitimate interest, you must minimize impact on data subjects. Ignoring robots.txt works against you legally.
import urllib.robotparser
from urllib.parse import urlparse
def can_scrape(url: str) -> bool:
parsed = urlparse(url)
robots_url = f"{parsed.scheme}://{parsed.netloc}/robots.txt"
rp = urllib.robotparser.RobotFileParser()
rp.set_url(robots_url)
rp.read()
return rp.can_fetch("*", url)
# Always check before scraping
if not can_scrape(target_url):
print(f"Skipping {target_url} — robots.txt disallows")
Step 4: Add Rate Limiting (Proportionality Requirement)
GDPR requires proportionality. Hammering a site with 100 req/sec is not proportional to legitimate interest.
import asyncio
import random
from playwright.async_api import async_playwright
async def scrape_with_rate_limit(urls: list, delay_seconds: float = 2.0):
results = []
async with async_playwright() as p:
browser = await p.chromium.launch(headless=True)
for url in urls:
if not can_scrape(url):
continue
page = await browser.new_page()
try:
await page.goto(url, wait_until="networkidle", timeout=30000)
data = await extract_minimal_data(page)
results.append(data)
except Exception as e:
print(f"Error scraping {url}: {e}")
finally:
await page.close()
# Human-like delay (GDPR proportionality)
await asyncio.sleep(delay_seconds + random.uniform(0, 1))
await browser.close()
return results
Step 5: Implement Data Retention Limits
GDPR Article 5(1)(e): data must not be kept longer than necessary.
from datetime import datetime, timedelta
import sqlite3
def store_with_retention(data: dict, db_path: str, retention_days: int = 90):
conn = sqlite3.connect(db_path)
cursor = conn.cursor()
# Store with expiry timestamp
expires_at = (datetime.utcnow() + timedelta(days=retention_days)).isoformat()
cursor.execute("""
INSERT INTO scraped_data (url, data_json, scraped_at, expires_at, legal_basis)
VALUES (?, ?, ?, ?, ?)
""", (
data["source_url"],
str(data),
data["scraped_at"],
expires_at,
data["legal_basis"]
))
conn.commit()
conn.close()
def purge_expired_data(db_path: str):
"""Run this daily via cron to comply with retention limits"""
conn = sqlite3.connect(db_path)
cursor = conn.cursor()
deleted = cursor.execute("""
DELETE FROM scraped_data
WHERE expires_at < datetime()
""").rowcount
conn.commit()
conn.close()
print(f"Purged {deleted} expired records")
Step 6: Handle Data Subject Rights
GDPR gives people rights: access, erasure, portability. If you scrape personal data, you must handle these.
def handle_erasure_request(email: str, db_path: str):
"""GDPR Art. 17 - Right to erasure"""
conn = sqlite3.connect(db_path)
cursor = conn.cursor()
# Find and delete all data about this person
deleted = cursor.execute("""
DELETE FROM scraped_data
WHERE data_json LIKE ?
""", (f"%{email}%",)).rowcount
conn.commit()
conn.close()
# Log the erasure (required for audit trail)
log_gdpr_action("erasure", email, records_deleted=deleted)
print(f"Erased {deleted} records for {email}")
def log_gdpr_action(action: str, subject: str, **kwargs):
"""Maintain audit log for GDPR compliance"""
import json
with open("gdpr_audit.log", "a") as f:
entry = {
"timestamp": datetime.utcnow().isoformat(),
"action": action,
"subject_identifier": subject,
**kwargs
}
f.write(json.dumps(entry) + "\n")
The GDPR-Compliant Playwright Checklist
Before deploying any scraper:
- [ ] Legal basis documented (not assumed)
- [ ] Data minimization: only required fields collected
- [ ] robots.txt respected
- [ ] Rate limiting implemented (not crawling at full speed)
- [ ] Data retention limits set and enforced by code
- [ ] Erasure request handling in place
- [ ] Audit log of GDPR actions maintained
- [ ] No sensitive categories (health, ethnicity, religion) scraped
- [ ] If collecting EU resident data: DPA registered if required
What About LinkedIn, Facebook, and Other Restricted Sites?
These platforms explicitly prohibit scraping in their ToS. Under GDPR, ToS violations can affect your legitimate interest claim.
For these platforms: use official APIs or data providers who have proper data licensing agreements.
Bottom Line
GDPR-compliant scraping is possible. It just requires planning before coding.
The biggest mistake I see: people build the scraper first, then try to bolt on compliance. That does not work. Legal basis first, then architecture.
Need ready-to-use GDPR-compliant scraper templates? The Apify Scrapers Bundle includes 12 production-grade scrapers built with legal basis documentation and data minimization built in.
Get the bundle for €29 → https://vhubster3.gumroad.com/l/fjmtqn
Top comments (0)