DEV Community

Vhub Systems
Vhub Systems

Posted on

How to Build a GDPR-Compliant Web Scraper With Playwright in 2026

Web scraping and GDPR compliance seem like opposites. One collects data at scale. The other limits what data you can collect. But here is the thing: they are not mutually exclusive.

I have been building scrapers professionally for 4 years. Here is what I have learned about making them GDPR-compliant.

The Core Problem

Most scrapers are built to collect everything. GDPR says you can only collect what you need, for a specific purpose, with legal basis.

This creates 3 practical constraints:

  1. Data minimization: only scrape fields you actually use
  2. Purpose limitation: know WHY you are scraping before you build
  3. Legal basis: you need one of 6 reasons to process personal data

Step 1: Define Your Legal Basis First

Before writing a single line of Playwright code, answer this: what is your legal basis under GDPR Article 6?

For most B2B scraping:

  • Legitimate interest (Art. 6(1)(f)): valid for publicly posted professional data
  • Contract performance: if user asked you to fetch their own data
  • Public interest: academic/journalism use cases

If you cannot answer this question, stop. Build the legal basis first.

# Document your legal basis in code
SCRAPER_CONFIG = {
    "legal_basis": "legitimate_interest",
    "purpose": "B2B lead enrichment from public company websites",
    "data_categories": ["business_email", "job_title", "company_name"],
    "excludes": ["personal_emails", "home_addresses", "private_profiles"],
    "retention_days": 90
}
Enter fullscreen mode Exit fullscreen mode

Step 2: Scope Your Playwright Scraper Correctly

GDPR requires data minimization. Only collect what you need.

Bad approach (collect everything):

from playwright.sync_api import sync_playwright

def scrape_profile(url):
    with sync_playwright() as p:
        browser = p.chromium.launch()
        page = browser.new_page()
        page.goto(url)

        # Scraping entire page DOM
        return page.content()  # GDPR problem: stores everything
Enter fullscreen mode Exit fullscreen mode

GDPR-compliant approach (selective collection):

from playwright.sync_api import sync_playwright
from datetime import datetime

def scrape_business_profile(url: str, config: dict) -> dict:
    """Only collect fields specified in config.data_categories"""
    with sync_playwright() as p:
        browser = p.chromium.launch(headless=True)
        page = browser.new_page()

        # Set legitimate user agent
        page.set_extra_http_headers({
            "User-Agent": "Mozilla/5.0 (compatible; DataBot/1.0; +https://yourdomain.com/bot)"
        })

        page.goto(url, wait_until="networkidle")

        # Only extract what you need
        result = {}

        if "job_title" in config["data_categories"]:
            result["job_title"] = page.locator(".job-title").first.text_content()

        if "company_name" in config["data_categories"]:
            result["company_name"] = page.locator(".company").first.text_content()

        # Never collect: photos, personal emails, home city unless needed

        result["scraped_at"] = datetime.utcnow().isoformat()
        result["source_url"] = url
        result["legal_basis"] = config["legal_basis"]

        browser.close()
        return result
Enter fullscreen mode Exit fullscreen mode

Step 3: Respect robots.txt (It Is Not Optional Under GDPR)

Under GDPR legitimate interest, you must minimize impact on data subjects. Ignoring robots.txt works against you legally.

import urllib.robotparser
from urllib.parse import urlparse

def can_scrape(url: str) -> bool:
    parsed = urlparse(url)
    robots_url = f"{parsed.scheme}://{parsed.netloc}/robots.txt"

    rp = urllib.robotparser.RobotFileParser()
    rp.set_url(robots_url)
    rp.read()

    return rp.can_fetch("*", url)

# Always check before scraping
if not can_scrape(target_url):
    print(f"Skipping {target_url} — robots.txt disallows")
Enter fullscreen mode Exit fullscreen mode

Step 4: Add Rate Limiting (Proportionality Requirement)

GDPR requires proportionality. Hammering a site with 100 req/sec is not proportional to legitimate interest.

import asyncio
import random
from playwright.async_api import async_playwright

async def scrape_with_rate_limit(urls: list, delay_seconds: float = 2.0):
    results = []

    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=True)

        for url in urls:
            if not can_scrape(url):
                continue

            page = await browser.new_page()

            try:
                await page.goto(url, wait_until="networkidle", timeout=30000)
                data = await extract_minimal_data(page)
                results.append(data)
            except Exception as e:
                print(f"Error scraping {url}: {e}")
            finally:
                await page.close()

            # Human-like delay (GDPR proportionality)
            await asyncio.sleep(delay_seconds + random.uniform(0, 1))

        await browser.close()

    return results
Enter fullscreen mode Exit fullscreen mode

Step 5: Implement Data Retention Limits

GDPR Article 5(1)(e): data must not be kept longer than necessary.

from datetime import datetime, timedelta
import sqlite3

def store_with_retention(data: dict, db_path: str, retention_days: int = 90):
    conn = sqlite3.connect(db_path)
    cursor = conn.cursor()

    # Store with expiry timestamp
    expires_at = (datetime.utcnow() + timedelta(days=retention_days)).isoformat()

    cursor.execute("""
        INSERT INTO scraped_data (url, data_json, scraped_at, expires_at, legal_basis)
        VALUES (?, ?, ?, ?, ?)
    """, (
        data["source_url"],
        str(data),
        data["scraped_at"],
        expires_at,
        data["legal_basis"]
    ))

    conn.commit()
    conn.close()

def purge_expired_data(db_path: str):
    """Run this daily via cron to comply with retention limits"""
    conn = sqlite3.connect(db_path)
    cursor = conn.cursor()

    deleted = cursor.execute("""
        DELETE FROM scraped_data 
        WHERE expires_at < datetime()
    """).rowcount

    conn.commit()
    conn.close()

    print(f"Purged {deleted} expired records")
Enter fullscreen mode Exit fullscreen mode

Step 6: Handle Data Subject Rights

GDPR gives people rights: access, erasure, portability. If you scrape personal data, you must handle these.

def handle_erasure_request(email: str, db_path: str):
    """GDPR Art. 17 - Right to erasure"""
    conn = sqlite3.connect(db_path)
    cursor = conn.cursor()

    # Find and delete all data about this person
    deleted = cursor.execute("""
        DELETE FROM scraped_data
        WHERE data_json LIKE ?
    """, (f"%{email}%",)).rowcount

    conn.commit()
    conn.close()

    # Log the erasure (required for audit trail)
    log_gdpr_action("erasure", email, records_deleted=deleted)

    print(f"Erased {deleted} records for {email}")

def log_gdpr_action(action: str, subject: str, **kwargs):
    """Maintain audit log for GDPR compliance"""
    import json
    with open("gdpr_audit.log", "a") as f:
        entry = {
            "timestamp": datetime.utcnow().isoformat(),
            "action": action,
            "subject_identifier": subject,
            **kwargs
        }
        f.write(json.dumps(entry) + "\n")
Enter fullscreen mode Exit fullscreen mode

The GDPR-Compliant Playwright Checklist

Before deploying any scraper:

  • [ ] Legal basis documented (not assumed)
  • [ ] Data minimization: only required fields collected
  • [ ] robots.txt respected
  • [ ] Rate limiting implemented (not crawling at full speed)
  • [ ] Data retention limits set and enforced by code
  • [ ] Erasure request handling in place
  • [ ] Audit log of GDPR actions maintained
  • [ ] No sensitive categories (health, ethnicity, religion) scraped
  • [ ] If collecting EU resident data: DPA registered if required

What About LinkedIn, Facebook, and Other Restricted Sites?

These platforms explicitly prohibit scraping in their ToS. Under GDPR, ToS violations can affect your legitimate interest claim.

For these platforms: use official APIs or data providers who have proper data licensing agreements.

Bottom Line

GDPR-compliant scraping is possible. It just requires planning before coding.

The biggest mistake I see: people build the scraper first, then try to bolt on compliance. That does not work. Legal basis first, then architecture.


Need ready-to-use GDPR-compliant scraper templates? The Apify Scrapers Bundle includes 12 production-grade scrapers built with legal basis documentation and data minimization built in.

Get the bundle for €29 → https://vhubster3.gumroad.com/l/fjmtqn

Top comments (0)