Vhub Systems

Posted on Apr 3

How to Build a GDPR-Compliant Web Scraping Pipeline in 2026 (Step by Step)

#gdpr #tutorial #python #webscraping

Most web scraping guides stop at "here's how to extract the data." This one covers the part that gets companies fined: what to do with the data after you have it.

Here's a complete architecture for a GDPR-compliant B2B data pipeline — from collection to deletion.

What "GDPR Compliant" Actually Requires for Scraped Data

Before writing any code, understand what you're complying with:

Legal basis (Article 6) — You need a documented reason to process personal data. For B2B scraping, usually "legitimate interest" (Article 6(1)(f)).
Data minimisation (Article 5(1)(c)) — Collect only what you need for the stated purpose.
Storage limitation (Article 5(1)(e)) — Delete data when it's no longer needed.
Article 14 notification (for indirect collection) — In some jurisdictions, you must notify data subjects you've scraped their data.
Data subject rights — Be able to delete or export any individual's data on request.

The pipeline below satisfies all five.

Architecture Overview

[Scraper] → [Staging DB with TTL] → [Enrichment] → [Active DB]
                                                        ↓
                                               [Deletion scheduler]
                                                        ↓
                                               [Audit log]

Three environments:

Staging: raw scraped data, short TTL (7 days), never used for outreach
Active: cleaned, enriched, consented records used for campaigns
Archive: anonymized aggregate data for reporting (no personal data)

Step 1: Define Your Legal Basis Before Scraping

Document this before collecting anything:

# config/gdpr_config.py
GDPR_CONFIG = {
    "legal_basis": "legitimate_interest",
    "purpose": "B2B outreach to companies matching ICP criteria",
    "data_categories": ["business_email", "name", "job_title", "company"],
    "retention_days": 90,
    "lia_completed": "2026-04-01",  # Legitimate Interest Assessment date
    "controller": "Your Company Ltd",
    "dpo_contact": "privacy@yourcompany.com"
}

This config becomes part of your Record of Processing Activities (ROPA) — required under GDPR Article 30.

Step 2: Scraping With Consent Flags

When scraping, classify each record by jurisdiction and whether you found opt-out signals:

import tldextract

EU_TLDS = {'de', 'fr', 'nl', 'be', 'at', 'es', 'it', 'pl', 'se', 'dk', 'fi', 'ie'}

def classify_record(email, company_url):
    domain = tldextract.extract(company_url).suffix
    is_eu = domain in EU_TLDS

    return {
        'is_eu_subject': is_eu,
        'requires_art14': is_eu,  # Article 14 notice required for EU indirect collection
        'retention_days': 90 if is_eu else 180,
        'can_use_for_outreach': True,  # Unless opt-out signal found
    }

def check_for_optout(website_text):
    """Check if site has explicit scraping/marketing opt-out."""
    optout_signals = [
        'do not sell my personal information',
        'opt out of marketing',
        'no marketing contact',
        'nicht für marketing',
        'pas de démarchage',
    ]
    text_lower = website_text.lower()
    return any(signal in text_lower for signal in optout_signals)

Step 3: Staging Database With TTL

Every scraped record enters staging first. The staging table enforces deletion automatically:

CREATE TABLE staging_contacts (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    email TEXT NOT NULL,
    name TEXT,
    job_title TEXT,
    company TEXT,
    company_url TEXT,
    is_eu_subject BOOLEAN,
    requires_art14 BOOLEAN,
    art14_notified_at TIMESTAMP,
    scraped_at TIMESTAMP DEFAULT NOW(),
    expires_at TIMESTAMP NOT NULL,  -- set at insert time
    data_source TEXT,
    scraper_version TEXT,
    status TEXT DEFAULT 'pending'  -- pending, approved, rejected, promoted
);

-- Automatic deletion via pg_cron (PostgreSQL extension)
SELECT cron.schedule('delete-expired-staging', '0 * * * *', 
    'DELETE FROM staging_contacts WHERE expires_at < NOW()');

Insert with explicit TTL:

import datetime, uuid

def insert_staging(conn, record):
    expires_at = datetime.datetime.utcnow() + datetime.timedelta(
        days=record['retention_days']
    )
    conn.execute("""
        INSERT INTO staging_contacts 
        (email, name, job_title, company, company_url, is_eu_subject, 
         requires_art14, scraped_at, expires_at, data_source)
        VALUES (%s, %s, %s, %s, %s, %s, %s, NOW(), %s, %s)
    """, (
        record['email'], record['name'], record['job_title'],
        record['company'], record['company_url'],
        record['is_eu_subject'], record['requires_art14'],
        expires_at, record['source']
    ))

Step 4: Enrichment and Quality Gating

Before promoting a record to the active database:

def qualify_for_active(staging_record):
    checks = {
        'email_verified': verify_email(staging_record['email']),
        'no_optout': not staging_record.get('has_optout_signal'),
        'not_expired': staging_record['expires_at'] > datetime.datetime.utcnow(),
        'art14_ok': (
            not staging_record['requires_art14'] or 
            staging_record.get('art14_notified_at') is not None
        ),
    }
    return all(checks.values()), checks

def promote_to_active(conn, staging_id):
    qualified, checks = qualify_for_active(get_staging(conn, staging_id))
    if not qualified:
        log_rejection(conn, staging_id, checks)
        return False

    conn.execute("""
        INSERT INTO active_contacts SELECT * FROM staging_contacts WHERE id = %s
    """, (staging_id,))
    conn.execute("""
        UPDATE staging_contacts SET status = 'promoted' WHERE id = %s
    """, (staging_id,))
    return True

Step 5: Article 14 Notification (EU Records)

For EU subjects, GDPR Article 14 requires notifying individuals that you've collected their data. This must happen within "a reasonable period" — the ICO guidance says 1 month maximum.

Simple implementation using your outreach tool:

ART14_TEMPLATE = """
Subject: Information about your personal data — {company}

Hi {name},

In accordance with GDPR Article 14, I'm writing to inform you that 
I've collected your business contact details from publicly available 
sources (LinkedIn, your company website) for the purpose of 
professional outreach.

Data held: name, business email, job title, company
Legal basis: Legitimate interest (Article 6(1)(f))
Retention: 90 days

You can request deletion at any time by replying to this email.

{sender_name}
"""

def send_art14_notice(record, sender):
    send_email(
        to=record['email'],
        subject=f"Information about your personal data — {record['company']}",
        body=ART14_TEMPLATE.format(**record, sender_name=sender)
    )
    update_art14_timestamp(record['id'])

This doubles as your first outreach touchpoint — a thoughtful, GDPR-compliant introduction.

Step 6: Data Subject Rights Endpoint

Build a simple API endpoint to handle deletion and export requests:

from flask import Flask, request, jsonify

app = Flask(__name__)

@app.route('/privacy/request', methods=['POST'])
def handle_dsr():
    email = request.json.get('email')
    request_type = request.json.get('type')  # 'delete' or 'export'

    if request_type == 'delete':
        count = delete_all_records(email)
        log_dsr(email, 'delete', count)
        return jsonify({'status': 'completed', 'records_deleted': count})

    elif request_type == 'export':
        records = export_records(email)
        log_dsr(email, 'export', len(records))
        return jsonify({'status': 'completed', 'data': records})

def delete_all_records(email):
    # Delete from all tables, log deletion
    count = 0
    for table in ['staging_contacts', 'active_contacts']:
        result = db.execute(f"DELETE FROM {table} WHERE email = %s", (email,))
        count += result.rowcount

    # Add to suppression list
    db.execute("INSERT INTO suppression_list (email, reason, deleted_at) VALUES (%s, 'dsr_request', NOW())", (email,))
    return count

Response time requirement: GDPR Article 12 requires responding to DSRs within one month.

Step 7: Audit Logging

Every access and deletion should be logged for compliance evidence:

CREATE TABLE gdpr_audit_log (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    event_type TEXT,  -- 'collect', 'enrich', 'outreach', 'delete', 'export', 'dsr'
    record_id UUID,
    email_hash TEXT,  -- SHA-256 of email, not the email itself
    actor TEXT,       -- which system or person performed the action
    metadata JSONB,
    created_at TIMESTAMP DEFAULT NOW()
);

Log email hashes, not raw emails — you can prove compliance without holding an additional copy of personal data in the audit log.

What This Gets You

Documented legal basis and purpose
Automatic expiry for all scraped records
EU subject identification and Art. 14 compliance
Email verification before active use
Data subject rights fulfillment in under 24 hours
Audit trail for DPA investigations

This architecture doesn't eliminate GDPR risk — nothing does. But it demonstrates good faith compliance and structured data governance, which is what regulators look for.

Scraping Tools Built for Compliant Pipelines

The 35 actors in this bundle are designed for structured output that integrates cleanly with pipelines like the one above — typed schemas, consistent field names, no junk data.

Apify Scrapers Bundle — €29 — instant download, one-time price.

Includes B2B contact extractor, LinkedIn scraper, and 33 more. All PAY_PER_EVENT: €0.002–€0.005 per result.

DEV Community