GDPR Data Subject Rights and Web Scraping: What You're Required to Handle

#gdpr #security #webdev #python

If you scrape personal data and store it, GDPR gives individuals specific rights over that data. Ignoring these rights is where most scraping operations get into regulatory trouble — not the act of scraping itself.

Here's what GDPR Article 15-22 actually requires, translated into what scraper operators need to build.

The Eight Rights (Simplified)

GDPR Articles 15-22 define eight rights for EU data subjects:

Right of access — Anyone can request a copy of all personal data you hold about them
Right to rectification — If your data is wrong, they can require you to correct it
Right to erasure ("right to be forgotten") — They can require deletion under specific conditions
Right to restriction — They can pause processing while disputing accuracy
Right to data portability — You must provide their data in machine-readable format
Right to object — They can object to processing based on legitimate interests
Rights related to automated decision-making — Applies to automated profiling
Right to withdraw consent — If consent is your lawful basis, it must be withdrawable

For most scraping operations, rights 1, 3, and 6 are the most operationally significant.

Right of Access (Article 15) — The Hidden Complexity

What it requires: If someone submits a Subject Access Request (SAR), you have 30 days to respond with all personal data you hold about them, plus information about how it's used, where it came from, and who it's shared with.

The scraping problem: When you scrape at scale, you may have thousands of records with no direct relationship to the individuals in them. Someone might email asking "what data do you have about me?" and you'd need to search your entire dataset by email/name/identifier to find their records.

Minimum viable compliance:

# At minimum, you need a searchable index
# If using PostgreSQL:

CREATE INDEX idx_email_search ON scraped_data(email);
CREATE INDEX idx_name_search ON scraped_data(LOWER(full_name));

# And a function to respond to SARs:
async def handle_sar(requester_email: str, requester_name: str) -> dict:
    results = await db.fetch(
        """
        SELECT * FROM scraped_data 
        WHERE email = $1 OR LOWER(full_name) LIKE LOWER($2)
        """,
        requester_email, f"%{requester_name}%"
    )
    return {
        "data_held": [dict(r) for r in results],
        "source": "Public web scraping",
        "purpose": "Lead generation",
        "retention": "12 months",
        "recipients": "Internal sales team only"
    }

Practical note: For small operations (<10,000 records), responding manually to SARs is feasible. At scale, you need automated tooling.

Right to Erasure / "Right to Be Forgotten" (Article 17)

When it applies to scrapers: if your lawful basis is legitimate interests (most common for scrapers), and the person objects, erasure is required unless you have compelling legitimate grounds that override their interests.

Implementation:

async def handle_erasure_request(
    requester_email: str, 
    requester_name: str
) -> dict:
    # Step 1: Find all matching records
    records = await db.fetch(
        "SELECT id FROM scraped_data WHERE email = $1 OR LOWER(full_name) LIKE LOWER($2)",
        requester_email, f"%{requester_name}%"
    )

    ids_to_delete = [r['id'] for r in records]

    if not ids_to_delete:
        return {"status": "no_records_found", "deleted": 0}

    # Step 2: Delete from main table
    await db.execute(
        "DELETE FROM scraped_data WHERE id = ANY($1)",
        ids_to_delete
    )

    # Step 3: Add to suppression list (prevent re-scraping)
    await db.execute(
        "INSERT INTO erasure_suppressions (email, name, erased_at) VALUES ($1, $2, NOW())",
        requester_email, requester_name
    )

    # Step 4: Log the erasure for compliance records
    await db.execute(
        """INSERT INTO compliance_log (action, subject_identifier, records_affected, timestamp) 
           VALUES ('erasure', $1, $2, NOW())""",
        requester_email, len(ids_to_delete)
    )

    return {"status": "erased", "deleted": len(ids_to_delete)}

Critical: the suppression list prevents you from re-scraping the same person. Without it, you'd comply with an erasure request today and re-add the same data tomorrow on your next crawl.

Right to Object (Article 21)

What it means for scrapers: If your lawful basis is legitimate interests (Article 6(1)(f)), data subjects can object to processing at any time. You must stop processing unless you can demonstrate compelling legitimate grounds that override their interests.

In practice: treat an objection like an erasure request unless you have a specific, documented reason why your interest is more compelling than theirs. Most scraping operations don't have grounds compelling enough to override an objection.

Building Compliant Infrastructure from Day One

The mistake most developers make: building the scraping pipeline, scaling it to 100k records, and then realizing they have no way to handle rights requests.

Minimum compliance checklist:

□ SAR search capability (by email + name at minimum)
□ Erasure mechanism that actually deletes (not just soft-delete)
□ Suppression list to prevent re-scraping erased subjects
□ Compliance log of all rights requests handled
□ Data retention schedule and automated cleanup
□ Point of contact email in your privacy notice
□ Privacy notice accessible at a public URL

Data retention automation — the most commonly skipped requirement:

# Run weekly: delete records older than retention period
async def cleanup_expired_records():
    deleted = await db.fetchval(
        """
        DELETE FROM scraped_data 
        WHERE scraped_at < NOW() - INTERVAL '12 months'
        RETURNING COUNT(*)
        """
    )
    return deleted

Practical Timeline for Rights Requests

Right	Response Deadline	Extension Possible?
Access (SAR)	30 days	Yes, 2 months with notice
Erasure	30 days (erasure), 72 hours for high-risk	No standard extension
Rectification	30 days	Yes, 2 months
Portability	30 days	Yes, 2 months
Objection acknowledgment	Without undue delay	—

For solo operators and small teams: log all incoming privacy requests to a dedicated email/folder immediately. Missing the 30-day window is a straightforward regulatory violation.

How Enforcement Actually Works

GDPR enforcement for scrapers typically starts with:

A data subject complaint to their national DPA (Data Protection Authority)
The DPA contacts you for a response
You either demonstrate compliance or face investigation

The timeline from complaint to fine is typically 6-24 months for smaller operators. Fines for scraper operators have ranged from €2,000 for minor violations to €8.7M for large-scale commercial data processing without proper basis.

The cheapest path: build the compliance infrastructure before you have records to worry about. A deletion endpoint and a suppression list are a weekend's work. A DPA investigation is months of legal fees.

Related: Scraping Tools That Respect Data Minimisation

If you're building GDPR-aware scraping pipelines, choosing actors that let you select specific output fields (rather than scraping everything by default) makes data minimisation easier.

The Apify Scrapers Bundle (€29) includes 35 actors with configurable output schemas — collect emails, phones, social links, or all three, depending on your documented purpose.