DEV Community

Vhub Systems
Vhub Systems

Posted on

GDPR Data Subject Rights and Web Scraping: What You're Required to Handle

If you scrape personal data and store it, GDPR gives individuals specific rights over that data. Ignoring these rights is where most scraping operations get into regulatory trouble — not the act of scraping itself.

Here's what GDPR Article 15-22 actually requires, translated into what scraper operators need to build.

The Eight Rights (Simplified)

GDPR Articles 15-22 define eight rights for EU data subjects:

  1. Right of access — Anyone can request a copy of all personal data you hold about them
  2. Right to rectification — If your data is wrong, they can require you to correct it
  3. Right to erasure ("right to be forgotten") — They can require deletion under specific conditions
  4. Right to restriction — They can pause processing while disputing accuracy
  5. Right to data portability — You must provide their data in machine-readable format
  6. Right to object — They can object to processing based on legitimate interests
  7. Rights related to automated decision-making — Applies to automated profiling
  8. Right to withdraw consent — If consent is your lawful basis, it must be withdrawable

For most scraping operations, rights 1, 3, and 6 are the most operationally significant.

Right of Access (Article 15) — The Hidden Complexity

What it requires: If someone submits a Subject Access Request (SAR), you have 30 days to respond with all personal data you hold about them, plus information about how it's used, where it came from, and who it's shared with.

The scraping problem: When you scrape at scale, you may have thousands of records with no direct relationship to the individuals in them. Someone might email asking "what data do you have about me?" and you'd need to search your entire dataset by email/name/identifier to find their records.

Minimum viable compliance:

# At minimum, you need a searchable index
# If using PostgreSQL:

CREATE INDEX idx_email_search ON scraped_data(email);
CREATE INDEX idx_name_search ON scraped_data(LOWER(full_name));

# And a function to respond to SARs:
async def handle_sar(requester_email: str, requester_name: str) -> dict:
    results = await db.fetch(
        """
        SELECT * FROM scraped_data 
        WHERE email = $1 OR LOWER(full_name) LIKE LOWER($2)
        """,
        requester_email, f"%{requester_name}%"
    )
    return {
        "data_held": [dict(r) for r in results],
        "source": "Public web scraping",
        "purpose": "Lead generation",
        "retention": "12 months",
        "recipients": "Internal sales team only"
    }
Enter fullscreen mode Exit fullscreen mode

Practical note: For small operations (<10,000 records), responding manually to SARs is feasible. At scale, you need automated tooling.

Right to Erasure / "Right to Be Forgotten" (Article 17)

When it applies to scrapers: if your lawful basis is legitimate interests (most common for scrapers), and the person objects, erasure is required unless you have compelling legitimate grounds that override their interests.

Implementation:

async def handle_erasure_request(
    requester_email: str, 
    requester_name: str
) -> dict:
    # Step 1: Find all matching records
    records = await db.fetch(
        "SELECT id FROM scraped_data WHERE email = $1 OR LOWER(full_name) LIKE LOWER($2)",
        requester_email, f"%{requester_name}%"
    )

    ids_to_delete = [r['id'] for r in records]

    if not ids_to_delete:
        return {"status": "no_records_found", "deleted": 0}

    # Step 2: Delete from main table
    await db.execute(
        "DELETE FROM scraped_data WHERE id = ANY($1)",
        ids_to_delete
    )

    # Step 3: Add to suppression list (prevent re-scraping)
    await db.execute(
        "INSERT INTO erasure_suppressions (email, name, erased_at) VALUES ($1, $2, NOW())",
        requester_email, requester_name
    )

    # Step 4: Log the erasure for compliance records
    await db.execute(
        """INSERT INTO compliance_log (action, subject_identifier, records_affected, timestamp) 
           VALUES ('erasure', $1, $2, NOW())""",
        requester_email, len(ids_to_delete)
    )

    return {"status": "erased", "deleted": len(ids_to_delete)}
Enter fullscreen mode Exit fullscreen mode

Critical: the suppression list prevents you from re-scraping the same person. Without it, you'd comply with an erasure request today and re-add the same data tomorrow on your next crawl.

Right to Object (Article 21)

What it means for scrapers: If your lawful basis is legitimate interests (Article 6(1)(f)), data subjects can object to processing at any time. You must stop processing unless you can demonstrate compelling legitimate grounds that override their interests.

In practice: treat an objection like an erasure request unless you have a specific, documented reason why your interest is more compelling than theirs. Most scraping operations don't have grounds compelling enough to override an objection.

Building Compliant Infrastructure from Day One

The mistake most developers make: building the scraping pipeline, scaling it to 100k records, and then realizing they have no way to handle rights requests.

Minimum compliance checklist:

□ SAR search capability (by email + name at minimum)
□ Erasure mechanism that actually deletes (not just soft-delete)
□ Suppression list to prevent re-scraping erased subjects
□ Compliance log of all rights requests handled
□ Data retention schedule and automated cleanup
□ Point of contact email in your privacy notice
□ Privacy notice accessible at a public URL
Enter fullscreen mode Exit fullscreen mode

Data retention automation — the most commonly skipped requirement:

# Run weekly: delete records older than retention period
async def cleanup_expired_records():
    deleted = await db.fetchval(
        """
        DELETE FROM scraped_data 
        WHERE scraped_at < NOW() - INTERVAL '12 months'
        RETURNING COUNT(*)
        """
    )
    return deleted
Enter fullscreen mode Exit fullscreen mode

Practical Timeline for Rights Requests

Right Response Deadline Extension Possible?
Access (SAR) 30 days Yes, 2 months with notice
Erasure 30 days (erasure), 72 hours for high-risk No standard extension
Rectification 30 days Yes, 2 months
Portability 30 days Yes, 2 months
Objection acknowledgment Without undue delay

For solo operators and small teams: log all incoming privacy requests to a dedicated email/folder immediately. Missing the 30-day window is a straightforward regulatory violation.

How Enforcement Actually Works

GDPR enforcement for scrapers typically starts with:

  1. A data subject complaint to their national DPA (Data Protection Authority)
  2. The DPA contacts you for a response
  3. You either demonstrate compliance or face investigation

The timeline from complaint to fine is typically 6-24 months for smaller operators. Fines for scraper operators have ranged from €2,000 for minor violations to €8.7M for large-scale commercial data processing without proper basis.

The cheapest path: build the compliance infrastructure before you have records to worry about. A deletion endpoint and a suppression list are a weekend's work. A DPA investigation is months of legal fees.


Related: Scraping Tools That Respect Data Minimisation

If you're building GDPR-aware scraping pipelines, choosing actors that let you select specific output fields (rather than scraping everything by default) makes data minimisation easier.

The Apify Scrapers Bundle (€29) includes 35 actors with configurable output schemas — collect emails, phones, social links, or all three, depending on your documented purpose.

Top comments (0)