Vhub Systems

Posted on Apr 3

GDPR Data Retention for Scraped Data: How Long Can You Keep It?

#gdpr #webscraping #security #tutorial

You scraped 50,000 B2B contact records six months ago. You haven't used most of them. Are you in violation of GDPR?

Probably yes. Here's why, and what to do about it.

The Problem With Scraped Data and GDPR Storage Limits

GDPR Article 5(1)(e) — the storage limitation principle — states that personal data must be kept "no longer than is necessary for the purposes for which the personal data are processed."

For scraped B2B contact data, this creates a specific problem: most scraping workflows collect data in bulk before a specific use case is defined. You scrape LinkedIn profiles or business emails to "have them available." That's not a defined purpose, and it doesn't justify indefinite storage.

The ICO (UK data authority) and CNIL (France) have both enforced this principle against companies that held personal data in databases without active processing justification. Fines have ranged from €10,000 to €250,000 for retention violations.

What Counts as Personal Data in Scraped Records?

For B2B scraping, the threshold is clear. Any record that could identify a natural person — even in a business context — falls under GDPR if the person is in the EU:

Always personal data: names, direct email addresses (john.doe@company.com), LinkedIn profile URLs, phone numbers
Contextually personal: job titles combined with company names (can identify a specific person)
Usually not personal: generic company emails (info@company.com), company phone numbers, firmographic data

If your scraped record contains a named individual in any EU country, GDPR applies.

The Storage Retention Rules by Use Case

GDPR doesn't set a universal retention period. You set it based on your purpose. The key is documenting the purpose and enforcing the limit.

Cold outreach campaigns

Recommended retention: 90 days from collection, or until used in a campaign sequence, whichever is shorter
After campaign ends: delete or anonymize unless there's an active relationship
The "active relationship" exception: if a prospect responds and engages, you have legitimate interest to retain their record as a contact

Market research and competitive analysis

Purpose ends when the research report is complete
Retain for: the duration of the research project + reasonable audit period (typically 6-12 months)
Don't: retain raw personal data indefinitely "in case we need it again"

CRM enrichment

Retain: as long as the customer relationship is active
Delete or re-anonymize: within 60-90 days of the relationship ending (churn, no contact)
Document: when each record was collected and what enrichment source was used

Lead generation pipelines

Maximum: 12 months for uncontacted leads (EU authorities have used this as a benchmark)
After contact with no response: delete within 30-90 days depending on jurisdiction
Germany (BDSG strictest): some authorities suggest 6 months maximum for uncontacted leads

How to Build a Retention Schedule Into Your Scraping Pipeline

The simplest compliant approach: add a collected_at timestamp and a retain_until field to every record at collection time.

import datetime

def scrape_contact(url):
    data = extract_contact_data(url)
    data['collected_at'] = datetime.datetime.utcnow().isoformat()
    # Set retention: 90 days for cold outreach
    data['retain_until'] = (datetime.datetime.utcnow() + datetime.timedelta(days=90)).isoformat()
    return data

Then run a nightly deletion job:

def purge_expired_records(db):
    now = datetime.datetime.utcnow().isoformat()
    deleted = db.execute(
        "DELETE FROM contacts WHERE retain_until < ? AND status != 'active_relationship'",
        (now,)
    )
    log(f"Purged {deleted.rowcount} expired contact records")

This pattern — collect with expiry, enforce with scheduled deletion — satisfies Article 5(1)(e) and provides audit evidence if you're questioned.

The Legitimate Interest Loophole (and Its Limits)

GDPR Article 6(1)(f) allows processing for "legitimate interests pursued by the controller" as a legal basis. Many scraping operations use this to justify indefinite retention.

It doesn't work that way. Legitimate interest requires a three-part balancing test:

Purpose test: Is there a genuine business interest?
Necessity test: Is processing personal data necessary for that interest?
Balancing test: Does the individual's privacy interest override your business interest?

Critically: legitimate interest applies to the initial processing (scraping), not to permanent retention. You still need to delete data when it's no longer necessary for the specific purpose you identified in your LIA (Legitimate Interest Assessment).

"We might need it later" fails the necessity test.

Jurisdiction-Specific Rules

Germany (BDSG + GDPR): Most restrictive in the EU. Federal courts have found that scraping business contact data without direct notice to data subjects violates GDPR Article 14. Recommended: notify scraped contacts via email within 30 days, or don't retain the data.

Netherlands (AP): Has penalized companies for retaining scraped data beyond the stated purpose period. Retention above 12 months without active use has triggered investigations.

France (CNIL): Publishes specific guidance on B2B data retention. For marketing: 3 years maximum from last contact. For prospects who never responded: 1 year from collection.

UK (ICO, post-Brexit): Uses UK GDPR which mirrors EU rules. ICO guidance on direct marketing data: delete uncontacted records after 12 months.

Outside EU: If your scraped subjects are in the EU, EU GDPR applies regardless of where you operate.

Practical Compliance Checklist

Before deploying any B2B scraping pipeline:

[ ] Define the specific purpose in writing (not "general marketing")
[ ] Set a concrete retention period tied to the purpose
[ ] Add collected_at and retain_until to your data schema
[ ] Build an automated deletion job (run weekly minimum)
[ ] Document your Legitimate Interest Assessment if using Article 6(1)(f)
[ ] Consider Article 14 notification obligations for your target jurisdictions
[ ] Test your deletion pipeline — don't assume it works

Scraping at Scale Without GDPR Headaches

Building compliant pipelines is easier when your scraping infrastructure handles the hard parts. The Apify actors in this bundle are built with rate limiting, robots.txt compliance, and output schemas designed for GDPR-friendly downstream processing.

Apify Scrapers Bundle — €29 — 35 production-ready actors, instant download.

Includes documentation on recommended data retention setups for each use case.

DEV Community