Vhub Systems

Posted on Apr 3

GDPR Article 6 and Web Scraping: The Legal Basis Checklist Most Developers Skip

#webscraping #gdpr #security

Your web scraper is probably GDPR non-compliant. Not because you're collecting illegal data — but because you haven't documented why you're allowed to collect it.

GDPR Article 6 requires a lawful basis for processing personal data. Scraping publicly available data still counts as processing if it includes personal data (names, emails, job titles, profile photos).

Here's the checklist I use before deploying any scraper that touches personal data.

The 6 Lawful Bases (and Which Apply to Scraping)

1. Consent — Almost Never Applies to Scraping

You'd need the data subject to explicitly opt in to you collecting their data. Since you're scraping without their knowledge, this base almost never applies.

Exception: If you're scraping your own customers' data from a platform they authorized you to access.

2. Contract — Narrow Application

Applies when processing is necessary to fulfill a contract with the data subject directly.

Scraping use case: Enriching contact data for someone who signed up to your service and agreed to data enrichment in your terms.

3. Legal Obligation — Rare

You're legally required to process this data. Rarely applies to scraping scenarios.

4. Vital Interests — Almost Never

Processing necessary to protect someone's life. Not applicable to typical commercial scraping.

5. Public Task — Government/Research Only

Applies to public authorities and legitimate research organizations.

Scraping use case: Academic researchers scraping public records for non-commercial analysis. Must be documented.

6. Legitimate Interests — The One That Actually Applies

This is the lawful basis that covers most B2B scraping. Article 6(1)(f) allows processing when:

You have a legitimate interest (commercial, security, fraud prevention, etc.)
The processing is necessary for that interest
The interest is not overridden by the data subject's rights

The Legitimate Interests Assessment (LIA)

You must complete a 3-part test and document it:

Part 1 — Purpose test: What is your specific interest?

✅ "We scrape LinkedIn job titles to verify B2B prospect data accuracy before outbound sales contact"
❌ "We collect data" (too vague)

Part 2 — Necessity test: Is scraping the minimum required?

✅ "No API provides this data in real-time; manual lookup at scale is not feasible"
❌ "We prefer scraping to paying for a data provider" (not a necessity argument)

Part 3 — Balancing test: Do the individual's rights override your interest?

✅ "Data is limited to professional context (job title, company), not personal life. Data subjects would reasonably expect professional data to be used for B2B purposes."
❌ Using home addresses or family relationships for commercial purposes

What to Document Before You Deploy

Create a one-page document covering:

1. Data types collected (exact fields)
2. Source(s) of data
3. Lawful basis (usually Legitimate Interests)
4. LIA completed: [date]
5. Retention period (max 90 days for contact data)
6. Deletion mechanism (how you delete when asked)
7. Privacy notice URL (must mention you collect this data)
8. DPO notified: [yes/no]

This document is your audit shield. If a data protection authority (DPA) investigates, you show them this.

The 3 Things That Make Scraping Higher Risk

Special category data — Health, political opinions, religion, sexual orientation. GDPR Article 9 applies. Do not scrape this without explicit consent.
Non-EU entities scraping EU residents — GDPR applies regardless of where your company is incorporated. If you scrape EU LinkedIn profiles from a US company, GDPR applies.
Automated decision-making — If your scraper feeds data into algorithms that make decisions about individuals (credit scoring, job screening), Article 22 applies. Extra obligations.

Practical Checklist

[ ] Identified all personal data fields being collected
[ ] Confirmed lawful basis (usually Art. 6(1)(f) Legitimate Interests)
[ ] Completed and documented LIA
[ ] Privacy notice updated to mention indirect data collection
[ ] Retention limits set (default: 90 days for contact data)
[ ] Deletion endpoint exists (respond to erasure requests in 30 days)
[ ] Data minimization applied (only collect what you need)
[ ] Cross-border transfers assessed (SCCs if sending EU data to non-EU processors)

When You Use Apify

If you're using cloud-hosted scrapers like Apify, check:

Apify processes data on your behalf = they're a data processor under GDPR
You need a Data Processing Agreement (DPA) with Apify (available in their terms)
Data residency: Apify servers are in the US — you need SCCs or equivalent for EU personal data

The Apify Scrapers Bundle includes a GDPR compliance note for each actor covering data fields, retention defaults, and recommended handling.

The Quick Version

Scraping personal data = GDPR applies regardless of source
Legitimate Interests (Art. 6(1)(f)) is your most likely lawful basis
Document the 3-part LIA test before deploying
Don't scrape special category data without consent
Respond to erasure requests within 30 days

The companies that get investigated aren't necessarily the ones collecting the most data — they're the ones who can't explain why they're collecting it.

Building scrapers at scale? The Apify Scrapers Bundle ($29) includes 30 pre-built actors — each documented with the data fields collected, making your GDPR inventory easier to complete.

Related Tools

contact-info-scraper

DEV Community