Is Ignoring robots.txt Illegal Under GDPR? The Legal Answer for Scrapers

#gdpr #webscraping #security #legal

Developers building web scrapers face a question that sounds simple but isn't: if a website's robots.txt says Disallow: /, does scraping that site violate GDPR?

The short answer: robots.txt and GDPR are separate legal frameworks, and confusing them causes real legal risk. Here's how they actually interact.

What robots.txt Actually Is

robots.txt is a technical convention — not a legal instrument. It's a text file at the root of a domain that tells crawlers which paths they should avoid. No law mandates that scrapers obey it.

Three things robots.txt cannot do:

Create a binding legal obligation (it's not a contract)
Invoke GDPR protections on its own
Override your legitimate interest in accessing publicly available data

What robots.txt can do: inform Computer Fraud and Abuse Act (CFAA) claims in the US. If a site explicitly disallows access and you scrape anyway, that instruction might support a "without authorization" argument — though courts have consistently ruled that public data is accessible (hiQ v. LinkedIn being the landmark case).

In the EU, robots.txt has no direct GDPR connection. GDPR governs the processing of personal data, not the technical method of access.

Where GDPR Actually Applies to Scraping

GDPR kicks in when you scrape personal data — any information relating to an identified or identifiable natural person. That includes:

Names and email addresses
Phone numbers
Social media profiles linked to real identities
IP addresses (under some interpretations)
Employment information tied to individuals

If your scraper collects any of this, GDPR applies regardless of whether the site had a robots.txt restriction.

The relevant GDPR questions for scrapers:

1. What's your lawful basis?
Article 6 requires one of six legal bases. For most scraping use cases, you're relying on "legitimate interests" (Article 6(1)(f)) — but this requires a three-part balancing test.

2. Have you addressed data minimisation?
Article 5(1)(c): collect only what's necessary. If you're scraping for lead generation, collecting home addresses you'll never use fails this test.

3. Do you have a retention policy?
Scraped personal data can't be stored indefinitely. You need a documented retention period and deletion mechanism.

4. Can data subjects exercise their rights?
If someone asks you to delete their data, you need a process to find and remove it. This is operationally complex at scale.

The Actual Legal Risks for Scrapers

In order of actual legal frequency:

1. Terms of Service violation (civil, not criminal)
Most sites prohibit scraping in their ToS. This creates breach of contract exposure, not criminal liability. The legal risk here is civil lawsuits from the platform, not GDPR enforcement.

2. CFAA claims (US only)
The Computer Fraud and Abuse Act creates criminal and civil liability for accessing computers "without authorization." Post-hiQ, this risk is lower for public data, but not zero.

3. GDPR enforcement (EU personal data)
Regulators can fine up to 4% of global annual revenue or €20M for serious violations. Enforcement has focused on large platforms and data brokers, but smaller operators aren't immune.

4. Database rights (EU only)
The Database Directive protects substantial investment in compiling databases. If you systematically extract a substantial part of a database, you may infringe database rights independent of GDPR.

A Practical Compliance Framework

For scrapers collecting personal data:

Step 1: Identify what personal data you're collecting
Before building: list every data field. Flag any that can identify individuals. These fields trigger GDPR obligations.

Step 2: Document your legitimate interest
Write a one-paragraph legitimate interest assessment (LIA): what's your purpose, why is it legitimate, and why does it not override the individual's rights.

Step 3: Honor opt-outs and deletion requests
Build a deletion pipeline before you start collecting at scale. Being unable to find and delete a specific person's data is a common source of regulatory complaints.

Step 4: Minimum viable personal data
Can your use case work with anonymized or aggregated data? If yes, collect that. If you need individual-level data, document why.

Step 5: Check for robots.txt as a signal of ToS intent
Even if robots.txt has no legal force under GDPR, a restrictive robots.txt tells you the site owner explicitly objects to crawling. This is relevant to your legitimate interest assessment — a strong objection weakens your legitimate interest claim.

The robots.txt + GDPR Intersection

Here's where they do intersect: if you scrape a site that explicitly prohibits crawling, and you collect personal data from it, you face two simultaneous problems:

ToS violation: Creates civil exposure regardless of data type
GDPR processing without lawful basis: The site owner's explicit objection undermines your legitimate interest claim

The combination is worse than either separately. A German court in 2023 held that scraping professional profiles from a platform in violation of its ToS was also an unlawful processing of personal data, because the objection to crawling was relevant to the legitimate interest balancing test.

The safe zone: scraping public data from sites that permit crawling, for clearly legitimate purposes, with proper data minimisation.

What This Means for Your Scraper Build

If you're building a scraper for business use:

Check robots.txt — not because it's legally binding, but because it tells you the site's position on crawling
Identify which fields contain personal data before building the data model
Write a legitimate interest assessment before going to production
Build deletion capability from day one, not as an afterthought

The developers who run into GDPR trouble aren't usually the ones who ignored robots.txt on purpose. They're the ones who collected personal data without thinking about GDPR at all.

Related: Production Web Scraping Tools

If you're building compliant scrapers and need production-ready infrastructure, I maintain 35 Apify actors covering contact info, SERP, social media, e-commerce, and more — all with configurable output schemas that let you collect only what you need.

Apify Scrapers Bundle — €29 — one-time download, all 35 actors with workflow guides.