James

Posted on May 9

Web Scraping in 2024: Whats Legal, Whats Not, and What Works

#webscraping #legal #gdpr #ai

Every week, someone asks me whether web scraping is legal. The honest answer is that it depends on what you scrape, how you scrape it, and where you are.

This is not a legal guide. I am not a lawyer. This is a technical practitioner's map of the landscape after running scraping infrastructure for European clients for three years, building automated research pipelines, and dealing with the compliance questions that arise.

What follows is a framework for thinking about scraping legality, a set of operational rules that keep you out of trouble, and the specific stack I use to stay compliant.

The Three Legal Frameworks You Need to Understand

United States: The CFAA Standard

In 2019, the Ninth Circuit decided hiQ Labs v. LinkedIn. hiQ scraped publicly available LinkedIn profiles. LinkedIn sued under the Computer Fraud and Abuse Act (CFAA). The court held that scraping publicly available data — data visible without authentication — does not constitute "unauthorized access" under the CFAA.

The boundary is clear: if data requires a login, scraping it without authorization is a CFAA risk. If data is public, scraping is generally legal.

Practical rule for US operations: scrape what you could see in an incognito browser window. Do not bypass authentication. Do not circumvent technical barriers like CAPTCHA at scale.

European Union: Database Rights + GDPR

The EU has two layers of protection.

The Database Directive (96/9/EC) grants a sui generis right to creators who have made a "substantial investment" in collecting, verifying, or presenting data. If you scrape a competitor's curated database and replicate it, you may violate this right. The threshold is case-specific.

GDPR (2016/679) protects personal data. Even if data is publicly visible — like social media profiles — it retains GDPR protection. You need a lawful basis to process it. For scraping, this usually means a "legitimate interest assessment" that weighs your commercial purpose against the data subject's rights.

Practical rule for EU operations: do not scrape personal data without a documented legitimate interest assessment. And even then, minimize what you collect.

Germany: The Strictest Endpoint

Germany applies EU law and adds its own layer: BGB § 823 (tortious interference with business operations). German courts have held that systematic scraping that degrades server performance can constitute tort liability. The threshold is lower than in the US.

Practical rule for German operations: respect robots.txt, limit request rates, and avoid scraping German competitors for commercial replication.

A Decision Matrix for Scraping

Scenario	US Legal Risk	EU Legal Risk	German Legal Risk	Practical Advice
Public government datasets	Very Low	Very Low	Very Low	✅ Safe
Public e-commerce listings	Low	Low	Low-Medium	✅ Safe with rate limits
Public news articles	Very Low	Low	Low	✅ Safe
Public academic papers	Very Low	Low	Low	✅ Safe
Login-protected pricing	High	High	High	❌ Do not scrape
Login-protected user profiles	High	Very High	Very High	❌ Do not scrape
CAPTCHA-protected sites	High	Medium-High	Medium-High	⚠️ Manual only, not systematic
robots.txt disallowed paths	Low-Medium	Medium	Medium-High	⚠️ Respect robots.txt
Personal data (names, emails)	Medium	Very High	Very High	❌ Do not collect
Systematic server overload	Low-Medium	Medium	High	❌ Rate limit aggressively
Competitor database replication	Medium	Medium-High	High	⚠️ Consult legal counsel

Operational Rules That Keep You Out of Trouble

Rule 1: Scrape Only What You Need

The most common mistake is over-collection. You want pricing data, so you scrape the entire page including user reviews, related products, and user-generated content. Now you have personal data you did not need.

Define your data model before you write your first selector:

# BAD: Scrape everything
SELECTOR = "*"  

# GOOD: Define exactly what you need
SCHEMA = {
    "product_name": "css:.product-title h1",
    "price": "css:.price-current",
    "currency": "css:.price-currency", 
    # Explicitly NOT including: reviews, user names, social profiles
}

Rule 2: Respect robots.txt

This is not optional. Every jurisdiction that has addressed the question treats robots.txt as a meaningful signal of the site owner's intent.

from urllib.robotparser import RobotFileParser

rp = RobotFileParser()
rp.set_url(f"{base_url}/robots.txt")
rp.read()

can_fetch = rp.can_fetch("MyBot/1.0", target_path)
if not can_fetch:
    return None  # Skip this URL

If you ignore robots.txt systematically, courts can infer bad faith. That matters in civil liability cases.

Rule 3: Rate Limit as a Policy, Not an Afterthought

Implement request delays at the infrastructure level, not the script level:

# Infrastructure-level rate limiting
RATE_LIMITS = {
    "example.com": (1, 1.0),      # 1 request per second
    "slowsite.de": (1, 5.0),     # 1 request per 5 seconds
    "newssource.io": (10, 1.0),  # 10 requests per second (generous)
}

def polite_request(domain, url):
    limit, period = RATE_LIMITS.get(domain, (1, 1.0))
    # Enforce rate limit
    # Make request
    # Return result

If your scraping crashes a server, that is evidence of negligence. If you rate limit and the server still struggles, that is evidence of good faith.

Rule 4: Identify Yourself Properly

Your User-Agent should identify you and provide contact information:

HEADERS = {
    "User-Agent": "GrahamMirandaBot/1.0 (+https://grahammiranda.com/bot-policy)",
    "Accept": "text/html,application/xhtml+xml",
    "Accept-Language": "en-US",
}

A proper User-Agent with contact information demonstrates good faith. Obfuscation does the opposite.

Rule 5: No Authentication Circumvention

This is the bright line. Do not:

Steal or forge session cookies
Reverse-engineer authentication endpoints
Use leaked credentials
Exploit URL parameter vulnerabilities to access protected content

The hiQ case is clear: public data is generally scrapable. Protected data is not. There is no gray zone here.

Rule 6: Build Compliance Logging

If you are ever questioned about your scraping practices, you need evidence:

{
    "timestamp": "2024-05-09T14:32:01Z",
    "url": "https://example.com/products/widget",
    "domain": "example.com",
    "robots_txt_allowed": true,
    "rate_limit_compliant": true,
    "user_agent": "GrahamMirandaBot/1.0",
    "personal_data_detected": false,
    "data_fields_extracted": ["product_name", "price", "currency"],
    "http_status": 200,
    "response_time_ms": 342
}

Maintain these logs for the duration of your data retention policy (typically 3-7 years).

The Technical Stack for Compliant Scraping

Layer	Tool	Purpose
Request orchestration	`playwright` (Firefox)	JavaScript-rendered content, real browser fingerprint
Proxy rotation	Residential proxies	IP diversity without bot detection triggers
Rate limiting	Custom middleware	Policy enforcement at infrastructure level
robots.txt compliance	`urllib.robotparser`	Automatic path validation
Data validation	`pydantic`	Schema enforcement, type checking
PII detection	Custom rules + ML	Automatic flagging of personal data
Logging	Structured JSON	Audit trail for compliance
Storage	PostgreSQL + S3	Structured storage with access controls

The key insight: you need infrastructure-level enforcement, not script-level discipline. A single developer forgetting to add a delay can expose your entire operation.

What I Do When Scraping Gets Blocked

Every serious scraping operation hits blocks. Here is the escalation ladder:

Level 1: Rotate proxies. If one IP gets rate-limited, switch to another. We rotate through a pool of 50 residential IPs.

Level 2: Fingerprint variation. Vary User-Agent, accept-language, viewport size, and TLS fingerprint. But keep it plausible. Do not cycle through obviously fake headers.

Level 3: CAPTCHA solving (human-in-the-loop). For edge cases where a CAPTCHA appears, we send a notification to a human who solves it. We do not automate CAPTCHA solving at scale — that is a legal and ethical line we do not cross.

Level 4: Accept defeat. Some sites are simply not scrapable under compliant conditions. We maintain a blacklist of sites that require authentication, deploy aggressive bot detection, or explicitly prohibit scraping. We do not attempt to bypass these.

The hard truth: if your business model depends on scraping a site that does not want to be scraped, your business model has a problem.

The GDPR-Specific Problem

For EU operations, scraping adds a specific GDPR complexity: Article 5(1)(c), the data minimization principle. This means:

You must only collect personal data that is directly necessary for your purpose
You must document your lawful basis before collection
You must assess whether the data subject's interests override your legitimate interests
You must implement technical measures to minimize data collection

Example: scraping a public job board for salary data does not require collecting applicant names or email addresses. If your scraper captures them anyway, you violate data minimization.

Our approach: every scraper includes a PII detection layer that automatically redacts names, emails, phone numbers, and physical addresses before storage.

What Happens If You Get It Wrong

Cease and desist: The most common first step. A lawyer sends a letter demanding you stop scraping. Cost to comply: zero. Cost to ignore: potentially high.

IP blocking: The site blocks your proxies. You rotate. They block again. Eventually your proxy provider terminates your account.

CFAA lawsuit (US): Rare, but catastrophic. The damages are statutory and can reach six figures.

GDPR complaint (EU): Triggered by a data subject or regulator. The maximum fine is 4% of global turnover or €20 million, whichever is higher.

Tort lawsuit (Germany): Based on server overload or business interference. Damages are actual losses plus potentially punitive elements.

The worst outcome is not the legal penalty. It is the reputational damage. Nobody wants to be the company that built its competitive advantage on non-compliant scraping.

The Bottom Line

Scraping is a powerful tool. It is also a legal minefield. The difference between responsible and irresponsible scraping is not technical sophistication. It is policy discipline.

Define what you need before you collect it
Respect robots.txt
Rate limit aggressively
Identify yourself
Do not bypass authentication
Log everything
Minimize personal data
Accept that some sites are off-limits

If you follow these rules, you will rarely face legal trouble. If you do not, you are one angry site operator away from a very expensive problem.

I am the founder of Graham Miranda UG, a Berlin-based company building privacy-first web intelligence tools. We operate scraping infrastructure that processes millions of pages per month under a compliance-first policy. The architecture described above is what we ship in asearchz.online.

DEV Community