Every week, someone asks me whether web scraping is legal. The honest answer is that it depends on what you scrape, how you scrape it, and where you are.
This is not a legal guide. I am not a lawyer. This is a technical practitioner's map of the landscape after running scraping infrastructure for European clients for three years, building automated research pipelines, and dealing with the compliance questions that arise.
What follows is a framework for thinking about scraping legality, a set of operational rules that keep you out of trouble, and the specific stack I use to stay compliant.
The Three Legal Frameworks You Need to Understand
United States: The CFAA Standard
In 2019, the Ninth Circuit decided hiQ Labs v. LinkedIn. hiQ scraped publicly available LinkedIn profiles. LinkedIn sued under the Computer Fraud and Abuse Act (CFAA). The court held that scraping publicly available data — data visible without authentication — does not constitute "unauthorized access" under the CFAA.
The boundary is clear: if data requires a login, scraping it without authorization is a CFAA risk. If data is public, scraping is generally legal.
Practical rule for US operations: scrape what you could see in an incognito browser window. Do not bypass authentication. Do not circumvent technical barriers like CAPTCHA at scale.
European Union: Database Rights + GDPR
The EU has two layers of protection.
The Database Directive (96/9/EC) grants a sui generis right to creators who have made a "substantial investment" in collecting, verifying, or presenting data. If you scrape a competitor's curated database and replicate it, you may violate this right. The threshold is case-specific.
GDPR (2016/679) protects personal data. Even if data is publicly visible — like social media profiles — it retains GDPR protection. You need a lawful basis to process it. For scraping, this usually means a "legitimate interest assessment" that weighs your commercial purpose against the data subject's rights.
Practical rule for EU operations: do not scrape personal data without a documented legitimate interest assessment. And even then, minimize what you collect.
Germany: The Strictest Endpoint
Germany applies EU law and adds its own layer: BGB § 823 (tortious interference with business operations). German courts have held that systematic scraping that degrades server performance can constitute tort liability. The threshold is lower than in the US.
Practical rule for German operations: respect robots.txt, limit request rates, and avoid scraping German competitors for commercial replication.
A Decision Matrix for Scraping
| Scenario | US Legal Risk | EU Legal Risk | German Legal Risk | Practical Advice |
|---|---|---|---|---|
| Public government datasets | Very Low | Very Low | Very Low | ✅ Safe |
| Public e-commerce listings | Low | Low | Low-Medium | ✅ Safe with rate limits |
| Public news articles | Very Low | Low | Low | ✅ Safe |
| Public academic papers | Very Low | Low | Low | ✅ Safe |
| Login-protected pricing | High | High | High | ❌ Do not scrape |
| Login-protected user profiles | High | Very High | Very High | ❌ Do not scrape |
| CAPTCHA-protected sites | High | Medium-High | Medium-High | ⚠️ Manual only, not systematic |
| robots.txt disallowed paths | Low-Medium | Medium | Medium-High | ⚠️ Respect robots.txt |
| Personal data (names, emails) | Medium | Very High | Very High | ❌ Do not collect |
| Systematic server overload | Low-Medium | Medium | High | ❌ Rate limit aggressively |
| Competitor database replication | Medium | Medium-High | High | ⚠️ Consult legal counsel |
Operational Rules That Keep You Out of Trouble
Rule 1: Scrape Only What You Need
The most common mistake is over-collection. You want pricing data, so you scrape the entire page including user reviews, related products, and user-generated content. Now you have personal data you did not need.
Define your data model before you write your first selector:
# BAD: Scrape everything
SELECTOR = "*"
# GOOD: Define exactly what you need
SCHEMA = {
"product_name": "css:.product-title h1",
"price": "css:.price-current",
"currency": "css:.price-currency",
# Explicitly NOT including: reviews, user names, social profiles
}
Rule 2: Respect robots.txt
This is not optional. Every jurisdiction that has addressed the question treats robots.txt as a meaningful signal of the site owner's intent.
from urllib.robotparser import RobotFileParser
rp = RobotFileParser()
rp.set_url(f"{base_url}/robots.txt")
rp.read()
can_fetch = rp.can_fetch("MyBot/1.0", target_path)
if not can_fetch:
return None # Skip this URL
If you ignore robots.txt systematically, courts can infer bad faith. That matters in civil liability cases.
Rule 3: Rate Limit as a Policy, Not an Afterthought
Implement request delays at the infrastructure level, not the script level:
# Infrastructure-level rate limiting
RATE_LIMITS = {
"example.com": (1, 1.0), # 1 request per second
"slowsite.de": (1, 5.0), # 1 request per 5 seconds
"newssource.io": (10, 1.0), # 10 requests per second (generous)
}
def polite_request(domain, url):
limit, period = RATE_LIMITS.get(domain, (1, 1.0))
# Enforce rate limit
# Make request
# Return result
If your scraping crashes a server, that is evidence of negligence. If you rate limit and the server still struggles, that is evidence of good faith.
Rule 4: Identify Yourself Properly
Your User-Agent should identify you and provide contact information:
HEADERS = {
"User-Agent": "GrahamMirandaBot/1.0 (+https://grahammiranda.com/bot-policy)",
"Accept": "text/html,application/xhtml+xml",
"Accept-Language": "en-US",
}
A proper User-Agent with contact information demonstrates good faith. Obfuscation does the opposite.
Rule 5: No Authentication Circumvention
This is the bright line. Do not:
- Steal or forge session cookies
- Reverse-engineer authentication endpoints
- Use leaked credentials
- Exploit URL parameter vulnerabilities to access protected content
The hiQ case is clear: public data is generally scrapable. Protected data is not. There is no gray zone here.
Rule 6: Build Compliance Logging
If you are ever questioned about your scraping practices, you need evidence:
{
"timestamp": "2024-05-09T14:32:01Z",
"url": "https://example.com/products/widget",
"domain": "example.com",
"robots_txt_allowed": true,
"rate_limit_compliant": true,
"user_agent": "GrahamMirandaBot/1.0",
"personal_data_detected": false,
"data_fields_extracted": ["product_name", "price", "currency"],
"http_status": 200,
"response_time_ms": 342
}
Maintain these logs for the duration of your data retention policy (typically 3-7 years).
The Technical Stack for Compliant Scraping
| Layer | Tool | Purpose |
|---|---|---|
| Request orchestration |
playwright (Firefox) |
JavaScript-rendered content, real browser fingerprint |
| Proxy rotation | Residential proxies | IP diversity without bot detection triggers |
| Rate limiting | Custom middleware | Policy enforcement at infrastructure level |
| robots.txt compliance | urllib.robotparser |
Automatic path validation |
| Data validation | pydantic |
Schema enforcement, type checking |
| PII detection | Custom rules + ML | Automatic flagging of personal data |
| Logging | Structured JSON | Audit trail for compliance |
| Storage | PostgreSQL + S3 | Structured storage with access controls |
The key insight: you need infrastructure-level enforcement, not script-level discipline. A single developer forgetting to add a delay can expose your entire operation.
What I Do When Scraping Gets Blocked
Every serious scraping operation hits blocks. Here is the escalation ladder:
Level 1: Rotate proxies. If one IP gets rate-limited, switch to another. We rotate through a pool of 50 residential IPs.
Level 2: Fingerprint variation. Vary User-Agent, accept-language, viewport size, and TLS fingerprint. But keep it plausible. Do not cycle through obviously fake headers.
Level 3: CAPTCHA solving (human-in-the-loop). For edge cases where a CAPTCHA appears, we send a notification to a human who solves it. We do not automate CAPTCHA solving at scale — that is a legal and ethical line we do not cross.
Level 4: Accept defeat. Some sites are simply not scrapable under compliant conditions. We maintain a blacklist of sites that require authentication, deploy aggressive bot detection, or explicitly prohibit scraping. We do not attempt to bypass these.
The hard truth: if your business model depends on scraping a site that does not want to be scraped, your business model has a problem.
The GDPR-Specific Problem
For EU operations, scraping adds a specific GDPR complexity: Article 5(1)(c), the data minimization principle. This means:
- You must only collect personal data that is directly necessary for your purpose
- You must document your lawful basis before collection
- You must assess whether the data subject's interests override your legitimate interests
- You must implement technical measures to minimize data collection
Example: scraping a public job board for salary data does not require collecting applicant names or email addresses. If your scraper captures them anyway, you violate data minimization.
Our approach: every scraper includes a PII detection layer that automatically redacts names, emails, phone numbers, and physical addresses before storage.
What Happens If You Get It Wrong
Cease and desist: The most common first step. A lawyer sends a letter demanding you stop scraping. Cost to comply: zero. Cost to ignore: potentially high.
IP blocking: The site blocks your proxies. You rotate. They block again. Eventually your proxy provider terminates your account.
CFAA lawsuit (US): Rare, but catastrophic. The damages are statutory and can reach six figures.
GDPR complaint (EU): Triggered by a data subject or regulator. The maximum fine is 4% of global turnover or €20 million, whichever is higher.
Tort lawsuit (Germany): Based on server overload or business interference. Damages are actual losses plus potentially punitive elements.
The worst outcome is not the legal penalty. It is the reputational damage. Nobody wants to be the company that built its competitive advantage on non-compliant scraping.
The Bottom Line
Scraping is a powerful tool. It is also a legal minefield. The difference between responsible and irresponsible scraping is not technical sophistication. It is policy discipline.
- Define what you need before you collect it
- Respect robots.txt
- Rate limit aggressively
- Identify yourself
- Do not bypass authentication
- Log everything
- Minimize personal data
- Accept that some sites are off-limits
If you follow these rules, you will rarely face legal trouble. If you do not, you are one angry site operator away from a very expensive problem.
I am the founder of Graham Miranda UG, a Berlin-based company building privacy-first web intelligence tools. We operate scraping infrastructure that processes millions of pages per month under a compliance-first policy. The architecture described above is what we ship in asearchz.online.
Top comments (0)