James

Posted on May 13

DSGVO-Compliant Web Scraping: What German Businesses Need to Know

#dsgvo #webscraping #gdpr #germany

DSGVO-Compliant Web Scraping: What German Businesses Need to Know

Web scraping sits in a legal gray zone that most German businesses don't fully understand. Do it wrong, and you're looking at DSGVO fines up to €20 million. Do it right, and it's a legitimate competitive intelligence tool.

I've advised clients on both sides of this line. Here's the practical framework we use at Graham Miranda UG.

The Legal Foundation

Web scraping isn't explicitly regulated in German law. The relevant frameworks are:

Law	Relevance
DSGVO (GDPR)	Personal data handling, consent, legitimate interest
BGB § 823	Tort liability for property damage (including data)
UWG	Unfair competition (systematic copying of business data)
StGB § 202a	Computer fraud (circumventing access controls)
TMG § 5	Telemedia act (implied consent for indexing)

The good news: scraping publicly available, non-personal data is generally legal if done respectfully. The bad news: "publicly available" and "non-personal" are where most companies slip up.

The Three Tests

Before any scraping project, we run three tests:

Test 1: Is the Data Publicly Available?

Publicly available means:

No login required
No paywall or subscription gate
No terms of service explicitly prohibiting scraping
robots.txt doesn't disallow the relevant paths

Red flags:

Login-wall data (even free registration)
API endpoints marked "internal"
Data behind CAPTCHA (designed to prevent automated access)
Pages with "noindex" or "nofollow" directives

Test 2: Is the Data Personal?

Personal data = anything that identifies a natural person. Common scraping mistakes:

Data Type	Personal?	Risk Level
Product prices	No	Low
Company descriptions	No	Low
Forum usernames (pseudonymous)	Maybe	Medium
LinkedIn profiles	Yes	High
Customer reviews with real names	Yes	High
Email addresses	Yes	High

Rule: If in doubt, strip it. We run PII detection at ingestion using regex + NLP. Better to lose data than gain a DSGVO complaint.

Test 3: Is the Method Respectful?

"Respectful" has a technical definition:

Rate limiting: ≤ 1 request/second for most sites; slower for smaller sites
robots.txt compliance: Honor Disallow directives
User-Agent identification: Identify your scraper and provide contact info
No circumvention: Don't bypass CAPTCHAs, don't forge headers, don't use residential proxies to mask commercial scraping
Data minimization: Collect only what you need, store only what you use

Legitimate Interest Assessment (Art. 6 DSGVO)

For data that might be personal, Art. 6(1)(f) "legitimate interest" is your friend. The assessment has three parts:

Purpose: What business interest does this serve? (Price monitoring, market research, compliance tracking)
Necessity: Is scraping the least intrusive way to get this data?
Balancing: Do the data subjects' interests outweigh yours?

We document this in a one-page LI assessment for every project. It takes 10 minutes and provides legal cover.

Technical Safeguards

At Collection

# 1. robots.txt check
from urllib.robotparser import RobotFileParser
rp = RobotFileParser()
rp.set_url("https://target.com/robots.txt")
rp.read()
if not rp.can_fetch("*", url):
    skip(url)

# 2. Rate limiting
import time
TIME_BETWEEN_REQUESTS = 1.0  # seconds
time.sleep(TIME_BETWEEN_REQUESTS)

# 3. Respectful headers
headers = {
    "User-Agent": "GrahamMirandaBot/1.0 (+https://grahammiranda.com/scraping-policy)",
    "Accept": "text/html",
    "Accept-Language": "de-DE,de;q=0.9",
}

At Storage

Hash identifiers: Don't store real names; store SHA-256 hashes for correlation
Field-level redaction: Strip email, phone, address fields at ingestion
Time-limited retention: Auto-delete after 90 days unless legally required
Access logging: Who queried what data, when

At Processing

Aggregate-only analytics: Never report on individual records
Anonymization before export: K-anonymity (k≥5) for any shared datasets
Audit trail: Every data transformation logged

Case Study: Price Monitoring for a Retailer

Client: Mid-sized electronics retailer (Düsseldorf)
Goal: Track competitor prices for 50,000 SKUs
Challenge: Competitor sites include customer reviews with real names

Our approach:

Scraped product pages at 0.5 req/sec
Extracted price, availability, product specs only
Customer review sections were parsed but reviewer names stripped
Review text was retained (anonymized, no PII)
robots.txt respected; no login-required pages accessed
LI assessment documented: "legitimate business interest in competitive pricing, minimal privacy impact due to data minimization"

Result: Client has run this for 18 months with zero legal issues and 23% margin improvement through dynamic pricing.

When NOT to Scrape

We turn down projects in these categories:

Social media profiles with PII: LinkedIn, Xing, Facebook — too much personal data, too much legal risk
Health/financial data: Special category under DSGVO Art. 9 — requires explicit consent
Government databases with access controls: § 202a StGB territory
Competitor's internal systems: Obviously illegal
Children's data: Strict DSGVO protections, zero tolerance

The Future: AI-Act + Scraping

The EU AI Act (effective 2026) adds new requirements for AI systems trained on scraped data:

Transparency: Document data sources
Copyright respect: Respect opt-outs for AI training
Risk classification: High-risk AI requires extra scrutiny

For scraping-as-training-data companies, this means:

Source attribution for every training example
robots.txt "AI" extensions (emerging standard)
Clear terms of use on your own site about AI training

Bottom Line

Web scraping is legal in Germany if you:

Only scrape public, non-personal data
Respect robots.txt and rate limits
Document your legitimate interest
Minimize data collection and retention
Never circumvent access controls

It's not a free-for-all. But it's also not forbidden. The middle ground — respectful, documented, minimal scraping — is where competitive advantage lives.

Resources

Our scraping policy: grahammiranda.com/scraping-policy
DSGVO checklist: Available on request
Tools: asearchz.online — privacy-first search for compliance research

Graham Miranda is the founder of Graham Miranda UG (Berlin, HRB 36794), specializing in web intelligence, automation, and DSGVO-compliant data infrastructure.

DEV Community

DSGVO-Compliant Web Scraping: What German Businesses Need to Know

DSGVO-Compliant Web Scraping: What German Businesses Need to Know

The Legal Foundation

The Three Tests

Test 1: Is the Data Publicly Available?

Test 2: Is the Data Personal?

Test 3: Is the Method Respectful?

Legitimate Interest Assessment (Art. 6 DSGVO)

Technical Safeguards

At Collection

At Storage

At Processing

Case Study: Price Monitoring for a Retailer

When NOT to Scrape

The Future: AI-Act + Scraping

Bottom Line

Resources

Top comments (0)