DEV Community

James
James

Posted on

DSGVO-Compliant Web Scraping: What German Businesses Need to Know

DSGVO-Compliant Web Scraping: What German Businesses Need to Know

Web scraping sits in a legal gray zone that most German businesses don't fully understand. Do it wrong, and you're looking at DSGVO fines up to €20 million. Do it right, and it's a legitimate competitive intelligence tool.

I've advised clients on both sides of this line. Here's the practical framework we use at Graham Miranda UG.


The Legal Foundation

Web scraping isn't explicitly regulated in German law. The relevant frameworks are:

Law Relevance
DSGVO (GDPR) Personal data handling, consent, legitimate interest
BGB § 823 Tort liability for property damage (including data)
UWG Unfair competition (systematic copying of business data)
StGB § 202a Computer fraud (circumventing access controls)
TMG § 5 Telemedia act (implied consent for indexing)

The good news: scraping publicly available, non-personal data is generally legal if done respectfully. The bad news: "publicly available" and "non-personal" are where most companies slip up.


The Three Tests

Before any scraping project, we run three tests:

Test 1: Is the Data Publicly Available?

Publicly available means:

  • No login required
  • No paywall or subscription gate
  • No terms of service explicitly prohibiting scraping
  • robots.txt doesn't disallow the relevant paths

Red flags:

  • Login-wall data (even free registration)
  • API endpoints marked "internal"
  • Data behind CAPTCHA (designed to prevent automated access)
  • Pages with "noindex" or "nofollow" directives

Test 2: Is the Data Personal?

Personal data = anything that identifies a natural person. Common scraping mistakes:

Data Type Personal? Risk Level
Product prices No Low
Company descriptions No Low
Forum usernames (pseudonymous) Maybe Medium
LinkedIn profiles Yes High
Customer reviews with real names Yes High
Email addresses Yes High

Rule: If in doubt, strip it. We run PII detection at ingestion using regex + NLP. Better to lose data than gain a DSGVO complaint.

Test 3: Is the Method Respectful?

"Respectful" has a technical definition:

  • Rate limiting: ≤ 1 request/second for most sites; slower for smaller sites
  • robots.txt compliance: Honor Disallow directives
  • User-Agent identification: Identify your scraper and provide contact info
  • No circumvention: Don't bypass CAPTCHAs, don't forge headers, don't use residential proxies to mask commercial scraping
  • Data minimization: Collect only what you need, store only what you use

Legitimate Interest Assessment (Art. 6 DSGVO)

For data that might be personal, Art. 6(1)(f) "legitimate interest" is your friend. The assessment has three parts:

  1. Purpose: What business interest does this serve? (Price monitoring, market research, compliance tracking)
  2. Necessity: Is scraping the least intrusive way to get this data?
  3. Balancing: Do the data subjects' interests outweigh yours?

We document this in a one-page LI assessment for every project. It takes 10 minutes and provides legal cover.


Technical Safeguards

At Collection

# 1. robots.txt check
from urllib.robotparser import RobotFileParser
rp = RobotFileParser()
rp.set_url("https://target.com/robots.txt")
rp.read()
if not rp.can_fetch("*", url):
    skip(url)

# 2. Rate limiting
import time
TIME_BETWEEN_REQUESTS = 1.0  # seconds
time.sleep(TIME_BETWEEN_REQUESTS)

# 3. Respectful headers
headers = {
    "User-Agent": "GrahamMirandaBot/1.0 (+https://grahammiranda.com/scraping-policy)",
    "Accept": "text/html",
    "Accept-Language": "de-DE,de;q=0.9",
}
Enter fullscreen mode Exit fullscreen mode

At Storage

  • Hash identifiers: Don't store real names; store SHA-256 hashes for correlation
  • Field-level redaction: Strip email, phone, address fields at ingestion
  • Time-limited retention: Auto-delete after 90 days unless legally required
  • Access logging: Who queried what data, when

At Processing

  • Aggregate-only analytics: Never report on individual records
  • Anonymization before export: K-anonymity (k≥5) for any shared datasets
  • Audit trail: Every data transformation logged

Case Study: Price Monitoring for a Retailer

Client: Mid-sized electronics retailer (Düsseldorf)
Goal: Track competitor prices for 50,000 SKUs
Challenge: Competitor sites include customer reviews with real names

Our approach:

  1. Scraped product pages at 0.5 req/sec
  2. Extracted price, availability, product specs only
  3. Customer review sections were parsed but reviewer names stripped
  4. Review text was retained (anonymized, no PII)
  5. robots.txt respected; no login-required pages accessed
  6. LI assessment documented: "legitimate business interest in competitive pricing, minimal privacy impact due to data minimization"

Result: Client has run this for 18 months with zero legal issues and 23% margin improvement through dynamic pricing.


When NOT to Scrape

We turn down projects in these categories:

  1. Social media profiles with PII: LinkedIn, Xing, Facebook — too much personal data, too much legal risk
  2. Health/financial data: Special category under DSGVO Art. 9 — requires explicit consent
  3. Government databases with access controls: § 202a StGB territory
  4. Competitor's internal systems: Obviously illegal
  5. Children's data: Strict DSGVO protections, zero tolerance

The Future: AI-Act + Scraping

The EU AI Act (effective 2026) adds new requirements for AI systems trained on scraped data:

  • Transparency: Document data sources
  • Copyright respect: Respect opt-outs for AI training
  • Risk classification: High-risk AI requires extra scrutiny

For scraping-as-training-data companies, this means:

  • Source attribution for every training example
  • robots.txt "AI" extensions (emerging standard)
  • Clear terms of use on your own site about AI training

Bottom Line

Web scraping is legal in Germany if you:

  1. Only scrape public, non-personal data
  2. Respect robots.txt and rate limits
  3. Document your legitimate interest
  4. Minimize data collection and retention
  5. Never circumvent access controls

It's not a free-for-all. But it's also not forbidden. The middle ground — respectful, documented, minimal scraping — is where competitive advantage lives.


Resources


Graham Miranda is the founder of Graham Miranda UG (Berlin, HRB 36794), specializing in web intelligence, automation, and DSGVO-compliant data infrastructure.

Top comments (0)