DEV Community

agenthustler
agenthustler

Posted on

Legal and Ethical Web Scraping in 2026: What You Can and Cannot Scrape

Web scraping sits in a gray area, but it's not the legal minefield some people make it out to be. Plenty of legitimate businesses scrape data every day — price comparison sites, search engines, academic researchers, journalists. The key is understanding where the boundaries are.

This guide breaks down the current legal landscape so you can scrape confidently and responsibly.

The Legal Foundation: Key Court Cases

hiQ Labs v. LinkedIn (2022)

This is the landmark case for web scraping in the US. The Ninth Circuit ruled that scraping publicly accessible data does not violate the Computer Fraud and Abuse Act (CFAA). hiQ Labs scraped public LinkedIn profiles to provide workforce analytics, and the court sided with them.

What this means for you: Scraping public data that anyone can see without logging in is generally protected under US law.

Ryanair v. PR Aviation (EU, 2015)

The European Court of Justice ruled that scraping data from a website isn't automatically a database right infringement if the data isn't protected by copyright or the database right.

Meta v. Bright Data (2024)

Meta sued Bright Data for scraping Instagram and Facebook. The court dismissed claims related to scraping publicly available data but allowed claims about scraping data behind login walls. This reinforced the public vs. private distinction.

The Public vs. Private Data Line

This is the single most important distinction in scraping law:

Data Type Examples Generally Safe?
Public data Product prices, news headlines, business listings, public social posts Yes
Login-required data Private profiles, DMs, account dashboards No
Rate-limited but public APIs with auth keys for public data Usually yes, respect limits
Personal data (EU) Names, emails, photos of individuals GDPR applies — see below

What's Generally Safe to Scrape

  • Product prices and availability — the backbone of price comparison sites
  • Public news articles and headlines — used by aggregators everywhere
  • Business listings and contact info — public directory data
  • Public government data — court records, legislation, public filings
  • Academic papers and citations — for research purposes
  • Public social media posts — posts made visible to everyone
  • Job listings — publicly posted positions
  • Weather data, stock prices, sports scores — factual data

What You Should Avoid

  • Data behind login walls — this crosses the authorization boundary
  • Copyrighted content in bulk — scraping for republishing verbatim
  • Personal data for profiling — especially in the EU (GDPR)
  • Data explicitly prohibited by court order
  • Scraping after receiving a cease-and-desist — consult a lawyer first

robots.txt: Guidelines, Not Law

The robots.txt file tells scrapers which parts of a site the owner prefers not to be crawled:

User-agent: *
Disallow: /private/
Disallow: /api/
Allow: /products/
Crawl-delay: 10
Enter fullscreen mode Exit fullscreen mode

Important nuance: robots.txt is a convention, not a legal contract. Ignoring it isn't illegal by itself. However:

  • Courts have considered robots.txt compliance as evidence of good faith
  • Ignoring it after being asked to stop could strengthen a case against you
  • It's considered good practice to respect it

Best practice: Respect robots.txt unless you have a specific legal reason not to (e.g., academic research into security vulnerabilities).

Terms of Service: Are They Enforceable?

Many websites include "no scraping" clauses in their Terms of Service. The legal enforceability varies:

  • Browsewrap (ToS linked in footer, no click required): Generally weakly enforceable. Courts have found that users aren't bound by terms they never actively agreed to.
  • Clickwrap (must click "I agree"): More enforceable, especially if you created an account.

The hiQ v. LinkedIn ruling weakened the argument that ToS violations equal CFAA violations for public data. However, ToS can still support breach of contract claims in some jurisdictions.

GDPR and Personal Data (EU/EEA)

If you're scraping personal data of EU residents, GDPR applies regardless of where you're located.

Key GDPR principles for scraping:

  1. Lawful basis — you need a legal reason to process personal data. "Legitimate interest" is the most common basis for scraping, but you must document your reasoning.
  2. Data minimization — only collect what you actually need.
  3. Storage limitation — don't keep personal data forever.
  4. Right to erasure — individuals can request their data be deleted.

Practical approach:

  • If you're scraping product prices, business info, or non-personal data: GDPR doesn't apply.
  • If you're scraping names, emails, or photos of individuals: you need a documented lawful basis and must handle data subject requests.

US vs. EU: Key Differences

Aspect United States European Union
Public data scraping Generally allowed (hiQ ruling) Allowed, but GDPR applies to personal data
CFAA / Computer Misuse Applies to unauthorized access Similar laws in member states
Database rights No equivalent federal law Sui generis database right exists
Privacy Sector-specific (CCPA in CA) Comprehensive (GDPR everywhere)
robots.txt Not legally binding Not legally binding

Ethical Scraping: Best Practices

Legal and ethical aren't always the same thing. Here's how to be a good citizen:

1. Don't Overload Servers

import time
import random
import requests

def polite_scrape(urls: list[str]):
    for url in urls:
        response = requests.get(url)
        # Random delay between 2-5 seconds
        time.sleep(random.uniform(2, 5))
Enter fullscreen mode Exit fullscreen mode

Hammering a site with thousands of requests per second can cause downtime. That's not just rude — it could be considered a denial-of-service attack.

2. Identify Yourself

headers = {
    "User-Agent": "MyScraperBot/1.0 (contact@example.com)"
}
Enter fullscreen mode Exit fullscreen mode

Using a descriptive User-Agent lets site owners contact you if there's an issue, rather than blocking you outright.

3. Cache and Don't Re-scrape Unnecessarily

import hashlib
import os

def get_cached_or_scrape(url: str, cache_dir: str = "./cache"):
    cache_key = hashlib.md5(url.encode()).hexdigest()
    cache_path = os.path.join(cache_dir, f"{cache_key}.html")

    if os.path.exists(cache_path):
        with open(cache_path) as f:
            return f.read()

    response = requests.get(url)
    os.makedirs(cache_dir, exist_ok=True)
    with open(cache_path, "w") as f:
        f.write(response.text)
    return response.text
Enter fullscreen mode Exit fullscreen mode

4. Respect Rate Limits

If a site returns 429 Too Many Requests, back off. Services like ScraperAPI handle rate limiting, IP rotation, and compliance features automatically — useful when you need reliability at scale.

5. Don't Republish Copyrighted Content

Scraping facts (prices, dates, statistics) is different from scraping creative content (articles, photos, reviews). You can scrape and store facts freely. Republishing someone else's article verbatim is copyright infringement regardless of how you obtained it.

A Simple Compliance Checklist

Before starting a scraping project, run through this:

  • [ ] Is the data publicly accessible (no login required)?
  • [ ] Am I scraping facts/data, not copyrighted creative works?
  • [ ] Have I checked robots.txt?
  • [ ] Am I rate-limiting my requests?
  • [ ] Am I avoiding personal data, or do I have a lawful basis (GDPR)?
  • [ ] Have I set a descriptive User-Agent?
  • [ ] Am I not circumventing any technical access controls?
  • [ ] Have I documented my purpose for scraping?

If you can check all of these, you're on solid ground.

Bottom Line

Web scraping is a legitimate tool used by businesses worldwide. The legal landscape has actually become more favorable for scrapers of public data over the past few years, thanks to cases like hiQ v. LinkedIn.

The rules are straightforward:

  • Public data is generally fair game in the US
  • Be polite — rate limit, identify yourself, don't overload servers
  • Be careful with personal data in the EU
  • Don't scrape behind login walls without authorization
  • Facts and data points are not copyrightable — creative expression is

When in doubt, consult a lawyer for your specific use case. But don't let fear stop you from building with publicly available data — that's what the internet is for.

Disclaimer: This article is for informational purposes only and does not constitute legal advice. Consult a qualified attorney for guidance on your specific situation.

Top comments (0)