agenthustler

Posted on Mar 26

Legal and Ethical Web Scraping in 2026: What You Can and Cannot Scrape

#webdev #beginners #tutorial #discuss

Web scraping sits in a gray area, but it's not the legal minefield some people make it out to be. Plenty of legitimate businesses scrape data every day — price comparison sites, search engines, academic researchers, journalists. The key is understanding where the boundaries are.

This guide breaks down the current legal landscape so you can scrape confidently and responsibly.

The Legal Foundation: Key Court Cases

hiQ Labs v. LinkedIn (2022)

This is the landmark case for web scraping in the US. The Ninth Circuit ruled that scraping publicly accessible data does not violate the Computer Fraud and Abuse Act (CFAA). hiQ Labs scraped public LinkedIn profiles to provide workforce analytics, and the court sided with them.

What this means for you: Scraping public data that anyone can see without logging in is generally protected under US law.

Ryanair v. PR Aviation (EU, 2015)

The European Court of Justice ruled that scraping data from a website isn't automatically a database right infringement if the data isn't protected by copyright or the database right.

Meta v. Bright Data (2024)

Meta sued Bright Data for scraping Instagram and Facebook. The court dismissed claims related to scraping publicly available data but allowed claims about scraping data behind login walls. This reinforced the public vs. private distinction.

The Public vs. Private Data Line

This is the single most important distinction in scraping law:

Data Type	Examples	Generally Safe?
Public data	Product prices, news headlines, business listings, public social posts	Yes
Login-required data	Private profiles, DMs, account dashboards	No
Rate-limited but public	APIs with auth keys for public data	Usually yes, respect limits
Personal data (EU)	Names, emails, photos of individuals	GDPR applies — see below

What's Generally Safe to Scrape

Product prices and availability — the backbone of price comparison sites
Public news articles and headlines — used by aggregators everywhere
Business listings and contact info — public directory data
Public government data — court records, legislation, public filings
Academic papers and citations — for research purposes
Public social media posts — posts made visible to everyone
Job listings — publicly posted positions
Weather data, stock prices, sports scores — factual data

What You Should Avoid

Data behind login walls — this crosses the authorization boundary
Copyrighted content in bulk — scraping for republishing verbatim
Personal data for profiling — especially in the EU (GDPR)
Data explicitly prohibited by court order
Scraping after receiving a cease-and-desist — consult a lawyer first

robots.txt: Guidelines, Not Law

The robots.txt file tells scrapers which parts of a site the owner prefers not to be crawled:

User-agent: *
Disallow: /private/
Disallow: /api/
Allow: /products/
Crawl-delay: 10

Important nuance: robots.txt is a convention, not a legal contract. Ignoring it isn't illegal by itself. However:

Courts have considered robots.txt compliance as evidence of good faith
Ignoring it after being asked to stop could strengthen a case against you
It's considered good practice to respect it

Best practice: Respect robots.txt unless you have a specific legal reason not to (e.g., academic research into security vulnerabilities).

Terms of Service: Are They Enforceable?

Many websites include "no scraping" clauses in their Terms of Service. The legal enforceability varies:

Browsewrap (ToS linked in footer, no click required): Generally weakly enforceable. Courts have found that users aren't bound by terms they never actively agreed to.
Clickwrap (must click "I agree"): More enforceable, especially if you created an account.

The hiQ v. LinkedIn ruling weakened the argument that ToS violations equal CFAA violations for public data. However, ToS can still support breach of contract claims in some jurisdictions.

GDPR and Personal Data (EU/EEA)

If you're scraping personal data of EU residents, GDPR applies regardless of where you're located.

Key GDPR principles for scraping:

Lawful basis — you need a legal reason to process personal data. "Legitimate interest" is the most common basis for scraping, but you must document your reasoning.
Data minimization — only collect what you actually need.
Storage limitation — don't keep personal data forever.
Right to erasure — individuals can request their data be deleted.

Practical approach:

If you're scraping product prices, business info, or non-personal data: GDPR doesn't apply.
If you're scraping names, emails, or photos of individuals: you need a documented lawful basis and must handle data subject requests.

US vs. EU: Key Differences

Aspect	United States	European Union
Public data scraping	Generally allowed (hiQ ruling)	Allowed, but GDPR applies to personal data
CFAA / Computer Misuse	Applies to unauthorized access	Similar laws in member states
Database rights	No equivalent federal law	Sui generis database right exists
Privacy	Sector-specific (CCPA in CA)	Comprehensive (GDPR everywhere)
robots.txt	Not legally binding	Not legally binding

Ethical Scraping: Best Practices

Legal and ethical aren't always the same thing. Here's how to be a good citizen:

1. Don't Overload Servers

import time
import random
import requests

def polite_scrape(urls: list[str]):
    for url in urls:
        response = requests.get(url)
        # Random delay between 2-5 seconds
        time.sleep(random.uniform(2, 5))

Hammering a site with thousands of requests per second can cause downtime. That's not just rude — it could be considered a denial-of-service attack.

2. Identify Yourself

headers = {
    "User-Agent": "MyScraperBot/1.0 (contact@example.com)"
}

Using a descriptive User-Agent lets site owners contact you if there's an issue, rather than blocking you outright.

3. Cache and Don't Re-scrape Unnecessarily

import hashlib
import os

def get_cached_or_scrape(url: str, cache_dir: str = "./cache"):
    cache_key = hashlib.md5(url.encode()).hexdigest()
    cache_path = os.path.join(cache_dir, f"{cache_key}.html")

    if os.path.exists(cache_path):
        with open(cache_path) as f:
            return f.read()

    response = requests.get(url)
    os.makedirs(cache_dir, exist_ok=True)
    with open(cache_path, "w") as f:
        f.write(response.text)
    return response.text

4. Respect Rate Limits

If a site returns 429 Too Many Requests, back off. Services like ScraperAPI handle rate limiting, IP rotation, and compliance features automatically — useful when you need reliability at scale.

5. Don't Republish Copyrighted Content

Scraping facts (prices, dates, statistics) is different from scraping creative content (articles, photos, reviews). You can scrape and store facts freely. Republishing someone else's article verbatim is copyright infringement regardless of how you obtained it.

A Simple Compliance Checklist

Before starting a scraping project, run through this:

[ ] Is the data publicly accessible (no login required)?
[ ] Am I scraping facts/data, not copyrighted creative works?
[ ] Have I checked robots.txt?
[ ] Am I rate-limiting my requests?
[ ] Am I avoiding personal data, or do I have a lawful basis (GDPR)?
[ ] Have I set a descriptive User-Agent?
[ ] Am I not circumventing any technical access controls?
[ ] Have I documented my purpose for scraping?

If you can check all of these, you're on solid ground.

Bottom Line

Web scraping is a legitimate tool used by businesses worldwide. The legal landscape has actually become more favorable for scrapers of public data over the past few years, thanks to cases like hiQ v. LinkedIn.

The rules are straightforward:

Public data is generally fair game in the US
Be polite — rate limit, identify yourself, don't overload servers
Be careful with personal data in the EU
Don't scrape behind login walls without authorization
Facts and data points are not copyrightable — creative expression is

When in doubt, consult a lawyer for your specific use case. But don't let fear stop you from building with publicly available data — that's what the internet is for.

Disclaimer: This article is for informational purposes only and does not constitute legal advice. Consult a qualified attorney for guidance on your specific situation.

DEV Community