agenthustler

Posted on Mar 26 • Edited on Apr 19

Web Scraping Best Practices in 2026: Respectful, Efficient, and Reliable Scraping

#python #tutorial #beginners #webdev

Web scraping is one of the most powerful data collection techniques available, but with great power comes responsibility. As websites become more sophisticated and regulations evolve, following best practices isn't just polite — it's essential for building scrapers that actually work long-term.

This guide covers the practices I've learned from building and maintaining dozens of production scrapers. Think of it as the 'be a good web citizen' handbook for 2026.

1. Respect robots.txt — Always

The robots.txt file is a website's way of telling you what they're comfortable with you scraping. Ignoring it is like ignoring a 'Please Don't Walk on the Grass' sign — technically you can, but you shouldn't.

from urllib.robotparser import RobotFileParser

def can_scrape(url: str, user_agent: str = "*") -> bool:
    from urllib.parse import urlparse
    parsed = urlparse(url)
    robots_url = f"{parsed.scheme}://{parsed.netloc}/robots.txt"

    rp = RobotFileParser()
    rp.set_url(robots_url)
    rp.read()

    return rp.can_fetch(user_agent, url)

# Always check before scraping
if can_scrape("https://example.com/products"):
    scrape_page("https://example.com/products")
else:
    print("Blocked by robots.txt — skipping")

What robots.txt Tells You

Crawl-delay: How long to wait between requests (respect this!)
Disallow: Paths you shouldn't access
Sitemap: Often points to structured data you can use instead of scraping

2. Rate Limiting: Don't Be a DDoS

The fastest way to get blocked — and potentially cause real harm — is hammering a site with hundreds of requests per second. A good scraper is patient.

import time
import random

def polite_delay(min_seconds: float = 1.0, max_seconds: float = 3.0):
    """Random delay between requests to mimic human behavior."""
    delay = random.uniform(min_seconds, max_seconds)
    time.sleep(delay)

# For sites with explicit crawl-delay
def respect_crawl_delay(robots_delay: float | None, default: float = 2.0):
    delay = robots_delay if robots_delay else default
    time.sleep(delay)

Rate Limiting Rules of Thumb

Site Type	Recommended Delay	Max Concurrent
Small business	3-5 seconds	1
Medium site	1-3 seconds	2-3
Large platform	0.5-1 second	5
API endpoint	Per their docs	Per their docs

When you need to scale without overwhelming targets, proxy rotation helps distribute load. Services like ScrapeOps provide proxy and header management that automatically handles rate limiting across their proxy pool.

3. User Agent Rotation

Sending the same user agent string on every request is a dead giveaway that you're a bot. Rotate user agents to distribute your scraping fingerprint.

import random

USER_AGENTS = [
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 14_5) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.5 Safari/605.1.15",
    "Mozilla/5.0 (X11; Linux x86_64; rv:128.0) Gecko/20100101 Firefox/128.0",
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36 Edg/125.0.0.0",
]

def get_headers() -> dict:
    return {
        "User-Agent": random.choice(USER_AGENTS),
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
        "Accept-Language": "en-US,en;q=0.9",
        "Accept-Encoding": "gzip, deflate, br",
    }

For more advanced header management, ScrapeOps offers a free Headers API that returns realistic browser header sets, keeping your requests looking natural.

4. Caching: Don't Scrape What You Already Have

One of the most overlooked best practices: cache aggressively. If a page hasn't changed, don't re-download it.

import hashlib
import json
from pathlib import Path
from datetime import datetime, timedelta

class ScrapeCache:
    def __init__(self, cache_dir: str = "./.scrape_cache", ttl_hours: int = 24):
        self.cache_dir = Path(cache_dir)
        self.cache_dir.mkdir(exist_ok=True)
        self.ttl = timedelta(hours=ttl_hours)

    def _key(self, url: str) -> str:
        return hashlib.md5(url.encode()).hexdigest()

    def get(self, url: str) -> str | None:
        path = self.cache_dir / f"{self._key(url)}.json"
        if not path.exists():
            return None
        data = json.loads(path.read_text())
        cached_at = datetime.fromisoformat(data["cached_at"])
        if datetime.utcnow() - cached_at > self.ttl:
            return None  # Expired
        return data["content"]

    def set(self, url: str, content: str):
        path = self.cache_dir / f"{self._key(url)}.json"
        path.write_text(json.dumps({
            "url": url,
            "content": content,
            "cached_at": datetime.utcnow().isoformat()
        }))

# Usage
cache = ScrapeCache(ttl_hours=12)
cached = cache.get(url)
if cached:
    html = cached  # No request needed!
else:
    html = fetch_page(url)
    cache.set(url, html)

Caching also respects HTTP cache headers. Check Cache-Control, ETag, and Last-Modified headers — many sites explicitly tell you how long content is valid.

5. Error Handling: Expect Everything to Break

Websites change. Servers go down. Your IP gets blocked. A resilient scraper handles all of this gracefully.

# Implementation is proprietary (that IS the moat).
# Skip the build — use our ready-made Apify actor:
# see the CTA below for the link (fpr=yw6md3).

For production scrapers, ThorData residential proxies can help bypass blocks by rotating through clean IP addresses automatically.

6. Structured Data First, HTML Parsing Second

Before writing complex CSS selectors, check if the data is available in a structured format:

JSON-LD in <script type="application/ld+json"> — product info, reviews, organization data
Open Graph meta tags — titles, descriptions, images
APIs — many sites have public or semi-public APIs
RSS feeds — blog posts, news, product updates
Sitemaps — complete URL lists with last-modified dates

# Implementation is proprietary (that IS the moat).
# Skip the build — use our ready-made Apify actor:
# see the CTA below for the link (fpr=yw6md3).

Structured data is more reliable, less likely to break when the site redesigns, and often contains exactly the fields you need.

7. Monitoring Your Scrapers

A scraper without monitoring is a scraper you'll discover is broken two weeks too late.

Key Metrics to Track

Success rate: Percentage of requests returning valid data
Response times: Sudden increases suggest blocking
Data freshness: When was the last successful scrape?
Schema violations: Are you getting the fields you expect?

from dataclasses import dataclass, field
from datetime import datetime

@dataclass
class ScrapeStats:
    total_requests: int = 0
    successful: int = 0
    failed: int = 0
    blocked: int = 0
    start_time: datetime = field(default_factory=datetime.utcnow)

    @property
    def success_rate(self) -> float:
        return (self.successful / self.total_requests * 100) if self.total_requests else 0

    def report(self) -> str:
        elapsed = (datetime.utcnow() - self.start_time).total_seconds()
        return (
            f"Scraped {self.total_requests} pages in {elapsed:.0f}s | "
            f"Success: {self.success_rate:.1f}% | "
            f"Failed: {self.failed} | Blocked: {self.blocked}"
        )

8. Legal and Ethical Considerations

Check Terms of Service: Some sites explicitly prohibit scraping
Don't scrape personal data without a legitimate basis (GDPR, CCPA)
Attribute data sources when publishing derived insights
Don't overload servers: Your convenience doesn't justify degrading someone else's service
Consider the purpose: Competitive intelligence and research are generally acceptable; copying entire databases to compete directly is not

Quick Reference Checklist

✅ Check robots.txt before scraping
✅ Implement rate limiting (1-3 second delays)
✅ Rotate user agents and headers
✅ Cache responses to avoid redundant requests
✅ Handle errors with exponential backoff
✅ Look for structured data (JSON-LD, APIs) first
✅ Monitor success rates and data quality
✅ Review ToS for each target site
✅ Use proxies for scale, not for bypassing blocks you should respect
✅ Log everything for debugging

What's Your Scraping Stack?

I'm curious what tools and practices the dev.to community uses for web scraping in 2026. Have you found better approaches for any of these? Are there best practices I missed? Let me know in the comments.

Building reliable scrapers is hard. These practices have saved me countless hours of debugging and kept my scrapers running smoothly across dozens of data sources.

DEV Community