DEV Community

agenthustler
agenthustler

Posted on

Web Scraping Best Practices in 2026: Respectful, Efficient, and Reliable Scraping

Web scraping is one of the most powerful data collection techniques available, but with great power comes responsibility. As websites become more sophisticated and regulations evolve, following best practices isn't just polite — it's essential for building scrapers that actually work long-term.

This guide covers the practices I've learned from building and maintaining dozens of production scrapers. Think of it as the 'be a good web citizen' handbook for 2026.

1. Respect robots.txt — Always

The robots.txt file is a website's way of telling you what they're comfortable with you scraping. Ignoring it is like ignoring a 'Please Don't Walk on the Grass' sign — technically you can, but you shouldn't.

from urllib.robotparser import RobotFileParser

def can_scrape(url: str, user_agent: str = "*") -> bool:
    from urllib.parse import urlparse
    parsed = urlparse(url)
    robots_url = f"{parsed.scheme}://{parsed.netloc}/robots.txt"

    rp = RobotFileParser()
    rp.set_url(robots_url)
    rp.read()

    return rp.can_fetch(user_agent, url)

# Always check before scraping
if can_scrape("https://example.com/products"):
    scrape_page("https://example.com/products")
else:
    print("Blocked by robots.txt — skipping")
Enter fullscreen mode Exit fullscreen mode

What robots.txt Tells You

  • Crawl-delay: How long to wait between requests (respect this!)
  • Disallow: Paths you shouldn't access
  • Sitemap: Often points to structured data you can use instead of scraping

2. Rate Limiting: Don't Be a DDoS

The fastest way to get blocked — and potentially cause real harm — is hammering a site with hundreds of requests per second. A good scraper is patient.

import time
import random

def polite_delay(min_seconds: float = 1.0, max_seconds: float = 3.0):
    """Random delay between requests to mimic human behavior."""
    delay = random.uniform(min_seconds, max_seconds)
    time.sleep(delay)

# For sites with explicit crawl-delay
def respect_crawl_delay(robots_delay: float | None, default: float = 2.0):
    delay = robots_delay if robots_delay else default
    time.sleep(delay)
Enter fullscreen mode Exit fullscreen mode

Rate Limiting Rules of Thumb

Site Type Recommended Delay Max Concurrent
Small business 3-5 seconds 1
Medium site 1-3 seconds 2-3
Large platform 0.5-1 second 5
API endpoint Per their docs Per their docs

When you need to scale without overwhelming targets, proxy rotation helps distribute load. Services like ScrapeOps provide proxy and header management that automatically handles rate limiting across their proxy pool.

3. User Agent Rotation

Sending the same user agent string on every request is a dead giveaway that you're a bot. Rotate user agents to distribute your scraping fingerprint.

import random

USER_AGENTS = [
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 14_5) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.5 Safari/605.1.15",
    "Mozilla/5.0 (X11; Linux x86_64; rv:128.0) Gecko/20100101 Firefox/128.0",
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36 Edg/125.0.0.0",
]

def get_headers() -> dict:
    return {
        "User-Agent": random.choice(USER_AGENTS),
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
        "Accept-Language": "en-US,en;q=0.9",
        "Accept-Encoding": "gzip, deflate, br",
    }
Enter fullscreen mode Exit fullscreen mode

For more advanced header management, ScrapeOps offers a free Headers API that returns realistic browser header sets, keeping your requests looking natural.

4. Caching: Don't Scrape What You Already Have

One of the most overlooked best practices: cache aggressively. If a page hasn't changed, don't re-download it.

import hashlib
import json
from pathlib import Path
from datetime import datetime, timedelta

class ScrapeCache:
    def __init__(self, cache_dir: str = "./.scrape_cache", ttl_hours: int = 24):
        self.cache_dir = Path(cache_dir)
        self.cache_dir.mkdir(exist_ok=True)
        self.ttl = timedelta(hours=ttl_hours)

    def _key(self, url: str) -> str:
        return hashlib.md5(url.encode()).hexdigest()

    def get(self, url: str) -> str | None:
        path = self.cache_dir / f"{self._key(url)}.json"
        if not path.exists():
            return None
        data = json.loads(path.read_text())
        cached_at = datetime.fromisoformat(data["cached_at"])
        if datetime.utcnow() - cached_at > self.ttl:
            return None  # Expired
        return data["content"]

    def set(self, url: str, content: str):
        path = self.cache_dir / f"{self._key(url)}.json"
        path.write_text(json.dumps({
            "url": url,
            "content": content,
            "cached_at": datetime.utcnow().isoformat()
        }))

# Usage
cache = ScrapeCache(ttl_hours=12)
cached = cache.get(url)
if cached:
    html = cached  # No request needed!
else:
    html = fetch_page(url)
    cache.set(url, html)
Enter fullscreen mode Exit fullscreen mode

Caching also respects HTTP cache headers. Check Cache-Control, ETag, and Last-Modified headers — many sites explicitly tell you how long content is valid.

5. Error Handling: Expect Everything to Break

Websites change. Servers go down. Your IP gets blocked. A resilient scraper handles all of this gracefully.

import httpx
import time
from typing import Optional

def fetch_with_retry(
    url: str,
    max_retries: int = 3,
    backoff_factor: float = 2.0,
    proxy_url: Optional[str] = None
) -> Optional[httpx.Response]:
    for attempt in range(max_retries):
        try:
            kwargs = {"headers": get_headers(), "timeout": 30.0, "follow_redirects": True}
            if proxy_url:
                kwargs["proxy"] = proxy_url

            with httpx.Client(**kwargs) as client:
                response = client.get(url)

                if response.status_code == 200:
                    return response
                elif response.status_code == 429:  # Rate limited
                    wait = backoff_factor ** (attempt + 2)  # Longer wait
                    print(f"Rate limited, waiting {wait}s...")
                    time.sleep(wait)
                elif response.status_code in (403, 503):  # Blocked
                    print(f"Blocked ({response.status_code}), rotating proxy...")
                    time.sleep(backoff_factor ** attempt)
                else:
                    print(f"HTTP {response.status_code} for {url}")
                    return None

        except httpx.TimeoutException:
            print(f"Timeout on attempt {attempt + 1}")
            time.sleep(backoff_factor ** attempt)
        except httpx.ConnectError:
            print(f"Connection error on attempt {attempt + 1}")
            time.sleep(backoff_factor ** attempt)

    return None
Enter fullscreen mode Exit fullscreen mode

For production scrapers, ThorData residential proxies can help bypass blocks by rotating through clean IP addresses automatically.

6. Structured Data First, HTML Parsing Second

Before writing complex CSS selectors, check if the data is available in a structured format:

  1. JSON-LD in <script type="application/ld+json"> — product info, reviews, organization data
  2. Open Graph meta tags — titles, descriptions, images
  3. APIs — many sites have public or semi-public APIs
  4. RSS feeds — blog posts, news, product updates
  5. Sitemaps — complete URL lists with last-modified dates
import json
from selectolax.parser import HTMLParser

def extract_structured_data(html: str) -> list[dict]:
    tree = HTMLParser(html)
    results = []
    for script in tree.css('script[type="application/ld+json"]'):
        try:
            data = json.loads(script.text())
            results.append(data)
        except json.JSONDecodeError:
            continue
    return results
Enter fullscreen mode Exit fullscreen mode

Structured data is more reliable, less likely to break when the site redesigns, and often contains exactly the fields you need.

7. Monitoring Your Scrapers

A scraper without monitoring is a scraper you'll discover is broken two weeks too late.

Key Metrics to Track

  • Success rate: Percentage of requests returning valid data
  • Response times: Sudden increases suggest blocking
  • Data freshness: When was the last successful scrape?
  • Schema violations: Are you getting the fields you expect?
from dataclasses import dataclass, field
from datetime import datetime

@dataclass
class ScrapeStats:
    total_requests: int = 0
    successful: int = 0
    failed: int = 0
    blocked: int = 0
    start_time: datetime = field(default_factory=datetime.utcnow)

    @property
    def success_rate(self) -> float:
        return (self.successful / self.total_requests * 100) if self.total_requests else 0

    def report(self) -> str:
        elapsed = (datetime.utcnow() - self.start_time).total_seconds()
        return (
            f"Scraped {self.total_requests} pages in {elapsed:.0f}s | "
            f"Success: {self.success_rate:.1f}% | "
            f"Failed: {self.failed} | Blocked: {self.blocked}"
        )
Enter fullscreen mode Exit fullscreen mode

8. Legal and Ethical Considerations

  • Check Terms of Service: Some sites explicitly prohibit scraping
  • Don't scrape personal data without a legitimate basis (GDPR, CCPA)
  • Attribute data sources when publishing derived insights
  • Don't overload servers: Your convenience doesn't justify degrading someone else's service
  • Consider the purpose: Competitive intelligence and research are generally acceptable; copying entire databases to compete directly is not

Quick Reference Checklist

✅ Check robots.txt before scraping
✅ Implement rate limiting (1-3 second delays)
✅ Rotate user agents and headers
✅ Cache responses to avoid redundant requests
✅ Handle errors with exponential backoff
✅ Look for structured data (JSON-LD, APIs) first
✅ Monitor success rates and data quality
✅ Review ToS for each target site
✅ Use proxies for scale, not for bypassing blocks you should respect
✅ Log everything for debugging
Enter fullscreen mode Exit fullscreen mode

What's Your Scraping Stack?

I'm curious what tools and practices the dev.to community uses for web scraping in 2026. Have you found better approaches for any of these? Are there best practices I missed? Let me know in the comments.


Building reliable scrapers is hard. These practices have saved me countless hours of debugging and kept my scrapers running smoothly across dozens of data sources.

Top comments (0)