Alex Chen

Posted on Mar 23

Responsible Web Scraping: Rate Limiting, robots.txt, and Keeping Your Scraper Legal

You can build a scraper that's technically brilliant — stealth patches, CAPTCHA solving, proxy rotation. But if you ignore rate limits, blast servers with requests, and violate terms of service, you're not a developer. You're a problem.

Let's talk about responsible scraping: how to get the data you need without being a bad actor.

Why This Matters

Beyond ethics, there are practical reasons:

Legal risk — lawsuits are real (hiQ v. LinkedIn, Clearview AI)
IP bans — aggressive scraping gets you permanently blocked
Server harm — you can accidentally DDoS small sites
Reputation — your company's IP range gets blacklisted
Data quality — rushed scraping produces worse data

Understanding robots.txt

Every website can publish a robots.txt\ file that specifies which paths scrapers should avoid:

# https://example.com/robots.txt
User-agent: *
Disallow: /admin/
Disallow: /api/internal/
Disallow: /user/*/settings
Crawl-delay: 10

User-agent: Googlebot
Allow: /
Crawl-delay: 1

Parsing robots.txt in Python

from urllib.robotparser import RobotFileParser
from urllib.parse import urlparse

class RobotsChecker:
    def __init__(self):
        self._parsers: dict[str, RobotFileParser] = {}

    def can_fetch(self, url: str, user_agent: str = "*") -> bool:
        """Check if a URL is allowed by robots.txt."""
        domain = urlparse(url).netloc

        if domain not in self._parsers:
            parser = RobotFileParser()
            parser.set_url(f"https://{domain}/robots.txt")
            try:
                parser.read()
            except Exception:
                # If we can't read robots.txt, be cautious
                return True
            self._parsers[domain] = parser

        return self._parsers[domain].can_fetch(user_agent, url)

    def get_crawl_delay(
        self, domain: str, user_agent: str = "*"
    ) -> float | None:
        """Get the recommended crawl delay."""
        if domain not in self._parsers:
            self.can_fetch(f"https://{domain}/", user_agent)

        parser = self._parsers.get(domain)
        if parser:
            delay = parser.crawl_delay(user_agent)
            return float(delay) if delay else None
        return None

# Usage
checker = RobotsChecker()

urls = [
    "https://example.com/products/123",
    "https://example.com/admin/users",
    "https://example.com/api/internal/debug",
]

for url in urls:
    allowed = checker.can_fetch(url)
    print(f"{'✓' if allowed else '✗'} {url}")

Output:

✓ https://example.com/products/123
✗ https://example.com/admin/users
✗ https://example.com/api/internal/debug

Should You Always Follow robots.txt?

robots.txt is advisory, not legally binding in most jurisdictions. But:

Follow it for sites you have no business relationship with
Respect Crawl-delay — it's telling you their server's capacity
Document your decisions — if you choose to ignore specific rules, have a reason

Rate Limiting: Don't Be That Scraper

The Golden Rule

Your scraper should be indistinguishable from a human browsing the site. A human doesn't load 100 pages per second.

Implementing Respectful Rate Limiting

import asyncio
import time
from collections import defaultdict

class RespectfulRateLimiter:
    """Rate limiter that respects site capacity."""

    def __init__(
        self,
        default_delay: float = 2.0,
        max_concurrent: int = 3,
    ):
        self.default_delay = default_delay
        self.max_concurrent = max_concurrent
        self._domain_semaphores: dict[str, asyncio.Semaphore] = {}
        self._last_request: dict[str, float] = defaultdict(float)
        self._lock = asyncio.Lock()

    def _get_semaphore(self, domain: str) -> asyncio.Semaphore:
        if domain not in self._domain_semaphores:
            self._domain_semaphores[domain] = asyncio.Semaphore(
                self.max_concurrent
            )
        return self._domain_semaphores[domain]

    async def acquire(self, domain: str, crawl_delay: float = None):
        """Wait for permission to make a request."""
        sem = self._get_semaphore(domain)
        await sem.acquire()

        delay = crawl_delay or self.default_delay

        async with self._lock:
            elapsed = time.monotonic() - self._last_request[domain]
            if elapsed < delay:
                await asyncio.sleep(delay - elapsed)
            self._last_request[domain] = time.monotonic()

    def release(self, domain: str):
        sem = self._get_semaphore(domain)
        sem.release()


# Usage with robots.txt
class ResponsibleScraper:
    def __init__(self):
        self.robots = RobotsChecker()
        self.limiter = RespectfulRateLimiter(
            default_delay=2.0,
            max_concurrent=3,
        )

    async def fetch(self, url: str) -> str | None:
        domain = urlparse(url).netloc

        # Step 1: Check robots.txt
        if not self.robots.can_fetch(url):
            print(f"Blocked by robots.txt: {url}")
            return None

        # Step 2: Respect crawl delay
        crawl_delay = self.robots.get_crawl_delay(domain)

        # Step 3: Rate limit
        await self.limiter.acquire(domain, crawl_delay)

        try:
            async with httpx.AsyncClient() as client:
                resp = await client.get(url)
                return resp.text
        finally:
            self.limiter.release(domain)

Adaptive Rate Limiting

Adjust your speed based on server response:

class AdaptiveRateLimiter:
    """Slow down when the server shows stress."""

    def __init__(self, base_delay: float = 1.0):
        self.base_delay = base_delay
        self.current_delay = base_delay
        self.max_delay = 30.0
        self._consecutive_errors = 0

    def record_response(
        self, status_code: int, response_time: float
    ):
        if status_code == 429:
            # Rate limited — back off significantly
            self.current_delay = min(
                self.current_delay * 3,
                self.max_delay
            )
            print(
                f"Rate limited! Delay → {self.current_delay:.1f}s"
            )

        elif status_code >= 500:
            # Server error — back off
            self._consecutive_errors += 1
            self.current_delay = min(
                self.base_delay * (2 ** self._consecutive_errors),
                self.max_delay
            )

        elif response_time > 5.0:
            # Slow response — server is struggling
            self.current_delay = min(
                self.current_delay * 1.5,
                self.max_delay
            )

        else:
            # Good response — gradually speed up
            self._consecutive_errors = 0
            self.current_delay = max(
                self.current_delay * 0.95,
                self.base_delay
            )

    async def wait(self):
        await asyncio.sleep(self.current_delay)

Identifying Your Scraper

Be transparent about who you are:

# Set a clear User-Agent that identifies your bot
HEADERS = {
    "User-Agent": (
        "MyCompanyScraper/1.0 "
        "(+https://mycompany.com/bot; "
        "contact@mycompany.com)"    ),
    "From": "contact@mycompany.com",}# This helps site owners:
# 1. Contact you if there's a problem
# 2. Whitelist you if they want to
# 3. Understand the traffic source

Handling CAPTCHAs Responsibly

When CAPTCHAs appear, they're a signal: the site wants to verify you're human. Options:

Option 1: Reduce Your Rate

async def handle_captcha_signal(scraper):
    """CAPTCHAs appearing = you're going too fast."""

    # First, slow down
    scraper.limiter.current_delay *= 2
    print(f"CAPTCHAs detected — slowing to "
          f"{scraper.limiter.current_delay:.1f}s/req")

    # If CAPTCHAs persist, solve them
    # But don't solve more than N per hour
    if scraper.captcha_count_this_hour < 50:
        token = await solver.solve(...)
        scraper.captcha_count_this_hour += 1
        return token
    else:
        print("Too many CAPTCHAs — stopping to avoid abuse")
        return None

Option 2: Solve When Necessary

For legitimate use cases (price monitoring, research, testing), solving CAPTCHAs is reasonable:

from datetime import datetime

class CaptchaBudget:
    """Track and limit CAPTCHA solving costs."""

    def __init__(
        self,
        daily_budget: float = 5.0,  # USD
        cost_per_solve: float = 0.001,
    ):
        self.daily_budget = daily_budget
        self.cost_per_solve = cost_per_solve
        self.today_spent = 0.0
        self.today_date = datetime.utcnow().date()

    def can_solve(self) -> bool:
        today = datetime.utcnow().date()
        if today != self.today_date:
            self.today_spent = 0.0
            self.today_date = today

        return self.today_spent + self.cost_per_solve <= self.daily_budget

    def record_solve(self):
        self.today_spent += self.cost_per_solve

    @property    def remaining(self) -> float:
        return self.daily_budget - self.today_spent

Caching: Don't Scrape What You Already Have

import hashlib
import json
from pathlib import Path
from datetime import datetime, timedelta

class ScrapeCache:
    """Cache scraped pages to avoid unnecessary requests."""

    def __init__(
        self,
        cache_dir: str = ".cache",
        ttl_hours: int = 24,
    ):
        self.cache_dir = Path(cache_dir)
        self.cache_dir.mkdir(exist_ok=True)
        self.ttl = timedelta(hours=ttl_hours)
        self.hits = 0
        self.misses = 0

    def _cache_key(self, url: str) -> str:
        return hashlib.md5(url.encode()).hexdigest()

    def get(self, url: str) -> str | None:
        key = self._cache_key(url)
        cache_file = self.cache_dir / f"{key}.json"

        if not cache_file.exists():
            self.misses += 1
            return None

        data = json.loads(cache_file.read_text())
        cached_at = datetime.fromisoformat(data["cached_at"])

        if datetime.utcnow() - cached_at > self.ttl:
            self.misses += 1
            return None

        self.hits += 1
        return data["html"]

    def set(self, url: str, html: str):
        key = self._cache_key(url)
        cache_file = self.cache_dir / f"{key}.json"
        cache_file.write_text(json.dumps({
            "url": url,
            "html": html,
            "cached_at": datetime.utcnow().isoformat(),
        }))

    @property    def hit_rate(self) -> str:
        total = self.hits + self.misses
        if total == 0:
            return "N/A"
        return f"{self.hits/total:.1%}"

The Complete Responsible Scraper

class ResponsibleScraper:
    def __init__(self, config: dict = None):
        config = config or {}

        self.robots = RobotsChecker()
        self.limiter = AdaptiveRateLimiter(
            base_delay=config.get("base_delay", 2.0)
        )
        self.cache = ScrapeCache(
            ttl_hours=config.get("cache_hours", 24)
        )
        self.captcha_budget = CaptchaBudget(
            daily_budget=config.get("daily_captcha_budget", 5.0)
        )
        self.captcha_solver = CaptchaSolver(
            api_base="https://www.passxapi.com"
        )
        self.stats = {
            "fetched": 0,
            "cached": 0,
            "robots_blocked": 0,
            "captchas_solved": 0,
            "rate_limited": 0,
        }

    async def scrape(self, url: str) -> dict | None:
        domain = urlparse(url).netloc

        # 1. Check cache
        cached = self.cache.get(url)
        if cached:
            self.stats["cached"] += 1
            return {"url": url, "html": cached, "cached": True}

        # 2. Check robots.txt
        if not self.robots.can_fetch(url):
            self.stats["robots_blocked"] += 1
            return None

        # 3. Rate limit
        crawl_delay = self.robots.get_crawl_delay(domain)
        await self.limiter.wait()

        # 4. Fetch
        async with httpx.AsyncClient(
            headers={
                "User-Agent": (
                    "DataCollector/1.0 "
                    "(+https://mysite.com/bot)"
                ),
            }
        ) as client:
            start = time.monotonic()
            resp = await client.get(url)
            elapsed = time.monotonic() - start

            # 5. Adapt rate based on response
            self.limiter.record_response(
                resp.status_code, elapsed
            )

            if resp.status_code == 429:
                self.stats["rate_limited"] += 1
                return None

            html = resp.text

            # 6. Handle CAPTCHA if present
            captcha = detect_captcha(html)
            if captcha:
                if self.captcha_budget.can_solve():
                    token = await self.captcha_solver.solve(
                        captcha_type=captcha["type"],
                        sitekey=captcha["sitekey"],
                        url=url,
                    )
                    self.captcha_budget.record_solve()
                    self.stats["captchas_solved"] += 1

                    # Resubmit with token
                    resp = await client.post(
                        url, data={captcha["field"]: token}
                    )
                    html = resp.text
                else:
                    print(
                        f"CAPTCHA budget exhausted "
                        f"(${self.captcha_budget.remaining:.2f} left)"
                    )
                    return None

            # 7. Cache the result
            self.cache.set(url, html)
            self.stats["fetched"] += 1

            return {"url": url, "html": html, "cached": False}

    def print_stats(self):
        print(f"Scraping stats: {self.stats}")
        print(f"Cache hit rate: {self.cache.hit_rate}")
        print(
            f"CAPTCHA budget remaining: "
            f"${self.captcha_budget.remaining:.2f}"
        )
        print(f"Current delay: {self.limiter.current_delay:.1f}s")

Quick Checklist

Before running your scraper in production:

[ ] robots.txt — Are you checking and respecting it?
[ ] Rate limiting — Are you waiting between requests?
[ ] User-Agent — Does it identify your bot and provide contact info?
[ ] Caching — Are you avoiding re-scraping unchanged pages?
[ ] Error handling — Do you back off on 429/5xx responses?
[ ] CAPTCHA budget — Have you set a daily spending limit?
[ ] Data storage — Are you only keeping data you actually need?
[ ] Terms of Service — Have you read the site's ToS?

Key Takeaways

robots.txt is your first check — respect it unless you have a documented reason not to
Adaptive rate limiting is better than fixed delays — respond to server signals
Cache aggressively — don't re-scrape what hasn't changed
Budget your CAPTCHA solves — set daily limits and stick to them
Identify yourself — a clear User-Agent helps everyone
Slow is reliable — a scraper that runs for a week at 1 req/s beats one that gets banned in an hour

For handling CAPTCHAs within your budget, check out passxapi-python — at $0.001/solve, even a $5/day budget gives you 5,000 solves.

What's your approach to responsible scraping? Share your practices in the comments.

DEV Community