You can build a scraper that's technically brilliant — stealth patches, CAPTCHA solving, proxy rotation. But if you ignore rate limits, blast servers with requests, and violate terms of service, you're not a developer. You're a problem.
Let's talk about responsible scraping: how to get the data you need without being a bad actor.
Why This Matters
Beyond ethics, there are practical reasons:
- Legal risk — lawsuits are real (hiQ v. LinkedIn, Clearview AI)
- IP bans — aggressive scraping gets you permanently blocked
- Server harm — you can accidentally DDoS small sites
- Reputation — your company's IP range gets blacklisted
- Data quality — rushed scraping produces worse data
Understanding robots.txt
Every website can publish a robots.txt\ file that specifies which paths scrapers should avoid:
# https://example.com/robots.txt
User-agent: *
Disallow: /admin/
Disallow: /api/internal/
Disallow: /user/*/settings
Crawl-delay: 10
User-agent: Googlebot
Allow: /
Crawl-delay: 1
Parsing robots.txt in Python
from urllib.robotparser import RobotFileParser
from urllib.parse import urlparse
class RobotsChecker:
def __init__(self):
self._parsers: dict[str, RobotFileParser] = {}
def can_fetch(self, url: str, user_agent: str = "*") -> bool:
"""Check if a URL is allowed by robots.txt."""
domain = urlparse(url).netloc
if domain not in self._parsers:
parser = RobotFileParser()
parser.set_url(f"https://{domain}/robots.txt")
try:
parser.read()
except Exception:
# If we can't read robots.txt, be cautious
return True
self._parsers[domain] = parser
return self._parsers[domain].can_fetch(user_agent, url)
def get_crawl_delay(
self, domain: str, user_agent: str = "*"
) -> float | None:
"""Get the recommended crawl delay."""
if domain not in self._parsers:
self.can_fetch(f"https://{domain}/", user_agent)
parser = self._parsers.get(domain)
if parser:
delay = parser.crawl_delay(user_agent)
return float(delay) if delay else None
return None
# Usage
checker = RobotsChecker()
urls = [
"https://example.com/products/123",
"https://example.com/admin/users",
"https://example.com/api/internal/debug",
]
for url in urls:
allowed = checker.can_fetch(url)
print(f"{'✓' if allowed else '✗'} {url}")
Output:
✓ https://example.com/products/123
✗ https://example.com/admin/users
✗ https://example.com/api/internal/debug
Should You Always Follow robots.txt?
robots.txt is advisory, not legally binding in most jurisdictions. But:
- Follow it for sites you have no business relationship with
- Respect Crawl-delay — it's telling you their server's capacity
- Document your decisions — if you choose to ignore specific rules, have a reason
Rate Limiting: Don't Be That Scraper
The Golden Rule
Your scraper should be indistinguishable from a human browsing the site. A human doesn't load 100 pages per second.
Implementing Respectful Rate Limiting
import asyncio
import time
from collections import defaultdict
class RespectfulRateLimiter:
"""Rate limiter that respects site capacity."""
def __init__(
self,
default_delay: float = 2.0,
max_concurrent: int = 3,
):
self.default_delay = default_delay
self.max_concurrent = max_concurrent
self._domain_semaphores: dict[str, asyncio.Semaphore] = {}
self._last_request: dict[str, float] = defaultdict(float)
self._lock = asyncio.Lock()
def _get_semaphore(self, domain: str) -> asyncio.Semaphore:
if domain not in self._domain_semaphores:
self._domain_semaphores[domain] = asyncio.Semaphore(
self.max_concurrent
)
return self._domain_semaphores[domain]
async def acquire(self, domain: str, crawl_delay: float = None):
"""Wait for permission to make a request."""
sem = self._get_semaphore(domain)
await sem.acquire()
delay = crawl_delay or self.default_delay
async with self._lock:
elapsed = time.monotonic() - self._last_request[domain]
if elapsed < delay:
await asyncio.sleep(delay - elapsed)
self._last_request[domain] = time.monotonic()
def release(self, domain: str):
sem = self._get_semaphore(domain)
sem.release()
# Usage with robots.txt
class ResponsibleScraper:
def __init__(self):
self.robots = RobotsChecker()
self.limiter = RespectfulRateLimiter(
default_delay=2.0,
max_concurrent=3,
)
async def fetch(self, url: str) -> str | None:
domain = urlparse(url).netloc
# Step 1: Check robots.txt
if not self.robots.can_fetch(url):
print(f"Blocked by robots.txt: {url}")
return None
# Step 2: Respect crawl delay
crawl_delay = self.robots.get_crawl_delay(domain)
# Step 3: Rate limit
await self.limiter.acquire(domain, crawl_delay)
try:
async with httpx.AsyncClient() as client:
resp = await client.get(url)
return resp.text
finally:
self.limiter.release(domain)
Adaptive Rate Limiting
Adjust your speed based on server response:
class AdaptiveRateLimiter:
"""Slow down when the server shows stress."""
def __init__(self, base_delay: float = 1.0):
self.base_delay = base_delay
self.current_delay = base_delay
self.max_delay = 30.0
self._consecutive_errors = 0
def record_response(
self, status_code: int, response_time: float
):
if status_code == 429:
# Rate limited — back off significantly
self.current_delay = min(
self.current_delay * 3,
self.max_delay
)
print(
f"Rate limited! Delay → {self.current_delay:.1f}s"
)
elif status_code >= 500:
# Server error — back off
self._consecutive_errors += 1
self.current_delay = min(
self.base_delay * (2 ** self._consecutive_errors),
self.max_delay
)
elif response_time > 5.0:
# Slow response — server is struggling
self.current_delay = min(
self.current_delay * 1.5,
self.max_delay
)
else:
# Good response — gradually speed up
self._consecutive_errors = 0
self.current_delay = max(
self.current_delay * 0.95,
self.base_delay
)
async def wait(self):
await asyncio.sleep(self.current_delay)
Identifying Your Scraper
Be transparent about who you are:
# Set a clear User-Agent that identifies your bot
HEADERS = {
"User-Agent": (
"MyCompanyScraper/1.0 "
"(+https://mycompany.com/bot; "
"contact@mycompany.com)" ),
"From": "contact@mycompany.com",}# This helps site owners:
# 1. Contact you if there's a problem
# 2. Whitelist you if they want to
# 3. Understand the traffic source
Handling CAPTCHAs Responsibly
When CAPTCHAs appear, they're a signal: the site wants to verify you're human. Options:
Option 1: Reduce Your Rate
async def handle_captcha_signal(scraper):
"""CAPTCHAs appearing = you're going too fast."""
# First, slow down
scraper.limiter.current_delay *= 2
print(f"CAPTCHAs detected — slowing to "
f"{scraper.limiter.current_delay:.1f}s/req")
# If CAPTCHAs persist, solve them
# But don't solve more than N per hour
if scraper.captcha_count_this_hour < 50:
token = await solver.solve(...)
scraper.captcha_count_this_hour += 1
return token
else:
print("Too many CAPTCHAs — stopping to avoid abuse")
return None
Option 2: Solve When Necessary
For legitimate use cases (price monitoring, research, testing), solving CAPTCHAs is reasonable:
from datetime import datetime
class CaptchaBudget:
"""Track and limit CAPTCHA solving costs."""
def __init__(
self,
daily_budget: float = 5.0, # USD
cost_per_solve: float = 0.001,
):
self.daily_budget = daily_budget
self.cost_per_solve = cost_per_solve
self.today_spent = 0.0
self.today_date = datetime.utcnow().date()
def can_solve(self) -> bool:
today = datetime.utcnow().date()
if today != self.today_date:
self.today_spent = 0.0
self.today_date = today
return self.today_spent + self.cost_per_solve <= self.daily_budget
def record_solve(self):
self.today_spent += self.cost_per_solve
@property def remaining(self) -> float:
return self.daily_budget - self.today_spent
Caching: Don't Scrape What You Already Have
import hashlib
import json
from pathlib import Path
from datetime import datetime, timedelta
class ScrapeCache:
"""Cache scraped pages to avoid unnecessary requests."""
def __init__(
self,
cache_dir: str = ".cache",
ttl_hours: int = 24,
):
self.cache_dir = Path(cache_dir)
self.cache_dir.mkdir(exist_ok=True)
self.ttl = timedelta(hours=ttl_hours)
self.hits = 0
self.misses = 0
def _cache_key(self, url: str) -> str:
return hashlib.md5(url.encode()).hexdigest()
def get(self, url: str) -> str | None:
key = self._cache_key(url)
cache_file = self.cache_dir / f"{key}.json"
if not cache_file.exists():
self.misses += 1
return None
data = json.loads(cache_file.read_text())
cached_at = datetime.fromisoformat(data["cached_at"])
if datetime.utcnow() - cached_at > self.ttl:
self.misses += 1
return None
self.hits += 1
return data["html"]
def set(self, url: str, html: str):
key = self._cache_key(url)
cache_file = self.cache_dir / f"{key}.json"
cache_file.write_text(json.dumps({
"url": url,
"html": html,
"cached_at": datetime.utcnow().isoformat(),
}))
@property def hit_rate(self) -> str:
total = self.hits + self.misses
if total == 0:
return "N/A"
return f"{self.hits/total:.1%}"
The Complete Responsible Scraper
class ResponsibleScraper:
def __init__(self, config: dict = None):
config = config or {}
self.robots = RobotsChecker()
self.limiter = AdaptiveRateLimiter(
base_delay=config.get("base_delay", 2.0)
)
self.cache = ScrapeCache(
ttl_hours=config.get("cache_hours", 24)
)
self.captcha_budget = CaptchaBudget(
daily_budget=config.get("daily_captcha_budget", 5.0)
)
self.captcha_solver = CaptchaSolver(
api_base="https://www.passxapi.com"
)
self.stats = {
"fetched": 0,
"cached": 0,
"robots_blocked": 0,
"captchas_solved": 0,
"rate_limited": 0,
}
async def scrape(self, url: str) -> dict | None:
domain = urlparse(url).netloc
# 1. Check cache
cached = self.cache.get(url)
if cached:
self.stats["cached"] += 1
return {"url": url, "html": cached, "cached": True}
# 2. Check robots.txt
if not self.robots.can_fetch(url):
self.stats["robots_blocked"] += 1
return None
# 3. Rate limit
crawl_delay = self.robots.get_crawl_delay(domain)
await self.limiter.wait()
# 4. Fetch
async with httpx.AsyncClient(
headers={
"User-Agent": (
"DataCollector/1.0 "
"(+https://mysite.com/bot)"
),
}
) as client:
start = time.monotonic()
resp = await client.get(url)
elapsed = time.monotonic() - start
# 5. Adapt rate based on response
self.limiter.record_response(
resp.status_code, elapsed
)
if resp.status_code == 429:
self.stats["rate_limited"] += 1
return None
html = resp.text
# 6. Handle CAPTCHA if present
captcha = detect_captcha(html)
if captcha:
if self.captcha_budget.can_solve():
token = await self.captcha_solver.solve(
captcha_type=captcha["type"],
sitekey=captcha["sitekey"],
url=url,
)
self.captcha_budget.record_solve()
self.stats["captchas_solved"] += 1
# Resubmit with token
resp = await client.post(
url, data={captcha["field"]: token}
)
html = resp.text
else:
print(
f"CAPTCHA budget exhausted "
f"(${self.captcha_budget.remaining:.2f} left)"
)
return None
# 7. Cache the result
self.cache.set(url, html)
self.stats["fetched"] += 1
return {"url": url, "html": html, "cached": False}
def print_stats(self):
print(f"Scraping stats: {self.stats}")
print(f"Cache hit rate: {self.cache.hit_rate}")
print(
f"CAPTCHA budget remaining: "
f"${self.captcha_budget.remaining:.2f}"
)
print(f"Current delay: {self.limiter.current_delay:.1f}s")
Quick Checklist
Before running your scraper in production:
- [ ] robots.txt — Are you checking and respecting it?
- [ ] Rate limiting — Are you waiting between requests?
- [ ] User-Agent — Does it identify your bot and provide contact info?
- [ ] Caching — Are you avoiding re-scraping unchanged pages?
- [ ] Error handling — Do you back off on 429/5xx responses?
- [ ] CAPTCHA budget — Have you set a daily spending limit?
- [ ] Data storage — Are you only keeping data you actually need?
- [ ] Terms of Service — Have you read the site's ToS?
Key Takeaways
- robots.txt is your first check — respect it unless you have a documented reason not to
- Adaptive rate limiting is better than fixed delays — respond to server signals
- Cache aggressively — don't re-scrape what hasn't changed
- Budget your CAPTCHA solves — set daily limits and stick to them
- Identify yourself — a clear User-Agent helps everyone
- Slow is reliable — a scraper that runs for a week at 1 req/s beats one that gets banned in an hour
For handling CAPTCHAs within your budget, check out passxapi-python — at $0.001/solve, even a $5/day budget gives you 5,000 solves.
What's your approach to responsible scraping? Share your practices in the comments.
Top comments (0)