Web Scraping Guide
A practical guide to ethical, effective, and legally responsible web scraping.
1. Ethical Scraping Principles
Respect the Website
- Check robots.txt before scraping. It's the site's stated policy.
- Rate limit your requests. 1-2 requests per second is a reasonable default.
- Identify yourself with a descriptive User-Agent if possible.
- Don't scrape personal data, copyrighted content, or data behind authentication without permission.
The Politeness Hierarchy
- Use an official API if one exists — always prefer structured data sources.
- Check for open datasets (data.gov, Kaggle, etc.) before scraping.
- Contact the website and ask for data access.
- Scrape responsibly as a last resort.
2. Rate Limiting Strategies
Token Bucket
The token bucket algorithm allows bursts while maintaining an average rate:
rate_limiter = RateLimiter(requests_per_second=2.0)
await rate_limiter.acquire() # Blocks until a token is available
Per-Domain Delays
Enforce minimum delays between requests to the same domain to avoid overloading a single server:
scheduler = Scheduler(domain_delay=1.0) # 1 second between same-domain requests
Adaptive Rate Limiting
Monitor response codes and adjust dynamically:
- 429 Too Many Requests: Back off exponentially.
-
503 Service Unavailable: Pause and retry after the
Retry-Afterheader. - 200 OK with slow response: Reduce request rate.
3. Anti-Bot Detection & Bypass
Common Detection Signals
| Signal | What It Checks |
|---|---|
| User-Agent | Known bot strings or missing headers |
| Request Rate | Unnaturally consistent timing |
| JavaScript | Whether the client executes JS |
| TLS Fingerprint | Bot-like TLS client hello patterns |
| Cookie Handling | Whether cookies are stored/sent |
| Mouse/Keyboard | Browser automation detection |
Mitigation Strategies
- Rotate user agents across a pool of real browser strings.
- Add random delays (jitter) between requests — avoid metronomic timing.
- Use headless browsers (Playwright) for JS-heavy sites.
- Rotate IPs via proxy services if legitimate and necessary.
- Handle cookies properly — maintain sessions as a real browser would.
When to Stop
If a site actively blocks you despite reasonable measures, respect their wishes. Aggressive anti-bot bypass may violate the CFAA or equivalent laws.
4. Legal Considerations
Disclaimer: This is general information, not legal advice. Consult a lawyer for your specific jurisdiction.
Key Legal Frameworks
- CFAA (US): Unauthorized access to computer systems. The hiQ v. LinkedIn ruling clarified that scraping public data may not violate CFAA.
- GDPR (EU): Scraping personal data of EU residents requires a lawful basis.
- CCPA (California): Similar to GDPR for California residents.
- Copyright: Scraping copyrighted content for redistribution may infringe.
Safe Practices
- Only scrape publicly accessible pages.
- Respect
robots.txtand Terms of Service. - Don't circumvent access controls or authentication.
- Don't scrape personal data without a legitimate purpose.
- Store only the minimum data needed.
- Delete data when it's no longer needed.
5. Proxy Management
Types of Proxies
| Type | Use Case | Cost |
|---|---|---|
| Datacenter | High volume, low cost | $ |
| Residential | Harder to detect | $$$ |
| Mobile | Most trusted IPs | $$$$ |
| SOCKS5 | Protocol-agnostic tunneling | $$ |
Rotation Strategies
- Round-robin: Simple sequential rotation through the pool.
- Weighted: Assign more traffic to better-performing proxies.
- Sticky sessions: Same proxy per domain for session consistency.
- Failure-based: Remove proxies that fail consistently.
6. Data Quality
Validation Pipeline
Always validate scraped data before storage:
- Clean whitespace — normalize line breaks, tabs, extra spaces.
- Strip HTML — remove any residual markup in text fields.
- Validate required fields — drop records missing critical data.
- Deduplicate — content-hash based deduplication.
- Type coercion — parse prices, dates, numbers into correct types.
Monitoring
- Track success rate per domain and over time.
- Alert on schema changes (new fields, removed fields, type changes).
- Log error distributions to catch site changes early.
7. Architecture Patterns
Single-Site Scraper
For one-off or focused scraping tasks:
HttpClient → Parser → Pipeline → Storage
Multi-Site Crawl
For broad crawling across domains:
Scheduler → [Workers] → Middleware → Parser → Pipeline → Storage
↑ │
└─────────── discovered URLs ─────────────────┘
Hybrid (HTTP + Browser)
Use HTTP for most pages, fall back to Playwright for JS-rendered content:
try:
html = await http_client.get(url)
except NeedsJavaScript:
html = await browser.get_page_content(url)
8. Performance Tips
- Use async I/O (aiohttp) — don't block on network requests.
- Batch storage writes — buffer records and flush periodically.
- Reuse HTTP sessions — connection pooling reduces overhead.
- Parse selectively — don't parse the full DOM if you only need one element.
- Cache responses — avoid re-fetching pages during development.
- Profile your pipeline — parsing is often the bottleneck, not networking.
This is 1 of 14 resources in the Python Developer Pro toolkit. Get the complete [Web Scraping Framework] with all files, templates, and documentation for $39.
Or grab the entire Python Developer Pro bundle (14 products) for $159 — save 30%.
Top comments (0)