DatanestDigital

Posted on Mar 23 • Originally published at datanest-stores.pages.dev

Web Scraping Framework: Web Scraping Guide

#python #webdev #tutorial #programming

Web Scraping Guide

A practical guide to ethical, effective, and legally responsible web scraping.

1. Ethical Scraping Principles

Respect the Website

Check robots.txt before scraping. It's the site's stated policy.
Rate limit your requests. 1-2 requests per second is a reasonable default.
Identify yourself with a descriptive User-Agent if possible.
Don't scrape personal data, copyrighted content, or data behind authentication without permission.

The Politeness Hierarchy

Use an official API if one exists — always prefer structured data sources.
Check for open datasets (data.gov, Kaggle, etc.) before scraping.
Contact the website and ask for data access.
Scrape responsibly as a last resort.

2. Rate Limiting Strategies

Token Bucket

The token bucket algorithm allows bursts while maintaining an average rate:

rate_limiter = RateLimiter(requests_per_second=2.0)
await rate_limiter.acquire()  # Blocks until a token is available

Per-Domain Delays

Enforce minimum delays between requests to the same domain to avoid overloading a single server:

scheduler = Scheduler(domain_delay=1.0)  # 1 second between same-domain requests

Adaptive Rate Limiting

Monitor response codes and adjust dynamically:

429 Too Many Requests: Back off exponentially.
503 Service Unavailable: Pause and retry after the Retry-After header.
200 OK with slow response: Reduce request rate.

3. Anti-Bot Detection & Bypass

Common Detection Signals

Signal	What It Checks
User-Agent	Known bot strings or missing headers
Request Rate	Unnaturally consistent timing
JavaScript	Whether the client executes JS
TLS Fingerprint	Bot-like TLS client hello patterns
Cookie Handling	Whether cookies are stored/sent
Mouse/Keyboard	Browser automation detection

Mitigation Strategies

Rotate user agents across a pool of real browser strings.
Add random delays (jitter) between requests — avoid metronomic timing.
Use headless browsers (Playwright) for JS-heavy sites.
Rotate IPs via proxy services if legitimate and necessary.
Handle cookies properly — maintain sessions as a real browser would.

When to Stop

If a site actively blocks you despite reasonable measures, respect their wishes. Aggressive anti-bot bypass may violate the CFAA or equivalent laws.

4. Legal Considerations

Disclaimer: This is general information, not legal advice. Consult a lawyer for your specific jurisdiction.

Key Legal Frameworks

CFAA (US): Unauthorized access to computer systems. The hiQ v. LinkedIn ruling clarified that scraping public data may not violate CFAA.
GDPR (EU): Scraping personal data of EU residents requires a lawful basis.
CCPA (California): Similar to GDPR for California residents.
Copyright: Scraping copyrighted content for redistribution may infringe.

Safe Practices

Only scrape publicly accessible pages.
Respect robots.txt and Terms of Service.
Don't circumvent access controls or authentication.
Don't scrape personal data without a legitimate purpose.
Store only the minimum data needed.
Delete data when it's no longer needed.

5. Proxy Management

Types of Proxies

Type	Use Case	Cost
Datacenter	High volume, low cost	$
Residential	Harder to detect	$$$
Mobile	Most trusted IPs	$$$$
SOCKS5	Protocol-agnostic tunneling	$$

Rotation Strategies

Round-robin: Simple sequential rotation through the pool.
Weighted: Assign more traffic to better-performing proxies.
Sticky sessions: Same proxy per domain for session consistency.
Failure-based: Remove proxies that fail consistently.

6. Data Quality

Validation Pipeline

Always validate scraped data before storage:

Clean whitespace — normalize line breaks, tabs, extra spaces.
Strip HTML — remove any residual markup in text fields.
Validate required fields — drop records missing critical data.
Deduplicate — content-hash based deduplication.
Type coercion — parse prices, dates, numbers into correct types.

Monitoring

Track success rate per domain and over time.
Alert on schema changes (new fields, removed fields, type changes).
Log error distributions to catch site changes early.

7. Architecture Patterns

Single-Site Scraper

For one-off or focused scraping tasks:

HttpClient → Parser → Pipeline → Storage

Multi-Site Crawl

For broad crawling across domains:

Scheduler → [Workers] → Middleware → Parser → Pipeline → Storage
     ↑                                             │
     └─────────── discovered URLs ─────────────────┘

Hybrid (HTTP + Browser)

Use HTTP for most pages, fall back to Playwright for JS-rendered content:

try:
    html = await http_client.get(url)
except NeedsJavaScript:
    html = await browser.get_page_content(url)

8. Performance Tips

Use async I/O (aiohttp) — don't block on network requests.
Batch storage writes — buffer records and flush periodically.
Reuse HTTP sessions — connection pooling reduces overhead.
Parse selectively — don't parse the full DOM if you only need one element.
Cache responses — avoid re-fetching pages during development.
Profile your pipeline — parsing is often the bottleneck, not networking.

This is 1 of 14 resources in the Python Developer Pro toolkit. Get the complete [Web Scraping Framework] with all files, templates, and documentation for $39.

Get the Full Kit →

Or grab the entire Python Developer Pro bundle (14 products) for $159 — save 30%.

Get the Complete Bundle →

DEV Community