DEV Community

Thesius Code
Thesius Code

Posted on • Originally published at datanest-stores.pages.dev

Web Scraping Framework: Web Scraping Guide

Web Scraping Guide

A practical guide to ethical, effective, and legally responsible web scraping.


1. Ethical Scraping Principles

Respect the Website

  • Check robots.txt before scraping. It's the site's stated policy.
  • Rate limit your requests. 1-2 requests per second is a reasonable default.
  • Identify yourself with a descriptive User-Agent if possible.
  • Don't scrape personal data, copyrighted content, or data behind authentication without permission.

The Politeness Hierarchy

  1. Use an official API if one exists — always prefer structured data sources.
  2. Check for open datasets (data.gov, Kaggle, etc.) before scraping.
  3. Contact the website and ask for data access.
  4. Scrape responsibly as a last resort.

2. Rate Limiting Strategies

Token Bucket

The token bucket algorithm allows bursts while maintaining an average rate:

rate_limiter = RateLimiter(requests_per_second=2.0)
await rate_limiter.acquire()  # Blocks until a token is available
Enter fullscreen mode Exit fullscreen mode

Per-Domain Delays

Enforce minimum delays between requests to the same domain to avoid overloading a single server:

scheduler = Scheduler(domain_delay=1.0)  # 1 second between same-domain requests
Enter fullscreen mode Exit fullscreen mode

Adaptive Rate Limiting

Monitor response codes and adjust dynamically:

  • 429 Too Many Requests: Back off exponentially.
  • 503 Service Unavailable: Pause and retry after the Retry-After header.
  • 200 OK with slow response: Reduce request rate.

3. Anti-Bot Detection & Bypass

Common Detection Signals

Signal What It Checks
User-Agent Known bot strings or missing headers
Request Rate Unnaturally consistent timing
JavaScript Whether the client executes JS
TLS Fingerprint Bot-like TLS client hello patterns
Cookie Handling Whether cookies are stored/sent
Mouse/Keyboard Browser automation detection

Mitigation Strategies

  • Rotate user agents across a pool of real browser strings.
  • Add random delays (jitter) between requests — avoid metronomic timing.
  • Use headless browsers (Playwright) for JS-heavy sites.
  • Rotate IPs via proxy services if legitimate and necessary.
  • Handle cookies properly — maintain sessions as a real browser would.

When to Stop

If a site actively blocks you despite reasonable measures, respect their wishes. Aggressive anti-bot bypass may violate the CFAA or equivalent laws.


4. Legal Considerations

Disclaimer: This is general information, not legal advice. Consult a lawyer for your specific jurisdiction.

Key Legal Frameworks

  • CFAA (US): Unauthorized access to computer systems. The hiQ v. LinkedIn ruling clarified that scraping public data may not violate CFAA.
  • GDPR (EU): Scraping personal data of EU residents requires a lawful basis.
  • CCPA (California): Similar to GDPR for California residents.
  • Copyright: Scraping copyrighted content for redistribution may infringe.

Safe Practices

  • Only scrape publicly accessible pages.
  • Respect robots.txt and Terms of Service.
  • Don't circumvent access controls or authentication.
  • Don't scrape personal data without a legitimate purpose.
  • Store only the minimum data needed.
  • Delete data when it's no longer needed.

5. Proxy Management

Types of Proxies

Type Use Case Cost
Datacenter High volume, low cost $
Residential Harder to detect $$$
Mobile Most trusted IPs $$$$
SOCKS5 Protocol-agnostic tunneling $$

Rotation Strategies

  • Round-robin: Simple sequential rotation through the pool.
  • Weighted: Assign more traffic to better-performing proxies.
  • Sticky sessions: Same proxy per domain for session consistency.
  • Failure-based: Remove proxies that fail consistently.

6. Data Quality

Validation Pipeline

Always validate scraped data before storage:

  1. Clean whitespace — normalize line breaks, tabs, extra spaces.
  2. Strip HTML — remove any residual markup in text fields.
  3. Validate required fields — drop records missing critical data.
  4. Deduplicate — content-hash based deduplication.
  5. Type coercion — parse prices, dates, numbers into correct types.

Monitoring

  • Track success rate per domain and over time.
  • Alert on schema changes (new fields, removed fields, type changes).
  • Log error distributions to catch site changes early.

7. Architecture Patterns

Single-Site Scraper

For one-off or focused scraping tasks:

HttpClient → Parser → Pipeline → Storage
Enter fullscreen mode Exit fullscreen mode

Multi-Site Crawl

For broad crawling across domains:

Scheduler → [Workers] → Middleware → Parser → Pipeline → Storage
     ↑                                             │
     └─────────── discovered URLs ─────────────────┘
Enter fullscreen mode Exit fullscreen mode

Hybrid (HTTP + Browser)

Use HTTP for most pages, fall back to Playwright for JS-rendered content:

try:
    html = await http_client.get(url)
except NeedsJavaScript:
    html = await browser.get_page_content(url)
Enter fullscreen mode Exit fullscreen mode

8. Performance Tips

  • Use async I/O (aiohttp) — don't block on network requests.
  • Batch storage writes — buffer records and flush periodically.
  • Reuse HTTP sessions — connection pooling reduces overhead.
  • Parse selectively — don't parse the full DOM if you only need one element.
  • Cache responses — avoid re-fetching pages during development.
  • Profile your pipeline — parsing is often the bottleneck, not networking.

This is 1 of 14 resources in the Python Developer Pro toolkit. Get the complete [Web Scraping Framework] with all files, templates, and documentation for $39.

Get the Full Kit →

Or grab the entire Python Developer Pro bundle (14 products) for $159 — save 30%.

Get the Complete Bundle →


Related Articles

Top comments (0)