DEV Community

Cover image for I Tried Scraping 1M Pages in 24 Hours — Here’s What Actually Broke
Anna
Anna

Posted on

I Tried Scraping 1M Pages in 24 Hours — Here’s What Actually Broke

I didn’t expect parsing to be the problem.

Or JavaScript rendering.
Or even rate limits.

What actually broke first was… everything around the scraper.

The goal

  • Target: ~1,000,000 pages
  • Time: 24 hours
  • Stack: Python + async requests
  • Setup: distributed across multiple workers

Sounds straightforward, right?

It wasn’t.

Problem #1: Throughput collapsed after ~50K requests

At the beginning, everything looked healthy:

  • low latency
  • stable success rate
  • fast throughput

Then suddenly:

  • response times doubled
  • success rate dropped
  • retries started stacking

No code changes. No deploys.

Just… degradation.

What caused it?

Not rate limits.

IP-level throttling.

Instead of blocking requests outright, the target site started:

  • slowing down responses
  • returning partial data
  • occasionally serving fallback pages

No errors. Just worse performance.

Problem #2: Data inconsistency across workers

Different workers started returning:

  • different product prices
  • different rankings
  • sometimes missing fields

Same endpoint. Same parser.

Root cause?

Requests were coming from:

  • different IP regions
  • mixed IP reputations

Which triggered:

  • geo-based content variation
  • bot-detection fallback responses

At scale, this turns your dataset into a patchwork of realities.

Problem #3: Retry logic made things worse

Our retry strategy was simple:

retry on failure (timeout / non-200)

But here’s the issue:

  • many “successful” responses were actually degraded
  • retries reused similar IP patterns
  • traffic looked even more suspicious over time

Result:

higher load → worse data → more retries → even worse data

A perfect negative loop.

What actually worked (after multiple iterations)

1. Treat IP rotation as part of system design

Not as a patch.

We moved to:

  • per-request IP rotation
  • region-aware routing
  • controlled session reuse (only when needed)

This alone stabilized:

  • response time
  • success rate
  • data consistency

2. Align IP geography with target data

Instead of random distribution:

  • US pages → US IPs
  • EU pages → EU IPs

This reduced:

  • content mismatch
  • localization errors
  • inconsistent datasets

3. Add “data validation”, not just “request validation”

We stopped trusting 200 OK.

We added checks like:

  • required fields present
  • price within expected range
  • layout consistency

If data failed validation → treated as failure → retried differently

4. Reduce retry aggression

Instead of:

  • immediate retries
    We switched to:

  • delayed retries

  • different IP pools

  • capped retry counts

This prevented feedback loops.

5. Use a more realistic IP layer

At this scale, IP quality became a bottleneck.

Datacenter IPs were fast — but:

  • easier to detect
  • more likely to get degraded responses

Switching to residential traffic improved:

  • consistency
  • success rate
  • data reliability

In our case, using a provider like Rapidproxy helped smooth out:

  • IP distribution
  • geographic targeting
  • long-running job stability

Not dramatically faster — but much more stable, which mattered more.

Final numbers (after fixes)

  • Success rate: +27%
  • Retry volume: -42%
  • Data consistency issues: significantly reduced
  • Total completion time: ~18% faster

Not because we optimized code.

Because we fixed the system around the code.

What I’d do differently from day one

If I had to do this again:

  • design IP strategy first
  • validate data, not just responses
  • assume degradation, not failure
  • monitor consistency, not just success rate

Final thought

At small scale, scraping is about code.

At large scale, scraping is about behavior.

And the systems that survive are the ones that look the least like bots.

Top comments (0)