Anna

Posted on Mar 31

I Tried Scraping 1M Pages in 24 Hours — Here’s What Actually Broke

#webscraping #residentialips #rapidproxy #datacenterips

I didn’t expect parsing to be the problem.

Or JavaScript rendering.
Or even rate limits.

What actually broke first was… everything around the scraper.

The goal

Target: ~1,000,000 pages
Time: 24 hours
Stack: Python + async requests
Setup: distributed across multiple workers

Sounds straightforward, right?

It wasn’t.

Problem #1: Throughput collapsed after ~50K requests

At the beginning, everything looked healthy:

low latency
stable success rate
fast throughput

Then suddenly:

response times doubled
success rate dropped
retries started stacking

No code changes. No deploys.

Just… degradation.

What caused it?

Not rate limits.

IP-level throttling.

Instead of blocking requests outright, the target site started:

slowing down responses
returning partial data
occasionally serving fallback pages

No errors. Just worse performance.

Problem #2: Data inconsistency across workers

Different workers started returning:

different product prices
different rankings
sometimes missing fields

Same endpoint. Same parser.

Root cause?

Requests were coming from:

different IP regions
mixed IP reputations

Which triggered:

geo-based content variation
bot-detection fallback responses

At scale, this turns your dataset into a patchwork of realities.

Problem #3: Retry logic made things worse

Our retry strategy was simple:

retry on failure (timeout / non-200)

But here’s the issue:

many “successful” responses were actually degraded
retries reused similar IP patterns
traffic looked even more suspicious over time

Result:

higher load → worse data → more retries → even worse data

A perfect negative loop.

What actually worked (after multiple iterations)

1. Treat IP rotation as part of system design

Not as a patch.

We moved to:

per-request IP rotation
region-aware routing
controlled session reuse (only when needed)

This alone stabilized:

response time
success rate
data consistency

2. Align IP geography with target data

Instead of random distribution:

US pages → US IPs
EU pages → EU IPs

This reduced:

content mismatch
localization errors
inconsistent datasets

3. Add “data validation”, not just “request validation”

We stopped trusting 200 OK.

We added checks like:

required fields present
price within expected range
layout consistency

If data failed validation → treated as failure → retried differently

4. Reduce retry aggression

Instead of:

immediate retries
We switched to:
delayed retries
different IP pools
capped retry counts

This prevented feedback loops.

5. Use a more realistic IP layer

At this scale, IP quality became a bottleneck.

Datacenter IPs were fast — but:

easier to detect
more likely to get degraded responses

Switching to residential traffic improved:

consistency
success rate
data reliability

In our case, using a provider like Rapidproxy helped smooth out:

IP distribution
geographic targeting
long-running job stability

Not dramatically faster — but much more stable, which mattered more.

Final numbers (after fixes)

Success rate: +27%
Retry volume: -42%
Data consistency issues: significantly reduced
Total completion time: ~18% faster

Not because we optimized code.

Because we fixed the system around the code.

What I’d do differently from day one

If I had to do this again:

design IP strategy first
validate data, not just responses
assume degradation, not failure
monitor consistency, not just success rate

Final thought

At small scale, scraping is about code.

At large scale, scraping is about behavior.

And the systems that survive are the ones that look the least like bots.

DEV Community