Annabelle

Posted on Apr 2

How to Build a Reliable Web Data Collection System (Retries, Headers, and Proxy Rotation)

#webscraping #python #devops #programming

What makes a data collection system reliable?

A reliable data collection system can handle failures, avoid detection, and continue running without interruptions. This typically involves retry logic, proxy rotation, request delays, and proper headers.

If you’ve already implemented proxy rotation, you’ve solved one part of the problem. If not, this guide on how to rotate proxies in Python for reliable data collection walks through the basics of setting up proxy rotation in a real workflow.

But in real-world scenarios, that’s not enough.

You’ll still run into:

Random request failures
Rate limits
CAPTCHAs
Inconsistent responses

To make your system reliable, you need to combine multiple techniques together.

Why do scraping systems fail?

Scraping systems fail because websites detect patterns such as repeated IP usage, missing headers, and high request frequency.

Common causes include:

Sending too many requests too quickly
Using the same IP repeatedly
Missing or unrealistic headers
No retry handling

Even with proxies, your system will break if you don’t handle these properly.

How do you build a resilient request function?

You build a resilient request function by combining retries, proxy rotation, and error handling.

Here’s a simple example:

import requests
import random
import time

proxy_list = [
    "http://user:pass@ip1:port",
    "http://user:pass@ip2:port",
    "http://user:pass@ip3:port"
]

user_agents = [
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64)",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7)",
    "Mozilla/5.0 (X11; Linux x86_64)"
]

def get_proxy():
    return random.choice(proxy_list)

def get_headers():
    return {
        "User-Agent": random.choice(user_agents)
    }

def fetch(url):
    for attempt in range(3):
        proxy = get_proxy()

        try:
            response = requests.get(
                url,
                proxies={"http": proxy, "https": proxy},
                headers=get_headers(),
                timeout=10
            )

            if response.status_code == 200:
                return response.text

        except:
            pass

        time.sleep(random.uniform(1, 3))

    return None

This setup:

Rotates proxies
Rotates headers
Retries failed requests
Adds delays

Why are headers important?

Headers are important because websites use them to identify real users.

Without headers, your requests look like bots.

At minimum, you should include:

User-Agent
Accept-Language
Accept

Example:

def get_headers():
    return {
        "User-Agent": random.choice(user_agents),
        "Accept-Language": "en-US,en;q=0.9",
        "Accept": "text/html,application/xhtml+xml"
    }

How does proxy rotation improve reliability?

Proxy rotation improves reliability by distributing requests across multiple IP addresses, reducing the chance of detection and blocking.

Instead of hitting a server from one IP repeatedly, you spread requests across many.

If you're evaluating different options, many developers compare rotating residential proxies based on success rate, IP pool size, and geographic coverage.

How do you handle rate limiting?

You handle rate limiting by slowing down requests and adding randomness.

Simple techniques:

Add delays

time.sleep(random.uniform(1, 3))

Avoid patterns

Don’t send requests at fixed intervals.

Reduce concurrency

Too many parallel requests = higher detection risk.

How do you detect blocked responses?

You should check for:

HTTP 403 / 429 status codes
CAPTCHA pages
Empty or unexpected responses

Example:

if response.status_code in [403, 429]:
    return None

You can also check content for known block patterns.

How do you scale this system?

To scale, you need:

Larger proxy pools
Queue systems (e.g., task queues)
Parallel workers
Logging and monitoring

At scale, your system becomes more about architecture than code.

FAQs

Do I always need proxies for data collection?

Not always. For small-scale tasks, you may not need them. But for large-scale or repeated requests, proxies become necessary.

What’s the biggest mistake beginners make?

Not adding retry logic. One failure can break your entire pipeline if not handled properly.

How many retries should I use?

Typically 2–5 retries. More than that can slow down your system.

Are residential proxies always better?

They are harder to detect, but also more expensive. The best choice depends on your use case.

Final Thoughts

Building a reliable data collection system isn’t about one trick, it’s about combining multiple techniques.

Proxy rotation, retries, headers, and delays all work together.

If you only use one, your system will eventually fail.

If you combine them properly, you get a system that’s:

Stable
Scalable
Harder to block

DEV Community