Mohammad Waseem

Posted on Feb 3

Circumventing IP Bans in Web Scraping with React and Open Source Tools

#security #scraping #react

Web scraping is an essential technique for data gathering, but it often runs into the obstacle of IP bans, especially when targeting popular or protected sites. As a security researcher, I developed an approach using React and open source tools to mitigate IP bans effectively.

Understanding the Challenge

Many websites deploy IP-based rate limiting or aggressive security measures that block requests after a certain threshold. Traditional scraping methods encounter repeated bans, disrupting data collection efforts.

Strategy Overview

Our goal was to distribute requests across multiple IP addresses dynamically, making it appear as if requests originate from diverse users. We leveraged React’s frontend capabilities to orchestrate requests through a proxy network, ensuring anonymity and rotation. Additionally, open source tools like ProxyBroker, Tor, and ScraperAPI enabled seamless IP management.

Building the Solution

Setting up a Proxy Network

We chose Tor for its ease of integration and open source nature. By configuring Tor and leveraging ProxyBroker, we dynamically discover and verify a pool of proxies.

# Install Tor and ProxyBroker
sudo apt-get install tor
pip install proxybroker

With Tor running, ProxyBroker scans the network for available proxies:

import asyncio
from proxybroker import Broker

async def show_proxies():
    broker = Broker()
    proxies = await broker.get_proxies()
    for proxy in proxies:
        print(proxy)

asyncio.run(show_proxies())

This script builds a pool of verified proxies that can be rotated during scraping.

Integrating with React

React itself does not handle HTTP requests directly; instead, it orchestrates requests to our backend API, which manages proxy rotation.

// React component to trigger scraping requests
import React, { useState } from 'react';

function ScrapeTrigger() {
    const [status, setStatus] = useState('Idle');

    const startScraping = async () => {
        setStatus('In Progress');
        try {
            const response = await fetch('/api/start-scrape');
            if (response.ok) {
                setStatus('Completed');
            } else {
                throw new Error('Error in scraping');
            }
        } catch (err) {
            setStatus('Failed');
        }
    };

    return (
        <div>
            <button onClick={startScraping}>Start Scraping</button>
            <p>Status: {status}</p>
        </div>
    );
}

export default ScrapeTrigger;

This component communicates with our backend to initiate proxy-rotated scraping.

Backend Proxy Rotation

The backend, powered by Node.js or Python, cycles through the proxy list for each request, avoiding consecutive hits from the same IP:

import requests
import random

proxies = [
    {'http': 'http://proxy1:port'},
    {'http': 'http://proxy2:port'},
    # More proxies
]

def get_next_proxy():
    return random.choice(proxies)

def fetch_data(url):
    proxy = get_next_proxy()
    response = requests.get(url, proxies=proxy, timeout=10)
    return response.text

This method distributes requests and reduces the likelihood of IP bans.

Additional Tips

Implement request throttling to mimic human-like behavior.
Use headless browsers like Puppeteer with proxy rotation for complex sites.
Monitor proxy health and update the pool regularly.

Final Thoughts

Combining React with robust backend proxy management creates a scalable, resilient scraping system that minimizes bans. Always ensure your scraping respects robots.txt and legal considerations.

This approach balances open source flexibility and technical sophistication, providing a durable solution against IP-based security measures.

🛠️ QA Tip

To test this safely without using real user data, I use TempoMail USA.

DEV Community