DEV Community

Mohammad Waseem
Mohammad Waseem

Posted on

Mitigating IP Bans During Web Scraping with Open Source API Strategies

Web scraping often runs into the challenge of IP bans, especially when targeting sites with strict anti-scraping policies. Traditional methods like rotating proxies can be costly and unreliable. A more sustainable approach involves designing a resilient API layer that acts as an intermediary, intelligently managing requests and masking the scraper's footprint.

In this post, we'll explore how to leverage open-source tools to develop a robust API that helps avoid IP bans, with an emphasis on integrating rate limiting, request caching, and user-agent rotation.

Understanding the Challenge

Many websites monitor incoming traffic patterns and ban IPs that exhibit suspicious behavior. Repeated requests from a single IP, high request rates, or known scraper signatures trigger bans. Instead of relying solely on proxy pooling, we can build an API-based system that mimics human-like browsing and manages request patterns.

Designing the API Infrastructure

We'll implement an API server using Python with FastAPI, combined with a request manager that features IP rotation, rate limiting, and user-agent randomization. The idea is to centralize all scraping requests through this API, which controls the request flow and mitigates detection risks.

Step 1: Setting Up the API Server

from fastapi import FastAPI, HTTPException
import httpx
import random

app = FastAPI()

# List of user agents for rotation
USER_AGENTS = [
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64)...",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7)...",
    # Add more user agents
]

# Proxy list for IP rotation (optional)
PROXIES = ["http://proxy1:port", "http://proxy2:port"]

@app.get("/scrape")
def scrape(target_url: str):
    headers = {
        "User-Agent": random.choice(USER_AGENTS)
    }
    proxy = {'http://': random.choice(PROXIES)} if PROXIES else None
    try:
        response = httpx.get(target_url, headers=headers, proxies=proxy, timeout=10)
        response.raise_for_status()
        return response.text
    except httpx.HTTPError as e:
        raise HTTPException(status_code=500, detail=str(e))
Enter fullscreen mode Exit fullscreen mode

This endpoint allows the client to submit a URL request, which the server processes by selecting a random user-agent and proxy, thus distributing request patterns to mimic human browsing.

Step 2: Implementing Rate Limiting and Caching

To prevent triggering anti-bot measures, rate limiting is essential:

from slowapi import Limiter, _rate_limit_exceeded_handler
from slowapi.util import get_remote_address
from fastapi import Request
from fastapi.responses import JSONResponse
from starlette.middleware import Middleware
from starlette.middleware.base import BaseHTTPMiddleware

limiter = Limiter(key_func=get_remote_address)
app.state.limiter = limiter
app.add_exception_handler(429, _rate_limit_exceeded_handler)

@app.middleware("/scrape")
def add_rate_limit(request: Request, call_next):
    response = limiter.limit("10/minute")(request)(call_next)
    return response
Enter fullscreen mode Exit fullscreen mode

Additionally, to optimize requests and avoid making redundant calls, implement caching with an in-memory store or Redis:

import redis

redis_client = redis.Redis(host='localhost', port=6379, db=0)

@app.get("/scrape")
def scrape_with_cache(target_url: str):
    cached_response = redis_client.get(target_url)
    if cached_response:
        return cached_response.decode('utf-8')
    # Proceed with request if not cached
    headers = {"User-Agent": random.choice(USER_AGENTS)}
    response = httpx.get(target_url, headers=headers)
    redis_client.setex(target_url, 300, response.text)  # Cache for 5 minutes
    return response.text
Enter fullscreen mode Exit fullscreen mode

This setup ensures requests are throttled and redundant calls are minimized.

Step 3: Deploying and Monitoring

Deploy the API service in a controlled environment using Docker or cloud services. Add monitoring tools like Prometheus and Grafana to track request metrics.

Conclusion

By developing an API layer with open-source frameworks and libraries, you effectively distribute your request load, mimic human browsing behaviors, and reduce the risk of IP bans. Combining request rotation, rate limiting, and caching creates a resilient scraping architecture capable of sustainable data extraction. This method not only protects your IPs but also provides a scalable pattern adaptable to various scraping workloads.

References:

  • OpenAPI and FastAPI documentation
  • Redis for caching
  • SlowAPI for rate limiting

🛠️ QA Tip

Pro Tip: Use TempoMail USA for generating disposable test accounts.

Top comments (0)