Kurnia Sari

Posted on Mar 13

Detecting and Avoiding Proxy Blacklists When Scraping

#webscraping #proxy #programming #mrscraper

When web scraping, proxies can get blacklisted if a website detects suspicious activity. Detecting and avoiding proxy blacklists ensures uninterrupted access and reduces the risk of getting blocked.

Use Case: Preventing IP Blacklisting While Scraping E-commerce Prices

An e-commerce intelligence firm scrapes competitor pricing data daily. Their proxies risk being blacklisted due to frequent requests. By monitoring for blacklists and rotating proxies, they maintain seamless data collection.

How to Detect if a Proxy is Blacklisted

1. Check HTTP Response Codes

Certain HTTP status codes indicate blacklisting:

403 Forbidden – The IP is blocked from accessing the site.
429 Too Many Requests – The site has rate-limited the IP.
503 Service Unavailable – Temporary or permanent block due to bot detection.

Example: Checking HTTP Status Codes

import requests
proxy = {"http": "http://proxy-provider.com:port", "https": "http://proxy-provider.com:port"}
url = "https://example.com"

response = requests.get(url, proxies=proxy)
print(response.status_code)

2. Monitor for CAPTCHA Challenges

If a website consistently serves CAPTCHA challenges, the proxy is likely flagged.

Example: Detecting CAPTCHA

from bs4 import BeautifulSoup

soup = BeautifulSoup(response.text, "html.parser")
if soup.find("div", {"class": "captcha"}):
    print("CAPTCHA detected. Proxy may be blacklisted.")

3. Use an IP Blacklist Checker

Check if your proxy IP is blacklisted using services like:

Spamhaus
IPVoid
WhatIsMyIP

Example: Using an API to Check Blacklists

Some services offer APIs to check if an IP is blacklisted:

import requests

api_url = "https://api.blacklistchecker.com/check?ip=your_proxy_ip"
response = requests.get(api_url)
print(response.json())

How to Avoid Proxy Blacklisting

1. Rotate Proxies Automatically

Using a proxy rotation service ensures your IPs do not get flagged.

Example: Rotating Proxies in Python

import random

proxies = [
    "http://proxy1:port",
    "http://proxy2:port",
    "http://proxy3:port"
]

proxy = {"http": random.choice(proxies), "https": random.choice(proxies)}
response = requests.get(url, proxies=proxy)

2. Use Residential or Mobile Proxies

Residential and mobile proxies are harder to detect compared to datacenter proxies.

3. Implement User-Agent and Header Spoofing

Randomizing request headers helps avoid detection.

Example: Spoofing User-Agent

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36"
}
response = requests.get(url, headers=headers, proxies=proxy)

4. Introduce Random Delays Between Requests

Adding random delays prevents triggering rate limits.

import time
import random

time.sleep(random.uniform(1, 5))

5. Use CAPTCHA-Solving Services

If a site presents CAPTCHAs, integrating a solver like 2Captcha or Anti-Captcha can help.

Conclusion

Detecting and avoiding proxy blacklists is crucial for effective web scraping. By monitoring HTTP responses, using blacklist checkers, and implementing proxy rotation, scrapers can maintain uninterrupted access.

For an automated and AI-powered solution, consider Mrscraper, which manages proxy rotation, evasion techniques, and CAPTCHA-solving for seamless scraping.

Create and maintain end-to-end frontend tests

Learn best practices on creating frontend tests, testing on-premise apps, integrating tests into your CI/CD pipeline, and using Datadog’s testing tunnel.

Download The Guide

Top comments (0)

5 Playwright CLI Flags That Will Transform Your Testing Workflow

0:56 --last-failed
2:34 --only-changed
4:27 --repeat-each
5:15 --forbid-only
5:51 --ui --headed --workers 1

Learn how these powerful command-line options can save you time, strengthen your test suite, and streamline your Playwright testing experience. Click on any timestamp above to jump directly to that section in the tutorial!

Watch Full Video 📹️

DEV Community