Mohammad Waseem

Posted on Feb 1

Bypassing IP Bans in Web Scraping with TypeScript Without Spending a Dime

#typescript #scraping #network

How to Circumvent IP Banning During Web Scraping Using TypeScript on a Zero Budget

Web scraping can be a powerful tool for data collection, but encountering IP bans often hampers productivity. This challenge becomes even more significant when working with strict budgets, ruling out paid proxies or third-party services. As a Lead QA Engineer and seasoned developer, I will share effective, budget-friendly techniques to evade IP bans while scraping websites, using just TypeScript and open-source tools.

Understanding the Challenge

Most websites implement anti-scraping measures like IP-based rate limiting, banning suspicious IPs, or detecting unusual patterns of requests. When your IP gets banned, your scraper stops working, or worse, you face legal limits. The goal is to make the scraper mimic human-like behavior and distribute requests to evade detection.

Strategy Overview

Use residential IP proxies: Achieved via free or community-based sources.
Randomize IP addresses through network or environment techniques.
Implement request rotation with delays and pattern variability.
Monitor and adapt based on response feedback.

Implementing IP Rotation with Free Resources in TypeScript

1. Leveraging Free Proxy Lists

Start by sourcing free proxy lists. Websites like Free Proxy Lists provide periodically updated IPs. Download and parse them to use in your requests.

import fetch from 'node-fetch';

// Fetch free proxy list
async function getProxies(): Promise<string[]> {
  const response = await fetch('https://api.proxyscrape.com/?request=getproxies&proxytype=http&timeout=10000&limit=50');
  const text = await response.text();
  const proxies = text.split('\n').filter(p => p.trim() !== '');
  return proxies;
}

2. Randomizing Requests

Create a function to select a random proxy and set up HTTP requests to route through it.

// Randomly pick a proxy
function getRandomProxy(proxies: string[]): string {
  const index = Math.floor(Math.random() * proxies.length);
  return proxies[index];
}

// Example request using a proxy
async function fetchViaProxy(url: string, proxy: string): Promise<any> {
  // Note: You need a library that supports HTTP proxy, such as "https-proxy-agent" or "http-proxy-agent"
  const HttpsProxyAgent = require('https-proxy-agent');
  const agent = new HttpsProxyAgent(`http://${proxy}`);
  const response = await fetch(url, { agent });
  return response.json();
}

3. Mimicking Human-Like Behavior

Add randomized delays, varied request headers, and pattern randomness.

function sleep(ms: number): Promise<void> {
  return new Promise(resolve => setTimeout(resolve, ms));
}

function getRandomDelay() {
  return Math.floor(Math.random() * 3000) + 2000; // 2-5 seconds
}

// Usage loop
async function scrape(urls: string[]) {
  const proxies = await getProxies();
  for (const url of urls) {
    const proxy = getRandomProxy(proxies);
    try {
      await fetchViaProxy(url, proxy);
      console.log(`Successfully fetched ${url} via proxy ${proxy}`);
    } catch (err) {
      console.error(`Error fetching ${url} with proxy ${proxy}:`, err);
    }
    await sleep(getRandomDelay()); // Wait randomly between requests
  }
}

Additional Tips

Rotate User Agents: Use a list of common browser user agent strings and randomly pick one for each request.
Use Tor or VPN: If available, run your scraper through a Tor circuit to change the exit IP periodically. For zero budget, utilize free VPNs or Tor proxies.
Respect Robots.txt and Rate Limits: Be courteous to avoid being outright banned.

Final Remarks

While free solutions are less reliable than paid proxies, combining multiple methods—proxy rotation, request pacing, user agent randomization, and pattern variability—can significantly reduce the likelihood of IP bans. Always monitor responses, adapt your tactics over time, and prioritize ethical scraping practices.

Keep in mind that scraping should comply with legal restrictions and website terms of service. These techniques aim to help legitimate data collection efforts avoid unnecessary bans, not to circumvent security measures maliciously.

This approach leverages publicly available tools and carefully crafted request patterns to emulate human browsing habits and distribute network load efficiently, thus minimizing the risk of IP bans on a modest budget.

🛠️ QA Tip

Pro Tip: Use TempoMail USA for generating disposable test accounts.

DEV Community