DEV Community

Mohammad Waseem
Mohammad Waseem

Posted on

Caught by IP Ban? Node.js Scraping Secrets

Caught by IP Ban? Node.js Scraping Secrets

The Problem

When you hit a website’s API or DOM too many times, you may trigger rate‑limiting logic. Most sites employ a combination of request‑header fingerprinting, IP geolocation checks, and CAPTCHA challenges. A single IP that sends dozens of consecutive GET requests with identical user‑agent strings and no natural pause will usually be flagged. The server then either blocks the IP for a cooldown period or permanently bans it. For automated QA scripts that need to pull demo data or monitor front‑end performance, this can cause test flakiness and downtime.

Technically, the ban is enforced by a set of rules on the proxy or reverse‑proxy layer (NGINX, CloudFront, or a commercial WAF). The rules evaluate the following attributes:

Attribute Typical threshold Result
Request rate > 10 req/min 429 Slow Down
Header variance Overly repetitive UA or no Referer Temp ban
Geo‑location Same city > 1 h Permanent ban
Bearer token misuse Reused token without rotation Bad token 401

If you rely on a single IP across a 24‑hour test cycle, you’re bound to hit these thresholds.

Why it matters

QA teams automate “the boring parts”: data seeding, regression checks, end‑to‑end screenshots. Scraping a site’s public API or fixture pages is simpler than building a fully‑fledged mock server. But a hard IP block forces tests to pause, insert downtime, or blow up, breaking continuous integration pipelines. The result? Late feedback, undiagnosed bugs, and wasted hours. Preventing IP bans keeps your test harness reliable, reduces maintenance overhead, and ensures that the sprint queue stays packed with real feature work.

The Solution

Here’s a step‑by‑step recipe you can drop into a Node.js script or a Jest test suite.

  1. Set up a rotating proxy pool
   const http = require('http');
   const HttpProxyAgent = require('http-proxy-agent');

   const proxies = [
     'http://proxy1.example.com:3128',
     'http://proxy2.example.com:3128',
     'http://proxy3.example.com:3128',
   ];

   let idx = 0;
   function nextAgent() {
     const url = proxies[idx % proxies.length];
     idx++;
     return new HttpProxyAgent(url);
   }
Enter fullscreen mode Exit fullscreen mode

By cycling through n proxies you spread the request rate across multiple IPs.

  1. Throttle request bursts
   const Bottleneck = require('bottleneck');

   const limiter = new Bottleneck({
     maxConcurrent: 5,         // 5 parallel requests
     minTime: 200,             // 200 ms between starts
   });

   async function fetchWithLimit(url) {
     return limiter.schedule(() => fetch(url, {agent: nextAgent()}));
   }
Enter fullscreen mode Exit fullscreen mode

Limiting concurrent calls keeps the per‑IP request curve below most thresholds.

  1. Randomize request headers
   const userAgents = [
     'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 ...',
     'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 ...',
   ];

   function randomUA() {
     return userAgents[Math.floor(Math.random() * userAgents.length)];
   }

   async function fetchWithHeaders(url) {
     return fetch(url, {
       agent: nextAgent(),
       headers: {
         'User-Agent': randomUA(),
         'Accept-Language': 'en-US,en;q=0.9',
         'Referer': 'https://www.example.com',
       },
     });
   }
Enter fullscreen mode Exit fullscreen mode

Varying UA strings and including a realistic Referer dissuades automated bot detectors.

  1. Implement exponential back‑off
   async function resilientFetch(url, retries = 3) {
     for (let i = 0; i < retries; i++) {
       const resp = await fetchWithHeaders(url);
       if (resp.status === 429 || resp.status === 503) {
         await new Promise(r => setTimeout(r, Math.pow(2, i) * 1000));
         continue;
       }
       return resp;
     }
     throw new Error(`Failed after ${retries} attempts`);
   }
Enter fullscreen mode Exit fullscreen mode

If a provider still responds with 429, a gradual back‑off reduces the log‑in‑rate rather than bursting.

  1. Embed health checks in CI
   test_scrape:
     runs-on: ubuntu-latest
     steps:
       - uses: actions/checkout@v3
       - uses: actions/setup-node@v3
         with:
           node-version: '18'
       - run: npm ci
       - run: node src/scrapeTest.js
         env:
           PROXY_POOL: ${{ secrets.PROXY_POOL }}
Enter fullscreen mode Exit fullscreen mode

Add the script to your pipeline so you get a fresh ban‑free run every commit.

Quick Checklist

  • [x] Proxy pool (≥ 3 IPs)
  • [x] Per‑IP throttling (≤ 10 req/min)
  • [x] Randomized UA & Referer
  • [x] Exponential back‑off for 429/503
  • [x] CI integration

With these layers you create a “soft‑ban shield” that keeps your QA automation humming while obeying the site’s traffic policies. Happy scraping!

🛠️ The Tool I Use

    For generating clean test data and disposable emails for these workflows, I personally use [TempoMail USA](https://tempomailusa.com). It’s fast, has an API-like feel, and keeps my production data clean.
Enter fullscreen mode Exit fullscreen mode

Top comments (0)