Mohammad Waseem

Posted on Jan 23

Caught by IP Ban? Node.js Scraping Secrets

#productivity #testing #privacy #webdev

Caught by IP Ban? Node.js Scraping Secrets

The Problem

When you hit a website’s API or DOM too many times, you may trigger rate‑limiting logic. Most sites employ a combination of request‑header fingerprinting, IP geolocation checks, and CAPTCHA challenges. A single IP that sends dozens of consecutive GET requests with identical user‑agent strings and no natural pause will usually be flagged. The server then either blocks the IP for a cooldown period or permanently bans it. For automated QA scripts that need to pull demo data or monitor front‑end performance, this can cause test flakiness and downtime.

Technically, the ban is enforced by a set of rules on the proxy or reverse‑proxy layer (NGINX, CloudFront, or a commercial WAF). The rules evaluate the following attributes:

Attribute	Typical threshold	Result
Request rate	> 10 req/min	`429 Slow Down`
Header variance	Overly repetitive UA or no Referer	Temp ban
Geo‑location	Same city > 1 h	Permanent ban
Bearer token misuse	Reused token without rotation	Bad token 401

If you rely on a single IP across a 24‑hour test cycle, you’re bound to hit these thresholds.

Why it matters

QA teams automate “the boring parts”: data seeding, regression checks, end‑to‑end screenshots. Scraping a site’s public API or fixture pages is simpler than building a fully‑fledged mock server. But a hard IP block forces tests to pause, insert downtime, or blow up, breaking continuous integration pipelines. The result? Late feedback, undiagnosed bugs, and wasted hours. Preventing IP bans keeps your test harness reliable, reduces maintenance overhead, and ensures that the sprint queue stays packed with real feature work.

The Solution

Here’s a step‑by‑step recipe you can drop into a Node.js script or a Jest test suite.

Set up a rotating proxy pool

   const http = require('http');
   const HttpProxyAgent = require('http-proxy-agent');

   const proxies = [
     'http://proxy1.example.com:3128',
     'http://proxy2.example.com:3128',
     'http://proxy3.example.com:3128',
   ];

   let idx = 0;
   function nextAgent() {
     const url = proxies[idx % proxies.length];
     idx++;
     return new HttpProxyAgent(url);
   }

By cycling through n proxies you spread the request rate across multiple IPs.

Throttle request bursts

   const Bottleneck = require('bottleneck');

   const limiter = new Bottleneck({
     maxConcurrent: 5,         // 5 parallel requests
     minTime: 200,             // 200 ms between starts
   });

   async function fetchWithLimit(url) {
     return limiter.schedule(() => fetch(url, {agent: nextAgent()}));
   }

Limiting concurrent calls keeps the per‑IP request curve below most thresholds.

Randomize request headers

   const userAgents = [
     'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 ...',
     'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 ...',
   ];

   function randomUA() {
     return userAgents[Math.floor(Math.random() * userAgents.length)];
   }

   async function fetchWithHeaders(url) {
     return fetch(url, {
       agent: nextAgent(),
       headers: {
         'User-Agent': randomUA(),
         'Accept-Language': 'en-US,en;q=0.9',
         'Referer': 'https://www.example.com',
       },
     });
   }

Varying UA strings and including a realistic Referer dissuades automated bot detectors.

Implement exponential back‑off

   async function resilientFetch(url, retries = 3) {
     for (let i = 0; i < retries; i++) {
       const resp = await fetchWithHeaders(url);
       if (resp.status === 429 || resp.status === 503) {
         await new Promise(r => setTimeout(r, Math.pow(2, i) * 1000));
         continue;
       }
       return resp;
     }
     throw new Error(`Failed after ${retries} attempts`);
   }

If a provider still responds with 429, a gradual back‑off reduces the log‑in‑rate rather than bursting.

Embed health checks in CI

   test_scrape:
     runs-on: ubuntu-latest
     steps:
       - uses: actions/checkout@v3
       - uses: actions/setup-node@v3
         with:
           node-version: '18'
       - run: npm ci
       - run: node src/scrapeTest.js
         env:
           PROXY_POOL: ${{ secrets.PROXY_POOL }}

Add the script to your pipeline so you get a fresh ban‑free run every commit.

Quick Checklist

[x] Proxy pool (≥ 3 IPs)
[x] Per‑IP throttling (≤ 10 req/min)
[x] Randomized UA & Referer
[x] Exponential back‑off for 429/503
[x] CI integration

With these layers you create a “soft‑ban shield” that keeps your QA automation humming while obeying the site’s traffic policies. Happy scraping!

🛠️ The Tool I Use

    For generating clean test data and disposable emails for these workflows, I personally use [TempoMail USA](https://tempomailusa.com). It’s fast, has an API-like feel, and keeps my production data clean.

DEV Community

Caught by IP Ban? Node.js Scraping Secrets

Caught by IP Ban? Node.js Scraping Secrets

The Problem

Why it matters

The Solution

🛠️ The Tool I Use

Top comments (0)