Caught by IP Ban? Node.js Scraping Secrets
The Problem
When you hit a website’s API or DOM too many times, you may trigger rate‑limiting logic. Most sites employ a combination of request‑header fingerprinting, IP geolocation checks, and CAPTCHA challenges. A single IP that sends dozens of consecutive GET requests with identical user‑agent strings and no natural pause will usually be flagged. The server then either blocks the IP for a cooldown period or permanently bans it. For automated QA scripts that need to pull demo data or monitor front‑end performance, this can cause test flakiness and downtime.
Technically, the ban is enforced by a set of rules on the proxy or reverse‑proxy layer (NGINX, CloudFront, or a commercial WAF). The rules evaluate the following attributes:
| Attribute | Typical threshold | Result |
|---|---|---|
| Request rate | > 10 req/min | 429 Slow Down |
| Header variance | Overly repetitive UA or no Referer | Temp ban |
| Geo‑location | Same city > 1 h | Permanent ban |
| Bearer token misuse | Reused token without rotation | Bad token 401 |
If you rely on a single IP across a 24‑hour test cycle, you’re bound to hit these thresholds.
Why it matters
QA teams automate “the boring parts”: data seeding, regression checks, end‑to‑end screenshots. Scraping a site’s public API or fixture pages is simpler than building a fully‑fledged mock server. But a hard IP block forces tests to pause, insert downtime, or blow up, breaking continuous integration pipelines. The result? Late feedback, undiagnosed bugs, and wasted hours. Preventing IP bans keeps your test harness reliable, reduces maintenance overhead, and ensures that the sprint queue stays packed with real feature work.
The Solution
Here’s a step‑by‑step recipe you can drop into a Node.js script or a Jest test suite.
- Set up a rotating proxy pool
const http = require('http');
const HttpProxyAgent = require('http-proxy-agent');
const proxies = [
'http://proxy1.example.com:3128',
'http://proxy2.example.com:3128',
'http://proxy3.example.com:3128',
];
let idx = 0;
function nextAgent() {
const url = proxies[idx % proxies.length];
idx++;
return new HttpProxyAgent(url);
}
By cycling through n proxies you spread the request rate across multiple IPs.
- Throttle request bursts
const Bottleneck = require('bottleneck');
const limiter = new Bottleneck({
maxConcurrent: 5, // 5 parallel requests
minTime: 200, // 200 ms between starts
});
async function fetchWithLimit(url) {
return limiter.schedule(() => fetch(url, {agent: nextAgent()}));
}
Limiting concurrent calls keeps the per‑IP request curve below most thresholds.
- Randomize request headers
const userAgents = [
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 ...',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 ...',
];
function randomUA() {
return userAgents[Math.floor(Math.random() * userAgents.length)];
}
async function fetchWithHeaders(url) {
return fetch(url, {
agent: nextAgent(),
headers: {
'User-Agent': randomUA(),
'Accept-Language': 'en-US,en;q=0.9',
'Referer': 'https://www.example.com',
},
});
}
Varying UA strings and including a realistic Referer dissuades automated bot detectors.
- Implement exponential back‑off
async function resilientFetch(url, retries = 3) {
for (let i = 0; i < retries; i++) {
const resp = await fetchWithHeaders(url);
if (resp.status === 429 || resp.status === 503) {
await new Promise(r => setTimeout(r, Math.pow(2, i) * 1000));
continue;
}
return resp;
}
throw new Error(`Failed after ${retries} attempts`);
}
If a provider still responds with 429, a gradual back‑off reduces the log‑in‑rate rather than bursting.
- Embed health checks in CI
test_scrape:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- uses: actions/setup-node@v3
with:
node-version: '18'
- run: npm ci
- run: node src/scrapeTest.js
env:
PROXY_POOL: ${{ secrets.PROXY_POOL }}
Add the script to your pipeline so you get a fresh ban‑free run every commit.
Quick Checklist
- [x] Proxy pool (≥ 3 IPs)
- [x] Per‑IP throttling (≤ 10 req/min)
- [x] Randomized UA & Referer
- [x] Exponential back‑off for 429/503
- [x] CI integration
With these layers you create a “soft‑ban shield” that keeps your QA automation humming while obeying the site’s traffic policies. Happy scraping!
🛠️ The Tool I Use
For generating clean test data and disposable emails for these workflows, I personally use [TempoMail USA](https://tempomailusa.com). It’s fast, has an API-like feel, and keeps my production data clean.
Top comments (0)