Charles

Posted on Jun 8 • Edited on Jun 26

Why Your Web Scraper Gets Blocked: A Production Architecture That Survives

#scraping #node #architecture #proxies

Why Your Web Scraper Gets Blocked: A Production Architecture That Survives

Most web scrapers do not fail because the CSS selector was wrong.

They fail because the system was designed as a script:

Fetch a page
Parse the HTML
Retry if it fails
Hope tomorrow looks like today

That works for demos. It breaks in production.

If you are scraping protected sites, blocking is not a random runtime event. It is a design constraint. You need an architecture that assumes failure, measures data quality, and adapts before the whole pipeline goes dark.

This is the architecture pattern I use now for production scraping systems.

The real reason scrapers get blocked

Most teams focus on the visible error:

403 Forbidden
429 Too Many Requests
CAPTCHA page
Cloudflare / Akamai / PerimeterX challenge
Empty HTML shell from a JavaScript app

But those are symptoms. The actual causes usually sit one layer deeper.

1. Your traffic identity looks wrong

A datacenter IP sending repeated requests to product pages, search pages, or profile pages does not look like normal human traffic.

Even if your headers are perfect, the IP reputation may already be enough to classify the request as automation.

Common signals include:

IP ASN / hosting provider
request velocity
repeated URL patterns
missing browser behavior
inconsistent geo / language / timezone hints
known proxy ranges

2. Your scraper has no pacing model

A human does not open 500 product pages in 90 seconds.

Many scrapers use a simple concurrency value like this:

const concurrency = 20;

That is easy to reason about, but it is not how websites experience your traffic. The site sees request bursts, repeated paths, and abnormal timing.

A production scraper needs pacing rules per domain, per route type, and sometimes per account/session.

3. Your parser trusts successful responses

A 200 OK response can still be useless.

It might be:

a login page
a CAPTCHA page
a soft-block message
an empty JavaScript shell
a localized version missing the fields you expect
a page template that changed yesterday

If your pipeline treats every 200 as valid data, your database will slowly fill with garbage.

4. Your system cannot explain failures

A scraper that fails loudly is annoying.

A scraper that fails silently is expensive.

When the business asks, "Why did yesterday's pricing data drop by 40%?", you need more than "the script ran successfully." You need request logs, response samples, validation scores, and failure categories.

The production architecture

Here is the high-level design:

URL Queue
   │
   ▼
Scheduler / Rate Policy
   │
   ▼
Fetch Layer ──► Residential proxy / browser rendering / retries
   │
   ▼
Raw Response Store
   │
   ▼
Parser
   │
   ▼
Validation Layer
   │
   ├── valid data ──► Database / API / export
   │
   └── invalid data ──► Review queue + alerting
   │
   ▼
Monitoring Dashboard

The important idea is separation of concerns.

Your parser should not manage proxies. Your retry logic should not decide whether a product is valid. Your database should not be the first place you discover a block page.

Each layer has one job.

Layer 1: URL queue

Do not throw thousands of URLs directly at a worker.

Use a queue with metadata:

const job = {
  url: 'https://example.com/product/123',
  domain: 'example.com',
  type: 'product',
  priority: 'normal',
  lastScrapedAt: null,
  attempts: 0,
};

That metadata allows you to make better decisions:

search pages can be slower than product pages
failed URLs can be retried later instead of immediately
high-value URLs can get better proxy/rendering options
stale URLs can be refreshed without scraping everything

A queue also makes the system restartable. If a worker crashes, the job is not lost.

Layer 2: scheduler and pacing policy

This layer decides when a URL should be fetched.

A simple version:

const policy = {
  'example.com': {
    concurrency: 2,
    minDelayMs: 3000,
    maxDelayMs: 9000,
    maxRetries: 3,
  },
};

function randomDelay(min, max) {
  return min + Math.floor(Math.random() * (max - min));
}

A better version also tracks recent failures:

function shouldSlowDown(stats) {
  return stats.rateLimitRate > 0.05 || stats.captchaRate > 0.02;
}

function nextDelay(baseDelay, stats) {
  if (shouldSlowDown(stats)) return baseDelay * 4;
  if (stats.successRate > 0.95) return baseDelay;
  return baseDelay * 2;
}

The scheduler should adapt to what the site is telling you.

If the block rate rises, slow down. If validation starts failing, sample and inspect. Do not keep hammering the site with the same pattern.

Layer 3: fetch layer

The fetch layer is where identity and rendering are handled.

For simple static pages, an HTTP fetch may be enough. For JavaScript-heavy pages, you may need browser rendering. For protected sites, you may need residential proxies.

The key is to hide that complexity behind one interface:

async function fetchPage(url, options = {}) {
  const result = await xcrawl.scrapeMarkdown(url, {
    render: options.render ?? false,
    country: options.country,
  });

  return {
    status: result.status,
    finalUrl: result.finalUrl,
    markdown: result.data?.markdown || '',
    html: result.data?.html || '',
    fetchedAt: new Date().toISOString(),
  };
}

Your business logic should not care whether the request used a proxy, a rendered browser, or a fallback route. It should receive a normalized response.

Layer 4: block detection

Before parsing, classify the response.

function detectBlock(response) {
  const text = `${response.markdown}\n${response.html}`.toLowerCase();

  if (response.status === 429) return 'rate_limit';
  if (response.status === 403) return 'forbidden';
  if (text.includes('captcha')) return 'captcha';
  if (text.includes('access denied')) return 'access_denied';
  if (text.includes('enable javascript')) return 'needs_rendering';
  if (response.markdown.length < 300) return 'too_little_content';

  return null;
}

This does not need to be perfect. It just needs to be good enough to route failures correctly.

For example:

async function fetchWithFallback(url) {
  let response = await fetchPage(url, { render: false });
  let block = detectBlock(response);

  if (block === 'needs_rendering' || block === 'too_little_content') {
    response = await fetchPage(url, { render: true });
    block = detectBlock(response);
  }

  if (block) {
    return { ok: false, block, response };
  }

  return { ok: true, response };
}

This is much better than blindly retrying the same blocked request three times.

Layer 5: raw response storage

Always store the raw response before parsing.

await db.raw_pages.insert({
  url,
  status: response.status,
  finalUrl: response.finalUrl,
  markdown: response.markdown,
  html: response.html,
  fetchedAt: response.fetchedAt,
});

This feels boring until the day the target site changes its layout.

If you stored raw pages, you can fix the parser and reprocess yesterday's data without hitting the site again. If you only stored extracted fields, the data is gone.

Raw storage also gives you evidence when debugging:

Was the page blocked?
Did the selector fail?
Did the price format change?
Did the site serve a different layout by region?

Layer 6: parser

The parser should be defensive.

Bad parser:

const price = $('.price').text();

Better parser:

function extractPrice(markdown) {
  const patterns = [
    /\$\s?([0-9,]+(?:\.\d{2})?)/,
    /USD\s?([0-9,]+(?:\.\d{2})?)/i,
    /price[^0-9]{0,20}([0-9,]+(?:\.\d{2})?)/i,
  ];

  for (const pattern of patterns) {
    const match = markdown.match(pattern);
    if (match) {
      return Number(match[1].replace(/,/g, ''));
    }
  }

  return null;
}

Returning null is often better than throwing. The validation layer can decide whether missing data is acceptable.

Layer 7: validation

Validation is where you protect the database.

function validateProduct(product) {
  const checks = {
    hasTitle: Boolean(product.title && product.title.length > 8),
    hasPrice: typeof product.price === 'number' && product.price > 0,
    hasUrl: Boolean(product.url && product.url.startsWith('https://')),
    priceLooksReasonable: product.price == null || product.price < 10000,
  };

  const passed = Object.values(checks).filter(Boolean).length;
  const confidence = passed / Object.keys(checks).length;

  return {
    valid: confidence >= 0.75,
    confidence,
    checks,
  };
}

The database should only receive records that pass validation.

Everything else should go to a review queue:

if (!validation.valid) {
  await db.review_queue.insert({ url, product, validation, rawPageId });
  return;
}

await db.products.upsert(product);

This one layer prevents a huge class of silent data quality problems.

Layer 8: monitoring

Track the metrics that actually matter:

Effective scrape rate = valid records / total attempts

Also track:

success rate by domain
block rate by domain
validation failure rate
average response time
retry count per URL
rendering usage rate
cost per valid record

Requests per second is not the goal. Valid, fresh, usable data is the goal.

A slower scraper with a 96% valid-data rate is usually better than a fast scraper that silently returns garbage 40% of the time.

A practical retry strategy

Retries should be based on failure type.

async function handleFailure(job, failure) {
  switch (failure.block) {
    case 'rate_limit':
      return requeue(job, { delayMs: 30 * 60 * 1000 });

    case 'captcha':
    case 'access_denied':
      return requeue(job, { delayMs: 2 * 60 * 60 * 1000, useNewIdentity: true });

    case 'needs_rendering':
      return requeue(job, { delayMs: 0, render: true });

    case 'too_little_content':
      return requeue(job, { delayMs: 10 * 60 * 1000, render: true });

    default:
      return markFailed(job, failure);
  }
}

Blind retries create more suspicious traffic. Typed retries reduce damage.

Production checklist

Before calling a scraper production-ready, I want these boxes checked:

[ ] per-domain pacing rules
[ ] proxy / identity strategy
[ ] rendering fallback for JavaScript pages
[ ] block detection
[ ] raw response storage
[ ] defensive parser
[ ] validation score
[ ] review queue for low-confidence records
[ ] dashboard / alerts
[ ] cost per valid record tracked

If any of these are missing, the system might still work — but you will not know why it fails when it fails.

Final thought

Scraping is not just parsing HTML. Production scraping is distributed systems work with adversarial constraints.

The winning architecture is not the one that sends the most requests. It is the one that turns the highest percentage of attempts into valid, explainable, reusable data.

If your scraper keeps getting blocked, do not start by adding another retry loop.

Start by asking: which layer of the architecture is missing?

DEV Community

Why Your Web Scraper Gets Blocked: A Production Architecture That Survives

Why Your Web Scraper Gets Blocked: A Production Architecture That Survives

The real reason scrapers get blocked

1. Your traffic identity looks wrong

2. Your scraper has no pacing model

3. Your parser trusts successful responses

4. Your system cannot explain failures

The production architecture

Layer 1: URL queue

Layer 2: scheduler and pacing policy

Layer 3: fetch layer

Layer 4: block detection

Layer 5: raw response storage

Layer 6: parser

Layer 7: validation

Layer 8: monitoring

A practical retry strategy

Production checklist

Final thought

Top comments (0)