DEV Community

Charles
Charles

Posted on • Edited on

Why Your Web Scraper Gets Blocked: A Production Architecture That Survives

Why Your Web Scraper Gets Blocked: A Production Architecture That Survives

Most web scrapers do not fail because the CSS selector was wrong.

They fail because the system was designed as a script:

  1. Fetch a page
  2. Parse the HTML
  3. Retry if it fails
  4. Hope tomorrow looks like today

That works for demos. It breaks in production.

If you are scraping protected sites, blocking is not a random runtime event. It is a design constraint. You need an architecture that assumes failure, measures data quality, and adapts before the whole pipeline goes dark.

This is the architecture pattern I use now for production scraping systems.

The real reason scrapers get blocked

Most teams focus on the visible error:

  • 403 Forbidden
  • 429 Too Many Requests
  • CAPTCHA page
  • Cloudflare / Akamai / PerimeterX challenge
  • Empty HTML shell from a JavaScript app

But those are symptoms. The actual causes usually sit one layer deeper.

1. Your traffic identity looks wrong

A datacenter IP sending repeated requests to product pages, search pages, or profile pages does not look like normal human traffic.

Even if your headers are perfect, the IP reputation may already be enough to classify the request as automation.

Common signals include:

  • IP ASN / hosting provider
  • request velocity
  • repeated URL patterns
  • missing browser behavior
  • inconsistent geo / language / timezone hints
  • known proxy ranges

2. Your scraper has no pacing model

A human does not open 500 product pages in 90 seconds.

Many scrapers use a simple concurrency value like this:

const concurrency = 20;
Enter fullscreen mode Exit fullscreen mode

That is easy to reason about, but it is not how websites experience your traffic. The site sees request bursts, repeated paths, and abnormal timing.

A production scraper needs pacing rules per domain, per route type, and sometimes per account/session.

3. Your parser trusts successful responses

A 200 OK response can still be useless.

It might be:

  • a login page
  • a CAPTCHA page
  • a soft-block message
  • an empty JavaScript shell
  • a localized version missing the fields you expect
  • a page template that changed yesterday

If your pipeline treats every 200 as valid data, your database will slowly fill with garbage.

4. Your system cannot explain failures

A scraper that fails loudly is annoying.

A scraper that fails silently is expensive.

When the business asks, "Why did yesterday's pricing data drop by 40%?", you need more than "the script ran successfully." You need request logs, response samples, validation scores, and failure categories.

The production architecture

Here is the high-level design:

URL Queue
   │
   ▼
Scheduler / Rate Policy
   │
   ▼
Fetch Layer ──► Residential proxy / browser rendering / retries
   │
   ▼
Raw Response Store
   │
   ▼
Parser
   │
   ▼
Validation Layer
   │
   ├── valid data ──► Database / API / export
   │
   └── invalid data ──► Review queue + alerting
   │
   ▼
Monitoring Dashboard
Enter fullscreen mode Exit fullscreen mode

The important idea is separation of concerns.

Your parser should not manage proxies. Your retry logic should not decide whether a product is valid. Your database should not be the first place you discover a block page.

Each layer has one job.

Layer 1: URL queue

Do not throw thousands of URLs directly at a worker.

Use a queue with metadata:

const job = {
  url: 'https://example.com/product/123',
  domain: 'example.com',
  type: 'product',
  priority: 'normal',
  lastScrapedAt: null,
  attempts: 0,
};
Enter fullscreen mode Exit fullscreen mode

That metadata allows you to make better decisions:

  • search pages can be slower than product pages
  • failed URLs can be retried later instead of immediately
  • high-value URLs can get better proxy/rendering options
  • stale URLs can be refreshed without scraping everything

A queue also makes the system restartable. If a worker crashes, the job is not lost.

Layer 2: scheduler and pacing policy

This layer decides when a URL should be fetched.

A simple version:

const policy = {
  'example.com': {
    concurrency: 2,
    minDelayMs: 3000,
    maxDelayMs: 9000,
    maxRetries: 3,
  },
};

function randomDelay(min, max) {
  return min + Math.floor(Math.random() * (max - min));
}
Enter fullscreen mode Exit fullscreen mode

A better version also tracks recent failures:

function shouldSlowDown(stats) {
  return stats.rateLimitRate > 0.05 || stats.captchaRate > 0.02;
}

function nextDelay(baseDelay, stats) {
  if (shouldSlowDown(stats)) return baseDelay * 4;
  if (stats.successRate > 0.95) return baseDelay;
  return baseDelay * 2;
}
Enter fullscreen mode Exit fullscreen mode

The scheduler should adapt to what the site is telling you.

If the block rate rises, slow down. If validation starts failing, sample and inspect. Do not keep hammering the site with the same pattern.

Layer 3: fetch layer

The fetch layer is where identity and rendering are handled.

For simple static pages, an HTTP fetch may be enough. For JavaScript-heavy pages, you may need browser rendering. For protected sites, you may need residential proxies.

The key is to hide that complexity behind one interface:

async function fetchPage(url, options = {}) {
  const result = await xcrawl.scrapeMarkdown(url, {
    render: options.render ?? false,
    country: options.country,
  });

  return {
    status: result.status,
    finalUrl: result.finalUrl,
    markdown: result.data?.markdown || '',
    html: result.data?.html || '',
    fetchedAt: new Date().toISOString(),
  };
}
Enter fullscreen mode Exit fullscreen mode

Your business logic should not care whether the request used a proxy, a rendered browser, or a fallback route. It should receive a normalized response.

Layer 4: block detection

Before parsing, classify the response.

function detectBlock(response) {
  const text = `${response.markdown}\n${response.html}`.toLowerCase();

  if (response.status === 429) return 'rate_limit';
  if (response.status === 403) return 'forbidden';
  if (text.includes('captcha')) return 'captcha';
  if (text.includes('access denied')) return 'access_denied';
  if (text.includes('enable javascript')) return 'needs_rendering';
  if (response.markdown.length < 300) return 'too_little_content';

  return null;
}
Enter fullscreen mode Exit fullscreen mode

This does not need to be perfect. It just needs to be good enough to route failures correctly.

For example:

async function fetchWithFallback(url) {
  let response = await fetchPage(url, { render: false });
  let block = detectBlock(response);

  if (block === 'needs_rendering' || block === 'too_little_content') {
    response = await fetchPage(url, { render: true });
    block = detectBlock(response);
  }

  if (block) {
    return { ok: false, block, response };
  }

  return { ok: true, response };
}
Enter fullscreen mode Exit fullscreen mode

This is much better than blindly retrying the same blocked request three times.

Layer 5: raw response storage

Always store the raw response before parsing.

await db.raw_pages.insert({
  url,
  status: response.status,
  finalUrl: response.finalUrl,
  markdown: response.markdown,
  html: response.html,
  fetchedAt: response.fetchedAt,
});
Enter fullscreen mode Exit fullscreen mode

This feels boring until the day the target site changes its layout.

If you stored raw pages, you can fix the parser and reprocess yesterday's data without hitting the site again. If you only stored extracted fields, the data is gone.

Raw storage also gives you evidence when debugging:

  • Was the page blocked?
  • Did the selector fail?
  • Did the price format change?
  • Did the site serve a different layout by region?

Layer 6: parser

The parser should be defensive.

Bad parser:

const price = $('.price').text();
Enter fullscreen mode Exit fullscreen mode

Better parser:

function extractPrice(markdown) {
  const patterns = [
    /\$\s?([0-9,]+(?:\.\d{2})?)/,
    /USD\s?([0-9,]+(?:\.\d{2})?)/i,
    /price[^0-9]{0,20}([0-9,]+(?:\.\d{2})?)/i,
  ];

  for (const pattern of patterns) {
    const match = markdown.match(pattern);
    if (match) {
      return Number(match[1].replace(/,/g, ''));
    }
  }

  return null;
}
Enter fullscreen mode Exit fullscreen mode

Returning null is often better than throwing. The validation layer can decide whether missing data is acceptable.

Layer 7: validation

Validation is where you protect the database.

function validateProduct(product) {
  const checks = {
    hasTitle: Boolean(product.title && product.title.length > 8),
    hasPrice: typeof product.price === 'number' && product.price > 0,
    hasUrl: Boolean(product.url && product.url.startsWith('https://')),
    priceLooksReasonable: product.price == null || product.price < 10000,
  };

  const passed = Object.values(checks).filter(Boolean).length;
  const confidence = passed / Object.keys(checks).length;

  return {
    valid: confidence >= 0.75,
    confidence,
    checks,
  };
}
Enter fullscreen mode Exit fullscreen mode

The database should only receive records that pass validation.

Everything else should go to a review queue:

if (!validation.valid) {
  await db.review_queue.insert({ url, product, validation, rawPageId });
  return;
}

await db.products.upsert(product);
Enter fullscreen mode Exit fullscreen mode

This one layer prevents a huge class of silent data quality problems.

Layer 8: monitoring

Track the metrics that actually matter:

Effective scrape rate = valid records / total attempts
Enter fullscreen mode Exit fullscreen mode

Also track:

  • success rate by domain
  • block rate by domain
  • validation failure rate
  • average response time
  • retry count per URL
  • rendering usage rate
  • cost per valid record

Requests per second is not the goal. Valid, fresh, usable data is the goal.

A slower scraper with a 96% valid-data rate is usually better than a fast scraper that silently returns garbage 40% of the time.

A practical retry strategy

Retries should be based on failure type.

async function handleFailure(job, failure) {
  switch (failure.block) {
    case 'rate_limit':
      return requeue(job, { delayMs: 30 * 60 * 1000 });

    case 'captcha':
    case 'access_denied':
      return requeue(job, { delayMs: 2 * 60 * 60 * 1000, useNewIdentity: true });

    case 'needs_rendering':
      return requeue(job, { delayMs: 0, render: true });

    case 'too_little_content':
      return requeue(job, { delayMs: 10 * 60 * 1000, render: true });

    default:
      return markFailed(job, failure);
  }
}
Enter fullscreen mode Exit fullscreen mode

Blind retries create more suspicious traffic. Typed retries reduce damage.

Production checklist

Before calling a scraper production-ready, I want these boxes checked:

  • [ ] per-domain pacing rules
  • [ ] proxy / identity strategy
  • [ ] rendering fallback for JavaScript pages
  • [ ] block detection
  • [ ] raw response storage
  • [ ] defensive parser
  • [ ] validation score
  • [ ] review queue for low-confidence records
  • [ ] dashboard / alerts
  • [ ] cost per valid record tracked

If any of these are missing, the system might still work — but you will not know why it fails when it fails.

Final thought

Scraping is not just parsing HTML. Production scraping is distributed systems work with adversarial constraints.

The winning architecture is not the one that sends the most requests. It is the one that turns the highest percentage of attempts into valid, explainable, reusable data.

If your scraper keeps getting blocked, do not start by adding another retry loop.

Start by asking: which layer of the architecture is missing?

Top comments (0)