What actually happens when your scraping pipeline hits 100,000 pages

Most scraping projects start small. A few hundred pages, a quick script, maybe an LLM call to parse the HTML. It works fine. Then the scope grows.

At 100,000 pages per month, the decisions you made at 1,000 pages start costing real money and real engineering time. This post walks through what that scaling curve actually looks like, and where the hidden costs appear.

The token problem nobody budgets for

When you pass a full HTML page to an LLM for extraction, you are not passing a few paragraphs. Real pages across job listings, ecommerce, real estate, and review sites average around 572,739 tokens per page after DOM rendering with whitespace cleaned.

At that size, here is what a single page costs across common models:

Model	Cost per page (full HTML)
GPT-5 nano	$0.029
GPT-4o-mini	$0.086
GPT-5 mini	$0.145
Claude Haiku 4.5	$0.584
GPT-4o	$1.330
Claude Sonnet 4.6	$1.733

At 100,000 pages per month, GPT-5 nano costs roughly $2,900. GPT-4o-mini costs $8,600. Claude Sonnet 4.6 costs $173,300.

You can strip the HTML first to reduce tokens. Stripped pages average around 38,965 tokens. That brings GPT-5 nano down to $0.0024 per page, or $240 for 100,000 pages. But stripping HTML requires preprocessing work, and you risk removing markup that contains the data you actually need. There is no clean middle ground.

Flat cost regardless of page size

Minexa.ai is a Chrome extension-based scraper training tool that connects to an API for batch extraction. You train a scraper once by selecting an HTML container on a page in the extension. Minexa identifies all data points inside that container automatically. The result is a scraper_id you reference in every subsequent API call.

The extraction cost is credit-based, not token-based. A page costs the same whether it is 10,000 tokens or 600,000 tokens. At 100,000 pages per month, the Startup plan at $60 covers up to 120,000 pages. The Business plan at $500 covers up to 2,000,000.

What the API request looks like

After training a scraper in the extension, you click 'API Request' to get pre-generated Python code. The core request structure looks like this:

import requests

url = 'https://api.minexa.ai/data/'
api_key = 'YOUR_API_KEY'

data = {
  'batches': [{
    'scraper_id': 6241,
    'columns': ['top_30'],
    'urls': [
      'https://example.com/listing/1',
      'https://example.com/listing/2'
    ],
    'scraping': {
      'js_render': True,
      'timeout': 30,
      'js_code': [
        {'wait_time': 2},
        {'page_init': True},
        {'wait_time': 4}
      ],
      'proxy': 'verified',
      'retry': 3
    }
  }],
  'threads': 8
}

headers = {'Content-Type': 'application/json', 'api-key': api_key}
response = requests.post(url, json=data, headers=headers)
print(response.json())

The columns parameter accepts top_N notation to return the top N ranked fields automatically, or explicit column names if you want specific fields only. The threads parameter controls how many URLs are processed in parallel, up to your plan limit. Up to 50,000 URLs can be submitted in a single batch request.

If you already have HTML stored on S3 or CloudFront, you can pass file_urls pointing to those files and set js_render: false. This is the cheapest scraping configuration since no live crawling is needed.

What happens when a site redesigns

This is where most scraping pipelines require the most maintenance. With selector-based scrapers, a layout change can silently return wrong values for weeks before anyone notices.

With Minexa, a structural mismatch produces an explicit error or null values rather than a plausible-looking wrong value. When that happens, you open the updated page in the extension, select the new container, and create a new scraper. This takes the same 2 to 5 minutes as the original setup.

The only required code change is updating the scraper_id in your request body and checking whether any column names you rely on have shifted.

Try it yourself: Install the Minexa Chrome extension and train your first scraper in one session.

The reliability cost that does not show up in token pricing

LLM extraction pipelines at scale require validation logic. Field mapping errors, swapped values between visually similar fields, and fabricated defaults when a value is missing all produce rows that look correct but are not.

Minexa binds each column to a specific DOM element identified during training. The same field always maps to the same element. If the element is absent, the output is null. No value is invented.

At 100,000 pages, even a 1% error rate means 1,000 rows requiring manual review or retry logic. That indirect cost does not appear in any token pricing table.

Practical takeaway

If your extraction volume is below roughly 10,000 pages per month and you are already using stripped HTML, the cheapest LLM models are competitive on price. Above that threshold, or any time you are working with full HTML, the cost gap widens sharply and does not close.

The engineering overhead also compounds at scale. Prompt maintenance, schema drift, validation pipelines, and retry logic all grow with volume. A scraper trained once in the Minexa extension and called via API does not require any of that.

Full API documentation is at minexa.stoplight.io/docs/minexa if you want to explore the complete parameter reference before starting.