Minexa cost breakdown: when LLM-based extraction stops making financial sense

Most developers building data pipelines start with an LLM. It feels like the obvious move: pass the HTML, describe what you want, get structured JSON back. It works fine at low volume. Then the bills arrive.

This article breaks down exactly where the math stops working, using real token counts from real pages across six content categories.

The token problem nobody talks about upfront

When you pass a page to an LLM for extraction, you are paying for every token in that HTML. The average full HTML page, with whitespace cleaned but otherwise intact, runs around 572,000 tokens. That is not a worst-case number. That is the average across job listings, ecommerce pages, property results, review pages, search results, and hotel booking pages.

You have two options:

Strip the HTML down to DOM tags and text only, which brings the average to roughly 39,000 tokens. Cheaper, but you risk removing markup that contains the data you need, and it requires preprocessing logic you now have to maintain.
Pass full HTML and pay for every token. Safe and low-maintenance, but the cost per page becomes significant fast.

There is no clean middle ground. A context cap saves money but silently truncates pages mid-content with no error signal.

What the numbers actually look like

Here is the cost to process 120,000 pages per month using stripped HTML (39k tokens/page):

Model	Cost
GPT-5 nano	$285
GPT-4o-mini	$773
GPT-5 mini	$1,410
Claude Haiku 4.5	$5,280
Claude Sonnet 4.6	$15,820
Minexa Startup	$60

Switch to full HTML and every figure above scales by roughly 15x. GPT-5 nano goes from $285 to $3,480. Claude Sonnet 4.6 goes from $15,820 to $207,980. Minexa stays at $60 because its pricing is per page, not per token.

The only range where LLMs are competitive

Below roughly 10,000 pages per month on stripped HTML, the cheapest nano-class models (GPT-5 nano, GPT-4.1 nano, Gemini Flash Lite) land between $24 and $43. Minexa Personal is $15 as a flat monthly floor, so even here Minexa is cheaper. But the gap is small enough that tooling familiarity might reasonably win the decision.

Beyond that threshold, the gap widens fast and does not close.

Try Minexa free and train your first scraper in under 10 minutes

The costs that do not show up in token pricing

Token cost is only part of the picture. LLM extraction pipelines carry indirect costs that compound at volume:

Validation overhead. LLMs can return plausible but wrong values without any error signal. A job listing with salary, equity, and bonus in similar formats might come back with equity assigned to the salary field. At 100,000 pages, that translates to thousands of rows requiring downstream validation logic or manual review.

Retry logic. Inconsistent JSON field naming across responses, occasional fabricated values, and schema drift all require retry and normalization code. That is engineering time spent on infrastructure rather than the actual product.

Prompt maintenance. When a target site updates its layout, LLM prompts may need adjustment to keep extraction accurate. This is not always obvious until data quality degrades silently.

Minexa's DOM-based extraction returns null when a field is absent and raises an explicit error when a page does not match the trained scraper. There is no silent failure mode.

Three real scenarios

Ecommerce price monitoring (~80,000 pages/month). Using GPT-5 nano on stripped HTML with a 20% retry overhead: approximately $230/month. After training a single Minexa scraper on the product page structure: $60/month on the Startup plan. Price fields always pulled from the correct DOM element regardless of page layout updates.

Real estate listings (~200,000 pages/month, full HTML). GPT-5 mini on full HTML: over $29,000/month. Minexa Business plan: $500/month. The LLM had been occasionally swapping asking price and last sale price; Minexa eliminated the issue by binding each column to its specific DOM element.

Lead generation from directories (~50,000 pages/month). Mistral Small 2 on stripped HTML: approximately $485/month. Minexa Startup: $60/month. Inconsistent JSON field naming across LLM responses required a normalization step that was eliminated entirely after switching.

How Minexa's pricing works

Minexa charges per page extracted, not per token. Page size does not affect credit consumption. A 600,000-token page costs the same credit as a 20,000-token page.

The three plans:

Personal ($15/month): 10,000 credits, 3 threads
Startup ($60/month): 120,000 credits, 10 threads
Business ($500/month): 2,000,000 credits, 100 threads

Note: credit consumption can be higher for pages requiring JavaScript rendering or aggressive anti-bot handling. The baseline figures above apply to standard pages.

Getting started

Training a scraper takes 2 to 5 minutes via the Minexa Chrome extension. You hover over the HTML container holding your target data, confirm the selection, and Minexa generates a reusable scraper with a stable scraper_id. From there, extraction runs through the API with a straightforward POST request:

{
  "batches": [{
    "scraper_id": 6241,
    "columns": ["top_30"],
    "urls": ["https://example.com/listing/1"],
    "scraping": {"js_render": true, "proxy": "verified"}
  }],
  "threads": 5
}

The extension generates this code for you after training. You update the URLs and run it.

If you are currently running an LLM extraction pipeline above 10,000 pages per month, the full API documentation has everything needed to evaluate a migration.