There is a pattern that has become surprisingly common in backend and data engineering work: someone needs structured data from a website, reaches for an LLM API, feeds it raw HTML, and calls it done. It works well enough at small scale. Then the bill arrives, or the extracted fields start drifting, and the whole thing needs rethinking.
Using an LLM to parse HTML is not wrong by default. But it is often the wrong tool chosen for the wrong reason — convenience at prototype stage, not fitness for production.
What actually goes wrong with LLM extraction at scale
The issues are not dramatic. They are quiet and cumulative.
A product page shows a sale price and a crossed-out original price. Both look like prices to a language model. Depending on surrounding text, token context, and model temperature, the model may return either one under sale_price. It will not tell you it is uncertain. You find out during a data audit three weeks later.
A clinical trials page has four date fields. An LLM assigns the wrong date to a label roughly once per hundred rows or more, because the values are structurally similar and the model picks based on proximity rather than DOM position. At 50,000 pages, that is hundreds of silently wrong rows.
This is not a criticism of LLMs as a category. It is a description of what probabilistic text generation does when applied to a task that requires structural precision.
The alternative: train once, extract deterministically
The Minexa.ai API takes a different approach. You train a scraper once using the browser extension — point at the HTML container holding your data, confirm the selection, and Minexa generates a reusable scraper with a stable scraper_id. That scraper is backed by consolidated DOM selectors, not prompts. Every field maps to a specific element. Running the same scraper on the same page always produces identical JSON.
Once trained, you call the API with your URLs:
data = {
"batches": [{
"scraper_id": 4821,
"columns": ["top_30"],
"urls": ["https://example.com/listing/9981"],
"scraping": {
"js_render": True,
"proxy": "verified",
"timeout": 30,
"js_code": [
{"wait_time": 2},
{"page_init": True},
{"wait_time": 4}
],
"retry": 3
}
}],
"threads": 5
}
The columns parameter accepts either a named list like ["price", "availability", "brand"] or a top_n shorthand. Using "top_30" returns the 30 highest-ranked columns by Minexa's relevance algorithm. The ranking is deterministic, so top_30 always maps to the same 30 fields — safe for production without locking in a schema upfront.
Up to 50,000 URLs can go into a single batch request. The threads value controls parallel processing up to your plan's limit.
When a field is missing, you get null — not a guess
Minexa is designed to fail loudly. If a page structure changes and a trained selector no longer finds its target, the affected field returns null or an explicit error. It never borrows a value from a nearby element or invents a plausible substitute.
This contrasts directly with LLM behavior, where a missing price might come back as $0.00 or a missing date might be filled from another date field on the page — both plausible, both wrong, neither flagged.
If you submit a URL with the wrong scraper_id (a page type the scraper was not trained on), Minexa returns an error indicating the mismatch rather than attempting extraction on mismatched structure.
Handling pre-scraped HTML
If you already have HTML stored (on CloudFront, S3, or anywhere publicly accessible), you can skip live crawling entirely using file_urls:
{
"scraping": {"js_render": false, "proxy": "verified"},
"file_urls": [
"https://your-cdn.cloudfront.net/page-1.html"
],
"urls": [
"https://original-site.com/listing/1"
]
}
file_urls and urls are 1-to-1. Minexa reads from the stored HTML and maps output back to the original URL. This is the lowest-credit configuration since no rendering is needed.
The cost reality at scale
LLMs price by token. A full DOM-rendered HTML page averages around 572,000 tokens. At that size, GPT-4o-mini costs roughly $0.086 per page. At 120,000 pages per month, that is $10,320. Minexa's Startup plan handles the same volume for $60. Even with stripped HTML at 38,965 tokens, GPT-4o-mini costs $773 for 120,000 pages versus $60 on Minexa.
Minexa's cost does not change based on page size. A 600K-token page costs the same credit as a 10K-token page.
Start extracting structured data from any site in under 10 minutes: minexa.ai
Nested data and what to do with it
When extracted content is structurally nested, Minexa returns a list of objects with metadata:
{"locations": [
{"tag": "span", "type": "text", "value": "Berlin"},
{"tag": "span", "type": "text", "value": "Vienna"}
]}
In most cases you only need the values:
values = [item["value"] for item in row["locations"]]
The tag and attribute metadata are available when you need to filter by element type, which is useful for pages with long mixed content like article bodies.
What retraining looks like
When a site redesigns and the existing scraper starts returning nulls or errors, you open the updated page in the extension, select the new container, and create a new scraper. This takes the same 2-5 minutes as the original training. The only required code change is updating scraper_id in your request body and checking whether any column names you depend on have changed.
Retraining creates a new scraper from scratch. Column names may differ because they are generated fresh. Minexa does not attempt to preserve labels from the previous scraper.
Read the full API docs and explore scraping scenarios: minexa.stoplight.io/docs/minexa
The actual takeaway
LLM extraction is a reasonable starting point for low-volume, exploratory work where output variance is acceptable. Once you are running tens of thousands of pages per month, or once field accuracy matters for downstream use, the tradeoffs shift significantly: token costs compound, silent errors accumulate, and validation logic adds engineering overhead that was never in the original estimate.
Deterministic DOM-based extraction does not solve every problem, but for structured data at scale from consistent page types, it is the more predictable and cost-stable path.
Install the Minexa Chrome extension, train a scraper on your target page, and pull the auto-generated Python code directly from the extension: chromewebstore.google.com/detail/minexa-ai-scraper/ddljgbflolmninnkfcbdikabbjeapdnh

Top comments (0)