DEV Community

Minexa.ai
Minexa.ai

Posted on • Edited on

Nested data, null fields, and the quiet failures nobody talks about in web scraping

Most scraping bugs are not crashes. They are wrong values that look right.

A price field returns a number. The number is plausible. It passes your type check. It lands in your database. Three weeks later someone notices the sale price and the original price have been swapped on roughly 8% of records. No error was raised. No log entry flagged it. The pipeline just quietly extracted the wrong element.

This is the failure mode that actually costs time, and it shows up in three distinct places.

The selector drift problem

CSS selectors and XPath expressions are written against a snapshot of a page. When a site updates its layout, the selector either breaks visibly (returns nothing) or drifts silently (matches a different element that happens to exist at the same path). The second case is worse. A selector targeting .price-now that starts matching .price-was after a redesign will not throw an exception. It will just return the wrong number, consistently, at scale.

Traditional scraping gives you no structural guarantee. You write the selector, you hope the site does not change, and you build monitoring on top to catch drift after the fact.

The LLM ambiguity problem

LLM-based extraction has a different failure signature. On pages with multiple visually similar fields, like a job listing with a salary range, an equity range, and a bonus figure, the model picks based on proximity and pattern rather than structural position. It is usually right. At 100,000 pages, 'usually right' means thousands of incorrectly attributed rows, with no error signal attached.

The hallucination that is hardest to catch is not fabrication. It is field swapping: the correct value extracted into the wrong column. Schema conformance failures are also common. If a price is unavailable, some models return 0 or a nearby value rather than null. Both pass downstream validation.

The nested data problem

This one is underappreciated. Many pages contain fields that are not flat strings. A clinical trials page might have four separate date fields rendered as sibling span elements. A property listing might have address components spread across multiple tags.

When Minexa extracts nested content, it returns a list of objects rather than a collapsed string:

{
  "study_locations": [
    {"tag": "span", "type": "text", "value": "Akishima"},
    {"tag": "span", "type": "text", "value": "Atsugi"}
  ]
}
Enter fullscreen mode Exit fullscreen mode

To get the values in Python:

values = [item["value"] for item in row["study_locations"]]
Enter fullscreen mode Exit fullscreen mode

The tag and attribute metadata lets you filter when multiple object types are present. This is a real tradeoff: nested fields require more handling than flat columns, but the structure is explicit and accurate rather than collapsed and potentially wrong.

Check the Minexa API docs for the full parameter reference

How Minexa handles this structurally

Minexa is a deterministic, DOM-based extraction platform. You train a scraper once using the Chrome extension by selecting the HTML container that holds your target data block. Minexa locks onto that container and discovers all data points within it automatically.

Each column is bound to a specific DOM element via a consolidated selector chosen for structural stability. Running the same scraper on the same page always returns identical JSON. No temperature variance, no prompt sensitivity.

When a field is absent from the HTML, the output is null. Never a fabricated default.

When a URL is submitted with the wrong scraper_id, the API returns an explicit mismatch error rather than attempting extraction on the wrong structure.

A minimal API call looks like this:

{
  "batches": [{
    "scraper_id": 6241,
    "columns": ["top_30"],
    "urls": ["https://example.com/listing/99"],
    "scraping": {"js_render": true, "proxy": "verified"}
  }],
  "threads": 5
}
Enter fullscreen mode Exit fullscreen mode

The columns parameter accepts either top_N for automatic ranked field selection or explicit column names generated during training. Both cost the same.

Minexa API request structure explained

When a site redesigns and the scraper starts returning nulls or explicit errors, you retrain in 2 to 5 minutes via the extension. The only required code change is updating the scraper_id.

The fail-loudly design matters more than it sounds

Selector-based scrapers fail silently. LLMs fail silently. Both produce outputs that look valid and are not. Minexa is designed to surface structural problems as explicit errors rather than letting wrong data propagate.

This changes what your validation layer needs to do. Instead of checking whether extracted values are plausible, you can trust that a non-null value came from the correct DOM position. Your checks shift from 'does this look right' to 'is this field present'.

At scale, that is a meaningful reduction in downstream cleanup work.

Start with the extension, get your first dataset in under 10 minutes

Top comments (0)