What actually happens when Minexa extracts data from a page

If you have looked at Minexa before and wondered what is actually going on under the hood when it extracts data, this article walks through the internal mechanics in plain terms. Not the marketing pitch. The actual behavior.

What does 'container locking' mean?

Before collecting any values, Minexa identifies and locks onto the exact section of a page that holds the target data. This prevents it from accidentally pulling values from visually similar but unrelated sections, like a sidebar showing related products with their own prices, or a footer with duplicate navigation text.

The user selects a parent HTML element in the browser extension, not individual fields. Minexa works from that container outward, discovering columns within it rather than scanning the whole page.

How does it pick which selector to use for each field?

For each discovered data point, Minexa evaluates multiple candidate selectors and ranks them by structural stability and content regularity across pages. The final selector is the one that consistently targets the correct element even when minor layout differences exist between pages on the same site.

This is why a scraper trained on one product page works correctly across thousands of structurally similar product pages without modification.

What is the top_n columns parameter?

When you make an API request, you do not have to know the column names upfront. You can use "top_30" or "top_10" or any value, and Minexa returns that many columns ranked by its relevance algorithm.

{
  "batches": [{
    "scraper_id": 4712,
    "columns": ["top_30"],
    "urls": ["https://example.com/product/99"],
    "scraping": {"js_render": true, "proxy": "verified"}
  }],
  "threads": 5
}

The ranking is deterministic, so "top_30" always maps to the same set of columns in the same order. Once you know which columns matter, you can switch to named fields like ["price", "availability", "brand"] with no other changes needed.

What does nested data look like in the output?

When extracted content maps to multiple elements, Minexa returns a list of objects instead of a flat string. Each object includes a tag, type, attribute, and value field.

{"study_locations": [
  {"tag": "span", "type": "text", "value": "Austin", "attribute": null},
  {"tag": "span", "type": "text", "value": "Denver", "attribute": null}
]}

In most cases you only need the value. In Python:

locations = [item["value"] for item in row["study_locations"]]

For deeply structured content like article body text, the tag and attribute metadata let you filter and reconstruct the original content precisely. This does require some extra handling compared to flat columns, which is a real tradeoff worth knowing about before you start.

Is extraction deterministic?

Yes. Running the same scraper on the same page always produces identical JSON output as long as the underlying HTML has not changed. This differs from LLM-based extraction where outputs can vary between runs due to temperature settings, prompt drift, or model updates.

For testing and validation pipelines, this matters a lot. You can run the same page twice and diff the outputs with confidence.

What happens when something goes wrong?

Minexa is designed to fail loudly. If a page structure changes and the scraper no longer matches the HTML, affected fields return null or the scraper raises an explicit error. It never silently returns a wrong value.

If you accidentally submit a URL with the wrong scraper_id (for example, a category page when the scraper was trained on a detail page), Minexa returns an error flagging the mismatch rather than attempting extraction on mismatched content.

This is meaningfully different from selector-based scrapers that can quietly match the wrong element, and from LLM pipelines that may return a plausible-looking but fabricated value with no error signal at all.

What if the site redesigns?

When a site changes its layout substantially, the scraper will start returning errors or null values. That is the signal to retrain. You open an affected page in the extension, select the updated container, and create a new scraper. It takes the same 2 to 5 minutes as the original setup.

Retraining creates a new scraper with a new scraper_id. Column names may also change since the scraper is generated fresh. The only required code update is swapping the scraper_id in your request body and verifying the column names you rely on.

Can you skip live crawling if you already have the HTML?

Yes. If you have pages stored as HTML files on something like AWS CloudFront or a public URL, you can pass them via file_urls and set js_render to false. This is the cheapest scraping configuration since no live fetching or rendering is needed.

{
  "scraping": {"js_render": false, "proxy": "verified"},
  "file_urls": [
    "https://9343.cloudfront.net/html-page-1.html"
  ],
  "urls": [
    "https://original-site.com/page-1"
  ]
}

The urls field here holds the original source URLs so extracted data can be mapped back to the real page it came from. The two arrays are 1-to-1 by index.

Where to go from here

The Minexa Chrome extension is the fastest way to train your first scraper and get the pre-generated Python code. The extension also has a drop-down of ready-made scraping scenarios you can copy directly, which saves time compared to reading through the full API docs.

Full API reference is at minexa.stoplight.io/docs/minexa if you want to go deeper on request parameters and scraping configurations.

DEV Community

What actually happens when Minexa extracts data from a page

Top comments (0)