DEV Community

Minexa.ai
Minexa.ai

Posted on

10 things developers get wrong when building web scraping pipelines

Building a scraper that works once is easy. Building one that works reliably across thousands of pages, survives site updates, and produces clean structured output is a different problem entirely. Here are 10 mistakes that slow teams down.


1. Writing CSS selectors by hand for every site

Selectors break. A class rename, a layout tweak, a CMS update — and your pipeline silently starts returning empty strings or wrong values. The maintenance cost compounds fast across multiple sites. Tools like the Minexa.ai API skip this entirely: you point at the HTML container holding your data, and field discovery happens automatically.


2. Using an LLM to parse HTML at scale

LLMs work fine for a few hundred pages. At 50,000+ pages per month, token costs become the dominant expense. A full HTML page averages around 572,000 tokens. At that size, even cheap models like GPT-4.1 nano cost $0.058 per page — that is $2,900 for 50,000 pages. Minexa charges per page, not per token, so page size has zero effect on cost.


3. Trusting LLM output without validation

LLMs do not fail loudly. A missing price field might come back as 0, a fabricated default, or borrowed from a nearby element. An incorrect date gets assigned to the wrong label with no error signal. DOM-based extraction returns null when a value is absent — never an invented substitute. That difference matters at scale when you cannot manually review every row.

Minexa deterministic extraction vs LLM hallucination


4. Not handling nested data properly

When extracted content spans multiple child elements, a flat string is not always what comes back. Minexa returns a list of objects for nested fields:

{"locations": [{"tag": "span", "type": "text", "value": "Berlin"}, {"tag": "span", "type": "text", "value": "Paris"}]}
Enter fullscreen mode Exit fullscreen mode

In Python, getting the values is one line:

values = [item["value"] for item in data["locations"]]
Enter fullscreen mode Exit fullscreen mode

Skipping this handling step causes silent data loss in pipelines that expect flat strings.


5. Rebuilding scrapers from scratch after every site redesign

Site layouts change. If your scraper is a pile of selectors, a redesign means rewriting it. With a trained scraper model, retraining takes 2 to 5 minutes: open the updated page, select the new container, generate a new scraper_id. The only required code change is updating that ID in your request body.


6. Ignoring the difference between list pages and detail pages

These are structurally different extraction problems. A search results page has repeated item blocks. A product or listing page has one deep content block. Mixing them up in a single scraper produces inconsistent columns. Minexa handles both modes explicitly — 'list + detail' for paginated results and 'detail only' for individual pages like site.com/listing/8821.


7. Re-fetching HTML you already have

If you have already crawled and stored HTML files, there is no reason to re-fetch them for extraction. The Minexa API supports a file_urls parameter that points directly to stored HTML. Set js_render to false and you use the cheapest possible credit configuration since no live crawling happens.

{
  "scraping": {"js_render": false, "proxy": "verified"},
  "file_urls": ["https://your-cdn.cloudfront.net/page-1.html"],
  "urls": ["https://original-site.com/page-1"]
}
Enter fullscreen mode Exit fullscreen mode

8. Assuming one scraping config fits every site

A static blog and a JavaScript-heavy SPA need completely different fetch strategies. Skipping JS rendering on a React page returns an empty shell. Using residential proxies on a public static site wastes credits. Minexa exposes js_render, provider (service1, service2, service3), proxy, timeout, and bypass so you can match the config to the site. Start with service1 and escalate only when needed.

Minexa API credit consumption by scraping mode

Start extracting structured data without writing selectors: minexa.ai


9. Building a scheduling layer before you need it

The Minexa API does not manage scheduling internally when called programmatically. That is intentional — you control the trigger. A simple cron job that builds your URL list and calls POST https://api.minexa.ai/data/ is all you need. Overengineering a scheduling system before validating your extraction pipeline is wasted effort.

headers = {"Content-Type": "application/json", "api-key": "your_key"}
data = {
  "batches": [{
    "scraper_id": 6214,
    "columns": ["top_30"],
    "urls": ["https://site.com/detail/1", "https://site.com/detail/2"],
    "scraping": {"js_render": True, "proxy": "verified", "retry": 3}
  }],
  "threads": 5
}
Enter fullscreen mode Exit fullscreen mode

10. Defining a rigid schema before exploring the data

Most scraping projects start with 'I want the price and the title' and end up needing eight more fields. Defining a fixed schema upfront means multiple retraining cycles. Use columns: ["top_20"] first to see what Minexa surfaces automatically, then narrow down to named columns once you know what is actually on the page. Both approaches cost the same — there is no penalty for exploring.


The patterns above account for most of the debugging time, pipeline failures, and unexpected costs in scraping projects. Addressing them early — especially the LLM cost trap and the silent failure problem — saves significant rework later.

Read the full Minexa API docs to see request structure, scraping parameters, and ready-to-run Python code.

Top comments (0)