DEV Community

Zee
Zee

Posted on

When your web extraction tool should fail loudly instead of returning pretty lies

A web extraction API has one job that sounds boring until it fails:

return the data that exists, or admit that it could not get it.

That second half matters more than most people want to admit.

When you put an LLM at the end of a scraping pipeline, you get a nasty failure mode. The fetch fails, the page is blocked, the PDF text is empty, or the site returns a CAPTCHA page, and the model still tries to be helpful. Helpful, in this case, means inventing plausible JSON.

That is worse than a 500.

A 500 tells your pipeline to retry, route, alert, or skip. Fabricated JSON quietly poisons whatever comes next.

The pattern we ended up using

For Haunt API, the extraction path is deliberately boring before it is clever:

  1. fetch the page directly
  2. fall back through stronger fetch/render paths when needed
  3. inspect what actually came back
  4. only ask the model to extract when there is real page content
  5. return a structured failure when the page is inaccessible or clearly a verification wall

The key part is step 3. Do not treat “HTTP 200” as “we got the page”. A lot of sites return a successful status code for:

  • login walls
  • consent walls
  • CAPTCHA pages
  • JavaScript shells with no meaningful content
  • PDF wrappers with empty text
  • soft-block pages that look like normal HTML to a naive parser

If you pass that straight to an LLM and ask for product names, prices, company details, or whatever else, you are inviting fiction.

What a good failure looks like

A good extraction failure should be boring and machine-readable:

{
  "success": false,
  "error_type": "captcha_required",
  "message": "The target page requires human verification before extraction.",
  "url": "https://example.com/product"
}
Enter fullscreen mode Exit fullscreen mode

Or:

{
  "success": false,
  "error_type": "empty_content",
  "message": "The page was reachable, but no extractable content was found.",
  "url": "https://example.com/report.pdf"
}
Enter fullscreen mode Exit fullscreen mode

That gives the caller something useful:

  • retry later
  • ask for credentials
  • use a different source
  • mark the record unresolved
  • escalate to a human

What it should not do is return:

{
  "company_name": "Example Holdings Ltd",
  "revenue": "$12.4M",
  "employees": 87
}
Enter fullscreen mode Exit fullscreen mode

when none of that appeared on the page.

Tiny haunted spreadsheet, now with investor-grade hallucinations. Lovely.

Simple guardrails you can add

If you are building this yourself, add checks before extraction:

def has_meaningful_content(text: str) -> bool:
    if not text or len(text.strip()) < 200:
        return False

    bad_markers = [
        "verify you are human",
        "checking your browser",
        "captcha",
        "enable javascript",
        "access denied",
        "login required",
    ]

    lowered = text.lower()
    return not any(marker in lowered for marker in bad_markers)
Enter fullscreen mode Exit fullscreen mode

That is not enough on its own, but it catches a surprising amount of garbage before the model gets a chance to decorate it.

Also make the model answer from evidence, not vibes:

Extract only fields that are explicitly present in the provided page content.
If a field is missing, return null.
If the content is not the requested page, return an extraction_error object.
Do not infer, guess, or fill gaps from general knowledge.
Enter fullscreen mode Exit fullscreen mode

Then validate the output. If every field is mysteriously perfect after a weak fetch, be suspicious. The machine is smiling too much.

Where MCP makes this sharper

Agent workflows make this problem worse because the output is not always going straight to a human. Claude, Cursor, or another agent may call a tool, receive JSON, and continue planning from it.

Bad extraction becomes bad reasoning.

So for MCP tools, I think the contract should be stricter:

  • return structured JSON when extraction is grounded
  • return a structured error when it is not
  • expose the failure reason clearly enough for the agent to choose the next step

That is what we are building toward with Haunt API: known URL in, natural-language prompt in, structured JSON out, but no pretending when the page cannot actually be read.

If you are building agents that depend on web data, this is the boring line that matters:

no data is better than fake data.

Haunt API: https://hauntapi.com

Python SDK:

pip install hauntapi
Enter fullscreen mode Exit fullscreen mode

MCP server:

npx -y @hauntapi/mcp-server
Enter fullscreen mode Exit fullscreen mode

Top comments (0)