When your web extraction tool should fail loudly instead of returning pretty lies

#python #webscraping #ai #mcp

A web extraction API has one job that sounds boring until it fails:

return the data that exists, or admit that it could not get it.

That second half matters more than most people want to admit.

When you put an LLM at the end of a scraping pipeline, you get a nasty failure mode. The fetch fails, the page is blocked, the PDF text is empty, or the site returns a CAPTCHA page, and the model still tries to be helpful. Helpful, in this case, means inventing plausible JSON.

That is worse than a 500.

A 500 tells your pipeline to retry, route, alert, or skip. Fabricated JSON quietly poisons whatever comes next.

The pattern we ended up using

For Haunt API, the extraction path is deliberately boring before it is clever:

fetch the page directly
fall back through stronger fetch/render paths when needed
inspect what actually came back
only ask the model to extract when there is real page content
return a structured failure when the page is inaccessible or clearly a verification wall

The key part is step 3. Do not treat “HTTP 200” as “we got the page”. A lot of sites return a successful status code for:

login walls
consent walls
CAPTCHA pages
JavaScript shells with no meaningful content
PDF wrappers with empty text
soft-block pages that look like normal HTML to a naive parser

If you pass that straight to an LLM and ask for product names, prices, company details, or whatever else, you are inviting fiction.

What a good failure looks like

A good extraction failure should be boring and machine-readable:

{
  "success": false,
  "error_type": "captcha_required",
  "message": "The target page requires human verification before extraction.",
  "url": "https://example.com/product"
}

Or:

{
  "success": false,
  "error_type": "empty_content",
  "message": "The page was reachable, but no extractable content was found.",
  "url": "https://example.com/report.pdf"
}

That gives the caller something useful:

retry later
ask for credentials
use a different source
mark the record unresolved
escalate to a human

What it should not do is return:

{
  "company_name": "Example Holdings Ltd",
  "revenue": "$12.4M",
  "employees": 87
}

when none of that appeared on the page.

Tiny haunted spreadsheet, now with investor-grade hallucinations. Lovely.

Simple guardrails you can add

If you are building this yourself, add checks before extraction:

def has_meaningful_content(text: str) -> bool:
    if not text or len(text.strip()) < 200:
        return False

    bad_markers = [
        "verify you are human",
        "checking your browser",
        "captcha",
        "enable javascript",
        "access denied",
        "login required",
    ]

    lowered = text.lower()
    return not any(marker in lowered for marker in bad_markers)

That is not enough on its own, but it catches a surprising amount of garbage before the model gets a chance to decorate it.

Also make the model answer from evidence, not vibes:

Extract only fields that are explicitly present in the provided page content.
If a field is missing, return null.
If the content is not the requested page, return an extraction_error object.
Do not infer, guess, or fill gaps from general knowledge.

Then validate the output. If every field is mysteriously perfect after a weak fetch, be suspicious. The machine is smiling too much.

Where MCP makes this sharper

Agent workflows make this problem worse because the output is not always going straight to a human. Claude, Cursor, or another agent may call a tool, receive JSON, and continue planning from it.

Bad extraction becomes bad reasoning.

So for MCP tools, I think the contract should be stricter:

return structured JSON when extraction is grounded
return a structured error when it is not
expose the failure reason clearly enough for the agent to choose the next step

That is what we are building toward with Haunt API: known URL in, natural-language prompt in, structured JSON out, but no pretending when the page cannot actually be read.

If you are building agents that depend on web data, this is the boring line that matters: