The dangerous part of web extraction is not the error.
The dangerous part is a clean JSON response that looks correct and is not.
If an AI agent uses that output, the mistake does not stay inside a scraper. It moves into a report, a lead list, a price alert, a CRM update, or an automated decision.
So before you trust any web extraction API in an agent workflow, test whether it fails honestly.
Here is the checklist I use.
1. Test a real page, not a demo page
Demo pages are usually polite little museum exhibits.
Use pages that behave like the actual internet:
- a JavaScript-heavy product page
- a search results page
- a page with missing fields
- a login-gated page
- a blocked or rate-limited page
- a thin page that has layout but little useful content
If the tool only looks good on clean static pages, you have not learned much.
2. Check whether the API separates fetch success from extraction success
A page can return HTTP 200 and still be useless.
Common examples:
- the final page is a login screen
- the visible content is a bot challenge
- the useful data loads after hydration and never appeared in the fetch
- the page contains a generic country selector instead of the product
- the source has navigation text but no actual item data
The extractor should not treat these as success just because the request completed.
A better response says something like:
{
"status": "failed",
"failure_type": "login_required",
"extracted": null,
"next_step": "Use a public URL, authorised access, or sample content."
}
That is boring.
Boring is good here.
Boring means the agent can stop safely.
3. Ask for fields that are not on the page
This is the easiest lie detector.
Ask the extractor for something impossible:
Extract the product name, price, stock status, warranty length, CEO name, office address, and refund policy.
If the page only contains product name and price, the API should say the other fields are missing or low confidence.
It should not politely invent them because the schema asked nicely.
4. Inspect blocked-page behaviour
A blocked page should not become a fake product.
A login wall should not become a company profile.
A CAPTCHA page should not become a price list.
The tool should classify the failure. Useful failure classes include:
login_requiredaccess_deniedcaptcha_requiredrate_limitedthin_public_contentnot_foundtimeout
The exact names matter less than the behaviour: the caller should know what happened.
5. Look for evidence, not just output
For agent workflows, I want some trace of why the tool believed the result.
That can be simple:
- final URL type
- whether visible content was found
- whether structured data was found
- missing fields
- confidence
- response mode used
You do not need a forensic novel. You need enough context to avoid trusting a hallucinated row in a spreadsheet.
6. Test the no-key path
If the tool has an MCP server or agent integration, the first run matters.
A good no-key or demo path should answer:
- what the tool does
- how to run one safe demo
- where to get a key
- what command to copy next
- what failures mean
If the missing-key error just says 401, users will bounce. Agents will also be bad at recovering from that without extra instructions.
7. Treat honest failure as a product feature
This is the bit people get wrong.
They think users only want successful JSON.
Users want useful truth.
If the page is blocked, say it is blocked.
If the data is missing, say it is missing.
If the extraction is low confidence, say that.
A failed extraction that explains itself is better than a successful-looking lie.
That is the principle behind Haunt API, which is what I am building.
Disclosure: Haunt is my project. It takes a known public URL and a plain-English prompt, then returns structured JSON when the page provides enough evidence. It also has an MCP server for agent workflows.
Try the demo:
Docs:
MCP package:
npx -y @hauntapi/mcp-server
REST demo:
curl -X POST https://hauntapi.com/v1/demo/extract
If your agent is going to act on web data, the extraction layer has to be honest.
Otherwise you have not built automation.
You have built a very confident rumour machine.
Top comments (0)