Zee

Posted on Jun 5

How to test whether your web extraction API is lying to your agent

#agents #api #testing #webscraping

The dangerous part of web extraction is not the error.

The dangerous part is a clean JSON response that looks correct and is not.

If an AI agent uses that output, the mistake does not stay inside a scraper. It moves into a report, a lead list, a price alert, a CRM update, or an automated decision.

So before you trust any web extraction API in an agent workflow, test whether it fails honestly.

Here is the checklist I use.

1. Test a real page, not a demo page

Demo pages are usually polite little museum exhibits.

Use pages that behave like the actual internet:

a JavaScript-heavy product page
a search results page
a page with missing fields
a login-gated page
a blocked or rate-limited page
a thin page that has layout but little useful content

If the tool only looks good on clean static pages, you have not learned much.

2. Check whether the API separates fetch success from extraction success

A page can return HTTP 200 and still be useless.

Common examples:

the final page is a login screen
the visible content is a bot challenge
the useful data loads after hydration and never appeared in the fetch
the page contains a generic country selector instead of the product
the source has navigation text but no actual item data

The extractor should not treat these as success just because the request completed.

A better response says something like:

{
  "status": "failed",
  "failure_type": "login_required",
  "extracted": null,
  "next_step": "Use a public URL, authorised access, or sample content."
}

That is boring.

Boring is good here.

Boring means the agent can stop safely.

3. Ask for fields that are not on the page

This is the easiest lie detector.

Ask the extractor for something impossible:

Extract the product name, price, stock status, warranty length, CEO name, office address, and refund policy.

If the page only contains product name and price, the API should say the other fields are missing or low confidence.

It should not politely invent them because the schema asked nicely.

4. Inspect blocked-page behaviour

A blocked page should not become a fake product.

A login wall should not become a company profile.

A CAPTCHA page should not become a price list.

The tool should classify the failure. Useful failure classes include:

login_required
access_denied
captcha_required
rate_limited
thin_public_content
not_found
timeout

The exact names matter less than the behaviour: the caller should know what happened.

5. Look for evidence, not just output

For agent workflows, I want some trace of why the tool believed the result.

That can be simple:

final URL type
whether visible content was found
whether structured data was found
missing fields
confidence
response mode used

You do not need a forensic novel. You need enough context to avoid trusting a hallucinated row in a spreadsheet.

6. Test the no-key path

If the tool has an MCP server or agent integration, the first run matters.

A good no-key or demo path should answer:

what the tool does
how to run one safe demo
where to get a key
what command to copy next
what failures mean

If the missing-key error just says 401, users will bounce. Agents will also be bad at recovering from that without extra instructions.

7. Treat honest failure as a product feature

This is the bit people get wrong.

They think users only want successful JSON.

Users want useful truth.

If the page is blocked, say it is blocked.

If the data is missing, say it is missing.

If the extraction is low confidence, say that.

A failed extraction that explains itself is better than a successful-looking lie.

That is the principle behind Haunt API, which is what I am building.

Disclosure: Haunt is my project. It takes a known public URL and a plain-English prompt, then returns structured JSON when the page provides enough evidence. It also has an MCP server for agent workflows.

Try the demo:

https://hauntapi.com/demo

Docs:

https://hauntapi.com/docs

MCP package:

npx -y @hauntapi/mcp-server

REST demo:

curl -X POST https://hauntapi.com/v1/demo/extract

If your agent is going to act on web data, the extraction layer has to be honest.

Otherwise you have not built automation.

You have built a very confident rumour machine.

DEV Community