Your scraping API shouldn't charge you for failed requests

#ai #agents #api #webdev

I have used most of the web scraping APIs. They are good at the happy path: give them a reachable page, get back HTML or JSON. But almost all of them share a quietly expensive habit. They bill you for the request even when the page was blocked and they handed you garbage.

In the era of human-run scrapers, you could shrug that off. A person looked at the output, saw it was junk, and moved on. In the era of agents, it is a real problem, and it is worth thinking about why.

Failure is not the exception, it is the workload

If you point any extraction tool at the open web at scale, a meaningful slice of requests will fail: CAPTCHAs, login walls, rate limits, geo blocks, JavaScript shells that never settle, pages that are just thin. That is not an edge case. It is a constant percentage of every job you run.

So the failure path is not a corner of the product. For an agent running unattended, it is most of the interesting behavior. And there are two ways a tool can handle it.

The dishonest way: return a 200 with empty or partially fabricated fields, and bill the request. The caller cannot easily tell a real result from a fake one. A human might. An agent will not. It will take the fabricated fields and reason on top of them, confidently, forever.

The honest way: return an explicit, structured failure that says what went wrong, and charge nothing for it. The caller, human or agent, can branch on it.

Two design decisions that follow from this

1. Do not bill failed extractions. This sounds like a pricing gimmick. It is actually an incentive-alignment decision. If a provider charges for failures, they have no reason to detect them well. Their interest is to return something and move on. If they do not charge for failures, detecting failure accurately is suddenly in their own interest too. Pricing shapes behavior, including the vendor's.

It also makes your retry logic sane. If every blocked page and every transient timeout costs money, your agent's retry loop is a slow quota leak you will not notice until the bill arrives. When failures are free, you can retry sensibly without watching credits evaporate.

2. Return machine-readable diagnostics, not a string. Free-text errors are fine for humans and useless for agents. The agent needs a stable field to switch on. Something like:

{
  "success": false,
  "diagnostics": {
    "reason": "login_required",
    "retryable": false,
    "suggested_action": "Page needs an authenticated session. Use credentials or a different source."
  }
}

With a stable reason enum, an agent can do the obvious right things: retry on timeout, give up on captcha, switch source on access_denied, escalate on login_required. No string parsing, no guessing.

"But it is not the cheapest"

Honesty cuts both ways, so: a no-bill-on-failure model is not automatically the cheapest option. If your job is to pull raw HTML at the lowest possible cost per gigabyte, a bare proxy or a budget HTML endpoint will beat it. That is a real use case and those tools are fine for it.

The honest-failure model wins for a different job: structured extraction feeding an agent or a pipeline, where a wrong-but-confident result is more expensive than a clean failure. If a fabricated price or a hallucinated contact makes it into your product, the cost of that is not measured in credits.

How to evaluate any provider on this

Next time you trial a scraping or extraction API, do not just test the happy path. Point it at three pages you know are blocked: a CAPTCHA wall, a login-only page, and a heavy single-page app. Then look at two things:

Did it charge you for those three?
Could a program tell, from the response alone, that they failed and why?

If the answers are "yes, it charged me" and "no, not really," your agents are going to have a bad time, and so is your bill.

I built Haunt around exactly these two decisions, because I kept hitting this problem in my own agent projects. But the point stands whatever you use: in the agent era, how a tool fails is more important than how it succeeds. Pick tools that fail honestly, and price that failure at zero.