DEV Community: Zee

Your scraping API shouldn't charge you for failed requests

Zee — Tue, 23 Jun 2026 08:32:38 +0000

I have used most of the web scraping APIs. They are good at the happy path: give them a reachable page, get back HTML or JSON. But almost all of them share a quietly expensive habit. They bill you for the request even when the page was blocked and they handed you garbage.

In the era of human-run scrapers, you could shrug that off. A person looked at the output, saw it was junk, and moved on. In the era of agents, it is a real problem, and it is worth thinking about why.

Failure is not the exception, it is the workload

If you point any extraction tool at the open web at scale, a meaningful slice of requests will fail: CAPTCHAs, login walls, rate limits, geo blocks, JavaScript shells that never settle, pages that are just thin. That is not an edge case. It is a constant percentage of every job you run.

So the failure path is not a corner of the product. For an agent running unattended, it is most of the interesting behavior. And there are two ways a tool can handle it.

The dishonest way: return a 200 with empty or partially fabricated fields, and bill the request. The caller cannot easily tell a real result from a fake one. A human might. An agent will not. It will take the fabricated fields and reason on top of them, confidently, forever.

The honest way: return an explicit, structured failure that says what went wrong, and charge nothing for it. The caller, human or agent, can branch on it.

Two design decisions that follow from this

1. Do not bill failed extractions. This sounds like a pricing gimmick. It is actually an incentive-alignment decision. If a provider charges for failures, they have no reason to detect them well. Their interest is to return something and move on. If they do not charge for failures, detecting failure accurately is suddenly in their own interest too. Pricing shapes behavior, including the vendor's.

It also makes your retry logic sane. If every blocked page and every transient timeout costs money, your agent's retry loop is a slow quota leak you will not notice until the bill arrives. When failures are free, you can retry sensibly without watching credits evaporate.

2. Return machine-readable diagnostics, not a string. Free-text errors are fine for humans and useless for agents. The agent needs a stable field to switch on. Something like:

{
  "success": false,
  "diagnostics": {
    "reason": "login_required",
    "retryable": false,
    "suggested_action": "Page needs an authenticated session. Use credentials or a different source."
  }
}

With a stable reason enum, an agent can do the obvious right things: retry on timeout, give up on captcha, switch source on access_denied, escalate on login_required. No string parsing, no guessing.

"But it is not the cheapest"

Honesty cuts both ways, so: a no-bill-on-failure model is not automatically the cheapest option. If your job is to pull raw HTML at the lowest possible cost per gigabyte, a bare proxy or a budget HTML endpoint will beat it. That is a real use case and those tools are fine for it.

The honest-failure model wins for a different job: structured extraction feeding an agent or a pipeline, where a wrong-but-confident result is more expensive than a clean failure. If a fabricated price or a hallucinated contact makes it into your product, the cost of that is not measured in credits.

How to evaluate any provider on this

Next time you trial a scraping or extraction API, do not just test the happy path. Point it at three pages you know are blocked: a CAPTCHA wall, a login-only page, and a heavy single-page app. Then look at two things:

Did it charge you for those three?
Could a program tell, from the response alone, that they failed and why?

If the answers are "yes, it charged me" and "no, not really," your agents are going to have a bad time, and so is your bill.

I built Haunt around exactly these two decisions, because I kept hitting this problem in my own agent projects. But the point stands whatever you use: in the agent era, how a tool fails is more important than how it succeeds. Pick tools that fail honestly, and price that failure at zero.

How to test whether your web extraction API is lying to your agent

Zee — Fri, 05 Jun 2026 12:40:07 +0000

The dangerous part of web extraction is not the error.

The dangerous part is a clean JSON response that looks correct and is not.

If an AI agent uses that output, the mistake does not stay inside a scraper. It moves into a report, a lead list, a price alert, a CRM update, or an automated decision.

So before you trust any web extraction API in an agent workflow, test whether it fails honestly.

Here is the checklist I use.

1. Test a real page, not a demo page

Demo pages are usually polite little museum exhibits.

Use pages that behave like the actual internet:

a JavaScript-heavy product page
a search results page
a page with missing fields
a login-gated page
a blocked or rate-limited page
a thin page that has layout but little useful content

If the tool only looks good on clean static pages, you have not learned much.

2. Check whether the API separates fetch success from extraction success

A page can return HTTP 200 and still be useless.

Common examples:

the final page is a login screen
the visible content is a bot challenge
the useful data loads after hydration and never appeared in the fetch
the page contains a generic country selector instead of the product
the source has navigation text but no actual item data

The extractor should not treat these as success just because the request completed.

A better response says something like:

{
  "status": "failed",
  "failure_type": "login_required",
  "extracted": null,
  "next_step": "Use a public URL, authorised access, or sample content."
}

That is boring.

Boring is good here.

Boring means the agent can stop safely.

3. Ask for fields that are not on the page

This is the easiest lie detector.

Ask the extractor for something impossible:

Extract the product name, price, stock status, warranty length, CEO name, office address, and refund policy.

If the page only contains product name and price, the API should say the other fields are missing or low confidence.

It should not politely invent them because the schema asked nicely.

4. Inspect blocked-page behaviour

A blocked page should not become a fake product.

A login wall should not become a company profile.

A CAPTCHA page should not become a price list.

The tool should classify the failure. Useful failure classes include:

login_required
access_denied
captcha_required
rate_limited
thin_public_content
not_found
timeout

The exact names matter less than the behaviour: the caller should know what happened.

5. Look for evidence, not just output

For agent workflows, I want some trace of why the tool believed the result.

That can be simple:

final URL type
whether visible content was found
whether structured data was found
missing fields
confidence
response mode used

You do not need a forensic novel. You need enough context to avoid trusting a hallucinated row in a spreadsheet.

6. Test the no-key path

If the tool has an MCP server or agent integration, the first run matters.

A good no-key or demo path should answer:

what the tool does
how to run one safe demo
where to get a key
what command to copy next
what failures mean

If the missing-key error just says 401, users will bounce. Agents will also be bad at recovering from that without extra instructions.

7. Treat honest failure as a product feature

This is the bit people get wrong.

They think users only want successful JSON.

Users want useful truth.

If the page is blocked, say it is blocked.

If the data is missing, say it is missing.

If the extraction is low confidence, say that.

A failed extraction that explains itself is better than a successful-looking lie.

That is the principle behind Haunt API, which is what I am building.

Disclosure: Haunt is my project. It takes a known public URL and a plain-English prompt, then returns structured JSON when the page provides enough evidence. It also has an MCP server for agent workflows.

Try the demo:

https://hauntapi.com/demo

Docs:

https://hauntapi.com/docs

MCP package:

npx -y @hauntapi/mcp-server

REST demo:

curl -X POST https://hauntapi.com/v1/demo/extract

If your agent is going to act on web data, the extraction layer has to be honest.

Otherwise you have not built automation.

You have built a very confident rumour machine.

Stop pretending your scraper worked: honest JSON for AI agents

Zee — Mon, 01 Jun 2026 18:39:57 +0000

Most scraper demos lie by accident.

They show the happy path: one URL, one clean page, one neat JSON object. Then the first real user tries a marketplace search page, a login wall, a JavaScript shell, a rate-limited product page, or a site that serves different HTML to every fetch path.

The response still comes back as JSON, so everyone relaxes.

That is the trap. A JSON response is not the same thing as a useful extraction.

The failure mode agents hate

AI agents do not just need scraped text. They need to know what happened.

Bad extraction output looks like this:

{
  "title": "Example product",
  "price": "$29.99",
  "availability": "in stock"
}

That looks fine until you inspect the source and discover the page was a login prompt, a bot challenge, or a thin JavaScript shell. The extractor filled the schema because the schema was requested. Helpful. Like a smoke alarm that hums a little song while the kitchen burns.

Better extraction output separates the data from the confidence and the failure class:

{
  "status": "failed",
  "failure_type": "login_required",
  "confidence": 0.94,
  "extracted": null,
  "evidence": {
    "final_url_type": "restricted_page",
    "visible_content": "login prompt",
    "structured_data_found": false
  },
  "next_step": "Use an authorised source, public item URL, feed, API, or sample HTML."
}

That is less flashy. It is also much more useful.

The useful contract is not “scrape anything”

“Scrape anything” is usually a warning label wearing lipstick.

For agent workflows, the better contract is:

Return structured data when the page provides enough evidence.
Return a specific honest failure when it does not.
Preserve enough metadata for the caller to decide what to do next.
Never invent fields just because a prompt asked nicely.

This matters for ecommerce, lead enrichment, price monitoring, competitor tracking, procurement, and internal research agents. If the agent cannot tell the difference between “product unavailable”, “page blocked”, “login required”, and “the parser guessed”, it will make bad decisions with a straight face.

What I mean by honest failure

An honest extraction system should classify common failures explicitly:

login_required: public fetch reached a sign-in wall.
captcha_required: the target presented a challenge.
access_denied: the target refused access.
thin_public_content: the visible public page does not contain enough useful data.
not_found: the page genuinely appears missing.
timeout: the target or render path did not finish within budget.
unsupported_source: the input is outside the allowed fetch policy.

That last one matters. Some sources need permission, an account, a feed, a partnership, or a customer-provided export. Pretending otherwise is how “automation” turns into reputation damage.

Why this is an MCP problem too

MCP makes it easier for agents to call tools. Good.

It also makes it easier for agents to call bad tools confidently. Less good.

If an MCP tool says “extract product data from this page”, the caller needs more than a blob of text. It needs a result shape that tells the agent whether the answer is safe to use.

A decent MCP extraction response should expose:

mode used, such as static HTML, browser render, deterministic parser, or LLM-assisted extraction,
confidence,
item counts,
failure type when blocked,
whether the result came from visible page evidence,
a bounded next step.

That gives the agent a decision boundary. Without it, the agent just spreads the lie downstream, but faster.

A practical pattern

For production extraction, I like this rough flow:

Fetch the page through the cheapest safe path.
Classify obvious blocks before asking an LLM anything.
Try deterministic parsing for known structures: JSON-LD, tables, product cards, metadata, feeds.
Use browser rendering only when the page actually needs it.
Ask an LLM to structure evidence, not to hallucinate missing evidence.
Attach confidence and failure metadata.
Make quota and billing count successful useful work, not random provider attempts.

The boring bits are the product.

Anyone can make a demo that extracts one page. The hard part is making the system fail in a way the caller can trust.

Example: agent-readable discovery

This also affects how tools get discovered.

If you run an agent-readable directory or service registry, a vague card that says “web scraper API” is not enough. Agents need to know:

what input the service accepts,
what output shape it returns,
what it refuses to do,
what authentication is required,
what a first safe test call looks like,
what failure classes mean.

That is why I like service cards over logo walls. A human can infer a lot from branding. Agents need contracts.

A small demo path

I am building this into Haunt API, a web extraction API with an MCP surface, and listing it through OpenInvoke, an agent-readable service directory.

Useful starting points:

Haunt MCP web extraction use case: https://hauntapi.com/use-cases/mcp-server-for-web-scraping?utm_source=devto&utm_medium=article&utm_campaign=seven_day_push_2026_06_day1_devto&utm_content=honest_json_agents
Haunt docs and demo path: https://hauntapi.com/docs?utm_source=devto&utm_medium=article&utm_campaign=seven_day_push_2026_06_day1_devto&utm_content=docs_demo
OpenInvoke agent-readable directory idea: https://openinvoke.com/agent-readable-api-directory/?utm_source=devto&utm_medium=article&utm_campaign=seven_day_push_2026_06_day1_devto&utm_content=service_cards

The pitch is deliberately not “bypass every website”. That is not the game.

The better game is: extract what is legitimately available, say when it is not, and give agents a result they can reason about without pretending blocked pages are products.

That is less magical. It is also the version that does not poison your workflow.

When your web extraction tool should fail loudly instead of returning pretty lies

Zee — Tue, 12 May 2026 19:25:03 +0000

A web extraction API has one job that sounds boring until it fails:

return the data that exists, or admit that it could not get it.

That second half matters more than most people want to admit.

When you put an LLM at the end of a scraping pipeline, you get a nasty failure mode. The fetch fails, the page is blocked, the PDF text is empty, or the site returns a CAPTCHA page, and the model still tries to be helpful. Helpful, in this case, means inventing plausible JSON.

That is worse than a 500.

A 500 tells your pipeline to retry, route, alert, or skip. Fabricated JSON quietly poisons whatever comes next.

The pattern we ended up using

For Haunt API, the extraction path is deliberately boring before it is clever:

fetch the page directly
fall back through stronger fetch/render paths when needed
inspect what actually came back
only ask the model to extract when there is real page content
return a structured failure when the page is inaccessible or clearly a verification wall

The key part is step 3. Do not treat “HTTP 200” as “we got the page”. A lot of sites return a successful status code for:

login walls
consent walls
CAPTCHA pages
JavaScript shells with no meaningful content
PDF wrappers with empty text
soft-block pages that look like normal HTML to a naive parser

If you pass that straight to an LLM and ask for product names, prices, company details, or whatever else, you are inviting fiction.

What a good failure looks like

A good extraction failure should be boring and machine-readable:

{
  "success": false,
  "error_type": "captcha_required",
  "message": "The target page requires human verification before extraction.",
  "url": "https://example.com/product"
}

Or:

{
  "success": false,
  "error_type": "empty_content",
  "message": "The page was reachable, but no extractable content was found.",
  "url": "https://example.com/report.pdf"
}

That gives the caller something useful:

retry later
ask for credentials
use a different source
mark the record unresolved
escalate to a human

What it should not do is return:

{
  "company_name": "Example Holdings Ltd",
  "revenue": "$12.4M",
  "employees": 87
}

when none of that appeared on the page.

Tiny haunted spreadsheet, now with investor-grade hallucinations. Lovely.

Simple guardrails you can add

If you are building this yourself, add checks before extraction:

def has_meaningful_content(text: str) -> bool:
    if not text or len(text.strip()) < 200:
        return False

    bad_markers = [
        "verify you are human",
        "checking your browser",
        "captcha",
        "enable javascript",
        "access denied",
        "login required",
    ]

    lowered = text.lower()
    return not any(marker in lowered for marker in bad_markers)

That is not enough on its own, but it catches a surprising amount of garbage before the model gets a chance to decorate it.

Also make the model answer from evidence, not vibes:

Extract only fields that are explicitly present in the provided page content.
If a field is missing, return null.
If the content is not the requested page, return an extraction_error object.
Do not infer, guess, or fill gaps from general knowledge.

Then validate the output. If every field is mysteriously perfect after a weak fetch, be suspicious. The machine is smiling too much.

Where MCP makes this sharper

Agent workflows make this problem worse because the output is not always going straight to a human. Claude, Cursor, or another agent may call a tool, receive JSON, and continue planning from it.

Bad extraction becomes bad reasoning.

So for MCP tools, I think the contract should be stricter:

return structured JSON when extraction is grounded
return a structured error when it is not
expose the failure reason clearly enough for the agent to choose the next step

That is what we are building toward with Haunt API: known URL in, natural-language prompt in, structured JSON out, but no pretending when the page cannot actually be read.

If you are building agents that depend on web data, this is the boring line that matters:

no data is better than fake data.

Haunt API: https://hauntapi.com

Python SDK:

pip install hauntapi

MCP server:

npx -y @hauntapi/mcp-server

The One Lesson I Learned Building a Web Extraction API in 2026

Zee — Fri, 08 May 2026 04:16:13 +0000

I spent the last few months building a web extraction API. Here's what surprised me most: developers don't need another scraper. They need extraction that stops breaking.

Every web scraping thread I read has the same arc:

Write a BeautifulSoup/Scrapy scraper
It works for two weeks
The target site changes one div
Scraper breaks at 2am
Dev swears, rewrites selectors
Repeat

The alternative everyone reaches for next: "I'll use Playwright. No, I'll use Puppeteer. No, a headless browser with proxy rotation. No..."

But here's the thing most people miss: the problem isn't fetching. It's parsing.

The extraction-first approach

At Haunt API (which I built), we flipped the model. Instead of fetch-then-parse, the user describes what they want in plain English: "Extract product name, price, and stock status from this page."

The AI reads the page like a human would — it understands context, not CSS selectors. When the site changes layout next week, the extraction still works because the prompt targets meaning, not markup.

What matters in 2026

Cloudflare bypass is table stakes now. If your extraction service can't handle Cloudflare-protected sites, it's a hobby project.
Structured JSON output matters more than markdown. LLMs consume JSON; humans debug with it.
Failed extractions shouldn't cost anything. You shouldn't pay for "the page loaded but I couldn't find what you asked for."
Natural language prompts > CSS selectors. Site maintainers change divs. They don't change meaning.

A practical example

import requests

resp = requests.post(
    "https://hauntapi.com/v1/extract",
    headers={"X-API-Key": "your_key_here"},
    json={
        "url": "https://books.toscrape.com",
        "prompt": "Extract all book titles and their prices"
    }
)

print(resp.json()["data"])
# => [{"title": "A Light in the Attic", "price": "£51.77"}, ...]

That's three lines. No selectors. No Playwright. No parsing.

The real lesson

Building the tool taught me that the web extraction market in 2026 is consolidating around two poles: platforms (Apify, with thousands of pre-built scrapers and scheduling) and extraction APIs (tools that focus on making one extraction call reliable).

If you're building a product that needs web data, pick the right pole. If you need one-off reliable extraction of specific data points, an extraction-first API will save you more time than another headless browser setup.

Disclosure: I built Haunt API. Free tier is 100 requests/month if you want to try it: https://hauntapi.com

I’m looking for ugly URLs that break normal scrapers

Zee — Fri, 01 May 2026 21:24:30 +0000

Most scraper demos use friendly pages.

A blog post.
A docs page.
A fake ecommerce product.
Something clean enough that BeautifulSoup could probably manage it after a coffee.

That is not where web extraction gets annoying.

The annoying cases are the ugly ones:

JavaScript-rendered pages
pages with no stable CSS selectors
pages where the useful data is mixed into layout sludge
Cloudflare / bot-wall weirdness
vendor pages where the table changes every week
docs pages where the answer is spread across several sections
pages that look simple in a browser but return nonsense to curl

Those are the URLs I actually care about.

The useful test

The test is not:

“Can this tool scrape example.com?”

The test is:

“Can I send it a real page and ask for the specific thing I need, without writing a custom parser?”

Example:

curl -X POST https://hauntapi.com/v1/extract \
  -H "X-API-Key: YOUR_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com/some-awful-page",
    "prompt": "Extract product names, prices, availability, and the source URL as JSON"
  }'

That is the shape I built Haunt API around:

URL in.
Natural-language extraction prompt in.
Structured JSON out.

No selector map.
No one-off parser.
No “the site changed one div class and now everything is dead” ritual sacrifice.

What I want to test next

I’m collecting awkward public URLs that normal scrapers struggle with.

Not private data.
Not login-only pages.
Not anything illegal or creepy.

Just the normal developer pain pile:

public product pages
public directories
public docs
public event listings
public price pages
public content pages with messy markup

If you have one of those “this page should be easy but somehow isn’t” URLs, send it over.

I’ll try to turn it into clean JSON or Markdown and share what worked / what failed.

The live docs are here:

https://hauntapi.com/docs

And the hard-URL proof flow is here:

https://hauntapi.com/services

I’m mainly interested in the failures. Friendly demos are cheap. Broken real pages are where the bodies are buried.

Your SaaS cancellation page is where retention goes to die

Zee — Fri, 01 May 2026 21:21:43 +0000

Most SaaS teams treat churn like a dashboard problem.

They connect Stripe, stare at monthly churn, maybe add a chart, then wonder why nothing changes.

That is post-mortem work.

The customer has already left. The money is already gone. The dashboard is just reading the gravestone.

The useful moment is earlier: the cancellation page.

That is the one place where the customer is still present, still logged in, still telling you they are about to leave, and still possibly recoverable.

Here is the simple teardown I use when looking at a SaaS cancellation flow.

1. Do you know why they are leaving?

If the page only has a red "cancel subscription" button, you are throwing away the most useful data in the business.

At minimum, ask for one reason:

too expensive
missing feature
not using it enough
switched to another tool
temporary pause
support/product issue
other

Do not make it a 20-field survey. That is not research, that is punishment.

One click is enough.

2. Does the save offer match the reason?

This is where most flows go stupid.

If someone says "too expensive", offer a discount or downgrade.

If someone says "not using it enough", offer a pause or reminder.

If someone says "missing feature", show the closest workaround or ask if they want to be told when it ships.

If someone says "temporary pause", do not beg. Give them a clean pause option.

A generic "20% off if you stay" offer is better than nothing, but it is still lazy.

3. Are you saving the subscription or just annoying them?

Dark pattern cancellation flows might reduce churn for five minutes and increase hatred forever.

Do not hide the cancel button.
Do not add five fake confirmation screens.
Do not make them email support.
Do not trap them.

A good save flow is clear:

"You can cancel now, but here is the one relevant option that might fit better."

That is retention. Not hostage-taking.

4. Are failed payments mixed up with voluntary churn?

These are different problems.

A failed card is not the same as someone choosing to leave.

Failed payment recovery needs dunning, retries, backup payment methods, and clear billing emails.

Voluntary churn needs reason capture, matching offers, and product feedback loops.

If your churn dashboard lumps them together, your action plan will be mud.

5. Can you see what happens after the save attempt?

Track the basics:

cancellation started
reason selected
offer shown
offer accepted
cancellation completed
saved revenue

If you cannot see these steps, you cannot improve the flow. You are guessing in expensive darkness.

Tiny useful audit

Look at your cancellation page and ask:

What reason would a customer give here?
What offer would they see next?
Would that offer actually match the reason?
Would I personally find this flow fair?
Can I measure whether it saved anything?

If the answer is mostly "no", the fix is probably not another dashboard.

It is a better cancellation moment.

I built SaveMyChurn around this exact idea: catch the customer while they are still in the cancellation flow, ask why they are leaving, and show the right recovery offer instead of just reporting churn after the fact.

If you want to sanity-check your own flow, the low-friction page is here:

https://savemychurn.com/cancellation-audit

No Stripe key needed for the first look. Just use it as a teardown lens before you start handing tools access to billing data.

And if you do nothing else, add the one-question reason step. Boring, cheap, and annoyingly effective.

Most SaaS churn dashboards are post-mortems

Zee — Fri, 01 May 2026 06:51:24 +0000

If your churn dashboard only tells you that someone left, it is not a recovery system. It is a gravestone with charts.

The useful question is not just “what is our churn rate?”

It is:

who is likely to cancel?
why are they cancelling?
what save path should they see before the hard exit?
what failed payments are quietly sitting in Stripe?
what is a 5% retention improvement worth in actual MRR?

A lot of small SaaS teams already have the raw ingredients:

Stripe subscriptions
cancellation reasons, if they ask for them
plan and price data
retry events
customer usage signals

But the cancellation flow is usually written like a legal form:

Are you sure you want to cancel?

That is not retention. That is a trapdoor.

A better cancellation flow should branch.

If the reason is price, offer a downgrade or pause.

If the reason is temporary budget, offer a timed pause.

If the reason is missing functionality, capture the feature gap and trigger follow-up.

If the problem is failed payment, do not treat it like voluntary churn.

None of this requires a giant customer success department. It needs a simple loop:

capture the reason
match the reason to a recovery path
measure recovered revenue, not vanity clicks

I built SaveMyChurn around that idea: connect Stripe, detect churn and failed-payment leaks, and trigger personalised retention offers.

There is a free cancellation audit here if you want to see the rough shape before connecting anything:

https://savemychurn.com/cancellation-audit

And a churn calculator if you just want to see what a few retention points are worth:

https://savemychurn.com/churn-rate-calculator

The short version: if churn is only a number on your dashboard, you are already too late.

We Built a Custom Playwright Rendering Pipeline for Our MCP Server

Zee — Fri, 24 Apr 2026 07:10:26 +0000

We Built a Custom Playwright Rendering Pipeline for Our MCP Server — Here's What We Learned

At Haunt API, we build web extraction tools for AI agents. Our MCP server lets Claude and other AI assistants extract structured data from any URL. Simple enough on paper — fetch a page, parse the HTML, return JSON.

The problem? Half the internet doesn't want to be fetched.

The Problem With "Just Use Playwright"

Most web scraping tutorials go something like this:

from playwright.async_api import async_playwright

async with async_playwright() as p:
    browser = await p.chromium.launch()
    page = await browser.new_page()
    await page.goto(url)
    html = await page.content()

And that works! For a demo. For a product that real users depend on, it falls apart fast:

Sites detect headless browsers and serve captchas or empty pages
SPA pages need time to render — how long do you wait? 2 seconds? 5? 10?
You're burning resources loading images, fonts, and CSS when you only need text
Every render costs the same — no caching, no intelligence

We went through all of these. Here's how we solved each one.

Lesson 1: Don't Use One Tool For Everything

Our pipeline has three tiers, and most requests never hit Playwright:

Direct HTTP — Works for ~80% of the web. Fast, cheap, no browser needed.
FlareSolverr — Handles Cloudflare challenges and basic JS rendering.
Playwright — Full browser rendering for JS-heavy SPAs that return empty skeletons.

The key insight: we detect skeleton pages — HTML that has a <div id="root"></div> but no actual content — and only spin up the browser when we need to. Most pages don't need it.

def is_skeleton_html(html: str) -> bool:
    """Detect if HTML is an unrendered JS skeleton."""
    if len(html) < 500:
        return True

    # Strip scripts/styles and check for visible text
    text = strip_tags(html)
    if len(text) < 100:
        return True

    # Common SPA markers
    skeleton_markers = [
        '<div id="root"></div>',
        '<div id="__next"></div>',
        'You need to enable JavaScript',
    ]
    return any(marker in html for marker in skeleton_markers)

Lesson 2: Smart Wait Strategies Beat Fixed Timers

The worst thing about browser automation is the waiting. time.sleep(5) is either too short (page hasn't loaded) or too long (wasting time on pages that loaded instantly).

We built three concurrent wait strategies. First one to trigger wins:

Content Stability — Poll the page's visible text every 200ms. If it hasn't changed for 1 second, the content has loaded.

Network Idle — Wait for no new network requests for 500ms. Good for pages that make API calls after initial load.

Meaningful Content — Wait until the page has at least 500 characters of visible text. Catches pages that load something but aren't done yet.

async def wait_for_content(page, timeout=10):
    """Smart wait — detect when content has actually loaded."""
    tasks = [
        wait_for_content_stability(page),
        wait_for_network_idle(page),
        wait_for_meaningful_content(page),
    ]
    done, pending = await asyncio.wait(
        tasks, timeout=timeout, return_when=asyncio.FIRST_COMPLETED
    )
    for t in pending:
        t.cancel()
    return done.pop().result() if done else {"strategy": "timeout"}

This cut our average render time from 6 seconds to under 3.

Lesson 3: Fingerprint Rotation Matters

Headless Chromium has tells. Sites check for them. If every request comes from the same user agent with the same viewport on the same timezone, you get blocked.

We rotate fingerprints per-URL — same site sees a consistent browser (so cookies and sessions work), but different sites see different browsers:

FINGERPRINTS = [
    {"ua": "Chrome/120.0 Windows", "viewport": [1920, 1080], "locale": "en-US"},
    {"ua": "Chrome/119.0 macOS", "viewport": [1440, 900], "locale": "en-GB"},
    {"ua": "Chrome/120.0 Linux", "viewport": [1366, 768], "locale": "en-US"},
    # ... 10 total variants
]

def get_fingerprint(url: str) -> dict:
    """Deterministic per-URL fingerprint selection."""
    idx = int(hashlib.md5(url.encode()).hexdigest(), 16) % len(FINGERPRINTS)
    return FINGERPRINTS[idx]

Lesson 4: Block What You Don't Need

When you're extracting text data, images and fonts are dead weight. We block them at the network level:

BLOCKED_RESOURCES = {
    "image", "font", "media", "texttrack", "object",
    "beacon", "csp_report", "eventsource",
}

BLOCKED_DOMAINS = {
    "google-analytics.com", "facebook.net", "doubleclick.net",
    "hotjar.com", "mixpanel.com", "segment.io",
    # ... 20+ tracking domains
}

async def route_handler(route):
    if route.request.resource_type in BLOCKED_RESOURCES:
        await route.abort()
    elif any(d in route.request.url for d in BLOCKED_DOMAINS):
        await route.abort()
    else:
        await route.continue_()

This cuts HTML payload by 40-60% on most pages, which means faster renders and less RAM.

Lesson 5: Cache Renders, Not Requests

If two users extract data from the same URL within 5 minutes, the page probably hasn't changed. We cache the rendered HTML with a TTL:

class RenderCache:
    def __init__(self, max_size=50, default_ttl=300):
        self.cache = OrderedDict()
        self.max_size = max_size
        self.default_ttl = default_ttl

    def get(self, url):
        if url in self.cache:
            entry = self.cache[url]
            if time.time() - entry["cached_at"] < entry["ttl"]:
                return entry
            del self.cache[url]
        return None

Cache hits return in 0ms. For an API that charges per request, this saves users money and makes responses instant.

The Architecture

Final structure — 6 modules, each with a single job:

playwright-service/
├── server.py          # FastAPI orchestration, browser lifecycle
├── fingerprint.py     # UA/viewport/locale rotation
├── smart_wait.py      # Content stability + network idle detection
├── site_detect.py     # Static vs SPA classification
├── cache.py           # LRU render cache with TTL
└── stealth.py         # Resource blocking + headless detection evasion

Each module is ~100 lines. Easy to test, easy to modify, easy to explain to new contributors.

What We Learned

Don't reach for the browser first. Most pages are server-rendered. Direct HTTP is 10x faster and 100x cheaper.
Wait smarter, not longer. Detecting when content has actually loaded saves seconds per request.
Be a moving target. Rotating fingerprints and blocking trackers keeps you under the radar.
Cache aggressively. Web pages don't change every second. A 5-minute render cache saves users money and makes your API feel fast.
Build modules, not monoliths. Each piece of the pipeline has its own concerns. Keep them separate.

The Playwright browser engine is the oven. Everything around it — the routing, the waiting, the caching, the stealth — is the recipe. That's where the actual engineering lives.

We're Haunt API — web extraction built for AI agents. If you're building with Claude, Cursor, or any AI assistant, our MCP server gives your agent the ability to extract data from any URL in one line.

We Built a Custom Playwright Rendering Pipeline for Our MCP Server — Here is What We Learned

Zee — Mon, 20 Apr 2026 19:23:07 +0000

We Built a Custom Playwright Rendering Pipeline for Our MCP Server — Heres What We Learned

The problem? Half the internet doesnt want to be fetched.

The Problem With Just Use Playwright

Most web scraping tutorials go something like this:

from playwright.async_api import async_playwright

async with async_playwright() as p:
    browser = await p.chromium.launch()
    page = await browser.new_page()
    await page.goto(url)
    html = await page.content()

And that works! For a demo. For a product that real users depend on, it falls apart fast:

Sites detect headless browsers and serve captchas or empty pages
SPA pages need time to render — how long do you wait? 2 seconds? 5? 10?
You are burning resources loading images, fonts, and CSS when you only need text
Every render costs the same — no caching, no intelligence

We went through all of these. Here is how we solved each one.

Lesson 1: Do Not Use One Tool For Everything

Our pipeline has three tiers, and most requests never hit Playwright:

Direct HTTP — Works for approximately 80% of the web. Fast, cheap, no browser needed.
FlareSolverr — Handles Cloudflare challenges and basic JS rendering.
Playwright — Full browser rendering for JS-heavy SPAs that return empty skeletons.

The key insight: we detect skeleton pages — HTML that has an empty root div but no actual content — and only spin up the browser when we need to.

Lesson 2: Smart Wait Strategies Beat Fixed Timers

The worst thing about browser automation is the waiting. A fixed sleep is either too short or too long. We built three concurrent wait strategies — first one to trigger wins:

Content Stability — Poll visible text every 200ms. If unchanged for 1 second, done.
Network Idle — Wait for no new requests for 500ms.
Meaningful Content — Wait until 500+ chars of visible text exist.

This cut our average render time from 6 seconds to under 3.

Lesson 3: Fingerprint Rotation Matters

Headless Chromium has tells. We rotate fingerprints per-URL — same site sees a consistent browser, different sites see different browsers. 10 viewport variants across Windows, macOS, and Linux UAs.

Lesson 4: Block What You Do Not Need

When extracting text data, images and fonts are dead weight. We block them at the network level plus 20+ tracking domains. This cuts HTML payload by 40-60%.

Lesson 5: Cache Renders, Not Requests

If two users extract data from the same URL within 5 minutes, the page probably has not changed. Cache hits return in 0ms.

The Architecture

Six modules, each with a single job:

server.py — FastAPI orchestration, browser lifecycle
fingerprint.py — UA/viewport/locale rotation
smart_wait.py — Content stability + network idle detection
site_detect.py — Static vs SPA classification
cache.py — LRU render cache with TTL
stealth.py — Resource blocking + headless detection evasion

Each module is approximately 100 lines. Easy to test, easy to modify.

What We Learned

Do not reach for the browser first. Most pages are server-rendered.
Wait smarter, not longer.
Be a moving target with fingerprint rotation.
Cache aggressively.
Build modules, not monoliths.

The Playwright browser engine is the oven. Everything around it — the routing, the waiting, the caching, the stealth — is the recipe. That is where the actual engineering lives.

We are Haunt API — web extraction built for AI agents. If you are building with Claude, Cursor, or any AI assistant, our MCP server gives your agent the ability to extract data from any URL.

I Built an AI That Talks People Out of Cancelling Their Subscriptions

Zee — Mon, 20 Apr 2026 15:35:05 +0000

Here's the thing about churn: by the time someone clicks "Cancel Subscription", they've already decided. Your generic "Would you like 20% off?" popup is too late and too weak.

I spent the last month building SaveMyChurn — an AI-powered churn recovery tool for Stripe SaaS founders. This is how it works, what I learned building it, and why I think most cancellation flows are doing it wrong.

The problem

I was looking at my own Stripe dashboard one day and noticed something: the cancellation flow was the most ignored piece of the entire subscription experience. People pour weeks into onboarding, feature development, marketing — and then the cancel button just... ends things. No conversation. No understanding of why.

For bootstrapped SaaS founders running £5K-50K MRR, every subscription matters. Losing 5% of your customers a month isn't a statistic — it's the difference between growing and dying.

The existing tools didn't fit. Churnkey starts at $250/month — that's a significant chunk of revenue when you're small. The cheaper options are just form builders with a discount code at the end. Nobody was actually talking to the customer.

What I built

SaveMyChurn does three things:

1. Listens to Stripe in real time

When a customer hits cancel, Stripe fires a customer.subscription.deleted webhook. SaveMyChurn catches it instantly, pulls the subscription metadata, payment history, and plan details, and builds a profile of who's leaving and why.

# The webhook handler — this is where it starts
@router.post("/webhooks/stripe")
async def stripe_webhook(request: Request):
    payload = await request.body()
    event = stripe.Webhook.construct_event(
        payload, request.headers["stripe-signature"], webhook_secret
    )

    if event["type"] == "customer.subscription.deleted":
        subscription = event["data"]["object"]
        # Build subscriber profile from Stripe data
        profile = await build_subscriber_profile(subscription)
        # Generate AI retention strategy
        strategy = await generate_retention_strategy(profile)
        # Send personalised recovery email
        await send_retention_email(profile, strategy)

2. Generates a unique retention strategy per subscriber

This is the part I'm most proud of. Instead of a static "here's 20% off" flow, an AI strategist analyses the subscriber's behaviour — how long they've been a customer, what plan they're on, their payment history, any support tickets — and creates a genuinely personalised retention offer.

Someone cancelling after 2 months gets a different approach than someone who's been around for a year. Someone on a basic plan gets a different offer than someone on enterprise. The AI adjusts tone, offer type, discount level, and follow-up timing based on the full context.

3. Follows up automatically

One email rarely saves a cancellation. SaveMyChurn runs a multi-step sequence — initial offer, follow-up with adjusted terms, final value reminder — spaced over a few days. Each step is informed by whether they opened the previous email, clicked anything, or went silent.

The tech stack

Keeping it simple and cheap:

FastAPI backend — async Python, handles webhooks fast
MongoDB for subscriber profiles and strategy storage
Redis for caching and rate limiting
LLM via API for strategy generation — the AI strategist
Resend for transactional emails
Docker on a single VPS — the whole thing runs on one machine

The LLM cost per strategy generation is under a penny. When your competitor charges $250/month, that's a ridiculous margin.

The pricing model (and why it matters)

I went with a commission model. Monthly fee + a percentage of recovered revenue. The idea is simple: if I don't save you money, I don't make money.

This was a deliberate choice. Flat-fee tools have an incentive to get you signed up and keep you paying, regardless of results. Commission pricing means I'm motivated to actually recover subscriptions, not just ship a dashboard.

For founders at the £5K-50K MRR stage, this aligns incentives in a way that $250/month flat fees don't.

What I learned

Webhook reliability is everything. If you miss a customer.subscription.deleted event, you miss the entire recovery window. I ended up implementing retry queues and idempotency keys before anything else.

AI strategy > rules engine. I initially built a simple rule-based system (if cancel reason = "price" → offer discount). It was okay. The AI strategist that replaced it generates strategies I wouldn't have thought of — bundling features differently, offering plan downgrades instead of discounts, timing follow-ups based on engagement patterns.

One email is never enough. The first recovery email has maybe a 15-20% open rate. The follow-up catches another chunk. The third one gets the people who were "going to get around to it." Multi-step sequences doubled recovery rates compared to single emails.

Where it's at

SaveMyChurn is live and in production. It works end-to-end: Stripe webhook → AI strategy → personalised email sequence → dashboard showing what was saved.

If you're a bootstrapped SaaS founder on Stripe watching subscriptions slip away, give it a look. There's a free trial — no credit card required.

Your AI Agent Can't Scrape That Page. Here's How to Fix It.

Zee — Mon, 20 Apr 2026 15:16:08 +0000

Your AI Agent Can't Scrape That Page. Here's How to Fix It.

You built an AI agent that needs real-time web data. Product prices, news articles, competitor info — whatever it is, you need clean HTML or JSON from a URL.

So you fire off a requests.get() and... 403 Forbidden. Cloudflare says no.

Or you get a page, but it's empty — the content loads via JavaScript after the page renders, and your HTTP client never sees it.

Sound familiar? Let's break down what's happening and how to actually solve it.

Why Your Scraping Fails

1. JavaScript Rendering

Modern sites are SPAs. The HTML you get from a raw HTTP request is a shell — the actual content is loaded by JavaScript after the page mounts. requests, axios, fetch — none of them execute JS.

2. Cloudflare and Bot Detection

Cloudflare fingerprints your connection:

TLS fingerprint (does your HTTP client look like a browser?)
HTTP/2 fingerprint
Browser behavior (mouse movements, JS execution patterns)
IP reputation

Regular HTTP clients fail all of these checks.

3. Complex Layouts

Even when you get the HTML, extracting structured data from it is painful. You write brittle CSS selectors that break on every layout change.

The Solutions (From Worst to Best)

Selenium/Playwright Headless Browsers

They work... sometimes. But Cloudflare detects headless Chrome. You'll spend more time maintaining anti-detection patches than building your actual product.

Rotating Proxies + Custom Headers

Expensive, slow, and fragile. You're playing whack-a-mole with detection rules.

Use an API That Handles Everything

This is where tools like Haunt API come in. It's a web extraction API built specifically for AI agents:

import requests

resp = requests.post(
    "https://hauntapi.com/v1/extract",
    headers={"x-api-key": "your-key"},
    json={
        "url": "https://example.com/product/123",
        "prompt": "Get the product name, price, and availability"
    }
)

print(resp.json()["data"])
# {
#   "product_name": "Wireless Headphones Pro",
#   "price": "$79.99",
#   "availability": "In Stock"
# }

That's it. One API call. Cloudflare bypassed, JavaScript rendered, structured data extracted.

How It Works Under the Hood

Smart fetching — tries direct HTTP first, falls back to headless browser with anti-fingerprinting for Cloudflare-protected sites
JavaScript executes — SPA content becomes available
AI extracts the data you described in your natural language prompt
Clean JSON returned to your application

MCP Server for Claude and Cursor

If you're building with AI agents, Haunt also has an MCP server:

{
  "mcpServers": {
    "haunt": {
      "command": "npx",
      "args": ["@hauntapi/mcp-server"],
      "env": {
        "HAUNT_API_KEY": "your-key"
      }
    }
  }
}

Add that to your Claude Desktop or Cursor config and your AI agent can extract data from any website natively. Zero code.

REST API (No SDK Needed)

curl -X POST https://hauntapi.com/v1/extract \
  -H "x-api-key: your-key" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://news.ycombinator.com",
    "prompt": "Get the top 5 stories with titles, points, and URLs"
  }'

Free Tier

100 extractions/month for free. No credit card required. Perfect for prototyping your AI agent before scaling up.

Paid plans start at £19/mo for 1,000 requests with authenticated scraping and priority support.

When to Use What

Approach	Cost	Reliability	Setup Time
Raw requests	Free	Low (30%)	5 min
Selenium + proxies	$$$	Medium (60%)	Hours
Haunt API	Free tier	High (95%+)	5 min

TL;DR

If your AI agent needs web data and you're tired of fighting bot detection, try Haunt API. It handles Cloudflare, JavaScript rendering, and data extraction in a single API call.

Free to start, built for AI agents and RAG pipelines.

Disclosure: I built Haunt API because I was tired of writing the same scraping infrastructure for every project.