Zee

Posted on Apr 20

Your AI Agent Can't Scrape That Page. Here's How to Fix It.

#ai #webdev #scraping #tutorial

Your AI Agent Can't Scrape That Page. Here's How to Fix It.

You built an AI agent that needs real-time web data. Product prices, news articles, competitor info — whatever it is, you need clean HTML or JSON from a URL.

So you fire off a requests.get() and... 403 Forbidden. Cloudflare says no.

Or you get a page, but it's empty — the content loads via JavaScript after the page renders, and your HTTP client never sees it.

Sound familiar? Let's break down what's happening and how to actually solve it.

Why Your Scraping Fails

1. JavaScript Rendering

Modern sites are SPAs. The HTML you get from a raw HTTP request is a shell — the actual content is loaded by JavaScript after the page mounts. requests, axios, fetch — none of them execute JS.

2. Cloudflare and Bot Detection

Cloudflare fingerprints your connection:

TLS fingerprint (does your HTTP client look like a browser?)
HTTP/2 fingerprint
Browser behavior (mouse movements, JS execution patterns)
IP reputation

Regular HTTP clients fail all of these checks.

3. Complex Layouts

Even when you get the HTML, extracting structured data from it is painful. You write brittle CSS selectors that break on every layout change.

The Solutions (From Worst to Best)

Selenium/Playwright Headless Browsers

They work... sometimes. But Cloudflare detects headless Chrome. You'll spend more time maintaining anti-detection patches than building your actual product.

Rotating Proxies + Custom Headers

Expensive, slow, and fragile. You're playing whack-a-mole with detection rules.

Use an API That Handles Everything

This is where tools like Haunt API come in. It's a web extraction API built specifically for AI agents:

import requests

resp = requests.post(
    "https://hauntapi.com/v1/extract",
    headers={"x-api-key": "your-key"},
    json={
        "url": "https://example.com/product/123",
        "prompt": "Get the product name, price, and availability"
    }
)

print(resp.json()["data"])
# {
#   "product_name": "Wireless Headphones Pro",
#   "price": "$79.99",
#   "availability": "In Stock"
# }

That's it. One API call. Cloudflare bypassed, JavaScript rendered, structured data extracted.

How It Works Under the Hood

Smart fetching — tries direct HTTP first, falls back to headless browser with anti-fingerprinting for Cloudflare-protected sites
JavaScript executes — SPA content becomes available
AI extracts the data you described in your natural language prompt
Clean JSON returned to your application

MCP Server for Claude and Cursor

If you're building with AI agents, Haunt also has an MCP server:

{
  "mcpServers": {
    "haunt": {
      "command": "npx",
      "args": ["@hauntapi/mcp-server"],
      "env": {
        "HAUNT_API_KEY": "your-key"
      }
    }
  }
}

Add that to your Claude Desktop or Cursor config and your AI agent can extract data from any website natively. Zero code.

REST API (No SDK Needed)

curl -X POST https://hauntapi.com/v1/extract \
  -H "x-api-key: your-key" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://news.ycombinator.com",
    "prompt": "Get the top 5 stories with titles, points, and URLs"
  }'

Free Tier

100 extractions/month for free. No credit card required. Perfect for prototyping your AI agent before scaling up.

Paid plans start at £19/mo for 1,000 requests with authenticated scraping and priority support.

When to Use What

Approach	Cost	Reliability	Setup Time
Raw requests	Free	Low (30%)	5 min
Selenium + proxies	$$$	Medium (60%)	Hours
Haunt API	Free tier	High (95%+)	5 min

TL;DR

If your AI agent needs web data and you're tired of fighting bot detection, try Haunt API. It handles Cloudflare, JavaScript rendering, and data extraction in a single API call.

Free to start, built for AI agents and RAG pipelines.

Disclosure: I built Haunt API because I was tired of writing the same scraping infrastructure for every project.

DEV Community

Your AI Agent Can't Scrape That Page. Here's How to Fix It.

Your AI Agent Can't Scrape That Page. Here's How to Fix It.

Why Your Scraping Fails

1. JavaScript Rendering

2. Cloudflare and Bot Detection

3. Complex Layouts

The Solutions (From Worst to Best)

Selenium/Playwright Headless Browsers

Rotating Proxies + Custom Headers

Use an API That Handles Everything

How It Works Under the Hood

MCP Server for Claude and Cursor

REST API (No SDK Needed)

Free Tier

When to Use What

TL;DR

Top comments (0)