DEV Community

Zee
Zee

Posted on

Your AI Agent Can't Scrape That Page. Here's How to Fix It.

Your AI Agent Can't Scrape That Page. Here's How to Fix It.

You built an AI agent that needs real-time web data. Product prices, news articles, competitor info — whatever it is, you need clean HTML or JSON from a URL.

So you fire off a requests.get() and... 403 Forbidden. Cloudflare says no.

Or you get a page, but it's empty — the content loads via JavaScript after the page renders, and your HTTP client never sees it.

Sound familiar? Let's break down what's happening and how to actually solve it.

Why Your Scraping Fails

1. JavaScript Rendering

Modern sites are SPAs. The HTML you get from a raw HTTP request is a shell — the actual content is loaded by JavaScript after the page mounts. requests, axios, fetch — none of them execute JS.

2. Cloudflare and Bot Detection

Cloudflare fingerprints your connection:

  • TLS fingerprint (does your HTTP client look like a browser?)
  • HTTP/2 fingerprint
  • Browser behavior (mouse movements, JS execution patterns)
  • IP reputation

Regular HTTP clients fail all of these checks.

3. Complex Layouts

Even when you get the HTML, extracting structured data from it is painful. You write brittle CSS selectors that break on every layout change.

The Solutions (From Worst to Best)

Selenium/Playwright Headless Browsers

They work... sometimes. But Cloudflare detects headless Chrome. You'll spend more time maintaining anti-detection patches than building your actual product.

Rotating Proxies + Custom Headers

Expensive, slow, and fragile. You're playing whack-a-mole with detection rules.

Use an API That Handles Everything

This is where tools like Haunt API come in. It's a web extraction API built specifically for AI agents:

import requests

resp = requests.post(
    "https://hauntapi.com/v1/extract",
    headers={"x-api-key": "your-key"},
    json={
        "url": "https://example.com/product/123",
        "prompt": "Get the product name, price, and availability"
    }
)

print(resp.json()["data"])
# {
#   "product_name": "Wireless Headphones Pro",
#   "price": "$79.99",
#   "availability": "In Stock"
# }
Enter fullscreen mode Exit fullscreen mode

That's it. One API call. Cloudflare bypassed, JavaScript rendered, structured data extracted.

How It Works Under the Hood

  1. Smart fetching — tries direct HTTP first, falls back to headless browser with anti-fingerprinting for Cloudflare-protected sites
  2. JavaScript executes — SPA content becomes available
  3. AI extracts the data you described in your natural language prompt
  4. Clean JSON returned to your application

MCP Server for Claude and Cursor

If you're building with AI agents, Haunt also has an MCP server:

{
  "mcpServers": {
    "haunt": {
      "command": "npx",
      "args": ["@hauntapi/mcp-server"],
      "env": {
        "HAUNT_API_KEY": "your-key"
      }
    }
  }
}
Enter fullscreen mode Exit fullscreen mode

Add that to your Claude Desktop or Cursor config and your AI agent can extract data from any website natively. Zero code.

REST API (No SDK Needed)

curl -X POST https://hauntapi.com/v1/extract \
  -H "x-api-key: your-key" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://news.ycombinator.com",
    "prompt": "Get the top 5 stories with titles, points, and URLs"
  }'
Enter fullscreen mode Exit fullscreen mode

Free Tier

100 extractions/month for free. No credit card required. Perfect for prototyping your AI agent before scaling up.

Paid plans start at £19/mo for 1,000 requests with authenticated scraping and priority support.

When to Use What

Approach Cost Reliability Setup Time
Raw requests Free Low (30%) 5 min
Selenium + proxies $$$ Medium (60%) Hours
Haunt API Free tier High (95%+) 5 min

TL;DR

If your AI agent needs web data and you're tired of fighting bot detection, try Haunt API. It handles Cloudflare, JavaScript rendering, and data extraction in a single API call.

Free to start, built for AI agents and RAG pipelines.


Disclosure: I built Haunt API because I was tired of writing the same scraping infrastructure for every project.

Top comments (0)