DEV Community

Zee
Zee

Posted on

I’m looking for ugly URLs that break normal scrapers

Most scraper demos use friendly pages.

A blog post.
A docs page.
A fake ecommerce product.
Something clean enough that BeautifulSoup could probably manage it after a coffee.

That is not where web extraction gets annoying.

The annoying cases are the ugly ones:

  • JavaScript-rendered pages
  • pages with no stable CSS selectors
  • pages where the useful data is mixed into layout sludge
  • Cloudflare / bot-wall weirdness
  • vendor pages where the table changes every week
  • docs pages where the answer is spread across several sections
  • pages that look simple in a browser but return nonsense to curl

Those are the URLs I actually care about.

The useful test

The test is not:

“Can this tool scrape example.com?”

The test is:

“Can I send it a real page and ask for the specific thing I need, without writing a custom parser?”

Example:

curl -X POST https://hauntapi.com/v1/extract \
  -H "X-API-Key: YOUR_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com/some-awful-page",
    "prompt": "Extract product names, prices, availability, and the source URL as JSON"
  }'
Enter fullscreen mode Exit fullscreen mode

That is the shape I built Haunt API around:

URL in.
Natural-language extraction prompt in.
Structured JSON out.

No selector map.
No one-off parser.
No “the site changed one div class and now everything is dead” ritual sacrifice.

What I want to test next

I’m collecting awkward public URLs that normal scrapers struggle with.

Not private data.
Not login-only pages.
Not anything illegal or creepy.

Just the normal developer pain pile:

  • public product pages
  • public directories
  • public docs
  • public event listings
  • public price pages
  • public content pages with messy markup

If you have one of those “this page should be easy but somehow isn’t” URLs, send it over.

I’ll try to turn it into clean JSON or Markdown and share what worked / what failed.

The live docs are here:

https://hauntapi.com/docs

And the hard-URL proof flow is here:

https://hauntapi.com/services

I’m mainly interested in the failures. Friendly demos are cheap. Broken real pages are where the bodies are buried.

Top comments (0)