DEV Community

Cover image for Web Scraping API for structured data from any website - incl. authenticated and JS-heavy pages.
Mohit Prateek
Mohit Prateek

Posted on

Web Scraping API for structured data from any website - incl. authenticated and JS-heavy pages.

We’ve built a developer-facing scraping API at Anakin.

The workflow is: you send a URL (or a batch of URLs) plus a few parameters, and you get back normalized output - raw HTML, clean Markdown, or structured JSON. The platform handles JavaScript rendering (headless browser execution), proxy routing, retries, anti-bot handling, and authenticated sessions when needed.

The execution model (request → execution → extraction → response)

In practice, most scraping systems end up becoming a pile of components:

  • HTTP fetch + headless browser execution
  • Proxy pools with geo routing logic
  • Retries, backoff and fallbacks
  • Extraction and normalization pipelines
  • Session and authentication handling

We packaged these into a single API surface: one request in, one normalized response out.

At a high level:

  1. Router decides the execution path (simple fetch vs browser render, proxy pool selection, wait conditions).
  2. Execution layer performs the request (HTTP client or isolated Chromium instance).
  3. Stability layer applies retry and fallback logic (proxy rotation, browser reconfiguration, timing adjustments).
  4. Extraction layer returns normalized output (HTML or Markdown) and optionally applies schema-driven JSON extraction.

What the Tool Includes (Try It as You Read)

1) URL Scraper (single + batch)

Purpose: Fetch and normalize a single page (or many).

Step 1: Run a simple fetch (no browser). [Get the API Key from - https://tinyurl.com/bdehkj2z]

# TODO: Replace with your exact API KEY + URL
curl -X POST https://api.anakin.io/v1/url-scraper \
  -H "X-API-Key: your_api_key" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com",
    "country": "us",
    "useBrowser": false,
    "generateJson": false
  }'
Enter fullscreen mode Exit fullscreen mode

Output :Raw HTML/JSON/Markdown

Step 2 : Get results.

# TODO: Replace with your exact API KEY + Job ID
curl -X GET https://api.anakin.io/v1/url-scraper/job_abc123xyz \
  -H "X-API-Key: your_api_key"

Enter fullscreen mode Exit fullscreen mode

If you want to test it on a list of URLs together - Documentation here

2) Search API (query -> links + extracted content)

This takes a query and returns:

  • a ranked list of results
  • extracted content for those pages (not just URLs)

The point is to avoid building a second pipeline:
“search → fetch each URL → render/extract → normalize”.

Search API - Documentation Here

3) Agentic Search (multi-step research pipeline)

This is a multi-stage workflow that:

  • searches
  • selects relevant pages
  • extracts content
  • synthesizes
  • returns structured output

Agentic Search API - Documentation Here

4) Browser Sessions (authenticated scraping via session IDs)

Users authenticate once inside an isolated, secure browser environment. Encrypted session cookies and storage are persisted server-side, enabling scraping of dashboards, gated portals, and login-protected views without re-running the login flow on every request. Credentials are never stored.
Subsequent requests reference a session_id:

Check it out at: https://tinyurl.com/bdehkj2z

Browser Session

Practical use cases we’ve seen

  • Powering AI assistants with real-time web content (clean Markdown for retrieval/grounding)
  • Enhancing sales and GTM enrichment with public website signals
  • Extracting structured product/pricing data for monitoring workflows
  • Scraping authenticated dashboards and member portals
  • Automating multi-source research pipelines (search → extract → synthesize)
  • Embedding web extraction into internal tools and developer workflows

If you’ve worked on scraping systems in production, we’d value your feedback.

Try it on the kinds of pages that are usually annoying - dynamic rendering, geo-specific behavior, authenticated flows, or unstable markup - and let us know where it falls short.

Top comments (0)