Mohit Prateek

Posted on Feb 18

Web Scraping API for structured data from any website - incl. authenticated and JS-heavy pages.

#webscraping #api #rag #llm

We’ve built a developer-facing scraping API at Anakin.

The workflow is: you send a URL (or a batch of URLs) plus a few parameters, and you get back normalized output - raw HTML, clean Markdown, or structured JSON. The platform handles JavaScript rendering (headless browser execution), proxy routing, retries, anti-bot handling, and authenticated sessions when needed.

The execution model (request → execution → extraction → response)

In practice, most scraping systems end up becoming a pile of components:

HTTP fetch + headless browser execution
Proxy pools with geo routing logic
Retries, backoff and fallbacks
Extraction and normalization pipelines
Session and authentication handling

We packaged these into a single API surface: one request in, one normalized response out.

At a high level:

Router decides the execution path (simple fetch vs browser render, proxy pool selection, wait conditions).
Execution layer performs the request (HTTP client or isolated Chromium instance).
Stability layer applies retry and fallback logic (proxy rotation, browser reconfiguration, timing adjustments).
Extraction layer returns normalized output (HTML or Markdown) and optionally applies schema-driven JSON extraction.

What the Tool Includes (Try It as You Read)

1) URL Scraper (single + batch)

Purpose: Fetch and normalize a single page (or many).

Step 1: Run a simple fetch (no browser). [Get the API Key from - https://tinyurl.com/bdehkj2z]

# TODO: Replace with your exact API KEY + URL
curl -X POST https://api.anakin.io/v1/url-scraper \
  -H "X-API-Key: your_api_key" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com",
    "country": "us",
    "useBrowser": false,
    "generateJson": false
  }'

Output :Raw HTML/JSON/Markdown

Step 2 : Get results.

# TODO: Replace with your exact API KEY + Job ID
curl -X GET https://api.anakin.io/v1/url-scraper/job_abc123xyz \
  -H "X-API-Key: your_api_key"

If you want to test it on a list of URLs together - Documentation here

2) Search API (query -> links + extracted content)

This takes a query and returns:

a ranked list of results
extracted content for those pages (not just URLs)

The point is to avoid building a second pipeline:
“search → fetch each URL → render/extract → normalize”.

Search API - Documentation Here

3) Agentic Search (multi-step research pipeline)

This is a multi-stage workflow that:

searches
selects relevant pages
extracts content
synthesizes
returns structured output

Agentic Search API - Documentation Here

4) Browser Sessions (authenticated scraping via session IDs)

Users authenticate once inside an isolated, secure browser environment. Encrypted session cookies and storage are persisted server-side, enabling scraping of dashboards, gated portals, and login-protected views without re-running the login flow on every request. Credentials are never stored.
Subsequent requests reference a session_id:

Check it out at: https://tinyurl.com/bdehkj2z

Practical use cases we’ve seen

Powering AI assistants with real-time web content (clean Markdown for retrieval/grounding)
Enhancing sales and GTM enrichment with public website signals
Extracting structured product/pricing data for monitoring workflows
Scraping authenticated dashboards and member portals
Automating multi-source research pipelines (search → extract → synthesize)
Embedding web extraction into internal tools and developer workflows

If you’ve worked on scraping systems in production, we’d value your feedback.

Try it on the kinds of pages that are usually annoying - dynamic rendering, geo-specific behavior, authenticated flows, or unstable markup - and let us know where it falls short.

DEV Community

Web Scraping API for structured data from any website - incl. authenticated and JS-heavy pages.

What the Tool Includes (Try It as You Read)

Top comments (0)