DEV Community

Cover image for Web Scraping API That Returns Content + SEO Audit in One Call
slreport
slreport

Posted on

Web Scraping API That Returns Content + SEO Audit in One Call

Building an AI agent that needs to read the web is a deceptively annoying problem. Pick a web scraping API for clean content. Then realize you also need to know whether the pages being read are well-structured, whether their links are broken, whether they actually render properly with JavaScript. Wire in a separate SEO tool. Now there are two API keys, two billing accounts, and two response schemas — all to answer: "what is on this page and is it any good?"

hintrix collapses that into one API call.


What hintrix does

One POST request to /v1/scrape returns:

  • Clean Markdown (or HTML or plain text) — boilerplate stripped, ready for an LLM
  • A full GEO/SEO audit — 80+ checks: title tags, canonical URLs, hreflang, robots.txt rules, E-E-A-T signals, Schema.org validation
  • PageSpeed scores from Google's PageSpeed API
  • Link health — every internal and external link, with status codes
  • Content diffs — what changed since the last scrape of this URL (stored 7 days)
  • Optional screenshots — full-page PNG, base64-encoded (+1 credit, not stored)

Full browser rendering is included by default at no extra charge.

Here's the simplest request:

curl -X POST https://hintrix.com/v1/scrape \
  -H "X-API-Key: hx_live_YOUR_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com"
  }'
Enter fullscreen mode Exit fullscreen mode

Response (trimmed):

{
  "agent": "glance",
  "url": "https://example.com/",
  "status_code": 200,
  "response_time_ms": 1847,
  "js_rendered": true,
  "content": {
    "markdown": "# Example Domain\n\nThis domain is for use in illustrative examples...",
    "word_count": 19
  },
  "metadata": {
    "title": "Example Domain",
    "description": "",
    "language": "en"
  },
  "links": [
    {
      "href": "https://iana.org/domains/example",
      "text": "Learn more",
      "type": "external",
      "nofollow": false
    }
  ],
  "credits_used": 1
}
Enter fullscreen mode Exit fullscreen mode

That's 1 credit. Add "mode": ["content", "audit"] and the full SEO/GEO report is appended for 1 more credit. Total: $0.004 for a JS-rendered page with a complete technical audit.


What the GEO audit actually looks like

Add the audit mode:

curl -X POST https://hintrix.com/v1/scrape \
  -H "X-API-Key: hx_live_YOUR_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://httpbin.org/html",
    "mode": ["content", "audit"]
  }'
Enter fullscreen mode Exit fullscreen mode

The audit block in the response:

{
  "audit": {
    "geo_score": 46,
    "tech_score": 63,
    "sections": {
      "ai_accessibility": { "score": 75, "passed": 6, "total": 8 },
      "content_readiness": { "score": 40, "passed": 4, "total": 10 },
      "entity_eeat": { "score": 17, "passed": 1, "total": 6 },
      "structured_data": { "score": 43, "passed": 3, "total": 7 },
      "technical": { "score": 64, "passed": 7, "total": 11 }
    },
    "issues": [
      {
        "severity": "warning",
        "message": "No Open Graph meta tags found. AI crawlers and social platforms cannot generate previews.",
        "fix": "Add og:title, og:description, og:image to <head>"
      },
      {
        "severity": "warning",
        "message": "No structured data (JSON-LD, Microdata) found.",
        "fix": "Add Organization or WebPage schema"
      }
    ],
    "passes": [
      { "message": "Canonical URL is set correctly" },
      { "message": "robots.txt permits crawling" }
    ],
    "assets": {
      "llms_txt": "# httpbin.org\n> ...\n\n## Links\n- ..."
    }
  }
}
Enter fullscreen mode Exit fullscreen mode

A geo_score of 46 out of 100 means this page has significant issues for AI discoverability. That diagnostic arrives in the same response as the content — no second round-trip, no second API.

This is the core use case for AI agent pipelines: not just "here is what the page says" but "here is how trustworthy and well-structured this source is."


Node.js SDK

npm install hintrix
Enter fullscreen mode Exit fullscreen mode

Zero dependencies. TypeScript-native. Works in Node 18+. Available on npmjs.com/package/hintrix.

Basic scrape:

import { Hintrix } from 'hintrix';

const client = new Hintrix(process.env.HINTRIX_API_KEY);

const page = await client.scrape('https://example.com');

console.log(page.content.markdown);  // clean markdown
console.log(page.metadata.title);    // "Example Domain"
console.log(page.links);             // all links with types
Enter fullscreen mode Exit fullscreen mode

Crawl an entire site and collect all pages:

This is where the Node.js SDK becomes genuinely useful for AI pipelines. The crawlAndCollect method starts an async crawl job, polls for completion, and returns every page when done.

const { job, pages } = await client.crawlAndCollect('https://docs.example.com', {
  max_pages: 50,
  max_depth: 3,
  mode: ['content', 'audit'],
  check_links: true,
  onProgress: (p) => {
    process.stdout.write(`\r${p.pages_crawled} pages crawled...`);
  },
});

console.log(`\nDone. ${pages.length} pages collected.`);

// Feed them to your LLM
for (const page of pages) {
  const embedding = await embed(page.content_markdown);
  await vectorStore.upsert({ url: page.url, embedding });
}
Enter fullscreen mode Exit fullscreen mode

The SDK handles retries (exponential backoff on 429s and 5xx), polling, pagination of results, and cancellation via AbortController.

Structured content extraction with a CSS schema:

const result = await client.extract('https://example.com', {
  title: 'h1',
  description: 'meta[name=description]',
  links: 'a[href]',
});

console.log(result.data);
// { title: 'Example Domain', description: '...', links: '...' }
Enter fullscreen mode Exit fullscreen mode

Content extraction (/v1/extract) costs 2 credits per page.


How hintrix compares

Let's be transparent — different tools suit different jobs.

Feature hintrix Firecrawl Jina Reader ScrapingBee
Clean markdown output Yes Yes Yes No (raw HTML)
Full browser rendering Free, default Yes Limited +5x credits
GEO / SEO audit Yes (80+ checks) No No No
Content + audit in one call Yes No No No
Content diffing Yes (7 days) No No No
Link health check Yes No No No
Multi-page crawl Yes Yes No No
Proxy / anti-bot bypass No Basic No Yes
Credits expire 30 days (extended on purchase) Monthly N/A (tokens) Monthly
Pricing model Credit packs Subscription Token blocks Subscription
Free tier 500 + 500 tweet bonus 500 10M tokens 1,000
Auto-stop on errors Yes (30% threshold) No N/A No
Credit refund on failures Yes (first per URL/24h) No N/A No

Where hintrix wins: Content and technical context in one call, no subscription required, credits valid 30 days and extended on any purchase, automatic refunds on infrastructure failures.

Where competitors win: Firecrawl is open-source with a large community. ScrapingBee has mature proxy infrastructure and handles Cloudflare-protected sites. Jina's free tier is enormous. If anti-bot bypass is the primary requirement, hintrix is not the right tool.


Pricing

500 credits on signup, no card required. Tweet about hintrix and get 500 more — that's 1,000 credits to evaluate with, completely free.

When more credits are needed, buy a pack once — not a subscription:

Pack Credits Price Per credit
Small 2,500 $5 $0.002
Medium 7,500 $12 $0.0016
Large 20,000 $29 $0.00145

The math: 2,500 pages scraped with full browser rendering = $5. With audit enabled on every page: 2,500 pages × 2 credits = 5,000 credits = two $5 packs = $10 total. That's $0.004 per page, JS rendering plus full GEO/SEO audit included.

Credits are valid for 30 days, and every new purchase resets the clock for all your credits. Active users effectively never lose credits.


Engineering notes

Browser rendering at scale is a resource problem, not a code problem. The optimization work is about lifecycle management — when to reuse browser sessions, when to terminate them, how to cap concurrency. Credits are refunded on failures caused by hintrix infrastructure. If a browser session crashes, the request doesn't get charged. The refund policy is fair but not exploitable: the first failure on a given URL is always refunded, but repeated failures on the same URL within 24 hours are charged normally. This prevents abuse while keeping the system honest.

Auto-stop on crawls was non-negotiable. The crawler halts automatically when more than 30% of pages return errors, with a specific message explaining the cause: "Target site is rate-limiting requests (HTTP 429). 10/28 pages failed." Partial results are returned alongside the explanation.

Rate limiting feedback needs to be useful, not just a 429. The SDK surfaces RateLimitError with a retryAfter field. Crawls that hit rate limits pause rather than fail outright.

The GEO audit data is genuinely hard to get right. The audit checks things like: are AI crawlers allowed in robots.txt (not just Googlebot), do hreflang tags use valid locale codes, is the canonical URL self-referential, are E-E-A-T signals present (author markup, organization schema, external citations). Several checks require the page to be fully rendered first. Some took weeks to get reliable.


Limitations

No proxy rotation, no anti-bot bypass. Cloudflare-protected sites will often block requests. Social media platforms are blocked on hintrix's side as well — deliberately. This is a positioning choice, not an oversight.

Smaller free tier than Jina. 1,000 credits (500 + tweet bonus) is modest compared to 10M tokens.

Node.js SDK only, for now. A Python SDK is next.

Beta. Payment processing is launching soon — all signups receive free beta credits. Pricing above reflects the intended general availability pricing.

No self-hosted option. Firecrawl is open-source and can be self-hosted. hintrix is managed-only.


Get started

Sign up at hintrix.com — no credit card, 500 credits immediately. Full API reference at hintrix.com/docs.

npm install hintrix
Enter fullscreen mode Exit fullscreen mode

If you're building AI agents that need to read and reason about web content — particularly pipelines that need to evaluate source quality alongside content — hintrix is worth a look. The GEO audit in particular is something not found in other web scraping APIs, and feedback on whether it's useful in real pipelines is genuinely welcome.

Top comments (0)