slreport

Posted on Apr 2

Web Scraping API That Returns Content + SEO Audit in One Call

#webdev #api #ai #javascript

Building an AI agent that needs to read the web is a deceptively annoying problem. Pick a web scraping API for clean content. Then realize you also need to know whether the pages being read are well-structured, whether their links are broken, whether they actually render properly with JavaScript. Wire in a separate SEO tool. Now there are two API keys, two billing accounts, and two response schemas — all to answer: "what is on this page and is it any good?"

hintrix collapses that into one API call.

What hintrix does

One POST request to /v1/scrape returns:

Clean Markdown (or HTML or plain text) — boilerplate stripped, ready for an LLM
A full GEO/SEO audit — 80+ checks: title tags, canonical URLs, hreflang, robots.txt rules, E-E-A-T signals, Schema.org validation
PageSpeed scores from Google's PageSpeed API
Link health — every internal and external link, with status codes
Content diffs — what changed since the last scrape of this URL (stored 7 days)
Optional screenshots — full-page PNG, base64-encoded (+1 credit, not stored)

Full browser rendering is included by default at no extra charge.

Here's the simplest request:

curl -X POST https://hintrix.com/v1/scrape \
  -H "X-API-Key: hx_live_YOUR_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com"
  }'

Response (trimmed):

{
  "agent": "glance",
  "url": "https://example.com/",
  "status_code": 200,
  "response_time_ms": 1847,
  "js_rendered": true,
  "content": {
    "markdown": "# Example Domain\n\nThis domain is for use in illustrative examples...",
    "word_count": 19
  },
  "metadata": {
    "title": "Example Domain",
    "description": "",
    "language": "en"
  },
  "links": [
    {
      "href": "https://iana.org/domains/example",
      "text": "Learn more",
      "type": "external",
      "nofollow": false
    }
  ],
  "credits_used": 1
}

That's 1 credit. Add "mode": ["content", "audit"] and the full SEO/GEO report is appended for 1 more credit. Total: $0.004 for a JS-rendered page with a complete technical audit.

What the GEO audit actually looks like

Add the audit mode:

curl -X POST https://hintrix.com/v1/scrape \
  -H "X-API-Key: hx_live_YOUR_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://httpbin.org/html",
    "mode": ["content", "audit"]
  }'

The audit block in the response:

{
  "audit": {
    "geo_score": 46,
    "tech_score": 63,
    "sections": {
      "ai_accessibility": { "score": 75, "passed": 6, "total": 8 },
      "content_readiness": { "score": 40, "passed": 4, "total": 10 },
      "entity_eeat": { "score": 17, "passed": 1, "total": 6 },
      "structured_data": { "score": 43, "passed": 3, "total": 7 },
      "technical": { "score": 64, "passed": 7, "total": 11 }
    },
    "issues": [
      {
        "severity": "warning",
        "message": "No Open Graph meta tags found. AI crawlers and social platforms cannot generate previews.",
        "fix": "Add og:title, og:description, og:image to <head>"
      },
      {
        "severity": "warning",
        "message": "No structured data (JSON-LD, Microdata) found.",
        "fix": "Add Organization or WebPage schema"
      }
    ],
    "passes": [
      { "message": "Canonical URL is set correctly" },
      { "message": "robots.txt permits crawling" }
    ],
    "assets": {
      "llms_txt": "# httpbin.org\n> ...\n\n## Links\n- ..."
    }
  }
}

A geo_score of 46 out of 100 means this page has significant issues for AI discoverability. That diagnostic arrives in the same response as the content — no second round-trip, no second API.

This is the core use case for AI agent pipelines: not just "here is what the page says" but "here is how trustworthy and well-structured this source is."

Node.js SDK

npm install hintrix

Zero dependencies. TypeScript-native. Works in Node 18+. Available on npmjs.com/package/hintrix.

Basic scrape:

import { Hintrix } from 'hintrix';

const client = new Hintrix(process.env.HINTRIX_API_KEY);

const page = await client.scrape('https://example.com');

console.log(page.content.markdown);  // clean markdown
console.log(page.metadata.title);    // "Example Domain"
console.log(page.links);             // all links with types

Crawl an entire site and collect all pages:

This is where the Node.js SDK becomes genuinely useful for AI pipelines. The crawlAndCollect method starts an async crawl job, polls for completion, and returns every page when done.

const { job, pages } = await client.crawlAndCollect('https://docs.example.com', {
  max_pages: 50,
  max_depth: 3,
  mode: ['content', 'audit'],
  check_links: true,
  onProgress: (p) => {
    process.stdout.write(`\r${p.pages_crawled} pages crawled...`);
  },
});

console.log(`\nDone. ${pages.length} pages collected.`);

// Feed them to your LLM
for (const page of pages) {
  const embedding = await embed(page.content_markdown);
  await vectorStore.upsert({ url: page.url, embedding });
}

The SDK handles retries (exponential backoff on 429s and 5xx), polling, pagination of results, and cancellation via AbortController.

Structured content extraction with a CSS schema:

const result = await client.extract('https://example.com', {
  title: 'h1',
  description: 'meta[name=description]',
  links: 'a[href]',
});

console.log(result.data);
// { title: 'Example Domain', description: '...', links: '...' }

Content extraction (/v1/extract) costs 2 credits per page.

How hintrix compares

Let's be transparent — different tools suit different jobs.

Feature	hintrix	Firecrawl	Jina Reader	ScrapingBee
Clean markdown output	Yes	Yes	Yes	No (raw HTML)
Full browser rendering	Free, default	Yes	Limited	+5x credits
GEO / SEO audit	Yes (80+ checks)	No	No	No
Content + audit in one call	Yes	No	No	No
Content diffing	Yes (7 days)	No	No	No
Link health check	Yes	No	No	No
Multi-page crawl	Yes	Yes	No	No
Proxy / anti-bot bypass	No	Basic	No	Yes
Credits expire	30 days (extended on purchase)	Monthly	N/A (tokens)	Monthly
Pricing model	Credit packs	Subscription	Token blocks	Subscription
Free tier	500 + 500 tweet bonus	500	10M tokens	1,000
Auto-stop on errors	Yes (30% threshold)	No	N/A	No
Credit refund on failures	Yes (first per URL/24h)	No	N/A	No

Where hintrix wins: Content and technical context in one call, no subscription required, credits valid 30 days and extended on any purchase, automatic refunds on infrastructure failures.

Where competitors win: Firecrawl is open-source with a large community. ScrapingBee has mature proxy infrastructure and handles Cloudflare-protected sites. Jina's free tier is enormous. If anti-bot bypass is the primary requirement, hintrix is not the right tool.

Pricing

500 credits on signup, no card required. Tweet about hintrix and get 500 more — that's 1,000 credits to evaluate with, completely free.

When more credits are needed, buy a pack once — not a subscription:

Pack	Credits	Price	Per credit
Small	2,500	$5	$0.002
Medium	7,500	$12	$0.0016
Large	20,000	$29	$0.00145

The math: 2,500 pages scraped with full browser rendering = $5. With audit enabled on every page: 2,500 pages × 2 credits = 5,000 credits = two $5 packs = $10 total. That's $0.004 per page, JS rendering plus full GEO/SEO audit included.

Credits are valid for 30 days, and every new purchase resets the clock for all your credits. Active users effectively never lose credits.

Engineering notes

Browser rendering at scale is a resource problem, not a code problem. The optimization work is about lifecycle management — when to reuse browser sessions, when to terminate them, how to cap concurrency. Credits are refunded on failures caused by hintrix infrastructure. If a browser session crashes, the request doesn't get charged. The refund policy is fair but not exploitable: the first failure on a given URL is always refunded, but repeated failures on the same URL within 24 hours are charged normally. This prevents abuse while keeping the system honest.

Auto-stop on crawls was non-negotiable. The crawler halts automatically when more than 30% of pages return errors, with a specific message explaining the cause: "Target site is rate-limiting requests (HTTP 429). 10/28 pages failed." Partial results are returned alongside the explanation.

Rate limiting feedback needs to be useful, not just a 429. The SDK surfaces RateLimitError with a retryAfter field. Crawls that hit rate limits pause rather than fail outright.

The GEO audit data is genuinely hard to get right. The audit checks things like: are AI crawlers allowed in robots.txt (not just Googlebot), do hreflang tags use valid locale codes, is the canonical URL self-referential, are E-E-A-T signals present (author markup, organization schema, external citations). Several checks require the page to be fully rendered first. Some took weeks to get reliable.

Limitations

No proxy rotation, no anti-bot bypass. Cloudflare-protected sites will often block requests. Social media platforms are blocked on hintrix's side as well — deliberately. This is a positioning choice, not an oversight.

Smaller free tier than Jina. 1,000 credits (500 + tweet bonus) is modest compared to 10M tokens.

Node.js SDK only, for now. A Python SDK is next.

Beta. Payment processing is launching soon — all signups receive free beta credits. Pricing above reflects the intended general availability pricing.

No self-hosted option. Firecrawl is open-source and can be self-hosted. hintrix is managed-only.

Get started

npm install hintrix

If you're building AI agents that need to read and reason about web content — particularly pipelines that need to evaluate source quality alongside content — hintrix is worth a look. The GEO audit in particular is something not found in other web scraping APIs, and feedback on whether it's useful in real pipelines is genuinely welcome.