DEV Community

Boehner
Boehner

Posted on

Stop Scraping Pages by Hand — One API Call Returns Everything You Need

Stop Scraping Pages by Hand — One API Call Returns Everything You Need

I used to have this gross four-step process every time I needed to understand what a webpage was doing:

  1. Screenshot it
  2. curl the HTML and pipe it through a parser
  3. Fire up Puppeteer to extract structured data
  4. Manually look up the tech stack

Four round trips. Four scripts to maintain. Four things that break when a site updates its layout.

Then I added a single /v1/analyze endpoint to SnapAPI and collapsed all four steps into one.

Here's what a single call returns now:

{
  "page_type": "landing_page",
  "cta": "Start for free",
  "navigation": ["Docs", "Pricing", "Changelog", "Sign In"],
  "buttons": ["Start for free", "View docs", "See pricing"],
  "forms": [{ "action": "/signup", "fields": ["email"] }],
  "headings": {
    "h1": ["The Screenshot API that Developers Actually Use"],
    "h2": ["One line of code", "No Puppeteer", "Free tier included"]
  },
  "links": { "internal": 14, "external": 3, "total": 17 },
  "word_count": 847,
  "load_time_ms": 1243,
  "technologies": ["Cloudflare", "Google Analytics", "Stripe"],
  "screenshot": "<base64 PNG>"
}
Enter fullscreen mode Exit fullscreen mode

One HTTP GET. One response. Everything about the page — structure, intent, and a visual.


Why This Matters

If you're building any of these things, you've probably felt the pain:

  • Competitive intelligence tools — you want to know if a competitor changed their CTA or added a new pricing tier
  • SEO auditing scripts — you need word counts, heading structure, and link counts at scale
  • AI agents — your agent needs to understand a page before acting on it, not just see a blob of HTML
  • Lead enrichment pipelines — you're building profiles of companies and need to know what tech stack they're running

The traditional approach is to either scrape HTML (and fight bot detection), run your own Puppeteer cluster (and babysit it), or stitch together 3–4 different APIs (expensive, fragile).

The analyze endpoint is a single call that does all of that in under 2 seconds.


The Code

Node.js

const res = await fetch(
  "https://snapapi.tech/v1/analyze?" + new URLSearchParams({
    url: "https://stripe.com",
    api_key: "YOUR_KEY",
    screenshot: "true"
  })
);

const data = await res.json();

console.log("Page type:", data.page_type);
console.log("Primary CTA:", data.cta);
console.log("Tech stack:", data.technologies);
console.log("Word count:", data.word_count);
Enter fullscreen mode Exit fullscreen mode

Python

import requests

response = requests.get("https://snapapi.tech/v1/analyze", params={
    "url": "https://stripe.com",
    "api_key": "YOUR_KEY",
    "screenshot": "true"
})

data = response.json()
print(f"Page type: {data['page_type']}")
print(f"Primary CTA: {data['cta']}")
print(f"Technologies: {', '.join(data['technologies'])}")
print(f"Word count: {data['word_count']}")
Enter fullscreen mode Exit fullscreen mode

curl

curl "https://snapapi.tech/v1/analyze?url=https://stripe.com&api_key=YOUR_KEY" | jq .
Enter fullscreen mode Exit fullscreen mode

Real Use Case: Competitor Monitoring in 20 Lines

Here's a script that watches 10 competitor homepages and alerts you when their CTA or tech stack changes:

const fetch = require("node-fetch");
const fs = require("fs");

const API_KEY = process.env.SNAPAPI_KEY;
const COMPETITORS = [
  "https://stripe.com",
  "https://clerk.dev",
  "https://vercel.com",
  // ... add yours
];

const STATE_FILE = "./competitor-state.json";
const previous = fs.existsSync(STATE_FILE) 
  ? JSON.parse(fs.readFileSync(STATE_FILE)) 
  : {};

async function analyze(url) {
  const res = await fetch(
    `https://snapapi.tech/v1/analyze?url=${encodeURIComponent(url)}&api_key=${API_KEY}`
  );
  return res.json();
}

async function run() {
  const current = {};

  for (const url of COMPETITORS) {
    const data = await analyze(url);
    const key = url;
    current[key] = {
      cta: data.cta,
      technologies: data.technologies,
      page_type: data.page_type,
      word_count: data.word_count,
    };

    if (previous[key]) {
      const prev = previous[key];
      if (prev.cta !== data.cta) {
        console.log(`🔔 CTA changed on ${url}: "${prev.cta}" → "${data.cta}"`);
      }
      const addedTech = data.technologies.filter(t => !prev.technologies.includes(t));
      if (addedTech.length) {
        console.log(`🔔 New tech detected on ${url}: ${addedTech.join(", ")}`);
      }
    }
  }

  fs.writeFileSync(STATE_FILE, JSON.stringify(current, null, 2));
  console.log("✅ Done. Checked", COMPETITORS.length, "competitors.");
}

run().catch(console.error);
Enter fullscreen mode Exit fullscreen mode

Run this on a cron and you'll know the moment a competitor A/B tests a new headline or switches payment providers.


Real Use Case: AI Agent Page Understanding

If you're building an AI agent that needs to browse the web, raw HTML is a terrible input — it's noisy, enormous, and the model spends tokens on nav bars and cookie banners.

The analyze endpoint solves this by pre-extracting the structure:

async function getPageContext(url) {
  const res = await fetch(
    `https://snapapi.tech/v1/analyze?url=${encodeURIComponent(url)}&api_key=${API_KEY}`
  );
  const data = await res.json();

  // Return a compact summary for the LLM
  return {
    summary: `This is a ${data.page_type}. Primary CTA: "${data.cta}". ` +
             `Main heading: "${data.headings?.h1?.[0]}". ` +
             `Word count: ${data.word_count}. ` +
             `Technologies: ${data.technologies?.join(", ")}.`,
    screenshot: data.screenshot, // base64 for vision models
  };
}
Enter fullscreen mode Exit fullscreen mode

Feed that summary + screenshot to GPT-4o or Claude and you get much better responses than dumping raw HTML.


Real Use Case: SEO Audit at Scale

const pages = [
  "https://yoursite.com",
  "https://yoursite.com/pricing",
  "https://yoursite.com/blog",
  "https://yoursite.com/about",
];

const results = await Promise.all(
  pages.map(async url => {
    const res = await fetch(
      `https://snapapi.tech/v1/analyze?url=${encodeURIComponent(url)}&api_key=${API_KEY}`
    );
    const data = await res.json();
    return {
      url,
      h1_count: data.headings?.h1?.length ?? 0,
      word_count: data.word_count,
      has_form: data.forms.length > 0,
      internal_links: data.links?.internal ?? 0,
      cta: data.cta,
    };
  })
);

// Print a quick audit table
console.table(results);
Enter fullscreen mode Exit fullscreen mode

Output:

┌─────────────────────────────────┬──────────┬────────────┬──────────┬────────────────┬──────────────────┐
│ url                             │ h1_count │ word_count │ has_form │ internal_links │ cta              │
├─────────────────────────────────┼──────────┼────────────┼──────────┼────────────────┼──────────────────┤
│ https://yoursite.com            │ 1        │ 847        │ true     │ 14             │ Start for free   │
│ https://yoursite.com/pricing    │ 1        │ 412        │ false    │ 8              │ Get started      │
│ https://yoursite.com/blog       │ 0        │ 2341       │ false    │ 23             │                  │
│ https://yoursite.com/about      │ 1        │ 631        │ false    │ 11             │ Contact us       │
└─────────────────────────────────┴──────────┴────────────┴──────────┴────────────────┴──────────────────┘
Enter fullscreen mode Exit fullscreen mode

Missing H1 on the blog? No CTA on the about page? This surfaces in seconds.


Batch Mode

If you need to analyze 10+ pages, use the batch endpoint to parallelize everything:

const res = await fetch("https://snapapi.tech/v1/batch", {
  method: "POST",
  headers: {
    "Content-Type": "application/json",
    "X-API-Key": API_KEY
  },
  body: JSON.stringify({
    endpoint: "analyze",
    urls: [
      "https://stripe.com",
      "https://paddle.com",
      "https://lemon.squeezy.com",
    ]
  })
});

const results = await res.json();
// results is an array of analyze responses, one per URL
Enter fullscreen mode Exit fullscreen mode

10 URLs → ~3–4 seconds → structured intelligence on all of them.


What's Returned (Full Schema)

Field Type Description
page_type string landing_page, blog, pricing, docs, ecommerce, etc.
cta string The primary call-to-action button text
navigation string[] Top nav link labels
buttons string[] All button text on the page
forms object[] Form action, method, and field names
headings object H1–H6 arrays
links object internal, external, total counts
word_count number Visible word count (not HTML)
load_time_ms number Time to interactive
technologies string[] Detected libraries, CDNs, analytics, payment providers
screenshot string Base64 PNG (when screenshot=true)

Free Tier

SnapAPI has a free tier — 100 calls/month, no card required. Grab a key at snapapi.tech and try it against any page you want.

If you're auditing a lot of pages or building something that runs continuously, the paid tiers start at $9/month.


Wrapping Up

The analyze endpoint is what happens when you stop making developers stitch together a scraper + a parser + a Puppeteer script + a Wappalyzer clone — and just collapse it all into a single API call.

One request. One response. Everything you need to understand a webpage programmatically.

Try the live demo →
Read the docs →

Top comments (0)