DEV Community

CDCSaaS
CDCSaaS

Posted on

Turn any company website into structured B2B data (one API call)

If you've ever needed to go from a company's website to clean, structured data — its name, sector, a short description, social links, a contact email, and the technologies it runs on — you know the options aren't great:

  • Build your own scraper. Brittle, and every site is different. You'll spend more time maintaining selectors than using the data.
  • Pay a heavyweight data provider. Expensive, and the data is often a stale snapshot from months ago.
  • Paste HTML into an LLM and pray. Sometimes you get valid JSON. Sometimes you get a hallucinated CEO email that doesn't exist.

I kept hitting this wall while working with lists of company domains, so I built a small API that does one thing well: send a company URL, get back clean JSON.

The two rules that shaped it

1. It reads the live site at request time. Not a database snapshot from last quarter. If a company rebranded yesterday, you get today's version.

2. It never guesses. This was the hardest constraint to enforce with an LLM in the pipeline. Missing fields come back as null — never invented. If there's no contact email on the site, you get "email": null, not a plausible-looking fake you'd import straight into your CRM.

What a call looks like

curl --request GET \
  --url 'https://ai-live-company-enrichment-tech-detector.p.rapidapi.com/v1/enrich?url=https%3A%2F%2Fstripe.com' \
  --header 'x-rapidapi-host: ai-live-company-enrichment-tech-detector.p.rapidapi.com' \
  --header 'x-rapidapi-key: YOUR_KEY'
Enter fullscreen mode Exit fullscreen mode

And the response:

{
  "url": "https://stripe.com",
  "cached": false,
  "data": {
    "company_name": "Stripe, Inc.",
    "sector": "Financial Technology / Payments",
    "description": "Stripe is a financial infrastructure platform for businesses...",
    "social_links": {
      "linkedin": "https://www.linkedin.com/company/stripe",
      "twitter": "https://twitter.com/stripe",
      "github": "https://github.com/stripe"
    },
    "contact_email": null,
    "tech_stack": ["React", "Next.js", "Cloudflare", "..."]
  }
}
Enter fullscreen mode Exit fullscreen mode

How it works under the hood

A few design decisions, for the curious:

  • Two-pass tech detection. A fast pattern-matching pass first (think Wappalyzer-style fingerprints), then an LLM enrichment pass only for what patterns can't catch. Cheaper and faster than going full-LLM on everything.
  • Hard content trimming before the LLM. Page content is capped before any model call. This keeps latency and cost predictable instead of exploding on heavy JS-rendered sites.
  • Caching with a 14-day TTL. Repeat lookups on the same domain return in ~200 ms instead of re-scraping. The cached field in the response tells you which path you hit.
  • Strict schema validation. Every response is validated against a strict schema (Pydantic v2) before it leaves the API. Either the JSON conforms, or you get a proper error — never half-broken output.

Use cases I built it for

  • Lead enrichment: turn a list of prospect domains into CRM-ready records.
  • Tech-based targeting: filter prospects by their stack ("show me companies running Shopify").
  • Data hygiene: verify and refresh company records against the live web instead of stale databases.

Try it

There's a free tier (100 requests/month), enough to test it against your own data:

👉 AI Live Company Enrichment & Tech Detector on RapidAPI

I'd genuinely love feedback from other builders — on the positioning, the pricing, and especially: what field would you want it to extract next? Drop a comment below.

Top comments (0)