NexGenData

Posted on May 14 • Edited on May 18 • Originally published at thenextgennexus.com

Crunchbase Killed Its Free API. Here's How to Rebuild It (2026)

#apify #crunchbase #companydata #osint

Crunchbase Killed Its Free API. Here's How to Rebuild It (2026)

In 2023, Crunchbase quietly deprecated its free Basic API. A generation of indie hackers, academic researchers, and early-stage founders piping structured company data into dashboards and Jupyter notebooks suddenly had nowhere to go. The paid replacement starts at roughly $500/month — a rounding error for a BD team, a non-starter for a solo developer or a graduate student building a thesis dataset.

The good news: Crunchbase is mostly aggregating public information anyway. Company domains, tech stacks, team-size signals, infrastructure footprints — it's all sitting in public DNS, GitHub, certificate transparency logs, and registrar records. Nobody compiled it into one neat JSON endpoint for free, but the raw material is.

Stitch together the right eight sources and you can reconstruct roughly 60–70% of what Crunchbase Basic used to offer. You won't get funding rounds or cap tables — those still require paid data. You will get domain, founding-era proxy, tech stack, CDN, hosting provider, subdomain footprint, GitHub activity, email infrastructure, brand assets, and a reasonable team-size estimate. This post walks through each of the eight sources, the parallel-fanout architecture, a worked VC screening example, and the gotchas that will bite you at scale.

The Data Gap: What Crunchbase Basic Used to Give You

Before the shutdown, a free Crunchbase Basic call returned: company name, domain, logo URL, description, employee_count_range, funding_rounds with amounts and investors, industry tags, location, and founded_year.

Roughly half of that is derivable from public sources. The other half — funding data, precise employee counts, curated industry taxonomy — requires Crunchbase, PitchBook, or a human research team.

What you can reconstruct for free:

Domain, logo, tagline — company site, favicon, Open Graph tags
Founding-era proxy — WHOIS domain creation date (imperfect; domains are sometimes registered years before or after incorporation)
Tech stack — HTTP headers, robots.txt hints, npm org scopes
Hosting + CDN — DNS A records and HTTP headers (CF-Ray, X-Amz-Cf-Id, Via)
Team-size proxy — GitHub org member + repo count, plus subdomain footprint
Security posture — SSL cert issuer (Let's Encrypt vs. DigiCert vs. internal CA)
SaaS stack — DNS TXT records (SPF/DMARC verification tokens leak vendor relationships)

What you basically cannot derive: exact employee count, funding rounds, investors, cap table, ARR/MRR, board composition. For those, pay Crunchbase/PitchBook or do journalism.

The Eight Free Data Sources

Let's go through each source in order, including what you get, what it costs (in API calls and ethical risk), and where it breaks.

1. WHOIS — The Domain Registry Record

WHOIS is the oldest piece of internet infrastructure still relevant for company-data work. Every one of the ~362M registered domains (Verisign DNIB 2024) has a WHOIS record. Query whois stripe.com and you get registrar (MarkMonitor), creation date (2009-09-10), expiration, and nameservers.

What's useful:

Creation date as a founding-era proxy. Stripe's domain was registered in 2009; the company was founded in 2010. Close enough for most analysis.
Registrar choice. MarkMonitor, CSC, or Com Laude skew enterprise with IP-protection budgets. GoDaddy, Namecheap, Porkbun skew indie.
Registrant organization — when it isn't redacted. Post-GDPR, most registrars hide this behind "Whois Privacy," but pre-2018 .com registrations still expose org names surprisingly often.
Nameservers — ns-cloud-a1.googledomains.com means GCP DNS. pdns*.ultradns.net means enterprise DNS. ns1.cloudflare.com means Cloudflare.

Gotcha: WHOIS rate limits are aggressive and per-TLD. ccTLDs (.io, .ai, .de) have their own registry endpoints with tighter limits. Use whoisxmlapi.com free tier (500/month) if you need volume.

2. DNS — The Email and Infrastructure Fingerprint

DNS is where a huge amount of company intelligence lives, and it's all free to query. For any domain, pull:

A records — where the apex points. Is it a Cloudflare IP (104.21.x.x, 172.67.x.x)? A direct EC2 IP? A Fastly anycast range? This tells you about scale and sophistication.
MX records — the email provider. aspmx.l.google.com means Google Workspace (usually 1–500 employees, startup-heavy). *.mail.protection.outlook.com means Microsoft 365 (enterprise-skewed). Self-hosted MX on the company's own mail server means either a security company or a 90s holdout.
NS records — authoritative nameservers. Paired with A records, confirms the DNS provider.
TXT records — the real gold mine. SPF records (v=spf1 include:_spf.google.com include:mailgun.org ~all) reveal every email-sending service they use: Mailgun, SendGrid, Postmark, Mailchimp, Customer.io. DMARC records signal security maturity. And the weird ones — google-site-verification=..., atlassian-domain-verification=..., stripe-verification=..., intercom-site-verification=..., facebook-domain-verification=... — each one is a confirmed SaaS relationship.

A single dig TXT stripe.com can tell you Stripe uses Google for email, has strict DMARC, and verifies with Atlassian, Segment, and a dozen other vendors. That's a SaaS-stack leak that would cost $300/month from BuiltWith's paid tier.

3. SSL Certificates via crt.sh — The Subdomain Leak

Certificate Transparency logs are a post-2013 mandate requiring every publicly-trusted TLS cert to be logged. Combined CT logs indexed by crt.sh contain roughly 10B certificates as of 2026.

For any domain, crt.sh?q=%25.stripe.com&output=json returns every cert ever issued for a subdomain. This leaks:

Subdomain list — api.stripe.com, dashboard.stripe.com, files.stripe.com, checkout.stripe.com, and dozens more. Internal and staging hosts sometimes leak when someone accidentally requests a public cert for them.
Cert issuer — Let's Encrypt signals startup/small team (free, automated). DigiCert, Sectigo, GlobalSign signal mid-to-large with procurement budgets. Internal CAs signal enterprise with a dedicated PKI team.
Cert cadence — aggressive 90-day rotation signals modern DevOps; 2-year certs signal legacy ops.

Subdomain count alone is a useful size proxy. A 5-person startup has 3–5 subdomains. A Series B SaaS has 20–40. A Stripe-scale company has hundreds.

4. GitHub Org API — The Engineering-Team Signal

GitHub's REST API is free and generous: 5,000 requests/hour authenticated. GET /orgs/{org} returns public member count, repo count, creation date, description, and location. GET /orgs/{org}/repos paginates through public repos.

What you learn:

Public repo count — 5 repos means a small shop. 100+ repos means a real engineering org. Many companies put serious work in private repos, so this is a floor.
Primary languages — top 3 being Python/Go/TypeScript signals a modern stack. Java/Ruby at the top hints legacy.
Member count — public org members. Most companies hide members by default, so this drastically undercounts. Stripe has probably 300+ engineers on GitHub but shows maybe 40 public members.
Activity level — commit frequency, open-issue count, star count on flagship projects. A dead org with 500 repos and no recent commits is very different from a live one.

Note: the company-data-aggregator actor currently returns first-page repos only; a full-org walk with pagination is on the roadmap. For ~95% of companies, first page (30 repos) is enough signal.

5. Tech Headers — The Lightweight BuiltWith

Make a single HEAD or GET request to the company's homepage and inspect the response headers. You'll see:

Server — Apache, nginx, Caddy, IIS, LiteSpeed, or a proxy like Cloudflare/CloudFront that masks the origin.
X-Powered-By — PHP, Express, ASP.NET, Next.js. Modern shops often strip this for security; its absence is itself a signal.
Via — usually 1.1 varnish, 1.1 google, or similar; reveals intermediate proxies.
CF-Ray — Cloudflare presence and data center (e.g., 8a3f2c... -SJC = San Jose POP).
X-Amz-Cf-Id — CloudFront.
X-Served-By — Fastly.
X-Akamai-* — Akamai.
X-Vercel-* — Vercel.
X-Github-Request-Id — GitHub Pages.

From five headers you can usually identify the CDN, the front-end host, and sometimes the framework (Next.js, Remix, SvelteKit, Nuxt all leak through X-Nextjs-Prerender, X-Nuxt-Renderer, or distinctive Link prefetch patterns).

6. robots.txt + sitemap.xml — The Internal-Map Leak

Every well-behaved website publishes /robots.txt and usually /sitemap.xml. These are free to fetch and contain remarkable amounts of internal structure.

robots.txt tells you what the company doesn't want indexed, which is often what's interesting:

User-agent: *
Disallow: /admin/
Disallow: /api/
Disallow: /dashboard/
Disallow: /internal/
Disallow: /beta/
Disallow: /_next/
Disallow: /staging/
Sitemap: https://example.com/sitemap.xml

Just from that, you know they have an admin panel, a dashboard, a beta product, internal tooling paths, a Next.js frontend, and a staging environment. A lot of architectural intel for one HTTP GET.

sitemap.xml gives you size:

50-URL sitemap: tiny marketing site
500-URL sitemap: real product with docs and blog
5,000-URL sitemap: content-heavy SaaS with docs + changelogs + customer stories
50,000+ URLs split across sitemap index files: scale (think Notion, Webflow, marketplace-style)

Sitemap <lastmod> dates also hint at content velocity — is this a live, actively-maintained site, or abandoned?

7. npm Organization Scope — The Internal-Library Fingerprint

npm has 3M+ public packages and a scoping mechanism (@stripe/, @vercel/, @shopify/) that lets companies publish under an org namespace. Most serious JavaScript-shipping companies claim their scope.

Querying registry.npmjs.org/-/v1/search?text=scope:stripe gives you every package published under the scope. What this reveals:

Which products they ship. @stripe/stripe-js is the browser SDK. @stripe/react-stripe-js is the React wrapper. Each package is a confirmed product surface area.
Which internal libraries they've extracted. When a company open-sources @acme/ui-primitives, they're signaling internal practices and often recruiting (the README usually links to careers).
Download volume as a crude user-base proxy. @stripe/stripe-js gets 8M+ weekly downloads — a real distribution footprint.
Publication frequency. Monthly releases across many packages = active engineering. Last publish 2 years ago = maintenance mode.

Gotcha: some companies publish under personal names rather than org scopes (early-stage founders who never migrated). You'll miss those unless you also search packages where repository.url contains the company's GitHub org.

8. Favicon + Open Graph Images — The Brand Fingerprint

Finally, the branding layer. Fetch the homepage HTML and parse:

<link rel="icon" href="..."> — the favicon. High-res with multiple sizes = company cares about brand. A 16x16 Rails/Next placeholder = they don't.
<meta property="og:image"> — the social-share card, usually containing logo + tagline. Perfect for a visual directory.
<meta property="og:title"> and <meta property="og:description"> — official tagline and description, usually higher-quality than a scraped H1.

Fallback: Google's favicon service at https://www.google.com/s2/favicons?domain=stripe.com&sz=128 returns an icon for almost any indexed domain.

OG images are gold for company directories — pre-designed 1200x630, include logo and tagline, explicitly intended for third-party consumption. No ethical concerns.

Grounding Numbers

To put this in perspective: Crunchbase has ~11M companies (most with shallow data). GitHub has ~60M public repos across ~100M users (2024 Octoverse). WHOIS covers ~362M registered domains (Verisign DNIB 2024) — essentially every company with a web presence. npm has 3M+ public packages. CT logs indexed by crt.sh hold ~10B certificates.

You are not dealing with scarcity — you're dealing with an aggregation-and-normalization problem. Every relevant company is in these sources somewhere; the hard part is hitting them in parallel, merging results, and surviving rate limits.

Architecture: Parallel Fanout

Here's the architecture for pulling all eight sources per domain without taking forever:

[Domain list]
     |
     v
[company-data-aggregator]
     |
     +-> WHOIS        -> registrar, age, registrant org
     +-> DNS          -> MX stack, verification tokens
     +-> SSL/crt.sh   -> subdomains, cert issuer
     +-> GitHub org   -> repos, languages, size proxy
     +-> tech headers -> CDN, hosting, server
     +-> robots/sitemap -> scale + internal paths
     +-> npm          -> internal packages
     +-> favicon/og   -> brand assets
     |
     v
[merged JSON profile per domain]

The company-data-aggregator actor on Apify hits all eight in parallel per domain via asyncio.gather, with per-source error isolation (if WHOIS rate-limits you, DNS and GitHub still succeed), a configurable per-source timeout (default 10s), and a unified merged JSON output. For a list of 100 domains, the whole job finishes in under 2 minutes on Apify's standard compute.

Code Example

from apify_client import ApifyClient

client = ApifyClient("APIFY_TOKEN")

targets = ["stripe.com", "vercel.com", "fly.io", "railway.app", "render.com"]
run = client.actor("nexgendata/company-data-aggregator").call(run_input={
    "domains": targets,
    "sources": ["whois", "dns", "ssl", "github", "tech_headers", "robots", "npm", "favicon"],
    "timeout_per_source_s": 10,
})

for d in client.dataset(run["defaultDatasetId"]).iterate_items():
    print(f"\n{d['domain']}: GH repos={d.get('github', {}).get('repo_count')}, "
          f"MX={d.get('dns', {}).get('mx_provider')}, "
          f"CDN={d.get('tech_headers', {}).get('cdn')}")

Expected output for the five targets above:

stripe.com: GH repos=250, MX=Google Workspace, CDN=Cloudflare
vercel.com: GH repos=180, MX=Google Workspace, CDN=Vercel
fly.io: GH repos=120, MX=Google Workspace, CDN=Fastly
railway.app: GH repos=45, MX=Google Workspace, CDN=Cloudflare
render.com: GH repos=60, MX=Google Workspace, CDN=Cloudflare

At a glance: all five are modern infrastructure companies on Google Workspace, three on Cloudflare, one on Vercel's own CDN, one on Fastly. Stripe has 5x the public GitHub presence of Railway — consistent with their respective company sizes.

Worked Example: VC Associate Screens 100 AI-Infra Startups

A VC associate at a seed-stage fund is screening 100 "AI infra" startups pulled from a Twitter thread, a Substack, and a conference attendee list. She needs to narrow to ~10 worth a deeper call. Manually visiting 100 sites is a full day.

Instead, she feeds the list to the aggregator and gets back, per company:

Tech stack sophistication: AWS vs GCP vs self-hosted, Kubernetes usage (via api.k8s.* patterns), Vercel (prototype-y) vs bare EC2 (more production-grade for infra plays).
Team-size proxy: GitHub member count + subdomain count. 5 subdomains and 3 GitHub members = probably 1–3 people. 30+ subdomains and 20+ members = 15+ people.
Age: WHOIS creation dates. Founded 2024 is a different conversation than founded 2019 with no traction.
Subdomain footprint: a production-maturity proxy. An "enterprise-ready" AI-infra startup with just www. and app. is pre-product. One with api., docs., status., console., plus regional endpoints, has real infrastructure.
SaaS stack: TXT records reveal Segment, Intercom, Linear, Notion usage — commercial intent vs. pure research.

From 100 companies she narrows to 12 based on: founded 2022–2024, at least 10 GitHub public members OR 15+ subdomains, hosted on real cloud infra (not Wix/Webflow), and not already funded past seed (checked against Crunchbase Pro for the 12 finalists only — paying for 12 lookups is fine, 100 wasn't).

Total cost: ~$5 of Apify credits + 12 Crunchbase Pro lookups (~$40). Total time: 45 minutes instead of a day.

Augmenting With Paid Sources Where Free Isn't Enough

Free sources cover 60–70%. The rest requires paying. The honest hierarchy:

OpenCorporates (free tier, ~500 calls/month): incorporation records, legal entity names, officer lists. Genuinely useful for due diligence.
SEC EDGAR (fully free): 10-K, 10-Q, 8-K, S-1 filings for US public companies. Unbeatable for public-company financials.
Clearbit (RIP): acquired by HubSpot in 2023, rebranded as HubSpot Insights, free tier deprecated. Not a replacement anymore.
PeopleDataLabs / Proxycurl: LinkedIn-adjacent. Contact info, titles, seniority. Paid, cheaper per-record than Apollo.
Crunchbase Pro / PitchBook: for funding data you cannot derive elsewhere. Use surgically on your shortlist.
BuiltWith Pro: tech stack at scale without maintaining your own header parsing.

The play: free aggregation screens the top of the funnel, paid data enriches the shortlist. That's how you replicate Crunchbase Basic workflows without $500/month.

OSINT Ethics and Legal Limitations

All eight sources above are public data. That doesn't make bulk aggregation unconditionally fine.

GDPR Article 14 covers "information from sources other than the data subject." Aggregating EU-company data with personal info about employees (e.g., GitHub names) technically requires notice. Enforcement tends to target bulk sellers, not internal tools.
CCPA has similar provisions, plus specific rules around data brokers.
State-level data-broker laws (Vermont, Oregon, Texas, Delaware as of 2025) require registration if you re-sell aggregated personal data.
Crawl politeness: honor robots.txt, respect Crawl-Delay, use a User-Agent that identifies your bot with a contact email.

Good practice: use aggregated data for internal research, don't re-sell it as a "contact list," delete on request, and don't merge it with contact-level data (emails, phone numbers) without explicit consent pathways.

Gotchas

Things that will bite you at scale:

CDN-masked origins. Behind Cloudflare, you see Cloudflare, not the real origin. Workarounds: historical DNS records (securitytrails, viewdns.info), or MX records which usually sit on the real infra.
WHOIS privacy. GoDaddy redacts by default on all new .com registrations post-2018. You'll see "Redacted for Privacy" a lot.
GitHub orgs that hide members. Most enterprises have private membership. Member count will be null or drastically underreported. Repo count is more reliable.
npm under personal names. Founders publish early libraries under personal handles and never migrate. Search by repository.url pointing at the company's GitHub org to catch these.
DNS resolver leakage. Querying 8.8.8.8 for 10,000 targets tells Google exactly who you're researching. For sensitive OSINT, use DoH/DoT through a resolver you control.
Rate limits. crt.sh 429s around 10 req/sec. GitHub: 5,000/hour authenticated. WHOIS varies wildly per TLD. The aggregator handles backoff per-source; roll-your-own needs to plan for it.
CT noise. crt.sh returns every cert including expired, pre-cert, and duplicates. Dedupe by SAN, not cert serial.

FAQ

Is this legal?
Aggregating public WHOIS, DNS, CT logs, GitHub, npm, and HTTP headers is legal everywhere we're aware of. Re-selling aggregated data combined with personal info crosses into data-broker regulation. Stay on the internal-research side and you're fine.

How often does the data update?
DNS propagates in minutes. WHOIS updates weekly at registrars. CT logs are near-real-time. GitHub and npm are real-time. Re-run the aggregator weekly for fresh data.

Can I use this for lead generation?
For targeting and qualification: yes (this is what BuiltWith/HG Insights are for). For cold-email lists with personal info: no, not without a separate consent-compliant enrichment step.

How does accuracy compare to Crunchbase?
Domain, tech stack, hosting, CDN, subdomain count, GitHub metrics: more accurate (Crunchbase doesn't even track most of this). Funding, employee counts, investors: drastically less accurate — Crunchbase wins and it's not close.

What if the company uses Cloudflare — do I get real IPs?
No. Cloudflare hides the origin. Infer it from historical DNS (viewdns.info, securitytrails free tier) or non-proxied subdomains like mail and staging.

How do I combine this with Apollo/Hunter for contact enrichment?
Run the aggregator first for company-level profiles. Pass qualified domains to Apollo/Hunter for person-level enrichment. Much cheaper than running Apollo on your full list.

Can I run it on 10,000 domains at once?
Yes. At 10 concurrent with 10s per-source timeouts, 10,000 domains take ~3 hours and about $15–25 in Apify credits.

Where's the rate-limit bottleneck?
Usually crt.sh and WHOIS. GitHub is generous if authenticated. DNS has no practical limit with a good resolver. The aggregator backs off per-source independently.

Conclusion

Crunchbase killing its free Basic API hurt, but it wasn't fatal. Roughly two-thirds of what it provided is reconstructable from public WHOIS, DNS, CT logs, GitHub, npm, HTTP headers, robots/sitemap, and favicon/OG metadata. The rest — funding rounds, exact headcount, investor lists — still requires paid data, but you can now pay for it surgically on a pre-qualified shortlist instead of blanket-licensing it for your entire pipeline.

If you want the ready-made version rather than wiring up eight scrapers yourself, try the company-data-aggregator on Apify — it handles parallel fanout, per-source error isolation, rate-limit backoff, and unified JSON output across all eight sources. Pay per run, not per month.

Top comments (1)

foxck016077 • May 18

The eight-source fanout angle is the right framing — most "alternative to X API" posts stop at one substitute and pretend it covers the gap. Acknowledging that you only reconstruct 60–70% and naming the missing parts (funding, cap tables) is the part that builds trust.

Two real questions from someone shipping a smaller Apify actor on a related shape:

How do you handle the free-tier rate-limit surface across the eight sources at parallel-fanout scale? Specifically the registrar/WHOIS and certificate transparency ones — those tend to flap fast when you hit them concurrently from one IP, and I've been wondering whether the right answer is per-source pacing inside the actor or just queueing runs.
On the pay-per-run pricing model: did you settle on per-run because the workload is genuinely bursty for BD/research, or because per-run had a measurably better signup-to-first-run conversion than a monthly tier? I shipped a free Gmail inbox-analytics actor last weekend on a similar OSS-utility + paid-companion shape and the pricing tier question is the one I keep going back and forth on.