Crunchbase holds the most comprehensive database of startup and venture capital data on the web. Company profiles, funding histories, investor portfolios, acquisitions — it's the de facto source for business intelligence in the startup ecosystem.
But scraping Crunchbase in 2026 is genuinely challenging. This guide covers the technical landscape: what protections you're facing, what data is available, and the realistic approaches that work.
The Technical Challenge: Cloudflare
Crunchbase sits behind Cloudflare's Bot Management. This isn't basic CAPTCHA protection — it's JavaScript challenge loops, TLS fingerprinting, and behavioral analysis. Here's what this means in practice:
- Datacenter IPs are blocked within 1-5 requests.
- Basic HTTP clients (requests, httpx, urllib) get 403s immediately.
- Headless browsers without proper fingerprinting get detected.
- Residential proxies are required for any sustained scraping.
This isn't a solvable problem with clever headers or cookie manipulation. Cloudflare's detection is sophisticated enough that the only reliable approach involves residential IP addresses.
What Data Is Available
Despite the protection, Crunchbase pages are data-rich once you get past Cloudflare:
JSON-LD Structured Data
Company pages include Schema.org Organization markup:
{
"@type": "Organization",
"name": "OpenAI",
"url": "https://openai.com",
"description": "AI research and deployment company",
"foundingDate": "2015-12-11",
"numberOfEmployees": {"@type": "QuantitativeValue", "value": 3700}
}
Embedded React State
Crunchbase is a React application. The initial page load includes a __NEXT_DATA__ or similar hydration payload with structured company data:
{
"props": {
"pageProps": {
"entity": {
"properties": {
"identifier": {"value": "openai"},
"short_description": "...",
"funding_total": {"value": 11000000000, "currency": "USD"},
"last_funding_type": "secondary_market",
"num_employees_enum": "c_01001_05000"
}
}
}
}
}
This hydration data is often more complete than what's visible on the page.
Autocomplete API
Crunchbase's search autocomplete endpoint is less aggressively protected than the main site:
import httpx
resp = httpx.get(
"https://www.crunchbase.com/v4/data/autocompletes",
params={
"query": "artificial intelligence",
"collection_ids": "organizations",
"limit": 25,
},
headers={
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7)",
"X-Cb-Client-App-Instance-Id": "your-uuid-here",
}
)
# Returns basic company info: name, short_description, identifier
print(resp.json())
Note: This endpoint returns limited data (name, description, identifier) and may require a valid session cookie. It's useful for discovery but not for full company profiles.
Realistic Approach: Browser Automation + Residential Proxy
The most reliable DIY approach uses Playwright with residential proxies:
import asyncio
from playwright.async_api import async_playwright
import json
async def scrape_crunchbase_company(url: str, proxy: dict) -> dict:
"""Scrape a Crunchbase company page using Playwright."""
async with async_playwright() as p:
browser = await p.chromium.launch(
proxy=proxy, # {"server": "http://proxy:port", "username": "...", "password": "..."}
headless=True,
)
context = await browser.new_context(
viewport={"width": 1920, "height": 1080},
user_agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/122.0.0.0 Safari/537.36"
)
page = await context.new_page()
# Navigate and wait for data to load
await page.goto(url, wait_until="networkidle")
await page.wait_for_timeout(3000) # Extra wait for dynamic content
# Extract JSON-LD
ld_data = await page.evaluate("""
() => {
const script = document.querySelector('script[type="application/ld+json"]');
return script ? JSON.parse(script.textContent) : null;
}
""")
# Extract visible data points
company_data = await page.evaluate("""
() => {
const getText = (sel) => {
const el = document.querySelector(sel);
return el ? el.textContent.trim() : null;
};
return {
name: getText('h1'),
description: getText('[data-test="description"]'),
};
}
""")
await browser.close()
return {
**company_data,
"json_ld": ld_data,
"url": url,
}
# Usage with residential proxy
proxy = {
"server": "http://residential-proxy.example.com:8080",
"username": "your_user",
"password": "your_pass",
}
result = asyncio.run(scrape_crunchbase_company(
"https://www.crunchbase.com/organization/openai",
proxy
))
print(json.dumps(result, indent=2))
Why DIY Crunchbase Scraping Is Hard
The code above works for a single page. Scaling it to hundreds or thousands of companies introduces:
- Proxy rotation: You need to rotate residential IPs to avoid per-IP rate limits.
- Session management: Cloudflare tracks sessions. You need fresh browser contexts.
- Error handling: Cloudflare challenges, timeouts, partial loads, and blocked requests all need retry logic.
- Cost: Residential proxy bandwidth at $5-15/GB adds up when each page load is 2-5MB.
- Maintenance: Crunchbase updates their page structure and Cloudflare tunes their rules. Your selectors break.
For these reasons, most people who scrape Crunchbase at scale use a managed solution.
Using an Apify Actor
The CryptoSignals Crunchbase Scraper handles the infrastructure complexity:
from apify_client import ApifyClient
client = ApifyClient("YOUR_APIFY_TOKEN")
run = client.actor("cryptosignals/crunchbase-scraper").call(run_input={
"urls": [
"https://www.crunchbase.com/organization/stripe",
"https://www.crunchbase.com/organization/figma",
],
"scrapeType": "companies",
"proxyConfiguration": {
"useApifyProxy": True,
"apifyProxyGroups": ["RESIDENTIAL"]
}
})
for item in client.dataset(run["defaultDatasetId"]).iterate_items():
funding = item.get('funding_total', 'N/A')
employees = item.get('employee_count', 'N/A')
print(f"{item['name']} | Funding: {funding} | Employees: {employees}")
Important: You must configure residential proxy. The actor will not work reliably with datacenter proxies due to Cloudflare blocking.
Use Cases
Lead Generation Pipeline
Scrape companies by sector, filter by funding stage and employee count, enrich with contact data from other sources. Common pipeline:
- Search Crunchbase for "fintech" companies with Series A-B funding
- Extract company profiles and key people
- Cross-reference with LinkedIn for decision-maker contacts
- Load into CRM
Investor Portfolio Analysis
Track a VC's investment patterns: sectors, stages, check sizes, co-investors. Useful for founders targeting specific investors.
Market Sizing
Count companies in a specific sector by geography and funding stage. Answer questions like: "How many AI startups in Europe raised Series B+ in 2025?"
Cost Breakdown
Realistic monthly costs for Crunchbase scraping at scale:
| Component | Cost |
|---|---|
| Apify actor subscription | $4.99/mo |
| Residential proxy (Apify add-on) | $10-20/mo (usage-based) |
| Apify platform compute | $5-10/mo (usage-based) |
| Total | ~$20-35/mo |
Compare this to Crunchbase Pro API at $99/mo, and scraping makes financial sense for many use cases — with the trade-off of lower reliability (80-95% vs 99%+).
Crunchbase scraping in 2026 is doable but requires residential proxies and realistic expectations. If you need 100% reliability, use the official API. If you need cost-effective bulk data and can tolerate occasional failures, a scraper with good proxy infrastructure will serve you well.
Try the CryptoSignals Crunchbase Scraper — available on the Apify Store. Residential proxy required for reliable results.
Top comments (0)