Crunchbase is the go-to database for startup and venture capital data. With information on 2M+ companies, funding rounds, acquisitions, and key people, it's invaluable for investors, recruiters, market researchers, and sales teams.
But getting that data at scale? That's where things get tricky. Crunchbase has aggressively locked down access over the years, making scraping increasingly difficult. Their Basic API was deprecated, the Pro API costs $49/month minimum, and their website is heavily protected against automated access.
In this guide, I'll show you every viable method for extracting Crunchbase data in 2026 — what works, what doesn't, and how to avoid getting blocked.
Understanding Crunchbase's Data Structure
Before scraping, it helps to understand what Crunchbase actually stores:
- Organizations: Companies, investors, schools — each with a profile, description, funding history, and team members
- People: Founders, executives, investors — linked to their organizations
- Funding Rounds: Series A, B, C, etc. — with amounts, dates, and participating investors
- Acquisitions: Who bought whom, for how much, and when
- Events: Conferences, competitions, and industry events
- Categories and Industries: Hierarchical taxonomy for classification
Each entity has a unique permalink (URL slug) that serves as its identifier.
Method 1: Crunchbase's Official API
Crunchbase offers a tiered API. Here's what you need to know:
API Tiers in 2026
| Tier | Price | Rate Limit | Features |
|---|---|---|---|
| Basic (deprecated) | Free | N/A | No longer available |
| Starter | $49/mo | 200 req/min | Search, org profiles |
| Pro | $99/mo | 1,000 req/min | Full data, bulk export |
| Enterprise | Custom | Custom | Everything + support |
Working with the Crunchbase API
import requests
import time
CRUNCHBASE_API_KEY = "your_api_key_here"
BASE_URL = "https://api.crunchbase.com/api/v4"
HEADERS = {
"X-cb-user-key": CRUNCHBASE_API_KEY,
"Content-Type": "application/json"
}
def search_organizations(query: str, limit: int = 25) -> list:
"""Search for organizations on Crunchbase."""
url = f"{BASE_URL}/autocompletes"
params = {
"query": query,
"collection_ids": "organizations",
"limit": limit
}
response = requests.get(url, headers=HEADERS, params=params)
if response.status_code == 200:
data = response.json()
return data.get("entities", [])
elif response.status_code == 401:
print("Invalid API key")
return []
elif response.status_code == 429:
print("Rate limited - waiting 60s")
time.sleep(60)
return search_organizations(query, limit)
else:
print(f"Error {response.status_code}: {response.text}")
return []
results = search_organizations("artificial intelligence")
for org in results:
props = org.get("properties", {})
print(f"{props.get('identifier', {}).get('value')}: {props.get('short_description')}")
Fetching Detailed Organization Data
def get_organization(permalink: str):
"""Fetch detailed organization profile."""
url = f"{BASE_URL}/entities/organizations/{permalink}"
params = {
"field_ids": [
"identifier", "short_description", "description",
"founded_on", "num_employees_enum", "website_url",
"linkedin", "twitter", "location_identifiers",
"categories", "category_groups", "funding_total",
"num_funding_rounds", "last_funding_type",
"investor_identifiers", "revenue_range"
],
"card_ids": ["founders", "raised_funding_rounds", "acquiree_acquisitions"]
}
response = requests.get(url, headers=HEADERS, params=params)
if response.status_code == 200:
return response.json()
elif response.status_code == 404:
print(f"Organization '{permalink}' not found")
return None
else:
print(f"Error: {response.status_code}")
return None
openai_data = get_organization("openai")
if openai_data:
props = openai_data.get("properties", {})
print(f"Founded: {props.get('founded_on')}")
print(f"Total Funding: {props.get('funding_total', {}).get('value_usd')}")
print(f"Employees: {props.get('num_employees_enum')}")
Searching with Filters
The search endpoint is where the real power lies. You can build complex queries to find exactly the startups you need:
def search_funded_startups(
location: str = None,
min_funding: int = None,
categories: list = None,
founded_after: str = None,
limit: int = 50
) -> list:
"""Search for startups with specific criteria."""
url = f"{BASE_URL}/searches/organizations"
field_ids = [
"identifier", "short_description", "location_identifiers",
"categories", "funding_total", "founded_on",
"num_employees_enum", "last_funding_type",
"num_funding_rounds"
]
query_conditions = []
if location:
query_conditions.append({
"type": "predicate",
"field_id": "location_identifiers",
"operator_id": "includes",
"values": [location]
})
if min_funding:
query_conditions.append({
"type": "predicate",
"field_id": "funding_total",
"operator_id": "gte",
"values": [{"value": min_funding, "currency": "usd"}]
})
if founded_after:
query_conditions.append({
"type": "predicate",
"field_id": "founded_on",
"operator_id": "gte",
"values": [founded_after]
})
if categories:
query_conditions.append({
"type": "predicate",
"field_id": "categories",
"operator_id": "includes",
"values": categories
})
payload = {
"field_ids": field_ids,
"order": [{"field_id": "funding_total", "sort": "desc"}],
"query": query_conditions,
"limit": limit
}
response = requests.post(url, headers=HEADERS, json=payload)
if response.status_code == 200:
return response.json().get("entities", [])
else:
print(f"Error: {response.status_code} - {response.text}")
return []
# Find well-funded AI startups in San Francisco
ai_startups = search_funded_startups(
location="san-francisco",
min_funding=10_000_000,
categories=["artificial-intelligence"],
founded_after="2023-01-01",
limit=100
)
for startup in ai_startups:
props = startup.get("properties", {})
name = props.get("identifier", {}).get("value", "Unknown")
funding = props.get("funding_total", {}).get("value_usd", 0)
print(f"{name}: ${funding:,.0f}")
Method 2: Web Scraping Crunchbase
When the API is too expensive or doesn't expose what you need, web scraping is the alternative. But Crunchbase uses aggressive anti-bot measures including:
- Cloudflare protection with JavaScript challenges
- Fingerprinting and behavioral analysis
- Heavy client-side rendering (React SPA)
- Rate limiting by IP and session
Using Playwright for JavaScript-Rendered Content
import asyncio
from playwright.async_api import async_playwright
import json
async def scrape_crunchbase_org(permalink: str) -> dict:
"""Scrape a Crunchbase organization page using Playwright."""
async with async_playwright() as p:
browser = await p.chromium.launch(
headless=True,
args=["--disable-blink-features=AutomationControlled"]
)
context = await browser.new_context(
user_agent=(
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/121.0.0.0 Safari/537.36"
),
viewport={"width": 1920, "height": 1080}
)
page = await context.new_page()
url = f"https://www.crunchbase.com/organization/{permalink}"
await page.goto(url, wait_until="networkidle")
await page.wait_for_timeout(3000)
data = {}
# Company name
name_el = await page.query_selector("h1")
if name_el:
data["name"] = await name_el.inner_text()
# Description
desc_el = await page.query_selector("[class*='description']")
if desc_el:
data["description"] = await desc_el.inner_text()
# Key facts from the sidebar
fields = await page.query_selector_all("[class*='field-row']")
for field in fields:
label_el = await field.query_selector("[class*='label']")
value_el = await field.query_selector("[class*='value']")
if label_el and value_el:
label = await label_el.inner_text()
value = await value_el.inner_text()
data[label.strip().lower().replace(" ", "_")] = value.strip()
# Funding rounds
funding_rows = await page.query_selector_all(
"table[class*='funding'] tbody tr"
)
data["funding_rounds"] = []
for row in funding_rows:
cells = await row.query_selector_all("td")
if len(cells) >= 4:
round_data = {
"date": await cells[0].inner_text(),
"round_type": await cells[1].inner_text(),
"amount": await cells[2].inner_text(),
"investors": await cells[3].inner_text(),
}
data["funding_rounds"].append(round_data)
await browser.close()
return data
result = asyncio.run(scrape_crunchbase_org("stripe"))
print(json.dumps(result, indent=2))
Critical Anti-Detection Measures
Crunchbase is one of the hardest sites to scrape. Here's what you need to evade detection:
async def create_stealth_context(playwright):
"""Create a browser context that avoids detection."""
browser = await playwright.chromium.launch(
headless=True,
args=[
"--disable-blink-features=AutomationControlled",
"--disable-dev-shm-usage",
"--no-sandbox",
]
)
context = await browser.new_context(
user_agent=(
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/121.0.0.0 Safari/537.36"
),
viewport={"width": 1920, "height": 1080},
locale="en-US",
timezone_id="America/New_York",
)
# Remove automation indicators
await context.add_init_script("""
Object.defineProperty(navigator, 'webdriver', {
get: () => undefined
});
window.chrome = {
runtime: {},
loadTimes: function() {},
csi: function() {},
app: {}
};
const originalQuery = window.navigator.permissions.query;
window.navigator.permissions.query = (parameters) => (
parameters.name === 'notifications' ?
Promise.resolve({ state: Notification.permission }) :
originalQuery(parameters)
);
""")
return browser, context
Handling Pagination for Search Results
import random
async def scrape_search_results(query: str, max_pages: int = 5) -> list:
"""Scrape Crunchbase search results with pagination."""
all_results = []
async with async_playwright() as p:
browser, context = await create_stealth_context(p)
page = await context.new_page()
for page_num in range(1, max_pages + 1):
url = (
f"https://www.crunchbase.com/discover/organization.companies"
f"?page={page_num}"
)
await page.goto(url, wait_until="networkidle")
await page.wait_for_timeout(5000)
cards = await page.query_selector_all("[class*='result-row']")
for card in cards:
name_el = await card.query_selector("a[class*='company-name']")
if name_el:
name = await name_el.inner_text()
href = await name_el.get_attribute("href")
all_results.append({
"name": name.strip(),
"url": f"https://www.crunchbase.com{href}"
})
print(f"Page {page_num}: found {len(cards)} results")
# Random delay between pages
await page.wait_for_timeout(random.randint(5000, 10000))
await browser.close()
return all_results
Method 3: Alternative Data Sources
Sometimes the best way to get Crunchbase data is to not scrape Crunchbase at all. Several alternatives exist:
Open Datasets
- Crunchbase's own data downloads: Pro plan includes CSV exports
- Kaggle: Search for "crunchbase" — there are several community-maintained datasets
- Papers With Code: Some academic datasets include Crunchbase snapshots
Using Pre-Built Scrapers
Building and maintaining your own Crunchbase scraper is a significant time investment. The anti-bot measures change frequently, and your scraper can break at any time.
Apify's Crunchbase Scraper is a maintained, ready-to-use solution that handles all the anti-detection complexity. You specify what companies or search criteria you want, and it returns structured JSON data. It's particularly useful for one-off research projects or regular data pulls where you don't want to maintain scraping infrastructure.
Building a Funding Round Tracker
Here's a practical example that combines API data to track funding activity:
import requests
import json
import csv
from datetime import datetime, timedelta
class CrunchbaseTracker:
"""Track startup funding rounds from Crunchbase."""
def __init__(self, api_key: str):
self.api_key = api_key
self.base_url = "https://api.crunchbase.com/api/v4"
self.headers = {
"X-cb-user-key": api_key,
"Content-Type": "application/json"
}
self.request_count = 0
def _request(self, method: str, endpoint: str,
params: dict = None, payload: dict = None):
"""Make rate-limited API request."""
self.request_count += 1
url = f"{self.base_url}{endpoint}"
if method == "GET":
resp = requests.get(url, headers=self.headers, params=params)
else:
resp = requests.post(url, headers=self.headers, json=payload)
if resp.status_code == 200:
return resp.json()
elif resp.status_code == 429:
import time
print(f"Rate limited after {self.request_count} requests. Waiting...")
time.sleep(60)
return self._request(method, endpoint, params, payload)
else:
print(f"API Error {resp.status_code}: {resp.text[:200]}")
return None
def get_recent_funding(self, days: int = 7,
min_amount: int = 1_000_000) -> list:
"""Get funding rounds from the last N days."""
since_date = (
datetime.now() - timedelta(days=days)
).strftime("%Y-%m-%d")
payload = {
"field_ids": [
"identifier", "funded_organization_identifier",
"money_raised", "announced_on", "investment_type",
"investor_identifiers", "num_investors"
],
"query": [
{
"type": "predicate",
"field_id": "announced_on",
"operator_id": "gte",
"values": [since_date]
},
{
"type": "predicate",
"field_id": "money_raised",
"operator_id": "gte",
"values": [{"value": min_amount, "currency": "usd"}]
}
],
"order": [{"field_id": "announced_on", "sort": "desc"}],
"limit": 100
}
result = self._request("POST", "/searches/funding_rounds",
payload=payload)
if not result:
return []
rounds = []
for entity in result.get("entities", []):
props = entity.get("properties", {})
rounds.append({
"company": props.get(
"funded_organization_identifier", {}
).get("value", "Unknown"),
"amount_usd": props.get(
"money_raised", {}
).get("value_usd", 0),
"round_type": props.get("investment_type", "Unknown"),
"date": props.get("announced_on", "Unknown"),
"num_investors": props.get("num_investors", 0)
})
return rounds
def get_top_investors(self, category: str = None,
limit: int = 20) -> list:
"""Get most active investors."""
payload = {
"field_ids": [
"identifier", "short_description",
"num_investments_total", "num_exits",
"location_identifiers"
],
"order": [
{"field_id": "num_investments_total", "sort": "desc"}
],
"limit": limit
}
if category:
payload["query"] = [{
"type": "predicate",
"field_id": "investor_type",
"operator_id": "includes",
"values": [category]
}]
result = self._request("POST", "/searches/organizations",
payload=payload)
if not result:
return []
investors = []
for entity in result.get("entities", []):
props = entity.get("properties", {})
investors.append({
"name": props.get(
"identifier", {}
).get("value", "Unknown"),
"total_investments": props.get(
"num_investments_total", 0
),
"exits": props.get("num_exits", 0),
"description": props.get("short_description", "")
})
return investors
def export_to_csv(self, data: list, filename: str):
"""Export results to CSV."""
if not data:
print("No data to export")
return
with open(filename, "w", newline="", encoding="utf-8") as f:
writer = csv.DictWriter(f, fieldnames=data[0].keys())
writer.writeheader()
writer.writerows(data)
print(f"Exported {len(data)} records to {filename}")
# Usage Example
tracker = CrunchbaseTracker("your_api_key_here")
# Get this week's big funding rounds
recent = tracker.get_recent_funding(days=7, min_amount=5_000_000)
print(f"\nBig funding rounds this week: {len(recent)}")
for r in recent[:10]:
print(f" {r['company']}: ${r['amount_usd']:,.0f} ({r['round_type']})")
tracker.export_to_csv(recent, "weekly_funding.csv")
Handling Common Challenges
Challenge 1: Cloudflare Protection
Crunchbase uses Cloudflare's anti-bot system. When you hit a challenge page, you need to wait for it to resolve:
async def handle_cloudflare(page):
"""Wait for Cloudflare challenge to resolve."""
max_wait = 30
waited = 0
while waited < max_wait:
title = await page.title()
if "Just a moment" in title or "Checking" in title:
await page.wait_for_timeout(2000)
waited += 2
else:
return True
return False
Challenge 2: Login Walls
Some Crunchbase data requires a logged-in session. Here's how to handle authentication:
async def login_crunchbase(page, email: str, password: str):
"""Log into Crunchbase."""
await page.goto("https://www.crunchbase.com/login")
await page.wait_for_timeout(3000)
await page.fill("input[name='email']", email)
await page.fill("input[name='password']", password)
await page.click("button[type='submit']")
await page.wait_for_timeout(5000)
cookies = await page.context.cookies()
logged_in = any(c["name"] == "cb_session" for c in cookies)
return logged_in
Challenge 3: Data Completeness
Crunchbase data is community-contributed and often incomplete. Always validate what you get:
def validate_company_data(company: dict) -> dict:
"""Clean and validate scraped company data."""
cleaned = {}
cleaned["name"] = company.get("name", "").strip()
if not cleaned["name"]:
return None
# Funding (normalize to USD integer)
funding_raw = company.get("funding_total", "")
if isinstance(funding_raw, str):
funding_raw = funding_raw.replace("$", "").replace(",", "")
funding_raw = funding_raw.replace("M", "000000").replace("B", "000000000")
try:
cleaned["funding_usd"] = int(float(funding_raw))
except ValueError:
cleaned["funding_usd"] = None
else:
cleaned["funding_usd"] = funding_raw
# Employee count (normalize ranges)
emp = company.get("num_employees", "")
emp_ranges = {
"1-10": 5, "11-50": 30, "51-100": 75,
"101-250": 175, "251-500": 375, "501-1000": 750,
"1001-5000": 3000, "5001-10000": 7500, "10001+": 15000
}
cleaned["employees_est"] = emp_ranges.get(emp, None)
cleaned["founded"] = company.get("founded_on")
return cleaned
Rate Limits and Best Practices Summary
| Method | Rate Limit | Cost | Reliability |
|---|---|---|---|
| API (Starter) | 200 req/min | $49/mo | High |
| API (Pro) | 1,000 req/min | $99/mo | High |
| Web Scraping | Self-managed | Free + proxy costs | Low-Medium |
| Apify Scraper | Managed | Pay per use | High |
| Kaggle datasets | N/A | Free | Medium (stale) |
Key Rules to Follow
- Start with the API if you can afford it — it's the most reliable path
- Use web scraping as a supplement, not a primary method
- Cache aggressively — company data doesn't change hourly
- Respect rate limits — getting banned means starting over
- Validate everything — Crunchbase data has gaps and inconsistencies
- Store raw responses — you'll want to reparse later as your needs evolve
- Keep your scraper updated — Crunchbase changes their frontend regularly
Putting It All Together: Choosing Your Approach
Here's a decision tree for choosing the right method:
Do you need real-time, up-to-date data?
- Yes -> Use the official API ($49+/mo) or a managed scraper like Apify's Crunchbase Scraper
- No -> Check Kaggle for existing datasets first
How many companies do you need data on?
- Under 100 -> Manual collection or API autocomplete (might work on free tier)
- 100-10,000 -> API Starter plan or Apify scraper
- Over 10,000 -> API Pro plan with bulk export
How often do you need fresh data?
- One-time research -> Apify scraper or Kaggle dataset
- Weekly updates -> API with scheduled jobs
- Real-time monitoring -> API Pro with webhooks
What's your budget?
- $0 -> Kaggle datasets + limited web scraping
- Under $50/mo -> API Starter or Apify pay-per-use
- Over $100/mo -> API Pro with full access
Conclusion
Scraping Crunchbase in 2026 is harder than ever, but far from impossible. The official API, while paid, offers the most reliable access. Web scraping works but requires significant anti-detection effort. And tools like Apify's Crunchbase Scraper provide a middle ground — managed scraping without the maintenance headache.
Whatever method you choose, remember that the value isn't in the raw data — it's in what you do with it. A well-structured funding tracker, a competitive intelligence dashboard, or an investor CRM built on Crunchbase data can provide enormous value to your business.
Start small, validate your data pipeline, and scale up once you're confident in the quality. The startup ecosystem moves fast, and having reliable access to Crunchbase data gives you a real competitive edge.
Top comments (0)