agenthustler

Posted on Apr 9

How to Scrape Crunchbase Startup Data in 2026 (API, Web Scraping, and Alternatives)

#webscraping #python #startup #tutorial

Crunchbase is the go-to database for startup and venture capital data. With information on 2M+ companies, funding rounds, acquisitions, and key people, it's invaluable for investors, recruiters, market researchers, and sales teams.

But getting that data at scale? That's where things get tricky. Crunchbase has aggressively locked down access over the years, making scraping increasingly difficult. Their Basic API was deprecated, the Pro API costs $49/month minimum, and their website is heavily protected against automated access.

In this guide, I'll show you every viable method for extracting Crunchbase data in 2026 — what works, what doesn't, and how to avoid getting blocked.

Understanding Crunchbase's Data Structure

Before scraping, it helps to understand what Crunchbase actually stores:

Organizations: Companies, investors, schools — each with a profile, description, funding history, and team members
People: Founders, executives, investors — linked to their organizations
Funding Rounds: Series A, B, C, etc. — with amounts, dates, and participating investors
Acquisitions: Who bought whom, for how much, and when
Events: Conferences, competitions, and industry events
Categories and Industries: Hierarchical taxonomy for classification

Each entity has a unique permalink (URL slug) that serves as its identifier.

Method 1: Crunchbase's Official API

Crunchbase offers a tiered API. Here's what you need to know:

API Tiers in 2026

Tier	Price	Rate Limit	Features
Basic (deprecated)	Free	N/A	No longer available
Starter	$49/mo	200 req/min	Search, org profiles
Pro	$99/mo	1,000 req/min	Full data, bulk export
Enterprise	Custom	Custom	Everything + support

Working with the Crunchbase API

import requests
import time

CRUNCHBASE_API_KEY = "your_api_key_here"
BASE_URL = "https://api.crunchbase.com/api/v4"

HEADERS = {
    "X-cb-user-key": CRUNCHBASE_API_KEY,
    "Content-Type": "application/json"
}

def search_organizations(query: str, limit: int = 25) -> list:
    """Search for organizations on Crunchbase."""
    url = f"{BASE_URL}/autocompletes"
    params = {
        "query": query,
        "collection_ids": "organizations",
        "limit": limit
    }

    response = requests.get(url, headers=HEADERS, params=params)

    if response.status_code == 200:
        data = response.json()
        return data.get("entities", [])
    elif response.status_code == 401:
        print("Invalid API key")
        return []
    elif response.status_code == 429:
        print("Rate limited - waiting 60s")
        time.sleep(60)
        return search_organizations(query, limit)
    else:
        print(f"Error {response.status_code}: {response.text}")
        return []

results = search_organizations("artificial intelligence")
for org in results:
    props = org.get("properties", {})
    print(f"{props.get('identifier', {}).get('value')}: {props.get('short_description')}")

Fetching Detailed Organization Data

def get_organization(permalink: str):
    """Fetch detailed organization profile."""
    url = f"{BASE_URL}/entities/organizations/{permalink}"
    params = {
        "field_ids": [
            "identifier", "short_description", "description",
            "founded_on", "num_employees_enum", "website_url",
            "linkedin", "twitter", "location_identifiers",
            "categories", "category_groups", "funding_total",
            "num_funding_rounds", "last_funding_type",
            "investor_identifiers", "revenue_range"
        ],
        "card_ids": ["founders", "raised_funding_rounds", "acquiree_acquisitions"]
    }

    response = requests.get(url, headers=HEADERS, params=params)

    if response.status_code == 200:
        return response.json()
    elif response.status_code == 404:
        print(f"Organization '{permalink}' not found")
        return None
    else:
        print(f"Error: {response.status_code}")
        return None

openai_data = get_organization("openai")
if openai_data:
    props = openai_data.get("properties", {})
    print(f"Founded: {props.get('founded_on')}")
    print(f"Total Funding: {props.get('funding_total', {}).get('value_usd')}")
    print(f"Employees: {props.get('num_employees_enum')}")

Searching with Filters

The search endpoint is where the real power lies. You can build complex queries to find exactly the startups you need:

def search_funded_startups(
    location: str = None,
    min_funding: int = None,
    categories: list = None,
    founded_after: str = None,
    limit: int = 50
) -> list:
    """Search for startups with specific criteria."""
    url = f"{BASE_URL}/searches/organizations"

    field_ids = [
        "identifier", "short_description", "location_identifiers",
        "categories", "funding_total", "founded_on",
        "num_employees_enum", "last_funding_type",
        "num_funding_rounds"
    ]

    query_conditions = []

    if location:
        query_conditions.append({
            "type": "predicate",
            "field_id": "location_identifiers",
            "operator_id": "includes",
            "values": [location]
        })

    if min_funding:
        query_conditions.append({
            "type": "predicate",
            "field_id": "funding_total",
            "operator_id": "gte",
            "values": [{"value": min_funding, "currency": "usd"}]
        })

    if founded_after:
        query_conditions.append({
            "type": "predicate",
            "field_id": "founded_on",
            "operator_id": "gte",
            "values": [founded_after]
        })

    if categories:
        query_conditions.append({
            "type": "predicate",
            "field_id": "categories",
            "operator_id": "includes",
            "values": categories
        })

    payload = {
        "field_ids": field_ids,
        "order": [{"field_id": "funding_total", "sort": "desc"}],
        "query": query_conditions,
        "limit": limit
    }

    response = requests.post(url, headers=HEADERS, json=payload)

    if response.status_code == 200:
        return response.json().get("entities", [])
    else:
        print(f"Error: {response.status_code} - {response.text}")
        return []

# Find well-funded AI startups in San Francisco
ai_startups = search_funded_startups(
    location="san-francisco",
    min_funding=10_000_000,
    categories=["artificial-intelligence"],
    founded_after="2023-01-01",
    limit=100
)

for startup in ai_startups:
    props = startup.get("properties", {})
    name = props.get("identifier", {}).get("value", "Unknown")
    funding = props.get("funding_total", {}).get("value_usd", 0)
    print(f"{name}: ${funding:,.0f}")

Method 2: Web Scraping Crunchbase

When the API is too expensive or doesn't expose what you need, web scraping is the alternative. But Crunchbase uses aggressive anti-bot measures including:

Cloudflare protection with JavaScript challenges
Fingerprinting and behavioral analysis
Heavy client-side rendering (React SPA)
Rate limiting by IP and session

Using Playwright for JavaScript-Rendered Content

import asyncio
from playwright.async_api import async_playwright
import json

async def scrape_crunchbase_org(permalink: str) -> dict:
    """Scrape a Crunchbase organization page using Playwright."""
    async with async_playwright() as p:
        browser = await p.chromium.launch(
            headless=True,
            args=["--disable-blink-features=AutomationControlled"]
        )

        context = await browser.new_context(
            user_agent=(
                "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
                "AppleWebKit/537.36 (KHTML, like Gecko) "
                "Chrome/121.0.0.0 Safari/537.36"
            ),
            viewport={"width": 1920, "height": 1080}
        )

        page = await context.new_page()

        url = f"https://www.crunchbase.com/organization/{permalink}"
        await page.goto(url, wait_until="networkidle")
        await page.wait_for_timeout(3000)

        data = {}

        # Company name
        name_el = await page.query_selector("h1")
        if name_el:
            data["name"] = await name_el.inner_text()

        # Description
        desc_el = await page.query_selector("[class*='description']")
        if desc_el:
            data["description"] = await desc_el.inner_text()

        # Key facts from the sidebar
        fields = await page.query_selector_all("[class*='field-row']")
        for field in fields:
            label_el = await field.query_selector("[class*='label']")
            value_el = await field.query_selector("[class*='value']")
            if label_el and value_el:
                label = await label_el.inner_text()
                value = await value_el.inner_text()
                data[label.strip().lower().replace(" ", "_")] = value.strip()

        # Funding rounds
        funding_rows = await page.query_selector_all(
            "table[class*='funding'] tbody tr"
        )
        data["funding_rounds"] = []
        for row in funding_rows:
            cells = await row.query_selector_all("td")
            if len(cells) >= 4:
                round_data = {
                    "date": await cells[0].inner_text(),
                    "round_type": await cells[1].inner_text(),
                    "amount": await cells[2].inner_text(),
                    "investors": await cells[3].inner_text(),
                }
                data["funding_rounds"].append(round_data)

        await browser.close()
        return data

result = asyncio.run(scrape_crunchbase_org("stripe"))
print(json.dumps(result, indent=2))

Critical Anti-Detection Measures

Crunchbase is one of the hardest sites to scrape. Here's what you need to evade detection:

async def create_stealth_context(playwright):
    """Create a browser context that avoids detection."""
    browser = await playwright.chromium.launch(
        headless=True,
        args=[
            "--disable-blink-features=AutomationControlled",
            "--disable-dev-shm-usage",
            "--no-sandbox",
        ]
    )

    context = await browser.new_context(
        user_agent=(
            "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
            "AppleWebKit/537.36 (KHTML, like Gecko) "
            "Chrome/121.0.0.0 Safari/537.36"
        ),
        viewport={"width": 1920, "height": 1080},
        locale="en-US",
        timezone_id="America/New_York",
    )

    # Remove automation indicators
    await context.add_init_script("""
        Object.defineProperty(navigator, 'webdriver', {
            get: () => undefined
        });

        window.chrome = {
            runtime: {},
            loadTimes: function() {},
            csi: function() {},
            app: {}
        };

        const originalQuery = window.navigator.permissions.query;
        window.navigator.permissions.query = (parameters) => (
            parameters.name === 'notifications' ?
                Promise.resolve({ state: Notification.permission }) :
                originalQuery(parameters)
        );
    """)

    return browser, context

Handling Pagination for Search Results

import random

async def scrape_search_results(query: str, max_pages: int = 5) -> list:
    """Scrape Crunchbase search results with pagination."""
    all_results = []

    async with async_playwright() as p:
        browser, context = await create_stealth_context(p)
        page = await context.new_page()

        for page_num in range(1, max_pages + 1):
            url = (
                f"https://www.crunchbase.com/discover/organization.companies"
                f"?page={page_num}"
            )

            await page.goto(url, wait_until="networkidle")
            await page.wait_for_timeout(5000)

            cards = await page.query_selector_all("[class*='result-row']")

            for card in cards:
                name_el = await card.query_selector("a[class*='company-name']")
                if name_el:
                    name = await name_el.inner_text()
                    href = await name_el.get_attribute("href")
                    all_results.append({
                        "name": name.strip(),
                        "url": f"https://www.crunchbase.com{href}"
                    })

            print(f"Page {page_num}: found {len(cards)} results")

            # Random delay between pages
            await page.wait_for_timeout(random.randint(5000, 10000))

        await browser.close()

    return all_results

Method 3: Alternative Data Sources

Sometimes the best way to get Crunchbase data is to not scrape Crunchbase at all. Several alternatives exist:

Open Datasets

Crunchbase's own data downloads: Pro plan includes CSV exports
Kaggle: Search for "crunchbase" — there are several community-maintained datasets
Papers With Code: Some academic datasets include Crunchbase snapshots

Using Pre-Built Scrapers

Building and maintaining your own Crunchbase scraper is a significant time investment. The anti-bot measures change frequently, and your scraper can break at any time.

Apify's Crunchbase Scraper is a maintained, ready-to-use solution that handles all the anti-detection complexity. You specify what companies or search criteria you want, and it returns structured JSON data. It's particularly useful for one-off research projects or regular data pulls where you don't want to maintain scraping infrastructure.

Building a Funding Round Tracker

Here's a practical example that combines API data to track funding activity:

import requests
import json
import csv
from datetime import datetime, timedelta

class CrunchbaseTracker:
    """Track startup funding rounds from Crunchbase."""

    def __init__(self, api_key: str):
        self.api_key = api_key
        self.base_url = "https://api.crunchbase.com/api/v4"
        self.headers = {
            "X-cb-user-key": api_key,
            "Content-Type": "application/json"
        }
        self.request_count = 0

    def _request(self, method: str, endpoint: str,
                 params: dict = None, payload: dict = None):
        """Make rate-limited API request."""
        self.request_count += 1
        url = f"{self.base_url}{endpoint}"

        if method == "GET":
            resp = requests.get(url, headers=self.headers, params=params)
        else:
            resp = requests.post(url, headers=self.headers, json=payload)

        if resp.status_code == 200:
            return resp.json()
        elif resp.status_code == 429:
            import time
            print(f"Rate limited after {self.request_count} requests. Waiting...")
            time.sleep(60)
            return self._request(method, endpoint, params, payload)
        else:
            print(f"API Error {resp.status_code}: {resp.text[:200]}")
            return None

    def get_recent_funding(self, days: int = 7,
                           min_amount: int = 1_000_000) -> list:
        """Get funding rounds from the last N days."""
        since_date = (
            datetime.now() - timedelta(days=days)
        ).strftime("%Y-%m-%d")

        payload = {
            "field_ids": [
                "identifier", "funded_organization_identifier",
                "money_raised", "announced_on", "investment_type",
                "investor_identifiers", "num_investors"
            ],
            "query": [
                {
                    "type": "predicate",
                    "field_id": "announced_on",
                    "operator_id": "gte",
                    "values": [since_date]
                },
                {
                    "type": "predicate",
                    "field_id": "money_raised",
                    "operator_id": "gte",
                    "values": [{"value": min_amount, "currency": "usd"}]
                }
            ],
            "order": [{"field_id": "announced_on", "sort": "desc"}],
            "limit": 100
        }

        result = self._request("POST", "/searches/funding_rounds",
                               payload=payload)

        if not result:
            return []

        rounds = []
        for entity in result.get("entities", []):
            props = entity.get("properties", {})
            rounds.append({
                "company": props.get(
                    "funded_organization_identifier", {}
                ).get("value", "Unknown"),
                "amount_usd": props.get(
                    "money_raised", {}
                ).get("value_usd", 0),
                "round_type": props.get("investment_type", "Unknown"),
                "date": props.get("announced_on", "Unknown"),
                "num_investors": props.get("num_investors", 0)
            })

        return rounds

    def get_top_investors(self, category: str = None,
                          limit: int = 20) -> list:
        """Get most active investors."""
        payload = {
            "field_ids": [
                "identifier", "short_description",
                "num_investments_total", "num_exits",
                "location_identifiers"
            ],
            "order": [
                {"field_id": "num_investments_total", "sort": "desc"}
            ],
            "limit": limit
        }

        if category:
            payload["query"] = [{
                "type": "predicate",
                "field_id": "investor_type",
                "operator_id": "includes",
                "values": [category]
            }]

        result = self._request("POST", "/searches/organizations",
                               payload=payload)

        if not result:
            return []

        investors = []
        for entity in result.get("entities", []):
            props = entity.get("properties", {})
            investors.append({
                "name": props.get(
                    "identifier", {}
                ).get("value", "Unknown"),
                "total_investments": props.get(
                    "num_investments_total", 0
                ),
                "exits": props.get("num_exits", 0),
                "description": props.get("short_description", "")
            })

        return investors

    def export_to_csv(self, data: list, filename: str):
        """Export results to CSV."""
        if not data:
            print("No data to export")
            return

        with open(filename, "w", newline="", encoding="utf-8") as f:
            writer = csv.DictWriter(f, fieldnames=data[0].keys())
            writer.writeheader()
            writer.writerows(data)

        print(f"Exported {len(data)} records to {filename}")


# Usage Example
tracker = CrunchbaseTracker("your_api_key_here")

# Get this week's big funding rounds
recent = tracker.get_recent_funding(days=7, min_amount=5_000_000)
print(f"\nBig funding rounds this week: {len(recent)}")
for r in recent[:10]:
    print(f"  {r['company']}: ${r['amount_usd']:,.0f} ({r['round_type']})")

tracker.export_to_csv(recent, "weekly_funding.csv")

Handling Common Challenges

Challenge 1: Cloudflare Protection

Crunchbase uses Cloudflare's anti-bot system. When you hit a challenge page, you need to wait for it to resolve:

async def handle_cloudflare(page):
    """Wait for Cloudflare challenge to resolve."""
    max_wait = 30
    waited = 0

    while waited < max_wait:
        title = await page.title()
        if "Just a moment" in title or "Checking" in title:
            await page.wait_for_timeout(2000)
            waited += 2
        else:
            return True

    return False

Challenge 2: Login Walls

Some Crunchbase data requires a logged-in session. Here's how to handle authentication:

async def login_crunchbase(page, email: str, password: str):
    """Log into Crunchbase."""
    await page.goto("https://www.crunchbase.com/login")
    await page.wait_for_timeout(3000)

    await page.fill("input[name='email']", email)
    await page.fill("input[name='password']", password)
    await page.click("button[type='submit']")

    await page.wait_for_timeout(5000)

    cookies = await page.context.cookies()
    logged_in = any(c["name"] == "cb_session" for c in cookies)

    return logged_in

Challenge 3: Data Completeness

Crunchbase data is community-contributed and often incomplete. Always validate what you get:

def validate_company_data(company: dict) -> dict:
    """Clean and validate scraped company data."""
    cleaned = {}

    cleaned["name"] = company.get("name", "").strip()
    if not cleaned["name"]:
        return None

    # Funding (normalize to USD integer)
    funding_raw = company.get("funding_total", "")
    if isinstance(funding_raw, str):
        funding_raw = funding_raw.replace("$", "").replace(",", "")
        funding_raw = funding_raw.replace("M", "000000").replace("B", "000000000")
        try:
            cleaned["funding_usd"] = int(float(funding_raw))
        except ValueError:
            cleaned["funding_usd"] = None
    else:
        cleaned["funding_usd"] = funding_raw

    # Employee count (normalize ranges)
    emp = company.get("num_employees", "")
    emp_ranges = {
        "1-10": 5, "11-50": 30, "51-100": 75,
        "101-250": 175, "251-500": 375, "501-1000": 750,
        "1001-5000": 3000, "5001-10000": 7500, "10001+": 15000
    }
    cleaned["employees_est"] = emp_ranges.get(emp, None)

    cleaned["founded"] = company.get("founded_on")

    return cleaned

Rate Limits and Best Practices Summary

Method	Rate Limit	Cost	Reliability
API (Starter)	200 req/min	$49/mo	High
API (Pro)	1,000 req/min	$99/mo	High
Web Scraping	Self-managed	Free + proxy costs	Low-Medium
Apify Scraper	Managed	Pay per use	High
Kaggle datasets	N/A	Free	Medium (stale)

Key Rules to Follow

Start with the API if you can afford it — it's the most reliable path
Use web scraping as a supplement, not a primary method
Cache aggressively — company data doesn't change hourly
Respect rate limits — getting banned means starting over
Validate everything — Crunchbase data has gaps and inconsistencies
Store raw responses — you'll want to reparse later as your needs evolve
Keep your scraper updated — Crunchbase changes their frontend regularly

Putting It All Together: Choosing Your Approach

Here's a decision tree for choosing the right method:

Do you need real-time, up-to-date data?

Yes -> Use the official API ($49+/mo) or a managed scraper like Apify's Crunchbase Scraper
No -> Check Kaggle for existing datasets first

How many companies do you need data on?

Under 100 -> Manual collection or API autocomplete (might work on free tier)
100-10,000 -> API Starter plan or Apify scraper
Over 10,000 -> API Pro plan with bulk export

How often do you need fresh data?

One-time research -> Apify scraper or Kaggle dataset
Weekly updates -> API with scheduled jobs
Real-time monitoring -> API Pro with webhooks

What's your budget?

$0 -> Kaggle datasets + limited web scraping
Under $50/mo -> API Starter or Apify pay-per-use
Over $100/mo -> API Pro with full access

Conclusion

Scraping Crunchbase in 2026 is harder than ever, but far from impossible. The official API, while paid, offers the most reliable access. Web scraping works but requires significant anti-detection effort. And tools like Apify's Crunchbase Scraper provide a middle ground — managed scraping without the maintenance headache.

Whatever method you choose, remember that the value isn't in the raw data — it's in what you do with it. A well-structured funding tracker, a competitive intelligence dashboard, or an investor CRM built on Crunchbase data can provide enormous value to your business.

Start small, validate your data pipeline, and scale up once you're confident in the quality. The startup ecosystem moves fast, and having reliable access to Crunchbase data gives you a real competitive edge.

DEV Community