Devil Scrapes

Posted on Jun 4

Y Combinator Companies Scraper: full YC directory to JSON for $3/1K

#webscraping #python #apify #data

Quick answer: The Y Combinator company directory lists every YC-funded startup — but it has no bulk export and no public API. A Y Combinator companies scraper queries the Algolia search index that powers the directory and returns every matching company as a structured JSON row with name, batch, status, industries, team size, website, and YC profile URL. The Apify Actor below does it for $0.003 per company (~$3.00 per 1,000), with fingerprint rotation, proxy management, and retries handled for you.

Y Combinator has funded over 4,000 companies since 2005, and the directory at ycombinator.com/companies is the canonical public register of that portfolio — batch by batch, industry by industry, from the S05 cohort through whatever launched last Tuesday. For investors, recruiters, researchers, and competitive-intel teams, that register is a primary data source.

The catch: it's search-first, not download-first. You can browse 50 companies at a time or type a query, but there's no "export all W24 Developer Tools companies as CSV" button. If you want the dataset, you extract it. Here's what that takes, and how I wrapped it into a single API call.

What is Y Combinator's company directory? 🔎

Y Combinator is the seed accelerator behind Airbnb, Stripe, Dropbox, Coinbase, and several hundred other widely-used software businesses. Twice a year it accepts a batch of companies — designated by season and year (W24 = Winter 2024, S23 = Summer 2023) — funds them, and lists them in its public directory.

That directory is more than a marketing page. It's the ground truth for:

Which companies are currently Active, Acquired, Public, or Inactive
What industry and region each company operates in
The company's own website, team size, and YC-curated one-liner
A stable YC profile URL that persists even if the company's own domain changes

YC updates the directory continuously; the underlying Algolia index reflects near-real-time changes including newly-added companies and status changes.

Does the YC company directory have an API? 🔌

No. Y Combinator publishes no official public API for the company directory. The directory's search is powered by Algolia, and the search index (YCCompany_production) plus a scoped API key are embedded in the frontend JavaScript that loads the page. Those keys are public in the sense that any browser can read them, but they're not a stable, versioned, documented API — Algolia key rotation or index structure changes can break clients that depend on them directly. The Actor abstracts that surface so you don't have to track it.

What the data looks like

Every company comes back as one flat, typed row showing every field the Actor returns:

{
  "slug": "airbnb",
  "name": "Airbnb",
  "one_liner": "Book accommodations around the world.",
  "long_description": "Airbnb is a trusted community marketplace for people to list, discover, and book unique accommodations around the world.",
  "batch": "W09",
  "status": "Public",
  "industries": ["Consumer", "Travel, Leisure and Tourism"],
  "tags": ["travel", "marketplace", "hospitality"],
  "regions": ["United States of America", "San Francisco Bay Area"],
  "location": "San Francisco, CA, USA",
  "team_size": 6000,
  "website": "https://airbnb.com",
  "small_logo_url": "https://bookface-images.s3.amazonaws.com/small_logos/airbnb.png",
  "yc_url": "https://www.ycombinator.com/companies/airbnb",
  "yc_team_link": "https://www.ycombinator.com/companies/airbnb/founders",
  "scraped_at": "2026-05-31T09:00:00+00:00"
}

Sixteen fields, Pydantic-validated before they land in your dataset. The slug is stable and maps directly to the YC profile URL — a natural join key for incremental diffs.

The naive approach (and why it falls apart)

The path a developer usually tries first:

Open browser DevTools on ycombinator.com/companies
Find the Algolia XHR call
Replicate it with requests.post() and paginate

This mostly works — until it doesn't. Three places it breaks in production:

1. TLS fingerprinting on Algolia's search endpoint. A plain Python requests or httpx session emits a TLS fingerprint nothing like a real browser. We impersonate Chrome, Firefox, and Safari TLS + HTTP/2 sessions via curl-cffi — so the handshake looks indistinguishable from what the YC website itself sends.

2. Rate limiting and session management. A fast, naively-looped scraper can exhaust its Algolia request budget before finishing a large batch pull. We thread Apify residential proxies with rotating session IDs — when the target pushes back, we swap the exit IP and back off, rather than hammering until a ban arrives. We retry on 408 / 429 / 5xx with exponential backoff (up to 5 attempts per page), honouring Retry-After when the header is present.

3. Algolia key rotation. The directory's scoped search key gets rotated; a request that worked yesterday can start returning 403 tomorrow. Before each run we re-scrape the current key straight from the companies page, fall back to a known-good key if the page shape changes, then page the index 100 hits at a time up to your maxResults cap, honouring Algolia's nbPages boundary so we stop cleanly instead of looping past the last page.

None of that is boilerplate. It's the gap between a one-time script and a repeatable pipeline.

The Actor ⚙️

The Actor is on the Apify Store: apify.com/DevilScrapes/y-combinator-companies-scraper.

Paste a batch or industry filter in the Apify Console and click Start, or call it programmatically:

from apify_client import ApifyClient

client = ApifyClient("YOUR_APIFY_TOKEN")

run = client.actor("DevilScrapes/y-combinator-companies-scraper").call(
    run_input={
        "batch": "W24",
        "industry": "Developer Tools",
        "maxResults": 500,
    }
)

for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    print(item["name"], item["website"], item["team_size"])

Key input parameters (all optional):

Field	What it does
`searchQuery`	Free-text search across company name + one-liner
`batch`	YC batch slug — `W24`, `S23`, `W09`, etc.
`industry`	Industry tag as YC uses it — `Developer Tools`, `Fintech`, `B2B`
`status`	`any` / `active` / `acquired` / `inactive` / `public`
`maxResults`	Cap on companies returned (default 50, max 2000)

Leave all filters blank to pull the full directory up to your maxResults cap.

Use cases 💡

Five concrete patterns, not generic "startup research":

1. Batch-specific lead generation. Pull every W24 company with industry = "B2B", export as CSV, and feed it into your outbound sequence. Each row includes the company website and a YC-curated one-liner — enough context for a first pass before enrichment.

2. Investor deal-flow monitoring. Schedule a nightly run with no filters and maxResults = 2000. Diff slug values against yesterday's run to surface newly-added companies within hours of YC publishing them — first-mover signal for pre-seed investors.

3. Competitive landscape mapping. Pull all companies in a vertical (Fintech, Healthcare, Climate) across all batches. Group by batch to see how YC's thesis in that space has evolved over time. The status field lets you separate active companies from acquired ones without a second lookup.

4. Talent and hiring research. Filter for status = active, region = San Francisco, team_size < 20. You get a list of actively-growing early-stage companies that might be hiring or might be acqui-hire candidates — with direct website links.

5. Academic and journalistic research. The YC dataset is a natural experiment in startup outcomes — thousands of companies, controlled entry criteria, longitudinal status data. The Actor gives you a clean, current snapshot without manual copy-paste.

Pricing — exact numbers 💰

Pay-per-event. You pay for companies you receive, not companies you ask for.

Event	Price	What it covers
`actor-start`	$0.005	One-off warm-up charge per run
`result`	$0.003	Per company written to the dataset

Pull	Cost
100 companies	$0.31
500 companies	$1.51
1,000 companies	$3.01
2,000 companies (full run)	$6.01

Apify's $5 free trial credit covers your first ~1,650 companies with no credit card required.

The technically interesting part

The YC directory uses Algolia's InstantSearch library on the frontend, and its batch labels are not what you'd guess. The UI shows W24, but the index stores Winter 2024 — so a literal batch:"W24" filter returns nothing. The Actor normalises your shorthand (W24 → Winter 2024, S23 → Summer 2023) before it builds the Algolia filters string, then paginates the raw query endpoint directly. No JavaScript execution, no headless browser overhead, no dependency on CSS selector stability. When YC redesigns their frontend, the Actor's query layer is unaffected as long as the Algolia index name and key shape haven't changed.

Limitations 🚧

Honest list, because surprises in production are worse than warnings up front:

No founder / team data. The Algolia index surfaces company-level metadata. Individual founder names, LinkedIn profiles, and bios live on separate YC profile pages and are not in scope for this Actor.
No revenue or funding figures. YC doesn't publish these in the directory. Post-YC funding rounds come from Crunchbase or PitchBook, not from this source.
Algolia ranks and caps very broad queries. With no batch or industry filter, a fully unfiltered pull can clip at Algolia's ranking ceiling. Use filters to narrow the result set when completeness matters.
long_description is sparse. Many companies have a one_liner but no long_description. We surface null, never fabricate.
Status reflects YC's records, not real-time company state. A company marked Active may have quietly shut down. YC updates the directory but not instantly.

FAQ ❓

Is scraping the YC company directory legal?
The YC directory is publicly accessible with no authentication wall. This Actor reads only what any browser user can see, at a measured request pace, collecting only company-level metadata — no personal data. Always check your jurisdiction and use case.

Can I export to Google Sheets or a data warehouse?
Yes. Export JSON, CSV, or Excel from the Apify Console, or use the Apify API. Webhook on ACTOR.RUN.SUCCEEDED to push rows into Make, Zapier, n8n, or a direct database connector.

Does the YC directory have an official API?
No. The directory runs on Algolia's hosted search. This Actor wraps that layer so you don't have to track key rotations or index changes yourself.

Why are some fields null?
YC doesn't fill every field for every company. Older batches have minimal metadata; newer companies have fuller profiles. We surface null, never fabricate.

Try it

The Actor is live on the Apify Store: apify.com/DevilScrapes/y-combinator-companies-scraper.

Free $5 trial credit, no credit card. Run it on batch = W24 and you'll have a structured list of the most recent YC cohort in your dataset within seconds. Got a use case I didn't cover, or a field you wish it returned? Drop a comment below — I build based on what people actually need.

Further reading:

Y Combinator company directory — the source
Algolia InstantSearch documentation — the search layer this Actor targets
Apify Actor documentation — how Actors work, pricing, and API reference

Built by Devil Scrapes — Apify Actors with attitude. Pay-per-event, transparent pricing, no junk fields. 😈

DEV Community