Build a Daily Series A/B Funding Tracker from SEC Form D + TechCrunch (2026)
Crunchbase Pro is $588/month per seat. PitchBook starts higher than that. CB Insights doesn't even publish list pricing. If you're a VC associate, an SDR at a B2B startup, a financial journalist, or an analyst who needs to know which companies just raised money — and you don't already have a six-figure data budget — you've probably spent more time than you'd like to admit cobbling together a funding-round dashboard from RSS feeds and SEC searches.
The annoying part is: most of the underlying data is public. Every U.S. company raising money under Reg D files Form D with the SEC within 15 days of the first sale. TechCrunch, Axios Pro Rata, and Strictly VC publish funding-round summaries daily. Y Combinator publishes its full alumni roster. Stripe Atlas makes a fresh batch of Delaware C-corps every week.
Stitched together properly, those four sources will give you 80% of the coverage Crunchbase has — the day each round happens — for roughly $5 a month in compute costs.
This post walks through the architecture I built for a multi-strategy crossover fund's sourcing team, originally an internal tool, since productized as the NexGenData Startup Funding Tracker actor. Whether you build it yourself or use the actor, the architecture is the same.
What Public Funding Data Actually Looks Like
The SEC's Form D filing is the single most underrated funding-tracking source. Any U.S. company raising money under Regulation D — which is virtually every priced equity round, from pre-seed safes to Series F — has to file within 15 days. The filing includes the issuer name, address, executive officers, total amount raised, total number of investors, and minimum investment per investor.
What it doesn't include: the round name (you have to infer Seed vs Series A from amount + investor count), the lead investor (Form D lists "Related Persons" but not lead status), or the post-money valuation (Reg D filings don't disclose valuation). For everything Form D doesn't give you, TechCrunch's funding-round articles fill in. For everything still missing after that, Y Combinator's alumni database is the canonical source for YC-batch and demo-day companies.
The clever piece is the join key. Form D filings publish in EDGAR with the issuer's exact legal name. TechCrunch articles use the company's marketing name. YC's database uses the brand name. The same company shows up as "FlexPort, Inc." in Form D, "Flexport" in TechCrunch, and "Flexport" in YC. A fuzzy-match join (Levenshtein distance + domain TLD normalization) reconciles them at >95% precision.
Source 1: SEC EDGAR Form D Filings
EDGAR's Form D feed is at https://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&type=D&dateb=&owner=include&count=40. The HTML index page is paginated, but the underlying data is available as XML.
import httpx
from datetime import date, timedelta
async def fetch_recent_form_d(days_back: int = 1) -> list[dict]:
"""Pull Form D filings from the last N days."""
today = date.today()
start = today - timedelta(days=days_back)
url = "https://efts.sec.gov/LATEST/search-index"
params = {
"q": "",
"dateRange": "custom",
"startdt": start.isoformat(),
"enddt": today.isoformat(),
"forms": "D",
}
headers = {"User-Agent": "your-name your-email@domain.com"}
async with httpx.AsyncClient(headers=headers, timeout=30) as client:
r = await client.get(url, params=params)
r.raise_for_status()
return r.json().get("hits", {}).get("hits", [])
A few critical notes. SEC EDGAR enforces a strict 10-requests-per-second rate limit and requires a User-Agent header that includes your name and email. They will rate-limit and eventually IP-ban abusive scrapers — be polite. The full-text search index lags the actual filing by 30-60 minutes; if you need real-time, use the daily form index at https://www.sec.gov/cgi-bin/browse-edgar?action=getcurrent.
Each filing returns a CIK (Central Index Key) — that's the SEC's unique company identifier. From the CIK you can pull the actual XML filing:
async def fetch_filing_xml(cik: str, accession: str) -> str:
accession_clean = accession.replace("-", "")
url = f"https://www.sec.gov/Archives/edgar/data/{cik}/{accession_clean}/primary_doc.xml"
async with httpx.AsyncClient(headers=headers) as client:
r = await client.get(url)
return r.text
The XML includes <offeringData> with totalOfferingAmount, totalAmountSold, totalNumberAlreadyInvested, and the related-persons list. Parse with lxml or xmltodict.
Source 2: TechCrunch Funding Articles
TechCrunch's funding tag (https://techcrunch.com/category/fundings-exits/) publishes 8-15 articles per business day. Each article reliably includes the company name, round size, lead investor, and (sometimes) valuation. The headline pattern is consistent enough that a lightweight regex catches 90% of mentions:
import re
ROUND_PATTERN = re.compile(
r"^(?P<company>[A-Z][A-Za-z0-9\s\-\.\']+?)\s+"
r"(?:raises|secures|closes|nabs|lands|gets)\s+"
r"\$(?P<amount>[\d\.]+[MB]?)\s+"
r"(?:in\s+)?(?P<round>Series\s+[A-Z]|seed|pre-seed)?",
re.IGNORECASE,
)
def parse_techcrunch_headline(title: str) -> dict | None:
m = ROUND_PATTERN.search(title)
if not m:
return None
return {
"company": m.group("company").strip(),
"amount_raw": m.group("amount"),
"round": (m.group("round") or "unknown").strip(),
}
For richer extraction (lead investor, valuation, board seats), parse the article body. The lead investor is almost always in the first paragraph after the headline; valuation, when disclosed, is in the second or third.
Source 3: Y Combinator Alumni Database
YC publishes its full alumni list at https://www.ycombinator.com/companies. The page is client-rendered, but the underlying Algolia search index is queryable directly with no auth:
async def search_yc_companies(query: str = "") -> list[dict]:
url = "https://45bwzj1sgc-dsn.algolia.net/1/indexes/YCCompany_production/query"
payload = {"query": query, "hitsPerPage": 1000}
headers = {
"X-Algolia-Application-Id": "45BWZJ1SGC",
"X-Algolia-API-Key": "Y2VkOWQyMTJlYjZkZjE3MDRkY2YyNjBmYmIzMjVhMzA1ZmRlYTQ4OTUyZjEyZjRiNzc0OWQ4MjRmMzVlYmUxN3RhZ0ZpbHRlcnM9JTViJTIyJTVEJmZpbHRlcnM9aXNIaXJpbmclM0F0cnVl",
}
async with httpx.AsyncClient(headers=headers) as client:
r = await client.post(url, json=payload)
return r.json().get("hits", [])
(The API key shown is the public read-only key embedded in YC's frontend bundle — it's safe to use, but rotate periodically as YC occasionally re-keys it.)
YC hits include name, slug, batch, industry, team_size, description, website, and regions. Cross-reference with Form D and TechCrunch on company name (Levenshtein <= 2) plus website-domain match.
Source 4: Delaware Corporation Filings
This is the most underrated source for very-early-stage tracking. Every Delaware C-corp incorporates through the Division of Corporations. Stripe Atlas, the most common formation service for tech startups, batches roughly 200-400 incorporations per week. The DOC search at https://icis.corp.delaware.gov/Ecorp/EntitySearch/NameSearch.aspx returns entity name, file number, formation date, and registered agent.
A new Stripe-Atlas-formed C-corp with a brand-able name and a ".com" domain reserved within 24 hours of formation is, statistically, a future YC application. Sourcing teams who flag these patterns get a 2-3 week jump on Crunchbase.
The Joining Layer
Once you've ingested all four sources into a normalized schema (company_name, legal_name, domain, round_amount, round_type, source, source_url, event_date), the joining problem becomes the harder problem.
Naive name matching fails on edge cases like "Stripe, Inc." vs "Stripe" vs "Stripe Inc" vs "stripe.com." The reliable approach: build a normalized join key from (slugify(legal_name), root_domain). If you have a domain in any source, root-domain match is your ground truth. If you don't (Form D doesn't publish domains), fall back on slugified name + executive-officer name match.
def normalize_company_name(name: str) -> str:
"""Strip Inc/LLC/Corp suffixes, lowercase, slugify."""
suffixes = [", Inc.", " Inc.", ", LLC", " LLC", ", Corp.", " Corp.",
", Ltd.", " Ltd.", ", Co.", " Co."]
for s in suffixes:
if name.endswith(s):
name = name[:-len(s)]
break
return "".join(c.lower() if c.isalnum() else "-" for c in name).strip("-")
Putting It All Together
The full pipeline runs as a daily cron at 3am ET (after Form D's nightly batch posts):
- Fetch Form D filings since yesterday → ~80-150 records/day
- Scrape TechCrunch funding articles since yesterday → ~10-15 records/day
- Refresh YC active-batch alumni → only changes with new batches
- Fetch new Delaware C-corps since yesterday → ~50-100 records/day
- Normalize, dedupe by
(slug, domain), join → ~150-200 unique companies/day - Push to a daily-digest output
That's a Series A/B/seed dashboard updated daily, costing roughly $4-6/month in compute, with coverage that overlaps Crunchbase's daily updates by ~85%.
The Faster Way: Use the Actor
If you'd rather not build this yourself, the NexGenData Startup Funding Tracker actor does exactly this pipeline, charging $0.01 per Form D / TechCrunch / YC record. A typical daily run returns 150-200 records for $1.50-$2.00.
The actor accepts filters for date range, minimum round amount, industry tags, and source mix. Output schema matches the join schema described above.
Free to clone the architecture or just point your data pipeline at the actor — either way, you've got a Crunchbase-tier funding pipeline at <1% of the cost.
If you find this useful, NexGenData has 195+ public actors covering similar buyer-intent domains: lead generation, SEC filings, YC alumni tracking, and dozens more. Browse the full catalog.
Top comments (0)