DEV Community

Cover image for SEC Form D Scraper: turn EDGAR capital-raise filings into B2B leads
Devil Scrapes
Devil Scrapes

Posted on

SEC Form D Scraper: turn EDGAR capital-raise filings into B2B leads

Quick answer: Every U.S. company raising private capital under Regulation D must file Form D with the SEC within 15 days of the first sale. That filing is public record on SEC EDGAR and contains the issuer's legal name, postal address, phone number, industry group, offering amount, investor count, and named officers. A SEC Form D scraper fetches the EDGAR Full-Text Search feed, deep-fetches each filing's XML document, and returns every field as a structured, typed row. The Apify Actor below does this for $0.005 per row (~$5.05 per 1,000), with EDGAR rate-limit compliance, exponential backoff, and Pydantic-validated output handled for you.

There is a real-time feed of "this company just raised money" announcements that costs nothing to access, filed with the U.S. government and updated daily. The problem is extracting it into a spreadsheet, CRM, or pipeline โ€” without writing XML parsers, managing EDGAR rate limits, or wiring retries around an undocumented endpoint that occasionally goes quiet. That is the whole job. Here is what it takes.

What is SEC EDGAR Form D? ๐Ÿ—‚๏ธ

The U.S. Securities and Exchange Commission requires any company raising private capital under Regulation D โ€” the federal exemption that covers most Series A-and-earlier U.S. fundraises โ€” to file Form D within 15 days of the first sale to an investor. The filing is public record on EDGAR, indexed by the EDGAR Full-Text Search system, and available as raw XML.

What one filing gives you: the issuer's legal name and entity type; full postal address and phone number; industry group; offering amount in USD (or "Indefinite" for evergreen funds); amount sold and remaining; investor count; named officers, directors, and promoters with city and state; Reg D exemption codes; date of first sale.

Crunchbase, PitchBook, and CB Insights derive their primary private-market funding signal from Form D and re-sell it at $299โ€“$999/month on annual contracts. The underlying EDGAR data is U.S. government public domain โ€” no licensing restrictions on redistribution or commercial use.

Does SEC EDGAR have a Form D API? ๐Ÿ”Œ

Not a documented one. EDGAR exposes a Full-Text Search endpoint at https://efts.sec.gov/LATEST/search-index?forms=D that the website uses internally โ€” it returns paginated JSON hits with filing metadata. For the structured content of each filing (offering amounts, addresses, officers) you fetch and parse the individual XML document at https://www.sec.gov/Archives/edgar/data/{cik}/{accession}/primary_doc.xml. Neither endpoint is officially documented as a developer API; both are public and stable since at least 2021. Using them requires a valid User-Agent header per EDGAR Fair Access Policy ยง2.4 โ€” requests without one may return 403.

What the data looks like

Each qualifying Form D filing comes back as one flat, typed row with a nested related_persons list. A real verified record:

{
  "accession_number": "0002131570-26-000001",
  "cik": "0002131570",
  "entity_name": "One of One Ventures Technology, Inc.",
  "entity_type": "Corporation",
  "jurisdiction_of_incorporation": "DELAWARE",
  "year_of_incorporation": 2026,
  "issuer_street_1": "5582 57TH DRIVE",
  "issuer_street_2": null,
  "issuer_city": "MASPETH",
  "issuer_state_or_country": "NY",
  "issuer_zip_code": "11378",
  "issuer_phone_number": "9178874961",
  "industry_group_type": "Other Technology",
  "total_offering_amount_usd": 4000000.0,
  "total_amount_sold_usd": 0.0,
  "total_remaining_usd": 4000000.0,
  "is_indefinite_amount": false,
  "total_number_already_invested": 0,
  "minimum_investment_usd": 10000.0,
  "exemption_claimed": ["06B"],
  "is_new_notice": true,
  "date_of_first_sale": null,
  "related_persons": [
    {
      "first_name": "Ace",
      "last_name": "Watanasuparp",
      "city": "Maspeth",
      "state_or_country": "NY",
      "relationships": ["Executive Officer"]
    }
  ],
  "filing_date": "2026-05-04",
  "filing_url": "https://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=0002131570&type=D",
  "scraped_at": "2026-05-16T12:00:00+00:00"
}
Enter fullscreen mode Exit fullscreen mode

27 top-level fields plus a nested list of officers, the same shape every time, validated by Pydantic v2 before it hits your dataset. It drops straight into Pandas, BigQuery, a CRM enrichment pipeline, or a Clay table โ€” no post-processing gymnastics.

The naive approach (and why it falls apart) โš ๏ธ

The first thing everyone tries:

  1. Open the EDGAR Full-Text Search at efts.sec.gov
  2. Page through the JSON hits, grab adsh + cik
  3. Fetch the XML for each filing, parse it with xml.etree.ElementTree
  4. Done

It works for the first 20 filings on a quiet Tuesday. Then it falls apart in exactly the ways government infrastructure tends to:

1. The XML URL has a trap. EDGAR's archive URL takes the integer CIK โ€” no leading zeros. Pass the zero-padded 10-character CIK that the search API returns ("0002131570") and you get a 404. You strip leading zeros for the path but keep the padded form for the output field. Nothing in the EDGAR docs spells this out; you discover it by staring at 404s.

2. The XML schema has edge cases that break naive parsers. totalOfferingAmount is sometimes the integer string "2000000" and sometimes the literal "Indefinite" (common on evergreen funds and real-estate SPVs). yearOfInc is either a <value>YYYY</value> child or an <overFiveYears>true</overFiveYears> element with no value โ€” the parser must check element existence, not just read .text. dateOfFirstSale has either a <value> child (ISO date) or a <yetToOccur>true</yetToOccur> child; misread it and you write None where you should write "2026-05-04". We handle every branch and surface mismatches loudly rather than silently writing bad data.

3. The EDGAR rate limit is real. The EDGAR Fair Access Policy caps requests at 10/second/IP and requires a valid User-Agent header. We set one on every request and enforce a 0.1 s inter-fetch sleep โ€” keeping the run well under the limit even at the 5,000-row cap. We retry with exponential backoff (base 2 s, doubling, capped at 30 s, max 5 attempts) on 429 and 503 responses and honour Retry-After headers when present. When a filing returns a hard 403 or 404 we log a warning with the accession number and move on rather than failing the whole run.

4. Scale math. A 5,000-row run requires 5,000 XML fetches โ€” roughly 8โ€“9 minutes of wall clock time at the EDGAR-safe pace. On Apify it runs as a background job; come back when it is done.

We rotate the browser TLS fingerprint via curl-cffi Chrome 131 impersonation so the connection looks like a browser session, staying stable against future EDGAR hardening. We thread Apify residential proxies when you toggle useProxy=true. And we return Pydantic-validated typed rows โ€” no partial objects, no silent nulls.

The Actor

I packaged this as an Apify Actor: SEC Form D Leads Scraper.

Paste your filters into the Apify Console and click Start, or call it programmatically:

from apify_client import ApifyClient

client = ApifyClient("YOUR_APIFY_TOKEN")

run = client.actor("DevilScrapes/sec-form-d-leads").call(
    run_input={
        "stateFilter": "CA",
        "minOfferingAmountUsd": 1000000,
        "maxResults": 500,
    }
)

for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    print(item["entity_name"], item["total_offering_amount_usd"], item["issuer_phone_number"])
Enter fullscreen mode Exit fullscreen mode

All inputs are optional. Key fields: query (free-text EDGAR search), startDate / endDate (ISO YYYY-MM-DD, default: last 30 days โ†’ today), stateFilter (ISO-2 state code), minOfferingAmountUsd (discard below this USD amount; indefinite filings always pass), maxResults (1โ€“5,000, default 100), includeAmendments, useProxy. Date validation fires before any network call โ€” bad input exits non-zero immediately.

Use cases ๐Ÿ’ก

B2B sales prospecting. Pull every Form D filed in the last 30 days filtered to your target state and minimum raise size (stateFilter=CA, minOfferingAmountUsd=1000000). You get the company name, address, phone, and named officers โ€” the trigger-event list outbound SaaS and professional-services teams pay ZoomInfo and Crunchbase for, sourced directly from the SEC.

VC deal-flow tracking. Filter by query="biotechnology" and a date range covering the last quarter to surface every early-stage biotech Reg D raise in your geography, including sub-threshold raises that never make TechCrunch. Form D is the only place pre-seed raises are systematically filed as public record.

Journalism and capital-flow research. Export a multi-month window of Form D filings by state and industry group to measure where private capital is moving โ€” the kind of dataset NICAR reporters use for capital-concentration investigations.

Compliance and KYC. Verify a counterparty's Reg D filing history before signing a service agreement. The accession_number and filing_url fields link straight to the canonical EDGAR page for audit trails.

Pricing โ€” exact numbers ๐Ÿ’ฐ

Pay-per-event. You are charged for rows that pass every filter and land in the dataset โ€” no data, no charge beyond the small actor-start warm-up fee.

Event Price
actor-start $0.05 per run
result-row $0.005 per row
Pull Rows Cost
7-day default scan ~50 $0.30
California โ‰ฅ$1M, 30 days ~200 $1.05
AI keyword, 90 days ~500 $2.55
1,000 filings 1,000 $5.05
Maximum cap 5,000 $25.05

For comparison: Crunchbase Pro starts at $299/month, PitchBook at $30,000+/year โ€” both derive their private-market funding data from EDGAR Form D as a primary source. Apify's $5 free trial credit covers your first ~900 rows, no credit card required.

The technically interesting bit

EDGAR's XML schema has a CIK URL trap that has broken multiple open-source scrapers.

The Full-Text Search API returns CIKs zero-padded to 10 characters: "0002131570". But the XML archive URL requires the integer form โ€” 2131570, no padding. The zero-padded form returns a 404, and nothing in the EDGAR documentation flags it. The fix is int(ciks[0]) for the path while keeping the padded form in the output row. Verified against four real filings. The Actor also handles the "Indefinite" literal, the overFiveYears / withinFiveYears year-of-incorporation fork, and the yetToOccur first-sale date โ€” edge cases a naive ElementTree parse silently mishandles.

Limitations ๐Ÿšง

  • Form D and Form D/A only. Forms 10-K, 10-Q, 8-K, 4, S-1 are out of scope.
  • No email address extraction. Form D contains no emails. Contact data is limited to issuer phone number and officers' city + state.
  • No sales-compensation data. Broker and finder-fee fields exist in the XML but are not extracted.
  • No cross-run deduplication. Overlapping date ranges across runs may surface the same filing twice.
  • ~9 minutes for a 5,000-row run. The 0.1 s inter-fetch sleep is EDGAR-mandated; plan accordingly.
  • 7-day default dataset retention on Apify's free plan. Export immediately or use a named dataset.
  • Public EDGAR only. No EDGAR Online, no premium SEC data products.

FAQ โ“

Is scraping SEC EDGAR Form D legal?
Form D data is U.S. government public domain โ€” no copyright restrictions on redistribution or commercial use. This Actor reads only what the public EDGAR Full-Text Search and XML archive expose, paces requests per the EDGAR Fair Access Policy, and collects no personal data beyond what issuers voluntarily disclose on the public filing. Consult your own legal counsel for jurisdiction-specific questions.

Can I export to Google Sheets, a warehouse, or a CRM?
Yes. Export CSV, Excel, JSON, or XML from the Apify Console, webhook into Make/Zapier/n8n on ACTOR.RUN.SUCCEEDED, or pull via the Apify dataset API: GET /v2/datasets/{id}/items?format=csv&clean=true.

Is there an official SEC EDGAR Form D API?
No documented public API as of 2026. The EDGAR Full-Text Search endpoint and XML archive are public and stable but undocumented as developer APIs. This Actor wraps both with rate-limit compliance, retries, and XML parsing.

What is the difference between Form D and Form D/A?
Form D is the initial Reg D notice; Form D/A is an amendment. By default the Actor returns only new D notices (is_new_notice=true). Set includeAmendments=true to include amendments with is_new_notice=false.

Try it

The Actor is on the Apify Store: apify.com/DevilScrapes/sec-form-d-leads.

Free $5 trial credit, no credit card. Run a 30-day California pull (stateFilter=CA, maxResults=100) for a fresh B2B lead list in under a minute. Have a use case I missed, or a field the XML contains that you need extracted? Drop it in the comments โ€” I ship based on what people actually use.


Built by Devil Scrapes โ€” Apify Actors with attitude. Pay-per-event, transparent pricing, no junk fields. ๐Ÿ˜ˆ

Top comments (0)