Devil Scrapes

Posted on May 31

Equity Crowdfunding Leads: scrape 4,800+ Wefunder founders for $5/1K

#webscraping #python #apify #data

Quick answer: There is no unified API for Wefunder, Republic, or StartEngine. An equity crowdfunding leads scraper collects currently-raising and recently-funded campaign data — founder names, company taglines, raise progress, pre-money valuations — from all three platforms and returns them as one normalized dataset. The Apify Actor below does it for $0.005 per row (~$5.05 per 1,000), with the TLS fingerprinting, proxy rotation, and per-source parsing handled for you.

Wefunder alone lists 4,800+ currently-raising companies — founder names, taglines, raise totals, and pre-money valuations in one JSON payload. Republic has a trending carousel. StartEngine has an XML sitemap of 98 offering slugs. None has a download button; none shares a schema.

If you're a VC scout, an SDR targeting founders, or an analyst tracking what's raising in climate versus fintech, you're opening three browser tabs and copy-pasting. Here's what it takes to do that programmatically — and how I compressed it to one API call.

What is equity crowdfunding? 🔎

Equity crowdfunding under Regulation CF lets any US startup raise up to $5 million per year from the general public — not just accredited investors. The three dominant platforms are Wefunder (largest by volume), Republic (curated campaigns), and StartEngine (heavy on CPG and consumer brands).

Each platform requires issuers to file a Form C with the SEC before opening a round, so every active campaign has a verified company name, founding team, financial disclosures, and valuation on public record. That's the dataset: comprehensive, legally disclosed, and — until this Actor — only accessible by visiting three separate sites with three separate UX patterns.

Does Wefunder have an API? 📡

No public API. As of 2026, none of Wefunder, Republic, or StartEngine publishes an official data API or bulk export. Wefunder's SPA calls an internal JSON endpoint (/-/companies/explore) returning full campaign payloads — but it's undocumented, inspects your TLS fingerprint, and sits behind Cloudflare. Republic's backend GraphQL at api.republic.com rejects unauthenticated POSTs from datacenter IPs. StartEngine's offering detail pages require clearing a JavaScript-gated challenge first.

This is exactly why a hosted Actor earns its keep over a three-line requests snippet.

What the data looks like

Each row is a flat, typed record. A real one — RISE Robotics on Wefunder as of 2026-05-16:

{
  "source": "wefunder",
  "campaign_slug": "riserobotics",
  "company_name": "RISE Robotics",
  "tagline": "Electrifying heavy machines",
  "industry": null,
  "location": "MA",
  "founders": ["Hiten Sonpal"],
  "website_url": null,
  "target_amount_usd": null,
  "raised_amount_usd": 17448682.0,
  "num_investors": 417,
  "valuation_usd": 62100000.0,
  "revenue_usd": null,
  "funding_stage": "raising",
  "campaign_url": "https://wefunder.com/riserobotics",
  "scraped_at": "2026-05-16T13:40:00.000Z"
}

Sixteen fields, Pydantic-validated before they hit your dataset. valuation_usd comes from Wefunder's terms.nb shorthand ("$62.1M"), parsed into a float automatically. Republic and StartEngine rows land with the same shape; monetary fields are null there because that data is client-rendered (v2 plan — more below).

The naive approach (and why it falls apart) 🔧

The obvious move: open DevTools, find the XHR, replay it with requests.get(). It breaks fast, for a different reason on each platform.

Wefunder. The /-/companies/explore endpoint checks your TLS ClientHello fingerprint before it answers. Python's stdlib ssl and httpx look nothing like a real browser — the JA3/JA4 fingerprint reads as a script, and you hit a Cloudflare challenge before the JSON loads. We run curl-cffi with impersonate="chrome131", which replays the full Chrome 131 TLS handshake, ALPN extension order, and HTTP/2 SETTINGS frame, so at the TLS layer the connection is a browser.

Republic. The republic.com/companies page is SPA-rendered; the SSR shell carries only a ~10-item carousel of trending campaign links, and the backend GraphQL at api.republic.com rejects unauthenticated POSTs from datacenter IPs. We thread Apify residential proxies on every request so the connection arrives from a residential exit.

StartEngine. Their explore page is fully client-rendered. sitemap-private-offerings.xml carries the active slug list (98 entries as of 2026-05-16) — the only unauthenticated surface; detail pages return a bot-challenge body to non-browser clients. v1 emits slug + company name from the sitemap; Camoufox full-render is planned for v2.

We retry with exponential backoff (base 2 s, doubling, capped at 30 s, max 5 attempts) and honour Retry-After. On 429 or 503 we rotate the proxy session ID — fresh exit IP, fresh cookie jar. Partial success surfaces as an explicit status message; we never return an empty dataset under a green status. One source failing does not kill the run; all three failing exits non-zero with a clear error.

The Actor 🛠️

Equity Crowdfunding Leads on the Apify Store.

Open it in the Apify Console and click Start, or call it with the apify-client Python SDK:

from apify_client import ApifyClient

client = ApifyClient("YOUR_APIFY_TOKEN")

run = client.actor("DevilScrapes/equity-crowdfunding-leads").call(
    run_input={
        "sources": ["wefunder"],
        "maxPerSource": 200,
        "statusFilter": "active",
        "industryFilter": "fintech",
        "useProxy": True,
    }
)

for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    print(item["company_name"], item["raised_amount_usd"], item["founders"])

The key input parameters:

sources — any combination of wefunder, republic, startengine, or empty (= all three). Default: all three.
maxPerSource — hard cap per platform, 1–500. Default: 50.
statusFilter — "active", "funded", or "all". Wefunder-native; Republic and StartEngine emit currently-listed slugs regardless.
industryFilter — optional case-insensitive substring matched against tagline or industry. Pass "climate" for climate-tech campaigns.
useProxy — default true. Wefunder and Republic fingerprint datacenter IPs and block plain exits; leave it on.

What you'd actually use this for 💡

Four scenarios from the README and spec:

VC scout pipeline. Schedule a weekly Wefunder-only run, pull all active campaigns, join on founders[], enrich with LinkedIn. A live feed of sub-Series-A founders without waiting for Crunchbase. Scope it with industryFilter: "robotics" for your thesis vertical.

SDR founder outreach. Founders in active crowdfunding campaigns are fundraising — and buying. Filter by statusFilter: "active" and industryFilter: "fintech", drop founders[] into Apollo or Clay, and reach them while they're in motion.

Crowdfunding analytics. Schedule daily runs, persist to BigQuery or S3, and track raised_amount_usd trajectories. Wefunder publishes pre-money valuations Crunchbase never sees — the valuation_usd distribution by sector is a clean dataset for a leaderboard or research report.

Form C deep dives. This Actor surfaces campaign_slug and campaign_url. sec-edgar-filings-scraper (sibling Actor) takes it from there — issuer CIK on EDGAR, Form C / Form C-AR PDFs, audited revenue, SAFE terms. Two Actors, one Reg CF pipeline.

Pricing — exact numbers 💰

Pay-per-event. You pay for rows you receive, nothing for rows that don't come back.

Event	Price
Actor start (once per run)	$0.05
Per campaign row emitted	$0.005

Run size	Cost
50 rows (default, all 3 sources)	$0.30
150 rows (50/source × 3)	$0.80
1,000 rows	$5.05
5,000 rows	$25.05
10,000 rows	$50.05

For context, the nearest alternative — scraping Crunchbase via a third-party Apify Actor — typically runs around $30 per 1,000 rows, while covering fewer than 30% of Wefunder campaigns and zero Republic trending campaigns. This Actor is roughly 6× cheaper and sources from the campaigns directly, not from a derived database. Apify's $5 free trial credit covers your first ~990 rows with no credit card.

The part worth knowing before you build on this 🔍

Wefunder's internal /-/companies/explore endpoint is the same one the SPA calls on every page load — unauthenticated, returning full JSON payloads including pre-money valuation encoded as terms.nb dollar shorthand ("$62.1M", "$700K", "$1.2B"). This Actor parses that shorthand with multipliers K=1e3, M=1e6, B=1e9; malformed values emit null rather than crashing.

The design point worth knowing: the scraper doesn't infer valuations — it reads the exact payload the website reads and converts the display string to a typed float. The Pydantic v2 ResultRow model enforces the schema on every row before write, so type surprises are caught at write time, not at analysis time.

Limitations (the honest list) 🚧

Republic and StartEngine return sparse data in v1. Republic surfaces ~10 trending campaign slugs per run from the SSR shell; StartEngine emits slug + company name from the public sitemap. On both, raised amount, valuation, and investor count are client-rendered and stay null. For the richest rows, run Wefunder-only (sources: ["wefunder"]).
No historical archive. Every run is a fresh snapshot of currently-listed campaigns. Schedule runs and export to your own storage; Apify's default run-scoped storage is purged after 7 days on the free plan.
Status filter is Wefunder-native. funded and all only meaningfully change Wefunder results; Republic and StartEngine always emit their current listing surface regardless.
No investor identity data. Who invested and at what amount is private. This Actor emits only public-facing campaign metadata.
No SEC EDGAR Form C parsing. Revenue, expenses, share count, and SAFE terms from Form C filings are in scope for sec-edgar-filings-scraper, not this Actor.

FAQ

Is scraping Wefunder, Republic, and StartEngine legal?
All three host public-facing marketing pages built to attract investors. This Actor reads only what the public UI exposes — no authentication is bypassed, no private investor data is collected, and the request rate stays well under a human browsing the site. Form C filings are SEC-required public disclosures. Check your own jurisdiction and use case; nothing here is legal advice.

Does Wefunder, Republic, or StartEngine have an official API I should use instead?
No. As of 2026, none of the three offers a public data API or bulk export endpoint. Wefunder operates an internal JSON endpoint the SPA uses; Republic and StartEngine surface their data via their web UIs (or, for StartEngine, a sitemap).

Can I export the dataset to Google Sheets or a data warehouse?
Yes — export CSV, JSON, Excel, or XML from the Apify Console Export button after the run, webhook the dataset on ACTOR.RUN.SUCCEEDED into Make, Zapier, or n8n, or pull it via the Apify API.

Why does the Actor cost less than Crunchbase scrapers?
Different source, lower extraction cost. Crunchbase scraping hits a richer, more heavily defended site with far more fields. This Actor targets three smaller platforms and returns a narrower, well-defined schema. The 6× difference reflects the actual engineering complexity.

Try it

Live on the Apify Store: apify.com/DevilScrapes/equity-crowdfunding-leads.

Free $5 trial credit, no credit card. Run the defaults and you'll have 150 equity-crowdfunding leads across all three platforms in a couple of minutes. Need a fourth platform (NextSeed, MicroVentures), a field you wish was populated, or a parser that broke after a site restructure? Drop it in the comments. The devil's in the data; I ship based on what people actually find there.

Further reading:

Built by Devil Scrapes — Apify Actors for builders who want the data, not the drama. Pay-per-event, honest pricing, no junk fields. 😈

DEV Community