Devil Scrapes

Posted on Jun 3

SEC EDGAR Filings Scraper: export 10-K, 10-Q, 8-K to JSON for $2/1K

#webscraping #python #apify #data

Quick answer: The SEC EDGAR system (sec.gov) publishes every public-company filing — 10-K, 10-Q, 8-K, S-1, proxy statements — but has no bulk export and no official REST API for filings by ticker. An SEC EDGAR filings scraper resolves your ticker to a CIK, calls the EDGAR submissions endpoint, filters by form type, and returns one typed row per filing with 13 fields including the accession number and direct document URL. The Apify Actor below does it for $0.002 per filing (~$2.00 per 1,000), with proxy rotation and retries handled for you.

If you track public companies for a living — quant research, compliance, M&A, or just running an earnings calendar — you have already clicked around the SEC EDGAR filing search and wished the data came out as a spreadsheet, not a paginated HTML table. The EDGAR APIs exist and are free, but they are far from a clean programmatic surface: you resolve tickers to CIKs, hit the submissions endpoint, parse a deeply nested JSON blob, apply your own form-type filter, and assemble document URLs from a three-part key. This post shows what that takes — and how to get there in one client.actor(...).call(...).

What is SEC EDGAR? 🔎

EDGAR — the Electronic Data Gathering, Analysis, and Retrieval system — is the SEC's mandatory electronic filing platform. Every public company reporting to the US Securities and Exchange Commission files there: 10-K annual reports, 10-Q quarterlies, 8-K current reports, S-1 IPO registrations, Form 4 insider ownership, DEF 14A proxy statements, and more. EDGAR has been public since the early 1990s and indexes millions of filings.

What it does not give you out of the box: a single endpoint that takes a ticker, a list of form types, and a date range and returns clean rows. Instead you get a suite of partially overlapping endpoints — submissions JSON, full-text search, CIK lookup — that you stitch together yourself.

Does SEC EDGAR have an API? 🔌

Yes, kind of. The SEC publishes several developer endpoints at data.sec.gov — loosely documented, and they require specific User-Agent headers or the server throttles you. The key one for filings by company is https://data.sec.gov/submissions/CIK{cik}.json, returning the ~1,000 most recent filings for a CIK as a nested JSON blob. There is no ticker-to-CIK resolution built in: you fetch https://www.sec.gov/files/company_tickers.json first, parse the whole company map (50 000+ entries), and look up your ticker yourself — before any filtering or URL assembly.

What the data looks like

Each filing comes back as one flat, typed row. A real example for Apple's annual report:

{
  "cik": "0000320193",
  "ticker": "AAPL",
  "company_name": "Apple Inc.",
  "form_type": "10-K",
  "accession_number": "0000320193-24-000123",
  "filing_date": "2024-11-01",
  "report_date": "2024-09-28",
  "acceptance_datetime": "2024-11-01T06:01:36.000Z",
  "primary_document": "aapl-20240928.htm",
  "primary_document_url": "https://www.sec.gov/Archives/edgar/data/320193/000032019324000123/aapl-20240928.htm",
  "filing_index_url": "https://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=0000320193&type=10-K&dateb=&owner=include&count=40",
  "is_xbrl": true,
  "scraped_at": "2026-05-31T09:14:22+00:00"
}

Thirteen fields, Pydantic-validated before they land in your dataset. The primary_document_url goes straight to the HTML or PDF — no extra step to assemble the EDGAR archive path. is_xbrl tells you whether machine-readable financials are attached.

The naive approach (and why it falls apart) 🛠️

The EDGAR submissions API looks approachable from the outside. Here is where a first attempt usually stalls:

Step 1: Ticker resolution. EDGAR speaks CIKs, not tickers. To look up Apple you need 0000320193, not AAPL. The company_tickers.json file maps them, but it is a flat object with 50 000+ entries keyed by sequential integer — not by ticker. You iterate the whole thing and build an inverted index.

Step 2: Fingerprint-aware requests. The SEC requires a descriptive User-Agent with a contact address, or it rate-limits your client. Standard requests with a blank UA gets throttled within a few calls. We set a proper User-Agent on every request and run curl-cffi with browser-impersonated TLS handshakes (rotating across Chrome 131, Chrome 124, Firefox 147, and Safari 180 profiles) so the connection looks like a real browser, not a Python client.

Step 3: Response parsing. The submissions JSON is deeply nested. filings.recent contains parallel arrays — form, accessionNumber, filingDate, reportDate, primaryDocument, isXBRL — all aligned by index. Assembling one row means zipping fields by position, then building the archive URL from a three-part key (cik_int, accession_nodash, primary_doc). Get any transformation wrong and the URL 404s silently.

Step 4: Rate limits, retries, and CIK normalisation. The SEC's rate-limit guidance is 10 requests per second per IP. We pace requests, retry on 408/429/5xx with exponential backoff (up to 5 attempts), and honour Retry-After headers. We rotate residential proxies via Apify Proxy on any block — fresh session ID, fresh exit IP. CIKs are 10-digit zero-padded in some endpoints and bare integers in others; get the padding wrong and the submissions URL 404s. Partial success surfaces explicitly: we never return an empty dataset with a green status. The devil's in the details, and we absorbed them all.

None of that is impossible — but it is a solid afternoon of edge-case archaeology you would rather spend on the analysis, not the pipeline.

The Actor 🎯

I packaged the result as an Apify Actor: SEC EDGAR Filings Scraper. Paste a list of tickers in the Apify Console and click Start, or call it programmatically:

from apify_client import ApifyClient

client = ApifyClient("YOUR_APIFY_TOKEN")

run = client.actor("DevilScrapes/sec-edgar-filings-scraper").call(
    run_input={
        "tickers": ["AAPL", "MSFT", "NVDA"],
        "formTypes": ["10-K", "10-Q"],
        "maxResultsPerCompany": 20,
    }
)

for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    print(item["company_name"], item["form_type"], item["filing_date"])

Key input parameters:

Field	Default	Notes
`tickers`	`["AAPL", "MSFT"]`	Ticker symbols or 10-digit CIKs — mix freely
`formTypes`	`["10-K", "10-Q"]`	Filter to specific forms; leave empty for all
`maxResultsPerCompany`	`20`	Cap per company; EDGAR returns ~1,000 most recent
`userAgent`	Devil Scrapes default	SEC-compliant; replace with your own if preferred

Use cases

Earnings calendar automation. Schedule a daily run on your covered universe, filter for 10-Q and 10-K, diff against yesterday's dataset, and surface companies that filed since your last pull. No more manual EDGAR refreshes on earnings week.
Material-events monitoring. Pull every 8-K filed by your portfolio companies. The primary_document_url links straight to the filing body — pipe it into an LLM summarizer for a morning briefing on press releases and restatements.
M&A and activist-investor signals. Filter for S-4 (merger registration) and SC 13D (5%+ ownership disclosure). An activist fund's SC 13D shows up in EDGAR within four days of the position; this scraper surfaces it in your dataset within minutes.
Insider-activity tracking. Form 4 filings report officer and director stock transactions. A daily pull with formTypes: ["4"] on high-interest tickers gives you a continuous insider-activity log.
XBRL financial data pipeline. Filter for 10-K and 10-Q with is_xbrl: true, then feed primary_document_url into an XBRL parser or your warehouse. Every row already carries the accession number and document filename.

Pricing — exact numbers 💰

Pay-per-event. You pay for filings you receive, nothing for the query — no data, no charge (only the small one-off actor-start fee).

Event	Price	What it covers
`actor-start`	$0.005	One-off warm-up per run
`result`	$0.002	Per filing written to the dataset

Pull	Cost
100 filings	$0.21
1,000 filings	$2.01
10,000 filings	$20.01
50,000 filings	$100.01

Apify's $5 free trial credit covers your first ~2,497 filings, no credit card required. The S&P 500 at the last 20 filings each (10,000 rows) costs $20.01.

The part other scrapers won't tell you

EDGAR's submissions endpoint caps at ~1,000 most recent filings per company. For deeper history — every 10-K since 1994 — you hit the per-year archive index endpoints (https://www.sec.gov/Archives/edgar/full-index/), which need different parsing. The maxResultsPerCompany ceiling is therefore 1,000 for the submissions-based path, and we say so in the README. For pre-2000 filings, pair this Actor with a dedicated archive crawler.

We also expose userAgent as a configurable input. The SEC asks automated clients to identify themselves with a contact address. Our default is DevilScrapes-EDGAR/1.0 (contact: apify.com/DevilScrapes) — replace it at scale so SEC can reach you.

Limitations

~1,000 most recent filings per company. The submissions endpoint returns the latest ~1,000. For pre-history, use EDGAR's per-year archive index directly.
Ticker resolution covers operating companies only. SEC's company_tickers.json maps exchange-listed operating companies. Mutual funds and individual-filer CIKs require CIK input directly.
No full-text content retrieval. We return primary_document_url; downloading and parsing the HTML/PDF is out of scope. Pair with a generic HTML fetcher or PDF parser.
8-K filings have no report_date. Current-event reports have only a filing date; the field is null for event-driven forms — by design.
Rate limits apply at scale. SEC guidance is 10 requests/second. We pace within it, so very large runs (50+ tickers) take proportionally longer.

FAQ

Is scraping SEC EDGAR legal?
EDGAR is a public government database the SEC operates to fulfill its disclosure mandate. The SEC's EDGAR developer guide encourages programmatic access. This Actor uses only the published APIs, identifies itself with a proper User-Agent, and collects only filing metadata (no personal data). As always, consult your own legal counsel for your jurisdiction.

Does this work for international companies listed in the US?
Yes — foreign private issuers (FPIs) appear in EDGAR under form types like 20-F and 6-K. Pass their CIK directly, or use their US-listed ticker if it appears in the company_tickers.json map.

Can I export to Google Sheets or a data warehouse?
Yes — export CSV/Excel/JSON from the Apify Console after a run, webhook the dataset on ACTOR.RUN.SUCCEEDED into Make/Zapier/n8n, or fetch via the Apify API. Each row is a flat JSON object — no denesting needed.

Why does primary_document_url sometimes point to the EDGAR browse page?
Some older filings don't populate primaryDocument in the submissions JSON — there we fall back to the EDGAR browse page for that company, which is always valid. The filing_index_url always points to the filing-specific index page listing every document in the submission.

Try it

The Actor is on the Apify Store: apify.com/DevilScrapes/sec-edgar-filings-scraper. Free $5 trial credit, no credit card. Drop in ["AAPL", "MSFT", "NVDA"], set formTypes: ["10-K"], and you have clean rows in your dataset in seconds. Found a form type that behaves oddly, or need a field I missed? Drop it in the comments — I ship based on what people actually need.

References:

SEC EDGAR developer documentation — the official programmatic access guide
SEC EDGAR full-text search — complementary endpoint for keyword search
Apify Actor documentation — how Apify Actors work, storage, and API

Built by Devil Scrapes — Apify Actors with attitude. Pay-per-event, transparent pricing, no junk fields. 😈

DEV Community