Reverse-Engineering SEC EDGAR's Full-Text Search API (One Undocumented GET Request)

#api #python #tutorial #webdev

The official SEC EDGAR full-text search box is great if you're a human clicking around. It's useless if you want to pull 200 filings that mention "going concern" into a script.

So I opened the network tab, watched what the search page actually calls, and rebuilt the request myself. It turns out the entire thing runs on one undocumented GET request that returns clean Elasticsearch JSON. No API key, no signup, no OAuth dance. The SEC quietly shipped one of the better free financial-data APIs and never put a docs page on it.

Here's the exact request, the response fields nobody explains, and the gotchas that cost me an afternoon.

The endpoint and its real parameters

The page is a thin React front end. Every search fires a GET to https://efts.sec.gov/LATEST/search-index and gets back raw Elasticsearch JSON.

One trap before you copy anything: the path casing matters. /LATEST/ is uppercase; a lowercase /latest/ 404s.

The query parameters that actually do something:

q — the search term. Wrap a phrase in URL-encoded double quotes (%22climate+risk%22) for an exact match, or it tokenizes into an OR search.
forms — comma-separated filing types: 10-K, 8-K, SC 13D, etc. Leave it off to search everything.
startdt and enddt — date bounds in YYYY-MM-DD. Both required if you want a window.
from — pagination offset.
ciks — restrict to a specific company by its zero-padded CIK number.

A complete request:

curl -s \
  -A "your-app your-email@example.com" \
  "https://efts.sec.gov/LATEST/search-index?q=%22machine+learning%22&forms=8-K&startdt=2026-01-01&enddt=2026-06-01"

The User-Agent header is not optional. SEC's fair-access policy rejects requests with a generic or empty agent — you'll get a 403. Put your app name and a contact email in there. I learned this the hard way after my first ten curls returned nothing but an HTML block page.

The two fields that unlock everything

The response is the Elasticsearch envelope, untouched. A single hit looks like this:

{
  "_id": "0001193125-26-032000:ionq-ex99_2.htm",
  "_source": {
    "ciks": ["0001824920"],
    "display_names": ["IonQ, Inc.  (IONQ)  (CIK 0001824920)"],
    "root_forms": ["8-K"],
    "form": "8-K",
    "file_date": "2026-01-30",
    "adsh": "0001193125-26-032000",
    "file_type": "EX-99.2"
  }
}

Two fields do all the work:

_id is {accession}:{filename}. Split on the colon and you can build a direct link to the document.
adsh is the accession number — the join key you feed into the rest of EDGAR's data and XBRL endpoints to pull the full filing.

Turning a hit into a clickable filing URL means stripping the dashes from the accession number for the folder path:

def filing_url(hit):
    adsh, fname = hit["_id"].split(":", 1)
    cik = int(hit["_source"]["ciks"][0])  # int() drops leading zeros
    folder = adsh.replace("-", "")
    return f"https://www.sec.gov/Archives/edgar/data/{cik}/{folder}/{fname}"

A few _source fields are worth knowing because the docs never mention them:

items — 8-K item codes. This is the fast filter for event-driven work: 2.02 is earnings, 5.02 is an exec change, 1.01 is a material agreement.
root_forms — use this, not form, when you want amendments grouped with originals (8-K/A rolls up under 8-K).
file_date vs period_ending — filing date vs the period the filing covers. For "what was disclosed today" you want file_date; for fundamentals you want period_ending.
display_names — a pre-formatted Name (TICKER) (CIK …) string. Regex the ticker out instead of doing a second lookup.

There's also a free bonus: every response carries an aggregations block with form_filter, entity_filter, sic_filter, and biz_states_filter faceted counts — whether you asked for them or not. You can build a filings dashboard's sidebar without a single extra request.

A scraper that actually paginates

Pagination is the one thing that trips people up. Each request returns at most 100 documents in hits.hits; there's no size parameter the backend honors past that. You walk the result set with from, step by 100, and watch hits.total.value for when to stop.

import time
import requests

EFTS = "https://efts.sec.gov/LATEST/search-index"
HEADERS = {"User-Agent": "orthogonal-research max@orthogonal.info"}

def search_all(q, forms=None, startdt=None, enddt=None, max_results=1000):
    results = []
    offset = 0
    while offset < max_results:
        params = {"q": q, "from": offset}
        if forms:   params["forms"] = forms
        if startdt: params["startdt"] = startdt
        if enddt:   params["enddt"] = enddt

        r = requests.get(EFTS, params=params, headers=HEADERS, timeout=15)
        r.raise_for_status()
        hits = r.json()["hits"]["hits"]
        if not hits:
            break
        results.extend(hits)
        offset += 100
        time.sleep(0.15)  # stay under ~10 req/sec
    return results

filings = search_all('"going concern"', forms="10-K",
                     startdt="2026-01-01", enddt="2026-06-01")
for f in filings:
    src = f["_source"]
    print(src["file_date"], src["form"], src["display_names"][0])

The time.sleep(0.15) keeps you under SEC's documented limit of ~10 requests/sec. Go faster and you'll get temporary IP blocks lasting about ten minutes. There's no X-RateLimit header to watch — the only signal is a sudden 403, so it's better to throttle up front than to detect and back off.

The gotchas that cost me time

Phrase vs token search. A bare q=climate risk matches documents containing "climate" OR "risk" anywhere — that returned 40x more noise than I expected. The quoted form q=%22climate risk%22 is the exact phrase, and it's what you almost always want.
The 10,000-result ceiling. Elasticsearch caps deep pagination. Once from passes 10,000 the endpoint errors out. If a query has more hits than that, narrow it with a tighter date range and stitch the windows together — there's no scroll cursor exposed.
Full-text only covers 2001 onward. The index starts in 2001. Older filings exist in EDGAR but won't show up here; for pre-2001 you're back to the structured submissions API.
It indexes exhibits, not just the main doc. A single 8-K can return several hits — one per attached exhibit. Dedupe on the accession number (adsh) if you only want one row per filing.

Where this fits

I use this as the front door for a couple of projects: a script that flags new 8-K filings mentioning specific risk language, and an insider-buying alerter that cross-references full-text hits against Form 4 data. The full-text endpoint finds the filings; the structured EDGAR APIs pull the details.

I wrote up the full field-by-field decode of the _source envelope (every key in a real forms=8-K response) here if you want the complete reference.

The whole thing is one undocumented GET request returning clean JSON — no key, no cost. What other "human-only" search boxes are quietly sitting on a clean JSON API? I keep finding them in network tabs.