NexGenData

Posted on Jun 28 • Originally published at thenextgennexus.com

How to Download Clean Form 4, 13F, and 8-K SEC Filing Data in CSV

#ai #automation #finance #api

If you have ever tried to pull SEC filings as CSV straight from EDGAR, you already know the story: the data is public, but it is anything but ready to analyze. Form 4 insider transactions live in nested XML. 13F holdings sit inside information tables with their own schema. 8-K filings come as a header, an index, and a stack of exhibits. CIKs and tickers do not join cleanly. Field names drift between filers. And every step of the pipeline assumes you already know the quirks of the EDGAR data model.

This guide walks through what is actually inside Form 4 data , 13F holdings data , and 8-K material event data , why raw EDGAR is painful for everyday research workflows, and how to load a clean, normalized version of the three filings as CSV in Python or Excel in a couple of minutes. At the end, there is a free sample download and a link to the full Insider & Institutional Filings Data Pack if you want the full dataset.

Why SEC filings are useful research data

The SEC publishes one of the largest, longest-running, machine-readable corpora of corporate behavior in the world. For analysts, journalists, developers, students, and independent researchers, three filings show up over and over again:

Form 4 — every time a director, officer, or 10% beneficial owner buys, sells, is granted, exercises, or otherwise transacts in their own company's securities, they have to file a Form 4 within two business days. That gives you a near-real-time record of insider activity at every U.S. public company.
13F-HR — institutional investment managers with over $100M in assets under management file a 13F every quarter listing their long U.S. equity positions. It is the standard window into what large funds hold.
8-K — public companies file an 8-K within four business days of a "material event": earnings releases, executive changes, M&A activity, bankruptcy, delistings, and many more. Each filing flags one or more numbered items (2.02, 5.02, 1.01, and so on) that label what happened.

Used as a research dataset , these three filings answer practical questions: How many insider sales happened in technology last week? Which managers added to a specific CUSIP between two quarters? How many 8-Ks of item 5.02 (departure of officers) were filed by mid-cap companies in the last six months? Once the data is normalized, you can answer those questions in a couple of SQL lines or a few pandas cells.

For more on the broader category, see our Financial Data Tools and Market Intelligence Tools hubs.

Why raw EDGAR is painful for everyday work

EDGAR is excellent as an archive. It is not designed as an analytics warehouse. A few of the friction points you hit the first time you try to roll your own pipeline:

Inconsistent parsing. Form 4 documents are XML, but field placement, optional sub-elements, and footnotes vary across filers and filing agents. A naive parser breaks on the first non-standard filing.
CIK / ticker join. CIKs are the canonical key inside EDGAR. Tickers are the canonical key outside it. The mapping changes (ticker changes, mergers, multiple share classes), and SEC's own company_tickers.json covers only currently-listed issuers.
13F item-by-item assembly. Each 13F filing references one or more information tables. You have to crawl the filing index, find the table XML, parse rows that include sole/shared/no voting authority breakdowns, and re-join everything back to the manager CIK.
8-K item parsing. A single 8-K can flag multiple items in its header. The "real" event labels (e.g. "Results of Operations and Financial Condition" for item 2.02) live in the SEC's instructions, not in the filing itself.
Dedup and amendments. Amendments (Form 4/A, 8-K/A, 13F-HR/A) overlap with originals, and "the same" transaction can appear in multiple filings.
Rate limits and headers. EDGAR requires a contact User-Agent and rate-limits hard. A few hundred parallel requests will get you throttled.

For one-off research it is doable. For ongoing analysis, every team eventually rebuilds the same parser. That is the niche this data pack fills: a normalized, ready-to-analyze version of the three filings, joined to a fresh CIK / ticker / exchange map, so you can skip the parsing and go straight to the question you actually care about.

What fields are included

The pack ships four data files. Each one is a flat table — one row per transaction, holding, or 8-K item.

Form 4 — insider transactions (`form4_insider_transactions.csv`)

filing_accession_number, filing_url, filing_date, period_of_report
issuer_cik, issuer_name, ticker
insider_cik, insider_name, insider_role, is_director, is_officer, is_ten_percent_owner
transaction_date, transaction_code, transaction_type_label (Purchase, Sale, Grant/Award, Option Exercise, Tax Withholding, Disposition, Gift, …)
security_title, shares, price_per_share, transaction_value
acquired_or_disposed (A / D), shares_owned_after, ownership_type (Direct / Indirect)
source_form_type

13F-HR — institutional holdings (`13f_institutional_holdings.csv`)

filing_accession_number, filing_url, filing_date, report_period
manager_cik, manager_name
issuer_name, ticker, cusip
shares, market_value, put_call
investment_discretion (SOLE / DFND / OTR), voting_authority_sole, voting_authority_shared, voting_authority_none
source_form_type

8-K — material events (`8k_material_events.csv`)

One row per (filing, item). A single 8-K with three items produces three rows.

filing_accession_number, filing_url, filing_date, event_date
company_cik, company_name, ticker
item_number (e.g. 2.02, 5.02), item_label, short_event_summary
source_form_type (8-K or 8-K/A)

Company / ticker / CIK map (`company_ticker_cik_map.csv`)

cik (10-digit, zero-padded), ticker, company_name
exchange, entity_type, sic, sic_description

Every column, every transaction-code mapping, and every 8-K item label is documented in DATA_DICTIONARY.md that ships with the pack.

What the rows actually look like

The rows below are illustrative examples showing the column layout and field types. They are not real filings — verify any specific transaction against the original SEC document before using it in any decision.

Sample Form 4 row


    filing_accession_number, filing_date, issuer_cik, issuer_name, ticker, insider_name, insider_role, transaction_date, transaction_code, transaction_type_label, shares, price_per_share, transaction_value, acquired_or_disposed
    EXAMPLE-0001-26-000123, 2026-04-12, 0000320193, EXAMPLE COMPANY INC, EXMP, Doe Jane A., Officer (Chief Financial Officer), 2026-04-11, S, Sale, 2500, 187.42, 468550.00, D

Sample 13F-HR row


    filing_accession_number, filing_date, report_period, manager_cik, manager_name, issuer_name, cusip, shares, market_value, investment_discretion, voting_authority_sole
    EXAMPLE-0002-26-000456, 2026-02-14, 12-31-2025, 0000102909, EXAMPLE CAPITAL PARTNERS LLC, EXAMPLE HOLDINGS CORP, 30303M102, 412300, 89221150, SOLE, 412300

Sample 8-K row


    filing_accession_number, filing_date, event_date, company_cik, company_name, ticker, item_number, item_label
    EXAMPLE-0003-26-000789, 2026-03-04, 2026-03-03, 0000789019, EXAMPLE TECH INC, EXTC, 5.02, Departure of Directors or Certain Officers; Election of Directors; Appointment of Certain Officers; Compensatory Arrangements of Certain Officers

Loading the CSVs in Python (pandas)

The files are flat UTF-8 CSVs with a header row. You can load all four into pandas, join them on CIK, and get a working dataframe in under a minute:


    import pandas as pd

    form4 = pd.read_csv(
        "form4_insider_transactions.csv",
        parse_dates=["filing_date", "transaction_date", "period_of_report"],
        dtype={"issuer_cik": str, "insider_cik": str},
    )

    f13 = pd.read_csv(
        "13f_institutional_holdings.csv",
        parse_dates=["filing_date"],
        dtype={"manager_cik": str, "cusip": str},
    )

    k8 = pd.read_csv(
        "8k_material_events.csv",
        parse_dates=["filing_date", "event_date"],
        dtype={"company_cik": str},
    )

    cmap = pd.read_csv(
        "company_ticker_cik_map.csv",
        dtype={"cik": str, "sic": str},
    )

    # 1. Join Form 4 transactions to the exchange / SIC map
    form4_enriched = form4.merge(
        cmap.rename(columns={"cik": "issuer_cik"}),
        on="issuer_cik",
        how="left",
        suffixes=("", "_map"),
    )

    # 2. Most-active insider tickers in the window
    print(form4["ticker"].value_counts().head(10))

    # 3. 8-K Item 5.02 (officer changes) in the last 30 days
    recent_502 = k8[
        (k8["item_number"] == "5.02")
        & (k8["filing_date"] >= k8["filing_date"].max() - pd.Timedelta(days=30))
    ]
    print(recent_502.groupby("ticker").size().sort_values(ascending=False).head())

    # 4. Top managers by reported market value in the latest quarter
    print(
        f13.groupby("manager_name")["market_value"]
           .sum().sort_values(ascending=False).head(10)
    )

All four files share a consistent date format (YYYY-MM-DD) and zero-padded 10-digit CIKs, so joins are straightforward. The pack also ships a .json twin of each table if you prefer to ingest from a document store.

Worked example: a 30-day insider activity report

To make the dataset concrete, here is a short worked example. The goal is a one-page summary that answers four questions about insider activity in the last 30 days:

How many Form 4 transactions were filed in total?
What share of them were open-market purchases versus open-market sales versus grants and tax-withholding events?
Which 20 issuers had the most distinct insiders filing in the window?
How does that overlap with the list of issuers that filed an 8-K item 5.02 (officer changes) in the same window?


    import pandas as pd

    form4 = pd.read_csv("form4_insider_transactions.csv",
                        parse_dates=["filing_date", "transaction_date"],
                        dtype={"issuer_cik": str, "insider_cik": str})
    k8 = pd.read_csv("8k_material_events.csv",
                     parse_dates=["filing_date"],
                     dtype={"company_cik": str})

    window_end = form4["filing_date"].max()
    window_start = window_end - pd.Timedelta(days=30)
    f30 = form4[form4["filing_date"].between(window_start, window_end)]

    # 1. Total Form 4 transactions in window
    total = len(f30)

    # 2. Transaction-type mix
    mix = (f30["transaction_type_label"]
           .value_counts(normalize=True)
           .mul(100).round(1))

    # 3. Top 20 issuers by distinct insiders
    top_issuers = (f30.groupby("ticker")["insider_cik"]
                      .nunique().sort_values(ascending=False).head(20))

    # 4. Overlap with 8-K 5.02 in the same window
    k8_502 = k8[(k8["item_number"] == "5.02") &
                (k8["filing_date"].between(window_start, window_end))]
    overlap = set(top_issuers.index) & set(k8_502["ticker"].dropna())

    print(f"Total Form 4 transactions: {total}")
    print("Transaction mix (%):\n", mix, sep="")
    print("Top 20 by distinct insiders:\n", top_issuers, sep="")
    print("Issuers in both top-20 insiders AND 8-K 5.02:", overlap)

Every join in this script is just a column on the table — no XML parsing, no manual CIK padding, no per-filing edge cases. That is the entire reason the pack exists.

Working with the data in Excel

If you would rather stay in a spreadsheet, the pack includes sec_filings_data_pack.xlsx — a single workbook with one sheet per table, a frozen header row, and an autofilter on every column. Three things you can do immediately:

Filter by ticker. Open the Form 4 sheet, click the filter on the ticker column, and type a ticker into the search box to scope every insider transaction for that issuer.
Pull SIC and exchange via XLOOKUP. From any row in the Form 4 sheet, =XLOOKUP([@issuer_cik], CompanyMap[cik], CompanyMap[exchange]) attaches the listing exchange. Swap [exchange] for [sic_description] to label industries.
Pivot 8-K items. On the 8-K sheet, build a PivotTable with item_label as rows and ticker as columns to see which items each company is filing most often.

Common gotchas when working with insider and institutional filings

Even with normalized data, a few well-known gotchas show up the first time you analyze SEC filings. They are not bugs in the data — they are properties of the filings themselves — so it helps to know about them up front:

Form 4 transaction codes are not all "buys" and "sells." Codes P and S are open-market purchases and sales, but A (grants), M (option exercises), F (tax withholding), G (gifts), and D (dispositions to the issuer) describe very different events. Always group by transaction_type_label before reading totals.
13F is long-only U.S. equity. 13Fs report long positions in 13F-eligible U.S. securities. Shorts, most fixed income, and many non-U.S. positions are not in scope. A "top holdings" view from 13F is not a full picture of a manager's book.
13F tickers are often blank. The 13F filing itself reports CUSIPs, not tickers. The ticker column in the 13F table is populated where a CUSIP-to-ticker join is available; otherwise it stays empty. Join through company_ticker_cik_map.csv when you need consistent symbology.
8-K items can repeat in one filing. The pack expands one filing with three items into three rows, so totals by item are accurate, but if you want one row per filing you can collapse with groupby("filing_accession_number").
Amendments overlap with originals. source_form_type distinguishes 4 from 4/A, 8-K from 8-K/A, and 13F-HR from 13F-HR/A. Decide up front whether your analysis uses the latest amendment per accession or the original filing only.
Multiple share classes. Issuers with multiple classes (e.g. Class A vs Class B common) show up under separate security_title values on Form 4 and as separate CUSIPs on 13F.

Try the free sample, or get the full pack

You can preview the schema and field layout without buying anything. The free sample page (sample preview activating with launch) shows 100-row excerpts of each file and the full data dictionary.

If the schema fits your workflow, the full Insider & Institutional Filings Data Pack (public launch this week — link activating shortly) is available two ways:

Snapshot — $49 one-time. All four files (CSV, JSON, and XLSX), the full data dictionary, source notes, and changelog. Use it for a one-off research project or to evaluate the format before subscribing.
Updates — $19/month. The snapshot plus a refreshed delivery on a regular cadence so your dataset keeps moving with EDGAR. Cancel any time.

The pack is aimed at retail quant hobbyists, independent investors, financial analysts, journalists, developers building market tools, small research shops, data science students, newsletter writers, and small family offices — anyone who wants the filings in a usable shape without spending a week on parsing.

Disclaimer

NexGenData provides structured public filing data sourced from SEC/EDGAR records. This dataset is provided for informational and research purposes only. It is not investment advice, financial advice, legal advice, trading advice, or a recommendation to buy, sell, or hold any security. Data may contain errors, omissions, delays, or parsing issues. Users should verify all material information against the original SEC filing before making any decision. NexGenData is not affiliated with, endorsed by, or sponsored by the U.S. Securities and Exchange Commission.

Source: U.S. Securities and Exchange Commission EDGAR public filings and related SEC data files. NexGenData is not affiliated with or endorsed by the SEC. Original filings should be consulted at SEC.gov for authoritative records.

DEV Community

How to Download Clean Form 4, 13F, and 8-K SEC Filing Data in CSV

Why SEC filings are useful research data

Why raw EDGAR is painful for everyday work

What fields are included

Form 4 — insider transactions (`form4_insider_transactions.csv`)

13F-HR — institutional holdings (`13f_institutional_holdings.csv`)

8-K — material events (`8k_material_events.csv`)

Company / ticker / CIK map (`company_ticker_cik_map.csv`)

What the rows actually look like

Sample Form 4 row

Sample 13F-HR row

Sample 8-K row

Loading the CSVs in Python (pandas)

Worked example: a 30-day insider activity report

Working with the data in Excel

Common gotchas when working with insider and institutional filings

Try the free sample, or get the full pack

Related reading

Disclaimer

Top comments (0)

Why SEC filings are useful research data

Why raw EDGAR is painful for everyday work

What fields are included

Form 4 — insider transactions (form4_insider_transactions.csv)

13F-HR — institutional holdings (13f_institutional_holdings.csv)

8-K — material events (8k_material_events.csv)

Company / ticker / CIK map (company_ticker_cik_map.csv)

What the rows actually look like

Sample Form 4 row

Sample 13F-HR row

Sample 8-K row

Loading the CSVs in Python (pandas)

Worked example: a 30-day insider activity report

Working with the data in Excel

Common gotchas when working with insider and institutional filings

Try the free sample, or get the full pack

Related reading

Disclaimer

Form 4 — insider transactions (`form4_insider_transactions.csv`)

13F-HR — institutional holdings (`13f_institutional_holdings.csv`)

8-K — material events (`8k_material_events.csv`)

Company / ticker / CIK map (`company_ticker_cik_map.csv`)