If you have ever tried to pull SEC filings as CSV straight from EDGAR, you already know the story: the data is public, but it is anything but ready to analyze. Form 4 insider transactions live in nested XML. 13F holdings sit inside information tables with their own schema. 8-K filings come as a header, an index, and a stack of exhibits. CIKs and tickers do not join cleanly. Field names drift between filers. And every step of the pipeline assumes you already know the quirks of the EDGAR data model.
This guide walks through what is actually inside Form 4 data , 13F holdings data , and 8-K material event data , why raw EDGAR is painful for everyday research workflows, and how to load a clean, normalized version of the three filings as CSV in Python or Excel in a couple of minutes. At the end, there is a free sample download and a link to the full Insider & Institutional Filings Data Pack if you want the full dataset.
Why SEC filings are useful research data
The SEC publishes one of the largest, longest-running, machine-readable corpora of corporate behavior in the world. For analysts, journalists, developers, students, and independent researchers, three filings show up over and over again:
- Form 4 — every time a director, officer, or 10% beneficial owner buys, sells, is granted, exercises, or otherwise transacts in their own company's securities, they have to file a Form 4 within two business days. That gives you a near-real-time record of insider activity at every U.S. public company.
- 13F-HR — institutional investment managers with over $100M in assets under management file a 13F every quarter listing their long U.S. equity positions. It is the standard window into what large funds hold.
- 8-K — public companies file an 8-K within four business days of a "material event": earnings releases, executive changes, M&A activity, bankruptcy, delistings, and many more. Each filing flags one or more numbered items (2.02, 5.02, 1.01, and so on) that label what happened.
Used as a research dataset , these three filings answer practical questions: How many insider sales happened in technology last week? Which managers added to a specific CUSIP between two quarters? How many 8-Ks of item 5.02 (departure of officers) were filed by mid-cap companies in the last six months? Once the data is normalized, you can answer those questions in a couple of SQL lines or a few pandas cells.
For more on the broader category, see our Financial Data Tools and Market Intelligence Tools hubs.
Why raw EDGAR is painful for everyday work
EDGAR is excellent as an archive. It is not designed as an analytics warehouse. A few of the friction points you hit the first time you try to roll your own pipeline:
- Inconsistent parsing. Form 4 documents are XML, but field placement, optional sub-elements, and footnotes vary across filers and filing agents. A naive parser breaks on the first non-standard filing.
-
CIK / ticker join. CIKs are the canonical key inside EDGAR. Tickers are the canonical key outside it. The mapping changes (ticker changes, mergers, multiple share classes), and SEC's own
company_tickers.jsoncovers only currently-listed issuers. - 13F item-by-item assembly. Each 13F filing references one or more information tables. You have to crawl the filing index, find the table XML, parse rows that include sole/shared/no voting authority breakdowns, and re-join everything back to the manager CIK.
- 8-K item parsing. A single 8-K can flag multiple items in its header. The "real" event labels (e.g. "Results of Operations and Financial Condition" for item 2.02) live in the SEC's instructions, not in the filing itself.
- Dedup and amendments. Amendments (Form 4/A, 8-K/A, 13F-HR/A) overlap with originals, and "the same" transaction can appear in multiple filings.
- Rate limits and headers. EDGAR requires a contact User-Agent and rate-limits hard. A few hundred parallel requests will get you throttled.
For one-off research it is doable. For ongoing analysis, every team eventually rebuilds the same parser. That is the niche this data pack fills: a normalized, ready-to-analyze version of the three filings, joined to a fresh CIK / ticker / exchange map, so you can skip the parsing and go straight to the question you actually care about.
What fields are included
The pack ships four data files. Each one is a flat table — one row per transaction, holding, or 8-K item.
Form 4 — insider transactions (form4_insider_transactions.csv)
-
filing_accession_number,filing_url,filing_date,period_of_report -
issuer_cik,issuer_name,ticker -
insider_cik,insider_name,insider_role,is_director,is_officer,is_ten_percent_owner -
transaction_date,transaction_code,transaction_type_label(Purchase, Sale, Grant/Award, Option Exercise, Tax Withholding, Disposition, Gift, …) -
security_title,shares,price_per_share,transaction_value -
acquired_or_disposed(A / D),shares_owned_after,ownership_type(Direct / Indirect) source_form_type
13F-HR — institutional holdings (13f_institutional_holdings.csv)
-
filing_accession_number,filing_url,filing_date,report_period -
manager_cik,manager_name -
issuer_name,ticker,cusip -
shares,market_value,put_call -
investment_discretion(SOLE / DFND / OTR),voting_authority_sole,voting_authority_shared,voting_authority_none source_form_type
8-K — material events (8k_material_events.csv)
One row per (filing, item). A single 8-K with three items produces three rows.
-
filing_accession_number,filing_url,filing_date,event_date -
company_cik,company_name,ticker -
item_number(e.g.2.02,5.02),item_label,short_event_summary -
source_form_type(8-K or 8-K/A)
Company / ticker / CIK map (company_ticker_cik_map.csv)
-
cik(10-digit, zero-padded),ticker,company_name -
exchange,entity_type,sic,sic_description
Every column, every transaction-code mapping, and every 8-K item label is documented in DATA_DICTIONARY.md that ships with the pack.
What the rows actually look like
The rows below are illustrative examples showing the column layout and field types. They are not real filings — verify any specific transaction against the original SEC document before using it in any decision.
Sample Form 4 row
filing_accession_number, filing_date, issuer_cik, issuer_name, ticker, insider_name, insider_role, transaction_date, transaction_code, transaction_type_label, shares, price_per_share, transaction_value, acquired_or_disposed
EXAMPLE-0001-26-000123, 2026-04-12, 0000320193, EXAMPLE COMPANY INC, EXMP, Doe Jane A., Officer (Chief Financial Officer), 2026-04-11, S, Sale, 2500, 187.42, 468550.00, D
Sample 13F-HR row
filing_accession_number, filing_date, report_period, manager_cik, manager_name, issuer_name, cusip, shares, market_value, investment_discretion, voting_authority_sole
EXAMPLE-0002-26-000456, 2026-02-14, 12-31-2025, 0000102909, EXAMPLE CAPITAL PARTNERS LLC, EXAMPLE HOLDINGS CORP, 30303M102, 412300, 89221150, SOLE, 412300
Sample 8-K row
filing_accession_number, filing_date, event_date, company_cik, company_name, ticker, item_number, item_label
EXAMPLE-0003-26-000789, 2026-03-04, 2026-03-03, 0000789019, EXAMPLE TECH INC, EXTC, 5.02, Departure of Directors or Certain Officers; Election of Directors; Appointment of Certain Officers; Compensatory Arrangements of Certain Officers
Loading the CSVs in Python (pandas)
The files are flat UTF-8 CSVs with a header row. You can load all four into pandas, join them on CIK, and get a working dataframe in under a minute:
import pandas as pd
form4 = pd.read_csv(
"form4_insider_transactions.csv",
parse_dates=["filing_date", "transaction_date", "period_of_report"],
dtype={"issuer_cik": str, "insider_cik": str},
)
f13 = pd.read_csv(
"13f_institutional_holdings.csv",
parse_dates=["filing_date"],
dtype={"manager_cik": str, "cusip": str},
)
k8 = pd.read_csv(
"8k_material_events.csv",
parse_dates=["filing_date", "event_date"],
dtype={"company_cik": str},
)
cmap = pd.read_csv(
"company_ticker_cik_map.csv",
dtype={"cik": str, "sic": str},
)
# 1. Join Form 4 transactions to the exchange / SIC map
form4_enriched = form4.merge(
cmap.rename(columns={"cik": "issuer_cik"}),
on="issuer_cik",
how="left",
suffixes=("", "_map"),
)
# 2. Most-active insider tickers in the window
print(form4["ticker"].value_counts().head(10))
# 3. 8-K Item 5.02 (officer changes) in the last 30 days
recent_502 = k8[
(k8["item_number"] == "5.02")
& (k8["filing_date"] >= k8["filing_date"].max() - pd.Timedelta(days=30))
]
print(recent_502.groupby("ticker").size().sort_values(ascending=False).head())
# 4. Top managers by reported market value in the latest quarter
print(
f13.groupby("manager_name")["market_value"]
.sum().sort_values(ascending=False).head(10)
)
All four files share a consistent date format (YYYY-MM-DD) and zero-padded 10-digit CIKs, so joins are straightforward. The pack also ships a .json twin of each table if you prefer to ingest from a document store.
Worked example: a 30-day insider activity report
To make the dataset concrete, here is a short worked example. The goal is a one-page summary that answers four questions about insider activity in the last 30 days:
- How many Form 4 transactions were filed in total?
- What share of them were open-market purchases versus open-market sales versus grants and tax-withholding events?
- Which 20 issuers had the most distinct insiders filing in the window?
- How does that overlap with the list of issuers that filed an 8-K item 5.02 (officer changes) in the same window?
import pandas as pd
form4 = pd.read_csv("form4_insider_transactions.csv",
parse_dates=["filing_date", "transaction_date"],
dtype={"issuer_cik": str, "insider_cik": str})
k8 = pd.read_csv("8k_material_events.csv",
parse_dates=["filing_date"],
dtype={"company_cik": str})
window_end = form4["filing_date"].max()
window_start = window_end - pd.Timedelta(days=30)
f30 = form4[form4["filing_date"].between(window_start, window_end)]
# 1. Total Form 4 transactions in window
total = len(f30)
# 2. Transaction-type mix
mix = (f30["transaction_type_label"]
.value_counts(normalize=True)
.mul(100).round(1))
# 3. Top 20 issuers by distinct insiders
top_issuers = (f30.groupby("ticker")["insider_cik"]
.nunique().sort_values(ascending=False).head(20))
# 4. Overlap with 8-K 5.02 in the same window
k8_502 = k8[(k8["item_number"] == "5.02") &
(k8["filing_date"].between(window_start, window_end))]
overlap = set(top_issuers.index) & set(k8_502["ticker"].dropna())
print(f"Total Form 4 transactions: {total}")
print("Transaction mix (%):\n", mix, sep="")
print("Top 20 by distinct insiders:\n", top_issuers, sep="")
print("Issuers in both top-20 insiders AND 8-K 5.02:", overlap)
Every join in this script is just a column on the table — no XML parsing, no manual CIK padding, no per-filing edge cases. That is the entire reason the pack exists.
Working with the data in Excel
If you would rather stay in a spreadsheet, the pack includes sec_filings_data_pack.xlsx — a single workbook with one sheet per table, a frozen header row, and an autofilter on every column. Three things you can do immediately:
-
Filter by ticker. Open the Form 4 sheet, click the filter on the
tickercolumn, and type a ticker into the search box to scope every insider transaction for that issuer. -
Pull SIC and exchange via XLOOKUP. From any row in the Form 4 sheet,
=XLOOKUP([@issuer_cik], CompanyMap[cik], CompanyMap[exchange])attaches the listing exchange. Swap[exchange]for[sic_description]to label industries. -
Pivot 8-K items. On the 8-K sheet, build a PivotTable with
item_labelas rows andtickeras columns to see which items each company is filing most often.
Common gotchas when working with insider and institutional filings
Even with normalized data, a few well-known gotchas show up the first time you analyze SEC filings. They are not bugs in the data — they are properties of the filings themselves — so it helps to know about them up front:
-
Form 4 transaction codes are not all "buys" and "sells." Codes
PandSare open-market purchases and sales, butA(grants),M(option exercises),F(tax withholding),G(gifts), andD(dispositions to the issuer) describe very different events. Always group bytransaction_type_labelbefore reading totals. - 13F is long-only U.S. equity. 13Fs report long positions in 13F-eligible U.S. securities. Shorts, most fixed income, and many non-U.S. positions are not in scope. A "top holdings" view from 13F is not a full picture of a manager's book.
-
13F tickers are often blank. The 13F filing itself reports CUSIPs, not tickers. The
tickercolumn in the 13F table is populated where a CUSIP-to-ticker join is available; otherwise it stays empty. Join throughcompany_ticker_cik_map.csvwhen you need consistent symbology. -
8-K items can repeat in one filing. The pack expands one filing with three items into three rows, so totals by item are accurate, but if you want one row per filing you can collapse with
groupby("filing_accession_number"). -
Amendments overlap with originals.
source_form_typedistinguishes4from4/A,8-Kfrom8-K/A, and13F-HRfrom13F-HR/A. Decide up front whether your analysis uses the latest amendment per accession or the original filing only. -
Multiple share classes. Issuers with multiple classes (e.g. Class A vs Class B common) show up under separate
security_titlevalues on Form 4 and as separate CUSIPs on 13F.
Try the free sample, or get the full pack
You can preview the schema and field layout without buying anything. The free sample page (sample preview activating with launch) shows 100-row excerpts of each file and the full data dictionary.
If the schema fits your workflow, the full Insider & Institutional Filings Data Pack (public launch this week — link activating shortly) is available two ways:
- Snapshot — $49 one-time. All four files (CSV, JSON, and XLSX), the full data dictionary, source notes, and changelog. Use it for a one-off research project or to evaluate the format before subscribing.
- Updates — $19/month. The snapshot plus a refreshed delivery on a regular cadence so your dataset keeps moving with EDGAR. Cancel any time.
The pack is aimed at retail quant hobbyists, independent investors, financial analysts, journalists, developers building market tools, small research shops, data science students, newsletter writers, and small family offices — anyone who wants the filings in a usable shape without spending a week on parsing.
Related reading
- Financial Data Tools — broader coverage of the financial-data tooling space
- Market Intelligence Tools — tools and datasets used by market-intelligence teams
- Insider & Institutional Filings Data Pack (public launch this week — link activating shortly) — the full pack page
- Sample preview (sample preview activating with launch) — schema and 100 rows from each file, free
Disclaimer
NexGenData provides structured public filing data sourced from SEC/EDGAR records. This dataset is provided for informational and research purposes only. It is not investment advice, financial advice, legal advice, trading advice, or a recommendation to buy, sell, or hold any security. Data may contain errors, omissions, delays, or parsing issues. Users should verify all material information against the original SEC filing before making any decision. NexGenData is not affiliated with, endorsed by, or sponsored by the U.S. Securities and Exchange Commission.
Source: U.S. Securities and Exchange Commission EDGAR public filings and related SEC data files. NexGenData is not affiliated with or endorsed by the SEC. Original filings should be consulted at SEC.gov for authoritative records.
Top comments (0)