chonito7919

Posted on Oct 3

I Built a Tool to Parse SEC Dividend Data (And Actually Shipped It)

#python #beginners #opensource #career

I've been working in the electrical trade for years, but I've been teaching myself software development on the side. A few months ago, I had an idea: build a free, open-source tool to extract dividend data from SEC filings. There are paid services that do this, but they're expensive, and I thought, "How hard could it be?"

Turns out, pretty hard.

The Problem I Tried to Solve

If you want historical dividend data for US stocks, you have a few options:

Pay $50-500/month for services like SimFin or Intrinio
Scrape Yahoo Finance (legally questionable for commercial use)
Manually look up each company's investor relations page
Parse SEC filings yourself

I chose option 4.

The Journey (Or: How I Failed Before I Succeeded)

First attempt: HTML scraping. I wrote code to download 8-K, 10-K, and 10-Q filings and used an LLM to extract dividend amounts from the messy HTML.

Result: Complete garbage. The LLM extracted $3,910 as a dividend for Coca-Cola (it was reading the wrong table column). Processing took 50-60 seconds per company. This approach was dead on arrival.

Second attempt: Use XBRL data instead. The SEC provides structured JSON APIs with financial data in a standardized format called XBRL. No HTML scraping, no LLM guessing, just parsing structured data.

This worked much better. Processing time dropped to ~3 seconds per company, and the data was mostly accurate.

The Real Challenge: Data Quality

Here's what I learned about parsing SEC data that nobody tells you:

Companies report annual totals alongside quarterly dividends. Target, for example, reports both their $1.10 quarterly dividend AND a $4.38 annual total (sum of all 4 quarters) in the same filing, with the same XBRL tags. My parser had to figure out which was which.

Every company files differently. Some use one XBRL tag, others use a different tag. Some report fiscal quarters cleanly, others don't. There's no standard for "this is a sum" vs "this is a single payment."

Perfect accuracy is impossible. After weeks of work, I got the parser to about 85-90% accuracy. The remaining 10-15% needs manual review. I had to accept this.

The Solution: Confidence Scoring

Instead of trying to achieve 100% accuracy, I built a confidence scoring system. Each dividend gets scored 0.0-1.0 based on:

Is the amount reasonable? (Too high = probably an annual total)
What's the period duration? (365 days = annual, not quarterly)
How does it compare to other dividends for this company? (4× the median = suspicious)
Is there proper metadata? (Missing fiscal quarter = lower confidence)

Anything scoring below 0.8 gets flagged for review.

Test results:

Johnson & Johnson: 52 dividends, 100% confidence, zero manual review needed
Apple: 46 dividends, 96% average confidence, 2 flagged from 2012
Target: 65 dividends, 79% average confidence, 15 annual totals correctly flagged

The system works.

What I Learned

1. Perfect is the enemy of done. I spent weeks trying to get 100% accuracy. I was debugging edge cases for companies I'd never heard of. Eventually I realized: 85-90% automatic + flagging the rest is good enough. Ship it.

2. Data quality is never 100%. Even Bloomberg and FactSet have errors. The difference is they have teams of people verifying data. For a solo project with zero budget, confidence scoring + review workflow is the realistic solution.

3. Actually shipping something feels different than planning to ship something. I've started and abandoned dozens of projects. This is the first one I've actually pushed to GitHub with documentation, tests, and a proper license.

The Tech Stack

Language: Python 3.8+
Database: PostgreSQL (confidence scores, audit trails, review workflow)
Data Source: SEC EDGAR XBRL JSON APIs (official, free, no scraping)
License: Apache 2.0 (free to use, modify, even commercially)

The code is on GitHub: DivScout

What's Next?

Honestly? I don't know. The project is shipped. It works for what it does. Maybe I'll add more features. Maybe I'll use it to build something else. Maybe it just sits there as proof I can finish something.

For now, I'm just glad I shipped.

For other people learning to code while working full-time: You don't need a perfect project. You need a finished project. Even if it's 85% accurate. Even if only 5 people look at it. Even if it's not as good as the commercial alternatives.

Shipping something imperfect beats planning something perfect forever.

Top comments (1)

chonito7919 • Oct 4

Update: I realized I forgot to include the actual links to the project!

🔗 Live App: DivScout.app - The working dividend tracker with data from AAPL, JNJ, and TGT

🔗 Project Site: DivScout.com - Documentation, architecture overview, and technical details

The app and project site are running on Namecheap hosting with a PostgreSQL database on DigitalOcean. Both the parser and web interface code are open source on GitHub (links in the article). Thanks for reading!

Some comments may only be visible to logged-in visitors. Sign in to view all comments.