Parsing 94 district courts worth of PDFs. It went about as well as you'd expect.
We started with a simple problem. My co-founder is a litigator. I am an engineer. He was spending hours every week digging through PDFs on federal court websites to find judge-specific rules, local filing procedures, and deadline information. The PDFs were inconsistently formatted. Some were searchable text. Some were scanned images from 1995. Some were updated quarterly. Some had not changed in five years.
He asked me if I could build something to aggregate it all. I said sure. How hard could it be?
Famous last words.
The Scope Problem
There are 94 U.S. federal district courts. Each has local rules. Many of those local rules span hundreds of pages. Then there are the individual judge practices, chamber-specific documents that live on each judge's personal page on the court website. Some judges have them. Some do not. Some courts organize them logically. Some bury them under four layers of navigation.
Our first attempt at scope was naive. We thought we would scrape the rules, parse the PDFs, and build a searchable index. We did not realize that "scrape the rules" meant writing 94 different scrapers, each handling a different CMS, different URL structure, different PDF format, and different update cadence.
The Southern District of New York alone has over two dozen active district judges. Each has an individual practices PDF. Some update them every few months. The URL changes when they update. The filename changes. Sometimes the PDF is replaced inline with no versioning. Tracking changes across that surface area is a full-time job.
PDF Parsing: The Real Enemy
PDF parsing is a solved problem in computer science the same way sailing is a solved problem. The theory is straightforward. The practice involves a lot of cursing and unexpected water.
We started with Python and PyPDF2. It worked on maybe 30 percent of documents. The rest were image-based PDFs with no text layer. For those we added OCR using Tesseract. That got us to about 60 percent. But OCR on legal documents is messy. Footnotes confuse it. Tables break it. Multi-column layouts destroy it.
Then we found a whole category of PDFs that were text-based but had broken encoding. The text extracted as gibberish. The visual render was fine. The underlying text was corrupted. These were usually documents that had been printed, scanned, OCR'd, and re-saved multiple times over a decade. Each step introduced more noise.
Our current pipeline uses a hybrid approach. For text-based PDFs, we use pdfplumber, which handles tables and layout better than PyPDF2. For image-based PDFs, we use Tesseract with layout analysis. For the broken encoding edge cases, we fall back to full-page OCR of the rendered image, which is slower but more reliable than trusting the text layer.
Even with all of that, we still manually review a percentage of documents. Some things are just faster for a human to verify.
Normalizing Judge Data
Court websites do not have APIs. They have HTML. Sometimes very old HTML.
We built a judge directory covering every active federal district judge. For each judge, we track name, court, appointment date, chamber location, and links to individual practices documents. This sounds simple. It is not.
Some courts list judges on a single page with clean semantic HTML. Others use a table layout from 2003. Others require clicking through to a bio page, then finding a "chambers" tab, then downloading a PDF from a link that has no predictable structure.
We ended up with a per-court configuration system. Each court gets a scraper definition file that specifies:
- The URL pattern for the judges list page
- The CSS selector or XPath to extract judge names and profile links
- The pattern for individual practices document URLs
- The expected update frequency
- A human reviewer assignment
This lets us add new courts without rewriting the entire system. It also means we have 94 configuration files, each slightly different, each requiring maintenance when the court redesigns its website.
Which happens more often than you would think.
The Deadline Calculator
The most technically interesting part of the project is the deadline calculator. Federal court deadlines depend on:
- The triggering event (service, order entry, filing)
- The rule that governs the period (Rule 12, Rule 56, Rule 59, etc.)
- The type of days (calendar days, business days, court days)
- Federal holidays
- Local court closures
- Whether the last day falls on a weekend or holiday
We built a rules engine that encodes the Federal Rules of Civil Procedure and the Federal Rules of Appellate Procedure. The engine takes a trigger date, a rule reference, and a court identifier. It returns the calculated deadline with an explanation of each step.
The tricky part is local variations. Some districts have local rules that modify the default federal periods. Some judges have standing orders that affect scheduling. Some courts close for local events that are not federal holidays. We maintain a per-court overrides file that layers local rules on top of the federal defaults.
For holidays, we integrate the OPM federal holiday calendar. But we also track observed holidays separately from calendar holidays. When New Year's Day falls on a Saturday, the observed holiday is Friday. When it falls on Sunday, it is Monday. The calculator handles this correctly, which sounds trivial until you realize how many generic date libraries get it wrong.
Inauguration Day is its own special case. It is a federal holiday only in the Washington D.C. area. Our calculator knows that and applies it only to D.D.C. and D.C. Circuit deadlines. Everywhere else, January 20 is a normal day.
Architecture
The stack is intentionally boring. We use Node.js for the API, PostgreSQL for structured data, and Elasticsearch for full-text search across rules documents. The frontend is React. Nothing exotic.
Documents are stored in S3. Parsed text is indexed in Elasticsearch with metadata tags for court, judge, document type, and effective date. We version every document. When a judge updates their individual practices, we keep the old version and mark the new one as current. This lets us show change history and answer questions like "what did Judge Liman's practices say in 2023?"
The scrapers run on a schedule. Most courts are checked weekly. High-update courts, like SDNY, are checked daily. The scraper jobs are idempotent. If nothing changed, nothing changes. If a PDF was updated, it gets downloaded, parsed, indexed, and queued for human review if the diff is significant.
We use GitHub Actions for CI and deployment. The whole thing runs on AWS. Total monthly infrastructure cost is under $200. The biggest expense is the human review time, not the servers.
What We Got Wrong
Our first version tried to auto-detect document type from PDF content. That failed badly. A document titled "Individual Practices" might actually be a standing order. A "Local Rules" PDF might contain only criminal rules. We now require manual classification for new document types. It is slower. It is also correct.
We also underestimated the maintenance burden. Court websites change. Judges retire. New judges get appointed. PDFs get reorganized. We built a system that is 80 percent automated and 20 percent human curation. I thought we could get to 95 percent. I was wrong. The long tail of edge cases is endless.
Another mistake: we initially stored parsed text as plain strings. When we added full-text search, we realized we had destroyed paragraph structure, footnote references, and table data. We had to re-parse every document with layout preservation. That took three weekends.
Current Numbers
The database covers all 94 federal district courts. We track approximately 680 active district judges. We have parsed and indexed over 4,200 local rules and individual practices documents. The deadline calculator handles 40+ federal rule references with district-specific overrides.
Traffic is modest but growing. About 12,000 unique visitors per month, mostly litigators, paralegals, and law students. The calculator is the most-used feature. The judge directory gets heavy use from associates who just got case assignments and need to know their judge's quirks.
We have had zero venture capital. No paid marketing. Growth is entirely organic through legal forums, Reddit, and word of mouth in litigation departments.
Why We Keep Doing It
The honest answer is that we use it ourselves. My co-founder relies on it for his practice. I rely on it as a sanity check when he asks me whether a deadline calculation looks right. Building tools you actually use changes your priorities. You do not add features because they sound impressive. You add them because you needed them yesterday.
The federal court system is public. The rules are public. The judge practices are public. But they are scattered across 94 websites, in formats that resist aggregation. We are just trying to put them in one place, in a form that is actually usable.
It is not the kind of project that gets written up in TechCrunch. It will not scale to a billion users. But it solves a real problem for a specific group of people. And for a side project, that is enough.
If you are interested in the technical details or want to contribute, the project is at courtrules.app. We are not open source yet, though we have talked about it. The dataset might be useful to researchers, journalists, or other developers working in the legal space.
Feel free to reach out. Or just use the calculator. That is why we built it.
Top comments (0)