<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: masonericd</title>
    <description>The latest articles on DEV Community by masonericd (@masonericd).</description>
    <link>https://dev.to/masonericd</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F983815%2Fa8b43694-1b7b-40b3-a2da-dc12026a9e8c.png</url>
      <title>DEV Community: masonericd</title>
      <link>https://dev.to/masonericd</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/masonericd"/>
    <language>en</language>
    <item>
      <title>I extracted a government's entire budget from a 625-page PDF in 4 seconds</title>
      <dc:creator>masonericd</dc:creator>
      <pubDate>Fri, 19 Jun 2026 11:31:49 +0000</pubDate>
      <link>https://dev.to/masonericd/i-extracted-a-governments-entire-budget-from-a-625-page-pdf-in-4-seconds-23c</link>
      <guid>https://dev.to/masonericd/i-extracted-a-governments-entire-budget-from-a-625-page-pdf-in-4-seconds-23c</guid>
      <description>&lt;p&gt;Liberia's national budget is public. By law, it has to be. Every year the Ministry of Finance publishes it as a PDF on their website, hundreds of pages of line-item spending across 117 government ministries and agencies, six years of historical and projected figures per entity.&lt;br&gt;
Nobody reads it. Not because nobody cares, but because a 625-page PDF with no table of structured data is functionally a write-only document. A journalist trying to track one ministry's spending has to manually find the right page, copy numbers into a spreadsheet, and repeat that 117 times across multiple fiscal years. A World Bank analyst comparing procurement patterns across ministries does the same thing, slower, with less institutional memory of where things are buried.&lt;br&gt;
I built FiscalTrace to fix that. It's a small pipeline: extract, structure, compute variance, serve through an API. Five stages, no ML in the extraction step, nothing exotic. The interesting part isn't the architecture, it's what falls out the other end once government financial data actually becomes queryable.&lt;br&gt;
What it found, automatically, on the first run:&lt;br&gt;
The Ministry of Agriculture's budget jumped 262% between FY2023 actual spending and the FY2025 approved budget, from $3.7M to $13.4M. A program serving vulnerable communities had its budget cut 74% over the same window, from $7.8M to $2.0M.&lt;br&gt;
Neither of those numbers required me to know what I was looking for. They came out of a generic "flag anything that moved more than 20%" rule running against the structured table. That's the whole point, the system doesn't need a human to know where to look first.&lt;br&gt;
The extraction problem&lt;br&gt;
pdfplumber handles the actual PDF parsing, but the budget document isn't a clean table you can extract_table() your way through. It's six years of data per ministry, laid out across multiple pages, with section headers, footnotes, and inconsistent column spacing that breaks naive table detection about a third of the time.&lt;br&gt;
What worked was pattern matching against the structural regularities that do hold, every ministry entry follows the same line-item format regardless of which page it's on. I extract per-page, then stitch entities together by matching on the entity code, which is the only consistently formatted field across the entire document.&lt;br&gt;
python# simplified — the real extractor handles edge cases around&lt;/p&gt;

&lt;h1&gt;
  
  
  multi-line entity names and footnote markers
&lt;/h1&gt;

&lt;p&gt;def extract_entity_row(line, page_num):&lt;br&gt;
    match = ENTITY_ROW_PATTERN.match(line)&lt;br&gt;
    if not match:&lt;br&gt;
        return None&lt;br&gt;
    return {&lt;br&gt;
        "entity_code": match.group("code"),&lt;br&gt;
        "entity_name": match.group("name").strip(),&lt;br&gt;
        "fy23_actual": parse_currency(match.group("fy23")),&lt;br&gt;
        "fy24_budget": parse_currency(match.group("fy24")),&lt;br&gt;
        "fy25_budget": parse_currency(match.group("fy25")),&lt;br&gt;
        "source_page": page_num,&lt;br&gt;
    }&lt;br&gt;
Full extraction of all 117 entities across 6 fiscal years runs in under 4 seconds. The slow part was never the parsing, it was figuring out which regex actually generalizes across a document that wasn't designed to be machine-read.&lt;br&gt;
Variance, computed once, not on every request&lt;br&gt;
Once the data lands in Postgres, I compute budget growth percentage and execution rate at ingest time, not at query time. This matters more than it sounds like it should, anomaly detection on a 117-row table is cheap either way, but computing it once means the /api/v1/anomalies endpoint is just a WHERE abs(growth_pct) &amp;gt; threshold query against a precomputed column. No runtime aggregation, sub-millisecond response.&lt;br&gt;
sqlALTER TABLE spending_entities&lt;br&gt;
ADD COLUMN growth_pct NUMERIC GENERATED ALWAYS AS (&lt;br&gt;
  CASE WHEN fy23_actual &amp;gt; 0&lt;br&gt;
    THEN ROUND(((fy25_budget - fy23_actual) / fy23_actual) * 100, 1)&lt;br&gt;
    ELSE NULL&lt;br&gt;
  END&lt;br&gt;
) STORED;&lt;br&gt;
What's live right now&lt;br&gt;
Eight endpoints, fully public, no auth required except the bulk variance export:&lt;br&gt;
GET /api/v1/overview          national summary, total budget, entity count&lt;br&gt;
GET /api/v1/sectors           budget by sector, all 11&lt;br&gt;
GET /api/v1/entities          full list, sortable, filterable&lt;br&gt;
GET /api/v1/entities/{code}   single ministry detail&lt;br&gt;
GET /api/v1/search            full-text search by entity name&lt;br&gt;
GET /api/v1/anomalies         the interesting part&lt;br&gt;
bashcurl "&lt;a href="https://fiscaltrace.ericdiamason.tech/api/v1/anomalies?threshold_pct=20" rel="noopener noreferrer"&gt;https://fiscaltrace.ericdiamason.tech/api/v1/anomalies?threshold_pct=20&lt;/a&gt;"&lt;br&gt;
json{&lt;br&gt;
  "entity_name": "MINISTRY OF AGRICULTURE",&lt;br&gt;
  "fy2023_actual": 3704312,&lt;br&gt;
  "fy2025_budget": 13432776,&lt;br&gt;
  "change_pct": 262.6,&lt;br&gt;
  "alert_type": "large_increase"&lt;br&gt;
}&lt;br&gt;
What I'd do differently&lt;br&gt;
The extraction regex is brittle in a way I'm not fully comfortable with. It works on this specific document's formatting, and I haven't tested it against a different country's budget PDF yet, which is the actual interesting next step, does the same pattern-matching approach generalize, or does every government format their budget just differently enough to break it.&lt;br&gt;
The anomaly threshold (20%) is a number I picked, not one I derived. It's a reasonable starting heuristic, but it's not informed by any historical baseline of what "normal" year-over-year ministry budget volatility looks like in Liberia specifically.&lt;br&gt;
It's open source: github.com/ericdiamason/fiscaltrace&lt;/p&gt;

&lt;p&gt;Live: fiscaltrace.ericdiamason.tech&lt;br&gt;
Happy to talk through the extraction approach or the schema design in the comments.&lt;/p&gt;

</description>
      <category>python</category>
      <category>postgres</category>
      <category>fastapi</category>
      <category>opensource</category>
    </item>
  </channel>
</rss>
