Scraping Municipal Budget Data: City Finance Transparency

#webdev #programming #python #tutorial

Scraping Municipal Budget Data: City Finance Transparency

City budgets determine how billions of public dollars are spent, yet most budget data is locked in PDFs, outdated portals, or hard-to-navigate websites. Let's build a Python scraper that extracts municipal budget data and makes it accessible for analysis.

Why Municipal Budget Data Matters

Local government spending directly impacts communities — policing, education, infrastructure, public health. Journalists investigating budget priorities, researchers studying fiscal policy, and civic tech organizations all need structured budget data. Most cities don't make this easy.

Scraping Open Data Portals

Many cities use Socrata-powered open data portals with consistent APIs:

# Implementation is proprietary (that IS the moat).
# Skip the build — use our ready-made Apify actor:
# see the CTA below for the link (fpr=yw6md3).

Budget PDF Extraction

When data only exists in PDFs:

import subprocess
import re

class BudgetPDFExtractor:
    def extract_tables(self, pdf_path):
        try:
            import tabula
            tables = tabula.read_pdf(
                pdf_path, pages="all",
                multiple_tables=True,
                pandas_options={"header": None}
            )
            return tables
        except ImportError:
            return self._fallback_extraction(pdf_path)

    def _fallback_extraction(self, pdf_path):
        result = subprocess.run(
            ["pdftotext", "-layout", pdf_path, "-"],
            capture_output=True, text=True
        )
        lines = result.stdout.split("\n")
        budget_items = []
        for line in lines:
            match = re.match(
                r'^\s*(.{10,50})\s+\$?([\d,]+(?:\.\d{2})?)\s*$', line
            )
            if match:
                budget_items.append({
                    "department": match.group(1).strip(),
                    "amount": float(match.group(2).replace(",", ""))
                })
        return pd.DataFrame(budget_items)

State Controller Data

# Implementation is proprietary (that IS the moat).
# Skip the build — use our ready-made Apify actor:
# see the CTA below for the link (fpr=yw6md3).

Budget Comparison Analysis

def compare_city_budgets(budgets):
    comparisons = []
    for city, df in budgets.items():
        if "amount" in df.columns and "department" in df.columns:
            total = df["amount"].sum()
            for _, row in df.iterrows():
                comparisons.append({
                    "city": city,
                    "department": row["department"],
                    "amount": row["amount"],
                    "pct_of_total": round(row["amount"] / total * 100, 2)
                })
    result = pd.DataFrame(comparisons)
    for dept in result["department"].unique():
        dept_data = result[result["department"] == dept]
        if len(dept_data) >= 3:
            mean_pct = dept_data["pct_of_total"].mean()
            std_pct = dept_data["pct_of_total"].std()
            outliers = dept_data[abs(dept_data["pct_of_total"] - mean_pct) > 2 * std_pct]
            for _, row in outliers.iterrows():
                print(f"OUTLIER: {row['city']} spends {row['pct_of_total']}% "
                      f"on {dept} (avg: {mean_pct:.1f}%)")
    return result

Scaling Across Municipalities

Scraping hundreds of municipal sites requires robust infrastructure. ScraperAPI handles JavaScript-heavy government portals. ThorData offers reliable proxy rotation for large-scale government site scraping. ScrapeOps tracks success rates across diverse site architectures.

Building a Transparency Dashboard

def generate_transparency_report(city_name, budget_df):
    return {
        "city": city_name,
        "total_budget": budget_df["amount"].sum(),
        "departments": len(budget_df),
        "top_5_spending": budget_df.nlargest(5, "amount")[
            ["department", "amount"]
        ].to_dict("records"),
        "bottom_5_spending": budget_df.nsmallest(5, "amount")[
            ["department", "amount"]
        ].to_dict("records")
    }

Ethical Scraping of Government Data

Government websites often have limited infrastructure. Rate-limit aggressively, cache responses, and consider contributing scraped data back to open data initiatives. The goal is transparency, not server overload.

Municipal budget scraping is civic tech at its most impactful. Every dollar of public spending should be easily trackable by the public that funds it.

DEV Community

Scraping Municipal Budget Data: City Finance Transparency

Scraping Municipal Budget Data: City Finance Transparency

Why Municipal Budget Data Matters

Scraping Open Data Portals

Budget PDF Extraction

State Controller Data

Budget Comparison Analysis

Scaling Across Municipalities

Building a Transparency Dashboard

Ethical Scraping of Government Data

Top comments (0)