DEV Community

agenthustler
agenthustler

Posted on

Scraping Municipal Budget Data: City Finance Transparency

Scraping Municipal Budget Data: City Finance Transparency

City budgets determine how billions of public dollars are spent, yet most budget data is locked in PDFs, outdated portals, or hard-to-navigate websites. Let's build a Python scraper that extracts municipal budget data and makes it accessible for analysis.

Why Municipal Budget Data Matters

Local government spending directly impacts communities — policing, education, infrastructure, public health. Journalists investigating budget priorities, researchers studying fiscal policy, and civic tech organizations all need structured budget data. Most cities don't make this easy.

Scraping Open Data Portals

Many cities use Socrata-powered open data portals with consistent APIs:

import requests
import pandas as pd

class SocrataPortalScraper:
    def __init__(self, app_token=None):
        self.app_token = app_token
        self.headers = {}
        if app_token:
            self.headers["X-App-Token"] = app_token

    def search_datasets(self, domain, query="budget"):
        url = f"https://{domain}/api/catalog/v1"
        params = {"q": query, "categories": "Finance", "limit": 50}
        response = requests.get(url, params=params, headers=self.headers)
        datasets = []
        for result in response.json().get("results", []):
            resource = result.get("resource", {})
            datasets.append({
                "name": resource.get("name", ""),
                "id": resource.get("id", ""),
                "description": resource.get("description", "")[:200],
                "updated": resource.get("updatedAt", ""),
                "domain": domain
            })
        return datasets

    def fetch_dataset(self, domain, dataset_id, limit=50000):
        url = f"https://{domain}/resource/{dataset_id}.json"
        params = {"$limit": limit}
        response = requests.get(url, params=params, headers=self.headers)
        return pd.DataFrame(response.json())
Enter fullscreen mode Exit fullscreen mode

Budget PDF Extraction

When data only exists in PDFs:

import subprocess
import re

class BudgetPDFExtractor:
    def extract_tables(self, pdf_path):
        try:
            import tabula
            tables = tabula.read_pdf(
                pdf_path, pages="all",
                multiple_tables=True,
                pandas_options={"header": None}
            )
            return tables
        except ImportError:
            return self._fallback_extraction(pdf_path)

    def _fallback_extraction(self, pdf_path):
        result = subprocess.run(
            ["pdftotext", "-layout", pdf_path, "-"],
            capture_output=True, text=True
        )
        lines = result.stdout.split("\n")
        budget_items = []
        for line in lines:
            match = re.match(
                r'^\s*(.{10,50})\s+\$?([\d,]+(?:\.\d{2})?)\s*$', line
            )
            if match:
                budget_items.append({
                    "department": match.group(1).strip(),
                    "amount": float(match.group(2).replace(",", ""))
                })
        return pd.DataFrame(budget_items)
Enter fullscreen mode Exit fullscreen mode

State Controller Data

SCRAPER_API_KEY = "YOUR_KEY"

class StateControllerScraper:
    def scrape_state_portal(self, state_url):
        response = requests.get(
            "http://api.scraperapi.com",
            params={"api_key": SCRAPER_API_KEY, "url": state_url, "render": "true"},
            timeout=60
        )
        from bs4 import BeautifulSoup
        soup = BeautifulSoup(response.text, "html.parser")
        data_links = []
        for link in soup.find_all("a", href=True):
            href = link["href"].lower()
            if any(ext in href for ext in [".csv", ".xlsx", ".json", ".xml"]):
                data_links.append({
                    "url": link["href"],
                    "text": link.get_text(strip=True),
                    "format": href.split(".")[-1]
                })
        return data_links
Enter fullscreen mode Exit fullscreen mode

Budget Comparison Analysis

def compare_city_budgets(budgets):
    comparisons = []
    for city, df in budgets.items():
        if "amount" in df.columns and "department" in df.columns:
            total = df["amount"].sum()
            for _, row in df.iterrows():
                comparisons.append({
                    "city": city,
                    "department": row["department"],
                    "amount": row["amount"],
                    "pct_of_total": round(row["amount"] / total * 100, 2)
                })
    result = pd.DataFrame(comparisons)
    for dept in result["department"].unique():
        dept_data = result[result["department"] == dept]
        if len(dept_data) >= 3:
            mean_pct = dept_data["pct_of_total"].mean()
            std_pct = dept_data["pct_of_total"].std()
            outliers = dept_data[abs(dept_data["pct_of_total"] - mean_pct) > 2 * std_pct]
            for _, row in outliers.iterrows():
                print(f"OUTLIER: {row['city']} spends {row['pct_of_total']}% "
                      f"on {dept} (avg: {mean_pct:.1f}%)")
    return result
Enter fullscreen mode Exit fullscreen mode

Scaling Across Municipalities

Scraping hundreds of municipal sites requires robust infrastructure. ScraperAPI handles JavaScript-heavy government portals. ThorData offers reliable proxy rotation for large-scale government site scraping. ScrapeOps tracks success rates across diverse site architectures.

Building a Transparency Dashboard

def generate_transparency_report(city_name, budget_df):
    return {
        "city": city_name,
        "total_budget": budget_df["amount"].sum(),
        "departments": len(budget_df),
        "top_5_spending": budget_df.nlargest(5, "amount")[
            ["department", "amount"]
        ].to_dict("records"),
        "bottom_5_spending": budget_df.nsmallest(5, "amount")[
            ["department", "amount"]
        ].to_dict("records")
    }
Enter fullscreen mode Exit fullscreen mode

Ethical Scraping of Government Data

Government websites often have limited infrastructure. Rate-limit aggressively, cache responses, and consider contributing scraped data back to open data initiatives. The goal is transparency, not server overload.

Municipal budget scraping is civic tech at its most impactful. Every dollar of public spending should be easily trackable by the public that funds it.

Top comments (0)