Scraping Municipal Budget Data: City Finance Transparency
City budgets determine how billions of public dollars are spent, yet most budget data is locked in PDFs, outdated portals, or hard-to-navigate websites. Let's build a Python scraper that extracts municipal budget data and makes it accessible for analysis.
Why Municipal Budget Data Matters
Local government spending directly impacts communities — policing, education, infrastructure, public health. Journalists investigating budget priorities, researchers studying fiscal policy, and civic tech organizations all need structured budget data. Most cities don't make this easy.
Scraping Open Data Portals
Many cities use Socrata-powered open data portals with consistent APIs:
import requests
import pandas as pd
class SocrataPortalScraper:
def __init__(self, app_token=None):
self.app_token = app_token
self.headers = {}
if app_token:
self.headers["X-App-Token"] = app_token
def search_datasets(self, domain, query="budget"):
url = f"https://{domain}/api/catalog/v1"
params = {"q": query, "categories": "Finance", "limit": 50}
response = requests.get(url, params=params, headers=self.headers)
datasets = []
for result in response.json().get("results", []):
resource = result.get("resource", {})
datasets.append({
"name": resource.get("name", ""),
"id": resource.get("id", ""),
"description": resource.get("description", "")[:200],
"updated": resource.get("updatedAt", ""),
"domain": domain
})
return datasets
def fetch_dataset(self, domain, dataset_id, limit=50000):
url = f"https://{domain}/resource/{dataset_id}.json"
params = {"$limit": limit}
response = requests.get(url, params=params, headers=self.headers)
return pd.DataFrame(response.json())
Budget PDF Extraction
When data only exists in PDFs:
import subprocess
import re
class BudgetPDFExtractor:
def extract_tables(self, pdf_path):
try:
import tabula
tables = tabula.read_pdf(
pdf_path, pages="all",
multiple_tables=True,
pandas_options={"header": None}
)
return tables
except ImportError:
return self._fallback_extraction(pdf_path)
def _fallback_extraction(self, pdf_path):
result = subprocess.run(
["pdftotext", "-layout", pdf_path, "-"],
capture_output=True, text=True
)
lines = result.stdout.split("\n")
budget_items = []
for line in lines:
match = re.match(
r'^\s*(.{10,50})\s+\$?([\d,]+(?:\.\d{2})?)\s*$', line
)
if match:
budget_items.append({
"department": match.group(1).strip(),
"amount": float(match.group(2).replace(",", ""))
})
return pd.DataFrame(budget_items)
State Controller Data
SCRAPER_API_KEY = "YOUR_KEY"
class StateControllerScraper:
def scrape_state_portal(self, state_url):
response = requests.get(
"http://api.scraperapi.com",
params={"api_key": SCRAPER_API_KEY, "url": state_url, "render": "true"},
timeout=60
)
from bs4 import BeautifulSoup
soup = BeautifulSoup(response.text, "html.parser")
data_links = []
for link in soup.find_all("a", href=True):
href = link["href"].lower()
if any(ext in href for ext in [".csv", ".xlsx", ".json", ".xml"]):
data_links.append({
"url": link["href"],
"text": link.get_text(strip=True),
"format": href.split(".")[-1]
})
return data_links
Budget Comparison Analysis
def compare_city_budgets(budgets):
comparisons = []
for city, df in budgets.items():
if "amount" in df.columns and "department" in df.columns:
total = df["amount"].sum()
for _, row in df.iterrows():
comparisons.append({
"city": city,
"department": row["department"],
"amount": row["amount"],
"pct_of_total": round(row["amount"] / total * 100, 2)
})
result = pd.DataFrame(comparisons)
for dept in result["department"].unique():
dept_data = result[result["department"] == dept]
if len(dept_data) >= 3:
mean_pct = dept_data["pct_of_total"].mean()
std_pct = dept_data["pct_of_total"].std()
outliers = dept_data[abs(dept_data["pct_of_total"] - mean_pct) > 2 * std_pct]
for _, row in outliers.iterrows():
print(f"OUTLIER: {row['city']} spends {row['pct_of_total']}% "
f"on {dept} (avg: {mean_pct:.1f}%)")
return result
Scaling Across Municipalities
Scraping hundreds of municipal sites requires robust infrastructure. ScraperAPI handles JavaScript-heavy government portals. ThorData offers reliable proxy rotation for large-scale government site scraping. ScrapeOps tracks success rates across diverse site architectures.
Building a Transparency Dashboard
def generate_transparency_report(city_name, budget_df):
return {
"city": city_name,
"total_budget": budget_df["amount"].sum(),
"departments": len(budget_df),
"top_5_spending": budget_df.nlargest(5, "amount")[
["department", "amount"]
].to_dict("records"),
"bottom_5_spending": budget_df.nsmallest(5, "amount")[
["department", "amount"]
].to_dict("records")
}
Ethical Scraping of Government Data
Government websites often have limited infrastructure. Rate-limit aggressively, cache responses, and consider contributing scraped data back to open data initiatives. The goal is transparency, not server overload.
Municipal budget scraping is civic tech at its most impactful. Every dollar of public spending should be easily trackable by the public that funds it.
Top comments (0)