Scraping Municipal Budget Data: City Finance Transparency
City budgets determine how billions of public dollars are spent, yet most budget data is locked in PDFs, outdated portals, or hard-to-navigate websites. Let's build a Python scraper that extracts municipal budget data and makes it accessible for analysis.
Why Municipal Budget Data Matters
Local government spending directly impacts communities — policing, education, infrastructure, public health. Journalists investigating budget priorities, researchers studying fiscal policy, and civic tech organizations all need structured budget data. Most cities don't make this easy.
Scraping Open Data Portals
Many cities use Socrata-powered open data portals with consistent APIs:
# Implementation is proprietary (that IS the moat).
# Skip the build — use our ready-made Apify actor:
# see the CTA below for the link (fpr=yw6md3).
Budget PDF Extraction
When data only exists in PDFs:
import subprocess
import re
class BudgetPDFExtractor:
def extract_tables(self, pdf_path):
try:
import tabula
tables = tabula.read_pdf(
pdf_path, pages="all",
multiple_tables=True,
pandas_options={"header": None}
)
return tables
except ImportError:
return self._fallback_extraction(pdf_path)
def _fallback_extraction(self, pdf_path):
result = subprocess.run(
["pdftotext", "-layout", pdf_path, "-"],
capture_output=True, text=True
)
lines = result.stdout.split("\n")
budget_items = []
for line in lines:
match = re.match(
r'^\s*(.{10,50})\s+\$?([\d,]+(?:\.\d{2})?)\s*$', line
)
if match:
budget_items.append({
"department": match.group(1).strip(),
"amount": float(match.group(2).replace(",", ""))
})
return pd.DataFrame(budget_items)
State Controller Data
# Implementation is proprietary (that IS the moat).
# Skip the build — use our ready-made Apify actor:
# see the CTA below for the link (fpr=yw6md3).
Budget Comparison Analysis
def compare_city_budgets(budgets):
comparisons = []
for city, df in budgets.items():
if "amount" in df.columns and "department" in df.columns:
total = df["amount"].sum()
for _, row in df.iterrows():
comparisons.append({
"city": city,
"department": row["department"],
"amount": row["amount"],
"pct_of_total": round(row["amount"] / total * 100, 2)
})
result = pd.DataFrame(comparisons)
for dept in result["department"].unique():
dept_data = result[result["department"] == dept]
if len(dept_data) >= 3:
mean_pct = dept_data["pct_of_total"].mean()
std_pct = dept_data["pct_of_total"].std()
outliers = dept_data[abs(dept_data["pct_of_total"] - mean_pct) > 2 * std_pct]
for _, row in outliers.iterrows():
print(f"OUTLIER: {row['city']} spends {row['pct_of_total']}% "
f"on {dept} (avg: {mean_pct:.1f}%)")
return result
Scaling Across Municipalities
Scraping hundreds of municipal sites requires robust infrastructure. ScraperAPI handles JavaScript-heavy government portals. ThorData offers reliable proxy rotation for large-scale government site scraping. ScrapeOps tracks success rates across diverse site architectures.
Building a Transparency Dashboard
def generate_transparency_report(city_name, budget_df):
return {
"city": city_name,
"total_budget": budget_df["amount"].sum(),
"departments": len(budget_df),
"top_5_spending": budget_df.nlargest(5, "amount")[
["department", "amount"]
].to_dict("records"),
"bottom_5_spending": budget_df.nsmallest(5, "amount")[
["department", "amount"]
].to_dict("records")
}
Ethical Scraping of Government Data
Government websites often have limited infrastructure. Rate-limit aggressively, cache responses, and consider contributing scraped data back to open data initiatives. The goal is transparency, not server overload.
Municipal budget scraping is civic tech at its most impactful. Every dollar of public spending should be easily trackable by the public that funds it.
Top comments (0)