Vhub Systems

Posted on Mar 5

Every City Has Permit Data Online. Almost None of It Has an API. Here's How I Scraped 200 Cities

#python #webscraping #dataengineering #government

Every City Has Permit Data Online. Almost None of It Has an API. Here's How I Scraped 200 Cities

Six months ago I was talking to a real estate investor who mentioned, almost in passing, that he pays $500/month for a permit data service. I asked what the data looked like. He showed me. I recognized the exact table structure — I'd scraped three city portals with that same layout the week before. That conversation sent me down a two-week rabbit hole that turned into one of the more frustrating and genuinely interesting scraping projects I've done.

The core irony: building permit data is public record. Cities are legally required to maintain it. But virtually none of them expose it via API, and every single one presents it differently. So a cottage industry of data aggregators exists to charge money for something taxpayers already own.

Here's what I learned building a unified scraper across 200 city permit portals.

The Normalization Problem Is Worse Than the Scraping Problem

I assumed the hard part would be dealing with JavaScript-heavy portals and anti-bot measures. Wrong. The hard part was that every city invented its own vocabulary.

What Denver calls permit_type, Austin calls work_class, and some ancient New Jersey portal calls CATEGORY_CD. "Issued" in one system is "Active" in another and "APPR" in a third. Contractor names live in a single field in some cities, split across contractor_first and contractor_last in others, and embedded inside a notes blob in a few truly cursed implementations.

After mapping roughly 60 portals manually, I settled on a canonical schema that everything gets normalized into:

CANONICAL_PERMIT = {
    "permit_id": str,          # original ID from source system
    "permit_type": str,        # normalized: RESIDENTIAL, COMMERCIAL, ELECTRICAL, etc.
    "status": str,             # normalized: ISSUED, PENDING, EXPIRED, FINALED
    "issued_date": "YYYY-MM-DD",
    "expiry_date": "YYYY-MM-DD | None",
    "address": {
        "street": str,
        "city": str,
        "state": str,
        "zip": str,
        "full": str            # pre-joined for geocoding convenience
    },
    "parcel_id": "str | None", # APN format varies wildly by county
    "contractor": {
        "name": "str | None",
        "license_num": "str | None"
    },
    "declared_value": "float | None",  # USD, not always present
    "description": str,
    "source_url": str,
    "scraped_at": "ISO8601 timestamp"
}

The parcel_id field alone cost me four days. Some counties use hyphens, some use spaces, some pad with zeros, some use a completely different segmentation scheme. I gave up on normalizing it and just store it as-is with the county FIPS code alongside it so downstream joins are at least possible.

The Tech Stack Breakdown (Honest Numbers)

After going through 200 portals, here's the actual breakdown of what each required:

~45% plain requests + BeautifulSoup — old ASP.NET WebForms, static HTML tables, basic pagination via query params. Fast, reliable, easy.
~35% Playwright required — Angular/React SPAs that load data via XHR after page render. Some had infinite scroll. A few used ag-Grid which is its own special nightmare.
~20% manual form submissions or semi-manual — portals that require you to POST a search form with a CAPTCHA, or that use multi-step wizards where state lives in hidden __VIEWSTATE fields. Some of these I semi-automated; a handful I just gave up on and flagged as "requires human session."

The ViewState ones are genuinely painful. ASP.NET WebForms serializes the entire page state into a hidden field that gets submitted back on every action. You have to parse it, preserve it, and replay it correctly or the server throws you back to step one. I lost probably six hours to one Connecticut portal before I figured out the ViewState was also being validated against a session cookie.

A Real Parsing Example: Messy Table Structure

This is from a mid-sized Texas city portal (I'm leaving the name out because I don't want to draw attention to their infrastructure). The permit detail page uses a two-column layout table but with inconsistent colspan usage that breaks naive find_all('tr') iteration:

from bs4 import BeautifulSoup
import requests
import re

def parse_permit_detail(url: str, session: requests.Session) -> dict:
    resp = session.get(url, timeout=15)
    soup = BeautifulSoup(resp.text, "html.parser")

    # Target the detail table — identified by a stable summary attribute
    table = soup.find("table", {"summary": re.compile("permit detail", re.I)})
    fields = {}

    rows = table.find_all("tr")
    for row in rows:
        cells = row.find_all(["td", "th"])
        # Skip rows that are purely layout spacers
        if len(cells) < 2:
            continue
        # Handle colspan=3 "section header" rows — they have no paired value
        if any(int(c.get("colspan", 1)) > 2 for c in cells):
            continue
        # Normal label/value pairs — label is always first cell
        label = cells[0].get_text(strip=True).rstrip(":")
        value = cells[1].get_text(strip=True)
        if label:
            fields[label] = value

    return fields

This is stripped down — the real version also handles the case where a single row has four cells (two label/value pairs side by side), which about a third of the rows use. You find that out only after wondering why you're getting half the fields you expect.

The Accela Discovery: 40 Cities, One Adapter

About three weeks in, I started noticing patterns. Same URL structure: /CitizenAccess/. Same table class names. Same pagination behavior. Same field names in the HTML.

Accela Automation is a SaaS platform that municipalities license for permitting workflows. It turns out roughly 40 of the 200 cities I was targeting run on Accela — including some reasonably large ones like Sacramento and Tempe.

Once I realized this, I stopped writing city-specific scrapers and wrote one Accela adapter. Accela has some Ajax endpoints that are consistent across deployments (though not documented, obviously). The base URL changes; almost everything else doesn't. That single adapter covers about 20% of my total city coverage and was probably the highest-leverage thing I did on this project.

The catch: Accela deployments aren't identical. Individual cities configure custom fields, enable or disable modules, and sometimes run different versions. I handle this with a config dict per city that specifies which optional fields to attempt and a few behavioral flags. But the core parsing logic is shared.

If I'd known about Accela at the start, I would have targeted high-Accela-density regions first and built momentum faster.

What the Data Actually Looks Like at Scale

Running across 200 cities, a typical overnight batch pulls around 40,000–80,000 permit records depending on how many portals have recent activity. Raw runtime is about 4–6 hours with 8 parallel workers, though the Playwright-dependent portals are the bottleneck — they're roughly 6x slower than the requests-based ones.

Error rate I actually track: on any given run, about 7–9% of cities fail completely (portal is down, structure changed, got rate-limited). Another 12% return partial data. So real-world coverage on a given day is around 80% of target cities. Anyone who tells you scraping infrastructure is 99.9% reliable is selling you something.

The field with the worst fill rate is declared_value — only about 55% of permits across all cities include a dollar value. Contractors are present in about 78% of records. parcel_id comes in around 62%.

The Production Version

I eventually cleaned this up and published the multi-city version as an Apify actor: https://apify.com/lanky_quantifier/building-permits-scraper — currently covers the 50 highest-volume metros with scheduled runs and the normalized schema above. Useful if you need the data but don't want to maintain 50 individual scrapers that break every time a city redesigns their portal (which happens more than you'd think).

What I'd Do Differently

I'd start by cataloging which platform each target city uses before writing any scrapers. Accela, Tyler Technologies Energov, and CSS (Community Development Software) together probably cover 30–35% of US municipalities. Identifying platform clusters first would have saved me easily a week of redundant work.

I'd also build the normalization layer earlier. I was retrofitting canonical field mappings onto scrapers I'd already written, which meant touching everything twice.

The Question I'm Still Sitting With

The Accela adapter works, but it's fragile in ways I don't fully understand yet. Version differences between Accela deployments cause silent failures — fields that exist in one city's instance just aren't present in another's, and I don't always know why.

Has anyone else scraped Accela-based portals at scale? Specifically: have you found a way to reliably detect which Accela version a given deployment is running, or a more stable endpoint pattern than what's exposed in the citizen-facing UI? I've been poking at their API documentation (which is for licensed integrators, not the public) but haven't found a clean answer. Curious what others have run into.

DEV Community

Every City Has Permit Data Online. Almost None of It Has an API. Here's How I Scraped 200 Cities

Every City Has Permit Data Online. Almost None of It Has an API. Here's How I Scraped 200 Cities

The Normalization Problem Is Worse Than the Scraping Problem

The Tech Stack Breakdown (Honest Numbers)

A Real Parsing Example: Messy Table Structure

The Accela Discovery: 40 Cities, One Adapter

What the Data Actually Looks Like at Scale

The Production Version

What I'd Do Differently

The Question I'm Still Sitting With

Top comments (0)