Vhub Systems

Posted on Mar 5 • Edited on Apr 2

NPPES NPI Registry Has a Free API — But It's Useless at Scale. Here's the Fix

#python #webscraping #healthtech #api

NPPES NPI Registry Has a Free API — But It's Useless at Scale. Here's the Fix

Last year I got a request from a medical device company: they needed 500,000 active NPI records for a CRM they were building to target specific physician specialties. "Use the government API," their tech lead said. "It's free and public."

He wasn't wrong. The API exists. It's documented. It even works — right up until you try to do anything serious with it.

Here's what actually happened when I tried to pull half a million records from the NPPES NPI Registry, and how I eventually got it done.

The API Is Real, But the Limits Will Break You

The endpoint is https://npiregistry.cms.hhs.gov/api/ and it genuinely is free, no auth required. You can filter by taxonomy code, state, name, entity type — decent query parameters for a government system. The problem is the hard cap: 1,200 records per request, no cursor, no offset beyond what they give you.

CMS knows about this. Their documentation mentions the limit. What it doesn't mention is that if you need, say, cardiologists across all 50 states, you're already looking at a query space that won't fit in a single pull. You have to segment by taxonomy code, then by state, then sometimes by city — and pray that no combination returns more than 1,200 matching providers. Some do. When you hit that ceiling you get exactly 1,200 records and zero indication that you missed anyone.

My first attempt fetched about 80,000 records before I realized I was silently missing data. Fun discovery at 11pm.

The Bulk Download Is Worse Than You Think

Before going deep on the API approach, I tried the monthly bulk export at the CMS data dissemination page. It's an 8GB ZIP. Sounds straightforward.

It wasn't.

Problem one: the ZIP was corrupted on three separate downloads over two days. Not partially — Python's zipfile module threw BadZipFile on extraction every time. I tried unzip on the command line, got a slightly better error: invalid compressed data. Eventually got a clean copy on the fourth attempt, no explanation why.

Problem two: the data is three weeks stale by the time the file posts. For a CRM targeting active physicians, that's meaningful — NPI deactivations, address changes, new providers — all missing. The medical device company specifically needed current practice addresses. Three-week-old addresses in healthcare means a non-trivial percentage of your mail and reps' visits go to the wrong location.

Problem three: field encoding bugs. The CSV uses a mix of encodings that Python's default utf-8 reader chokes on. Specifically, I hit provider names with special characters (common in physician names) that were inconsistently encoded — some rows in latin-1, some in utf-8, some just corrupted. About 2-3% of rows needed manual handling. At 8 million records, that's 160,000+ rows you have to decide what to do with.

After two days on the bulk file, I went back to the API.

How I Actually Pulled 500k Records

The fix was to segment the query space by taxonomy code and iterate through every combination that could theoretically exceed the 1,200 limit. The NUCC taxonomy system has over 800 codes. In practice, I focused on the ~200 codes relevant to the client and further split high-volume ones by state.

Here's the pagination loop I used per segment:

import requests
import time

BASE_URL = "https://npiregistry.cms.hhs.gov/api/"

def fetch_npi_segment(taxonomy_code, state=None, limit=200, max_records=1200):
    results = []
    skip = 0

    while skip < max_records:
        params = {
            "version": "2.1",
            "taxonomy_description": taxonomy_code,
            "enumeration_type": "NPI-1",  # individual providers
            "limit": limit,
            "skip": skip,
        }
        if state:
            params["state"] = state

        response = requests.get(BASE_URL, params=params, timeout=15)
        response.raise_for_status()
        data = response.json()

        batch = data.get("results", [])
        if not batch:
            break

        results.extend(batch)
        skip += len(batch)

        if len(batch) < limit:
            break

        time.sleep(0.1)  # be polite

    return results

The important detail is checking if len(batch) < limit: break — if you get a partial page, you're done. If you consistently get full pages until you hit skip=1200, you've hit the ceiling and need to split that segment further (usually by state, then by city for dense metros).

Parallelizing 400+ Requests Without Getting Blocked

Running these sequentially would have taken the better part of a day. CMS doesn't publish rate limits, but empirically I found that hammering them gets you soft-blocked — responses slow to 10-15 seconds, then start timing out.

I settled on 8 workers with a 0.1s delay per request inside each worker. That gave me roughly 400 concurrent-ish requests completing in about 2 hours for 500,000 records. Real number: 487,000 after deduplication (NPI is unique but the same provider can appear across taxonomy segments).

from concurrent.futures import ThreadPoolExecutor, as_completed

segments = []  # list of (taxonomy_code, state) tuples — ~400 combinations
all_results = []

def safe_fetch(taxonomy, state):
    try:
        return fetch_npi_segment(taxonomy, state)
    except Exception as e:
        print(f"Failed {taxonomy}/{state}: {e}")
        return []

with ThreadPoolExecutor(max_workers=8) as executor:
    futures = {
        executor.submit(safe_fetch, tax, st): (tax, st)
        for tax, st in segments
    }
    for future in as_completed(futures):
        batch = future.result()
        all_results.extend(batch)
        print(f"Total so far: {len(all_results)}")

Eight workers was the sweet spot after testing 4, 8, 12, and 16. Twelve workers started producing timeout rates above 5%, which meant retry logic complexity I didn't want to deal with. Eight workers kept timeouts under 0.5%.

The Fields You Actually Get

The API response is richer than the bulk CSV in some ways. The fields I extracted for the CRM:

NPI — the 10-digit identifier, always present and clean
provider_name — constructed from basic.first_name + basic.last_name for NPI-1, basic.organization_name for NPI-2
taxonomy_code — from taxonomies[0].code, plus the description
practice_address — addresses array, filtered for address_purpose: "LOCATION" (not mailing)
phone — addresses[0].telephone_number, present maybe 85% of the time
credential — basic.credential, things like MD, DO, NP, PA — present about 70% of records

The practice address handling deserves a note: providers can have multiple addresses. I always took the LOCATION type over MAILING when both existed, but about 12% of records only had a mailing address. For a CRM targeting physical offices, that 12% is basically dead data. We flagged them rather than dropping them.

What I Shipped

After cleaning up the deduplication logic, normalizing the credential field (you get everything from "M.D." to "MD" to "md" to blank), and validating NPI check digits, the final dataset was about 480,000 usable records delivered as a normalized CSV.

Total engineering time: about 3 days including the two days I wasted on the bulk download. Runtime for the actual pull: ~2 hours.

I ended up publishing this as an Apify actor — https://apify.com/lanky_quantifier/healthcare-npi-scraper — for teams that need fresh NPI data without the engineering overhead. You put in taxonomy codes and states, it handles the segmentation and parallelization, outputs clean JSON or CSV.

What I'd Do Differently

A few things I'd change if I were starting over:

Pre-validate your segment coverage. I built the taxonomy/state matrix by hand. I should have first pulled counts-only requests (the API returns result_count before you paginate) to identify which segments exceed 1,200 and need city-level splitting. Would have saved two hours of debugging missing data.

Store raw responses before parsing. I parsed on the fly and lost the raw JSON. When I found the credential normalization issue later, I had to re-pull about 60,000 records.

Add NPI check digit validation upfront. NPIs use a Luhn variant. About 0.3% of records from the API had invalid check digits — likely data entry errors on the provider side. Easier to flag and quarantine early.

Over to You

I'm genuinely curious whether others have run into this with government health data APIs specifically. The NPPES situation — functional API, useless at scale, broken bulk export — feels like a pattern I keep hitting with CMS and ONC data sources. Have you found a cleaner path through the NPPES bulk data that I missed? Or hit similar walls with PECOS, the Provider Enrollment files, or the HCPCS datasets? I'd rather hear I was doing it wrong than accept that this is just the state of health data infrastructure.

Get the Data Without the Scraping Headaches

Need structured data like this without building the scraper yourself?

Apify Scrapers Bundle — $29 →

30+ production-ready scrapers: conference events, business directories, contact info, job listings, social media, and more. Pay per result. No server required.

DEV Community

NPPES NPI Registry Has a Free API — But It's Useless at Scale. Here's the Fix

NPPES NPI Registry Has a Free API — But It's Useless at Scale. Here's the Fix

The API Is Real, But the Limits Will Break You

The Bulk Download Is Worse Than You Think

How I Actually Pulled 500k Records

Parallelizing 400+ Requests Without Getting Blocked

The Fields You Actually Get

What I Shipped

What I'd Do Differently

Over to You

Get the Data Without the Scraping Headaches

Top comments (0)