Vhub Systems

Posted on Apr 3

How to Scrape US Healthcare Provider Data: NPPES API, Hospital Directories, and Bulk Data

#python #webscraping #data #healthcare

The US has 9+ million licensed healthcare providers registered in the NPPES NPI Registry. Most healthcare data companies charge $500-5,000/month to access this public dataset. The data is free — here's how to use it.

What data is available

NPPES (National Plan and Provider Enumeration System) is a public CMS database containing every licensed healthcare provider in the US:

Provider name, NPI number, credentials
Practice addresses (primary + secondary)
Specialties and taxonomy codes
Phone/fax numbers
License numbers by state
Organization/group affiliations

Method 1: NPPES Free API

CMS provides a free REST API — no authentication required:

import requests

def search_nppes(params: dict) -> list:
    url = "https://npiregistry.cms.hhs.gov/api/"
    defaults = {"version": "2.1", "limit": 200, "skip": 0}

    response = requests.get(url, params={**defaults, **params}, timeout=30)
    if response.status_code == 200:
        return response.json().get("results", [])
    return []

# Search family medicine providers in San Francisco
providers = search_nppes({
    "taxonomy_description": "Family Medicine",
    "state": "CA",
    "city": "San Francisco",
})

for p in providers[:3]:
    basic = p.get("basic", {})
    addr = p.get("addresses", [{}])[0]
    print(f"{basic.get('first_name')} {basic.get('last_name')}, {basic.get('credential')}")
    print(f"  NPI: {p.get('number')}")
    print(f"  Phone: {addr.get('telephone_number')}")
    print(f"  Address: {addr.get('address_1')}, {addr.get('city')}, {addr.get('state')}")

Paginate through all results:

def get_all_providers(search_params: dict) -> list:
    all_results = []
    skip = 0

    while True:
        batch = search_nppes({**search_params, "skip": skip})
        if not batch:
            break
        all_results.extend(batch)
        skip += 200
        if len(batch) < 200:
            break

    return all_results

# Get all cardiologists in Texas (may return 5,000+ results)
cardiologists = get_all_providers({
    "taxonomy_description": "Cardiovascular Disease",
    "state": "TX"
})
print(f"Found {len(cardiologists):,} cardiologists in Texas")

Method 2: Bulk NPPES monthly file

For nationwide analysis, download the full file (~1.3GB CSV, updated monthly):

import pandas as pd

# Download from: https://download.cms.gov/nppes/NPI_Files.html
df = pd.read_csv("npidata_pfile.csv", dtype=str, low_memory=False)

# Active individual providers only
active = df[
    (df["Entity Type Code"] == "1") &
    df["NPI Deactivation Date"].isna()
]

# Filter by taxonomy code (Family Medicine = 207Q*)
family_med = active[
    active["Healthcare Provider Taxonomy Code_1"].str.startswith("207Q", na=False)
]

print(f"Active family medicine providers: {len(family_med):,}")

Method 3: Hospital directory scraping

Hospital websites publish provider directories but don't export them. Use Playwright:

from playwright.async_api import async_playwright
import asyncio

async def scrape_hospital_providers(hospital_url: str) -> list:
    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=True)
        page = await browser.new_page()

        await page.goto(f"{hospital_url}/find-a-doctor")
        await page.wait_for_selector(".provider-card", timeout=10000)

        providers = await page.evaluate(
            "Array.from(document.querySelectorAll('.provider-card'))"
            ".map(el => ({"
            "  name: el.querySelector('.provider-name')?.innerText,"
            "  specialty: el.querySelector('.specialty')?.innerText,"
            "  accepting: el.querySelector('.accepting-patients')?.innerText"
            "}))"
        )

        await browser.close()
        return providers

providers = asyncio.run(scrape_hospital_providers("https://hospital.example.com"))

Method 4: Pre-built healthcare scraper

The Healthcare Provider Scraper on Apify handles NPPES pagination, hospital directory scraping, and data normalization automatically.

Sample output:

{
  "npi": "1234567890",
  "name": "Dr. Jane Smith, MD",
  "specialty": "Family Medicine",
  "address": "123 Main St, San Francisco, CA 94102",
  "phone": "415-555-0100",
  "acceptingNewPatients": true,
  "insuranceAccepted": ["Blue Cross", "Aetna", "Medicare"],
  "languages": ["English", "Spanish"]
}

74+ production runs. Pay-per-result pricing.

Use cases

Healthcare staffing: Find providers by specialty + accepting status + location
Insurance network analysis: Map which providers accept which plans
Market research: Provider density by specialty + region
Referral network mapping: Build provider relationship graphs from NPI affiliation data
Sales prospecting: B2B outreach to medical practices by specialty

Important notes

NPPES API limits: 200 results per request. No formal rate limit, but add 500ms delays for bulk queries.

Hospital websites: extremely variable structure. Test on each target before building production scrapers.

For HIPAA-sensitive use cases: NPPES data is public — it's the providers' professional registration data, not patient data.

n8n AI Automation Pack ($39) — 5 production-ready workflows

Pre-built and maintained

Apify Scrapers Bundle — $29 one-time

35+ scrapers including the Healthcare Provider Scraper. Instant download.

DEV Community