Josh Elberg

Posted on Jun 12 • Originally published at palavir.co

Finding Anomalies in Medicare Data with DuckDB and Python

#python #database #datascience #tutorial

Public healthcare datasets are some of the largest open data you can get your hands on. The CMS Medicare Part D "by Provider and Drug" file is tens of millions of rows: one row for every combination of prescriber and drug billed to Medicare in a year. The instinct is to stand up a warehouse, write a loader, and wait. You do not need any of that. With DuckDB you can query the whole file on a laptop, in seconds, from Python.

In this tutorial we build a peer-group outlier finder over that prescriber data. The idea: group providers into fair peer groups (same specialty, same drug), compute how far each provider's billing sits from the peer-group average in standard deviations, and rank the biggest deviations. The technique is general. Swap the dataset and the peer-group definition and the same code finds outliers in procedure billing, lab volumes, or any aggregate table you have.

One thing up front, because it matters: an outlier is a lead, not a verdict. Being far from the mean means "worth a human look," nothing more. We will come back to why.

Why DuckDB

DuckDB is an in-process analytical database. It runs inside your Python process the way SQLite does, so there is no server to install and no cluster to manage. It is columnar and vectorized, which is exactly the shape that aggregate queries over wide tables want, and it reads CSV and Parquet files directly off disk or over HTTPS without an import step. For a one-machine analysis of a large public file, that combination (zero infra, reads files in place, fast on aggregates) is hard to beat.

Prerequisites and getting the data

You need Python 3.9+ and one package:

pip install duckdb

That is the entire dependency list. DuckDB ships its own engine in the wheel.

The dataset is the Medicare Part D Prescribers - by Provider and Drug Public Use File, published free by CMS at:

https://data.cms.gov/provider-summary-by-type-of-service/medicare-part-d-prescribers/medicare-part-d-prescribers-by-provider-and-drug

From that page, open the most recent release year and download the full CSV (look for the "Download" option for the complete dataset, not the 1,000-row API preview). The file is large; a recent year is roughly 25+ million rows and a few gigabytes uncompressed. Save it somewhere with room, for example ~/data/part_d_provider_drug.csv.

The columns we use, with names taken from the CMS data dictionary:

Column	Meaning
`Prscrbr_NPI`	National Provider Identifier (the prescriber's unique ID)
`Prscrbr_Last_Org_Name`	Prescriber last name or organization name
`Prscrbr_City`	Prescriber city
`Prscrbr_State_Abrvtn`	Prescriber state
`Prscrbr_Type`	Prescriber specialty (e.g. Family Practice, Cardiology)
`Brnd_Name`	Brand name of the drug
`Gnrc_Name`	Generic name of the drug
`Tot_Clms`	Total Part D claims (prescriptions plus refills)
`Tot_30day_Fills`	Standardized 30-day fill count
`Tot_Day_Suply`	Total days supply dispensed
`Tot_Drug_Cst`	Aggregate drug cost paid across all those claims
`Tot_Benes`	Unique beneficiaries with at least one claim

Column names vary slightly by release year (CMS occasionally adjusts casing or adds fields). Check the data dictionary linked on the dataset page against the header row of the file you actually downloaded before running the queries.

Load and profile the data

Point DuckDB at the CSV. You do not import it into a table to start; you query the file directly and let the auto-detector infer types.

import duckdb

CSV = "/home/you/data/part_d_provider_drug.csv"

con = duckdb.connect()  # in-memory database

# Sanity: how many rows, how many providers, how many drugs?
overview = con.execute(f"""
    SELECT
        count(*)                         AS rows,
        count(DISTINCT Prscrbr_NPI)      AS providers,
        count(DISTINCT Gnrc_Name)        AS generic_drugs,
        count(DISTINCT Prscrbr_Type)     AS specialties
    FROM read_csv_auto('{CSV}')
""").fetchdf()

print(overview)

read_csv_auto sniffs delimiters and types from a sample. On a multi-gigabyte file the scan still takes only seconds because DuckDB reads in parallel and only touches the columns it needs.

Two more checks before trusting anything. First, look at the schema the sniffer inferred:

print(con.execute(f"DESCRIBE SELECT * FROM read_csv_auto('{CSV}')").fetchdf())

Confirm the numeric columns (Tot_Clms, Tot_Drug_Cst, Tot_Day_Suply) came back as numbers, not VARCHAR. If one was misread, force it with the columns argument or cast it in the query. If type detection looks shaky on a huge file, widen the sniff sample: read_csv_auto('{CSV}', sample_size=200000).

Second, eyeball the suppression. CMS withholds small cells: rows with fewer than 11 claims are dropped, and Tot_Benes is blank when the beneficiary count is under 11. That is not corruption, it is policy, and it shapes your analysis.

print(con.execute(f"""
    SELECT
        min(Tot_Clms)  AS min_claims,
        count(*) FILTER (WHERE Tot_Benes IS NULL) AS benes_suppressed
    FROM read_csv_auto('{CSV}')
""").fetchdf())

You should see a minimum claim count of at least 11 and a chunk of suppressed beneficiary counts. Good. Now we know the data's floor.

For repeated work, materialize it once into a DuckDB table or a Parquet file so you stop re-parsing CSV on every query:

con.execute(f"""
    CREATE TABLE part_d AS
    SELECT * FROM read_csv_auto('{CSV}', sample_size=200000)
""")

Every query below can now read from part_d instead of the raw file, which is noticeably faster.

The technique: peer-group z-scores

Raw billing totals are useless for comparison. An oncologist will out-bill a pediatrician on expensive drugs every time, and that tells you nothing. The fix is to compare each provider only against a fair peer group, then measure the gap in standard deviations.

Define the peer group as same specialty plus same drug. Within each (Prscrbr_Type, Gnrc_Name) group, compute the mean and standard deviation of a metric, then express each provider's value as a z-score:

z = (provider_value - peer_group_mean) / peer_group_stddev

A z-score of 0 is exactly average for that peer group. A z-score of 4 means the provider sits four standard deviations above peers prescribing the same drug in the same specialty. The metric we will rank on is cost per claim, which normalizes away the fact that some providers simply have larger panels.

DuckDB's window functions compute the peer statistics in a single pass, no self-join:

WITH base AS (
    SELECT
        Prscrbr_NPI,
        Prscrbr_Last_Org_Name,
        Prscrbr_State_Abrvtn,
        Prscrbr_Type,
        Gnrc_Name,
        Tot_Clms,
        Tot_Drug_Cst,
        Tot_Drug_Cst / Tot_Clms AS cost_per_claim
    FROM part_d
    WHERE Tot_Clms >= 20            -- ignore tiny, noisy denominators
),
peer AS (
    SELECT
        *,
        avg(cost_per_claim)         OVER w AS peer_mean,
        stddev_samp(cost_per_claim) OVER w AS peer_sd,
        count(*)                    OVER w AS peer_n
    FROM base
    WINDOW w AS (PARTITION BY Prscrbr_Type, Gnrc_Name)
)
SELECT
    Prscrbr_NPI,
    Prscrbr_Last_Org_Name,
    Prscrbr_State_Abrvtn,
    Prscrbr_Type,
    Gnrc_Name,
    Tot_Clms,
    round(cost_per_claim, 2) AS cost_per_claim,
    round(peer_mean, 2)      AS peer_mean,
    round((cost_per_claim - peer_mean) / peer_sd, 2) AS z_score
FROM peer
WHERE peer_n >= 30               -- only trust groups with enough peers
  AND peer_sd > 0
  AND (cost_per_claim - peer_mean) / peer_sd >= 4
ORDER BY z_score DESC
LIMIT 100;

Three guardrails are doing real work here:

Tot_Clms >= 20 keeps a provider with two flukey claims from producing a wild ratio.
peer_n >= 30 refuses to score a group with too few peers, where a "standard deviation" is meaningless.
peer_sd > 0 avoids dividing by zero when every peer billed the identical amount.

stddev_samp is the sample standard deviation, which is what you want when your group is a sample of behavior rather than the entire universe.

Why this shape, and where it breaks

Peer-group z-scores are the right tool because they answer the only question that matters for a fair comparison: unusual relative to whom? Comparing within specialty-and-drug strips out the legitimate reasons one provider bills more than another. It is cheap (one window pass), interpretable (a z-score is a sentence: "four SDs above peers"), and it scales to tens of millions of rows without a join.

The limits are just as important:

An outlier is not fraud. A high z-score can mean a specialist running a complex-case referral practice, a rural sole provider with no real peers, a coding quirk, or genuine waste. The query produces leads, full stop.
Z-scores assume a roughly bell-shaped spread. Billing distributions are skewed with fat tails, so extreme values inflate the very mean and SD you compare against. For production work, consider a robust version using the median and median absolute deviation (MAD) instead of mean and SD. DuckDB has median() and quantile_cont() built in.
Suppressed small cells are invisible. Providers below the 11-claim floor are not in the file at all, so the peer mean is computed only over providers above it.

None of that makes the method wrong. It makes it a first filter that hands a short, ranked list to a human, which is exactly its job.

Make it reusable

Wrap it in a function that takes the peer-group definition and the metric, so you can re-aim it without rewriting SQL. Pass the grouping columns and the numeric expression in, get a ranked DataFrame back.

import duckdb

def find_outliers(
    con,
    table="part_d",
    peer_cols=("Prscrbr_Type", "Gnrc_Name"),
    metric="Tot_Drug_Cst / Tot_Clms",
    min_claims=20,
    min_peers=30,
    z_threshold=4.0,
    limit=100,
):
    """Rank providers whose `metric` is z_threshold+ SDs above their peer group."""
    partition = ", ".join(peer_cols)
    sql = f"""
        WITH base AS (
            SELECT
                Prscrbr_NPI,
                Prscrbr_Last_Org_Name,
                Prscrbr_State_Abrvtn,
                {partition},
                Tot_Clms,
                ({metric}) AS metric
            FROM {table}
            WHERE Tot_Clms >= {min_claims}
        ),
        peer AS (
            SELECT *,
                avg(metric)         OVER w AS peer_mean,
                stddev_samp(metric) OVER w AS peer_sd,
                count(*)            OVER w AS peer_n
            FROM base
            WINDOW w AS (PARTITION BY {partition})
        )
        SELECT
            Prscrbr_NPI, Prscrbr_Last_Org_Name, Prscrbr_State_Abrvtn,
            {partition},
            Tot_Clms,
            round(metric, 2)                          AS metric_value,
            round(peer_mean, 2)                       AS peer_mean,
            peer_n,
            round((metric - peer_mean) / peer_sd, 2)  AS z_score
        FROM peer
        WHERE peer_n >= {min_peers}
          AND peer_sd > 0
          AND (metric - peer_mean) / peer_sd >= {z_threshold}
        ORDER BY z_score DESC
        LIMIT {limit}
    """
    return con.execute(sql).fetchdf()


con = duckdb.connect()
con.execute("CREATE TABLE part_d AS SELECT * FROM read_csv_auto('/home/you/data/part_d_provider_drug.csv', sample_size=200000)")

# Cost-per-claim outliers within specialty + drug
leads = find_outliers(con)
print(leads.head(20))

# Re-aim at a different question: days-supply per claim, peer group = drug alone
long_scripts = find_outliers(
    con,
    peer_cols=("Gnrc_Name",),
    metric="Tot_Day_Suply / Tot_Clms",
)
print(long_scripts.head(20))

Because the grouping and metric are parameters, the same function answers "who bills the most per claim for their specialty" and "who writes unusually long days-supply for this drug" with one argument change.

Export the leads for review. DuckDB writes Parquet or CSV straight from a query, so register the DataFrame and COPY it out:

con.register("leads_view", leads)
con.execute("COPY leads_view TO '/home/you/data/outlier_leads.parquet' (FORMAT parquet)")
con.execute("COPY leads_view TO '/home/you/data/outlier_leads.csv'  (HEADER, DELIMITER ',')")

Parquet keeps the column types and compresses well for handing to a downstream notebook; CSV is for the human who wants to open it in a spreadsheet.

Honest caveats

A few things to keep straight before anyone treats this output as a finding:

The data lags. CMS publishes these files annually, roughly a year behind the claims. You are analyzing last year's behavior, not this morning's.
Small cells are suppressed. Claim counts under 11 are dropped and beneficiary counts under 11 are blanked. Your peer groups are built only from providers above that floor, which biases the comparison set.
Anomalies require human review. Every guardrail in the SQL exists to cut false positives, and even so the output is a ranked list of questions, not answers. A high z-score earns a closer look from someone who understands the clinical context. It is never, by itself, an accusation about any provider.

That framing is the whole point. The product is the method, a fast, fair, reusable way to surface what is unusual in a huge public file, not any claim about a specific person.

Josh Elberg is the founder of Palavir, where he builds large public-data pipelines that turn raw federal and state datasets into queryable intelligence. He works in DuckDB on a laptop more often than in any warehouse.