Season Mudbhary

Posted on Jun 30

I Built an Open Healthcare Data Dictionary with 100k+ ISO-11179 Terms — Here's What I Learned

#dataengineering #dbt #database #dictionary

If you've ever worked on a healthcare data team, you've lived this problem: one system calls it mem_id, another calls it member_number, a third calls it subscriber_key — and they all mean the exact same thing.

Multiply that across claims, clinical, provider, pharmacy, and finance domains, and you get a data warehouse where nobody fully trusts a join without checking three wikis and pinging someone on Slack first.

I got tired of solving this problem from scratch on every project, so I built a standardized healthcare data dictionary — and I'm open-sourcing a sample of it.

The Core Idea: ISO-11179

ISO-11179 is an international standard for metadata registries — basically, a formal way to define data elements so that "Member Identifier" means the same thing, has the same data type, and follows the same naming convention no matter which system, team, or analyst is touching it.

In practice, for a healthcare data warehouse, that looks like:

Term	Standard Column	Data Type	Definition
Member Identifier	`mbr_id`	`VARCHAR(20)`	Unique identifier assigned to a health plan member
Rendering Provider NPI	`rndrng_prvdr_npi`	`VARCHAR(10)`	NPI of the provider who performed the service
Primary Diagnosis Code	`prim_diag_cd`	`VARCHAR(7)`	Principal ICD-10-CM diagnosis code for the claim

Once you standardize this once, every dbt model, every Snowflake schema, and every BI dashboard downstream inherits consistency for free.

What I Built

A healthcare data dictionary with 100,000+ terms spanning:

Claims
Clinical
Member / Enrollment
Provider
Pharmacy
Finance
Quality (HEDIS, risk adjustment)
Laboratory
Population health
Behavioral health
Supply chain
Compliance
Technology / operations

Every term has a consistent abbreviation convention, a defined data type, and a plain-English definition — plus links to related terms and example DDL.

Open Sample: 100 Terms on GitHub

I pulled 100 representative terms into a public repo so you can see exactly what this looks like without needing to sign up for anything:

github.com/mdatool/healthcare-data-glossary

It's available as both a browsable Markdown table and a CSV you can drop into a spreadsheet, a dbt seed, or whatever tool you're using.

A small sample:

Claim Number              → clm_nbr            VARCHAR(20)
National Provider Identifier → npi_nbr          VARCHAR(10)
Primary Diagnosis Code     → prim_diag_cd       VARCHAR(7)
Paid Amount                → paid_amt           DECIMAL(12,2)
HCC Category Code          → hcc_cat_cd         VARCHAR(10)

The Harder Problem: Generating This From an Existing Schema

Having a clean dictionary is great for new builds. But most of us aren't starting from scratch — we're staring at a legacy claims_hdr table someone built in 2014 with column names like mem_first, provider_npi, and diag_1, and zero documentation.

So I also built a tool that takes raw DDL and automatically generates:

A data dictionary — business definitions for every column
Standardized naming — maps your existing columns to ISO-11179 conventions
PHI/PII classification — flags HIPAA-sensitive fields automatically (SSN, DOB, names, etc.)
dbt schema YAML — tests and descriptions, ready to commit
Business glossary links — ties every column back to a canonical definition

Here's a real, unedited example of what it outputs when you paste in a typical claims header schema:

github.com/mdatool/healthcare-metadata-sample

The repo includes the raw input DDL (claims-schema.sql) alongside the full generated report so you can see the before/after.

A snippet of the generated dbt YAML:

- name: mbr_ssn
  description: "Government-issued Social Security Number of the member."
  meta:
    phi: true
    phi_risk: high
  tests:
    - not_null

- name: rndrng_prvdr_npi
  description: "National Provider Identifier of the provider who performed the service."
  tests:
    - not_null
    - dbt_utils.expression_is_true:
        expression: "length(rndrng_prvdr_npi) = 10"

That PHI flag on the SSN field, generated automatically from a generic column name (mem_ssn), is the part I find most useful day-to-day — catching sensitive fields before they accidentally end up unmasked in a BI tool is a real, recurring problem in healthcare data work.

Try It Yourself

If you want to run this against your own schema, the tool is free to use:

mdatool.com/tools/metadata-generator

And the full 100,000+ term dictionary (searchable, with production DDL examples per term) is here:

mdatool.com/glossary

There's also a small set of other free tools I built alongside this — a DDL converter for moving schemas between Snowflake/BigQuery/Databricks/SQL Server, an NPI lookup against the CMS registry, and an ICD-10 search tool with data-engineering-specific context (column naming, dbt models, data quality checks) rather than just clinical lookup.

What I'd Love Feedback On

This is very much a living project. A few things I'm actively thinking about:

Whether the abbreviation conventions I've chosen match what other healthcare data teams actually use in practice (there's no universal standard, just strong conventions)
What additional domains or term categories would be most useful to prioritize next
Whether the PHI/PII classification logic should be more conservative (flag more) or more precise (flag less, but with higher confidence)

If you work in healthcare data and have opinions on any of this — or just want to poke holes in the approach — I'd genuinely like to hear it in the comments.

If this was useful, the GitHub repos are linked above and contributions/issues are welcome. No pitch beyond that — just sharing something I built to solve a problem I kept running into.

Top comments (1)

Season Mudbhary • Jun 30

Thanks for reading! A couple of things I didn't fit into the post
but wanted to mention here:

The abbreviation conventions (mbr_id, prim_diag_cd, etc.) are
based on patterns I've seen across multiple health plans and
PBMs I've worked with — but I know there's no single universal
standard. If your org uses different conventions, I'd love to
know what diverges and why.
The PHI/PII detection currently works off column name patterns
- a curated list of known sensitive healthcare fields. It's not doing anything fancy with data profiling yet — that's probably the next big improvement.
If anyone tries the metadata generator on their own schema and
it gets something wrong (wrong PHI flag, weird naming suggestion,
etc.) — genuinely want to hear about it. That kind of feedback is
way more useful to me right now than praise 🙂

Happy to answer questions on the ISO-11179 side too, since that's
a standard most data engineers have heard of but maybe haven't
worked with directly.