If you've ever worked on a healthcare data team, you've lived this problem: one system calls it mem_id, another calls it member_number, a third calls it subscriber_key — and they all mean the exact same thing.
Multiply that across claims, clinical, provider, pharmacy, and finance domains, and you get a data warehouse where nobody fully trusts a join without checking three wikis and pinging someone on Slack first.
I got tired of solving this problem from scratch on every project, so I built a standardized healthcare data dictionary — and I'm open-sourcing a sample of it.
The Core Idea: ISO-11179
ISO-11179 is an international standard for metadata registries — basically, a formal way to define data elements so that "Member Identifier" means the same thing, has the same data type, and follows the same naming convention no matter which system, team, or analyst is touching it.
In practice, for a healthcare data warehouse, that looks like:
| Term | Standard Column | Data Type | Definition |
|---|---|---|---|
| Member Identifier | mbr_id |
VARCHAR(20) |
Unique identifier assigned to a health plan member |
| Rendering Provider NPI | rndrng_prvdr_npi |
VARCHAR(10) |
NPI of the provider who performed the service |
| Primary Diagnosis Code | prim_diag_cd |
VARCHAR(7) |
Principal ICD-10-CM diagnosis code for the claim |
Once you standardize this once, every dbt model, every Snowflake schema, and every BI dashboard downstream inherits consistency for free.
What I Built
A healthcare data dictionary with 100,000+ terms spanning:
- Claims
- Clinical
- Member / Enrollment
- Provider
- Pharmacy
- Finance
- Quality (HEDIS, risk adjustment)
- Laboratory
- Population health
- Behavioral health
- Supply chain
- Compliance
- Technology / operations
Every term has a consistent abbreviation convention, a defined data type, and a plain-English definition — plus links to related terms and example DDL.
Open Sample: 100 Terms on GitHub
I pulled 100 representative terms into a public repo so you can see exactly what this looks like without needing to sign up for anything:
github.com/mdatool/healthcare-data-glossary
It's available as both a browsable Markdown table and a CSV you can drop into a spreadsheet, a dbt seed, or whatever tool you're using.
A small sample:
Claim Number → clm_nbr VARCHAR(20)
National Provider Identifier → npi_nbr VARCHAR(10)
Primary Diagnosis Code → prim_diag_cd VARCHAR(7)
Paid Amount → paid_amt DECIMAL(12,2)
HCC Category Code → hcc_cat_cd VARCHAR(10)
The Harder Problem: Generating This From an Existing Schema
Having a clean dictionary is great for new builds. But most of us aren't starting from scratch — we're staring at a legacy claims_hdr table someone built in 2014 with column names like mem_first, provider_npi, and diag_1, and zero documentation.
So I also built a tool that takes raw DDL and automatically generates:
- A data dictionary — business definitions for every column
- Standardized naming — maps your existing columns to ISO-11179 conventions
- PHI/PII classification — flags HIPAA-sensitive fields automatically (SSN, DOB, names, etc.)
- dbt schema YAML — tests and descriptions, ready to commit
- Business glossary links — ties every column back to a canonical definition
Here's a real, unedited example of what it outputs when you paste in a typical claims header schema:
github.com/mdatool/healthcare-metadata-sample
The repo includes the raw input DDL (claims-schema.sql) alongside the full generated report so you can see the before/after.
A snippet of the generated dbt YAML:
- name: mbr_ssn
description: "Government-issued Social Security Number of the member."
meta:
phi: true
phi_risk: high
tests:
- not_null
- name: rndrng_prvdr_npi
description: "National Provider Identifier of the provider who performed the service."
tests:
- not_null
- dbt_utils.expression_is_true:
expression: "length(rndrng_prvdr_npi) = 10"
That PHI flag on the SSN field, generated automatically from a generic column name (mem_ssn), is the part I find most useful day-to-day — catching sensitive fields before they accidentally end up unmasked in a BI tool is a real, recurring problem in healthcare data work.
Try It Yourself
If you want to run this against your own schema, the tool is free to use:
mdatool.com/tools/metadata-generator
And the full 100,000+ term dictionary (searchable, with production DDL examples per term) is here:
There's also a small set of other free tools I built alongside this — a DDL converter for moving schemas between Snowflake/BigQuery/Databricks/SQL Server, an NPI lookup against the CMS registry, and an ICD-10 search tool with data-engineering-specific context (column naming, dbt models, data quality checks) rather than just clinical lookup.
What I'd Love Feedback On
This is very much a living project. A few things I'm actively thinking about:
- Whether the abbreviation conventions I've chosen match what other healthcare data teams actually use in practice (there's no universal standard, just strong conventions)
- What additional domains or term categories would be most useful to prioritize next
- Whether the PHI/PII classification logic should be more conservative (flag more) or more precise (flag less, but with higher confidence)
If you work in healthcare data and have opinions on any of this — or just want to poke holes in the approach — I'd genuinely like to hear it in the comments.
If this was useful, the GitHub repos are linked above and contributions/issues are welcome. No pitch beyond that — just sharing something I built to solve a problem I kept running into.
Top comments (1)
Thanks for reading! A couple of things I didn't fit into the post
but wanted to mention here:
The abbreviation conventions (mbr_id, prim_diag_cd, etc.) are
based on patterns I've seen across multiple health plans and
PBMs I've worked with — but I know there's no single universal
standard. If your org uses different conventions, I'd love to
know what diverges and why.
The PHI/PII detection currently works off column name patterns
If anyone tries the metadata generator on their own schema and
it gets something wrong (wrong PHI flag, weird naming suggestion,
etc.) — genuinely want to hear about it. That kind of feedback is
way more useful to me right now than praise 🙂
Happy to answer questions on the ISO-11179 side too, since that's
a standard most data engineers have heard of but maybe haven't
worked with directly.