John Wakaba

Posted on Jan 13

Building a MedAdvantage RAF Engine with dbt & PostgreSQL (Step-by-Step Guide)

#analytics #dataengineering #postgres #tutorial

In this project, I built a mini Medicare Advantage Risk Adjustment Factor (RAF) engine using PostgreSQL, dbt Core, and synthetic healthcare data.

The goal was to simulate a real-world healthcare analytics pipeline that transforms raw claims data into member-level risk scores.

This article walks through the entire process step by step, from raw CSV files to a final analytics mart.

1. Project Overview

The project models how healthcare organizations calculate risk scores using:

Member demographic data
Medical claims with ICD-10 diagnosis codes
Pharmacy claims with NDC drug codes
Reference mapping tables that convert codes into HCC and RxHCC categories

The final output is a table that shows:

Member ID
Service year
Gender and plan
Total HCC weight
Total RxHCC weight
Final risk score

This type of table is commonly used for actuarial analysis, reimbursement modeling, and population health analytics.

2. Tools & Environment Setup

Before starting the project, I installed and configured:

PostgreSQL as the data warehouse
DBeaver as the database client
Python + virtual environment (venv)
dbt Core with the Postgres adapter

This setup allows a modern ELT (Extract → Load → Transform) workflow where data is first loaded into the warehouse and then transformed using dbt.

3. Database Design

I used one PostgreSQL database with two schemas:

med_project → holds raw data and reference tables
analytics → holds all dbt models (staging, core, and marts)

Raw tables stored in `med_project`:

Members
Medical claims
Pharmacy claims
ICD-to-HCC mapping table
NDC-to-RxHCC mapping table

dbt models stored in `analytics`:

Staging models
Core models
Mart models (final analytics tables)

This separation keeps raw data immutable and transformed data fully governed by dbt.

4. Loading the Raw Data

Synthetic CSV files were generated and loaded into PostgreSQL using DBeaver’s Import Tool.

Each CSV corresponded to one raw table:

Members
Medical claims
Pharmacy claims

All raw table columns were stored as TEXT initially. This prevents ingestion failures during loading and allows all data type enforcement to be handled inside dbt.

5. Initializing the dbt Project

A new dbt project was created inside a Python virtual environment.

During initialization:

PostgreSQL was selected as the adapter
A profile was created in profiles.yml
The default target schema was set to analytics

The dbt connection was validated to ensure connectivity between dbt and PostgreSQL.

6. Registering Raw Tables as dbt Sources

To allow dbt to reference raw tables safely, a source configuration file was created.

This file tells dbt:

Which schema the raw tables are in
Which tables are considered authoritative raw sources
Which tables act as reference data (HCC & RxHCC mappings)

This enables consistent use of dbt sources and prevents hard-coding schema names inside models.

7. Building the Staging Layer

The staging layer is where all raw data is cleaned and standardized.

At this stage, I performed:

Explicit parsing of U.S.-formatted dates
Numeric type conversions for amounts and quantities
Trimming and upper-casing of ICD and NDC codes
Basic null handling

Three staging models were created:

stg_members
stg_medical_claims
stg_pharmacy_claims

These models ensure that all downstream data has:

Consistent data types
Clean formats
Reliable date values

8. Building the Core Layer

The core layer represents analytics-ready entities.

Here I focused on:

Creating one clean row per member
Passing through only validated claim and pharmacy records
Preparing the data for aggregation

Core models included:

Clean member dimension
Medical claim fact table
Pharmacy claim fact table

This layer removes duplication and creates stable structures used by the marts.

9. Mapping Diagnoses to HCCs

Medical claims were mapped to HCC categories using the ICD-to-HCC reference table.

At this stage:

Each member’s diagnosis codes were expanded
Codes were normalized
Valid diagnoses were matched to HCC categories

The output produces one record per member per year per HCC category, with an associated risk weight.

10. Computing Final Member Risk Scores

The final step combined:

Aggregated HCC weights from medical claims
Aggregated RxHCC weights from pharmacy claims
Member demographic attributes

For each member and service year:

HCC weights were summed
RxHCC weights were summed
A base score was added
A final risk score was computed

This produced the final analytics mart: member_risk_scores.

11. Results & Key Outcomes

Results

A fully functional end-to-end healthcare analytics pipeline
Clean transformation workflow from raw CSVs to final mart
A production-style member-level risk score table
Data ready for reporting tools like Power BI or Tableau

Key Takeaways

dbt enforces strong modeling discipline through layered architecture
Reference mapping tables are the backbone of healthcare risk analytics
Explicit type casting prevents silent data quality issues
Separation of raw, staging, core, and marts ensures scalability and auditability

12. Key Challenges Faced & Resolved

PostgreSQL CSV permission issues → Resolved by using client-side imports
Cross-database reference errors in dbt → Fixed by aligning the dbt database configuration
U.S. date format parsing errors → Solved by explicitly controlling date parsing in staging
Semicolon syntax errors in dbt models → Resolved by removing trailing semicolons
Source configuration mismatches → Fixed by correcting schema references

Final Thoughts

This project demonstrates how modern analytics engineering tools can simulate real Medicare Advantage risk modeling workflows using open-source technologies.

DEV Community

Building a MedAdvantage RAF Engine with dbt & PostgreSQL (Step-by-Step Guide)

1. Project Overview

2. Tools & Environment Setup