DEV Community

John Wakaba
John Wakaba

Posted on

Building a MedAdvantage RAF Engine with dbt & PostgreSQL (Step-by-Step Guide)

In this project, I built a mini Medicare Advantage Risk Adjustment Factor (RAF) engine using PostgreSQL, dbt Core, and synthetic healthcare data.

The goal was to simulate a real-world healthcare analytics pipeline that transforms raw claims data into member-level risk scores.

This article walks through the entire process step by step, from raw CSV files to a final analytics mart.


1. Project Overview

The project models how healthcare organizations calculate risk scores using:

  • Member demographic data
  • Medical claims with ICD-10 diagnosis codes
  • Pharmacy claims with NDC drug codes
  • Reference mapping tables that convert codes into HCC and RxHCC categories

The final output is a table that shows:

  • Member ID
  • Service year
  • Gender and plan
  • Total HCC weight
  • Total RxHCC weight
  • Final risk score

This type of table is commonly used for actuarial analysis, reimbursement modeling, and population health analytics.


2. Tools & Environment Setup

Before starting the project, I installed and configured:

  • PostgreSQL as the data warehouse
  • DBeaver as the database client
  • Python + virtual environment (venv)
  • dbt Core with the Postgres adapter

This setup allows a modern ELT (Extract → Load → Transform) workflow where data is first loaded into the warehouse and then transformed using dbt.


3. Database Design

I used one PostgreSQL database with two schemas:

  • med_project → holds raw data and reference tables
  • analytics → holds all dbt models (staging, core, and marts)

Raw tables stored in med_project:

  • Members
  • Medical claims
  • Pharmacy claims
  • ICD-to-HCC mapping table
  • NDC-to-RxHCC mapping table

dbt models stored in analytics:

  • Staging models
  • Core models
  • Mart models (final analytics tables)

This separation keeps raw data immutable and transformed data fully governed by dbt.


4. Loading the Raw Data

Synthetic CSV files were generated and loaded into PostgreSQL using DBeaver’s Import Tool.

Each CSV corresponded to one raw table:

  • Members
  • Medical claims
  • Pharmacy claims

All raw table columns were stored as TEXT initially. This prevents ingestion failures during loading and allows all data type enforcement to be handled inside dbt.


5. Initializing the dbt Project

A new dbt project was created inside a Python virtual environment.

During initialization:

  • PostgreSQL was selected as the adapter
  • A profile was created in profiles.yml
  • The default target schema was set to analytics

The dbt connection was validated to ensure connectivity between dbt and PostgreSQL.


6. Registering Raw Tables as dbt Sources

To allow dbt to reference raw tables safely, a source configuration file was created.

This file tells dbt:

  • Which schema the raw tables are in
  • Which tables are considered authoritative raw sources
  • Which tables act as reference data (HCC & RxHCC mappings)

This enables consistent use of dbt sources and prevents hard-coding schema names inside models.


7. Building the Staging Layer

The staging layer is where all raw data is cleaned and standardized.

At this stage, I performed:

  • Explicit parsing of U.S.-formatted dates
  • Numeric type conversions for amounts and quantities
  • Trimming and upper-casing of ICD and NDC codes
  • Basic null handling

Three staging models were created:

  • stg_members
  • stg_medical_claims
  • stg_pharmacy_claims

These models ensure that all downstream data has:

  • Consistent data types
  • Clean formats
  • Reliable date values

8. Building the Core Layer

The core layer represents analytics-ready entities.

Here I focused on:

  • Creating one clean row per member
  • Passing through only validated claim and pharmacy records
  • Preparing the data for aggregation

Core models included:

  • Clean member dimension
  • Medical claim fact table
  • Pharmacy claim fact table

This layer removes duplication and creates stable structures used by the marts.


9. Mapping Diagnoses to HCCs

Medical claims were mapped to HCC categories using the ICD-to-HCC reference table.

At this stage:

  • Each member’s diagnosis codes were expanded
  • Codes were normalized
  • Valid diagnoses were matched to HCC categories

The output produces one record per member per year per HCC category, with an associated risk weight.


10. Computing Final Member Risk Scores

The final step combined:

  • Aggregated HCC weights from medical claims
  • Aggregated RxHCC weights from pharmacy claims
  • Member demographic attributes

For each member and service year:

  • HCC weights were summed
  • RxHCC weights were summed
  • A base score was added
  • A final risk score was computed

This produced the final analytics mart: member_risk_scores.


11. Results & Key Outcomes

Results

  • A fully functional end-to-end healthcare analytics pipeline
  • Clean transformation workflow from raw CSVs to final mart
  • A production-style member-level risk score table
  • Data ready for reporting tools like Power BI or Tableau

Key Takeaways

  • dbt enforces strong modeling discipline through layered architecture
  • Reference mapping tables are the backbone of healthcare risk analytics
  • Explicit type casting prevents silent data quality issues
  • Separation of raw, staging, core, and marts ensures scalability and auditability

12. Key Challenges Faced & Resolved

  • PostgreSQL CSV permission issues → Resolved by using client-side imports
  • Cross-database reference errors in dbt → Fixed by aligning the dbt database configuration
  • U.S. date format parsing errors → Solved by explicitly controlling date parsing in staging
  • Semicolon syntax errors in dbt models → Resolved by removing trailing semicolons
  • Source configuration mismatches → Fixed by correcting schema references

Final Thoughts

This project demonstrates how modern analytics engineering tools can simulate real Medicare Advantage risk modeling workflows using open-source technologies.

Top comments (0)