Deblina Roy

Posted on Mar 28

🧬 Not All Heart Disease Is the Same - So I Built an AI to Prove It

#machinelearning #python #datascience #healthcare

The Question That Started It All

A few months ago, while diving into cardiovascular data, I found myself asking a simple question: Why do two heart patients with the exact same diagnosis respond so differently to the same treatment?

Clinically, they are labeled the same. But biologically? That didn’t feel right. Coming from a background in Microbiology, I’ve always been fascinated by the invisible "mechanics" of the cell. Transitioning into Data Science at Northwestern allowed me to finally quantify those mechanics at scale.

I started imagining three distinct patients:

Patient A: Struggles because of inherited genetic and metabolic "fuel" issues.

Patient B: Damage is driven by a hyper-active immune system (inflammation).

Patient C: The heart is slowly stiffening due to excessive scar tissue (fibrosis).

That’s the gap this project aims to bridge. Yet, we often treat them with a "one-size-fits-all" approach. I built this AI portal to prove we can do better.

Multi-Omics Integration

If the disease is different at a molecular level, then the data already knows and we just aren’t looking at all the layers at once. Instead of using one dataset, I decided to integrate four biological "chapters" of the same story:

Genomics: The blueprint you’re born with (GWAS Catalog)

Transcriptomics: What your genes are actually doing (GTEx Portal).

Proteomics: The functional machinery (Proteins).

Metabolomics: The downstream biochemical consequences.

[ Experience the Portal: https://multi-omic-heart-disease.streamlit.app/ ]

Incremental Pipeline Validation

I tested each integration layer sequentially using MOFA+ (Multi-Omics Factor Analysis):

Phase	Data Layers	Silhouette Score	Change
Phase 1	Genomics + Transcriptomics	0.0659	Baseline
Phase 2	+ Proteomics	0.1247	+88%
Phase 3	+ Metabolomics	0.1834	+178% total

This 178% improvement in clustering quality proved that biological signals are not just additive, they are synergistic.

Figure 1: Evolution of the pipeline from Phase 1 (Baseline) to Phase 3 (+178% cluster separation).

Three Molecular Subtypes

The model identified three disease subtypes with distinct biomarker patterns:

Subtype 0: Energy Metabolism

Mitochondrial dysfunction phenotype with reduced ATP production.

Aspect	Details
Root Cause	Genetic predisposition + mitochondrial dysfunction
Biomarker Profile	High PRS, Troponin I, NT-proBNP
Clinical Feature	Reduced cardiac output
Treatment Target	Metabolic support, AMPK activators

Subtype 1: Inflammatory

Immune dysregulation phenotype with elevated pro-inflammatory markers.

Aspect	Details
Root Cause	Autoimmune-mediated cardiac inflammation
Biomarker Profile	Elevated IL-6, CRP, TNF-α
Clinical Feature	Immune dysregulation
Treatment Target	Anti-inflammatory drugs, TNF inhibitors

Subtype 2: Fibrotic

Pathological fibrosis phenotype with excessive collagen accumulation.

Aspect	Details
Root Cause	Pathological cardiac fibrosis
Biomarker Profile	Elevated TGF-β, TIMP1
Clinical Feature	Diastolic dysfunction, stiffness
Treatment Target	Anti-fibrotic agents (Finerenone, SGLT2i)

Figure 2: The "Molecular Fingerprint" of Heart Disease. This heatmap reveals how specific biomarkers from Genomic risk (PRS) to Inflammatory cytokines (IL-6) cluster into three distinct, actionable subtypes. Green indicates high expression, proving that each subtype requires a different therapeutic focus.

The proof: Key Metrics

Metric	Result
Patient Samples	387 real patients
Integrated Features	50+ across 4 omics layers
Cross-Validation Accuracy	94.2%
AUC-ROC Score	0.947 (Excellent)
Balanced Accuracy	91.8%
Clustering Improvement	+178% (Silhouette: 0.0659 → 0.1834)

Figure 3: Model performance metrics across cross-validation folds demonstrate robust stratification capability.

These aren't vanity metrics; they represent real predictive power for stratifying patients into actionable subtypes.

Technical Stack

Data Integration

PCA for dimensionality reduction
MOFA+ for probabilistic factor analysis
Variance filtering for feature selection
Layer-wise normalization

ML Pipeline

K-means clustering (k=3)
Random Forest classification
SHAP for feature importance
Stratified cross-validation

Visualization & Deployment

Streamlit for interactive portal
Plotly for dynamic visualizations
GitHub for version control
Streamlit Cloud for live deployment

The Portal: Patient-Friendly Design

Fig 3: The interactive portal guides patients through symptom checklist, biomarker entry, and personalized risk assessment.

One challenge: making clinical AI understandable to non-scientists.

I implemented 6 accessibility features:

1. Symptom Checklist First

Before entering biomarker numbers, patients check symptoms they experience. This helps them understand early on what to look for.

2. Visual Biomarker Meters

Instead of just numbers, each biomarker shows:

Color gradient (green → yellow → red)
Status indicator (Low/Moderate/High)
Plain English explanation

3. Risk Rating

🟢 HIGH CONFIDENCE (94.2%)
   "The model is very confident in this result"

🟡 MODERATE CONFIDENCE (65%)
   "Confirm with your doctor"

🔴 LOW CONFIDENCE (45%)
   "Need additional testing"

4. "What This Means For You"

For each subtype, the portal shows:

Common symptoms to watch
Lifestyle changes that help
Medications your doctor might suggest
5 questions to ask your cardiologist

5. You vs. Average Comparison

"How do my markers compare to typical patients with this subtype?"

6. Trust & Credibility

Why should patients believe this?

Based on 387 real patient samples
94.2% validation accuracy
Reviewed by cardiologists
BUT: This is NOT a diagnosis. See your doctor.

Data Sources

GTEx Project: Gene expression in healthy heart tissue
GWAS Catalog: Genetic variants associated with heart disease
Clinical cohorts: Real patient biomarker data
Public databases: Protein and metabolite information

Key Insights

1. Multi-Omics > Single-Omics

No single data layer provides complete molecular classification. Integration improves discrimination power by 178%.

2. Explainability is Essential for Clinical Adoption

Model performance metrics alone don't guarantee clinical utility. Patient-friendly explanations and confidence scoring are equally important.

3. Normalization Prevents Layer Dominance

Biological datasets have different scales. Independent normalization per layer prevents high-variance omics from overwhelming low-variance layers.

4. Validation is Non-Negotiable

Stratified cross-validation, AUC-ROC, balanced accuracy to measure everything. For prognosis in healthcare, accuracy directly impacts patient outcomes.

5. Domain Knowledge Improves Model Interpretation

Understanding that IL-6 indicates ongoing inflammation helps explain why subtypes cluster together. Biological plausibility validates model decisions.

Next Steps & Future Directions

Longitudinal tracking (how subtypes evolve over time)
Imaging integration (echocardiography, cardiac MRI)
Survival prediction per subtype
Personalized drug response prediction
HIPAA compliance for real patient deployment
Clinical validation studies (prospective)

📚 Open Science & Reproducibility

Everything is open-source on GitHub:

Complete ML pipeline (Python scripts)
5 Jupyter analysis notebooks
Sample datasets & visualizations
Full documentation (literature review, methods, concepts)

👉 Explore the Repository

Conclusion & Clinical Implications

Multi-omic integration enables molecular stratification of clinically-labeled disease.

Current cardiac diagnosis relies on ejection fraction and symptoms alone. These are downstream manifestations of three distinct underlying mechanisms:

Metabolic dysfunction
Immune dysregulation
Fibrotic remodeling

Each requires different therapeutic targeting. This work demonstrates that existing clinical biomarkers, when integrated computationally, can reveal actionable patient subtypes before specialized testing.

Key Takeaways

Heart disease diagnosis shouldn't be a guessing game. By moving beyond the surface and integrating 4 layers of molecular data, we can identify these "hidden" disease types before they lead to irreversible damage. This isn't just data science; it's the future of precision cardiology.

🤝 Let's Connect!

I'm currently a Data Science graduate student at Northwestern University, and I'd love to hear your thoughts on precision cardiology and explainable AI.

LinkedIn: https://www.linkedin.com/in/deblina555/

Drop your comments below! 👇

📖 Key References

Argelaguet et al. (2018). Multi-Omics Factor Analysis. Molecular Systems Biology
Subramanian et al. (2005). Gene Set Enrichment Analysis. PNAS
Lundberg & Lee (2017). A Unified Approach to Interpreting Model Predictions. NeurIPS

🔗 Datasets Used

GTEx Portal - Gene expression reference
GWAS Catalog - Genetic variants
UK Biobank - Population data

DEV Community