The Question That Started It All
A few months ago, while diving into cardiovascular data, I found myself asking a simple question: Why do two heart patients with the exact same diagnosis respond so differently to the same treatment?
Clinically, they are labeled the same. But biologically? That didnβt feel right. Coming from a background in Microbiology, Iβve always been fascinated by the invisible "mechanics" of the cell. Transitioning into Data Science at Northwestern allowed me to finally quantify those mechanics at scale.
I started imagining three distinct patients:
Patient A: Struggles because of inherited genetic and metabolic "fuel" issues.
Patient B: Damage is driven by a hyper-active immune system (inflammation).
Patient C: The heart is slowly stiffening due to excessive scar tissue (fibrosis).
Thatβs the gap this project aims to bridge. Yet, we often treat them with a "one-size-fits-all" approach. I built this AI portal to prove we can do better.
Multi-Omics Integration
If the disease is different at a molecular level, then the data already knows and we just arenβt looking at all the layers at once. Instead of using one dataset, I decided to integrate four biological "chapters" of the same story:
Genomics: The blueprint youβre born with (GWAS Catalog)
Transcriptomics: What your genes are actually doing (GTEx Portal).
Proteomics: The functional machinery (Proteins).
Metabolomics: The downstream biochemical consequences.
[ Experience the Portal: https://multi-omic-heart-disease.streamlit.app/ ]
Incremental Pipeline Validation
I tested each integration layer sequentially using MOFA+ (Multi-Omics Factor Analysis):
| Phase | Data Layers | Silhouette Score | Change |
|---|---|---|---|
| Phase 1 | Genomics + Transcriptomics | 0.0659 | Baseline |
| Phase 2 | + Proteomics | 0.1247 | +88% |
| Phase 3 | + Metabolomics | 0.1834 | +178% total |
This 178% improvement in clustering quality proved that biological signals are not just additive, they are synergistic.
Figure 1: Evolution of the pipeline from Phase 1 (Baseline) to Phase 3 (+178% cluster separation).
Three Molecular Subtypes
The model identified three disease subtypes with distinct biomarker patterns:
Subtype 0: Energy Metabolism
Mitochondrial dysfunction phenotype with reduced ATP production.
| Aspect | Details |
|---|---|
| Root Cause | Genetic predisposition + mitochondrial dysfunction |
| Biomarker Profile | High PRS, Troponin I, NT-proBNP |
| Clinical Feature | Reduced cardiac output |
| Treatment Target | Metabolic support, AMPK activators |
Subtype 1: Inflammatory
Immune dysregulation phenotype with elevated pro-inflammatory markers.
| Aspect | Details |
|---|---|
| Root Cause | Autoimmune-mediated cardiac inflammation |
| Biomarker Profile | Elevated IL-6, CRP, TNF-Ξ± |
| Clinical Feature | Immune dysregulation |
| Treatment Target | Anti-inflammatory drugs, TNF inhibitors |
Subtype 2: Fibrotic
Pathological fibrosis phenotype with excessive collagen accumulation.
| Aspect | Details |
|---|---|
| Root Cause | Pathological cardiac fibrosis |
| Biomarker Profile | Elevated TGF-Ξ², TIMP1 |
| Clinical Feature | Diastolic dysfunction, stiffness |
| Treatment Target | Anti-fibrotic agents (Finerenone, SGLT2i) |
Figure 2: The "Molecular Fingerprint" of Heart Disease. This heatmap reveals how specific biomarkers from Genomic risk (PRS) to Inflammatory cytokines (IL-6) cluster into three distinct, actionable subtypes. Green indicates high expression, proving that each subtype requires a different therapeutic focus.
The proof: Key Metrics
| Metric | Result |
|---|---|
| Patient Samples | 387 real patients |
| Integrated Features | 50+ across 4 omics layers |
| Cross-Validation Accuracy | 94.2% |
| AUC-ROC Score | 0.947 (Excellent) |
| Balanced Accuracy | 91.8% |
| Clustering Improvement | +178% (Silhouette: 0.0659 β 0.1834) |
Figure 3: Model performance metrics across cross-validation folds demonstrate robust stratification capability.
These aren't vanity metrics; they represent real predictive power for stratifying patients into actionable subtypes.
Technical Stack
Data Integration
- PCA for dimensionality reduction
- MOFA+ for probabilistic factor analysis
- Variance filtering for feature selection
- Layer-wise normalization
ML Pipeline
- K-means clustering (k=3)
- Random Forest classification
- SHAP for feature importance
- Stratified cross-validation
Visualization & Deployment
- Streamlit for interactive portal
- Plotly for dynamic visualizations
- GitHub for version control
- Streamlit Cloud for live deployment
The Portal: Patient-Friendly Design
Fig 3: The interactive portal guides patients through symptom checklist, biomarker entry, and personalized risk assessment.
One challenge: making clinical AI understandable to non-scientists.
I implemented 6 accessibility features:
1. Symptom Checklist First
Before entering biomarker numbers, patients check symptoms they experience. This helps them understand early on what to look for.
2. Visual Biomarker Meters
Instead of just numbers, each biomarker shows:
- Color gradient (green β yellow β red)
- Status indicator (Low/Moderate/High)
- Plain English explanation
3. Risk Rating
π’ HIGH CONFIDENCE (94.2%)
"The model is very confident in this result"
π‘ MODERATE CONFIDENCE (65%)
"Confirm with your doctor"
π΄ LOW CONFIDENCE (45%)
"Need additional testing"
4. "What This Means For You"
For each subtype, the portal shows:
- Common symptoms to watch
- Lifestyle changes that help
- Medications your doctor might suggest
- 5 questions to ask your cardiologist
5. You vs. Average Comparison
"How do my markers compare to typical patients with this subtype?"
6. Trust & Credibility
Why should patients believe this?
- Based on 387 real patient samples
- 94.2% validation accuracy
- Reviewed by cardiologists
- BUT: This is NOT a diagnosis. See your doctor.
Data Sources
- GTEx Project: Gene expression in healthy heart tissue
- GWAS Catalog: Genetic variants associated with heart disease
- Clinical cohorts: Real patient biomarker data
- Public databases: Protein and metabolite information
Key Insights
1. Multi-Omics > Single-Omics
No single data layer provides complete molecular classification. Integration improves discrimination power by 178%.
2. Explainability is Essential for Clinical Adoption
Model performance metrics alone don't guarantee clinical utility. Patient-friendly explanations and confidence scoring are equally important.
3. Normalization Prevents Layer Dominance
Biological datasets have different scales. Independent normalization per layer prevents high-variance omics from overwhelming low-variance layers.
4. Validation is Non-Negotiable
Stratified cross-validation, AUC-ROC, balanced accuracy to measure everything. For prognosis in healthcare, accuracy directly impacts patient outcomes.
5. Domain Knowledge Improves Model Interpretation
Understanding that IL-6 indicates ongoing inflammation helps explain why subtypes cluster together. Biological plausibility validates model decisions.
Next Steps & Future Directions
- Longitudinal tracking (how subtypes evolve over time)
- Imaging integration (echocardiography, cardiac MRI)
- Survival prediction per subtype
- Personalized drug response prediction
- HIPAA compliance for real patient deployment
- Clinical validation studies (prospective)
π Open Science & Reproducibility
Everything is open-source on GitHub:
- Complete ML pipeline (Python scripts)
- 5 Jupyter analysis notebooks
- Sample datasets & visualizations
- Full documentation (literature review, methods, concepts)
Conclusion & Clinical Implications
Multi-omic integration enables molecular stratification of clinically-labeled disease.
Current cardiac diagnosis relies on ejection fraction and symptoms alone. These are downstream manifestations of three distinct underlying mechanisms:
- Metabolic dysfunction
- Immune dysregulation
- Fibrotic remodeling
Each requires different therapeutic targeting. This work demonstrates that existing clinical biomarkers, when integrated computationally, can reveal actionable patient subtypes before specialized testing.
Key Takeaways
Heart disease diagnosis shouldn't be a guessing game. By moving beyond the surface and integrating 4 layers of molecular data, we can identify these "hidden" disease types before they lead to irreversible damage. This isn't just data science; it's the future of precision cardiology.
π€ Let's Connect!
I'm currently a Data Science graduate student at Northwestern University, and I'd love to hear your thoughts on precision cardiology and explainable AI.
LinkedIn: https://www.linkedin.com/in/deblina555/
Drop your comments below! π
π Key References
- Argelaguet et al. (2018). Multi-Omics Factor Analysis. Molecular Systems Biology
- Subramanian et al. (2005). Gene Set Enrichment Analysis. PNAS
- Lundberg & Lee (2017). A Unified Approach to Interpreting Model Predictions. NeurIPS
π Datasets Used
- GTEx Portal - Gene expression reference
- GWAS Catalog - Genetic variants
- UK Biobank - Population data




Top comments (0)