DEV Community

Cover image for Heart Disease Prediction
Cristopher Delgado
Cristopher Delgado

Posted on

Heart Disease Prediction

Media Profiles:
LinkedIn
Github


Overview:

Continuing my data science journey eventually meant I would finally reach a crossroads with Machine Learning. I am proud to say I have completed a third project that showcases my understanding of machine learning models. In this project that I want to share with everyone, I utilized six different classification models.

  1. LogisticClassifier()
  2. RandomForestClassfier()
  3. KNeighborsClassifier()
  4. AdaBosstClassifier()
  5. GradientBoostClassifier()
  6. XGBClassifer()

With these six classification models, I created baseline models and their optimized versions. In the end, I chose a single model that aligns the best with the project's main objective.

Business Problem:

The main goal of this project was to address the idea of diagnostics in Cardiovascular Diseases (CVD). The stakeholder is a Diagnostic based Medical Device company that wants to integrate machine learning to provide continous monitoring and early detection of Cardio Vascular Disease-related symptoms into developing medical devices or already existent devices.

Stakeholder: Diagnostic Medical Device Company

The stakeholder wants to determine what related features to CVDs are the most important to monitor. Knowing that information they would like to develop medical devices for either at-home use for patients or clinical use devices and potentially use Machine Learning algorithms incorporated into their diagnostic devices and create software as a medical device.

Methodology

  1. Perform data cleaning which consists of casting columns to correct data types.
  2. Dealing with missing values accordingly.
  3. Perform data exploration and view correlations.
  4. Normalize continuous data in order to have all the data on the same scale.
  5. Create Testing and Training sets to train classification models and validate their performances.
  6. Use the recall scoring metric to optimize models
  7. Observe important features from the top-performing model.

I used the best recall score because I wanted the model's ability to correctly identify positive instances and minimize false negatives. The worst-case scenario for this model would be to classify an individual with heart disease when in reality they did have heart disease. In summary, I wanted the model to capture all positive cases and minimize false negatives.

Results:

Baseline Receiving Operating Characteristic Curve:

base_roc

Optimized Receiving Operating Characteristic Curve:
op_roc

The best-performing model was the Random Forests model which concluded the following top feature importance and confusion matrix

feature_importance

confusion_matrix

In this project, I concluded the following evaluations for the obtained results from my analysis:

  1. I recommend incorporating the Random Forest model into a diagnostic medical device as software for the purposes of monitoring Cardiovascular symptoms for Cardio Vascular Diseases.
  2. I recommend developing medical devices geared towards monitoring the slope of the S-T segment of the Electrocardiogram, Exercise-induced angina, and measuring the S-T segment depression.

Overall Experience

This was no easy task and there were times when I definitely got 'writer's block' due to the complexity of the task and the amount of models I needed to understand. There was much research that needed to be done using these classifiers. Now that I understand them I can proudly say that I can reproduce a project like this one using different features and targets.

My next step would be to look into image processing and deep learning so I can analyze medical imaging and create diagnostics using that knowledge. This project was just the stepping stone to reach my next goal.


To look more into this project, please look at my git repository which includes more analysis of the data and the data sources.

Github Repo

Top comments (0)