DEV Community

loading...
Cover image for Heart Disease Analysis by using Machine Learning.

Heart Disease Analysis by using Machine Learning.

Mohammad Sakib Mahmood
Measuring programming progress by the lines of code is like measuring aircraft building progress by weight.
・2 min read

Heart diseases refers to a group of conditions that affects your heart. Diseases under the heart disease umbrella include blood vessel diseases, such as coronary artery disease, Myocardial infraction, heart failure, heart rhythm problems (arrhythmia) and heart defects you’re born with (congenital heart defects), among others. Risk factor causing heart diseases following:

  1. Overweight
  2. High Blood pressure
  3. High-cholesterol level
  4. Diabetes Mellitus
  5. Being inactive

GitHub Code Link

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib import rcParams
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')
from sklearn.neighbors import KNeighborsClassifier
Enter fullscreen mode Exit fullscreen mode
df = pd.read_csv('dataset.csv')
print(df.head())
Enter fullscreen mode Exit fullscreen mode

Output:
Alt Text

print(df.info())
Enter fullscreen mode Exit fullscreen mode

Output:
Alt Text

print(df.describe())
Enter fullscreen mode Exit fullscreen mode

Output:
Alt Text

Feature Selection

To get correlation of each feature in the data set

import seaborn as sns
corrmat = df.corr()
top_corr_features = corrmat.index
plt.figure(figsize=(16,16))
#plot heat map
g=sns.heatmap(df[top_corr_features].corr(),annot=True,cmap="RdYlGn")
plt.show()
Enter fullscreen mode Exit fullscreen mode

Output:
Alt Text
It’s always a good practice to work with a data set where the target classes are of approximately equal size. Thus, let’s check for the same :

sns.set_style('whitegrid')
sns.countplot(x='target',data=df,palette='RdBu_r')
plt.show()
Enter fullscreen mode Exit fullscreen mode

Output:
Alt Text

Data Processing

After exploring the data set, I observed that I need to convert some categorical variables into dummy variables and scale all the values before training the Machine Learning models.

First, I’ll use the get_dummies method to create dummy columns for categorical variables.

dataset = pd.get_dummies(df, columns = ['sex', 'cp', 
                                        'fbs','restecg', 
                                        'exang', 'slope', 
                                        'ca', 'thal'])
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
standardScaler = StandardScaler()
columns_to_scale = ['age', 'trestbps', 'chol', 'thalach', 'oldpeak']
dataset[columns_to_scale] = standardScaler.fit_transform(dataset[columns_to_scale])
dataset.head()
Enter fullscreen mode Exit fullscreen mode

Output:
Alt Text

y = dataset['target']
X = dataset.drop(['target'], axis = 1)
Enter fullscreen mode Exit fullscreen mode
from sklearn.model_selection import cross_val_score
knn_scores = []
for k in range(1,21):
    knn_classifier = KNeighborsClassifier(n_neighbors = k)
    score=cross_val_score(knn_classifier,X,y,cv=10)
    knn_scores.append(score.mean())
Enter fullscreen mode Exit fullscreen mode
plt.plot([k for k in range(1, 21)], knn_scores, color = 'red')
for i in range(1,21):
    plt.text(i, knn_scores[i-1], (i, knn_scores[i-1]))
plt.xticks([i for i in range(1, 21)])
plt.xlabel('Number of Neighbors (K)')
plt.ylabel('Scores')
plt.title('K Neighbors Classifier scores for different K values')
plt.show()
Enter fullscreen mode Exit fullscreen mode

Output:
Alt Text

knn_classifier = KNeighborsClassifier(n_neighbors = 12)
score=cross_val_score(knn_classifier,X,y,cv=10)
score.mean()
Enter fullscreen mode Exit fullscreen mode

Output:
Alt Text

Random Forest Classifier

from sklearn.ensemble import RandomForestClassifier
randomforest_classifier= RandomForestClassifier(n_estimators=10)
score=cross_val_score(randomforest_classifier,X,y,cv=10)
score.mean()
Enter fullscreen mode Exit fullscreen mode

Output:
image

Discussion (0)