Vishnu Ajit

Posted on Jan 18

Project - Supervised Learning with Python - Lets use Logistic Regression for Predicting the chances of having a Heart Attack

#machinelearning #datascience #python

Excited to share my second tutorial along with the python notebook which i made for experimenting with machine learning algorithms! This time we are exploring a project using LogisticRegression . It loads the dataset from csv file (dataset obtained from kaggle) and enables us to predict probabilities of a patient having Heart Attack🧑‍💻📊

Concepts Used Include:

LogisticRegression🌀
StandardScaler from sklearn.preprocessing library 🎯
fit_transform() method ➖
train_test_split() 🌟
model.predict() 🔄
model.predict_proba() 🌟
classification_report() 🌟
roc_auc_score() 🎯

Why This Notebook:

The main goal of this notebook is to visually understand how to use the LogisticRegression concept in machine learning algorithm. Using the beauty of the Python programming language we try to predict from a patient's hospital data whether he might have a heart attack in the future.

I’ve included a line to my notebook to guide you through the it
The link to the notebook: https://github.com/ruforavishnu/Project_Machine_Learning/blob/master/project-supervised-learning-logistic-regression-heart-disease-prediction.ipynb

The link to the dataset : https://github.com/ruforavishnu/Project_Machine_Learning/blob/master/heart-disease-prediction.csv (Dataset obtained from kaggle)

Kaggle url to the same above given dataset : https://www.kaggle.com/datasets/dileep070/heart-disease-prediction-using-logistic-regression

What’s Next:

Over the Next week, I’ll be posting more of my notebooks for other concepts in Machine Learning as recommended by this url https://www.kaggle.com/discussions/getting-started/554563 [# Machine Learning Engineer Roadmap for 2025]
We'll especially be looking at Supervised Learning and Unsupervised Learning to get our feet wet before we begin to walk towards the shores of greater Artificial Intelligence.

Who's This For:

For anybody who loves python and who has been telling themselves I'm gonna learn Machine Learning one day. This is Day 2 for them ! Lets learn Machine Learning Together :) Yesterday we looked at Linear Regression. Today we are exploring the concept called Logistic Regression.

Feel free to explore the notebook and try out your own machine learning models! 🚀

The link to the notebook: https://github.com/ruforavishnu/Project_Machine_Learning/blob/master/project-supervised-learning-logistic-regression-heart-disease-prediction.ipynb

The link to the dataset : https://github.com/ruforavishnu/Project_Machine_Learning/blob/master/heart-disease-prediction.csv (Dataset obtained from kaggle)

Kaggle url to the same above given dataset : https://www.kaggle.com/datasets/dileep070/heart-disease-prediction-using-logistic-regression

Kaggle References: https://www.kaggle.com/discussions/getting-started/554563 [Machine Learning Engineer Roadmap for 2025]

Now Lets begin coding shall we? :)

Step 1.

Load the dataset from our csv file

import pandas as pd



data = pd.read_csv('heart-disease-prediction.csv')

print(data.head())

and we get the output

male  age  education  currentSmoker  cigsPerDay  BPMeds  prevalentStroke  \
0     1   39        4.0              0         0.0     0.0                0   
1     0   46        2.0              0         0.0     0.0                0   
2     1   48        1.0              1        20.0     0.0                0   
3     0   61        3.0              1        30.0     0.0                0   
4     0   46        3.0              1        23.0     0.0                0   

   prevalentHyp  diabetes  totChol  sysBP  diaBP    BMI  heartRate  glucose  \
0             0         0    195.0  106.0   70.0  26.97       80.0     77.0   
1             0         0    250.0  121.0   81.0  28.73       95.0     76.0   
2             0         0    245.0  127.5   80.0  25.34       75.0     70.0   
3             1         0    225.0  150.0   95.0  28.58       65.0    103.0   
4             0         0    285.0  130.0   84.0  23.10       85.0     85.0   

   TenYearCHD  
0           0  
1           0  
2           0  
3           1  
4           0

Step 2. Lets explore the data by ourselves first

We try running data.info() on our dataset

print(data.info())

and we get the output as

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4238 entries, 0 to 4237
Data columns (total 16 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   male             4238 non-null   int64  
 1   age              4238 non-null   int64  
 2   education        4133 non-null   float64
 3   currentSmoker    4238 non-null   int64  
 4   cigsPerDay       4209 non-null   float64
 5   BPMeds           4185 non-null   float64
 6   prevalentStroke  4238 non-null   int64  
 7   prevalentHyp     4238 non-null   int64  
 8   diabetes         4238 non-null   int64  
 9   totChol          4188 non-null   float64
 10  sysBP            4238 non-null   float64
 11  diaBP            4238 non-null   float64
 12  BMI              4219 non-null   float64
 13  heartRate        4237 non-null   float64
 14  glucose          3850 non-null   float64
 15  TenYearCHD       4238 non-null   int64  
dtypes: float64(9), int64(7)
memory usage: 529.9 KB
None

Step 3. Now what do we do with missing data?

What do we with columns in our dataset which have no value ?? and how do we do that ??


print(data.isnull().sum())

and we get the output


male                 0
age                  0
education          105
currentSmoker        0
cigsPerDay          29
BPMeds              53
prevalentStroke      0
prevalentHyp         0
diabetes             0
totChol             50
sysBP                0
diaBP                0
BMI                 19
heartRate            1
glucose            388
TenYearCHD           0
dtype: int64

Oh, so there are a couple of columns that have Null data or NaN values.

The fillna() method comes to rescue us.

data.fillna(data.mean(), inplace=True)

Hmmm, did that work? how do we check that? Oh ! Lets try running data.isnull().sum() once again?

print(data.isnull().sum())

and we get the output

male               0
age                0
education          0
currentSmoker      0
cigsPerDay         0
BPMeds             0
prevalentStroke    0
prevalentHyp       0
diabetes           0
totChol            0
sysBP              0
diaBP              0
BMI                0
heartRate          0
glucose            0
TenYearCHD         0
dtype: int64

Yes, it worked

Step 4. Now we need to preprocess the data don't we?

How do we do that? Lets see. Ok, so what all kinds of columns do we have ?

data.columns to the rescue

data.columns

and we get the output

Index(['male', 'age', 'education', 'currentSmoker', 'cigsPerDay', 'BPMeds',
       'prevalentStroke', 'prevalentHyp', 'diabetes', 'totChol', 'sysBP',
       'diaBP', 'BMI', 'heartRate', 'glucose', 'TenYearCHD'],
      dtype='object')

Ok, thats a lot of columns !! Kaggle has provided us with a lot of columns. We dont want all that do we?

Lets pick and choose.

['age', 'totChol','sysBP','diaBP', 'cigsPerDay','BMI','glucose']

Aha, now we build our friends. The only ones who have the keys to Logistic Regression. One is a DataFrame and the other is a Series.

Lets call them capital X and small y

X = data[['age', 'totChol','sysBP','diaBP', 'cigsPerDay','BMI','glucose']]

y = data['TenYearCHD']

Hmmm, lets see what we have now

X.head()

and we get the output

	age	totChol	sysBP	diaBP	cigsPerDay	BMI	glucose
0	39	195.0	106.0	70.0	0.0	26.97	77.0
1	46	250.0	121.0	81.0	0.0	28.73	76.0
2	48	245.0	127.5	80.0	20.0	25.34	70.0
3	61	225.0	150.0	95.0	30.0	28.58	103.0
4	46	285.0	130.0	84.0	23.0	23.10	85.0

Step 5: We need to normalize for better model performance

What is a standard scaler?

Simple explanation: A standard scaler is something that allows you to compare two items which are presently on different scales by bringing both of them to a similiar scale. So they can be compared against each other.

For example : Two friends are talking about how fast a Ferrari goes and how fast a Porsche goes. But one person is using the m/s scale and the other person is using the km/h scale. Its difficult to analyze which is faster right? So we convert both of them into either m/s or into km/h. So the comparison is easy enough.

And, Here comes the StandardScaler to our rescue

from sklearn.preprocessing import StandardScaler


scaler = StandardScaler()

X = scaler.fit_transform(X)

Step 6: Now we need to split the data we have into 2 segments.

First segment is for training the machine learning model. Second segment is to test the machine learning model we trained using the first segment to really check whether the model did work.

Simple explanation: Kind of like asking a student who learnt using only one textbook , the questions from another textbook . Just to check if the student really understood the concept or has he just byhearted the whole thing.

And, How do we do that? By using train_test_split()

from sklearn.model_selection import train_test_split



X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Note we have used test_size parameter to load only 20% (0.2 means 20%) of the available data for testing data. That means the remaining 80% is given as training data.

At the end of successful completion of the train_test_split we get 4 variables

X_train : has the training data. it has 80% elements from our Dataframe X (remember capital letter X?)
X_test : has the testing data. it has 20% elements from our Dataframe X
y_train: has the training data. has the 80% elements from our Series y (remember small letter y? )
y_test : has the testing data. has 20% elements from our Series y

Step 7. Finally we arrive at our final milestone. Training the LogisticRegression model

Lets train our model using LogisticRegression. (That is technical lingo for saying lets use the power of machine learning along with the beautiful python programming language to create an Artificial Intelligence model. An AI model that can predict what we want it to predict)

How do we do that? Oh just three lines of code :) 🤯🤯

from sklearn.linear_model import LogisticRegression


model = LogisticRegression()

model.fit(X_train, y_train)

Oh you can sit and pause. Its alright. That is it. Just 3 lines of code in python. And we have created an Artificial Intelligence model for ourselves. Ain't it a beauty?? 💛 💛

Step 8. Lets evaluate the machine learning model we just created

We save the values of the prediction to a variable called y_pred.

y_pred = model.predict(X_test)

We need to evaluate our model.

We use two methods for that

classification_report()
roc_auc_score()

Lets run that

from sklearn.metrics import classification_report , roc_auc_score



print(classification_report(y_test, y_pred))

print('ROC-AUC-score:', roc_auc_score(y_test, model.predict_proba(X_test)[:, 1]))

and we get the output as


        precision    recall  f1-score   support

           0       0.86      0.99      0.92       724
           1       0.55      0.05      0.09       124

        accuracy                           0.85       848

macro avg       0.70      0.52      0.51       848
weighted avg       0.81      0.85      0.80       848

ROC-AUC-score: 0.695252628764926

Step 9. We are done. The project is over. 💯✅✅Tada.

Lets test the machine learning model we created with a real person's data shall we?

patient2 = [[45, 210, 130, 85, 10, 25.1, 95]]

patient2_df = pd.DataFrame(patient2, columns=['age','totChol', 'sysBP','diaBP', 'cigsPerDay', 'BMI','glucose'])

patient2_scaled = scaler.transform(patient2_df)

We give the model our scaled data. and store the data in a variable called prediction.

prediction = model.predict(patient2_scaled)

Finally, lets test it using our old fashioned print() statement? ✅✅

# 1=Heart Disease, 0=No Heart Disease



if prediction[0] == 1:

    print('The chances the patient might have a heart disease in the future is: True')

else:

    print('The chances the patient might have a heart disease in the future is: False')

and we get the output

The chances the patient might have a heart disease in the future is: True

And it feels beautiful to know that we have completed learning one more machine learning concept doesn't it? :) 💯✅✅

Yes, it does 💛 💛

Homework: Now, here are a few other patient data for you to check on your own.

patient3 = [[65, 250, 155, 100, 15, 32.0, 150]]

patient4 = [[55, 240, 140, 90, 10, 29.5, 110]]

patient5 = [[70, 300, 160, 105, 20, 34.0, 180]]

Now go! 💨 🏃🏃 Go, open Visual Studio Code and start coding 🤖🤖. And, don't forget to come back here tomorrow for our next project. Like somebody once said: You never know what the tide might bring tomorrow? 🌊 🔮🖥️