DEV Community

Vishnu Ajit
Vishnu Ajit

Posted on

1

Project - Supervised Learning with Python - Lets use Logistic Regression for Predicting the chances of having a Heart Attack

Excited to share my second tutorial along with the python notebook which i made for experimenting with machine learning algorithms! This time we are exploring a project using LogisticRegression . It loads the dataset from csv file (dataset obtained from kaggle) and enables us to predict probabilities of a patient having Heart Attack๐Ÿง‘โ€๐Ÿ’ป๐Ÿ“Š

Concepts Used Include:

  • LogisticRegression๐ŸŒ€
  • StandardScaler from sklearn.preprocessing library ๐ŸŽฏ
  • fit_transform() method โž–
  • train_test_split() ๐ŸŒŸ
  • model.predict() ๐Ÿ”„
  • model.predict_proba() ๐ŸŒŸ
  • classification_report() ๐ŸŒŸ
  • roc_auc_score() ๐ŸŽฏ

Why This Notebook:

The main goal of this notebook is to visually understand how to use the LogisticRegression concept in machine learning algorithm. Using the beauty of the Python programming language we try to predict from a patient's hospital data whether he might have a heart attack in the future.

Iโ€™ve included a line to my notebook to guide you through the it
The link to the notebook: https://github.com/ruforavishnu/Project_Machine_Learning/blob/master/project-supervised-learning-logistic-regression-heart-disease-prediction.ipynb

The link to the dataset : https://github.com/ruforavishnu/Project_Machine_Learning/blob/master/heart-disease-prediction.csv (Dataset obtained from kaggle)

Kaggle url to the same above given dataset : https://www.kaggle.com/datasets/dileep070/heart-disease-prediction-using-logistic-regression

Whatโ€™s Next:

Over the Next week, Iโ€™ll be posting more of my notebooks for other concepts in Machine Learning as recommended by this url https://www.kaggle.com/discussions/getting-started/554563 [# Machine Learning Engineer Roadmap for 2025]
We'll especially be looking at Supervised Learning and Unsupervised Learning to get our feet wet before we begin to walk towards the shores of greater Artificial Intelligence.

Who's This For:

For anybody who loves python and who has been telling themselves I'm gonna learn Machine Learning one day. This is Day 2 for them ! Lets learn Machine Learning Together :) Yesterday we looked at Linear Regression. Today we are exploring the concept called Logistic Regression.

Feel free to explore the notebook and try out your own machine learning models! ๐Ÿš€

The link to the notebook: https://github.com/ruforavishnu/Project_Machine_Learning/blob/master/project-supervised-learning-logistic-regression-heart-disease-prediction.ipynb

The link to the dataset : https://github.com/ruforavishnu/Project_Machine_Learning/blob/master/heart-disease-prediction.csv (Dataset obtained from kaggle)

Kaggle url to the same above given dataset : https://www.kaggle.com/datasets/dileep070/heart-disease-prediction-using-logistic-regression

Kaggle References: https://www.kaggle.com/discussions/getting-started/554563 [Machine Learning Engineer Roadmap for 2025]

Now Lets begin coding shall we? :)

Step 1.

Load the dataset from our csv file
import pandas as pd



data = pd.read_csv('heart-disease-prediction.csv')

print(data.head())
Enter fullscreen mode Exit fullscreen mode
and we get the output
male  age  education  currentSmoker  cigsPerDay  BPMeds  prevalentStroke  \
0     1   39        4.0              0         0.0     0.0                0   
1     0   46        2.0              0         0.0     0.0                0   
2     1   48        1.0              1        20.0     0.0                0   
3     0   61        3.0              1        30.0     0.0                0   
4     0   46        3.0              1        23.0     0.0                0   

   prevalentHyp  diabetes  totChol  sysBP  diaBP    BMI  heartRate  glucose  \
0             0         0    195.0  106.0   70.0  26.97       80.0     77.0   
1             0         0    250.0  121.0   81.0  28.73       95.0     76.0   
2             0         0    245.0  127.5   80.0  25.34       75.0     70.0   
3             1         0    225.0  150.0   95.0  28.58       65.0    103.0   
4             0         0    285.0  130.0   84.0  23.10       85.0     85.0   

   TenYearCHD  
0           0  
1           0  
2           0  
3           1  
4           0
Enter fullscreen mode Exit fullscreen mode

Step 2. Lets explore the data by ourselves first

We try running data.info() on our dataset
print(data.info())
Enter fullscreen mode Exit fullscreen mode
and we get the output as
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4238 entries, 0 to 4237
Data columns (total 16 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   male             4238 non-null   int64  
 1   age              4238 non-null   int64  
 2   education        4133 non-null   float64
 3   currentSmoker    4238 non-null   int64  
 4   cigsPerDay       4209 non-null   float64
 5   BPMeds           4185 non-null   float64
 6   prevalentStroke  4238 non-null   int64  
 7   prevalentHyp     4238 non-null   int64  
 8   diabetes         4238 non-null   int64  
 9   totChol          4188 non-null   float64
 10  sysBP            4238 non-null   float64
 11  diaBP            4238 non-null   float64
 12  BMI              4219 non-null   float64
 13  heartRate        4237 non-null   float64
 14  glucose          3850 non-null   float64
 15  TenYearCHD       4238 non-null   int64  
dtypes: float64(9), int64(7)
memory usage: 529.9 KB
None
Enter fullscreen mode Exit fullscreen mode

Step 3. Now what do we do with missing data?

What do we with columns in our dataset which have no value ?? and how do we do that ??

print(data.isnull().sum())

Enter fullscreen mode Exit fullscreen mode
and we get the output

male                 0
age                  0
education          105
currentSmoker        0
cigsPerDay          29
BPMeds              53
prevalentStroke      0
prevalentHyp         0
diabetes             0
totChol             50
sysBP                0
diaBP                0
BMI                 19
heartRate            1
glucose            388
TenYearCHD           0
dtype: int64

Enter fullscreen mode Exit fullscreen mode
Oh, so there are a couple of columns that have Null data or NaN values.
The fillna() method comes to rescue us.
data.fillna(data.mean(), inplace=True)
Enter fullscreen mode Exit fullscreen mode
Hmmm, did that work? how do we check that? Oh ! Lets try running data.isnull().sum() once again?
print(data.isnull().sum())
Enter fullscreen mode Exit fullscreen mode
and we get the output
male               0
age                0
education          0
currentSmoker      0
cigsPerDay         0
BPMeds             0
prevalentStroke    0
prevalentHyp       0
diabetes           0
totChol            0
sysBP              0
diaBP              0
BMI                0
heartRate          0
glucose            0
TenYearCHD         0
dtype: int64
Enter fullscreen mode Exit fullscreen mode
Yes, it worked

Step 4. Now we need to preprocess the data don't we?

How do we do that? Lets see. Ok, so what all kinds of columns do we have ?
data.columns to the rescue
data.columns
Enter fullscreen mode Exit fullscreen mode
and we get the output
Index(['male', 'age', 'education', 'currentSmoker', 'cigsPerDay', 'BPMeds',
       'prevalentStroke', 'prevalentHyp', 'diabetes', 'totChol', 'sysBP',
       'diaBP', 'BMI', 'heartRate', 'glucose', 'TenYearCHD'],
      dtype='object')
Enter fullscreen mode Exit fullscreen mode
Ok, thats a lot of columns !! Kaggle has provided us with a lot of columns. We dont want all that do we?

Lets pick and choose.

['age', 'totChol','sysBP','diaBP', 'cigsPerDay','BMI','glucose']
Enter fullscreen mode Exit fullscreen mode

Aha, now we build our friends. The only ones who have the keys to Logistic Regression. One is a DataFrame and the other is a Series.

Lets call them capital X and small y

X = data[['age', 'totChol','sysBP','diaBP', 'cigsPerDay','BMI','glucose']]

y = data['TenYearCHD']
Enter fullscreen mode Exit fullscreen mode
Hmmm, lets see what we have now
X.head()
Enter fullscreen mode Exit fullscreen mode
and we get the output
age totChol sysBP diaBP cigsPerDay BMI glucose
0 39 195.0 106.0 70.0 0.0 26.97 77.0
1 46 250.0 121.0 81.0 0.0 28.73 76.0
2 48 245.0 127.5 80.0 20.0 25.34 70.0
3 61 225.0 150.0 95.0 30.0 28.58 103.0
4 46 285.0 130.0 84.0 23.0 23.10 85.0

Step 5: We need to normalize for better model performance

What is a standard scaler?

Simple explanation: A standard scaler is something that allows you to compare two items which are presently on different scales by bringing both of them to a similiar scale. So they can be compared against each other.

For example : Two friends are talking about how fast a Ferrari goes and how fast a Porsche goes. But one person is using the m/s scale and the other person is using the km/h scale. Its difficult to analyze which is faster right? So we convert both of them into either m/s or into km/h. So the comparison is easy enough.

And, Here comes the StandardScaler to our rescue
from sklearn.preprocessing import StandardScaler


scaler = StandardScaler()

X = scaler.fit_transform(X)
Enter fullscreen mode Exit fullscreen mode

Step 6: Now we need to split the data we have into 2 segments.

First segment is for training the machine learning model. Second segment is to test the machine learning model we trained using the first segment to really check whether the model did work.

Simple explanation: Kind of like asking a student who learnt using only one textbook , the questions from another textbook . Just to check if the student really understood the concept or has he just byhearted the whole thing.

And, How do we do that? By using train_test_split()
from sklearn.model_selection import train_test_split



X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Enter fullscreen mode Exit fullscreen mode

Note we have used test_size parameter to load only 20% (0.2 means 20%) of the available data for testing data. That means the remaining 80% is given as training data.

At the end of successful completion of the train_test_split we get 4 variables

X_train : has the training data. it has 80% elements from our Dataframe X (remember capital letter X?)
X_test : has the testing data. it has 20% elements from our Dataframe X
y_train: has the training data. has the 80% elements from our Series y (remember small letter y? )
y_test : has the testing data. has 20% elements from our Series y

Step 7. Finally we arrive at our final milestone. Training the LogisticRegression model

Lets train our model using LogisticRegression. (That is technical lingo for saying lets use the power of machine learning along with the beautiful python programming language to create an Artificial Intelligence model. An AI model that can predict what we want it to predict)
How do we do that? Oh just three lines of code :) ๐Ÿคฏ๐Ÿคฏ
from sklearn.linear_model import LogisticRegression


model = LogisticRegression()

model.fit(X_train, y_train)
Enter fullscreen mode Exit fullscreen mode
Oh you can sit and pause. Its alright. That is it. Just 3 lines of code in python. And we have created an Artificial Intelligence model for ourselves. Ain't it a beauty?? ๐Ÿ’› ๐Ÿ’›

Step 8. Lets evaluate the machine learning model we just created

We save the values of the prediction to a variable called y_pred.
y_pred = model.predict(X_test)
Enter fullscreen mode Exit fullscreen mode
We need to evaluate our model.

We use two methods for that

  1. classification_report()
  2. roc_auc_score()
Lets run that
from sklearn.metrics import classification_report , roc_auc_score



print(classification_report(y_test, y_pred))

print('ROC-AUC-score:', roc_auc_score(y_test, model.predict_proba(X_test)[:, 1]))
Enter fullscreen mode Exit fullscreen mode
and we get the output as

        precision    recall  f1-score   support

           0       0.86      0.99      0.92       724
           1       0.55      0.05      0.09       124

        accuracy                           0.85       848

macro avg       0.70      0.52      0.51       848
weighted avg       0.81      0.85      0.80       848

ROC-AUC-score: 0.695252628764926

Enter fullscreen mode Exit fullscreen mode

Step 9. We are done. The project is over. ๐Ÿ’ฏโœ…โœ…Tada.

Lets test the machine learning model we created with a real person's data shall we?
patient2 = [[45, 210, 130, 85, 10, 25.1, 95]]

patient2_df = pd.DataFrame(patient2, columns=['age','totChol', 'sysBP','diaBP', 'cigsPerDay', 'BMI','glucose'])

patient2_scaled = scaler.transform(patient2_df)
Enter fullscreen mode Exit fullscreen mode
We give the model our scaled data. and store the data in a variable called prediction.
prediction = model.predict(patient2_scaled)



Enter fullscreen mode Exit fullscreen mode
Finally, lets test it using our old fashioned print() statement? โœ…โœ…
# 1=Heart Disease, 0=No Heart Disease



if prediction[0] == 1:

    print('The chances the patient might have a heart disease in the future is: True')

else:

    print('The chances the patient might have a heart disease in the future is: False')
Enter fullscreen mode Exit fullscreen mode
and we get the output
The chances the patient might have a heart disease in the future is: True
Enter fullscreen mode Exit fullscreen mode

And it feels beautiful to know that we have completed learning one more machine learning concept doesn't it? :) ๐Ÿ’ฏโœ…โœ…

Yes, it does ๐Ÿ’› ๐Ÿ’›
Homework: Now, here are a few other patient data for you to check on your own.
patient3 = [[65, 250, 155, 100, 15, 32.0, 150]]

patient4 = [[55, 240, 140, 90, 10, 29.5, 110]]

patient5 = [[70, 300, 160, 105, 20, 34.0, 180]]


Enter fullscreen mode Exit fullscreen mode
Now go! ๐Ÿ’จ ๐Ÿƒ๐Ÿƒ Go, open Visual Studio Code and start coding ๐Ÿค–๐Ÿค–. And, don't forget to come back here tomorrow for our next project. Like somebody once said: You never know what the tide might bring tomorrow? ๐ŸŒŠ ๐Ÿ”ฎ๐Ÿ–ฅ๏ธ

Heroku

This site is built on Heroku

Join the ranks of developers at Salesforce, Airbase, DEV, and more who deploy their mission critical applications on Heroku. Sign up today and launch your first app!

Get Started

Top comments (0)

Image of AssemblyAI

Automatic Speech Recognition with AssemblyAI

Experience near-human accuracy, low-latency performance, and advanced Speech AI capabilities with AssemblyAI's Speech-to-Text API. Sign up today and get $50 in API credit. No credit card required.

Try the API

๐Ÿ‘‹ Kindness is contagious

Explore a sea of insights with this enlightening post, highly esteemed within the nurturing DEV Community. Coders of all stripes are invited to participate and contribute to our shared knowledge.

Expressing gratitude with a simple "thank you" can make a big impact. Leave your thanks in the comments!

On DEV, exchanging ideas smooths our way and strengthens our community bonds. Found this useful? A quick note of thanks to the author can mean a lot.

Okay