DEV Community

Mr Codeslinger
Mr Codeslinger

Posted on

Logistic Regression with Scikit-learn

We'll start with the questions on your minds right now.

What is Logistic Regression?

  • Logistic Regression is a Machine Learning classification algorithm that is used to predict the probability of a categorical dependent variable.
  • In logistic regression, the dependent variable is a binary variable that contains data coded as 1 (yes, success, etc.) or 0 (no, failure, etc.).

It looks like this:


What you should depict from this image is that in logistic regression, your data is classified into 0 or 1.

If you've been following up with the series. Just know that this is a special one because today you're gonna do the Feature Extraction by yourself.

The question that's probably on your mind if you've not been following up with the series:

What is Feature Extraction?

  • Feature extraction is a process of dimensionality reduction by which an initial set of raw data is reduced to more manageable groups for processing.

  • In other terms, it is the act of selecting useful features from a dataset and dumping the rest.

Click here to download the dataset we're gonna be using today. Normally, once you click on the link it starts downloading but as I said this article is different. Since you're doing the Feature Extraction yourself, you'll have to know which feature's you're gonna select. This means that you'll have to study the attribute information yourself.


We're gonna make a model that would be able to predict if someone has heart disease or doesn't.

We're gonna start coding now

Importing the needed libraries

import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

Load and view the dataset

df = pd.read_csv('heart.csv')



Feature Extraction

This is where you do your research check which features are important.

Making the training and validation set

When a large amount of data is at hand, a set of samples can be set aside to evaluate the final model. The "training" data set is the general term for the samples used to create the model, while the “test” or “validation” data set is used to qualify performance.

train_data, validation_data, train_labels, validation_labels = train_test_split(
  • train_size is how big or small you want your training set to be. This is the same for test_size.
  • random_state is basically used for reproducing your problem the same every time it is run.

Making a model

model = LogisticRegression(),train_labels)



The score is not too bad but it's not good.

Making predictions with your model

Now it's time to make a prediction, using the features that you've picked.




You can visit Kaggle to find more datasets that you can perform Logistic Regression on.

Check out my Twitter or Instagram.

Feel free to ask questions in the comments.


Top comments (0)