DEV Community

pierce1798
pierce1798

Posted on

Logistic Regression

Who hasn't wanted to be able to tell the future before. While it's a common power many people wish they had, we aren't quite there yet. That being said, there are tools we can use to make some pretty good predictions and it's thanks to computers that we have these tools. We may never be able to achieve 100% accuracy, however one algorithm that we have to help us determine the outcome of an event is logistic regression. We can use this when we want to determine the likelihood of one outcome versus another. For example we may want to get a better idea as to whether or not a customer will churn from a service, or how probable it is that an organism falls to one species or another given it's features. While we can use this method for more than two possible outcomes by comparing the probability of each outcome to each other, we should remember to weigh all of our options so that we can find the best outcome for a specific case.
When we are using logistic regression to compare two possible cases, we first need to understand how those cases are shown. With two outcomes, these are represented by ones and zeros. Which case is more likely is determined by whether the results round to one or zero. We can also imagine this as one case having a chance to occur between zero and one, zero being no chance of happening and one being a one hundred percent chance, therefore if there is a greater chance one case will not happen, the other is more likely.
Looking at a graph we would see that all of our points would fall on either zero or one.
The process of carrying out this model can be easily broken down into steps. First you'll need the data set you want to make your predictions for, and next set an X and y variable. X represents a matrix of the target's features, while y will be just the case to be predicted. Now we can use our LogisticRegression classifier to fit the training data to a model that we can use to predict results given new data. We can also split our data into test and training sets to score and evaluate how well the model performs.

Now we can walk step by step through creating a model.
First we'll load in our data, for this we are using a previously cleaned data frame of information about passengers on the titanic.

df = pd.read_csv('./data/cleaned_titanic.csv')
df.head()

Next, we need to instantiate our estimator (LogisticRegression), and set our features columns to our X variable and our target to predict to our y. Then we can fit them.

logreg = LogisticRegression(random_state=42)
feature_cols = ['PassengerId','Pclass','Age','SibSp','Parch','Fare','youngin','male']
X = df[feature_cols]
y = df.Survived
X_train, X_test, y_train, y_test = train_test_split(X, y)
logreg.fit(X_train, y_train)

So that we can have a training set of data and another set to test with, we'll split our data using sklearn's train test split method.

Now if we want to take a quick look at what we have so far, we can call the next lines to return an array of predictions based on our current model so far.

test = logreg.predict(X_test)
trained = logreg.predict(X_train)
test,trained

We can also use the next lines to give us a report on our training and test scores containing accuracy, precision, recall, and f-1 score. With this we can see how our model performed when it came time to run our test on it.

y_train_pred = logreg.predict(X_train)
y_test_pred = logreg.predict(X_test)
print(classification_report(y_train,y_train_pred))
print(classification_report(y_test,y_test_pred))

This is a very basic rundown of logistic regression and it's uses as well as how to make a light model for it. There are many other things we can tweak and change to optimize our results and create a better performing model. Some of those things include changing hyper parameters and normalizing data. We can also better visualize our results if we were to create a confusion matrix. This would show us our results categorized as true positive and true false results, as well as any false positive or false negative results. Doing this can help us understand where exactly we should focus our optimization efforts and what scores we want to change to best fit our needs.
There are many uses for logistic regression models and although it may not always be the best model for a situation, they do work well for predicting classifications. Just because we have a working model, though, doesn't mean we should accept it as our final model, it's always good to test different models to see what performs the best and with which parameters.

Top comments (0)