Intro to Logistic Regression with sklearn

What is Logistic Regression

Logistic regression is a method used for binary classification problems. In other words it classifies your data as 0 or 1. Some examples of cases where LR is used include detecting if an email is spam or not, if someone will vote, or even the risk of developing a disease. To do this LR uses the Logistic Function, hence the name Logistic Regression.

Logistic Function

The Logistic function is an s-shaped curve with the equation 1/e^-x. As you can see below it can map out any real number to a value between 0 and 1 though not those limits itself.

Example

Download the dataset from Kaggle: https://www.kaggle.com/c/titanic/data?select=train.csv

Then import the needed libraries as well as the dataset.

import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, mean_squared_log_error

df = pd.read_csv('../../Downloads/train.csv')
df.head()

Next we are going to change the categorial data into indicator variables and get rid of the features that we don't need such as the peoples names.

x_feats = ['Pclass', 'Sex', 'Age', 'SibSp', 'Fare', 'Cabin', 'Embarked']
X = pd.get_dummies(df[x_feats], drop_first=True)
y = df['Survived']
X.head()

Lastly before we get to building our model, we are going to fill any missing values as well as normalize our data, this helps our model run faster, though it is not required for logistic regression.

# Fill missing values
X = X.fillna(value=0) 
for col in X.columns:
    # Subtract the minimum and divide by the range forcing a scale of 0 to 1 for each feature
    X[col] = (X[col] - min(X[col]))/ (max(X[col]) - min(X[col]))

Now we can actually split our data and add our model.


X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

logreg = LogisticRegression(fit_intercept=False, C=100, solver='liblinear')
model_log = logreg.fit(X_train, y_train)
model_log

Then finally we can check out how our model performed.

print('Training r^2:', logreg.score(X_train, y_train))
print('Test r^2:', logreg.score(X_test, y_test))
print('Training MSE:', mean_squared_error(y_train, logreg.predict(X_train)))
print('Test MSE:', mean_squared_error(y_test, logreg.predict(X_test)))

78% on the testing, not too shabby for predicting if someone survived from the titanic or not. As you can see from the step I created the model I increased C from it's default 1 to 1000, by increasing this value it becomes more likely that the parameters will be increased in magnitude simply to adjust for small perturbations in the data. I would suggest playing with this value and seeing how it changes your predictions.