Logistic Regression Workflow Guide

If you are one of the many whose first time working on a logistic regression project is through a boot camp program, you might feel overwhelmed about where to start to begin when working on your project. This guide will hopefully serve as a beacon of light in order for you to better approach your project and spend less time trying to figure out your approach and more time practicing and improving on your understanding of logistic regression modelling.

Introduction

I'll be using the project I worked on while in my intensive bootcamp program: the Terry Stop Dataset. This tutorial will highlight all the steps one should take when finding a predictive logistic regression model through multiple iterations.

Dataset

I used the Terry Stop dataset that is publicly available on the seattle.gov website. This data represents records of police reported stops under Terry v. Ohio, 392 U.S. 1 (1968). Each row represents a unique stop. Each record contains perceived demographics of the subject, as reported by the officer making the stop and officer demographics as reported to the Seattle Police Department, for employment purposes. Where available, data elements from the associated Computer Aided Dispatch (CAD) event (e.g. Call Type, Initial Call Type, Final Call Type) are included. I chose the 'arrest flag' column as my target column and included the demographic, frisk, police officer YOB, and date column as my features.

First Decision - Deciding on which columns you will use in your modelling. You can mix and match and see which combinations work better than others. For my project, I decided to stick to the same set of features, you could choose to create a data frame with just demographics + precinct and another one with demographics and date reported and compare.

Second Decision - Deciding on how you will clean your data. This will change depending on which columns you decide to use. Basic data cleaning include removing NaN values and consolidating similar values into one value.

Third Decision - Deciding on which predictive models you will use. I used a logistic regression, KNearestNeighbors, RandomForestTree, and DecisionTree models. I used pipelines to expedite the process and cut back on code.

Fourth Decision - If you use a gridsearch which parameters will you decide on using and how many options will you provide for each parameter for each model. This might be limited based on your hardware capabilities.

Fifth Decision - Deciding on which model will be your final one. This depends on many factors about your dataset. In this project, there was a class imbalance in my target column so I focused on precision, recall, and F1 score.

Sixth Decision - Deciding on which visualizations you will use. Ideally you will want visualizations that are easy to read, add to your story, and is catered to the stakeholder.

Libraries

Many of the libraries imported below should be familiar to you by now. If you are interested, you look at the documentation to learn more about each. I'll briefly go over the lesser known ones that I used now.

I used datetime, date to transform the 'date reported' column into a datetime(n64) dtype and then created two new columns with the month and the year respectively.
I used mdates, cbook, mtick, and calendar to transform the month numbers to names when creating the visualizations of terry stops by month and year.

'
import pandas as pd
import numpy as np
from datetime import datetime, date
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.dummy import DummyClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, plot_confusion_matrix, \
accuracy_score, recall_score, precision_score, f1_score, log_loss, roc_auc_score, roc_curve
from sklearn.compose import ColumnTransformer
from imblearn.pipeline import Pipeline
from sklearn.tree import DecisionTreeClassifier
import matplotlib.dates as mdates
import matplotlib.cbook as cbook
import matplotlib.ticker as mtick
import calendar
'

Code

Below is the link to project repository on github. You'll find markdowns for almost each cell of code and a explanatory narrative of my process.

Link

Conclusion

Hopefully This guide serves as a basic outline for you as you begin your logistic regression project. Please feel free to reach out if you have any questions or comments about anything!

DEV Community

Logistic Regression Workflow Guide

Introduction

Dataset

Libraries

Code

Conclusion

Top comments (0)

Read next

Answer: Cannot read properties of undefined (reading 'invalidatesTags') [duplicate]

day 7

Building a shopping cart using React, Redux toolkit

Understanding Quantization in AI: A Comprehensive Guide Including LoRA and QLoRA