What is Logistic Regression?
Logistic Regression is a statistical model commonly used for binary classification tasks, such as spam email detection. Despite its name, it is not used for regression but instead predicts the probability that an instance belongs to a particular class (usually class 1, the positive class).
The model uses the sigmoid (or logistic) function, which outputs a probability between 0 and 1, and is defined as follows:
Here, θ is the vector of coefficients, and ( x ) is the feature vector (i.e., the input data for a single instance). The expression θ^T x
is the dot product of these two vectors.
The sigmoid function maps any input to an output in the range (0, 1), representing the probability of the positive class. This probability is then compared to a threshold (commonly 0.5). If the probability is greater than or equal to the threshold, the predicted class is 1 (positive class); otherwise, it is 0 (negative class).
We can express the thresholding mechanism as follows:
To understand this, consider the following graph of the sigmoid function:
From the graph, we observe that as ( x ) becomes negative, the output ( y ) approaches 0. As ( x ) becomes positive, ( y ) approaches 1, with the output being 0.5 when ( x ) is around 0.
Cost Function
A cost function is used to measure the error between the predicted output and the actual class labels, penalizing the model for incorrect predictions. This allows the model to adjust its weights (θ) to minimize this error during training.
The cost function for logistic regression on any input instance is given below:
And the overall cost function is given below:
This is the standard logistic loss function used for binary classification, where y_pred
is the predicted probability of the positive class (class 1), and ( y ) is the actual class label (either 0 or 1).
To understand this function Cost(y_pred,y)
, let's break it down for two cases, y = 0
and y = 1
:
- Case ( y = 0 ):
When the actual label y = 0
, the first part of the cost function, -ylog(y_pred)
, becomes 0. The cost is then determined by the second part, (1 - y)log(1 - y_pred)
, which simplifies to log(1 - y_pred)
. This represents the penalty for incorrectly predicting a high probability for the positive class when the true label is 0.
- Case ( y = 1 ):
When the actual label y = 1
, the second part of the cost function, -(1 - y)log(1 - y_pred)
, becomes 0. The cost is then determined by the first part, -ylog(y_pred)
, which simplifies to -log(y_pred)
. This represents the penalty for incorrectly predicting a low probability for the positive class when the true label is 1.
The goal during training is to minimize this cost function across all training examples, effectively finding the model parameters θ that yield the most accurate predictions.
Gradient Descent
Gradient Descent is the process in which the models attempts to find the local minima
or values of θ, which leads to a minimum cost. The steps that the model takes to reach that local minima
is called as Learning Rate
and is generally 0.01
. The Gradient of the logistic model is represented as follows:
Once the gradient is calculated, the weights θ, are updated as follows:
Loss Functions
The loss functions are functions that determine the degree of accuracy of the model, ie, the difference between the predicted output and the actual output. The two loss functions used in this post are: Mean Squared Error and Mean Absolute Error.
Mean Squared Error (MSE)
Mean Squared Error (MSE) is the function that calculates the average of the squared differences between the predicted output and the actual output. It penalizes larger errors more than smaller ones. The formula for MSE is:
Mean Absolute Error (MAE)
Mean Absolute Error (MAE) is the function that calculates the average of the absolute differences between the predicted output and the actual output. Unlike MSE, MAE treats all errors equally and does not penalize larger errors more than smaller ones. The formula for MAE is:
Program
The dataset used in this post is taken from kaggle. The code to read and perform pre-processing is provided below:
import pandas as pd
from sklearn.model_selection import train_test_split
df = pd.read_csv("./Titanic Train Data.csv")
t = df.drop(columns=["Name", "Sex", "Fare", "Ticket", "SibSp","Cabin","PassengerId"])
# fixing empty age
average_age = 0
number_of_rows = len(t)
for i in range(number_of_rows):
if not np.isnan(t['Age'].iloc[i]):
average_age += t['Age'].iloc[i]
average_age = average_age / number_of_rows
t['Age'].replace(np.nan,average_age, inplace=True)
# fixing empty embarked
t['Embarked'].replace(np.nan,"S",inplace=True)
# categorising the embarked values
t['Embarked'].replace({'S': 0, 'C': 1, 'Q': 2}, inplace=True)
x = t[['Age','Pclass','Parch','Embarked']]
y = t['Survived']
x_train, x_test , y_train, y_test = train_test_split(x,y,random_state=42)
Logistic Regression Program from Scratch
from sklearn.metrics import mean_squared_error , mean_absolute_error
def sigmoid(z):
return 1 / (1 + np.exp(-z))
def cost_function(x, y, theta):
m = len(y)
z = np.dot(x, theta)
y_pred = sigmoid(z)
cost = (1 / m) * np.sum((-y * np.log(y_pred)) - ((1 - y) * np.log(1 - y_pred)))
return cost
def train(x, y, learning_rate, epochs):
m, n = x.shape
theta = np.zeros(n)
for i in range(epochs):
z = np.dot(x, theta)
y_pred = sigmoid(z)
gradient = (1 / m) * np.dot(x.T, (y_pred - y))
theta -= learning_rate * gradient
cost = cost_function(x, y, theta)
if i % 100 == 0:
print(f"Cost: {cost:.4f} | Epoch: {i}")
#displaying only weights
print(f"Final weights: {theta[1:]}")
return theta
def predict(x, theta):
z = np.dot(x, theta)
y_pred = sigmoid(z)
return y_pred >= 0.5
Driver Code
x_train = np.hstack([np.ones((x_train.shape[0], 1)), x_train])
x_test = np.hstack([np.ones((x_test.shape[0], 1)), x_test])
theta = train(x_train, y_train, learning_rate=0.01, epochs=1000)
pred = predict(x_test, theta)
accuracy = np.mean(pred == y_test)
print(f"Accuracy: {accuracy:.4f}")
MSE = mean_squared_error(pred, y_test)
MAE = mean_absolute_error(pred, y_test)
print(f"MSE: {MSE}")
print(f"MAE: {MAE}")
Output
Cost: 0.7098 | Epoch: 0
Cost: 0.7498 | Epoch: 100
Cost: 0.7454 | Epoch: 200
Cost: 0.7394 | Epoch: 300
Cost: 0.7332 | Epoch: 400
Cost: 0.7272 | Epoch: 500
Cost: 0.7216 | Epoch: 600
Cost: 0.7164 | Epoch: 700
Cost: 0.7116 | Epoch: 800
Cost: 0.7071 | Epoch: 900
Final weights: [ 0.02662689 -0.44641909 0.22831224 0.27235578]
Accuracy: 0.6457
MSE: 0.3542600896860987
MAE: 0.3542600896860987
Note:
In the following line:
x_train = np.hstack([np.ones((x_train.shape[0], 1)), x_train])
x_test = np.hstack([np.ones((x_test.shape[0], 1)), x_test])
A column of ones is added to the input feature matrices, introducing a bias term in the logistic regression model. The bias term is a parameter that allows the model to make predictions by shifting the output independently of the input features. This helps the model fit data that doesn't necessarily pass through the origin. The bias term is learned during training, allowing the model to adjust the decision boundary and make more accurate predictions.
The bias term is represented by:
Note that the Sklearn
Logistic Regression model already handles this bias term internally, so when using the pre-built model, you don't need to manually add it to the input features, as is done above.
Logistic Regression Program using Sklearn
Program
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import mean_squared_error , mean_absolute_error
log_model = LogisticRegression()
log_model.fit(x_train,y_train)
model_predictions = log_model.predict(x_test)
MSE = mean_squared_error(model_predictions,y_test)
MAE = mean_absolute_error(model_predictions,y_test)
print(f"Final Weights: {log_model.coef_}")
print(f"MSE: {MSE}")
print(f"MAE: {MAE}")
Output
Final Weights: [[-0.02804986 -1.02796635 0.20322083 0.40791547]]
MSE: 0.26905829596412556
MAE: 0.26905829596412556
Top comments (0)