Introduction
Contrary to its name, logistic regression is not a regression algorithm but a classification algorithm. In standard regression algorithms, the predicted value of is a continuous value. However, in classification algorithms, the predicted value falls within the range of . This is because we want to categorize by discrete values such as 0 or 1. If , then , if , then , and we divide by a threshold value (0.5 in this case).
Binary Logistic Regression
Binary Logistic Regression is used for binary classification tasks, where the objective is categorize instances into one of two possible classes. These two classes are often represented as 0 and 1, which correspond to outcomes such as false/true, negative/positive, fail/pass, etc.
Activation Function
In order to categorize by discrete values, the Logistic Function, also known as the Sigmoid Function, is introduced. The characteristic feature is that the function satisfies and .
import numpy as np
def sigmoid(z):
return 1 / (1 + np.exp(-z))
import matplotlib.pyplot as plt
x = np.arange(-5.0, 5.0, 0.1)
y = sigmoid(x)
plt.plot(x, y)
plt.grid()
plt.show()
Hypothesis Function
In logistic regression, the hypothesis function is a composite function that unites the hypothesis function of linear regression with the sigmoid function.
Cost Function
The logistic regression cannot use the same cost function used for linear regression because its output is wavy and causes many local optimizations.
Revisit the cost function of Linear regression:
To minimize the cost function, the hypothesis function looks to be minimized as approaches the value of and maximize as it moves away. This can be expressed using the log function:
This method is commonly known as the "Cross-entropy loss".
Optimization
The Gradient descent method in logistic regression is basically the same as in linear regression, but the contents of the hypothesis function are different.
Repeat until convergence:
Implementation
We will implement the binary logistic regression algorithm using Python. The following code is a simple implementation of binary logistic regression using the breast cancer dataset from the scikit-learn library.
To start, we will load the dataset and divide it into training and test sets.
pip install numpy==1.23.5 pandas==1.5.3 scikit-learn==1.2.2 matplotlib==3.7.4
import numpy as np
import pandas as pd
from sklearn import datasets
from sklearn import model_selection
dataset = datasets.load_breast_cancer()
X = pd.DataFrame(data=dataset.data, columns=dataset.feature_names)
y = pd.Series(data=dataset.target, name="target")
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, random_state=0)
print("samples: {}; features: {}".format(*X.shape))
print("samples: {}; values: {}".format(*y.shape, y.unique()))
samples: 569; features: 30
samples: 569; values: [0 1]
Next, we will standardize the dataset. Standardization is a method of normalizing the dataset by subtracting the mean and dividing by the standard deviation. This is done to prevent the influence of large values on the model.
def standardize(X: pd.DataFrame) -> pd.DataFrame:
"""Standardize the dataset. (z-score normalization)
:param X: The dataset to be standardized.
:return: The standardized dataset.
"""
return (X - np.mean(X, axis=0)) / np.std(X, axis=0)
X_train_std = standardize(X_train)
X_test_std = standardize(X_test)
Now, we will proceed to implement the binary logistic regression algorithm. To gain a deeper understanding of the implementation, please refer to the comments provided within the code.
class BinaryLogisticRegression:
def __init__(self, alpha: float = 0.01, eps: float = 1e-6) -> None:
self.alpha = alpha # Learning rate for gradient descent
self.eps = eps # Threshold of convergence
def fit(self, X: pd.DataFrame, y: pd.Series) -> "BinaryLogisticRegression":
"""Fit the model to the training dataset. Optimizing the parameters by gradient descent.
:param X: The training dataset.
:param y: The target.
:return: The trained model.
"""
self._m = X.shape[0] # The number of samples
num_features = X.shape[1] # The number of features
self._theta = np.zeros(num_features) # The parameters (weight)
self._error_values = [] # The output values of the cost function in each iteration
self._grad_values = [] # Gradient values in each iteration
self._iter_counter = 0 # The counter of iterations
error = self.J(X, y) # The initial output value of the cost function with random parameters
diff = 1.0 # The difference between the previous and the current output values of the cost function
# Repeat until convergence
while diff > self.eps:
# Update the parameters by gradient descent
grad = (1 / self._m) * np.dot(self.h(X, self._theta) - y, X) # Calculate the gradient using the formula
self._theta = self._theta - self.alpha * grad # Update the parameters
# Print the current status
_error = self.J(X, y) # Compute the error with the updated parameters
diff = abs(error - _error) # Compute the difference between the previous and the current error
error = _error # Update the error
self._error_values.append(error)
self._grad_values.append(grad.sum())
self._iter_counter += 1
print(f"[{self._iter_counter}] error: {error}, diff: {diff}, grad: {grad.sum()}")
print(f"Convergence in {self._iter_counter} iterations.")
return self
def predict(self, X: pd.DataFrame) -> np.ndarray:
"""Predict the target values.
:param X: The dataset to be predicted.
:return: The predicted target values.
"""
return np.where(self.h(X, self._theta) >= 0.5, 1, 0)
def activate(self, z: np.ndarray) -> np.ndarray:
"""Activation function (sigmoid/logistic function).
:param z: The output of the hypothesis function.
:return: The activated output. 0 <= activate(z) <= 1
"""
return 1 / (1 + np.exp(-z))
def h(self, X: pd.DataFrame, theta: np.ndarray) -> np.ndarray:
"""Hypothesis function.
:param X: The dataset
:param theta: The parameters (weight)
:return: The activated output. 0 <= h(x, theta) <= 1
"""
return self.activate(np.dot(X, theta))
def J(self, X: pd.DataFrame, y: pd.Series) -> float:
"""Cost function (cross-entropy loss).
:param X: The dataset
:param y: The target
:return: The loss value.
"""
delta = 1e-7 # To avoid log(0)
return - (1 / self._m) * (
np.sum(y * np.log(self.h(X, self._theta) + delta) + (1 - y) * np.log(1 - self.h(X, self._theta) + delta))
)
Now that the model is prepared, we can go ahead and train it using the standardized training dataset while also visualizing the training process.
model = BinaryLogisticRegression()
model.fit(X_train_std, y_train)
import matplotlib.pyplot as plt
fig = plt.figure(figsize=(15, 5))
ax = fig.add_subplot(1, 2, 1)
ax.set_title("Cross-entropy Loss")
ax.set_ylabel("Loss")
ax.set_xlabel("Iteration")
ax.plot(np.arange(model._iter_counter), model._error_values, color="b")
ax = fig.add_subplot(1, 2, 2)
ax.set_title("Gradient")
ax.set_ylabel("Gradient")
ax.set_xlabel("Iteration")
ax.plot(np.arange(model._iter_counter), model._grad_values, color="r")
plt.show()
Each iteration, the cross-entropy loss decreases, and the gradient approaches zero. This indicates that the model is converging.
Finally, we will evaluate the model using the standardized training and test datasets.
from sklearn.metrics import accuracy_score
y_train_pred = model.predict(X_train_std)
print(f"Acuracy score for train data: {accuracy_score(y_train, y_train_pred)}")
y_test_pred = model.predict(X_test_std)
print(f"Acuracy score for test data: {accuracy_score(y_test, y_test_pred)}")
Acuracy score for train data: 0.9882629107981221
Acuracy score for test data: 0.958041958041958
Multiple Logistic Regression
In the previous section, we implemented a binary logistic regression model. In this section, we will implement a multiple logistic regression model, which can handle multiple classes.
Activation Function
The activation function of the multiple logistic regression model is the softmax function. The softmax function is defined as follows:
The softmax function takes a vector of real numbers and returns a vector of the same length, where each element is in the range (0, 1), and the sum of the elements is 1. This is useful for representing the probability distribution of the classes.
import numpy as np
def softmax(z):
z = z - np.max(z, axis=0) # Prevent overflow
return np.exp(z) / np.sum(np.exp(z))
We can visualize the softmax function using the following code:
import matplotlib.pyplot as plt
x = np.arange(-5.0, 5.0, 0.1)
y = softmax(x)
plt.plot(x, y)
plt.ylim(0, 0.1)
plt.grid()
plt.show()
Implementation
The multiple logistic regression model is similar to the binary logistic regression model, but predicts the probability distribution of the classes using the softmax function and takes the class with the highest probability as the predicted class.
As a training dataset, we will use the Iris dataset, which contains 150 samples of three classes of iris flowers. The dataset has four features: sepal length, sepal width, petal length, and petal width.
import numpy as np
import pandas as pd
from sklearn import datasets
from sklearn import model_selection
dataset = datasets.load_iris()
X = pd.DataFrame(data=dataset.data, columns=dataset.feature_names)
y = pd.Series(data=dataset.target, name="target")
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, random_state=0)
print("samples: {}; features: {}".format(*X.shape))
print("samples: {}; values: {}".format(*y.shape, y.unique()))
samples: 150; features: 4
samples: 150; values: [0 1 2]
We will use One-hot encoding to convert the target values to a binary matrix. The one-hot encoding is a representation of categorical variables as binary vectors. This method is commonly used to ensure that categorical variables do not imply any ordinal relationship, and each category is treated independently.
y_train_encoded = pd.get_dummies(y_train, dtype=int)
print(y_train.head(3))
y_train_encoded.head(3)
The encoded target values will be in a format that matches the predicted probabilities for each target value, as in .
After training the model, the predicted target, determined by selecting the maximum probability, will be compared with the true target values. For instance, if , then .
Next, as in the binary logistic regression model, we will implement the standardize function and apply it to the training and test datasets.
def standardize(X: pd.DataFrame) -> pd.DataFrame:
"""Standardize the dataset. (z-score normalization)
:param X: The dataset to be standardized.
:return: The standardized dataset.
"""
return (X - np.mean(X, axis=0)) / np.std(X, axis=0)
X_train_std = standardize(X_train)
X_test_std = standardize(X_test)
We will now proceed to implement the multiple logistic regression model. It's important to note that the implementation is similar to the binary logistic regression model, but in this case, we will use the softmax function as the activation function. Additionally, we will determine the predicted class based on the class with the highest probability.
class MultipleLogisticRegression:
def __init__(self, alpha: float = 0.01, eps: float = 1e-6) -> None:
self.alpha = alpha # Learning rate for gradient descent
self.eps = eps # Threshold of convergence
def fit(self, X: pd.DataFrame, y: pd.Series) -> "MultipleLogisticRegression":
"""Fit the model to the training dataset. Optimizing the parameters by gradient descent.
:param X: The training dataset.
:param y: The target.
:return: The trained model.
"""
self._m = X.shape[0] # The number of samples
num_features = X.shape[1] # The number of features
num_targets = y.shape[1] # The number of targets
self._theta = np.zeros([num_targets, num_features]) # The parameters (weight)
self._error_values = [] # The output values of the cost function in each iteration
self._grad_values = [] # Gradient values in each iteration
self._iter_counter = 0 # The counter of iterations
error = self.J(X, y) # The initial output value of the cost function with random parameters
diff = np.ones(num_targets) # The difference between the previous and the current output values of the cost function
# Repeat until convergence
while diff.sum() > self.eps:
# Update the parameters by gradient descent
grad = (1 / self._m) * np.dot((self.h(X, self._theta) - y).T, X) # Calculate the gradient using the formula
self._theta = self._theta - self.alpha * grad # Update the parameters
# Print the current status
_error = self.J(X, y) # Compute the error with the updated parameters
diff = abs(error - _error) # Compute the difference between the previous and the current error
error = _error # Update the error
self._error_values.append(error.sum())
self._grad_values.append(grad.sum())
self._iter_counter += 1
print(f"[{self._iter_counter}] error: {error.sum()}, diff: {diff.sum()}, grad: {grad.sum()}")
print(f"Convergence in {self._iter_counter} iterations.")
return self
def predict(self, X: pd.DataFrame) -> np.ndarray:
"""Predict the target values.
:param X: The dataset to be predicted.
:return: The predicted target values.
"""
return self.h(X, self._theta).argmax(1)
def activate(self, z: np.ndarray) -> np.ndarray:
"""Activation function (sigmoid/logistic function).
:param z: The output of the hypothesis function.
:return: The activated output. 0 <= activate(z) <= 1
"""
return np.exp(z)/np.sum(np.exp(z), axis=1, keepdims=True)
def h(self, X: pd.DataFrame, theta: np.ndarray) -> np.ndarray:
"""Hypothesis function.
:param X: The dataset
:param theta: The parameters (weight)
:return: The activated output. 0 <= h(x, theta) <= 1
"""
return self.activate(np.dot(X, theta.T))
def J(self, X: pd.DataFrame, y: pd.Series) -> float:
"""Cost function (cross-entropy loss).
:param X: The dataset
:param y: The target
:return: The loss value.
"""
delta = 1e-7 # To avoid log(0)
return - (1 / self._m) * (
np.sum(y * np.log(self.h(X, self._theta) + delta) + (1 - y) * np.log(1 - self.h(X, self._theta) + delta))
)
Since the model is ready, we will move on to train the model using the standardized features and one-hot encoded targets and evaluate its performance.
model = MultipleLogisticRegression()
model.fit(X_train_std, y_train_encoded)
from sklearn.metrics import accuracy_score
y_train_pred = model.predict(X_train_std)
print(f"Acuracy score for train data: {accuracy_score(y_train, y_train_pred)}")
y_test_pred = model.predict(X_test_std)
print(f"Acuracy score for test data: {accuracy_score(y_test, y_test_pred)}")
Acuracy score for train data: 0.875
Acuracy score for test data: 0.7105263157894737
The model demonstrates decent performance on the training dataset but exhibits poor performance on the test dataset, indicating a clear case of overfitting. To address this issue, regularization techniques can be employed to mitigate overfitting and enhance the model's generalization capabilities.
Top comments (2)
You can enhance your blog by adding an eye-catching cover image. The structure is already excellent!
Hey, thanks! I followed your advice and added some cover photos 👍