Koki Esaki

Posted on Feb 4, 2024 • Edited on Feb 10, 2024

Brushing Up on Logistic Regression in Python: Theory to Practice

#python #datascience #machinelearning #ai

Introduction

Contrary to its name, logistic regression is not a regression algorithm but a classification algorithm. In standard regression algorithms, the predicted value of $y$ is a continuous value. However, in classification algorithms, the predicted value falls within the range of $0 ≤ h_\theta(x) ≤ 1$ . This is because we want to categorize by discrete values such as 0 or 1. If $h_\theta(x) ≥ 0.5$ , then $y = 1$ , if $h_\theta(x) < 0.5$ , then $y = 0$ , and we divide by a threshold value (0.5 in this case).

Binary Logistic Regression

Binary Logistic Regression is used for binary classification tasks, where the objective is categorize instances into one of two possible classes. These two classes are often represented as 0 and 1, which correspond to outcomes such as false/true, negative/positive, fail/pass, etc.

Activation Function

In order to categorize by discrete values, the Logistic Function, also known as the Sigmoid Function, is introduced. The characteristic feature is that the function satisfies $0<g(z)<1$ and $g(0)=0.5$ .

g(z) = \frac{1}{1 + e^{-z}}

import numpy as np

def sigmoid(z):
    return 1 / (1 + np.exp(-z))

import matplotlib.pyplot as plt

x = np.arange(-5.0, 5.0, 0.1)
y = sigmoid(x)
plt.plot(x, y)
plt.grid()
plt.show()

Hypothesis Function

In logistic regression, the hypothesis function is a composite function that unites the hypothesis function of linear regression with the sigmoid function.

h_\theta(x) = g(\theta_0 + \theta_1x_1 + \theta_2x_2 + ... + \theta_nx_n) = g(\theta^Tx) = \frac{1}{1 + e^{-\theta^Tx}}

Cost Function

The logistic regression cannot use the same cost function used for linear regression because its output is wavy and causes many local optimizations.

Revisit the cost function of Linear regression:

J(\theta) = \frac{1}{2m}\sum_{i=1}^{m}(h_\theta(x^{(i)}) - y^{(i)})^2

To minimize the cost function, the hypothesis function looks to be minimized as $h_\theta(x)$ approaches the value of $y$ and maximize as it moves away. This can be expressed using the log function:

J(\theta) = -\frac{1}{m}\sum_{i=1}^{m}(y^{(i)}\log (h_\theta(x^{(i)}) + (1-y^{(i)})\log (1 - h_\theta(x^{(i)})))

This method is commonly known as the "Cross-entropy loss".

Optimization

The Gradient descent method in logistic regression is basically the same as in linear regression, but the contents of the hypothesis function $h_\theta(x)$ are different.

Repeat until convergence:

\theta_j := \theta_j - \alpha \frac{\partial}{\partial \theta_j} J(\theta) = \theta_j - \alpha \frac{1}{m}\sum_{i=1}^{m}(h_\theta(x^{(i)}) - y^{(i)}) \cdot x_j^{(i)}

Implementation

We will implement the binary logistic regression algorithm using Python. The following code is a simple implementation of binary logistic regression using the breast cancer dataset from the scikit-learn library.

To start, we will load the dataset and divide it into training and test sets.

pip install numpy==1.23.5 pandas==1.5.3 scikit-learn==1.2.2 matplotlib==3.7.4

import numpy as np
import pandas as pd
from sklearn import datasets
from sklearn import model_selection


dataset = datasets.load_breast_cancer()
X = pd.DataFrame(data=dataset.data, columns=dataset.feature_names)
y = pd.Series(data=dataset.target, name="target")

X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, random_state=0)

print("samples: {}; features: {}".format(*X.shape))
print("samples: {}; values: {}".format(*y.shape, y.unique()))

samples: 569; features: 30
samples: 569; values: [0 1]

Next, we will standardize the dataset. Standardization is a method of normalizing the dataset by subtracting the mean and dividing by the standard deviation. This is done to prevent the influence of large values on the model.

def standardize(X: pd.DataFrame) -> pd.DataFrame:
    """Standardize the dataset. (z-score normalization)
    :param X: The dataset to be standardized.
    :return: The standardized dataset.
    """
    return (X - np.mean(X, axis=0)) / np.std(X, axis=0)


X_train_std = standardize(X_train)
X_test_std = standardize(X_test)

Now, we will proceed to implement the binary logistic regression algorithm. To gain a deeper understanding of the implementation, please refer to the comments provided within the code.

class BinaryLogisticRegression:

    def __init__(self, alpha: float = 0.01, eps: float = 1e-6) -> None:
        self.alpha = alpha  # Learning rate for gradient descent
        self.eps = eps  # Threshold of convergence

    def fit(self, X: pd.DataFrame, y: pd.Series) -> "BinaryLogisticRegression":
        """Fit the model to the training dataset. Optimizing the parameters by gradient descent.

        :param X: The training dataset.
        :param y: The target.
        :return: The trained model.
        """
        self._m = X.shape[0]  # The number of samples
        num_features = X.shape[1]  # The number of features

        self._theta = np.zeros(num_features)  # The parameters (weight)

        self._error_values = []  # The output values of the cost function in each iteration
        self._grad_values = []  # Gradient values in each iteration
        self._iter_counter = 0  # The counter of iterations

        error = self.J(X, y)  # The initial output value of the cost function with random parameters
        diff = 1.0  # The difference between the previous and the current output values of the cost function

        # Repeat until convergence
        while diff > self.eps:
            # Update the parameters by gradient descent
            grad = (1 / self._m) * np.dot(self.h(X, self._theta) - y, X)  # Calculate the gradient using the formula
            self._theta = self._theta - self.alpha * grad  # Update the parameters

            # Print the current status
            _error = self.J(X, y)  # Compute the error with the updated parameters
            diff = abs(error - _error)  # Compute the difference between the previous and the current error
            error = _error  # Update the error
            self._error_values.append(error)
            self._grad_values.append(grad.sum())
            self._iter_counter += 1
            print(f"[{self._iter_counter}] error: {error}, diff: {diff}, grad: {grad.sum()}")
        print(f"Convergence in {self._iter_counter} iterations.")
        return self

    def predict(self, X: pd.DataFrame) -> np.ndarray:
        """Predict the target values.

        :param X: The dataset to be predicted.
        :return: The predicted target values.
        """
        return np.where(self.h(X, self._theta) >= 0.5, 1, 0)

    def activate(self, z: np.ndarray) -> np.ndarray:
        """Activation function (sigmoid/logistic function).

        :param z: The output of the hypothesis function.
        :return: The activated output. 0 <= activate(z) <= 1
        """
        return 1 / (1 + np.exp(-z))

    def h(self, X: pd.DataFrame, theta: np.ndarray) -> np.ndarray:
        """Hypothesis function.

        :param X: The dataset
        :param theta: The parameters (weight)
        :return: The activated output. 0 <= h(x, theta) <= 1
        """
        return self.activate(np.dot(X, theta))

    def J(self, X: pd.DataFrame, y: pd.Series) -> float:
        """Cost function (cross-entropy loss).

        :param X: The dataset
        :param y: The target
        :return: The loss value.
        """
        delta = 1e-7  # To avoid log(0)
        return - (1 / self._m) * (
            np.sum(y * np.log(self.h(X, self._theta) + delta) + (1 - y) * np.log(1 - self.h(X, self._theta) + delta))
        )

Now that the model is prepared, we can go ahead and train it using the standardized training dataset while also visualizing the training process.

model = BinaryLogisticRegression()
model.fit(X_train_std, y_train)

import matplotlib.pyplot as plt


fig = plt.figure(figsize=(15, 5))

ax = fig.add_subplot(1, 2, 1)
ax.set_title("Cross-entropy Loss")
ax.set_ylabel("Loss")
ax.set_xlabel("Iteration")
ax.plot(np.arange(model._iter_counter), model._error_values, color="b")

ax = fig.add_subplot(1, 2, 2)
ax.set_title("Gradient")
ax.set_ylabel("Gradient")
ax.set_xlabel("Iteration")
ax.plot(np.arange(model._iter_counter), model._grad_values, color="r")

plt.show()

Each iteration, the cross-entropy loss decreases, and the gradient approaches zero. This indicates that the model is converging.

Finally, we will evaluate the model using the standardized training and test datasets.

from sklearn.metrics import accuracy_score

y_train_pred = model.predict(X_train_std)
print(f"Acuracy score for train data: {accuracy_score(y_train, y_train_pred)}")
y_test_pred = model.predict(X_test_std)
print(f"Acuracy score for test data: {accuracy_score(y_test, y_test_pred)}")

Acuracy score for train data: 0.9882629107981221
Acuracy score for test data: 0.958041958041958

Multiple Logistic Regression

In the previous section, we implemented a binary logistic regression model. In this section, we will implement a multiple logistic regression model, which can handle multiple classes.

Activation Function

The activation function of the multiple logistic regression model is the softmax function. The softmax function is defined as follows:

g_k(z) = \frac{e(z_k)}{\sum_{i=1}^n e(z_i)}

The softmax function takes a vector of real numbers and returns a vector of the same length, where each element is in the range (0, 1), and the sum of the elements is 1. This is useful for representing the probability distribution of the classes.

import numpy as np

def softmax(z):
    z = z - np.max(z, axis=0)  # Prevent overflow
    return np.exp(z) / np.sum(np.exp(z))

We can visualize the softmax function using the following code:

import matplotlib.pyplot as plt

x = np.arange(-5.0, 5.0, 0.1)
y = softmax(x)
plt.plot(x, y)
plt.ylim(0, 0.1)
plt.grid()
plt.show()

Implementation

The multiple logistic regression model is similar to the binary logistic regression model, but predicts the probability distribution of the classes using the softmax function and takes the class with the highest probability as the predicted class.

As a training dataset, we will use the Iris dataset, which contains 150 samples of three classes of iris flowers. The dataset has four features: sepal length, sepal width, petal length, and petal width.

import numpy as np
import pandas as pd
from sklearn import datasets
from sklearn import model_selection


dataset = datasets.load_iris()
X = pd.DataFrame(data=dataset.data, columns=dataset.feature_names)
y = pd.Series(data=dataset.target, name="target")

X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, random_state=0)

print("samples: {}; features: {}".format(*X.shape))
print("samples: {}; values: {}".format(*y.shape, y.unique()))

samples: 150; features: 4
samples: 150; values: [0 1 2]

We will use One-hot encoding to convert the target values to a binary matrix. The one-hot encoding is a representation of categorical variables as binary vectors. This method is commonly used to ensure that categorical variables do not imply any ordinal relationship, and each category is treated independently.

y_train_encoded = pd.get_dummies(y_train, dtype=int)
print(y_train.head(3))
y_train_encoded.head(3)

The encoded target values will be in a format that matches the predicted probabilities for each target value, as in $[0.20, 0.30, 0.50]$ .

After training the model, the predicted target, determined by selecting the maximum probability, will be compared with the true target values. For instance, if $y = [0.20, 0.30, 0.50]$ , then $argmax(y) = 2$ .

Next, as in the binary logistic regression model, we will implement the standardize function and apply it to the training and test datasets.

def standardize(X: pd.DataFrame) -> pd.DataFrame:
    """Standardize the dataset. (z-score normalization)
    :param X: The dataset to be standardized.
    :return: The standardized dataset.
    """
    return (X - np.mean(X, axis=0)) / np.std(X, axis=0)


X_train_std = standardize(X_train)
X_test_std = standardize(X_test)

We will now proceed to implement the multiple logistic regression model. It's important to note that the implementation is similar to the binary logistic regression model, but in this case, we will use the softmax function as the activation function. Additionally, we will determine the predicted class based on the class with the highest probability.

class MultipleLogisticRegression:

    def __init__(self, alpha: float = 0.01, eps: float = 1e-6) -> None:
        self.alpha = alpha  # Learning rate for gradient descent
        self.eps = eps  # Threshold of convergence

    def fit(self, X: pd.DataFrame, y: pd.Series) -> "MultipleLogisticRegression":
        """Fit the model to the training dataset. Optimizing the parameters by gradient descent.

        :param X: The training dataset.
        :param y: The target.
        :return: The trained model.
        """
        self._m = X.shape[0]  # The number of samples
        num_features = X.shape[1]  # The number of features
        num_targets = y.shape[1]  # The number of targets

        self._theta = np.zeros([num_targets, num_features])  # The parameters (weight)

        self._error_values = []  # The output values of the cost function in each iteration
        self._grad_values = []  # Gradient values in each iteration
        self._iter_counter = 0  # The counter of iterations

        error = self.J(X, y)  # The initial output value of the cost function with random parameters
        diff = np.ones(num_targets)  # The difference between the previous and the current output values of the cost function

        # Repeat until convergence
        while diff.sum() > self.eps:
            # Update the parameters by gradient descent
            grad = (1 / self._m) * np.dot((self.h(X, self._theta) - y).T, X)  # Calculate the gradient using the formula
            self._theta = self._theta - self.alpha * grad  # Update the parameters

            # Print the current status
            _error = self.J(X, y)  # Compute the error with the updated parameters
            diff = abs(error - _error)  # Compute the difference between the previous and the current error
            error = _error  # Update the error
            self._error_values.append(error.sum())
            self._grad_values.append(grad.sum())
            self._iter_counter += 1
            print(f"[{self._iter_counter}] error: {error.sum()}, diff: {diff.sum()}, grad: {grad.sum()}")
        print(f"Convergence in {self._iter_counter} iterations.")
        return self

    def predict(self, X: pd.DataFrame) -> np.ndarray:
        """Predict the target values.

        :param X: The dataset to be predicted.
        :return: The predicted target values.
        """
        return self.h(X, self._theta).argmax(1)

    def activate(self, z: np.ndarray) -> np.ndarray:
        """Activation function (sigmoid/logistic function).

        :param z: The output of the hypothesis function.
        :return: The activated output. 0 <= activate(z) <= 1
        """
        return np.exp(z)/np.sum(np.exp(z), axis=1, keepdims=True)

    def h(self, X: pd.DataFrame, theta: np.ndarray) -> np.ndarray:
        """Hypothesis function.

        :param X: The dataset
        :param theta: The parameters (weight)
        :return: The activated output. 0 <= h(x, theta) <= 1
        """
        return self.activate(np.dot(X, theta.T))

    def J(self, X: pd.DataFrame, y: pd.Series) -> float:
        """Cost function (cross-entropy loss).

        :param X: The dataset
        :param y: The target
        :return: The loss value.
        """
        delta = 1e-7  # To avoid log(0)
        return - (1 / self._m) * (
            np.sum(y * np.log(self.h(X, self._theta) + delta) + (1 - y) * np.log(1 - self.h(X, self._theta) + delta))
        )

Since the model is ready, we will move on to train the model using the standardized features and one-hot encoded targets and evaluate its performance.

model = MultipleLogisticRegression()
model.fit(X_train_std, y_train_encoded)

from sklearn.metrics import accuracy_score

y_train_pred = model.predict(X_train_std)
print(f"Acuracy score for train data: {accuracy_score(y_train, y_train_pred)}")
y_test_pred = model.predict(X_test_std)
print(f"Acuracy score for test data: {accuracy_score(y_test, y_test_pred)}")

Acuracy score for train data: 0.875
Acuracy score for test data: 0.7105263157894737

The model demonstrates decent performance on the training dataset but exhibits poor performance on the test dataset, indicating a clear case of overfitting. To address this issue, regularization techniques can be employed to mitigate overfitting and enhance the model's generalization capabilities.