DEV Community: Lance Galletti

GMM Clustering From Scratch

Lance Galletti — Thu, 18 May 2023 14:38:09 +0000

In this article you will learn how to implement the EM algorithm for solving GMM clustering from scratch.

Your friend, who works at Jurassic Park, needs to routinely record the weights of the various dinosaurs to monitor their health and make sure they are each in a normal range for their species. This time though, they forgot to label which weights corresponds to which dino species so they don’t know what range to compare each weight against… Luckily, they know how many different species are in the park but they need your help to figure out which species a given weight is most likely associated with.

By the end of this article, you will be able to help your friend.

Breaking down the task

This is not an easy problem. For each weight in our sample, we need to report, for each species, the probability that the given weight comes from that species. Formally, we need to find the following conditional probability: $P (S_{j} ∣ X_{i})$

Where $S_{j}$ is the $j^{t h}$ species and $X_{i}$ is a specific animal weight from the dataset your friend gave you. Computing this value is highly complex because:

Some dinosaurs are more common than others: for example there are many more Stegosauruses than Raptors in the park. This means a given data point, knowing nothing about it would just have a higher chance of being a Stegosaurus than a Raptor.
The weights of different species vary differently. For example, the weights of a Sauropod might be similar to a bell curve, symmetric around an average weight about 100 tons. But the weights of Maiasaura might differ greatly between male and female so we might observe more of a bimodal distribution (with peaks at each of the average weight of males and females).

Doing the math and applying Bayes Theorem reveals these probabilities:

P (S_{j} ∣ X_{i}) = \frac{P ( X _{i} ∣ S _{j} ) P ( S _{j} )}{P ( X _{i} )}

Where:

$P (S_{j})$ is the prior probability of seeing species $S_{j}$ (that probability would be higher for the Stegosauruses than the Raptors for example).
$P (X_{i} ∣ S_{j})$ is the PDF of species $S_{j}$ weights evaluated at weight $X_{i}$ (seeing a Sauropod that weighs 100 tons is way more likely than seeing a Raptor that weighs 100 tons)

What about $P (X_{i})$ ?

To compute $P (X_{i})$ , we need to understand the distribution of $X_{i}$ . Let’s work with a simple example where there are only two species in the park: the Stegosaurus and the Ankylosaurus. If we looked at the distribution of the weights of all the Stegosauruses, we would see a normal distribution around 4 metric tons, with a standard deviation of about .1 tons. Looking at the Ankylosaurus, we would observe a normal distribution around a mean of 5 tons with a standard deviation of about .5 tons.

To get the distribution of the weights of both we need to account for how likely it is to meet a Stegosaurus compared to an Ankylosaurus. Keeping things simple, we can assume they have equal numbers of Stegosauruses than Ankylosauruses in the park. So the probability that we would encounter one vs the other would be 50%.

In general, from any number of species each with individual weight distributions, we can construct a combined distribution as above by simply specifying the proportion of each individual species we expect.

For example, we could have, in the park, 10% raptors, 25% Sauropods, 5% T-Rexs, 30% Stegosauruses, 15% Ankylosauruses, 15% Maiasaura. The individual weight distributions $P (X_{i} ∣ S_{j})$ could look like this:

To combine them, we factor in their relative proportion $P (S_{j})$ , as such

import numpy as np
from scipy.stats import norm
import matplotlib.pyplot as plt

maiasaura = lambda x : .5 * norm.pdf(x, 7, .3) + .5 * norm.pdf(x, 8, .3)
stegosaurus = lambda x : norm.pdf(x, 4, .3)
ankylosaurus = lambda x : norm.pdf(x, 5, .5)
trex = lambda x : norm.pdf(x, 10, 1.5)
raptor = lambda x : norm.pdf(x, .7, .2)
sauropod = lambda x : norm.pdf(x, 20, 3)

x = np.arange(0, 30, .01)
plt.plot(x, .1 * raptor(x) + .25 * sauropod(x) + .05 * trex(x) + .3 * stegosaurus(x) + .15 * ankylosaurus(x) + .15 * maiasaura(x), color='blue')

Hence, for a given weight $X_{i}$ in the dataset, $P (X_{i})$ is computed by the weighted sum of the PDFs of each species’ weights as such:

P (X_{i}) = j \sum P (S_{j}) P (X_{i} ∣ S_{j})

We say that $X_{i}$ follows a mixture distribution with a set number k of components. When every component has a Normal Distribution, we refer to that special case as a Gaussian Mixture Distribution.

Gaussian Mixture Model

Recall, our goal is to report back $P (S_{j} ∣ X_{i})$ for all weights and all species. So if there are k=10 species we would report back 10 probabilities per data point in the dataset. In order to compute $P (S_{j} ∣ X_{i})$ we need $P (X_{i} ∣ S_{j})$ which, could be any distribution with any number of parameters… To simplify things, GMM assumes that the data follows a Gaussian Mixture Distribution where every $P (X_{i} ∣ S_{j})$ is a Normal Distribution.

With this assumption, what do we need to know in order to compute $P (S_{j} ∣ X_{i})$ ?

The relative proportions of each species in the park $P (S_{j})$
The parameters of each of the normal distributions $P (X_{i} ∣ S_{j})$ (which are $μ_{j}$ and $σ_{j}$ )

Maximum Likelihood Estimation

Suppose you are given a dataset of coin tosses and are asked to estimate the probability of Heads. How would you go about it? Let’s take the following sequence of coin tosses (which we can assume are independent):

H, T, T, H, T

Given the limited information, the best we can do is find the probability of Heads that maximized the probability of having observed this particular sequences of coin tosses. Knowing that this coin can be modeled as a Bernoulli RV with probability p of Heads, we can formulate the probability of observing the above data as:

P (H, T, T, H, T) = (5 2) p^{2} (1 - p)^{3}

To find the value of p that maximized the probability of observing the data we saw, we can find take the derivative of the above wrt p, set it equal to zero and solve for p.

Our estimate for p is then 2/5 which is the sample proportion of Heads in our dataset. And it’s the best we can do given the information we have. This approach is called the Maximum Likelihood Estimation approach where we estimate the parameters of the distribution that generated the dataset by finding the parameter values that maximize the probability of observing that dataset (i.e. assuming that the dataset is a sample that perfectly represents the distribution).

Maximum Likelihood Estimation of Gaussian Mixture Distribution parameters

We can use the same approach to estimate the parameters of the Gaussian Mixture Distribution that generated the data. Recall:

P (X_{i}) = j \sum P (S_{j}) P (X_{i} ∣ S_{j})

So the probability of seeing a dataset with N such values would be the product of those PDFs:

i \prod P (X_{i}) = i \prod j \sum P (S_{j}) P (X_{i} ∣ S_{j})

Taking the derivative of a product is hard… To make things easier we can take the log of the above to transform the product into a sum (which won’t change the critical points):

lo g (i \prod P (X_{i})) = i \sum lo g (j \sum P (S_{j}) P (X_{i} ∣ S_{j}))

Taking the derivative wrt all the parameters and setting it equal to zero we get the following estimates:

μ_{j}^= \frac{\sum _{i} P ( S _{j} ∣ X _{i} ) X _{i}}{\sum _{i} P ( S _{j} ∣ X _{i} )}

Σ_{j}^= \frac{\sum _{i} P ( S _{j} ∣ X _{i} ) ( X _{i} - μ _{j} ^ ) ^{T} ( X _{i} - μ _{j} ^ )}{\sum _{i} P ( S _{j} ∣ X _{i} )}

P (S_{j})^= \frac{1}{N} i \sum P (S_{j} ∣ X_{i})

Something is strange here… Recall the entire reason we need to compute these values is to report $P (S_{j} ∣ X_{i})$ ! But in order to compute these values we need to know $P (S_{j} ∣ X_{i})$ …

Expectation Maximization Algorithm

Since we need one value to compute the others and we need the others to compute the one, the EM Algorithm proposes the following approach:

Start with random $μ_{j}, Σ_{j}, P (S_{j})$
Compute $P (S_{j} ∣ X_{i})$ for all X_i by using $μ_{j}, Σ_{j}, P (S_{j})$
Compute / Update $μ_{j}, Σ_{j}, P (S_{j})$ from $P (S_{j} ∣ X_{i})$
Repeat 2 & 3 until convergence

Implementation

Generating the data

Let’s start by generating a dataset that follows a Gaussian Mixture Distribution:

from numpy import array, argmax
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from numpy.random import multivariate_normal as mvn_random
from scipy.stats import multivariate_normal
from numpy.random import normal, uniform

class Component:
    def __init__(self, mixture_prop, mean, variance):
        self.mixture_prop = mixture_prop
        self.mean = mean
        self.variance = variance

def generate_gmm_dataset(gmm_params, sample_size):
    def get_random_component(gmm_params):
        '''
            returns component with prob
            proportional to mixture_prop
        '''
        r = uniform()
        for c in gmm_params:
            r -= c.mixture_prop
            if r <= 0:
                return c

    dataset = []
    for _ in range(sample_size):
        comp = get_random_component(gmm_params)
        dataset += [normal(comp.mean, comp.variance)]
    return dataset

gmm = [
    Component(.25, [-3, 3], [[1, 0], [0, 1]]),
    Component(.50, [0, 0], [[1, 0], [0, 1]]),
    Component(.25, [3, 3], [[1, 0], [0, 1]])
]
data = generate_gmm_dataset(gmm, sample_size)

EM Algorithm

First we need to find reasonable initial values for the $μ_{j}, Σ_{j}, P (S_{j})$ which we can do by applying a clustering algorithm like Kmeans (which actually favors this type of globular cluster).

def gmm_init(k, dataset):
    kmeans = KMeans(k, init='k-means++').fit(dataset)
    gmm_params = []

    for j in range(k):
        p_cj = sum([1 if kmeans.labels_[i] == j else 0 for i in range(len(dataset))]) / len(dataset)
        mean_j = sum([dataset[i] for i in range(len(dataset)) if kmeans.labels_[i] == j]) / sum([1 if kmeans.labels_[i] == j else 0 for i in range(len(dataset))])
        var_j = sum([(dataset[i] - mean_j).reshape(-1, 1) * (dataset[i] - mean_j).reshape(1, -1) for i in range(len(dataset)) if kmeans.labels_[i] == j]) / sum([1 if kmeans.labels_[i] == j else 0 for i in range(len(dataset))])

        gmm_params.append(Component(p_cj, mean_j, var_j))

    return gmm_params

From the clusters generated by Kmeans, we can get the mean and variance of each cluster, as well as the proportion of points in that cluster, to get initial values for $μ_{j}, Σ_{j}, P (S_{j})$ .

Then we have two steps in the EM Algorithm:

def expectation_maximization(k, dataset, iterations):
    gmm_params = gmm_init(k, dataset)

    for _ in range(iterations):
        # expectation step
        probs = compute_probs(k, dataset, gmm_params)

        # maximization step
        gmm_params = compute_gmm(k, dataset, probs)

    return probs, gmm_params

Where the helper function are defined as such:

def compute_gmm(k, dataset, probs):
    '''
        Compute P(C_j), mean_j, var_j
        Here mean_j is a vector and var_j is a matrix
    '''
    gmm_params = []

    for j in range(k):
        p_cj = sum([probs[i][j] for i in range(len(dataset))]) / len(dataset)
        mean_j = sum([probs[i][j] * dataset[i] for i in range(len(dataset))]) / sum([probs[i][j] for i in range(len(dataset))])
        var_j = sum([probs[i][j] * (dataset[i] - mean_j).reshape(-1, 1) * (dataset[i] - mean_j).reshape(1, -1) for i in range(len(dataset))]) / sum([probs[i][j] for i in range(len(dataset))])

        gmm_params.append(Component(p_cj, mean_j, var_j))

    return gmm_params


def compute_probs(k, dataset, gmm_params):
    '''
        For all x_i in dataset, compute P(C_j | X_i) = P(X_i | C_j)P(C_j) / P(X_i) for all C_j
        return the list of lists of all P(C_j | X_i) for all x_i in dataset.
    '''
    probs = []

    for i in range(len(dataset)):
        p_cj_xi = []
        for j in range(k):
            p_cj_xi += [gmm_params[j].mixture_prop * multivariate_normal.pdf(dataset[i], gmm_params[j].mean, gmm_params[j].variance)]
        p_cj_xi = p_cj_xi / sum(p_cj_xi)
        probs.append(p_cj_xi)

    return probs

To draw the above plots where the size of the data points are proportional to the probability of being in that cluster, you can do the following:

probs, gmm_p = expectation_maximization(num_clusters, data, 3)
labels = [argmax(array(p)) for p in probs] # create a hard assignment
size = 50 * array(probs).max(1) ** 2 # emphasizes the difference in probability

plt.scatter(data[:, 0], data[:, 1], c=labels, cmap='viridis', s=size)
plt.title('GMM with {} clusters and {} samples'.format(num_clusters, sample_size))
plt.show()

Conclusion

Now you can help your friend figure out which weight most likely corresponds to which dino species.

Acknowledgement

Thank you to Reshab Chhabra for their contributions.

Support Vector Machines From Scratch

Lance Galletti — Mon, 08 May 2023 15:24:17 +0000

In this article you will learn how to implement a simple algorithm for solving SVM from scratch.

Tldr; Support Vector Machines

The goal is to find the widest street that separates classes. The street is defined by 3 lines:

In this example we have two classes (blue = +1 and green = -1). The line in red is the decision boundary — to classify an unknown point u using the above SVM means:

$w^{T} u + b \geq 0$ THEN blue
$w^{T} u + b < 0$ THEN green

The width of the street is

w i d t h = \frac{2}{∥ w ∥}

That means the width of the street is inversely proportional to the magnitude of w.

Dividing w and b by 4:

So finding the widest street means we want to find a w that is as small as possible that forms a street that separates the classes.

Expanding / Retracting Rate

Notice that multiplying w and b by the same constant c doesn’t change the decision boundary but does change the width of the street. If:

$0 < c < 1$ the width will expand
$c > 1$ the width will retract

Assuming we have a linearly separable dataset, we would impose the constraint that no data points lie in the street. This means that if we find a line that separates our classes but some points lie in the street, we should make the street more narrow. And if we find a street that separates our classes and none of them are in the street, we should expand the street.

Ideally, some special points (called support vectors) would exactly lie on the edge of the street.

Perceptron Algorithm

If the current w and b result in misclassification of a random point x in our dataset, we can move the line toward x by some amount lr as follows:

$w_{n e w} = w_{o l d} + y_{i} * l r * x_{i}$ and $b_{n e w} = b_{o l d} + y_{i} * l r$

If that point x is in the street we might want to retract the street. If that point is not in the street or was classified correctly we might want to expand it.

Putting it all together

Starting with a random w and b (note: the initial values of w and b have a large impact on what line is learned through this process).

epochs = 100
lr = .05
expanding_rate = .99
retracting_rate = 1.01
for _ in range(epochs):
    # pick a point from X at random
    i = np.random.randint(0, len(X))
    x, y = X[i], Y[i]
    ypred = w[0] * x[0] + w[1] * x[1] + b
    if (ypred > 0 and y > 0) or (ypred < 0 and y < 0):
        # classified correctly
        if ypred < 1 and ypred > -1:
            # in the street / street is too wide
            w = w + x * y * lr * retracting_rate
            b = b + y * lr * retracting_rate
        else:
            # street is too narrow
            w = w * expanding_rate
            b = b * expanding_rate
    else:
        # misclassified
        w = w + x * y * lr * expanding_rate
        b = b + y * lr * expanding_rate

Circled in red are the points that were mislassified. Circled in yellow are the points that were correctly classified.

What if the data is not linearly separable

Looking at the problem statement again — we’re looking to maximize the width of the street subject to the constraint that points in our dataset cannot lie in the street. Mathematically we want to find:

max (\frac{2}{∥ w ∥}) = min (∥ w ∥) = min (\frac{1}{2} ∥ w ∥^{2})

subject to:

y_{i} (w \cdot x_{i} + b) - 1 = 0

Which equates to minimizing:

L = \frac{1}{2} ∥ w ∥^{2} - i \sum α_{i} [y_{i} (w \cdot x_{i} + b) - 1]

Taking the derivative wrt w:

\frac{\partial L}{\partial w} = w - i \sum α_{i} y_{i} x_{i} = 0

Which means:

w = i \sum α_{i} y_{i} x_{i}

So the decision boundary can be re-written as:

i \sum α_{i} ⟨ x_{i}, x ⟩ + b = 0

Updating w means updating α.

$α_{i, n e w} = α_{i, o l d} + y_{i} * l r$ and $b_{n e w} = b_{o l d} + y_{i} * l r$

So, we can re-write the above algorithm as such:

epochs = 100
lr = .05
expanding_rate = .99
retracting_rate = 1.01
def predict(alpha_i, b, x):
    wx = 0
    for j in range(len(X)):
        wx += alpha_i[j] * np.dot(X[j], x)
    return wx + b
for _ in range(epochs):
    # pick a point from X at random
    i = np.random.randint(0, len(X))
    x, y = X[i], Y[i]
    ypred = predict(alpha_i, b, x)
    if (ypred > 0 and y > 0) or (ypred < 0 and y < 0):
        # classified correctly
        if ypred < 1 and ypred > -1:
            # in the street / street is too wide
            alpha_i[i] += y * lr
            alpha_i = alpha_i * retracting_rate           
            b += y * lr * retracting_rate
        else:
            # street is too narrow
            alpha_i = alpha_i * expanding_rate
            b *= expanding_rate
    else:
        # misclassified
        alpha_i[i] += y * lr
        alpha_i = alpha_i * expanding_rate
        b += y * lr * expanding_rate

Kernel Trick

To find a non-linear decision boundary, we can use the “kernel trick”. That is, instead of defining an explicit mapping to a new feature space where the dataset is linearly separable we need only define what the inner product in that transformed space is. This kernel function is what defines this inner product and can replace the dot product in the predict function:

i \sum α_{i} K (x_{i}, x) + b = 0

For example we can use the polynomial kernel

def polynomial(x_i, x_j, c, n):
    return (np.dot(x_i, x_j) + c) ** n

def predict(alpha_i, b, x):
    wx = 0
    for j in range(len(X)):
        wx += alpha_i[j] * polynomial(X[j], x, C, N)
    return wx + b

Or using the RBF kernel

Conclusion

While other methods can solve this problem more effectively, I hope this simple algorithm helped you better understand SVMs!

DEV Community: Lance Galletti

GMM Clustering From Scratch

Breaking down the task

What about P(Xi​) ?

Gaussian Mixture Model

Maximum Likelihood Estimation

Maximum Likelihood Estimation of Gaussian Mixture Distribution parameters

Expectation Maximization Algorithm

Implementation

Generating the data

EM Algorithm

Conclusion

Acknowledgement

Support Vector Machines From Scratch

Tldr; Support Vector Machines

Expanding / Retracting Rate

Perceptron Algorithm

Putting it all together

What if the data is not linearly separable

Kernel Trick

Conclusion

What about $P (X_{i})$ ?