Pedro H Goncalves

Posted on Jun 26, 2023 • Updated on Jul 3, 2023

Unsupervised Clustering with K-Means

#beginners #datascience #machinelearning #kmeans

In the past few weeks, I have been studying about clustering and some of its models to apply in a project at the company I work for. When you study clustering, you quickly come across the centroid model, which leads you to K-Means, the most famous method for working with this type of clustering. We will use K-Means for our activity.

Speaking a bit about how we will perform our activity, we will use a dataset containing sales information from an unspecified company, and we will cluster its customers taking into account some of their behaviors in the store. For this purpose, we will also utilize the RFM concept (recency, frequency, monetary), which is widely used in marketing teams.

It's important to remember that there are various other types of clustering and centroid-based clustering algorithms. This article specifically focuses on K-Means and a practical application of its algorithm.

In this article, we will not discuss data transformation or data visualization. If you would like to provide feedback on my code, you can visit the repository where the code used in this article is located, as there are some visualization and transformation aspects not shown here.

K-Means

K-Means is an unsupervised algorithm, which means it does not require "labels" on the events, unlike supervised algorithms that need labels for training. Unsupervised algorithms are designed to learn from the data itself by autonomously identifying patterns (often not visible to the naked eye).

The goal of the algorithm is to generate K clusters (where K is defined by the scientist), reducing the variance between clusters and increasing the similarity among points assigned to the same cluster.

How it works

The algorithm randomly assigns K numbers in the feature space.
The distance is calculated by iterating over each point and each centroid, and the point is assigned to the centroid with the shortest distance (the calculation of distance uses the Euclidean distance formula).

Recalculate the position of the clusters based on the mean coordinates of each point assigned to the same cluster.
Repeat steps 2 and 3 until the position of the clusters no longer undergoes significant changes or until a certain number of iterations is reached.

Determining the ideal number of K

To determine the number of K (clusters), we will use the elbow method, which is the most commonly used method for this task. We will also use the distance point-line calculation to further refine and better define our number of clusters.

The elbow method calculates the sum of squared distances between the points within a cluster. Its goal is to minimize the total inertia (variability of the points) of each cluster. The formula for this calculation is as follows:

Where K is the number of clusters, x is the point within each cluster, and μ is the mean distance between each point.

The distance point-line calculation is the perpendicular distance of each point along a line defined by two points. It is used to discover the greatest homogeneity within a cluster and the greatest difference between clusters.

P0 and P1 are our starting point (P0) and our last point (P1). y1 represents the y-value (in a Cartesian plane) at our last point (P1), and the same applies to y0 for the first point. The same logic applies to x0 and x1. In the equation, as we usually iterate for each number of clusters, x and y represent the x and y values of the cluster being calculated.

We will start by defining two functions. calculateWcss iterates from 1 to 9 clusters (we don't want to have too many customer clusters in our dataset, and this is generally determined and tested with the data and business teams). It calculates the total inertia for each number of clusters and returns a list with the cluster number and its inertia.

def calculateWcss(data) -> list:  
    wcss = []  
    for k in range(1, 10):  
        kmeans = KMeans(n_clusters=k)  
        kmeans.fit(X=data)  
        data["clusters"] = kmeans.labels_  
        wcss.append(kmeans.inertia_)  
    return wcss  

def plotFigure(quadraticSum:list,figsize:tuple[int]):  
    plt.figure(figsize=figsize)  
    plt.plot(quadraticSum)  
    plt.xlabel("Clusters")  
    plt.show()  

dfRecencyModel = dfOrderCostumer[['recency']]  
quadraticSum = calculateWcss(dfRecencyModel)  
plotFigure(quadraticSum,(13,6))

Calling the calculateWcss function using the column in the dataset that represents the number of days since the last purchase and plotting it in the plotFigure function, we get the following result:

Interpreting this graph, we might think, "Well, the number 8 of clusters is the best because it has the lowest inertia." It's not entirely incorrect, but not entirely correct either. As mentioned earlier, we don't want too many clusters, so we're looking for the point where the inertia doesn't decrease drastically, always aiming to have the fewest clusters.

Upon reevaluation, we could say that 2 and 3 are strong candidates. However, we will use the distance point-line calculation to ensure the number of clusters we will apply.

Let's define the distancePointLine function in code. It calculates the distance of the number of clusters to the points P0 and P1, which are 1 and 9 (our number of clusters defined in calculateWcss). It returns the ideal number of clusters where we have the greatest perpendicular distance between the starting point and the ending point.

def distancePointLine(wcss:list) -> int:  
    import math  
    x1, y1 = 2,wcss[0]  
    x2,y2 = 20,wcss[len(wcss)-1]  

    distance = []  
    for i in range(len(wcss)):  
        x0 = i+2  
        y0 = wcss[i]  
        numerator = abs((y2 - y1) * x0 - (x2 - x1) * y0 + x2 * y1 - y2 * x1)  
        denominator = math.sqrt((y2 - y1) ** 2 + (x2 - x1) ** 2)  
        distance.append(numerator/denominator)  
    return distance.index(max(distance)) + 2

Calling the function with the return of calculateWcss we have the value 4 as the ideal number of clusters and we will use it in the rest of the tasks.

Clustering our dataset

In our dataset, we have information such as recency (which we used to determine the ideal number of clusters and represents the number of days since the last purchase), frequency (which represents the number of times a particular customer has made purchases in our store), and monetary value (representing the amount the customer spent in our store). Typically, people would cluster using all the features (columns) together. However, we will perform separate clustering for each feature, specifically four clusters for each feature.

Let's start by defining a function that takes parameters such as a new column name for the cluster, the name of the feature to be used as the basis for clustering, the multidimensional array of the separated feature from the DataFrame, the DataFrame itself to add the clustering, and whether the rating (cluster it belongs to) should be in ascending or descending order.

We will use the cluster as the rating. As the cluster starts from 0 and goes up to 3, the cluster with a rating of 0 will represent customers who have spent the least money or have been inactive for the longest time on the platform.)

def orderCluster(clusterName:str,target_name:str,featureColumn:DataFrame,dfAppend,ascending:bool) -> DataFrame:  
    kmeans = KMeans(n_clusters=nmrCluster)  

    dfUse = dfAppend  
    dfUse[clusterName] = kmeans.fit_predict(featureColumn)  

    groupbyCluster = dfUse.groupby(clusterName)[target_name].mean().reset_index()  
    groupbyCluster = groupbyCluster.sort_values(by=target_name,ascending=ascending).reset_index(drop=True)  
    groupbyCluster['index'] = groupbyCluster.index  
    groupbyCluster.drop(columns=[target_name],inplace=True)  
    dfUsageMerged = pd.merge(dfUse,groupbyCluster, on=clusterName)  
    dfUsageMerged.drop(columns=[clusterName],inplace=True)  
    dfUsageMerged.rename(columns={"index":clusterName},inplace=True)  
    return dfUsageMerged

Now we will call the orderCluster function for each feature and increment by in the dfMain (DataFrame that we performed some transformations after reading the .csv file)

finalDataframe = dfMain[['id_unique_costumer','recency','recency_cluster','order_approved','frequency_cluster','agg_value','revenue_cluster']]  
finalDataframe['pontuation'] = finalDataframe['recency_cluster'] + finalDataframe['frequency_cluster'] + finalDataframe['revenue_cluster']

finalDataframe['segmentation'] = 'Inactive'

finalDataframe.loc[finalDataframe['pontuation']>1,'segmentation'] = 'Business'  
finalDataframe.loc[finalDataframe['pontuation']>3,'segmentation'] = 'Master'  
finalDataframe.loc[finalDataframe['pontuation']>5,'segmentation'] = 'Premium'

And then, we can plot a graph to visualize the distribution of each segmentation, using the features of agg_value (amount of money spent) and recency (number of days since the last purchase) as well.

Here's the function to plot the graph:

def plot_segment(x,y,data):  
    sns.set(palette='muted',color_codes=True,style='whitegrid')    
    sns.scatterplot(x=x,y=y,hue='segmentation',data=data,sizes=(50,150),size_order=['Premium','Master','Business','Inativo'])  
    plt.show()

plot_segment('recency','agg_value',finalDataframe)

The plotted graph:

With this graph, it becomes clear that our customers classified as Premium (obviously few) have spent higher amounts than the average and made more recent purchases, while the inactive ones have not spent much and haven't made purchases for some time. Based on this, our company can have more targeted communication by offering customized services to Premium customers and providing some type of discount coupon to encourage inactive customers to return and spend in our store.

Digging deeper into RFM.

Let's further analyze our recency cluster with the following code.

finalDataframe.groupby('recency_cluster')

['recency'].describe().reset_index()

Now we know that customers belonging to cluster 0 have an average of 490 days since their last purchase, making them the cluster with the lowest recency.

The RFM concept creates more customer groups based on the clusters they belong to, taking into account the attributes of recency, frequency, and monetary value. For example, a customer who belongs to cluster 0 in recency, cluster 1 in frequency, and cluster 3 in monetary value means that they haven't made a recent purchase, have made a reasonable number of purchases in our store, and have spent a high amount of money. The RFM analysis would allocate this customer to a "dormant" or "hibernating" customer segment. We can implement this classification in our algorithm, but I present it here as a challenge. I recommend reading more about RFM and how to implement it in your business alongside unsupervised clustering.

Conclusion.

In this article, we have learned how to determine the ideal number of clusters and how to cluster our customers based on the widely used RFM concept. If you would like to explore more about other models, data visualization, data transformation, I suggest checking out my GitHub repository, where I frequently work on data engineering projects and related topics.

Thank you very much for reading.

Repository: https://github.com/pedrohgoncalvess/k-means-clustering