Customer segmentation using RFM and K-Means

#python #sklearn #clustering #tutorial

Welcome! In this tutorial, we will guide you in clustering bank customers based on their transaction behaviors using the RFM (Recency, Frequency, Monetary) model and K-Means clustering in Python.

Prerequisites:

Python Installed.
Familiarity with Python Programming.
Basic Understanding of Clustering

1.Install required libraries

Install the necessary packages for data processing and clustering.

pip install pandas scikit-learn

2. Import Libraries

Import the necessary modules.

import pandas as pd
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler

3. Upload bank details

The dataset has around 10,000 records. Each record addresses a financial transaction carried out by a customer.

The main columns that stand out include client_id, transaction_date, and transaction_amount. The client_id is a unique identifier assigned to each client, ensuring data consistency and facilitating reference. Meanwhile, transaction_date records the timestamp of each transaction performed, serving as an essential marker for evaluating transaction patterns and behaviors over time.

The transaction_amount, on the other hand, is a numeric field that quantifies the monetary value associated with each transaction. This field has immense significance as it provides a direct window into understanding an individual's consumption habits, financial capabilities and, to some extent, economic stratum.

We use the pandas read_csv() function to read the csv containing the transactions.

data = pd.read_csv('banking_data.csv')
print(data.head())

4. Calculate RFM metrics

The acronym RFM stands for Recency, Frequency and Monetary Value, each representing a unique facet of a customer's transactional pattern.

Recency: This metric addresses the question of how long ago a customer engaged in a transaction. A shorter time since last transaction typically indicates a more active customer. In our dataset, this is calculated by subtracting the date of each transaction from the date of the last transaction in the dataset, resulting in the number of days elapsed.

Frequency: Representing the total number of transactions a customer has performed during a specific period, frequency offers information about how often customers interact with banking services. It is a direct measure of customer engagement and loyalty.

Monetary Value: This metric encapsulates the total amount a customer spent during a period. It is a reflection of customer value, with higher values indicating customers who bring more financial value.

# Recency
max_date = data['transaction_date'].max()
data['recency'] = (max_date - data['transaction_date']).dt.days

# Frequency & Monetary Value
rfm = data.groupby('client_id').agg({
     'recency': 'min',
     'client_id': 'count',
     'transaction_amount': 'sum'
}).rename(columns={
     'client_id': 'frequency',
     'transaction_amount': 'monetary'
})

print(rfm.head())

5. Data preprocessing

The “Data Preprocessing” section plays a key role in the analytical pipeline. While raw data often contains a plethora of valuable information, it is also riddled with inconsistencies, discrepancies, and varying scales that, if not addressed, can skew the results of subsequent analysis.

In the context of our tutorial focusing on grouping bank customers using RFM metrics, preprocessing becomes especially crucial. Clustering algorithms like K-Means are sensitive to the scale of the data. Different magnitudes between variables can disproportionately influence the algorithm, leading to misleading clusters.

To address this, the preprocessing phase employs scikit-learn's StandardScaler, a renowned Python library for data science. StandardScaler normalizes each variable to have a mean of zero and a standard deviation of one. This ensures that all variables, whether Recency, Frequency or Monetary, contribute equally to the grouping process.

The transformed data, called rfm_scaled in our tutorial, represents standardized RFM values ready for clustering. In essence, the preprocessing phase acts as a bridge, converting raw, irregular data into a refined, standardized format, ensuring that the subsequent clustering algorithm works efficiently and provides accurate, interpretable clusters. This section highlights the adage, “Garbage in, garbage out,” emphasizing the importance of clean, standardized input data for quality results.

scaler = StandardScaler()
rfm_scaled = scaler.fit_transform(rfm)

6. Determine the number of clusters

This part of the tutorial addresses one of the most critical decisions in the clustering process: identifying the ideal number of clusters. Although clustering aims to segment data into distinct groups based on similarities, the number of these groups is not always evident in advance.

In our tutorial, the Elbow Method is presented as the technique of choice for discerning this ideal number. This method involves plotting the sum of squared distances (often referred to as "inertia") for various cluster counts. As the number of clusters increases, inertia typically decreases; each data point is closest to its respective centroid. However, beyond a certain point, adding more clusters does not lead to a substantial decrease in inertia. This inflection point, similar to an “elbow” in the plotted curve, suggests an ideal number of clusters.

By employing the Elbow or Elbow method, our tutorial iteratively tunes the K-Means algorithm with a range of cluster counts. By viewing the resulting inertia values, analysts can discern the “elbow” and subsequently the suggested cluster count.

In summary, this section highlights the importance of selecting an appropriate cluster count, offering a systematic approach to making this critical decision, ensuring that the resulting clusters are meaningful, distinct and actionable.

distortions = []
for i in range(1, 11):
     km = KMeans(n_clusters=i)
     km.fit(rfm_scaled)
     distortions.append(km.inertia_)

7. Apply K-Means clustering

The K-Means clustering algorithm works by partitioning data into distinct clusters. This is done by assigning each data point to the cluster whose centroid (or center) is closest. These centroids are iteratively recalculated until they stabilize, meaning that an optimal clustering arrangement has been achieved.

For our tutorial, the standardized RFM values, derived in the Data Preprocessing phase, serve as input. Leveraging Python's scikit-learn library, a KMeans object is instantiated with the optimal number of clusters deduced from the previous section. This object is then trained using the fit_predict method on our scaled RFM values, and the resulting cluster labels are stored back into the original RFM dataframe according to the following code:

kmeans = KMeans(n_clusters=3, random_state=0)
rfm['cluster'] = kmeans.fit_predict(rfm_scaled)
print(rfm.head())

8. Interpretation of Segments

In the end, we need to examine the centroids and characteristics of each generated cluster to understand the segments.

You can do this with the code below:

cluster_summary = rfm.groupby('cluster').agg({
     'recency': ['mean', 'std'],
     'frequency': ['mean', 'std'],
     'monetary': ['mean', 'std']
})

print(cluster_summary)

With this tutorial we were able to segment customers based on their transactional behavior using the RFM model and K-Means clustering. This segmentation allows the development of targeted marketing strategies, personalized services and also better customer management.

Download dataset and Jupyter Notebook