Naman Srivastava

Posted on Oct 13

Mall Customer Segmentation using ML — A Step-by-Step Tutorial

#python #datascience #machinelearning #tutorial

🧠 Introduction

In this project, we explore Customer Segmentation using the famous Mall Customers Dataset. We'll apply KMeans Clustering to group similar customers and Random Forest Regression to predict spending scores. Finally, we’ll deploy the model with Streamlit.

📊 Dataset Overview

We use the Mall_Customers.csv dataset from Kaggle. It contains information about 200 customers:

Column	Description
CustomerID	Unique identifier
Genre	Male / Female
Age	Customer’s age
Annual Income (k$)	Annual income in thousand dollars
Spending Score (1–100)	Score assigned by the mall

⚙️ Step 1: Import Required Libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.cluster import KMeans
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score
import joblib

🧹 Step 2: Load and Preprocess Data

# Load dataset
df = pd.read_csv('Mall_Customers.csv')

# Clean columns
df.columns = df.columns.str.strip()

# Encode Gender
le = LabelEncoder()
df['Genre'] = le.fit_transform(df['Genre'])  # Female=0, Male=1

# Drop CustomerID
X = df.drop(['CustomerID'], axis=1)

🧩 Step 3: Feature Scaling

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X[['Age', 'Annual Income (k$)', 'Spending Score (1–100)']])

# Save scaler
joblib.dump(scaler, './Models/scaler.pkl')

🌀 Step 4: KMeans Clustering

# Find optimal number of clusters using Elbow Method
inertia = []
for k in range(1, 11):
    kmeans = KMeans(n_clusters=k, random_state=42)
    kmeans.fit(X_scaled)
    inertia.append(kmeans.inertia_)

plt.plot(range(1, 11), inertia, marker='o')
plt.title('Elbow Method')
plt.xlabel('Number of clusters')
plt.ylabel('Inertia')
plt.show()

# Train KMeans
kmeans = KMeans(n_clusters=5, random_state=42)
df['Cluster'] = kmeans.fit_predict(X_scaled)

# Save the model
joblib.dump(kmeans, './Models/classifier.pkl')

📈 Step 5: Visualize Clusters

plt.figure(figsize=(8, 6))
plt.scatter(df['Annual Income (k$)'], df['Spending Score (1–100)'], c=df['Cluster'], cmap='viridis')
plt.xlabel('Annual Income (k$)')
plt.ylabel('Spending Score (1–100)')
plt.title('Customer Segments')
plt.show()

🧮 Step 6: Predicting Spending Score (Regression)

# Features and target
X_rf = df[['Genre', 'Age', 'Annual Income (k$)']]
y_rf = df['Spending Score (1–100)']

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X_rf, y_rf, test_size=0.2, random_state=42)

# Train Random Forest
rf = RandomForestRegressor(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)

# Evaluate
preds = rf.predict(X_test)
print(f'R² Score: {r2_score(y_test, preds):.2f}')

# Save model
joblib.dump(rf, './Models/Spending_Score.pkl')

🌐 Step 7: Streamlit App for Deployment

Create a new file named app.py:

import streamlit as st
import joblib
import numpy as np

# Load models
scaler = joblib.load('./Models/scaler.pkl')
kmeans = joblib.load('./Models/classifier.pkl')
rf = joblib.load('./Models/Spending_Score.pkl')

st.title('🛍️ Mall Customer Segmentation')

# Inputs
gender = st.selectbox('Gender', ['Female', 'Male'])
age = st.number_input('Age', 18, 70)
income = st.number_input('Annual Income (k$)', 10, 150)

# Encode gender
gender_encoded = 0 if gender == 'Female' else 1

# Predict Spending Score
predicted_score = rf.predict([[gender_encoded, age, income]])[0]

# Predict Cluster
scaled_features = scaler.transform([[age, income, predicted_score]])
cluster = kmeans.predict(scaled_features)[0]

st.subheader(f'💰 Predicted Spending Score: {predicted_score:.2f}')
st.subheader(f'📊 Predicted Cluster: {cluster}')

🪄 Step 8: Run the Streamlit App

streamlit run app.py

Then open the provided localhost link to access your interactive dashboard.

🎯 Conclusion

You’ve successfully built and deployed a full Machine Learning + Streamlit project!

What You Learned

How to preprocess and scale data
Perform KMeans clustering for segmentation
Train Random Forest for regression
Deploy with Streamlit for interactivity

💻 Author: Naman Srivastava
📅 Date: October 2025
🔗 GitHub: Mall Customer Segmentation Project

DEV Community