DEV Community

Trix Cyrus
Trix Cyrus

Posted on • Edited on

Part 5: Building Your Own AI - Exploring Unsupervised Learning and Clustering

Author: Trix Cyrus

Try My, Waymap Pentesting tool: Click Here
TrixSec Github: Click Here
TrixSec Telegram: Click Here


Unsupervised learning offers powerful techniques for extracting insights from unlabeled data, making it essential for discovering hidden patterns and relationships. In this article, we’ll focus on clustering algorithms such as K-Means and hierarchical clustering and introduce dimensionality reduction techniques like Principal Component Analysis (PCA). Real-world applications, such as customer segmentation and anomaly detection, will demonstrate the practical utility of these methods.


1. What Is Unsupervised Learning?

  • Definition: Learning patterns from data without pre-existing labels.
  • Objective: Group or structure data in meaningful ways, revealing intrinsic structures.
  • Applications:
    • Market segmentation.
    • Fraud detection.
    • Recommendation systems.

2. Clustering Algorithms

a. K-Means Clustering

  • How It Works:
    1. Select the number of clusters ((k)).
    2. Randomly initialize cluster centroids.
    3. Assign data points to the nearest centroid.
    4. Recalculate centroids based on assignments.
    5. Repeat until convergence.
  • Example Use Case: Grouping customers based on purchasing behavior.
  • Advantages: Simple, fast, scalable.
  • Limitations: Requires pre-defining (k); sensitive to outliers.

Code Example:

from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import make_blobs

# Generate synthetic data
X, _ = make_blobs(n_samples=300, centers=4, random_state=42, cluster_std=1.0)

# Apply K-Means
kmeans = KMeans(n_clusters=4, random_state=42)
kmeans.fit(X)
labels = kmeans.labels_

# Visualize clusters
sns.scatterplot(x=X[:, 0], y=X[:, 1], hue=labels, palette='viridis')
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s=200, c='red', label='Centroids')
plt.legend()
plt.show()
Enter fullscreen mode Exit fullscreen mode

b. Hierarchical Clustering

  • How It Works:
    • Creates a tree-like structure (dendrogram) to represent data groupings.
    • Two approaches:
    • Agglomerative: Bottom-up, merging clusters.
    • Divisive: Top-down, splitting clusters.
  • Example Use Case: Gene expression analysis in bioinformatics.
  • Advantages: No need to pre-define the number of clusters.
  • Limitations: Computationally expensive for large datasets.

Code Example:

from scipy.cluster.hierarchy import dendrogram, linkage
from sklearn.datasets import make_blobs
import matplotlib.pyplot as plt

# Generate synthetic data
X, _ = make_blobs(n_samples=150, centers=3, random_state=42, cluster_std=1.2)

# Apply hierarchical clustering
linked = linkage(X, method='ward')

# Plot dendrogram
plt.figure(figsize=(10, 7))
dendrogram(linked, truncate_mode='lastp', p=10, leaf_rotation=90, leaf_font_size=10)
plt.title('Hierarchical Clustering Dendrogram')
plt.show()
Enter fullscreen mode Exit fullscreen mode

3. Dimensionality Reduction

a. Principal Component Analysis (PCA)

  • Purpose: Reduce the number of dimensions while retaining most of the data’s variability.
  • How It Works:
    • Identifies principal components (orthogonal vectors) capturing maximum variance.
    • Projects data onto these components.
  • Example Use Case: Visualizing high-dimensional data in 2D or 3D.
  • Advantages: Reduces noise and improves computational efficiency.
  • Limitations: May lose interpretability of original features.

Code Example:

from sklearn.decomposition import PCA
from sklearn.datasets import load_iris
import matplotlib.pyplot as plt

# Load Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# Apply PCA
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)

# Plot PCA results
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=y, cmap='viridis', edgecolor='k')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.title('PCA on Iris Dataset')
plt.show()
Enter fullscreen mode Exit fullscreen mode

4. Real-World Applications

a. Customer Segmentation

  • Goal: Group customers based on behavior, demographics, or preferences.
  • Approach:
    • Use K-Means to cluster purchase data.
    • Visualize clusters for insights.

b. Anomaly Detection

  • Goal: Identify outliers or unusual patterns, such as fraudulent transactions.
  • Approach:
    • Use clustering to find normal data patterns.
    • Points far from cluster centroids are flagged as anomalies.

~Trixsec

Top comments (0)