DBSCAN: Finding Cluster of any shape

#datastructures #machinelearning

Exploring how DBSCAN uses density, not distance, to find clusters of any shape from theory and code to real-world applications like anomaly detection and geospatial mapping.

Before diving into the code or real-world applications, it’s essential to understand what DBSCAN actually is and why it stands apart from traditional clustering techniques.

What is DBSCAN

At its core, DBSCAN short for Density-Based Spatial Clustering of Applications with Noise is a powerful algorithm that clusters data not based on shape or center points, but on the density of data points in a region. Let’s break down the key concepts that make DBSCAN unique.

Density-Based Clustering : Instead of grouping points based of distance from a center(like K-MEANS). This algorithm groups points based on how crowded a region is
No need for predefined number of clusters : Unlike k-means which needs to figure out the number of clusters. DBSCAN is able to get the different clusters based on density of points in a region

It is like traversing through all the points noting the number of times you encountered regions of high density

Identifies Noise and Outliers : DBSCAN is able to naturally identifies points that do not belong to any cluster like isolated data or anomalies . these are labelled as noise and hence a good tool for anomaly detection
It can find clusters of any shape : It is not limited to circular or spherical clusters. It is able to detect clusters of any shape like spirals and other complex irregular shapes
Core, Border, and Noise Points : DBSCAN classifies points into 3 types

Types of Points in DBSCAN

Core points : Have enough neighbors nearby (defined by ε and MinPts). ε is a defined distance used to determine the neighbors of a given point and MinPts is the minimum number of neighbor a core point must have to form a dense region
Border points : Close to a core point but not dense enough on their own.
Noise points : Points that are not close to any dense region.

Code implementation

1.using numpy

ALGORITHM According to this algorithm a cluster is a continious region of high density(contains certain number of points whose distant from core point is less that a certain number ε

-For each instance(point) counts how many instances are located within a small distance ε (epsilon) from it. This region is called the instance’s ε- neighborhood.
-If an instance has at least MinPts instances in its ε-neighborhood (including itself), then it is considered a core instance. In other words, core instances are those that are located in dense regions.
-All instances in the same neighborhood will be assigned to the same cluster. this neighborhood may contain points that are core points to other neighborhoods
-Any instance that is not a core instance and does not have one in its neighborhood is considered an anomaly.

1.Import numpy

#import libraries
import numpy as np

2.create a DBSCAN Class(remember Object-Oriented Programming)

#class
class DBSCAN:
     #constructor function
    def __init__(self, eps, MinPts):
        self.eps = eps
        self.MinPts = MinPts
    #define the fit function

define methods for this class

1.fit function:

Initialize labels for all points to noise (-1).
Iterate over all points. If a point is already labeled, skip it.
Find neighbors within epsilon distance using the get_neighbors method.
If a point is not a core point (i.e., it has fewer than min_samples neighbors), label it as noise.
Otherwise, label the point as part of a new cluster and expand the cluster using the expand_cluster method.

def fit(self, data):
        # Initialize labels for all points to noise (-1)
        labels = np.full(data.shape[0], -1)#np.full creates an array of given shape and fill it with a certain value np.full(shape,value)
        cluster_id = 0
        for i in range(data.shape[0]):
            # If point is already labeled, skip it
            if labels[i] != -1:
                continue
            # Find neighbors within epsilon distance
            neighbors = self.get_neighbors(data, i, self.eps)
            # If point is not a core point, label it as noise
            if len(neighbors) < self.min_samples:
                labels[i] = -1
            else:
                # Label point as part of a new cluster
                labels = self.expand_cluster(data, labels, i, neighbors, cluster_id, self.eps, self.min_samples)
                cluster_id += 1
        return labels

2.get_neighbors method

Calculate the Euclidean distance between a point and all other points.
Return the indices of points within epsilon distance.
expand_cluster method:
- Label a point as part of a cluster.
- Iterate over all neighbors of the point. If a neighbor is noise or unlabeled, label it as part of the cluster.
- Find neighbors of each neighbor. If a neighbor is a core point, add its neighbors to the list.

def get_neighbors(self, data, index, eps):
        # Calculate Euclidean distance between point and all other points
        distances = np.linalg.norm(data - data[index], axis=1)
        # Return indices of points within epsilon distance
        return np.where(distances <= eps)[0]
    def expand_cluster(self, data, labels, index, neighbors, cluster_id, eps, min_samples):
        # Label point as part of the cluster
        labels[index] = cluster_id
        # Iterate over all neighbors
        for i in neighbors:
            # If neighbor is noise or unlabeled, label it as part of the cluster
            if labels[i] == -1:
                labels[i] = cluster_id
                # Find neighbors of the neighbor
                new_neighbors = self.get_neighbors(data, i, eps)
                # If neighbor is a core point, add its neighbors to the list
                if len(new_neighbors) >= min_samples:
                    neighbors = np.concatenate((neighbors, new_neighbors))
    return labels

use case

from sklearn.datasets import make_moons
import matplotlib.pyplot as plt

#Generate sample data
data, _ = make_moons(n_samples=200, noise=0.05)

#Create DBSCAN instance
dbscan = DBSCAN(eps=0.3, min_samples=10)

#Fit DBSCAN to data
labels = dbscan.fit(data)

#Plot clusters
plt.scatter(data[:, 0], data[:, 1], c=labels)
plt.show()

DEV Community

DBSCAN: Finding Cluster of any shape

What is DBSCAN

Types of Points in DBSCAN

Code implementation

1.using numpy

Top comments (0)