Machine Learning Text Clustering with DBSCAN

#productivity #webdev #tooling

Introduction

Unsupervised Machine Learning (UML) is a type of machine learning where the goal is to discover patterns or relationships in data without any prior information or supervision. This is in contrast to supervised machine learning, where the goal is to learn from labeled data to make predictions about new, unseen data.

One common task in UML is text clustering, which is the process of grouping similar documents or text data together. One algorithm that can be used for text clustering is DBSCAN (Density-Based Spatial Clustering of Applications with Noise).

DBSCAN is a density-based clustering algorithm that groups together points that are closely packed together (i.e., high density) and separates points that are far apart (i.e., low density). The algorithm works by defining a neighborhood around each data point, and if a sufficient number of data points fall within that neighborhood, the point is designated as a core point. If a point is not a core point, but is close to one, it is designated as a border point. Any point that is not a core point or a border point is considered noise.

The key parameters of the DBSCAN algorithm are the distance metric (used to determine the neighborhood around each point) and the minimum number of points required to form a dense region (referred to as the "min_samples" parameter).

Text Clustering have played an important role in Natural Language Processing when one wants to group texts into groups of similar themes for the purposes of model training.

The following libraries were used in the process:-

Matplotlib
Sci-kit Learn
Pandas
Numpy

The process

Step 01: Importing modules and libraries

In this step, we import the necessary modules and libraries that will be used throughout the clustering process. The warnings module is imported to handle any potential warning messages, which are then filtered to be ignored. We also import the DBSCAN class from the sklearn.cluster module, which is the implementation of the DBSCAN algorithm. Additionally, we import the TfidfVectorizer class from sklearn.feature_extraction.text module, which will be used to convert the text data into numerical feature vectors. Furthermore, we import matplotlib.pyplot as plt to visualize the clusters, and import numpy as np and pandas as pd for data manipulation and analysis.

import warnings
warnings.filterwarnings("ignore")
from sklearn.cluster import DBSCAN
from sklearn.feature_extraction.text import TfidfVectorizer
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

Step 02: Loading training data source

In this step, we load the training data from a CSV file named "data.csv". The data is read into a Pandas DataFrame called data, which allows us to easily work with and analyze the data. We extract the "comments" column from the DataFrame and assign it to the variable texts, as this will be the text data we will be clustering.

data = pd.read_csv("data.csv")
texts = data["comments"]

Step 03: Vectorizing the training data

To apply the DBSCAN algorithm to text data, we need to convert the textual comments into numerical feature vectors. In this step, we create an instance of the TfidfVectorizer class called vectorizer. This class implements the TF-IDF (Term Frequency-Inverse Document Frequency) transformation, which is a commonly used technique for converting text data into numerical representations. We then use the fit_transform method of the vectorizer object to transform the texts data into a matrix of TF-IDF features. The resulting matrix is assigned to the variable vectors.

vectorizer = TfidfVectorizer()
vectors = vectorizer.fit_transform(texts)

Step 04: Training the data on the DBSCAN algorithm

In this step, we create an instance of the DBSCAN class called dbscan. DBSCAN is a density-based clustering algorithm that groups together data points that are close to each other based on a distance metric. We specify the parameters for the DBSCAN object, such as eps (the maximum distance between samples for them to be considered in the same neighborhood) and min_samples (the minimum number of samples required for a cluster). We then call the fit method on the dbscan object, passing in the vectors matrix as input. This performs the clustering process on the data and assigns cluster labels to each data point.

dbscan = DBSCAN(eps=1.0, min_samples=5)
dbscan.fit(vectors)

Step 05: Collecting data labels and coordinates for plotting

After training the DBSCAN algorithm, we can collect the cluster labels assigned to each data point and the corresponding coordinates of the data points. We assign the cluster labels to the variable cluster_labels using the labels_ attribute of the dbscan object. The labels_ attribute contains the cluster assignments for each data point, where noise points are assigned the label -1. We convert the vectors matrix to a dense NumPy array using the toarray method, and assign the result to the variable coords. This will be used for visualizing the clusters.

cluster_labels = dbscan.labels_
coords = vectors.toarray()

Step 06: Collecting clusters information

In this step, we gather some information about the clusters generated by the DBSCAN algorithm. We calculate the number of clusters by finding the unique cluster labels in the cluster_labels array using the np.unique function. The total number of noise points is calculated by counting the occurrences of -1 (assigned to noise points) in the cluster_labels array using the np.sum function. Finally, we print the estimated number of clusters and the estimated number of noise points based on the clustering results.

no_clusters = len(np.unique(cluster_labels) )
no_noise = np.sum(np.array(cluster_labels) == -1, axis=0)

print('Estimated no. of clusters: %d' % no_clusters)
print('Estimated no. of noise points: %d' % no_noise)

Step 07: Visualizing the clusters

To visually explore the clusters generated by the DBSCAN algorithm, we create a scatter plot. Each data point is plotted using its corresponding x and y coordinates. We assign colors to the points based on their cluster labels using a lambda function and the map function. Points belonging to a specific cluster (cluster label 1) are assigned the blue color '#3b4cc0', while noise points (cluster label -1) are assigned the red color '#b40426'. The scatter plot is created using the scatter function from matplotlib.pyplot, where the x and y coordinates are provided. The c parameter is set to the list of colors, and the marker parameter is set to "o" for circular markers. The plot is then displayed using plt.show().

colors = list(map(lambda x: '#3b4cc0' if x == 1 else '#b40426', cluster_labels))
plt.scatter(x, y, c=colors, marker="o", picker=True)
plt.show()

Conclusion

By following the step-by-step process outlined in the code, users can effectively cluster text data and gain valuable insights into underlying patterns and structures. The code covers key aspects such as data loading, vectorization using TF-IDF, training the DBSCAN algorithm, collecting cluster labels and coordinates, and visualizing the resulting clusters.

By leveraging the power of DBSCAN, users can identify dense regions in the data while handling noise and clusters of various shapes. The visualization component enhances the understanding and interpretation of the clusters, facilitating deeper analysis of the text data. This code serves as a valuable resource for anyone seeking to cluster and explore textual information efficiently.

That's all, I hope this was helpful and happy coding.

Do you have a project 🚀 that you want me to assist you email me🤝😊: wilbertmisingo@gmail.com
Have a question or wanna be the first to know about my posts:-
Follow ✅ me on GitHub
Follow ✅ me on Twitter/X 𝕏
Follow ✅ me on LinkedIn 💼