DEV Community: Purity-Nyagweth

Stemming vs Lemmatization - What is the difference?

Purity-Nyagweth — Thu, 10 Nov 2022 07:50:49 +0000

Introduction

Stemming and Lemmatization are techniques used in text processing. In Natural Language Processing (NLP), text processing is needed to normalize the text. The aim of text normalization is to reduce the amount of information that a machine has to handle thus improving the efficiency of the machine learning process.
Both stemming and lemmatization involves reducing the inflectional forms of words to their root forms. Inflection forms of words are words that are derived from the root or base form of a word. For example, the words jumped, jumping and jumps are inflectional forms of the root word jump. Likewise, creating, created, creates are inflectional forms of the root word create, and so on.

Prerequisites

Basic knowledge of python programming
Python installed
Natural Language Toolkit(nltk) package installed

What is the difference between stemming and lemmatization?

The main difference between stemming and lemmatization is that stemming chops off the suffixes of a word to reduce a word to its root form while lemmatization first takes into consideration the context of a word and makes use of the context to convert the word to its meaningful base form which is known as lemma.

Below are examples of words that stemming and lemmatization have been performed on.

Stemming Examples

Word --- Porter Stemmer

jumped --- jump
friends --- friend
football --- footbal
mysteries --- mysteri
created --- creat
took --- took

Lemmatization Examples

Word --- Lemmatized word

jumped --- jump
friends --- friend
football --- football
mysteries --- mystery
created --- create
took --- take

How to carry out stemming

Natural Language Toolkit(nltk) package has two stemmers for the English Language. These stemmers are PorterStemmer and LancasterStemmer.
We are going to use PorterStemmer to carryout stemming.

First let's import PorterStemmer

from nltk.stem import PorterStemmer

Let's now create a list of words that we want to stem

word_list = ["jumped", "friendship", "friends", "swimming","creation","stability","writing",
             "realize","mystery","football", "mysteries", "created", "took"]

We will now stem every word in the list and then print the word with its stemmed version.

stemmer = PorterStemmer()

for word in word_list:
    print((word,stemmer.stem(word)))

Output
('jumped', 'jump')
('friendship', 'friendship')
('friends', 'friend')
('swimming', 'swim')
('creation', 'creation')
('stability', 'stabil')
('writing', 'write')
('realize', 'realiz')
('mystery', 'mysteri')
('football', 'footbal')
('mysteries', 'mysteri')
('created', 'creat')
('took', 'took')

How to carry out lemmatization

As mentioned earlier, lemmatization just like stemming reduces a word to its root form but for lemmatization we need to first tag the words with their parts of speech tags before carrying out the lemmatization. For example, every word that is verb will be given the tag verb(v), words that are noun will be given noun(n) tag and so on.

Let's first install the libraries that we will be using

from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet
import nltk

As a start, let's create a function for tagging the words. We will use wordnet for tagging the words.

def tag(doc):
    #POS tagging
    tagged_tokens = nltk.pos_tag(doc)
    return tagged_tokens

Next, let's create a function for converting the parts of speech(pos) tags.

# function for converting tags
def pos_tag_wordnet(tagged_tokens):
    tag_map = {'j': wordnet.ADJ, 'v': wordnet.VERB, 'n': wordnet.NOUN, 'r': wordnet.ADV}
    new_tagged_tokens = [(word, tag_map.get(tag[0].lower(), wordnet.NOUN))
                            for word, tag in tagged_tokens]
    return new_tagged_tokens

Let's now tag the words in the word list from before, then convert the tags and print the output.

# tag the words
tagged_tokens = tag(word_list)
# convert the tags
wordnet_tokens = pos_tag_wordnet(tagged_tokens)
print(wordnet_tokens)

Output
[('jumped', 'v'), ('friendship', 'n'), ('friends', 'n'), ('swimming', 'v'), ('creation', 'n'), ('stability', 'n'), ('writing', 'v'), ('realize', 'v'), ('mystery', 'n'), ('football', 'n'), ('mysteries', 'n'), ('created', 'v'), ('took', 'v')]
From the output, we can see we've got verbs(v) and nouns(n).

Let's now lemmatize the tagged words.

wnl = WordNetLemmatizer()

for word, tag in wordnet_tokens:
    print((word, wnl.lemmatize(word, tag)))

Output
('jumped', 'jump')
('friendship', 'friendship')
('friends', 'friend')
('swimming', 'swim')
('creation', 'creation')
('stability', 'stability')
('writing', 'write')
('realize', 'realize')
('mystery', 'mystery')
('football', 'football')
('mysteries', 'mystery')
('created', 'create')
('took', 'take')

Conclusion

In this article, we've learned about stemming and lemmatization, what they are and their differences. Both stemming and lemmatization are good techniques for text processing and they each have pros and cons.

Credits

Audio Transcription with Python

Purity-Nyagweth — Wed, 31 Aug 2022 18:20:33 +0000

Introduction

Audio transcription is the processing of converting speech in an audio or video file into text. Having a transcription for a video or an audio recording has benefits. Below are some of the benefits of audio transcription:

Expanding target audience. When a transcript get translated to several languages it will open up the content to a wider audience.
Making the content more accessible. With a transcript, the content of an audio can be readily and accurately accessed, more so in cases where the audio quality has been compromised due to background distractions, low volume, regional accents and so on.
Boosting the SEO. With transcription, the keywords used in the audio will now be in written form hence they can be recognized by search engines.

In this article we are going to learn how to transcribe audio using python.

Prerequisite

Basic knowledge of python programming
Assembly AI account

Getting API token

The first thing we will do is to get an API token from Assembly AI.
Let's go to Assembly AI and create a free account.
Once we have an account, we will sign in and then copy the API Key.
The API Key is located at the right of the home page.

Creating config file for storing the key

Now that we have an API Key, let's create a config file for storing the key.
We will create a python file and name it 'api_key.py' (you can give it any name). Then create a variable and assign the API Key to the variable.

API_KEY = 'API Key from Assembly AI'

After creating the config file, we will now create a main file (main.py) where we will write the codes for transcribing the audio.

NOTE: 'api_key.py' and 'main.py' should be in the same directory.

Importing requests and API Key

The first thing that we will do in the 'main.py' is to import requests and the API Key.

import requests
from api_key import API_KEY

Uploading Audio to Assembly AI

Next, let's create a variable 'filename', then get the path of the audio that we want to transcribe and assign this path to 'filename'.

filename = 'audio path'

Let's now create another variable 'upload_endpoint'.

upload_endpoint = 'https://api.assemblyai.com/v2/upload'

Let's also create a variable 'headers' which will be used for authentication. We will use the API Key for authentication.

headers = {'authorization': API_KEY}

Next, let's create a function for reading the audio file.

def read_file(filename, chunk_size=5242880):
    with open(filename, 'rb') as _file:
        while True:
            data = _file.read(chunk_size)
            if not data:
                break
            yield data

Let's now do a post request to upload the file.

upload_response = requests.post(upload_endpoint,
                        headers=headers,
                        data=read_file(filename))

We can print the response to see what kind of response we get.

print(upload_response.json())

The output is an upload url where the audio file is after being uploaded

Transcribing Audio

Our next step now is to transcribe the uploaded audio.
Let's create a variable 'transcript_endpoint'. We will assign the transcription end point to this variable.

transcript_endpoint = "https://api.assemblyai.com/v2/transcript"

The transcript endpoint is the same as the upload endpoint expect that it ends with 'transcript' while the upload endpoint ends with 'upload'.

Next, let's extract the audio url from the response we got from uploading the audio.

audio_url = upload_response.json()['upload_url']

Let's now create a json file that contains the audio url.

json = { "audio_url": audio_url}

Then submit the audio for transcription

transcript_response = requests.post(transcript_endpoint, json=json, headers=headers)

Let's print the response.

print(transcript_response())

Below is the response but with just a few of the info not all of it.

{'id': 'ongvqhtbo7-ad52-4272-b695-d7c624b7c2b5', 'language_model': 'assemblyai_default', 'acoustic_model': 'assemblyai_default', 'language_code': 'en_us', 'status': 'queued'}

The response we get is not the transcript itself, this is because depending on the length of audio it may take a minute or two to get the transcript ready. Instead, the response we get contains a bunch of information about the transcription.

Our main interest from the response is the 'id', we will use it to ask AssemblyAI whether the transcription job is ready or not.

Polling

In a quite simple definition, polling refers to the continuous checking of a resource to see what state they are in.

We will now write the code for polling AssemblyAI. We will use this code to continuously check the status of the transcription job so as to know whether the transcription is ready or not.

The first thing we will do is to get the 'id' from the response.

job_id = transcript_response.json()['id']

After getting the job id, let's create a polling endpoint and then send a get request.

polling_endpoint = transcript_endpoint + '/' + job_id # creating a polling endpoint
polling_response = requests.get(polling_endpoint, headers=headers) # get request

Let's print the polling response.

print(polling_response.json())

Just like the transcript response this response also gives a bunch of information.
Below is the response but with just a few of the info not all of it.

{'id': 'onse9tjyxv-4164-439c-b19c-ee92ae95a7c1', 'language_model': 'assemblyai_default', 'acoustic_model': 'assemblyai_default', 'language_code': 'en_us', 'status': 'processing'}

For this response, we are interested in the status. If the transcription is not ready, the status will indicate 'processing', otherwise it will indicate 'completed'.
From the response above, we can see that the status indicates 'processing' meaning that the transcription is not yet ready.

Let's now create a while loop that will keep on polling AssemblyAI until the status indicates completed.

while True:
    polling_response = requests.get(polling_endpoint, headers=headers)
    if polling_response.json()['status'] == 'processing':
        print('Still processing')
    elif polling_response.json()['status'] == 'error':
        print('error')
    elif polling_response.json()['status'] == 'completed':
        print('completed')
        break

Here's the output from the while loop. As we can see, with the while loop we keep on polling until the status indicates 'completed'.

Still processing
Still processing
Still processing
Still processing
completed

With the status indicating 'completed' let's now print the polling response.

print(polling_response.json())

Below is the response but with a bit of the information.

{'id': 'onsx3yoyc6-aaa9-472e-b8fa-e3cc10e0432f', 'language_model': 'assemblyai_default', 'acoustic_model': 'assemblyai_default', 'language_code': 'en_us', 'status': 'completed', 'text': 'How is your processing with Python?', 'words': [{'text': 'How', 'start': 730, 'end': 822, 'confidence': 0.38754, 'speaker': None},

Our main focus from the response is 'text', this is because it contains the transcript. From the response we can see that our transcript is 'How is your processing with Python?'

Now that we have the transcript, let's save it.

Saving the transcript

We will now write the transcript into a text file and thereafter print 'File succesfully saved!' to confirm that the file has been saved.
The text file will be saved into the working directory.

response = polling_response.json()
with open ('transcript.txt', 'w') as f:
    f.write(response['text'])
print('File succesfully saved!')

Conclusion

In this article we've learnt how to transcribe an audio file and looked at all the steps to follow when transcribing an audio.
A part from AssemblyAI, there are also other platforms such as Deepgram that can be used for converting speech to text.

Credits

How to Transcribe Audio files with python(AssemblyAI)

How to Plot an audio file using Matplotlib

Purity-Nyagweth — Wed, 06 Jul 2022 05:24:50 +0000

Introduction

Plotting and visualizing an audio file is one of the most important processes in audio analysis. Audio analysis is the process of transforming, exploring, and interpreting audio signals recorded by digital devices so as to extract insights from the audio data.

In this article, we are going to plot a waveform of an audio file with matplotlib.

Prerequisites

Python installed
Numpy installed
Matplotlib installed
Background in data analysis

Importing module and libraries

As a first step, let's import the modules and libraries that we will need.



import wave
import matplotlib.pyplot as plt 
import numpy as np

We will use wave module and numpy to preprocess the audio. We will use matplotlib for plotting the audio.

Loading the Audio file

The audio file that we will use is a wave file.
Let's load the wave file that we want to plot



obj = wave.open('audio_file.wav', 'rb')

Getting Audio parameters

Let's print out the audio parameters such as number of channels, sample width, etc.
We will use .getparams() method of the wave module



print('Parameters:', obj.getparams())

Output: Parameters: _wave_params(nchannels=1, sampwidth=2, framerate=22050, nframes=81585, comptype='NONE', compname='not compressed')

Now let's get the parameters that we will need for plotting the audio.

Sample frequency, this is the number of samples per second



sample_freq = obj.getframerate()

Number of samples, this is the total number of samples or frames in the audio file



n_samples = obj.getnframes()

Signal wave, this is the wave amplitude which is the sound intensity.



signal_wave = obj.readframes(-1)

Audio length, this is the duration of the audio.



duration = n_samples/sample_freq

Creating numpy objects

Let's create a numpy object from the signal_wave. This will be plotted on the y-axis.



signal_array = np.frombuffer(signal_wave, dtype=np.int16)

Let's create a numpy object from duration. This will be plotted on the x-axis



time = np.linspace(0, duration, num=n_samples)

Creating an audio plot



plt.figure(figsize=(15, 5))
plt.plot(time, signal_array)
plt.title('Audio Plot')
plt.ylabel(' signal wave')
plt.xlabel('time (s)')
plt.xlim(0, time) #limiting the x axis to the audio time
plt.show()

Output:

Conclusion

In this article, we've learnt how to plot a waveform of an audio file. Apart from plotting the waveform, another plot we can get from an audio file is frequency spectrum. For the frequency spectrum, we plot sample frequency against time. To learn how to plot a frequency spectrum, checkout this link.

Credits

Audio Processing Basics(Assembly AI)
Audio Analysis(altexsoft)

Named Entity Recognition with Spacy

Purity-Nyagweth — Thu, 30 Jun 2022 18:50:01 +0000

Introduction

Named Entity Recognition also known as NER, is a Natural Language Processing (NLP) task that identifies and classifies named entities in a text. Named entities are real-world objects assigned a name. They include people's names, location names, work of art, organizations, days, dates and among many others.

Named Entity Recognition is usually used for extracting key information to understand a text while performing task such as topic identification. It can also be used on its own for the case of just extracting important information from a text.

In this article, I am going to explain how to perform Named Entity Recognition using Spacy.

Prerequisite

Spacy installed
Python installed
Basic knowledge of python programming

What is Spacy?

Spacy is an open-source NLP library that is used for performing various NLP tasks.
It has a built-in mechanism that is used for identifying and classifying named entities.

NER using Spacy

First, let's import the Spacy library

import spacy

Then load the "en_core_web_sm" model and assign it to a variable named nlp

nlp = spacy.load("en_core_web_sm")

Let's create a sample text which we will extract named entities from

sample_text = "Over 200 youth from Kisumu County in Kenya, have today gotten a chance to take  part in a Golf programme by Safaricom held at Lolwe Grounds."

Then create a Spacy document by passing the sample text into nlp()

doc = nlp(sample_text)

To extract the named entities from the document we will use '.ents'

print(doc.ents)

Output: (200, Kisumu County, Kenya, today, Safaricom, Lolwe Grounds)

Let's now print all the entities together with the category(label) they have been classified to.

for ent in doc.ents:
    print(ent, ent.label_)

Output
200 CARDINAL
Kisumu County GPE
Kenya GPE
today DATE
Safaricom ORG
Lolwe Grounds FAC

The explain() method

Spacy has a method 'explain()', that a label/category can be passed to and it gives an explanation of that label/category.
To get a quick definition of a label, we can use the 'explain()' method.

Let's try it out with the labels we got

spacy.explain("CARDINAL")

Output: Numerals that do not fall under another type

spacy.explain("GPE")

Output: Countries, cities, states

spacy.explain("DATE")

Output: Absolute or relative dates or periods

spacy.explain("FAC")

Output: Buildings, airports, highways, bridges, etc.

Visualizing Named Entities using Displacy

Displacy is a built-in Spacy dependency visualizer.
It will show the Named Entities directly in the text.

Let's import Displacy

from spacy import displacy

Then, we will create the visual

displacy.render(doc,style="ent",jupyter=True)

Output

Conclusion

Named Entity Recognition is one of the methods that can be used to gain insights from a text while carrying out NLP tasks. Named Entity Recognition has several use cases such as in Recommendation systems, enabling efficient search algorithms, customer support and so on.

In this article, we looked at Named Entity Recognition using Spacy. But, Spacy is not the only library that can be used for NER. Other open-source libraries that you can use are NLTK and Stanford NER

Credits

BOXPLOTS IN A BRIEF

Purity-Nyagweth — Tue, 05 Apr 2022 09:52:07 +0000

Introduction

A boxplot is a graph that is usually used for the descriptive analysis of numerical data. It provides us with information about data structures and distributional features of the data based on measures of central tendency and measures of dispersion including the median, quartiles, maximum, minimum, and symmetry. Boxplot can also be used to identify outliers in a dataset.

Boxplots consist of a box and whiskers. Boxplots can either be in a vertical or horizontal position. When in a vertical position, the bottom end of the box is the lower quartile while the top end of the box is the upper quartile. The height of the box which extends from the lower to the upper quartile is the interquartile range. It contains 50% of the data. The median is the horizontal line inside the box. Whiskers appear at the end of the plot. They mark the minimum and maximum observations of the data. The data observations that appear beyond the whiskers are the outliers.

If the median is in the middle of the box, the distribution of the data is symmetric otherwise it is skewed.

When the median is closer to the lower quartile and the lower whisker is shorter, then it is a right-skewed distribution (positive skewness), otherwise when the median is closer to the upper quartile and the upper whisker is shorter, then it is a left-skewed distribution (negative skewness).

How to create a Boxplot

To create our boxplot, we are going to use the CarPrice_Assignment.csv Dataset from Kaggle.

As a first step, we are going to import libraries that we will be using.

import pandas as pd import seaborn as sns import matplotlib.pyplot as plt

In the next step, we are going to read the dataset and view the first 5 rows

df = pd.read_csv('../input/car-price-prediction/CarPrice_Assignment.csv') df.head()

Now we are going to create a boxplot for the feature ‘price’ to get a descriptive analysis of it.

sns.set_style('whitegrid') sns.boxplot(y='price', data = df) plt.ylabel('price') #setting text for y axis plt.show()

From the boxplot, we can see that on average, the price of a car is around 10,000. The minimum price of a car is 5,000 and the maximum price is around 28,000 to 29,000. We can also note the outliers in the dataset and that the distribution is right-skewed.

Boxplots as a way to compare several datasets

Although boxplots are usually used for descriptive analysis, they can also be used to visualize correlations and associations between variables. They show the descriptive analysis of a dependent numerical feature against each unique value of an independent categorical feature.

As an example, we are going to use the dataset from before. We are going to create a boxplot to show the descriptive analysis of the feature ‘price’ against each unique value of the feature ‘fuel type’.

sns.set_style("whitegrid") sns.boxplot(x='fueltype', y='price', data =df) plt.xlabel('fueltype') #Set text for the x axis plt.ylabel('price') #Set text for y axis plt.show()

As per the boxplots above, we can see that the price for cars with fuel type diesel and fuel type gas differ and that on average, the cars with the fuel type diesel have a higher price as compared to the cars with the fuel type gas.

Creating a boxplot with nested grouping

Still using the same dataset, we are now going to group both the cars with fuel type gas and fuel type diesel by the feature ‘door number’ and plot them against the price to see how their prices vary.

sns.set_style("whitegrid") sns.boxplot(x ='fueltype', y='price', hue='doornumber', data=df) plt.xlabel('fueltype') #Set text for the x axis plt.ylabel('price') #Set text for y axis plt.show()

We can see that the cars that use diesel, whether with two doors or four doors have a higher price as compared to cars that use gas. The average price of cars that use diesel and have two doors is the lowest as compared to all the others. With the cars that use gas, there is a slight variation in the average price between the cars with two doors and those with four doors. While for the
cars that use diesel, there is a big variation in the average price between cars with two doors and those with four doors.

Creating a boxplot with nested grouping with some bins being empty

For this, we are going to group both cars with fuel type gas and diesel by the feature ‘drivewheel’ and plot them against price to see how their prices vary.

sns.set_style("whitegrid") sns.boxplot(x ='fueltype', y='price', hue='drivewheel', data=df) plt.xlabel('fueltype') # Set text for the x axis plt.ylabel('price')# Set text for y axis plt.show()

We can see that with cars that use diesel, there is none that is a four-wheel drive. For both the cars that use gas and diesel, the cars that are rear-wheel drive have the highest prices and the forward-wheel drive cars have the lowest prices. Generally, the cars that use diesel have higher prices as compared to those that use gas, we can see that rear-wheel-drive cars that use diesel have a higher price as compared to the rear-wheel-drive cars that use gas and this is also seen with the forward-wheel drive cars.

In conclusion

This was a brief explanation of boxplots. Understanding what they
are, how to create them, interpret them, and use them to get insights from data. Hopefully, those who read this article will find it useful.

Aspect-based sentiment analysis of video reviews

Purity-Nyagweth — Thu, 31 Mar 2022 12:20:19 +0000

Introduction

I've always loved participating in hackathons, generally because they are a great way of showcasing my work, improving my skills and networking. I came across the Deepgram Hackathon on DEV a little bit late, that is a week after it was launched and with the approaching deadline, the Innovative Ideas category was the best option for me to be able to submit in time and not miss out. Another reason as to why I chose the Innovative Ideas category is because, I am still not advanced in programming and so it would have been quite hard for me to come up with code that would successfully complete the 'Build' challenge. Also, it's my first time participating in a Hackathon where I am to come up with an innovative idea. This is quite interesting and I'd really like to try it out.
This is my first time encountering Deepgram, though I do have basic knowledge of speech recognition technology and have encountered it in quite a number of occasions.

My Deepgram Use-Case

My idea is to create a model that will be able to perform aspect-based sentiment analysis of video reviews. The model will be able to extract the sentiment and classify the video review as either positive, negative or neutral. It will also be able to extract the aspect and classify the video review into the category, feature or topic that is being talked about in the video.
Customer reviews have been discovered to make a huge impact on whether a customer will make a purchase from a company or not, actually, much more than the marketing and advertising the company does on its brand. Having customer reviews is a great way for a company or a business to gain customers' trust on its products and or services, that is in the case of good and positive reviews. But in the case of negative reviews, then it is a good way for a company or business to understand the customers better and make improvements on their products or services. Customer reviews can also be used by investors for finance and stock monitoring. Investors can choose on a company to invest in by looking at the sentiments of the company's products.
Customer reviews have most of the time been written text, but of late video reviews are becoming more popular with customers. As it has always been said, 'People are more likely to believe what they see than what they hear.' Most video reviews are usually done on products, where someone gives his or her experience in using a product and gives a real-time demonstration of the product.
Because sentiment analysis is usually done on text, for this project, Deepgram will help with their speech-to-text technology by transcribing the speech from the videos into text after which aspect-based sentiment analysis will be done on the transcribed text.

Dive into Details

The main challenge that will be solved by Deepgram is to get the speech or audio from the videos into text format so that aspect-based sentiment analysis can be done on the text.
The main people who will benefit from this innovation are companies, businesses and investors. The main idea here is to automate the process of the sentiment analysis of video reviews. The benefits that the companies and investors will get from this innovation are:

Saving time. In the case of a large number of video reviews to be analyzed, manually analyzing the videos could take a lot of time. Something that could take a short time through automation.
Having a more trustworthy analysis. In most cases when performing analysis, humans tend to rely on their own experiences and unconscious biasness to derive meaning. The automated analysis will be able to remove human biasness through consistent analysis.
Having a more powerful analysis. They will be able to perform analysis without having limits on the data size to be analyzed.

This innovation will make use of Deepgram's model feature, with the option set to video, this is because the audios will be sourced from videos. Since this innovation is all about text analysis, Deepgram's keywords, utterances and utterance split will be very vital. With the keyword feature enabled, the model will be able to intensify a keyword or supress a keyword. Thus, enabling it to understand the context in the text. This a huge plus since the analysis will require a clear and a well understood context.

What lead me to this particular idea is that while searching for inspirations from Deepgram support page, I came across this project by Kevin Lewis, where you would have a wearable screen that live captions your voice to help people understand you while wearing a mask. Suddenly, it rang in my mind that there's actually a lot we can get from reading texts, the sentiments being relayed, topic and context being discussed. And that is how I ended up coming up with the idea of aspect-based sentiment analysis of video reviews.

Conclusion

By participating in this challenge, I have really got to learn a lot about Deepgram, its features and how they can be used. It has also been really nice to go through the other participants' posts and learn so much about the several ways in which speech to text technology can be used.

K-MEANS CLUSTERING

Purity-Nyagweth — Fri, 28 Jan 2022 09:29:12 +0000

What is K-means Clustering?

K-means clustering is a category of unsupervised learning. It falls under the class of clustering algorithms. Clustering algorithms find similarities in the data in order to group them into clusters.

The K in K-means represents the number of clusters that the data points are to be grouped into while the means, comes about by the fact that after creating the clusters, K-means then gets the mean of each cluster and uses them as the new centroids(center of the clusters).

The number of clusters (K) is usually predetermined. K-means clustering creates a predetermined number of clusters from an unlabeled multidimensional data.

The following two assumptions are the basis for the K-means model;

Cluster center is the arithmetic mean of all points belonging to the cluster.
Each point is closer to its own cluster center than to other cluster centers.

Steps for K-means clustering

Guess random cluster centers
Assign points to the nearest cluster center
Get the mean of the clusters and take them as the new cluster centers
Repeat steps 3 and 4 until convergence (same points are assigned to the same cluster in consecutive iterations) or the new cluster centers formed do not change or until the number of iterations is reached.

How to choose the right K (number of clusters)
There are quite a number of methods that are used to choose the number of clusters when using K-means. These include; the elbow method, silhouette method, and sum of squares method among many others. We are going to discuss the elbow method in detail.

A major property of clusters is that data points in a cluster are to be similar. Meaning that the clustering algorithms are to form clusters such that intra-cluster variation(WCSS) is minimized. WCSS which in full means within cluster sum of squares is the sum of the squared distance between each member of the cluster and the centroid.

In the Elbow method, the WCSS at each number of clusters is calculated and graphed. One should choose a number of clusters such that adding another cluster doesn’t minimize the WCSS more. This will be a point where there is a change of slope from steep to shallow (an elbow).

The following are steps for performing the elbow method;

Run K-means clustering for different values of k. For example values of K ranging from 1 to 12
For each cluster calculate its WCSS
Plot a graph of WCSS against the number of clusters k
Spot the point where there is a change of slope from steep to shallow (an elbow). This will be the optimal number of clusters.

Implementing k-means with python

We are going to use the mall_customers dataset from Kaggle.

Snippet for loading the dataset and viewing the first 5 rows of the data

# loading dataset
mall_customers = pd.read_csv("data path")
mall_customers.head()

The dataset has 5 columns. We are going to use the columns 'age' and 'spending score' to make the clusters. We want to group the ages according to their spending score.

We will get the part of dataset that is needed

X = mall_customers[['Age', 'Spending Score (1-100)']]

At first, we will determine the number of clusters needed by using the elbow method. We will run K-means for different values of K in the range of 1 to 10, calculate each of their WCSS, and plot a graph.

wcss = []
for i in range(1,11):
    kmeans = KMeans(n_clusters=i, random_state=0)
    kmeans.fit(X)
    wcss.append(kmeans.inertia_)
plt.plot(range(1,11),wcss)
plt.title('Plot of WCSS against number of clusters')
plt.ylabel('WCSS')
plt.xlabel('Number of clusters')
plt.show()

From the graph above, 4 is our optimal number of clusters.

We will now proceed to clustering the data into 4 clusters and visualize them.

#defining a k=4 kmeans cluster model
kmeans_4 = KMeans(n_clusters=4, random_state=0)

#fitting data into the model
assignments = kmeans_4.fit_predict(X)

#creating dictionary to assign cluster numbers to colours for visualization
col_dic = {0:'blue',1:'green',2:'orange',3:'magenta'}

#mapping cluster numbers to colours
assign_colour = [col_dic[x] for x in assignments]

#visualization
plt.scatter(X['Age'], X['Spending Score (1-100)'], color=assign_colour)
plt.ylabel('Spending score')
plt.xlabel('Age')
plt.show()

Conclusion
In this article, we have discussed K-means clustering algorithm, what it is, the steps to take when creating K-means, and used python to implement it. Hope you enjoyed reading through and found the article helpful.

Feel free to give feedback or comment if you have got any so that we all keep learning.