Thales Bruno

Posted on Jul 7, 2020 • Edited on Jul 20, 2020 • Originally published at thalesbr.uno

Categorical variables

#statistics #datascience #python #beginners

A categorical variable (sometimes called a nominal variable) is a variable that can assume one of a limited number of possible values described as categories and there is no intrinsic ordering to the categories. It uses labels, names, or other descriptors (even numbers) to identify exclusive categories or types of things.

As an example of a categorical variable, we may mention Nationality having values like Brazilian, Canadian, French, etc., and we can see that there is no ordering between the values: we cannot say that Brazilian is higher than Canadian. In summary, there is no way to order these categories from highest to lowest or from best to worst.

Other examples of categorical variables could be Regions (North, South, East, West), Blood Type (A, B, AB, O) or Smartphone Brand (Apple, Samsumg, LG, Xiami).

However, if there is a clear order between the categories, so we are dealing with an ordinal variable, that is very similar to a categorical variable and often it's considered a special kind of this and placed on between categorical and quantitative variables. An example of an ordinal variable could be Educational Level (Elementary school education, High school graduate, Some college, College graduate, Graduate degree).

But in this article we are focusing on pure categorical or nominal variables, so let's check out what we can do with some categorical data.

Frequency distribution

Since we have a dataset with some categorical variables, the most common thing we can do is count the occurrences of each category in the whole data. This will give us a frequency distribution.

Let's take a look at some real data to demonstrate a frequency distribution. We will use the Kaggle Google Play Store Apps dataset from Lavanya Gupta. This dataset has more than 10,000 rows, each of them is an app from Google Play Store, and as features (columns) we can see the App name, Category, Rating, and others.

We will use pandas for handling the data. Firstly, we import pandas and read the CSV file downloaded from Kaggle, but only the Category column. Then, we use the unique method to show all values observed in our data. As we can see, there are 34 App Categories in our categorical variable, like Finance, Sports, Weathers and others and we can't see any order between them (Events category is not better or higher than Shopping category, for instance).

import pandas as pd

df = pd.read_csv("./data/googleplaystore.csv", usecols=['Category'])
categories = df['Category'].unique()

print(f"{len(categories)} categories:")
print(categories)

34 categories:
['ART_AND_DESIGN' 'AUTO_AND_VEHICLES' 'BEAUTY' 'BOOKS_AND_REFERENCE'
 'BUSINESS' 'COMICS' 'COMMUNICATION' 'DATING' 'EDUCATION' 'ENTERTAINMENT'
 'EVENTS' 'FINANCE' 'FOOD_AND_DRINK' 'HEALTH_AND_FITNESS' 'HOUSE_AND_HOME'
 'LIBRARIES_AND_DEMO' 'LIFESTYLE' 'GAME' 'FAMILY' 'MEDICAL' 'SOCIAL'
 'SHOPPING' 'PHOTOGRAPHY' 'SPORTS' 'TRAVEL_AND_LOCAL' 'TOOLS'
 'PERSONALIZATION' 'PRODUCTIVITY' 'PARENTING' 'WEATHER' 'VIDEO_PLAYERS'
 'NEWS_AND_MAGAZINES' 'MAPS_AND_NAVIGATION' '1.9']

Now that we know all category values we can have, let's count how many times every category occurs in our data using value_counts method.

frequency = df['Category'].value_counts()

# frequency is a pandas Series, so we'll transform it in a DataFrame just for presentation purposes
frequency_dist = pd.DataFrame(frequency)
frequency_dist.columns = ['Frequency']
frequency_dist.index.name = 'Category'

# Using head(10) to show only the first 10 lines
frequency_dist.head(10)

	Frequency
Category
FAMILY	1972
GAME	1144
TOOLS	843
MEDICAL	463
BUSINESS	460
PRODUCTIVITY	424
PERSONALIZATION	392
COMMUNICATION	387
SPORTS	384
LIFESTYLE	382

So, we can see above that most apps are from the Family category with 1,972 occurrences. Game and Tools are also common categories, on the other hand, there are few apps from the Beauty category.

Relative Frequency

At the moment we already know how many apps we have from each category. But what if we wanted to figure out what is the percentage of Medical apps of all apps? Then we need to calculate the relative frequency of category apps dividing the frequency by the total number of apps (aka the sample data).

Relative frequency of something = Frequency of something / n

Again, we will use the marvelous pandas. The relative frequency must assume a value from 0 to 1, but here we will multiply it by 100 and show the values in percentage form instead. So, as you can see below, Medical apps represent approximately 4.27% of all apps in Google Play Store according to our dataset.

frequency_dist['Relative Frequency (%)'] = (frequency_dist['Frequency']/sum(frequency_dist['Frequency']))*100

# Using head(10) to show only the first 10 lines
frequency_dist.head(10)

	Frequency	Relative Frequency (%)
Category
FAMILY	1972	18.190204
GAME	1144	10.552532
TOOLS	843	7.776035
MEDICAL	463	4.270824
BUSINESS	460	4.243151
PRODUCTIVITY	424	3.911078
PERSONALIZATION	392	3.615903
COMMUNICATION	387	3.569781
SPORTS	384	3.542109
LIFESTYLE	382	3.523660

Frequency Bar Chart

Finally, we will plot the frequency variable in a Bar Chart that is a pretty common way to visualize categorical data.

import plotly.express as px

fig = px.bar(frequency)
fig.update_layout(title='Frequency Distribution of Google Play Store app categories',
                  xaxis_title='Category',
                  yaxis_title='Frequency')
fig.show()

So, in this article we have seen a bit about Categorical Variables or Nominal Variables, which is a pretty usual data type we face in Statistics, Data Analysis, Machine Learning, and so on. It was just an introductory content, but we may cover it a little deeper in upcoming posts.

References

Wikipedia | Categorical variable 🔎
UCLA | WHAT IS THE DIFFERENCE BETWEEN CATEGORICAL, ORDINAL AND NUMERICAL VARIABLES? 🔎
Brandon Foltz | Statistics 101: Describing a Categorical Variable
🔎
web.ma.utexas.edu | Ordinal Variables 🔎