Dipti M

Posted on Aug 21

Handling Categorical Data in R: A Practical Guide

Categorical data is data that takes on a limited, fixed set of values. For example, instead of recording a person’s exact age as a number, we might categorize them as “Child”, “Adult”, or “Senior”. These categories are easier to interpret but come with their own structure and challenges.

Before diving into handling categorical data, it’s important to distinguish between its two main types:

Ordinal Data – Categories that have a natural order (e.g., small < medium < large).

Nominal Data – Categories without any inherent order (e.g., dumbbell, grippers, gloves).

Both forms are common in analytics and machine learning, especially in classification problems, where the output itself is categorical (e.g., churn vs. not churn, profitable vs. not profitable).

In this guide, we’ll walk through how to transform, summarize, and analyze categorical data in R using functions from base R and popular packages.

Converting Numerical Data into Categories

Often, numerical data is converted into categories for easier interpretation. For example, instead of using raw Sepal.Length values from the iris dataset, we can group them into bins.

Using cut() and split()

Load iris dataset

x <- iris

Split Sepal.Length into 3 ranges

list1 <- split(x, cut(x$Sepal.Length, 3))

summary(list1)

Here, cut() divides the range of Sepal.Length into 3 equal-width intervals.

Using cut2() from Hmisc
library(Hmisc)

Split into 3 groups with roughly equal counts

list2 <- split(x, cut2(x$Sepal.Length, g = 3))

summary(list2)

Difference:

cut() → equal ranges

cut2() → equal number of values per group

Adding Categories as New Columns

Instead of creating lists, we can add categories directly to the dataset:

x$class <- cut(x$Sepal.Length, 3)
x$class2 <- cut2(x$Sepal.Length, g = 3)

If you prefer numeric labels:

x$class <- as.numeric(x$class)

Now class takes values 1, 2, or 3.

Counting Category Sizes
Using table()
class_length <- table(x$class)
class_length

Output:

1 2 3

59 71 20

To convert into a DataFrame:

class_length_df <- as.data.frame(class_length)
names(class_length_df)[1] <- "group"
class_length_df

Using count() from plyr (Cleaner)
library(plyr)

class_length2 <- count(x, "class")
class_length2

Output:

class freq
1 1 59
2 2 71
3 3 20

✅ Advantage: Directly returns a clean DataFrame, skipping the renaming hassle.

Comparing table() vs. count()

table() – quick summary, but includes all possible combinations (even with 0 counts).

count() – skips 0-count combinations, giving a cleaner output.

Example with two variables (class and class2):

table()

two_way <- as.data.frame(table(x$class, x$class2))

plyr::count()

two_way_count <- count(x, c("class", "class2"))

👉 count() omits zero-frequency rows, making the output easier to interpret.

Cross-Tabulation

If you prefer cross-tabulated outputs:

cross_tab <- xtabs(~ class + class2, x)
cross_tab

Output is an xtabs object (table). For larger N-way tables:

threeway_cross_tab <- xtabs(~ class + class2 + Species, x)
threeway_cross_tab

Downside: readability decreases as dimensions grow.

Cleaner Alternative with count()
threeway_cross_tab_df <- count(x, c("class", "class2", "Species"))
threeway_cross_tab_df

This produces a neat DataFrame with non-zero counts only, making it much easier to work with.

Key Takeaways

Use cut() for equal-width bins, cut2() for equal-size groups.

Add categories as new columns instead of splitting into lists.

table() is good for quick summaries, but requires cleanup.

count() from plyr is more flexible, faster, and cleaner for categorical summaries.

For multi-way frequency tables, count() provides concise results compared to xtabs() or table().
This post was first published by Perceptive Analytics.
We believe data should work for you—not the other way around. With 20+ years of experience, we help organizations like yours unlock value through services such as Power BI Consulting, excel vba programming, and Power BI Consultants