Categorical data is data that takes on a limited, fixed set of values. For example, instead of recording a person’s exact age as a number, we might categorize them as “Child”, “Adult”, or “Senior”. These categories are easier to interpret but come with their own structure and challenges.
Before diving into handling categorical data, it’s important to distinguish between its two main types:
Ordinal Data – Categories that have a natural order (e.g., small < medium < large).
Nominal Data – Categories without any inherent order (e.g., dumbbell, grippers, gloves).
Both forms are common in analytics and machine learning, especially in classification problems, where the output itself is categorical (e.g., churn vs. not churn, profitable vs. not profitable).
In this guide, we’ll walk through how to transform, summarize, and analyze categorical data in R using functions from base R and popular packages.
Converting Numerical Data into Categories
Often, numerical data is converted into categories for easier interpretation. For example, instead of using raw Sepal.Length values from the iris dataset, we can group them into bins.
Using cut() and split()
Load iris dataset
x <- iris
Split Sepal.Length into 3 ranges
list1 <- split(x, cut(x$Sepal.Length, 3))
summary(list1)
Here, cut() divides the range of Sepal.Length into 3 equal-width intervals.
Using cut2() from Hmisc
library(Hmisc)
Split into 3 groups with roughly equal counts
list2 <- split(x, cut2(x$Sepal.Length, g = 3))
summary(list2)
Difference:
cut() → equal ranges
cut2() → equal number of values per group
Adding Categories as New Columns
Instead of creating lists, we can add categories directly to the dataset:
x$class <- cut(x$Sepal.Length, 3)
x$class2 <- cut2(x$Sepal.Length, g = 3)
If you prefer numeric labels:
x$class <- as.numeric(x$class)
Now class takes values 1, 2, or 3.
Counting Category Sizes
Using table()
class_length <- table(x$class)
class_length
Output:
1 2 3
59 71 20
To convert into a DataFrame:
class_length_df <- as.data.frame(class_length)
names(class_length_df)[1] <- "group"
class_length_df
Using count() from plyr (Cleaner)
library(plyr)
class_length2 <- count(x, "class")
class_length2
Output:
class freq
1 1 59
2 2 71
3 3 20
✅ Advantage: Directly returns a clean DataFrame, skipping the renaming hassle.
Comparing table() vs. count()
table() – quick summary, but includes all possible combinations (even with 0 counts).
count() – skips 0-count combinations, giving a cleaner output.
Example with two variables (class and class2):
table()
two_way <- as.data.frame(table(x$class, x$class2))
plyr::count()
two_way_count <- count(x, c("class", "class2"))
👉 count() omits zero-frequency rows, making the output easier to interpret.
Cross-Tabulation
If you prefer cross-tabulated outputs:
cross_tab <- xtabs(~ class + class2, x)
cross_tab
Output is an xtabs object (table). For larger N-way tables:
threeway_cross_tab <- xtabs(~ class + class2 + Species, x)
threeway_cross_tab
Downside: readability decreases as dimensions grow.
Cleaner Alternative with count()
threeway_cross_tab_df <- count(x, c("class", "class2", "Species"))
threeway_cross_tab_df
This produces a neat DataFrame with non-zero counts only, making it much easier to work with.
Key Takeaways
Use cut() for equal-width bins, cut2() for equal-size groups.
Add categories as new columns instead of splitting into lists.
table() is good for quick summaries, but requires cleanup.
count() from plyr is more flexible, faster, and cleaner for categorical summaries.
For multi-way frequency tables, count() provides concise results compared to xtabs() or table().
This post was first published by Perceptive Analytics.
We believe data should work for you—not the other way around. With 20+ years of experience, we help organizations like yours unlock value through services such as Power BI Consulting, excel vba programming, and Power BI Consultants
Top comments (0)