Dipti Moryani

Posted on Jan 6

Working with Categorical Data in R

#webdev #ai #programming #javascript

Categorical data represents variables that take on a finite set of predefined values rather than continuous numeric ranges. For example, representing age as Child, Adult, or Senior instead of a numeric value is a classic use of categorical data.
Before working with categorical variables, it’s important to understand their types, how to create them, and how to analyze their distributions effectively in R.

Types of Categorical Data
Categorical variables fall into two broad categories:

Ordinal Data
Ordinal variables have a natural ordering.
Examples:
Small < Medium < Large
Low < Medium < High
Poor < Average < Good
The order carries meaning, though the distance between levels does not.
Nominal Data
Nominal variables have no inherent order.
Examples:
Product type: accessory, regular, premium
Sports equipment: dumbbell, gloves, grippers
Customer segment names
Any ordering would be arbitrary and analytically meaningless.

Why Convert Numerical Data to Categorical?
Analysts often transform numeric variables into categories to:
Simplify interpretation
Enable classification modeling
Improve business communication
Similarly, many business problems naturally produce categorical outcomes:
Will a customer churn? (Yes / No)
Will a user convert? (Buy / Not buy)
Is a transaction fraudulent? (Fraud / Legit)
All problems with categorical outcomes are classification problems, commonly addressed in analytics and AI consulting workflows.

Creating Categorical Variables in R
R provides multiple ways to convert numeric variables into categories. We’ll use the built-in iris dataset for illustration.
attach(iris)
x <- iris

The dataset contains:
Sepal Length
Sepal Width
Petal Length
Petal Width
Species

Using cut() to Create Categories
The cut() function divides numeric data into equal-width intervals.
list1 <- split(x, cut(x$Sepal.Length, 3))
summary(list1)

This splits Sepal Length into three equal ranges.

Using cut2() from Hmisc
The cut2() function creates groups with approximately equal numbers of observations.
library(Hmisc)
list2 <- split(x, cut2(x$Sepal.Length, g = 3))
summary(list2)

Key Difference
cut() → equal range widths
cut2() → balanced group sizes

Adding Categorical Variables to the Dataset
Instead of creating lists, it’s often better to add the category labels directly to the dataset.
x$class <- cut(x$Sepal.Length, 3)
x$class2 <- cut2(x$Sepal.Length, g = 3)

If numeric class labels are preferred:
x$class <- as.numeric(x$class)

Now the classes are indexed as 1, 2, or 3, which is convenient for modeling.

Counting Observations in Each Category
Using table()
class_length <- table(x$class)
class_length

Output:
1 2 3
59 71 20

This gives a quick summary, but the result is a table, not a data frame.

Converting Table Output to a Data Frame
class_length_df <- as.data.frame(class_length)
names(class_length_df)[1] <- "class"

This works—but renaming columns manually can be risky in larger datasets.

A Better Approach: count() from plyr
The plyr package provides a cleaner solution.
library(plyr)
class_length2 <- count(x, "class")
class_length2

Output:
class freq
1 1 59
2 2 71
3 3 20

Advantages of count():
Returns a data frame directly
Retains column names
Requires fewer steps
Easier to join back to original data

Two-Way Frequency Tables
Using table()
two_way <- as.data.frame(table(x$class, x$class2))

This includes zero-frequency combinations, which can clutter results.

Using count()
two_way_count <- count(x, c("class", "class2"))

Key difference:
table() → all possible combinations
count() → only observed combinations (cleaner output)

N-Way Frequency Tables
For large datasets, count() is significantly faster and more readable.
full_counts <- count(x)

Using table(x) on a full data frame can be slow and memory-intensive because it computes all possible combinations.

Cross-Tabulation with xtabs()
If a matrix-style cross-tab is required:
cross_tab <- xtabs(~ class + class2, x)
cross_tab

This is useful for reporting but returns an object of class:
class(cross_tab)

"xtabs" "table"

Converting it back to a data frame recreates the same verbosity as table().

Three-Way Cross-Tabulation
threeway_cross_tab <- xtabs(~ class + class2 + Species, x)
threeway_cross_tab

As dimensions increase, readability decreases.

Clean Alternative for Multi-Way Counts
threeway_cross_tab_df <- count(x, c("class", "class2", "Species"))
threeway_cross_tab_df

This produces:
Compact output
No zero-frequency noise
Easy visualization and filtering
Better scalability for real-world datasets

Key Takeaways
Categorical data can be nominal or ordinal
cut() and cut2() help transform numeric data into categories
table() is useful for quick summaries
count() (plyr) is superior for clean, scalable frequency analysis
For multi-dimensional categorical analysis, count() outperforms table() and xtabs()
At Perceptive Analytics, our mission is “to enable businesses to unlock value in data.” For over 20 years, we’ve partnered with more than 100 clients—from Fortune 500 companies to mid-sized firms—to solve complex data analytics challenges. Our services include working with an experienced Power BI developer and delivering intelligent AI chatbot services, turning data into strategic insight. We would love to talk to you. Do reach out to us.

DEV Community

Working with Categorical Data in R

Top comments (0)