Categorical data plays a crucial role in data analysis, machine learning, and business intelligence. Instead of working with raw numeric values, analysts often convert data into categories—such as Child, Adult, or Senior—to simplify interpretation and modeling. In classification problems, the output itself is categorical: whether a customer churns, whether a transaction is fraudulent, or whether a product is profitable.
In this article, we’ll explore how to generate frequency tables of categorical variables in R as data frames, starting from base R methods and moving toward modern, industry-standard approaches using the tidyverse and data.table ecosystems. While the examples remain grounded in fundamentals, the tools and style reflect current best practices.
Understanding Categorical Data
Categorical variables fall into two broad types:
Nominal: Categories without an inherent order
Examples: product type, payment method, species
Ordinal: Categories with a meaningful order
Examples: small < medium < large, low < medium < high
Many real-world analytics workflows involve converting numeric data into categories—for example, bucketing ages or transaction sizes—before analysis or modeling.
Preparing the Data: A Modern Setup
We’ll use the classic iris dataset, which remains useful for demonstrations. However, instead of using attach() (now discouraged due to namespace conflicts), we’ll work explicitly with data frames.
x <- iris
The dataset contains 150 observations with five variables:
Sepal.Length
Sepal.Width
Petal.Length
Petal.Width
Species
Creating Categorical Variables from Numeric Data
Using cut() (Base R)
The cut() function divides numeric variables into intervals of equal width.
x$class <- cut(x$Sepal.Length, breaks = 3)
This creates three groups based on sepal length range.
Using cut2() (Hmisc)
When balanced group sizes are preferred, cut2() from the Hmisc package is useful.
library(Hmisc)
x$class2 <- cut2(x$Sepal.Length, g = 3)
cut() → equal-width intervals
cut2() → approximately equal counts per group
Converting Categories to Numeric Labels (Optional)
In some modeling workflows, numeric class labels are easier to handle:
x$class_num <- as.numeric(x$class)
This maps each interval to 1, 2, or 3.
Counting Frequencies: The Classic Approach
Using table()
class_length <- table(x$class_num)
class_length
Output:
1 2 3
59 71 20
This provides a quick summary, but the result is a table, not a data frame.
Converting to a Data Frame
class_length_df <- as.data.frame(class_length)
Result:
Var1 Freq
1 1 59
2 2 71
3 3 20
The drawback here is clear:
Column names are generic (Var1, Freq)
Renaming manually can be error-prone in large workflows
A Cleaner Solution: Counting as a Data Frame Directly
The plyr::count() Function (Historical Context)
Historically, the plyr package solved this problem neatly:
library(plyr)
count(x, "class_num")
Output:
class_num freq
1 1 59
2 2 71
3 3 20
This approach was popular for years—and still works—but plyr is now largely superseded by more modern tools.
Modern Industry Standard: dplyr::count()
Today, dplyr (part of the tidyverse) is the preferred solution in most professional R environments.
library(dplyr)
x %>%
count(class_num)
Output:
class_num n
1 1 59
2 2 71
3 3 20
Why dplyr::count() Is Preferred
Returns a tibble/data frame
Works seamlessly with pipelines
Handles multi-variable counts naturally
Actively maintained and industry-supported
Multi-Way Frequency Tables
Two-Way Counts
Using base R:
as.data.frame(table(x$class, x$class2))
This includes zero-frequency combinations, which can clutter results.
Using dplyr:
x %>%
count(class, class2)
Only non-zero combinations are returned—cleaner and faster.
Cross-Tabulated Views with xtabs()
When a matrix-style view is required:
cross_tab <- xtabs(~ class + class2, data = x)
cross_tab
This is useful for reporting, but the output is still a table object.
Three-Way and N-Way Counts
Base R (xtabs())
xtabs(~ class + class2 + Species, data = x)
While powerful, the output becomes increasingly difficult to interpret as dimensions grow.
Modern Tidy Output
x %>%
count(class, class2, Species)
Result:
Flat, readable structure
Easy to visualize
Ready for plotting or exporting
This format is especially valuable in dashboards, BI tools, and machine learning pipelines.
Performance Considerations
In real-world datasets with millions of rows:
table() computes all possible combinations, including zero counts
count() computes only observed combinations
For high-performance workflows, many teams now rely on data.table:
library(data.table)
setDT(x)[, .N, by = .(class, class2, Species)]
This approach is extremely fast and memory-efficient.
Key Takeaways
Categorical data is central to analytics and classification problems
Base R functions like table() are useful for quick summaries
Modern workflows favor dplyr::count() for clarity and scalability
Multi-way categorical summaries are easier to manage in flat data frames
Clean frequency tables integrate seamlessly into visualization, modeling, and reporting pipelines
Final Thoughts
While R continues to evolve, the core challenge remains the same: turning raw categorical data into meaningful summaries. The shift from base R and plyr toward tidyverse and data.table reflects broader industry trends—readability, performance, and reproducibility.
Understanding these tools ensures your code remains not only correct, but also future-proof and aligned with modern data science practices.
Our mission is “to enable businesses unlock value in data.” We do many activities to achieve that—helping you solve tough problems is just one of them. For over 20 years, we’ve partnered with more than 100 clients — from Fortune 500 companies to mid-sized firms — to solve complex data analytics challenges. Our services include hire power bi consultants and power bi consulting services— turning raw data into strategic insight.
Top comments (0)