Anshuman

Posted on Jan 8

Working with Categorical Data in R: Creating Frequency Tables as Data Frames (Modern Approaches)

#ai #javascript #programming #tutorial

Categorical data plays a crucial role in data analysis, machine learning, and business intelligence. Instead of working with raw numeric values, analysts often convert data into categories—such as Child, Adult, or Senior—to simplify interpretation and modeling. In classification problems, the output itself is categorical: whether a customer churns, whether a transaction is fraudulent, or whether a product is profitable.

In this article, we’ll explore how to generate frequency tables of categorical variables in R as data frames, starting from base R methods and moving toward modern, industry-standard approaches using the tidyverse and data.table ecosystems. While the examples remain grounded in fundamentals, the tools and style reflect current best practices.

Understanding Categorical Data

Categorical variables fall into two broad types:

Nominal: Categories without an inherent order
Examples: product type, payment method, species

Ordinal: Categories with a meaningful order
Examples: small < medium < large, low < medium < high

Many real-world analytics workflows involve converting numeric data into categories—for example, bucketing ages or transaction sizes—before analysis or modeling.

Preparing the Data: A Modern Setup

We’ll use the classic iris dataset, which remains useful for demonstrations. However, instead of using attach() (now discouraged due to namespace conflicts), we’ll work explicitly with data frames.

x <- iris

The dataset contains 150 observations with five variables:

Sepal.Length

Sepal.Width

Petal.Length

Petal.Width

Species

Creating Categorical Variables from Numeric Data

Using cut() (Base R)

The cut() function divides numeric variables into intervals of equal width.

x$class <- cut(x$Sepal.Length, breaks = 3)

This creates three groups based on sepal length range.

Using cut2() (Hmisc)

When balanced group sizes are preferred, cut2() from the Hmisc package is useful.

library(Hmisc)
x$class2 <- cut2(x$Sepal.Length, g = 3)

cut() → equal-width intervals

cut2() → approximately equal counts per group

Converting Categories to Numeric Labels (Optional)

In some modeling workflows, numeric class labels are easier to handle:

x$class_num <- as.numeric(x$class)

This maps each interval to 1, 2, or 3.

Counting Frequencies: The Classic Approach

Using table()

class_length <- table(x$class_num)
class_length

Output:

1 2 3
59 71 20

This provides a quick summary, but the result is a table, not a data frame.

Converting to a Data Frame

class_length_df <- as.data.frame(class_length)

Result:

Var1 Freq
1 1 59
2 2 71
3 3 20

The drawback here is clear:

Column names are generic (Var1, Freq)

Renaming manually can be error-prone in large workflows

A Cleaner Solution: Counting as a Data Frame Directly

The plyr::count() Function (Historical Context)

Historically, the plyr package solved this problem neatly:

library(plyr)
count(x, "class_num")

Output:

class_num freq
1 1 59
2 2 71
3 3 20

This approach was popular for years—and still works—but plyr is now largely superseded by more modern tools.

Modern Industry Standard: dplyr::count()

Today, dplyr (part of the tidyverse) is the preferred solution in most professional R environments.

library(dplyr)

x %>%
count(class_num)

Output:

class_num n
1 1 59
2 2 71
3 3 20

Why dplyr::count() Is Preferred

Returns a tibble/data frame

Works seamlessly with pipelines

Handles multi-variable counts naturally

Actively maintained and industry-supported

Multi-Way Frequency Tables

Two-Way Counts

Using base R:

as.data.frame(table(x$class, x$class2))

This includes zero-frequency combinations, which can clutter results.

Using dplyr:

x %>%
count(class, class2)

Only non-zero combinations are returned—cleaner and faster.

Cross-Tabulated Views with xtabs()

When a matrix-style view is required:

cross_tab <- xtabs(~ class + class2, data = x)
cross_tab

This is useful for reporting, but the output is still a table object.

Three-Way and N-Way Counts

Base R (xtabs())

xtabs(~ class + class2 + Species, data = x)

While powerful, the output becomes increasingly difficult to interpret as dimensions grow.

Modern Tidy Output

x %>%
count(class, class2, Species)

Result:

Flat, readable structure

Easy to visualize

Ready for plotting or exporting

This format is especially valuable in dashboards, BI tools, and machine learning pipelines.

Performance Considerations

In real-world datasets with millions of rows:

table() computes all possible combinations, including zero counts

count() computes only observed combinations

For high-performance workflows, many teams now rely on data.table:

library(data.table)
setDT(x)[, .N, by = .(class, class2, Species)]

This approach is extremely fast and memory-efficient.

Key Takeaways

Categorical data is central to analytics and classification problems

Base R functions like table() are useful for quick summaries

Modern workflows favor dplyr::count() for clarity and scalability

Multi-way categorical summaries are easier to manage in flat data frames

Clean frequency tables integrate seamlessly into visualization, modeling, and reporting pipelines

Final Thoughts

While R continues to evolve, the core challenge remains the same: turning raw categorical data into meaningful summaries. The shift from base R and plyr toward tidyverse and data.table reflects broader industry trends—readability, performance, and reproducibility.

Understanding these tools ensures your code remains not only correct, but also future-proof and aligned with modern data science practices.

Our mission is “to enable businesses unlock value in data.” We do many activities to achieve that—helping you solve tough problems is just one of them. For over 20 years, we’ve partnered with more than 100 clients — from Fortune 500 companies to mid-sized firms — to solve complex data analytics challenges. Our services include hire power bi consultants and power bi consulting services— turning raw data into strategic insight.

DEV Community

Working with Categorical Data in R: Creating Frequency Tables as Data Frames (Modern Approaches)

Top comments (0)