Advanced Imputation with R Packages

#ai #powerplatform #tableau #programming

Data analysts and data scientists often encounter missing values in datasets, and if not handled properly, these missing values can mislead analysis, distort model performance, and produce inaccurate insights. Handling missing data is therefore a critical step in the data preparation process.
In this article, we’ll explore what missing values are, types of missingness, and techniques to impute missing values in R, including hands-on examples using the popular mice package.

What Are Missing Values?
Imagine you are collecting survey data where participants fill out personal details. For someone who is married, the marital status will be married and they may provide the names of their spouse and children. For unmarried respondents, these fields will naturally be left blank.
This is a genuine example of missing values, but missing data can also occur due to human error (forgetting to enter data), incorrect entries (like a negative age), or system errors during data collection.
Before handling missing data, it’s important to identify which type of missingness you are dealing with.

Types of Missing Values
Missing data is typically classified into three categories:
MCAR (Missing Completely At Random):
Missing values occur randomly with no relationship to any other variable. Example: A survey participant accidentally skips a question. MCAR is rare but easiest to handle because the missingness does not introduce bias.
MAR (Missing At Random):
Missing values depend on other observed variables but not on the missing variable itself. Example: Males are less likely to answer a survey question on mental health. While the missingness is predictable using other data, it cannot be directly observed. MAR values can often be safely imputed.
NMAR (Not Missing At Random):
Missing values are related to the value itself or hidden factors. Example: Missing spouse names could indicate unmarried participants or deliberate omission. NMAR requires careful handling, as ignoring these values may bias the analysis.

Strategies to Handle Missing Values

Dropping Missing Values If the proportion of missing data is very small (e.g., <5%), you may choose to ignore the missing values: clean_data <- na.omit(dataset)

However, dropping too many rows may lead to loss of valuable information, especially for large missing patterns.

Imputing Missing Values Imputation involves filling missing values with plausible substitutes, preserving the structure and distribution of the data. Numeric Data: Use mean, median, or moving averages. Categorical Data: Use mode (most frequent value) or prediction-based imputation. Special Cases: Use placeholder values like -1 for age or Unknown for categorical fields, primarily for quick exploration. Example: # Mean imputation for numeric variable dataset$age[is.na(dataset$age)] <- mean(dataset$age, na.rm = TRUE)

Mode imputation for categorical variable

dataset$gender[is.na(dataset$gender)] <- as.character(stats::mode(dataset$gender))

Advanced Imputation with R Packages
R provides several powerful packages for robust imputation, including:
Hmisc – General-purpose imputation.
missForest – Non-parametric imputation using Random Forest.
Amelia – Multiple imputation for time-series and cross-sectional data.
mice – Multivariate Imputation via Chained Equations (gold standard for MAR data).
We’ll focus on the mice package, which is widely used for MAR missing values and provides multiple imputed datasets for robust modeling.

Using the mice Package
The mice package imputes missing values by regressing each incomplete variable on other variables. Multiple datasets are created to capture the uncertainty of imputation.
Step 1: Load Packages and Data
library(mice)
library(VIM)
library(lattice)

data(nhanes) # NHANES dataset

NHANES contains 25 observations and 4 variables: age, bmi, hyp (hypertension), and chl (cholesterol).
Several variables have missing values: bmi, hyp, and chl.
Age is coded in bands (1, 2, 3) and better treated as a factor:
nhanes$age <- as.factor(nhanes$age)

Step 2: Understand Missing Patterns
md.pattern(nhanes)

This function shows the pattern of missingness, including which variables are missing together.
Visualizing Missing Data
nhanes_miss <- aggr(nhanes, col=mdc(1:2), numbers=TRUE, sortVars=TRUE,
labels=names(nhanes), cex.axis=.7, gap=3,
ylab=c("Proportion of missingness","Missingness Pattern"))

marginplot(nhanes[, c("chl", "bmi")], col = mdc(1:2), cex.numbers = 1.2, pch = 19)

aggr() shows proportion of missingness per variable.
marginplot() shows distribution of missing vs observed values, helping identify MCAR values.

Step 3: Impute Missing Values
mice_imputes <- mice(nhanes, m=5, maxit=40)

m=5: Creates 5 imputed datasets.
maxit=40: Maximum iterations per imputation.
Default imputation method: Predictive Mean Matching (PMM) for numeric variables.
mice_imputes$method

Step 4: Extract Imputed Dataset
Imputed_data <- complete(mice_imputes, 5) # Using the 5th imputed dataset

Step 5: Evaluate Imputation Quality
XY Plot:
xyplot(mice_imputes, bmi ~ chl | .imp, pch = 20, cex = 1.4)

Blue points = observed data
Red points = imputed data
Red should closely match blue to ensure reasonable imputation.
Density Plot:
densityplot(mice_imputes)

Compares the distribution of observed and imputed values.

Step 6: Modeling with Multiple Imputed Datasets
mice allows robust modeling using all imputed datasets with with() and pool() functions:
lm_5_model <- with(mice_imputes, lm(chl ~ age + bmi + hyp))
combo_5_model <- pool(lm_5_model)
summary(combo_5_model)

This combines model results across all imputed datasets, giving more reliable estimates than using a single imputed dataset.

Summary
Missing values are a common challenge in data analysis.
They can be ignored, dropped, or imputed, depending on the type and amount of missingness.
R packages like mice, Hmisc, Amelia, and missForest offer advanced imputation methods.
The mice package is particularly powerful for MAR missing values, allowing multiple imputations and robust modeling.
Visualizing missingness using VIM helps determine the nature of missing values and guides proper imputation.
Proper handling of missing data ensures accurate, unbiased, and reliable models, which is a cornerstone of successful data science projects.
At Perceptive Analytics, our mission is “to enable businesses to unlock value in data.” For over 20 years, we’ve partnered with more than 100 clients—from Fortune 500 companies to mid-sized firms—to solve complex data analytics challenges. Our services include Power BI consulting services and PowerBI consultants, turning data into strategic insight. We would love to talk to you. Do reach out to us.

DEV Community

Advanced Imputation with R Packages

Mode imputation for categorical variable

Top comments (0)