Tree-Based Models for Alzheimer's Disease Classification: A tidymodels Approach

#tutorial #datascience #data

Initially published on May 29, 2025

Introduction

Alzheimer's disease, the most common form of dementia, affects millions worldwide. Early and accurate diagnosis is crucial for treatment and care planning. In this article, I explore how tree-based machine learning models can help classify dementia status using neuroimaging data from the OASIS dataset. What's particularly exciting is how the tidymodels framework in R provides an elegant, consistent interface for implementing and comparing these powerful algorithms.

The Dataset

The Open Access Series of Imaging Studies (OASIS) provides MRI data from 416 subjects aged 18-96, including 100 with Alzheimer's diagnosis. Our classification target is dementia status (present/absent), with predictors including:

Demographic variables (age, gender, education)
MRI-derived measures (whole brain volume, estimated total intracranial volume)
Clinical dementia rating (CDR)

static_df <- read.csv("oasis_cross-sectional.csv") %>% 
  clean_names() %>% 
  select(-hand) %>% 
  mutate(across(
      where(~ is.numeric(.) && any(is.na(.))), ~ coalesce(., mean(., na.rm = TRUE))),
    cdr = as.factor(cdr),
    dementia = as.factor(ifelse(cdr == 0, 0, 1))

The tidymodels Workflow

The beauty of tidymodels lies in its consistent framework for model building:

Data preprocessing with recipes
Model specification with parsnip
Tuning with tune and dials
Evaluation with yardstick

This workflow remains identical across different model types - only the underlying algorithm changes.

Decision Trees: The Interpretable Foundation

Decision trees provide transparent classification rules that clinicians can understand. Our implementation:

dct_model <- decision_tree(
  mode = "classification",
  cost_complexity = tune(),
  tree_depth = tune()
) %>% 
  set_engine("rpart")

Key advantages:

Visual representation of decision paths
Automatic feature selection
Handles mixed data types naturally

Random Forests: The Power of Ensembles

Random forests improve accuracy by aggregating many decorrelated trees:

rf_model <- rand_forest(
  mode = "classification",
  trees = tune(),
  mtry = tune()
) %>% 
  set_engine("ranger")

Why they shine:

Reduces overfitting through bagging
Provides feature importance metrics
Handles high-dimensional data well

Gradient Boosted Machines: Sequential Improvement

GBMs iteratively improve by focusing on previous errors:

gbm_model <- gbm.fit(x = select(train_df, -dementia),
y = train_df$dementia,
            distribution = "multinomial",
            n.trees = 5000,
            shrinkage = 0.01)

Strengths:

Often achieves state-of-the-art accuracy
Flexible loss functions
Handles class imbalance well

XGBoost: The Championship Algorithm

XGBoost adds regularization and efficient computation to GBM:

xg_model <- boost_tree(
  trees = tune(),
  learn_rate = tune(),
  tree_depth = tune()
) %>% 
  set_engine("xgboost") %>% 
  set_mode("classification")

Why it's special:

Parallel processing for speed
Built-in cross-validation
Regularization prevents overfitting

Model Evaluation

All models can be evaluated consistently:

predictions <- final_results %>% 
collect_predictions()

conf_mat <- conf_mat(predictions, truth = dementia, estimate = .pred_class)

autoplot(conf_mat)

Why This Matters

Tree-based models offer several advantages for medical classification:

Interpretability: Especially important in healthcare decisions
Handling missing data: Robust to imperfect clinical datasets
Nonlinear relationships: Capture complex biological interactions
Automatic feature selection: Identify key diagnostic markers

The tidymodels framework makes it remarkably straightforward to implement, compare, and productionize these models while maintaining rigorous statistical practices.

Conclusion

From simple decision trees to sophisticated boosted ensembles, tree-based models provide a powerful toolkit for model development tasks, in our case a simple classification detection of Alzheimer’s. The tidymodels ecosystem in R democratizes access to these techniques through its consistent, tidy interface. As neuroimaging datasets grow larger and more complex, these methods will become increasingly valuable in the quest to understand and diagnose dementia earlier and more accurately.

What excites me most is how accessible these advanced techniques have become - with relatively concise code, we can implement models that would have required specialized expertise just years ago. The intersection of statistical learning and healthcare has remains promising!

RPubs: RPubs - Tree Based Models
Full Code: R_Playlist/TBM at master · AkanimohOD19A/R_Playlist
Youtube: https://youtu.be/Ov0ExO8A-aU