DEV Community: Julia Silge

Topic modeling for Spice Girls lyrics 🇬🇧👯‍♀️🎤

Julia Silge — Wed, 15 Dec 2021 00:00:00 +0000

This is the latest in my series of screencasts, but instead of being about tidymodels, this screencast focuses on unsupervised modeling for text, specifically topic modeling. Today’s screencast walks through how to build a structural topic model and then how to explore and understand it, with this week’s #TidyTuesday dataset on Spice Girls lyrics. 👯‍♀️

Here is the code I used in the video, for those who prefer reading instead of or in addition to video.

Explore data

Our modeling goal is to “discover” topics in the lyrics of Spice Girls songs. Instead of a supervised or predictive model where our observations have labels, this is an unsupervised approach.

library(tidyverse)

lyrics <- read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-12-14/lyrics.csv")

How many albums and songs are there in this dataset?

lyrics %>% distinct(album_name)


## # A tibble: 3 × 1
## album_name
## <chr>     
## 1 Spice     
## 2 Spiceworld
## 3 Forever


lyrics %>% distinct(album_name, song_name)


## # A tibble: 31 × 2
## album_name song_name                 
## <chr> <chr>                     
## 1 Spice "Wannabe"                 
## 2 Spice "Say You\x92ll Be There"  
## 3 Spice "2 Become 1"              
## 4 Spice "Love Thing"              
## 5 Spice "Last Time Lover"         
## 6 Spice "Mama"                    
## 7 Spice "Who Do You Think You Are"
## 8 Spice "Something Kinda Funny"   
## 9 Spice "Naked"                   
## 10 Spice "If U Can\x92t Dance"     
## # … with 21 more rows

Let’s start by tokenizing this text and removing a small set of stop words (as well as fixing that punctuation).

library(tidytext)

tidy_lyrics <-
  lyrics %>%
  mutate(song_name = str_replace_all(song_name, "\x92", "'")) %>%
  unnest_tokens(word, line) %>%
  anti_join(get_stopwords())

What are the most common words in these songs after removing stop words?

tidy_lyrics %>%
  count(word, sort = TRUE)


## # A tibble: 979 × 2
## word n
## <chr> <int>
## 1 get 153
## 2 love 137
## 3 know 124
## 4 time 106
## 5 wanna 102
## 6 never 101
## 7 oh 88
## 8 yeah 88
## 9 la 85
## 10 got 82
## # … with 969 more rows

How about per song?

tidy_lyrics %>%
  count(song_name, word, sort = TRUE)


## # A tibble: 2,206 × 3
## song_name word n
## <chr> <chr> <int>
## 1 Saturday Night Divas get 91
## 2 Spice Up Your Life la 64
## 3 If U Can't Dance dance 60
## 4 Holler holler 48
## 5 Never Give Up on the Good Times never 47
## 6 Move Over generation 41
## 7 Saturday Night Divas deeper 41
## 8 Move Over yeah 39
## 9 Something Kinda Funny got 39
## 10 Never Give Up on the Good Times give 38
## # … with 2,196 more rows

This gives us an idea of how many counts per words we have per song, for our modeling.

Train a topic model

To train a topic model with the stm package, we need to create a sparse matrix from our tidy dataframe of tokens.

lyrics_sparse <-
  tidy_lyrics %>%
  count(song_name, word) %>%
  cast_sparse(song_name, word, n)

dim(lyrics_sparse)


## [1] 31 979

This means there are songs (i.e. documents) and 979 different tokens (i.e. terms or words) in our dataset for modeling.

A topic model like this one models:

each document as a mixture of topics
each topic as a mixture of words

The most important parameter when training a topic modeling is K, the number of topics. This is like k in k-means in that it is a hyperparamter of the model and we must choose this value ahead of time. We could try multiple different values to find the best value for K, but this is a very small dataset so let’s just stick with K = 4.

library(stm)
set.seed(123)
topic_model <- stm(lyrics_sparse, K = 4, verbose = FALSE)

To get a quick view of the results, we can use summary().

summary(topic_model)


## A topic model with 4 topics, 31 documents and a 979 word dictionary.

## Topic 1 Top Words:
## Highest Prob: get, wanna, time, night, right, deeper, come 
## FREX: deeper, saturday, comin, get, lover, night, last 
## Lift: achieve, saying, tonight, another, anyway, blameless, breaking 
## Score: deeper, saturday, lover, get, wanna, night, comin 
## Topic 2 Top Words:
## Highest Prob: dance, yeah, generation, know, next, love, naked 
## FREX: next, naked, denying, foolin, nobody, wants, meant 
## Lift: admit, bein, check, d'ya, defeat, else, foolin 
## Score: next, naked, dance, generation, denying, foolin, nobody 
## Topic 3 Top Words:
## Highest Prob: got, holler, make, love, oh, something, play 
## FREX: holler, kinda, swing, funny, yay, use, trust 
## Lift: anyone, bottom, driving, fantasy, follow, hoo, long 
## Score: holler, swing, kinda, funny, yay, driving, loving 
## Topic 4 Top Words:
## Highest Prob: la, never, love, give, time, know, way 
## FREX: times, tried, swear, la, bring, promise, viva 
## Lift: able, certain, love's, rely, affection, shy, replace 
## Score: la, times, swear, shake, viva, chicas, front

Explore topic model results

To explore more deeply, we can tidy() the topic model results to get a dataframe that we can compute on. There are two possible outputs for this topic model, the "beta" matrix of topic-word probabilities and the "gamma" matrix of document-topic probabilities. Let’s start with the first.

word_topics <- tidy(topic_model, matrix = "beta")
word_topics


## # A tibble: 3,916 × 3
## topic term beta
## <int> <chr> <dbl>
## 1 1 achieve 1.66e- 3
## 2 2 achieve 2.14e-21
## 3 3 achieve 1.75e-49
## 4 4 achieve 5.18e-36
## 5 1 baby 1.20e- 2
## 6 2 baby 1.44e- 2
## 7 3 baby 1.29e-15
## 8 4 baby 5.04e- 3
## 9 1 back 1.94e- 2
## 10 2 back 5.49e- 4
## # … with 3,906 more rows

Since this is a tidy dataframe, we can manipulate it how we like, include making a visualization showing the highest probability words from each topic.

word_topics %>%
  group_by(topic) %>%
  slice_max(beta, n = 10) %>%
  ungroup() %>%
  mutate(topic = paste("Topic", topic)) %>%
  ggplot(aes(beta, reorder_within(term, beta, topic), fill = topic)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(vars(topic), scales = "free_y") +
  scale_x_continuous(expand = c(0, 0)) +
  scale_y_reordered() +
  labs(x = expression(beta), y = NULL)

What about the other matrix? We also need to pass in the document_names.

song_topics <- tidy(topic_model,
  matrix = "gamma",
  document_names = rownames(lyrics_sparse)
)
song_topics


## # A tibble: 124 × 3
## document topic gamma
## <chr> <int> <dbl>
## 1 2 Become 1 1 0.714   
## 2 Denying 1 0.00163 
## 3 Do It 1 0.996   
## 4 Get Down With Me 1 0.947   
## 5 Goodbye 1 0.00106 
## 6 Holler 1 0.00103 
## 7 If U Can't Dance 1 0.000942
## 8 If You Wanna Have Some Fun 1 0.00722 
## 9 Last Time Lover 1 0.998   
## 10 Let Love Lead the Way 1 0.00175 
## # … with 114 more rows

Remember that each document (song) was modeled as a mixture of topics. How did that turn out?

song_topics %>%
  mutate(
    song_name = fct_reorder(document, gamma),
    topic = factor(topic)
  ) %>%
  ggplot(aes(gamma, topic, fill = topic)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(vars(song_name), ncol = 4) +
  scale_x_continuous(expand = c(0, 0)) +
  labs(x = expression(gamma), y = "Topic")

The songs near the top of this plot are mostly one topic, while the songs near the bottom are more a mix.

There is a TON more you can do with topic models. For example, we can take the trained topic model and, using some supplementary metadata on our documents, estimate regressions for the proportion of each document about a topic with the metadata as the predictors. For example, let’s estimate regressions for our four topics with the album name as the predictor. This asks the question, “Do the topics in Spice Girls songs change across albums?”

effects <-
  estimateEffect(
    1:4 ~ album_name,
    topic_model,
    tidy_lyrics %>% distinct(song_name, album_name) %>% arrange(song_name)
  )

Again, to get a quick view of the results, we can use summary(), but to dive deeper, we will want to use tidy().

summary(effects)


## 
## Call:
## estimateEffect(formula = 1:4 ~ album_name, stmobj = topic_model, 
## metadata = tidy_lyrics %>% distinct(song_name, album_name) %>% 
## arrange(song_name))
## 
## 
## Topic 1:
## 
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.1787 0.1312 1.362 0.184
## album_nameSpice 0.1199 0.1892 0.634 0.531
## album_nameSpiceworld 0.1139 0.1862 0.612 0.546
## 
## 
## Topic 2:
## 
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.1444 0.1325 1.090 0.285
## album_nameSpice 0.1357 0.1879 0.722 0.476
## album_nameSpiceworld 0.1486 0.1846 0.805 0.427
## 
## 
## Topic 3:
## 
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)  
## (Intercept) 0.27150 0.12085 2.247 0.0327 *
## album_nameSpice 0.01954 0.16752 0.117 0.9080  
## album_nameSpiceworld -0.25776 0.16700 -1.543 0.1339  
## ---
## Signif. codes: 0 ' ***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## 
## Topic 4:
## 
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)   
## (Intercept) 0.405559 0.140820 2.880 0.00754 **
## album_nameSpice -0.273207 0.202200 -1.351 0.18746   
## album_nameSpiceworld -0.007134 0.194246 -0.037 0.97096   
## ---
## Signif. codes: 0 ' ***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1


tidy(effects)


## # A tibble: 12 × 6
## topic term estimate std.error statistic p.value
## <int> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 1 (Intercept) 0.177 0.132 1.34 0.190  
## 2 1 album_nameSpice 0.120 0.189 0.633 0.532  
## 3 1 album_nameSpiceworld 0.115 0.188 0.608 0.548  
## 4 2 (Intercept) 0.145 0.133 1.09 0.283  
## 5 2 album_nameSpice 0.135 0.187 0.722 0.476  
## 6 2 album_nameSpiceworld 0.150 0.185 0.813 0.423  
## 7 3 (Intercept) 0.272 0.120 2.26 0.0316 
## 8 3 album_nameSpice 0.0167 0.167 0.100 0.921  
## 9 3 album_nameSpiceworld -0.259 0.166 -1.57 0.129  
## 10 4 (Intercept) 0.404 0.140 2.89 0.00739
## 11 4 album_nameSpice -0.273 0.196 -1.39 0.175  
## 12 4 album_nameSpiceworld -0.00502 0.193 -0.0260 0.979

Looks like there is no statistical evidence of change in the lyrical content of the Spice Girls songs across these three albums!

Predicting viewership for Doctor Who episodes

Julia Silge — Sat, 27 Nov 2021 00:00:00 +0000

This is the latest in my series of screencasts demonstrating how to use the tidymodels packages, from just getting started to tuning more complex models. Today’s screencast walks through how to handle workflow objects, with this week’s #TidyTuesday dataset on Doctor Who episodes. 💙

Here is the code I used in the video, for those who prefer reading instead of or in addition to video.

Explore data

Our modeling goal is to predict the UK viewership of Doctor Who episodes (since the 2005 revival) from the episodes’ air date. How has the viewership of these episodes changed over time?

episodes <- readr::read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-11-23/episodes.csv") %>%
  filter(!is.na(uk_viewers))

episodes %>%
  ggplot(aes(first_aired, uk_viewers)) +
  geom_line(alpha = 0.8, size = 1.2, color = "midnightblue") +
  labs(x = NULL)

These are quite spiky, with much higher viewer numbers for special episodes like season finales or Christmas episodes.

I have only ever watched episodes of Doctor Who after they arrive on US streaming platforms, but I will say that I haven’t caught up on some of the latest seasons, much like many viewers in the UK.

Create a workflow

In tidymodels, we typically recommend using a workflow in your modeling analyses, to make it easier to carry around preprocessing and modeling components in your code and to protect against errors. Let’s create some bootstrap resampling folds for these episodes, and then a workflow to predict viewership (in millions) from the air date.

library(tidymodels)

set.seed(123)
folds <- bootstraps(episodes, times = 100, strata = uk_viewers)
folds


## # Bootstrap sampling using stratification 
## # A tibble: 100 × 2
## splits id          
## <list> <chr>       
## 1 <split [167/61]> Bootstrap001
## 2 <split [167/55]> Bootstrap002
## 3 <split [167/64]> Bootstrap003
## 4 <split [167/56]> Bootstrap004
## 5 <split [167/69]> Bootstrap005
## 6 <split [167/63]> Bootstrap006
## 7 <split [167/68]> Bootstrap007
## 8 <split [167/55]> Bootstrap008
## 9 <split [167/60]> Bootstrap009
## 10 <split [167/60]> Bootstrap010
## # … with 90 more rows

We want to use first_aired as our predictor, but let’s do some feature engineering here. Let’s create a date feature (just year here; if we had more data, maybe we could try week of the year or month), and also create a feature for a few holidays that are celebrated in the UK and have special Doctor Who episodes.

who_rec <-
  recipe(uk_viewers ~ first_aired, data = episodes) %>%
  step_date(first_aired, features = "year") %>%
  step_holiday(first_aired,
    holidays = c("NewYearsDay", "ChristmasDay"),
    keep_original_cols = FALSE
  )

## not needed for modeling, but just to check how things are going:
prep(who_rec) %>% bake(new_data = NULL)


## # A tibble: 167 × 4
## uk_viewers first_aired_year first_aired_NewYearsDay first_aired_ChristmasDay
## <dbl> <dbl> <dbl> <dbl>
## 1 10.8 2005 0 0
## 2 7.97 2005 0 0
## 3 8.86 2005 0 0
## 4 7.63 2005 0 0
## 5 7.98 2005 0 0
## 6 8.63 2005 0 0
## 7 8.01 2005 0 0
## 8 8.06 2005 0 0
## 9 7.11 2005 0 0
## 10 6.86 2005 0 0
## # … with 157 more rows

Now let’s combine this feature engineering recipe together with a model. We don’t have much data here, so let’s stick with a straightforward OLS linear model.

who_wf <- workflow(who_rec, linear_reg())
who_wf


## ══ Workflow ════════════════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: linear_reg()
## 
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 2 Recipe Steps
## 
## • step_date()
## • step_holiday()
## 
## ── Model ───────────────────────────────────────────────────────────────────────
## Linear Regression Model Specification (regression)
## 
## Computational engine: lm

Extract custom quantities from resampled workflows

If you look at many of my tutorials or the documentation for tidymodels, you’ll see that we can fit our workflow to our resamples with code like fit_resamples(who_wf, folds). This can give us some useful results, but sometimes we want more. The functions like fit_resamples() and tune_grid() and friends don’t keep the fitted models they train, because they are all trained for the purpose of evaluation or tuning or similar; we usually throw those models away. Sometimes we want to record something about those models beyond their performance; we can do that using a special control_*() function.

ctrl_extract <- control_resamples(extract = extract_fit_engine)

To create ctrl_extract, I used the extract_fit_engine() function, but you have total flexibility here and can supply your own function. Check out this tutorial for another way to supply a custom function here.

With our ctrl_extract ready to go, we can now fit to our resamples and keep the linear models for each resample.

doParallel::registerDoParallel()
set.seed(234)
who_rs <- fit_resamples(who_wf, folds, control = ctrl_extract)
who_rs


## # Resampling results
## # Bootstrap sampling using stratification 
## # A tibble: 100 × 5
## splits id .metrics .notes .extracts    
## <list> <chr> <list> <list> <list>       
## 1 <split [167/61]> Bootstrap001 <tibble [2 × 4]> <tibble [0 × 1]> <tibble [1 ×…
## 2 <split [167/55]> Bootstrap002 <tibble [2 × 4]> <tibble [0 × 1]> <tibble [1 ×…
## 3 <split [167/64]> Bootstrap003 <tibble [2 × 4]> <tibble [0 × 1]> <tibble [1 ×…
## 4 <split [167/56]> Bootstrap004 <tibble [2 × 4]> <tibble [0 × 1]> <tibble [1 ×…
## 5 <split [167/69]> Bootstrap005 <tibble [2 × 4]> <tibble [0 × 1]> <tibble [1 ×…
## 6 <split [167/63]> Bootstrap006 <tibble [2 × 4]> <tibble [0 × 1]> <tibble [1 ×…
## 7 <split [167/68]> Bootstrap007 <tibble [2 × 4]> <tibble [0 × 1]> <tibble [1 ×…
## 8 <split [167/55]> Bootstrap008 <tibble [2 × 4]> <tibble [0 × 1]> <tibble [1 ×…
## 9 <split [167/60]> Bootstrap009 <tibble [2 × 4]> <tibble [0 × 1]> <tibble [1 ×…
## 10 <split [167/60]> Bootstrap010 <tibble [2 × 4]> <tibble [0 × 1]> <tibble [1 ×…
## # … with 90 more rows

Since we have each lm object for each resample, we can tidy() them to find the coefficients. We can do any kind of analysis we want on these bootstrapped coefficients, including making a visualization.

who_rs %>%
  select(id, .extracts) %>%
  unnest(.extracts) %>%
  mutate(coefs = map(.extracts, tidy)) %>%
  unnest(coefs) %>%
  filter(term != "(Intercept)") %>%
  ggplot(aes(estimate, fill = term)) +
  geom_histogram(alpha = 0.8, bins = 12, show.legend = FALSE) +
  facet_wrap(vars(term), scales = "free")

It looks like episodes airing on Christmas Day have much higher viewership, 2.5 to 3 million viewers higher than other days. Airing on New Years also looks like it is associated with more viewers, and we see evidence for a modest decrease in viewers with year.

Predict giant pumpkin weights 🎃 with tidymodels

Julia Silge — Mon, 08 Nov 2021 00:00:00 +0000

This is the latest in my series of screencasts demonstrating how to use the tidymodels packages. If you are a tidymodels user, either just starting out or someone who has used the packages a lot, we are interested in your feedback on our priorities for 2022. The survey we fielded last year turned out to be very helpful in making decisions, so we would so appreciate your input again!

Today’s screencast is great for someone just starting out with workflowsets, the tidymodels package for handling multiple preprocessing/modeling combinations at once, with this week’s #TidyTuesday dataset on giant pumpkins from competitons. 🥧

Here is the code I used in the video, for those who prefer reading instead of or in addition to video.

Explore data

Our modeling goal is to predict the weight of giant pumpkins from other characteristics measured during a competition.

library(tidyverse)

pumpkins_raw <- readr::read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-10-19/pumpkins.csv")

pumpkins <-
  pumpkins_raw %>%
  separate(id, into = c("year", "type")) %>%
  mutate(across(c(year, weight_lbs, ott, place), parse_number)) %>%
  filter(type == "P") %>%
  select(weight_lbs, year, place, ott, gpc_site, country)

pumpkins


## # A tibble: 15,965 × 6
## weight_lbs year place ott gpc_site country    
## <dbl> <dbl> <dbl> <dbl> <chr> <chr>      
## 1 2032 2013 1 475 Uesugi Farms Weigh-off United Sta…
## 2 1985 2013 2 453 Safeway World Championship Pumpkin … United Sta…
## 3 1894 2013 3 445 Safeway World Championship Pumpkin … United Sta…
## 4 1874. 2013 4 436 Elk Grove Giant Pumpkin Festival United Sta…
## 5 1813 2013 5 430 The Great Howard Dill Giant Pumpkin… Canada     
## 6 1791 2013 6 431 Elk Grove Giant Pumpkin Festival United Sta…
## 7 1784 2013 7 445 Uesugi Farms Weigh-off United Sta…
## 8 1784. 2013 8 434 Stillwater Harvestfest United Sta…
## 9 1780. 2013 9 422 Stillwater Harvestfest United Sta…
## 10 1766. 2013 10 425 Durham Fair Weigh-Off United Sta…
## # … with 15,955 more rows

The main relationship here is between the volume/size of the pumpkin (measured via “over-the-top inches”) and weight.

pumpkins %>%
  filter(ott > 20, ott < 1e3) %>%
  ggplot(aes(ott, weight_lbs, color = place)) +
  geom_point(alpha = 0.2, size = 1.1) +
  labs(x = "over-the-top inches", y = "weight (lbs)") +
  scale_color_viridis_c()

Big, heavy pumpkins placed closer to winning at the competitions, naturally!

Has there been any shift in this relationship over time?

pumpkins %>%
  filter(ott > 20, ott < 1e3) %>%
  ggplot(aes(ott, weight_lbs)) +
  geom_point(alpha = 0.2, size = 1.1, color = "gray60") +
  geom_smooth(aes(color = factor(year)),
    method = lm, formula = y ~ splines::bs(x, 3),
    se = FALSE, size = 1.5, alpha = 0.6
  ) +
  labs(x = "over-the-top inches", y = "weight (lbs)", color = NULL) +
  scale_color_viridis_d()

Hard to say, I think.

Which countries produced more or less massive pumpkins?

pumpkins %>%
  mutate(
    country = fct_lump(country, n = 10),
    country = fct_reorder(country, weight_lbs)
  ) %>%
  ggplot(aes(country, weight_lbs, color = country)) +
  geom_boxplot(outlier.colour = NA) +
  geom_jitter(alpha = 0.1, width = 0.15) +
  labs(x = NULL, y = "weight (lbs)") +
  theme(legend.position = "none")

Build and fit a workflow set

Let’s start our modeling by setting up our “data budget.” We’ll stratify by our outcome weight_lbs.

library(tidymodels)

set.seed(123)
pumpkin_split <- pumpkins %>%
  filter(ott > 20, ott < 1e3) %>%
  initial_split(strata = weight_lbs)

pumpkin_train <- training(pumpkin_split)
pumpkin_test <- testing(pumpkin_split)

set.seed(234)
pumpkin_folds <- vfold_cv(pumpkin_train, strata = weight_lbs)
pumpkin_folds


## # 10-fold cross-validation using stratification 
## # A tibble: 10 × 2
## splits id    
## <list> <chr> 
## 1 <split [8954/996]> Fold01
## 2 <split [8954/996]> Fold02
## 3 <split [8954/996]> Fold03
## 4 <split [8954/996]> Fold04
## 5 <split [8954/996]> Fold05
## 6 <split [8954/996]> Fold06
## 7 <split [8955/995]> Fold07
## 8 <split [8956/994]> Fold08
## 9 <split [8957/993]> Fold09
## 10 <split [8958/992]> Fold10

Next, let’s create three data preprocessing recipes: one that only pools infrequently used factors levels, one that also creates indicator variables, and finally one that also creates spline terms for over-the-top inches.

base_rec <-
  recipe(weight_lbs ~ ott + year + country + gpc_site,
    data = pumpkin_train
  ) %>%
  step_other(country, gpc_site, threshold = 0.02)

ind_rec <-
  base_rec %>%
  step_dummy(all_nominal_predictors())

spline_rec <-
  ind_rec %>%
  step_bs(ott)

Then, let’s create three model specifications: a random forest model, a MARS model, and a linear model.

rf_spec <-
  rand_forest(trees = 1e3) %>%
  set_mode("regression") %>%
  set_engine("ranger")

mars_spec <-
  mars() %>%
  set_mode("regression") %>%
  set_engine("earth")

lm_spec <- linear_reg()

Now it’s time to put the preprocessing and models together in a workflow_set().

pumpkin_set <-
  workflow_set(
    list(base_rec, ind_rec, spline_rec),
    list(rf_spec, mars_spec, lm_spec),
    cross = FALSE
  )

pumpkin_set


## # A workflow set/tibble: 3 × 4
## wflow_id info option result    
## <chr> <list> <list> <list>    
## 1 recipe_1_rand_forest <tibble [1 × 4]> <opts[0]> <list [0]>
## 2 recipe_2_mars <tibble [1 × 4]> <opts[0]> <list [0]>
## 3 recipe_3_linear_reg <tibble [1 × 4]> <opts[0]> <list [0]>

We use cross = FALSE because we don’t want every combination of these components, only three options to try. Let’s fit these possible candidates to our resamples to see which one performs best.

doParallel::registerDoParallel()
set.seed(2021)

pumpkin_rs <-
  workflow_map(
    pumpkin_set,
    "fit_resamples",
    resamples = pumpkin_folds
  )

pumpkin_rs


## # A workflow set/tibble: 3 × 4
## wflow_id info option result   
## <chr> <list> <list> <list>   
## 1 recipe_1_rand_forest <tibble [1 × 4]> <opts[1]> <rsmp[+]>
## 2 recipe_2_mars <tibble [1 × 4]> <opts[1]> <rsmp[+]>
## 3 recipe_3_linear_reg <tibble [1 × 4]> <opts[1]> <rsmp[+]>

Evaluate workflow set

How did our three candidates do?

autoplot(pumpkin_rs)

There is not much difference between the three options, and if anything, our linear model with spline feature engineering maybe did better. This is nice because it’s a simpler model!

collect_metrics(pumpkin_rs)


## # A tibble: 6 × 9
## wflow_id .config preproc model .metric .estimator mean n std_err
## <chr> <chr> <chr> <chr> <chr> <chr> <dbl> <int> <dbl>
## 1 recipe_1_r… Preprocess… recipe rand_… rmse standard 86.1 10 1.10e+0
## 2 recipe_1_r… Preprocess… recipe rand_… rsq standard 0.969 10 9.97e-4
## 3 recipe_2_m… Preprocess… recipe mars rmse standard 83.8 10 1.92e+0
## 4 recipe_2_m… Preprocess… recipe mars rsq standard 0.969 10 1.67e-3
## 5 recipe_3_l… Preprocess… recipe linea… rmse standard 82.4 10 2.27e+0
## 6 recipe_3_l… Preprocess… recipe linea… rsq standard 0.970 10 1.97e-3

We can extract the workflow we want to use and fit it to our training data.

final_fit <-
  extract_workflow(pumpkin_rs, "recipe_3_linear_reg") %>%
  fit(pumpkin_train)

We can use an object like this to predict, such as on the test data like predict(final_fit, pumpkin_test), or we can examine the model parameters.

tidy(final_fit) %>%
  arrange(-abs(estimate))


## # A tibble: 15 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -9731. 675. -14.4 1.30e- 46
## 2 ott_bs_3 2585. 25.6 101. 0        
## 3 ott_bs_2 450. 11.9 37.9 2.75e-293
## 4 ott_bs_1 -345. 36.3 -9.50 2.49e- 21
## 5 gpc_site_Ohio.Valley.Giant.Pumpkin.Gr… 21.1 7.80 2.70 6.89e- 3
## 6 country_United.States 11.9 5.66 2.11 3.53e- 2
## 7 gpc_site_Stillwater.Harvestfest 11.6 7.87 1.48 1.40e- 1
## 8 country_Germany -11.5 6.68 -1.71 8.64e- 2
## 9 country_other -10.7 6.33 -1.69 9.13e- 2
## 10 country_Canada 9.29 6.12 1.52 1.29e- 1
## 11 country_Italy 8.12 7.02 1.16 2.47e- 1
## 12 gpc_site_Elk.Grove.Giant.Pumpkin.Fest… -7.81 7.70 -1.01 3.10e- 1
## 13 year 4.89 0.334 14.6 5.03e- 48
## 14 gpc_site_Wiegemeisterschaft.Berlin.Br… 1.51 8.07 0.187 8.51e- 1
## 15 gpc_site_other 1.41 5.60 0.251 8.02e- 1

The spline terms are by far the most important, but we do see evidence of certain sites and countries being predictive of weight (either up or down) as well as a small trend of heavier pumpkins with year.

Don’t forget to take the tidymodels survey for 2022 priorities!

Spatial resampling for the #30DayMapChallenge 🗺

Julia Silge — Fri, 05 Nov 2021 00:00:00 +0000

This is the latest in my series of screencasts demonstrating how to use the tidymodels packages, from just getting started to tuning more complex models. Today’s screencast walks through how to use spatial resampling for evaluating a model, with this week’s #TidyTuesday dataset on geographic data. 🗾

Here is the code I used in the video, for those who prefer reading instead of or in addition to video.

Explore data

Geographic data is special when it comes to, well, basically everything! This includes modeling and especially evaluating models. This week’s #TidyTuesday is all about exploring spatial data for the #30DayMapChallenge this month, and especially the spData and spDataLarge packages along with the book Geocomputation with R.

Let’s use the dataset of landslides (plus not-landslide locations) in Southern Ecuador.

data("lsl", package = "spDataLarge")
landslides <- as_tibble(lsl)
landslides


## # A tibble: 350 × 8
## x y lslpts slope cplan cprof elev log10_carea
## <dbl> <dbl> <fct> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 715078. 9558647. FALSE 37.4 0.0205 0.00866 2477. 2.61
## 2 713748. 9558047. FALSE 41.7 -0.0241 0.00679 2486. 3.07
## 3 712508. 9558887. FALSE 20.0 0.0390 0.0147 2142. 2.29
## 4 713998. 9558187. FALSE 45.8 -0.00632 0.00435 2391. 3.83
## 5 714308. 9557307. FALSE 41.7 0.0423 -0.0202 2570. 2.70
## 6 713488. 9558117. FALSE 52.9 0.0323 0.00703 2418. 2.48
## 7 714948. 9558347. FALSE 51.9 0.0399 0.000791 2546. 3.15
## 8 714678. 9560357. FALSE 38.5 0.0164 0.0299 1932. 3.26
## 9 714368. 9560287. FALSE 24.1 -0.0188 -0.00956 2059. 3.20
## 10 712528. 9559217. FALSE 50.5 0.0142 0.0151 1973. 2.60
## # … with 340 more rows

How are these landslides (plus not landslides) distributes in this area?

ggplot(landslides, aes(x, y)) +
  stat_summary_hex(aes(z = elev), alpha = 0.6, bins = 12) +
  geom_point(aes(color = lslpts), alpha = 0.7) +
  coord_fixed() +
  scale_fill_viridis_c() +
  scale_color_manual(values = c("gray90", "midnightblue")) +
  labs(fill = "Elevation", color = "Landslide?")

Create spatial resamples

In tidymodels, one of the first steps we recommend thinking about is “spending your data budget.” When it comes to geographic data, points close to each other are often similar so we don’t want to randomly resample our observations. Instead, we want to use a resampling strategy that accounts for that autocorrelation. Let’s create both resamples that are appropriate to spatial data and resamples that might work for “regular,” non-spatial data but are not a good fit for geographic data.

library(tidymodels)
library(spatialsample)

set.seed(123)
good_folds <- spatial_clustering_cv(landslides, coords = c("x", "y"), v = 5)
good_folds


## # 5-fold spatial cross-validation 
## # A tibble: 5 × 2
## splits id   
## <list> <chr>
## 1 <split [306/44]> Fold1
## 2 <split [256/94]> Fold2
## 3 <split [251/99]> Fold3
## 4 <split [303/47]> Fold4
## 5 <split [284/66]> Fold5


set.seed(234)
bad_folds <- vfold_cv(landslides, v = 5, strata = lslpts)
bad_folds


## # 5-fold cross-validation using stratification 
## # A tibble: 5 × 2
## splits id   
## <list> <chr>
## 1 <split [280/70]> Fold1
## 2 <split [280/70]> Fold2
## 3 <split [280/70]> Fold3
## 4 <split [280/70]> Fold4
## 5 <split [280/70]> Fold5

The spatialsample package currently provides one method for spatial resampling and we are interested in hearing about what other methods we should support next.

How do these resamples look? Let’s create a little helper function:

plot_splits <- function(split) {
  p <- bind_rows(
    analysis(split) %>%
      mutate(analysis = "Analysis"),
    assessment(split) %>%
      mutate(analysis = "Assessment")
  ) %>%
    ggplot(aes(x, y, color = analysis)) +
    geom_point(size = 1.5, alpha = 0.8) +
    coord_fixed() +
    labs(color = NULL)
  print(p)
}

The spatial resampling creates resamples where observations close to each other are together.

walk(good_folds$splits, plot_splits)

The regular resampling doesn’t do this; it just randomly resamples all observations.

walk(bad_folds$splits, plot_splits)

This second option is not a good idea for geographic data.

Fit and evaluate model

Let’s create a straightforward logistic regression model to predict whether a location saw a landslide based on the other characteristics like slope, elevation, amount of water flow, etc. We can estimate how well this same model fits the data both with our regular folds and our special spatial resampling.

glm_spec <- logistic_reg()
lsl_form <- lslpts ~ slope + cplan + cprof + elev + log10_carea

lsl_wf <- workflow(lsl_form, glm_spec)

doParallel::registerDoParallel()
set.seed(2021)
regular_rs <- fit_resamples(lsl_wf, bad_folds)
set.seed(2021)
spatial_rs <- fit_resamples(lsl_wf, good_folds)

How did our results turn out?

collect_metrics(regular_rs)


## # A tibble: 2 × 6
## .metric .estimator mean n std_err .config             
## <chr> <chr> <dbl> <int> <dbl> <chr>               
## 1 accuracy binary 0.737 5 0.0173 Preprocessor1_Model1
## 2 roc_auc binary 0.808 5 0.0201 Preprocessor1_Model1


collect_metrics(spatial_rs)


## # A tibble: 2 × 6
## .metric .estimator mean n std_err .config             
## <chr> <chr> <dbl> <int> <dbl> <chr>               
## 1 accuracy binary 0.677 5 0.0708 Preprocessor1_Model1
## 2 roc_auc binary 0.782 5 0.00790 Preprocessor1_Model1

If we use the “regular” resampling, we get a more optimistc estimate of performance which would fool us into thinking our model would perform better than it really could. The lower performance estimate using spatial resampling is more accurate because of the autocorrelation of this geographic data; observations near each other are more alike than observations far apart. With geographic data, it’s important to use an appropriate model evaluation strategy!

Multiclass predictive modeling for economics research papers 📑

Julia Silge — Wed, 29 Sep 2021 00:00:00 +0000

This is the latest in my series of screencasts demonstrating how to use the tidymodels packages, from just getting started to tuning more complex models. Today’s screencast walks through how to build, tune, and evaluate a multiclass predictive model with text features and lasso regularization, with this week’s #TidyTuesday dataset on NBER working papers. 📑

Here is the code I used in the video, for those who prefer reading instead of or in addition to video.

Explore data

Our modeling goal is to predict the category of National Bureau of Economic Research working papers from the titles and years of the papers.

library(tidyverse)

papers <- readr::read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-09-28/papers.csv")
programs <- readr::read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-09-28/programs.csv")
paper_authors <- readr::read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-09-28/paper_authors.csv")
paper_programs <- readr::read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-09-28/paper_programs.csv")

Let’s start by joining up these datasets to find the info we need.

papers_joined <-
  paper_programs %>%
  left_join(programs) %>%
  left_join(papers) %>%
  filter(!is.na(program_category)) %>%
  distinct(paper, program_category, year, title)

papers_joined %>%
  count(program_category)


## # A tibble: 3 × 2
## program_category n
## <chr> <int>
## 1 Finance 4336
## 2 Macro/International 12012
## 3 Micro 18527

The papers are in three categories (finance, microeconomics, and macroeconomics) so we’ll be training a multiclass predictive model, not a binary classification model as we often see or use.

Let’s create one exploratory plot before we move on to modeling.

library(tidytext)
library(tidylo)

title_log_odds <-
  papers_joined %>%
  unnest_tokens(word, title) %>%
  filter(!is.na(program_category)) %>%
  count(program_category, word, sort = TRUE) %>%
  bind_log_odds(program_category, word, n)

title_log_odds %>%
  group_by(program_category) %>%
  slice_max(log_odds_weighted, n = 10) %>%
  ungroup() %>%
  ggplot(aes(log_odds_weighted,
    fct_reorder(word, log_odds_weighted),
    fill = program_category
  )) +
  geom_col(show.legend = FALSE) +
  facet_wrap(vars(program_category), scales = "free_y") +
  labs(x = "Log odds (weighted)", y = NULL)

These type of relationships between category and title words are what we want to use in our predictive model.

Build and tune a model

Let’s start our modeling by setting up our “data budget.” We’ll stratify by our outcome program_category.

library(tidymodels)

set.seed(123)
nber_split <- initial_split(papers_joined, strata = program_category)
nber_train <- training(nber_split)
nber_test <- testing(nber_split)

set.seed(234)
nber_folds <- vfold_cv(nber_train, strata = program_category)
nber_folds


## # 10-fold cross-validation using stratification 
## # A tibble: 10 × 2
## splits id    
## <list> <chr> 
## 1 <split [23539/2617]> Fold01
## 2 <split [23539/2617]> Fold02
## 3 <split [23540/2616]> Fold03
## 4 <split [23540/2616]> Fold04
## 5 <split [23540/2616]> Fold05
## 6 <split [23541/2615]> Fold06
## 7 <split [23541/2615]> Fold07
## 8 <split [23541/2615]> Fold08
## 9 <split [23541/2615]> Fold09
## 10 <split [23542/2614]> Fold10

Next, let’s set up our feature engineering. We will need to transform our text data into features useful for our model by tokenizing and computing (in this case) tf-idf. Let’s also downsample since our dataset is imbalanced, with many more of some of the categories than others.

library(themis)
library(textrecipes)

nber_rec <-
  recipe(program_category ~ year + title, data = nber_train) %>%
  step_tokenize(title) %>%
  step_tokenfilter(title, max_tokens = 200) %>%
  step_tfidf(title) %>%
  step_downsample(program_category)

nber_rec


## Recipe
## 
## Inputs:
## 
## role #variables
## outcome 1
## predictor 2
## 
## Operations:
## 
## Tokenization for title
## Text filtering for title
## Term frequency-inverse document frequency with title
## Down-sampling based on program_category

Then, let’s create our model specification for a lasso model. We need to use a model specification that can handle multiclass data, in this case multinom_reg().

multi_spec <-
  multinom_reg(penalty = tune(), mixture = 1) %>%
  set_mode("classification") %>%
  set_engine("glmnet")

multi_spec


## Multinomial Regression Model Specification (classification)
## 
## Main Arguments:
## penalty = tune()
## mixture = 1
## 
## Computational engine: glmnet

Now it’s time to put the preprocessing and model together in a workflow().

nber_wf <- workflow(nber_rec, multi_spec)
nber_wf


## ══ Workflow ════════════════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: multinom_reg()
## 
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 4 Recipe Steps
## 
## • step_tokenize()
## • step_tokenfilter()
## • step_tfidf()
## • step_downsample()
## 
## ── Model ───────────────────────────────────────────────────────────────────────
## Multinomial Regression Model Specification (classification)
## 
## Main Arguments:
## penalty = tune()
## mixture = 1
## 
## Computational engine: glmnet

Since the lasso regularization penalty is a hyperparameter of the model (we can’t find the best value from fitting the model a single time), let’s tune over a grid of possible penalty parameters.

nber_grid <- grid_regular(penalty(range = c(-5, 0)), levels = 20)

doParallel::registerDoParallel()
set.seed(2021)
nber_rs <-
  tune_grid(
    nber_wf,
    nber_folds,
    grid = nber_grid
  )

nber_rs


## # Tuning results
## # 10-fold cross-validation using stratification 
## # A tibble: 10 × 4
## splits id .metrics .notes          
## <list> <chr> <list> <list>          
## 1 <split [23539/2617]> Fold01 <tibble [40 × 5]> <tibble [0 × 1]>
## 2 <split [23539/2617]> Fold02 <tibble [40 × 5]> <tibble [0 × 1]>
## 3 <split [23540/2616]> Fold03 <tibble [40 × 5]> <tibble [0 × 1]>
## 4 <split [23540/2616]> Fold04 <tibble [40 × 5]> <tibble [0 × 1]>
## 5 <split [23540/2616]> Fold05 <tibble [40 × 5]> <tibble [0 × 1]>
## 6 <split [23541/2615]> Fold06 <tibble [40 × 5]> <tibble [0 × 1]>
## 7 <split [23541/2615]> Fold07 <tibble [40 × 5]> <tibble [0 × 1]>
## 8 <split [23541/2615]> Fold08 <tibble [40 × 5]> <tibble [0 × 1]>
## 9 <split [23541/2615]> Fold09 <tibble [40 × 5]> <tibble [0 × 1]>
## 10 <split [23542/2614]> Fold10 <tibble [40 × 5]> <tibble [0 × 1]>

This is a pretty fast model to fit, since it is linear. How did it turn out?

autoplot(nber_rs)

show_best(nber_rs)


## # A tibble: 5 × 7
## penalty .metric .estimator mean n std_err .config              
## <dbl> <chr> <chr> <dbl> <int> <dbl> <chr>                
## 1 0.00234 roc_auc hand_till 0.784 10 0.00249 Preprocessor1_Model10
## 2 0.00428 roc_auc hand_till 0.783 10 0.00244 Preprocessor1_Model11
## 3 0.00127 roc_auc hand_till 0.783 10 0.00251 Preprocessor1_Model09
## 4 0.000695 roc_auc hand_till 0.782 10 0.00253 Preprocessor1_Model08
## 5 0.000379 roc_auc hand_till 0.782 10 0.00254 Preprocessor1_Model07

Choose and evaluate a final model

We could use the numerically best model with select_best() by often with regularized models we would rather choose a simpler model within some limits of performance. We can choose using the “one-standard error rule” with select_by_one_std_err() and then use last_fit() to fit one time to the training data and evaluate one time to the testing data.

final_penalty <-
  nber_rs %>%
  select_by_one_std_err(metric = "roc_auc", desc(penalty))

final_penalty


## # A tibble: 1 × 9
## penalty .metric .estimator mean n std_err .config .best .bound
## <dbl> <chr> <chr> <dbl> <int> <dbl> <chr> <dbl> <dbl>
## 1 0.00428 roc_auc hand_till 0.783 10 0.00244 Preprocessor1_Mod… 0.784 0.781


final_rs <-
  nber_wf %>%
  finalize_workflow(final_penalty) %>%
  last_fit(nber_split)

final_rs


## # Resampling results
## # Manual resampling 
## # A tibble: 1 × 6
## splits id .metrics .notes .predictions .workflow
## <list> <chr> <list> <list> <list> <list>   
## 1 <split [26156/8719]> train/test split <tibble … <tibbl… <tibble [8,… <workflo…

How did our final model perform on the training data?

collect_metrics(final_rs)


## # A tibble: 2 × 4
## .metric .estimator .estimate .config             
## <chr> <chr> <dbl> <chr>               
## 1 accuracy multiclass 0.609 Preprocessor1_Model1
## 2 roc_auc hand_till 0.779 Preprocessor1_Model1

We can visualize the difference in performance across classes with a confusion matrix.

collect_predictions(final_rs) %>%
  conf_mat(program_category, .pred_class) %>%
  autoplot()

We can also visualize the ROC curves for each class.

collect_predictions(final_rs) %>%
  roc_curve(truth = program_category, .pred_Finance:.pred_Micro) %>%
  ggplot(aes(1 - specificity, sensitivity, color = .level)) +
  geom_abline(slope = 1, color = "gray50", lty = 2, alpha = 0.8) +
  geom_path(size = 1.5, alpha = 0.7) +
  labs(color = NULL) +
  coord_fixed()

It looks like the finance and microeconomics papers were easier to identify than the macroeconomics papers.

Finally, we can extract (and save, if we like) the fitted workflow from our results to use for predicting on new data.

final_fitted <- extract_workflow(final_rs)
## can save this for prediction later with readr::write_rds()

predict(final_fitted, nber_test[111,], type = "prob")


## # A tibble: 1 × 3
## .pred_Finance `.pred_Macro/International` .pred_Micro
## <dbl> <dbl> <dbl>
## 1 0.104 0.531 0.365

We can even make up new paper titles and see how our model classifies them.

predict(final_fitted, tibble(year = 2021, title = "Pricing Models for Corporate Responsibility"), type = "prob")


## # A tibble: 1 × 3
## .pred_Finance `.pred_Macro/International` .pred_Micro
## <dbl> <dbl> <dbl>
## 1 0.598 0.158 0.244


predict(final_fitted, tibble(year = 2021, title = "Teacher Health and Medicaid Expansion"), type = "prob")


## # A tibble: 1 × 3
## .pred_Finance `.pred_Macro/International` .pred_Micro
## <dbl> <dbl> <dbl>
## 1 0.288 0.141 0.571

Dimensionality reduction for Billboard Top 100 songs 🎶

Julia Silge — Wed, 15 Sep 2021 00:00:00 +0000

This is the latest in my series of screencasts demonstrating how to use the tidymodels packages, from just getting started to tuning more complex models. Today’s screencast focuses only on data preprocessing, or feature engineering; let’s walk through how to use dimensionality reduction for song features sourced from Spotify (mostly audio), with this week’s #TidyTuesday dataset on Billboard Top 100 songs. 🎵

Here is the code I used in the video, for those who prefer reading instead of or in addition to video.

Explore data

Our modeling goal is to use dimensionality reduction for features of Billboard Top 100 songs, connecting data about where the songs were in the rankings with mostly audio features available from Spotify.

library(tidyverse)

## billboard ranking data
billboard <- readr::read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-09-14/billboard.csv")

## spotify feature data
audio_features <- readr::read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-09-14/audio_features.csv")

Let’s start by finding the longest streak each song was on this chart.

max_weeks <-
  billboard %>%
  group_by(song_id) %>%
  summarise(weeks_on_chart = max(weeks_on_chart), .groups = "drop")

max_weeks


## # A tibble: 29,389 × 2
## song_id weeks_on_chart
## <chr> <dbl>
## 1 -twistin'-White Silver SandsBill Black's Combo 2
## 2 ¿Dònde Està Santa Claus? (Where Is Santa Claus?)Augie Rios 4
## 3 ......And Roses And RosesAndy Williams 7
## 4 ...And Then There Were DrumsSandy Nelson 4
## 5 ...Baby One More TimeBritney Spears 32
## 6 ...Ready For It?Taylor Swift 19
## 7 '03 Bonnie & ClydeJay-Z Featuring Beyonce Knowles 23
## 8 '65 Love AffairPaul Davis 20
## 9 '98 Thug ParadiseTragedy, Capone, Infinite 5
## 10 'Round We GoBig Sister 2
## # … with 29,379 more rows

Now let’s join this with the Spotify audio features (where available). We don’t have Spotify features for all the songs, and it’s possible that there are systematic differences in songs that we could vs. could not get Spotify data for. Something to keep in mind!

billboard_joined <-
  audio_features %>%
  filter(!is.na(spotify_track_popularity)) %>%
  inner_join(max_weeks)

billboard_joined


## # A tibble: 24,395 × 23
## song_id performer song spotify_genre spotify_track_id spotify_track_pr…
## <chr> <chr> <chr> <chr> <chr> <chr>            
## 1 ......An… Andy Will… .....… ['adult stand… 3tvqPPpXyIgKrm4… https://p.scdn.c…
## 2 ...And T… Sandy Nel… ...An… ['rock-and-ro… 1fHHq3qHU8wpRKH… <NA>             
## 3 ...Baby … Britney S… ...Ba… ['dance pop',… 3MjUtNVVq3C8Fn0… https://p.scdn.c…
## 4 ...Ready… Taylor Sw… ...Re… ['pop', 'post… 2yLa0QULdQr0qAI… <NA>             
## 5 '03 Bonn… Jay-Z Fea… '03 B… ['east coast … 5ljCWsDlSyJ41kw… <NA>             
## 6 '65 Love… Paul Davis '65 L… ['album rock'… 5nBp8F6tekSrnFg… https://p.scdn.c…
## 7 'til I C… Tammy Wyn… 'til … ['country', '… 0aJHZYjwbfTmeyU… https://p.scdn.c…
## 8 'Til My … Luther Va… 'Til … ['funk', 'mot… 2R97RZWUx4vAFbM… https://p.scdn.c…
## 9 'Til Sum… Keith Urb… 'Til … ['australian … 1CKmI1IQjVEVB3F… <NA>             
## 10 'Til You… After 7 'Til … ['funk', 'neo… 3kGMziz884MLV1o… <NA>             
## # … with 24,385 more rows, and 17 more variables:
## # spotify_track_duration_ms <dbl>, spotify_track_explicit <lgl>,
## # spotify_track_album <chr>, danceability <dbl>, energy <dbl>, key <dbl>,
## # loudness <dbl>, mode <dbl>, speechiness <dbl>, acousticness <dbl>,
## # instrumentalness <dbl>, liveness <dbl>, valence <dbl>, tempo <dbl>,
## # time_signature <dbl>, spotify_track_popularity <dbl>, weeks_on_chart <dbl>

Some of the features we now have for each song are characteristics of the song like the time signature (3/4, 4/4, 5/4) and the tempo in BPM.

billboard_joined %>%
  filter(tempo > 0, time_signature > 1) %>%
  ggplot(aes(tempo, fill = factor(time_signature))) +
  geom_histogram(alpha = 0.5, position = "identity") +
  labs(fill = "time signature")

Pop songs like those on the Billboard chart are overwhelming in 4/4!

There are other features available from Spotify as well, such as “danceability” and “loudness.”

library(corrr)

billboard_joined %>%
  select(danceability:weeks_on_chart) %>%
  na.omit() %>%
  correlate() %>%
  rearrange() %>%
  network_plot(colours = c("orange", "white", "midnightblue"))

It looks like only spotify_track_popularity is really at all correlated with weeks_on_chart. That popularity metric isn’t really an audio feature of the song per se, but it might be helpful to have such a feature as we understand more how dimensionality reduction works.

Dimensionality reduction

In our book Tidy Modeling with R, we recently published a chapter on dimensionality reduction. My post today walks through a more brief and basic outline of some of the material from that chapter. Within the tidymodels framework, dimensionality reduction is a feature engineering or data preprocessing step, so we use recipes to implement this kind of analysis. I typically show how to use a data preprocessing recipe together with a model, but in this post, let’s just focus on recipes and how they work.

Let’s still start by splitting our data into training and testing sets, so we can estimate or traing our preprocessing recipe on our training set, and then apply that trained recipe onto a new set (our testing set).

library(tidymodels)

set.seed(123)
billboard_split <- billboard_joined %>%
  select(danceability:weeks_on_chart) %>%
  mutate(weeks_on_chart = log(weeks_on_chart)) %>%
  na.omit() %>%
  initial_split(strata = weeks_on_chart)

## how many observations in each set?
billboard_split


## <Analysis/Assess/Total>
## <18245/6084/24329>


billboard_train <- training(billboard_split)
billboard_test <- testing(billboard_split)

Now let’s make a basic starter recipe that we can work off of.

billboard_rec <-
  recipe(weeks_on_chart ~ ., data = billboard_train) %>%
  step_zv(all_numeric_predictors()) %>%
  step_normalize(all_numeric_predictors())

rec_trained <- prep(billboard_rec)
rec_trained


## Data Recipe
## 
## Inputs:
## 
## role #variables
## outcome 1
## predictor 13
## 
## Training data contained 18245 data points and no missing data.
## 
## Operations:
## 
## Zero variance filter removed no terms [trained]
## Centering and scaling for danceability, energy, key, loudness, ... [trained]

When we prep() the recipe, we use the training data to estimate the quantities we need to apply these steps to new data. So in this case, we would, for example, compute the mean and standard deviation from the training data in order to center and scale. The testing data will be centered and scaled with the mean and standard deviation from the training data.

Next, let’s make a little helper function for us to extend this starter recipe. This function will:

prep() the recipe (you can prep() an already-prepped recipe, for example after you have added new steps)
bake() the recipe using our testing data
make a visualization of the results

library(ggforce)

plot_test_results <- function(recipe, dat = billboard_test) {
  recipe %>%
    prep() %>%
    bake(new_data = dat) %>%
    ggplot() +
    geom_autopoint(aes(color = weeks_on_chart), alpha = 0.4, size = 0.5) +
    geom_autodensity(alpha = .3) +
    facet_matrix(vars(-weeks_on_chart), layer.diag = 2) +
    scale_color_distiller(palette = "BuPu", direction = 1) +
    labs(color = "weeks (log)")
}

PCA

Let’s start with principal component analysis, one of the most straightforward dimensionality reduction approaches. It is linear, unsupervised, and makes new features that try to account for as much variation in the data as possible. Remember that our function estimates PCA components from our training data and then applies those to our testing data.

rec_trained %>%
  step_pca(all_numeric_predictors(), num_comp = 4) %>%
  plot_test_results() +
  ggtitle("Principal Component Analysis")

This looks a bit underwhelming in terms of the components being connected to weeks on the chart, but there is a little bit of relationship.

We can tidy() recipes, either as a whole or for individual steps, and either before or after using prep(). Let’s tidy() this recipe to find the features that contribute the most to the PC components.

rec_trained %>%
  step_pca(all_numeric_predictors(), num_comp = 4) %>%
  prep() %>%
  tidy(number = 3) %>%
  filter(component %in% paste0("PC", 1:4)) %>%
  group_by(component) %>%
  slice_max(abs(value), n = 5) %>%
  ungroup() %>%
  ggplot(aes(abs(value), terms, fill = value > 0)) +
  geom_col(alpha = 0.8) +
  facet_wrap(vars(component), scales = "free_y") +
  labs(x = "Contribution to principal component", y = NULL, fill = "Positive?")

I’ve implemented PCA for these features before. The results this time for a different sample of songs aren’t exactly the same but have some qualitative similarities; we see that the first component is mostly about loudness/energy vs. acoustic while the second is about valence, where high valence means more positive, cheerful, happy music.

PLS

Partial least squares is a lot like PCA but it is supervised ; it makes components that try to account for a lot of variation but also are related to the outcome.

rec_trained %>%
  step_pls(all_numeric_predictors(), outcome = "weeks_on_chart", num_comp = 4) %>%
  plot_test_results() +
  ggtitle("Partial Least Squares")

We do see a stronger relationship to weeks on the chart here, like we would hope since we were using PLS.

rec_trained %>%
  step_pls(all_numeric_predictors(), outcome = "weeks_on_chart", num_comp = 4) %>%
  prep() %>%
  tidy(number = 3) %>%
  filter(component %in% paste0("PLS", 1:4)) %>%
  group_by(component) %>%
  slice_max(abs(value), n = 5) %>%
  ungroup() %>%
  ggplot(aes(abs(value), terms, fill = value > 0)) +
  geom_col(alpha = 0.8) +
  facet_wrap(vars(component), scales = "free_y") +
  labs(x = "Contribution to PLS component", y = NULL, fill = "Positive?")

The Spotify popularity feature, which like we said before is not really an audio feature, is now a big contributor to the first couple of components.

UMAP

Uniform manifold approximation and projection (UMAP) is another dimensionality reduction technique, but it works very differently than either PCA or PLS. It is not linear. Instead, it starts by finding nearest neighbors for the observations in the high dimensional space, building a graph network, and then creating a new lower dimensional space based on that.

library(embed)

rec_trained %>%
  step_umap(all_numeric_predictors(), num_comp = 4) %>%
  plot_test_results() +
  ggtitle("UMAP")

UMAP is very good at making little clusters in the new reduced space, but notice that in our case they aren’t very connected to weeks on the chart. UMAP results can seem very appealing but it’s good to understand how arbitrary some of the structure we see here is, and generally this algorithm’s limitations.

Fit and predict with tidymodels for bird baths in Australia 🇦🇺

Julia Silge — Wed, 01 Sep 2021 00:00:00 +0000

This is the latest in my series of screencasts demonstrating how to use the tidymodels packages, from just getting started to tuning more complex models. Today’s screencast is good for folks who are newer to modeling or tidymodels; it focuses on how to use feature engineering together with a model algorithm and how to fit and predict, with this week’s #TidyTuesday dataset on bird baths in Australia. 🐦

Here is the code I used in the video, for those who prefer reading instead of or in addition to video.

Explore data

Our modeling goal is to predict whether we’ll see a bird at a bird bath in Australia, given info like what kind of bird we’re looking for and whether the bird bath is in an urban or rural location.

library(tidyverse)

bird_baths <- readr::read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-08-31/bird_baths.csv")

bird_baths %>%
  count(urban_rural)


## # A tibble: 3 × 2
## urban_rural n
## <chr> <int>
## 1 Rural 49686
## 2 Urban 111202
## 3 <NA> 169

Notice that there are some summary rows in the dataset with NA values for urban_rural, survey_year, etc. We can use that to choose some top bird types to focus on, instead of all the many bird types included in this dataset.

top_birds <-
  bird_baths %>%
  filter(is.na(urban_rural)) %>%
  arrange(-bird_count) %>%
  slice_max(bird_count, n = 15) %>%
  pull(bird_type)

top_birds


## [1] "Noisy Miner" "Australian Magpie" "Rainbow Lorikeet"  
## [4] "Red Wattlebird" "Superb Fairy-wren" "Magpie-lark"       
## [7] "Pied Currawong" "Crimson Rosella" "Eastern Spinebill" 
## [10] "Spotted Dove" "Lewin's Honeyeater" "Satin Bowerbird"   
## [13] "Crested Pigeon" "Grey Fantail" "Red-browed Finch"

How likely were the citizen scientists who collected this data to see birds of different types, in different locations?

bird_parsed <-
  bird_baths %>%
  filter(
    !is.na(urban_rural),
    bird_type %in% top_birds
  ) %>%
  group_by(urban_rural, bird_type) %>%
  summarise(bird_count = mean(bird_count), .groups = "drop")

p1 <-
  bird_parsed %>%
  ggplot(aes(bird_count, bird_type)) +
  geom_segment(
    data = bird_parsed %>%
      pivot_wider(
        names_from = urban_rural,
        values_from = bird_count
      ),
    aes(x = Rural, xend = Urban, y = bird_type, yend = bird_type),
    alpha = 0.7, color = "gray70", size = 1.5
  ) +
  geom_point(aes(color = urban_rural), size = 3) +
  scale_x_continuous(labels = scales::percent) +
  labs(x = "Probability of seeing bird", y = NULL, color = NULL)

p1

Superb fairy-wrens are more rural, while noisy miners are more urban.

Let’s build a model to predict this probability of seeing a bird using just these two predictors.

bird_df <-
  bird_baths %>%
  filter(
    !is.na(urban_rural),
    bird_type %in% top_birds
  ) %>%
  mutate(bird_count = if_else(bird_count > 0, "bird", "no bird")) %>%
  mutate_if(is.character, as.factor)

Build a first model

Let’s start our modeling by setting up our “data budget.” We are going to use a simple logistic regression model that is unlikely to overfit, but let’s still split our data into training and testing, and then create resampling folds.

library(tidymodels)

set.seed(123)
bird_split <- initial_split(bird_df, strata = bird_count)
bird_train <- training(bird_split)
bird_test <- testing(bird_split)

set.seed(234)
bird_folds <- vfold_cv(bird_train, strata = bird_count)
bird_folds


## # 10-fold cross-validation using stratification 
## # A tibble: 10 × 2
## splits id    
## <list> <chr> 
## 1 <split [9637/1072]> Fold01
## 2 <split [9638/1071]> Fold02
## 3 <split [9638/1071]> Fold03
## 4 <split [9638/1071]> Fold04
## 5 <split [9638/1071]> Fold05
## 6 <split [9638/1071]> Fold06
## 7 <split [9638/1071]> Fold07
## 8 <split [9638/1071]> Fold08
## 9 <split [9639/1070]> Fold09
## 10 <split [9639/1070]> Fold10

We’ll make a couple of attempts at fitting models here, but they will all use straightforward logistic regression.

glm_spec <- logistic_reg()

For this first model, let’s set up our feature engineering recipe with our outcome and two predictors , and begin with only one preprocessing step to transform our nominal (factor or character, like urban_rural and bird_type) predictors to dummy or indicator variables. Then let’s put our preprocessing recipe together with our model specification in a workflow.

rec_basic <-
  recipe(bird_count ~ urban_rural + bird_type, data = bird_train) %>%
  step_dummy(all_nominal_predictors())

wf_basic <- workflow(rec_basic, glm_spec)

We could fit this one time to the training data, but to get better estimates of performance, let’s fit 10 times to our 10 resampling folds.

doParallel::registerDoParallel()
ctrl_preds <- control_resamples(save_pred = TRUE)
rs_basic <- fit_resamples(wf_basic, bird_folds, control = ctrl_preds)

How did this turn out? If we look at some overall metrics, accuracy does not look so bad:

collect_metrics(rs_basic)


## # A tibble: 2 × 6
## .metric .estimator mean n std_err .config             
## <chr> <chr> <dbl> <int> <dbl> <chr>               
## 1 accuracy binary 0.822 10 0.0000762 Preprocessor1_Model1
## 2 roc_auc binary 0.601 10 0.00783 Preprocessor1_Model1

This is because there were not many birds overall, though! The model is just saying “no bird” everywhere and getting good accuracy. The ROC curve, on the other hand, looks not so great.

augment(rs_basic) %>%
  roc_curve(bird_count, .pred_bird) %>%
  autoplot()

Add interactions

We know from the plot we made during EDA that there are interactions between whether a bird bath is urban/rural and what kinds of birds we see there; we could model these interactions either with a model type that can handle it natively (like trees) or with explicit interaction terms like this:

rec_interact <-
  rec_basic %>%
  step_interact(~ starts_with("urban_rural"):starts_with("bird_type"))

wf_interact <- workflow(rec_interact, glm_spec)
rs_interact <- fit_resamples(wf_interact, bird_folds, control = ctrl_preds)

How did this do, our same logistic regression model specification but now with interactions?

collect_metrics(rs_interact)


## # A tibble: 2 × 6
## .metric .estimator mean n std_err .config             
## <chr> <chr> <dbl> <int> <dbl> <chr>               
## 1 accuracy binary 0.822 10 0.0000762 Preprocessor1_Model1
## 2 roc_auc binary 0.669 10 0.00660 Preprocessor1_Model1

The accuracy is about the same (since the model is always predicting “no bird”) but the probabilities look better.

augment(rs_interact) %>%
  roc_curve(bird_count, .pred_bird) %>%
  autoplot()

Evaluate model on new data

Let’s stick with this model, logistic regression together with interactions between urban/rural and bird type. We can fit the model one time to the entire training set.

bird_fit <- fit(wf_interact, bird_train)

Now this trained model is ready to be applied to new data. For example, we can predict the test set, perhaps to get out probabilities.

predict(bird_fit, bird_test, type = "prob")


## # A tibble: 3,571 × 2
## .pred_bird `.pred_no bird`
## <dbl> <dbl>
## 1 0.213 0.787
## 2 0.123 0.877
## 3 0.141 0.859
## 4 0.283 0.717
## 5 0.119 0.881
## 6 0.252 0.748
## 7 0.0380 0.962
## 8 0.123 0.877
## 9 0.129 0.871
## 10 0.119 0.881
## # … with 3,561 more rows

In fact, we can predict on any kind of new data that has the right input variables. Let’s make some ourselves.

new_bird_data <-
  tibble(bird_type = top_birds) %>%
  crossing(urban_rural = c("Urban", "Rural"))

new_bird_data


## # A tibble: 30 × 2
## bird_type urban_rural
## <chr> <chr>      
## 1 Australian Magpie Rural      
## 2 Australian Magpie Urban      
## 3 Crested Pigeon Rural      
## 4 Crested Pigeon Urban      
## 5 Crimson Rosella Rural      
## 6 Crimson Rosella Urban      
## 7 Eastern Spinebill Rural      
## 8 Eastern Spinebill Urban      
## 9 Grey Fantail Rural      
## 10 Grey Fantail Urban      
## # … with 20 more rows

We can use a helpful function like augment() to take this new data and “augment” it with predicted probabilities and class predictions, and we can use predict() with specific type arguments to return specialized predictions like confidence intervals. Let’s bind these together.

bird_preds <-
  augment(bird_fit, new_bird_data) %>%
  bind_cols(
    predict(bird_fit, new_bird_data, type = "conf_int")
  )

bird_preds


## # A tibble: 30 × 9
## bird_type urban_rural .pred_class .pred_bird `.pred_no bird` .pred_lower_bird
## <chr> <chr> <fct> <dbl> <dbl> <dbl>
## 1 Australi… Rural no bird 0.245 0.755 0.193 
## 2 Australi… Urban no bird 0.287 0.713 0.249 
## 3 Crested … Rural no bird 0.0826 0.917 0.0526
## 4 Crested … Urban no bird 0.141 0.859 0.113 
## 5 Crimson … Rural no bird 0.215 0.785 0.166 
## 6 Crimson … Urban no bird 0.123 0.877 0.0969
## 7 Eastern … Rural no bird 0.283 0.717 0.227 
## 8 Eastern … Urban no bird 0.0973 0.903 0.0736
## 9 Grey Fan… Rural no bird 0.254 0.746 0.200 
## 10 Grey Fan… Urban no bird 0.0614 0.939 0.0435
## # … with 20 more rows, and 3 more variables: .pred_upper_bird <dbl>,
## # .pred_lower_no bird <dbl>, .pred_upper_no bird <dbl>

Now let’s visualize these predictions.

p2 <-
  bird_preds %>%
  ggplot(aes(.pred_bird, bird_type, color = urban_rural)) +
  geom_errorbar(aes(
    xmin = .pred_lower_bird,
    xmax = .pred_upper_bird
  ),
  width = .2, size = 1.2, alpha = 0.5
  ) +
  geom_point(size = 2.5) +
  scale_x_continuous(labels = scales::percent) +
  labs(x = "Predicted probability of seeing bird", y = NULL, color = NULL)

p2

Actually, let’s put this together with our earlier plot!

library(patchwork)

p1 + p2

Modeling human/computer interactions on Star Trek 🖖

Julia Silge — Tue, 24 Aug 2021 00:00:00 +0000

This is the latest in my series of screencasts demonstrating how to use the tidymodels packages, from just getting started to tuning more complex models. Today’s screencast is on a more advanced topic, how to evaluate multiple combinations of feature engineering and modeling approaches via workflowsets, with this week’s #TidyTuesday dataset on Star Trek human/computer interactions.

Here is the code I used in the video, for those who prefer reading instead of or in addition to video.

Explore data

Our modeling goal is to predict which computer interactions from Star Trek were spoken by a person and which were spoken by the computer.

library(tidyverse)
computer_raw <- read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-08-17/computer.csv")

computer_raw %>%
  distinct(value_id, .keep_all = TRUE) %>%
  count(char_type)


## # A tibble: 2 × 2
## char_type n
## <chr> <int>
## 1 Computer 178
## 2 Person 234

Which words are more likely to be spoken by a computer vs. by a person?

library(tidytext)
library(tidylo)

computer_counts <-
  computer_raw %>%
  distinct(value_id, .keep_all = TRUE) %>%
  unnest_tokens(word, interaction) %>%
  count(char_type, word, sort = TRUE)

computer_counts %>%
  bind_log_odds(char_type, word, n) %>%
  filter(n > 10) %>%
  group_by(char_type) %>%
  slice_max(log_odds_weighted, n = 10) %>%
  ungroup() %>%
  ggplot(aes(log_odds_weighted,
    fct_reorder(word, log_odds_weighted),
    fill = char_type
  )) +
  geom_col(alpha = 0.8, show.legend = FALSE) +
  facet_wrap(vars(char_type), scales = "free_y") +
  labs(y = NULL)

Notice that stop words are among the words with highest weighted log odds; they are very informative in this situation.

Build and compare models

Let’s start our modeling by setting up our “data budget.” This is a very small dataset so we won’t expect to see amazing results from our model, but it is fun and a nice way to demonstrate some of these concepts.

library(tidymodels)

set.seed(123)

comp_split <-
  computer_raw %>%
  distinct(value_id, .keep_all = TRUE) %>%
  select(char_type, interaction) %>%
  initial_split(prop = 0.8, strata = char_type)

comp_train <- training(comp_split)
comp_test <- testing(comp_split)

set.seed(234)
comp_folds <- bootstraps(comp_train, strata = char_type)
comp_folds


## # Bootstrap sampling using stratification 
## # A tibble: 25 × 2
## splits id         
## <list> <chr>      
## 1 <split [329/118]> Bootstrap01
## 2 <split [329/128]> Bootstrap02
## 3 <split [329/134]> Bootstrap03
## 4 <split [329/124]> Bootstrap04
## 5 <split [329/118]> Bootstrap05
## 6 <split [329/116]> Bootstrap06
## 7 <split [329/106]> Bootstrap07
## 8 <split [329/124]> Bootstrap08
## 9 <split [329/121]> Bootstrap09
## 10 <split [329/121]> Bootstrap10
## # … with 15 more rows

When it comes to feature engineering, we don’t know ahead of time if we should remove stop words, or center and scale the predictors, or balance the classes. Let’s create feature engineering recipes that do all of these things so we can compare how they perform.

library(textrecipes)
library(themis)

rec_all <-
  recipe(char_type ~ interaction, data = comp_train) %>%
  step_tokenize(interaction) %>%
  step_tokenfilter(interaction, max_tokens = 80) %>%
  step_tfidf(interaction)

rec_all_norm <-
  rec_all %>%
  step_normalize(all_predictors())

rec_all_smote <-
  rec_all_norm %>%
  step_smote(char_type)

## we can `prep()` just to check if it works
prep(rec_all_smote)


## Data Recipe
## 
## Inputs:
## 
## role #variables
## outcome 1
## predictor 1
## 
## Training data contained 329 data points and no missing data.
## 
## Operations:
## 
## Tokenization for interaction [trained]
## Text filtering for interaction [trained]
## Term frequency-inverse document frequency with interaction [trained]
## Centering and scaling for tfidf_interaction_a, ... [trained]
## SMOTE based on char_type [trained]

Now let’s do the same with removing stop words.

rec_stop <-
  recipe(char_type ~ interaction, data = comp_train) %>%
  step_tokenize(interaction) %>%
  step_stopwords(interaction) %>%
  step_tokenfilter(interaction, max_tokens = 80) %>%
  step_tfidf(interaction)

rec_stop_norm <-
  rec_stop %>%
  step_normalize(all_predictors())

rec_stop_smote <-
  rec_stop_norm %>%
  step_smote(char_type)

## again, let's check it
prep(rec_stop_smote)


## Data Recipe
## 
## Inputs:
## 
## role #variables
## outcome 1
## predictor 1
## 
## Training data contained 329 data points and no missing data.
## 
## Operations:
## 
## Tokenization for interaction [trained]
## Stop word removal for interaction [trained]
## Text filtering for interaction [trained]
## Term frequency-inverse document frequency with interaction [trained]
## Centering and scaling for 80 items [trained]
## SMOTE based on char_type [trained]

Let’s try out two kinds of models that often work well for text data, a support vector machine and a naive Bayes model.

library(discrim)

nb_spec <-
  naive_Bayes() %>%
  set_mode("classification") %>%
  set_engine("naivebayes")

nb_spec


## Naive Bayes Model Specification (classification)
## 
## Computational engine: naivebayes


svm_spec <-
  svm_linear() %>%
  set_mode("classification") %>%
  set_engine("LiblineaR")

svm_spec


## Linear Support Vector Machine Specification (classification)
## 
## Computational engine: LiblineaR

Now we can put all these together in a workflowset.

comp_models <-
  workflow_set(
    preproc = list(
      all = rec_all,
      all_norm = rec_all_norm,
      all_smote = rec_all_smote,
      stop = rec_stop,
      stop_norm = rec_stop_norm,
      stop_smote = rec_stop_smote
    ),
    models = list(nb = nb_spec, svm = svm_spec),
    cross = TRUE
  )

comp_models


## # A workflow set/tibble: 12 × 4
## wflow_id info option result    
## <chr> <list> <list> <list>    
## 1 all_nb <tibble [1 × 4]> <opts[0]> <list [0]>
## 2 all_svm <tibble [1 × 4]> <opts[0]> <list [0]>
## 3 all_norm_nb <tibble [1 × 4]> <opts[0]> <list [0]>
## 4 all_norm_svm <tibble [1 × 4]> <opts[0]> <list [0]>
## 5 all_smote_nb <tibble [1 × 4]> <opts[0]> <list [0]>
## 6 all_smote_svm <tibble [1 × 4]> <opts[0]> <list [0]>
## 7 stop_nb <tibble [1 × 4]> <opts[0]> <list [0]>
## 8 stop_svm <tibble [1 × 4]> <opts[0]> <list [0]>
## 9 stop_norm_nb <tibble [1 × 4]> <opts[0]> <list [0]>
## 10 stop_norm_svm <tibble [1 × 4]> <opts[0]> <list [0]>
## 11 stop_smote_nb <tibble [1 × 4]> <opts[0]> <list [0]>
## 12 stop_smote_svm <tibble [1 × 4]> <opts[0]> <list [0]>

None of these models have any tuning parameters, so next let’s use fit_resamples() to evaluate how each of these combinations of feature engineering recipes and model specifications performs, using our bootstrap resamples.

set.seed(123)
doParallel::registerDoParallel()

computer_rs <-
  comp_models %>%
  workflow_map(
    "fit_resamples",
    resamples = comp_folds,
    metrics = metric_set(accuracy, sensitivity, specificity)
  )

We can make a quick high-level visualization of these results.

autoplot(computer_rs)

All of the SVMs did better than all of the naive Bayes models, at least as far as overall accuracy. We can also dig deeper and explore the results more.

rank_results(computer_rs) %>%
  filter(.metric == "accuracy")


## # A tibble: 12 × 9
## wflow_id .config .metric mean std_err n preprocessor model rank
## <chr> <chr> <chr> <dbl> <dbl> <int> <chr> <chr> <int>
## 1 all_svm Preprocess… accuracy 0.679 0.00655 25 recipe svm_l… 1
## 2 all_norm_… Preprocess… accuracy 0.658 0.00756 25 recipe svm_l… 2
## 3 stop_svm Preprocess… accuracy 0.652 0.00700 25 recipe svm_l… 3
## 4 all_smote… Preprocess… accuracy 0.650 0.00611 25 recipe svm_l… 4
## 5 stop_norm… Preprocess… accuracy 0.646 0.00753 25 recipe svm_l… 5
## 6 stop_smot… Preprocess… accuracy 0.632 0.00914 25 recipe svm_l… 6
## 7 all_norm_… Preprocess… accuracy 0.589 0.00678 25 recipe naive… 7
## 8 all_smote… Preprocess… accuracy 0.575 0.0115 25 recipe naive… 8
## 9 stop_smot… Preprocess… accuracy 0.573 0.00971 25 recipe naive… 9
## 10 stop_norm… Preprocess… accuracy 0.571 0.00950 25 recipe naive… 10
## 11 all_nb Preprocess… accuracy 0.570 0.0102 25 recipe naive… 11
## 12 stop_nb Preprocess… accuracy 0.559 0.0120 25 recipe naive… 12

Some interesting things to note are:

how balancing the classes via SMOTE does in fact change sensitivity and specificity the way we would expect
that removing stop words looks like mostly a bad idea!

Train and evaluate final model

Let’s say that we want to keep overall accuracy high, so we pick rec_all and svm_spec. We can use last_fit() to fit one time to all the training data and evalute one time on the testing data.

comp_wf <- workflow(rec_all, svm_spec)

comp_fitted <-
  last_fit(
    comp_wf,
    comp_split,
    metrics = metric_set(accuracy, sensitivity, specificity)
  )

comp_fitted


## # Resampling results
## # Manual resampling 
## # A tibble: 1 × 6
## splits id .metrics .notes .predictions .workflow
## <list> <chr> <list> <list> <list> <list>   
## 1 <split [329/83]> train/test split <tibble [… <tibble … <tibble [83 … <workflo…

How did that turn out?

collect_metrics(comp_fitted)


## # A tibble: 3 × 4
## .metric .estimator .estimate .config             
## <chr> <chr> <dbl> <chr>               
## 1 accuracy binary 0.735 Preprocessor1_Model1
## 2 sens binary 0.611 Preprocessor1_Model1
## 3 spec binary 0.830 Preprocessor1_Model1

We can also look at the predictions, and for example make a confusion matrix.

collect_predictions(comp_fitted) %>%
  conf_mat(char_type, .pred_class) %>%
  autoplot()

It was easier to identify people talking to computers than the other way around.

Since this is a linear model, we can also look at the coefficients for words in the model, perhaps for the largest effect size terms in each direction.

extract_workflow(comp_fitted) %>%
  tidy() %>%
  group_by(estimate > 0) %>%
  slice_max(abs(estimate), n = 10) %>%
  ungroup() %>%
  mutate(term = str_remove(term, "tfidf_interaction_")) %>%
  ggplot(aes(estimate, fct_reorder(term, estimate), fill = estimate > 0)) +
  geom_col(alpha = 0.8) +
  scale_fill_discrete(labels = c("people", "computer")) +
  labs(y = NULL, fill = "More from...")

Predict housing prices 🏠 in Austin TX with xgboost

Julia Silge — Sun, 15 Aug 2021 00:00:00 +0000

This is the latest in my series of screencasts demonstrating how to use the tidymodels packages, from just getting started to tuning more complex models. My screencasts lately have focused on xgboost as I have participated in SLICED, a competitive data science streaming show. This past week were the semifinals, where we competed to predict prices of homes in Austin, TX. 🏠 One of the more interesting available variables for this dataset was the text description of the real estate listings, so let’s walk through one way to incorporate text information with boosted tree modeling.

Here is the code I used in the video, for those who prefer reading instead of or in addition to video.

Explore data

Our modeling goal is to predict the price (binned) for homes in Austin, TX given features about the real estate listing. This is a multiclass classification challenge, where we needed to submit a probability for each home being in each priceRange bin. The main data set provided is in a CSV file called training.csv.

library(tidyverse)
train_raw <- read_csv("train.csv")

train_raw %>%
  count(priceRange)


## # A tibble: 5 × 2
## priceRange n
## <chr> <int>
## 1 0-250000 1249
## 2 250000-350000 2356
## 3 350000-450000 2301
## 4 450000-650000 2275
## 5 650000+ 1819

You can watch this week’s full episode of SLICED to see lots of exploratory data analysis and visualization of this dataset, but let’s just make a few data visualization for context in this blog post.

How is price distributed across Austin?

price_plot <-
  train_raw %>%
  mutate(priceRange = parse_number(priceRange)) %>%
  ggplot(aes(longitude, latitude, z = priceRange)) +
  stat_summary_hex(alpha = 0.8, bins = 50) +
  scale_fill_viridis_c() +
  labs(
    fill = "mean",
    title = "Price"
  )

price_plot

Let’s look at this distribution and compare it to some other variables available in the dataset. We can create a little plotting function using {{}} to quickly iterate through, and put them together with patchwork.

library(patchwork)

plot_austin <- function(var, title) {
  train_raw %>%
    ggplot(aes(longitude, latitude, z = {{ var }})) +
    stat_summary_hex(alpha = 0.8, bins = 50) +
    scale_fill_viridis_c() +
    labs(
      fill = "mean",
      title = title
    )
}

(price_plot + plot_austin(avgSchoolRating, "School rating")) /
  (plot_austin(yearBuilt, "Year built") + plot_austin(log(lotSizeSqFt), "Lot size (log)"))

Notice the east/west gradients as well as the radial changes. I went to grad school in Austin and this all look very familiar to me!

Finding words related to price

The description variable contains text from each real estate listing. We could try to use the text features directly in modeling, as described in our book, but I’ve found that often isn’t great for boosted tree models (which tend to be what works best overall in an environment like SLICED). Let’s walk through another option which may work better in some situations, which is to use some separate analysis to identify important words and then create dummy variables indicating whether any given listing has those words.

Let’s start by tidying the description text.

library(tidytext)

austin_tidy <-
  train_raw %>%
  mutate(priceRange = parse_number(priceRange) + 100000) %>%
  unnest_tokens(word, description) %>%
  anti_join(get_stopwords())

austin_tidy %>%
  count(word, sort = TRUE)


## # A tibble: 17,944 × 2
## word n
## <chr> <int>
## 1 home 11620
## 2 kitchen 5721
## 3 room 5494
## 4 austin 4918
## 5 new 4772
## 6 large 4771
## 7 2 4585
## 8 bedrooms 4571
## 9 contains 4413
## 10 3 4386
## # … with 17,934 more rows

Next, let’s compute word frequencies per price range for the top 100 words.

top_words <-
  austin_tidy %>%
  count(word, sort = TRUE) %>%
  filter(!word %in% as.character(1:5)) %>%
  slice_max(n, n = 100) %>%
  pull(word)

word_freqs <-
  austin_tidy %>%
  count(word, priceRange) %>%
  complete(word, priceRange, fill = list(n = 0)) %>%
  group_by(priceRange) %>%
  mutate(
    price_total = sum(n),
    proportion = n / price_total
  ) %>%
  ungroup() %>%
  filter(word %in% top_words)

word_freqs


## # A tibble: 500 × 5
## word priceRange n price_total proportion
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 access 100000 180 56290 0.00320
## 2 access 350000 365 114853 0.00318
## 3 access 450000 322 116678 0.00276
## 4 access 550000 294 125585 0.00234
## 5 access 750000 248 112073 0.00221
## 6 appliances 100000 209 56290 0.00371
## 7 appliances 350000 583 114853 0.00508
## 8 appliances 450000 576 116678 0.00494
## 9 appliances 550000 567 125585 0.00451
## 10 appliances 750000 391 112073 0.00349
## # … with 490 more rows

Now let’s use modeling to find the words that are increasing with price and those that are decreasing with price.

word_mods <-
  word_freqs %>%
  nest(data = c(priceRange, n, price_total, proportion)) %>%
  mutate(
    model = map(data, ~ glm(cbind(n, price_total) ~ priceRange, ., family = "binomial")),
    model = map(model, tidy)
  ) %>%
  unnest(model) %>%
  filter(term == "priceRange") %>%
  mutate(p.value = p.adjust(p.value)) %>%
  arrange(-estimate)

word_mods


## # A tibble: 100 × 7
## word data term estimate std.error statistic p.value
## <chr> <list> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 outdoor <tibble [5 × 4]> priceRange 0.00000325 1.85e-7 17.6 4.37e-67
## 2 custom <tibble [5 × 4]> priceRange 0.00000214 1.47e-7 14.6 3.98e-46
## 3 pool <tibble [5 × 4]> priceRange 0.00000159 1.22e-7 13.0 6.12e-37
## 4 office <tibble [5 × 4]> priceRange 0.00000150 1.46e-7 10.3 6.03e-23
## 5 suite <tibble [5 × 4]> priceRange 0.00000143 1.39e-7 10.3 4.03e-23
## 6 gorgeous <tibble [5 × 4]> priceRange 0.000000975 1.62e-7 6.02 1.19e- 7
## 7 w <tibble [5 × 4]> priceRange 0.000000920 9.05e-8 10.2 2.33e-22
## 8 windows <tibble [5 × 4]> priceRange 0.000000890 1.28e-7 6.95 2.81e-10
## 9 private <tibble [5 × 4]> priceRange 0.000000889 1.15e-7 7.70 1.08e-12
## 10 car <tibble [5 × 4]> priceRange 0.000000778 1.66e-7 4.69 1.52e- 4
## # … with 90 more rows

Let’s make something like a volcano plot to see the relationship between p-value and effect size for these words.

library(ggrepel)

word_mods %>%
  ggplot(aes(estimate, p.value)) +
  geom_vline(xintercept = 0, lty = 2, alpha = 0.7, color = "gray50") +
  geom_point(color = "midnightblue", alpha = 0.8, size = 2.5) +
  scale_y_log10() +
  geom_text_repel(aes(label = word), family = "IBMPlexSans")

Words like outdoor, custom, pool, suite, office increase with price.
Words like new, paint, carpet, great, tile, close, flooring decrease with price.

These are the words that we’d like to try to detect and use in feature engineering for our xgboost model, rather than using all the text tokens as features individually.

higher_words <-
  word_mods %>%
  filter(p.value < 0.05) %>%
  slice_max(estimate, n = 12) %>%
  pull(word)

lower_words <-
  word_mods %>%
  filter(p.value < 0.05) %>%
  slice_max(-estimate, n = 12) %>%
  pull(word)

We can look at these changes with price directly. For example, these are the words most associated with price decrease.

word_freqs %>%
  filter(word %in% lower_words) %>%
  ggplot(aes(priceRange, proportion, color = word)) +
  geom_line(size = 2.5, alpha = 0.7, show.legend = FALSE) +
  facet_wrap(vars(word), scales = "free_y") +
  scale_x_continuous(labels = scales::dollar) +
  scale_y_continuous(labels = scales::percent, limits = c(0, NA)) +
  labs(x = NULL, y = "proportion of total words used for homes at that price") +
  theme_light(base_family = "IBMPlexSans")

Cheaper houses are “great” but not expensive houses, and apparently you don’t need to mention the location (“close,” “minutes,” “location”) of more expensive houses.

Build a model

Let’s start our modeling by setting up our “data budget,” as well as the metrics (this challenge was evaluate on multiclass log loss).

library(tidymodels)

set.seed(123)
austin_split <- train_raw %>%
  select(-city) %>%
  mutate(description = str_to_lower(description)) %>%
  initial_split(strata = priceRange)
austin_train <- training(austin_split)
austin_test <- testing(austin_split)
austin_metrics <- metric_set(accuracy, roc_auc, mn_log_loss)

set.seed(234)
austin_folds <- vfold_cv(austin_train, v = 5, strata = priceRange)
austin_folds


## # 5-fold cross-validation using stratification 
## # A tibble: 5 × 2
## splits id   
## <list> <chr>
## 1 <split [5996/1502]> Fold1
## 2 <split [5998/1500]> Fold2
## 3 <split [5999/1499]> Fold3
## 4 <split [5999/1499]> Fold4
## 5 <split [6000/1498]> Fold5

For feature engineering, let’s use basically everything in the dataset (aside from city, which was not a very useful variable) and create dummy or indicator variables using step_regex(). The idea here is that we will detect whether these words associated with low/high price are there and create a yes/no variable indicating their presence or absence.

higher_pat <- glue::glue_collapse(higher_words, sep = "|")
lower_pat <- glue::glue_collapse(lower_words, sep = "|")

austin_rec <-
  recipe(priceRange ~ ., data = austin_train) %>%
  update_role(uid, new_role = "uid") %>%
  step_regex(description, pattern = higher_pat, result = "high_price_words") %>%
  step_regex(description, pattern = lower_pat, result = "low_price_words") %>%
  step_rm(description) %>%
  step_novel(homeType) %>%
  step_unknown(homeType) %>%
  step_other(homeType, threshold = 0.02) %>%
  step_dummy(all_nominal_predictors()) %>%
  step_nzv(all_predictors())

austin_rec


## Data Recipe
## 
## Inputs:
## 
## role #variables
## outcome 1
## predictor 13
## uid 1
## 
## Operations:
## 
## Regular expression dummy variable using `outdoor|custom|pool|office|suite|gorgeous|w|windows|private|car|high|full`
## Regular expression dummy variable using `carpet|paint|close|flooring|shopping|new|easy|minutes|tile|great|community|location`
## Delete terms description
## Novel factor level assignment for homeType
## Unknown factor level assignment for homeType
## Collapsing factor levels for homeType
## Dummy variables from all_nominal_predictors()
## Sparse, unbalanced variable filter on all_predictors()

Now let’s create a tunable xgboost model specification, tuning a lot of the important model hyperparameters, and combine it with our feature engineering recipe in a workflow(). We can also create a custom xgb_grid to specify what parameters I want to try out, like not-too-small learning rate, avoiding tree stubs, etc. I chose this parameter grid to get reasonable performance in a reasonable amount of tuning time.

xgb_spec <-
  boost_tree(
    trees = 1000,
    tree_depth = tune(),
    min_n = tune(),
    mtry = tune(),
    sample_size = tune(),
    learn_rate = tune()
  ) %>%
  set_engine("xgboost") %>%
  set_mode("classification")

xgb_word_wf <- workflow(austin_rec, xgb_spec)

set.seed(123)
xgb_grid <-
  grid_max_entropy(
    tree_depth(c(5L, 10L)),
    min_n(c(10L, 40L)),
    mtry(c(5L, 10L)),
    sample_prop(c(0.5, 1.0)),
    learn_rate(c(-2, -1)),
    size = 20
  )

xgb_grid


## # A tibble: 20 × 5
## tree_depth min_n mtry sample_size learn_rate
## <int> <int> <int> <dbl> <dbl>
## 1 7 33 8 0.768 0.0845
## 2 10 33 7 0.928 0.0784
## 3 5 21 6 0.626 0.0868
## 4 9 31 8 0.728 0.0162
## 5 8 35 5 0.666 0.0937
## 6 6 21 5 0.907 0.0105
## 7 6 27 6 0.982 0.0729
## 8 7 33 8 0.936 0.0102
## 9 7 15 5 0.559 0.0182
## 10 6 35 9 0.784 0.0347
## 11 9 39 9 0.737 0.0582
## 12 8 17 8 0.596 0.0818
## 13 9 21 7 0.601 0.0136
## 14 7 15 7 0.763 0.0197
## 15 6 12 10 0.800 0.0569
## 16 9 19 9 0.589 0.0138
## 17 10 14 5 0.829 0.0140
## 18 8 37 10 0.664 0.0202
## 19 5 11 5 0.514 0.0136
## 20 10 38 9 0.962 0.0150

Now we can tune across the grid of parameters and our resamples. Since we are trying quite a lot of hyperparameter combinations, let’s use racing to quit early on clearly bad hyperparameter combinations.

library(finetune)
doParallel::registerDoParallel()

set.seed(234)
xgb_word_rs <-
  tune_race_anova(
    xgb_word_wf,
    austin_folds,
    grid = xgb_grid,
    metrics = metric_set(mn_log_loss),
    control = control_race(verbose_elim = TRUE)
  )

xgb_word_rs


## # Tuning results
## # 5-fold cross-validation using stratification 
## # A tibble: 5 × 5
## splits id .order .metrics .notes          
## <list> <chr> <int> <list> <list>          
## 1 <split [5996/1502]> Fold1 3 <tibble [20 × 9]> <tibble [0 × 1]>
## 2 <split [5999/1499]> Fold3 1 <tibble [20 × 9]> <tibble [0 × 1]>
## 3 <split [6000/1498]> Fold5 2 <tibble [20 × 9]> <tibble [0 × 1]>
## 4 <split [5999/1499]> Fold4 4 <tibble [10 × 9]> <tibble [0 × 1]>
## 5 <split [5998/1500]> Fold2 5 <tibble [4 × 9]> <tibble [0 × 1]>

That takes a little while but we did it!

Evaluate results

First off, how did the “race” go?

plot_race(xgb_word_rs)

We can look at the top results manually as well.

show_best(xgb_word_rs)


## # A tibble: 4 × 11
## mtry min_n tree_depth learn_rate sample_size .metric .estimator mean n
## <int> <int> <int> <dbl> <dbl> <chr> <chr> <dbl> <int>
## 1 5 14 10 0.0140 0.829 mn_log_l… multiclass 0.920 5
## 2 7 15 7 0.0197 0.763 mn_log_l… multiclass 0.921 5
## 3 5 15 7 0.0182 0.559 mn_log_l… multiclass 0.921 5
## 4 9 19 9 0.0138 0.589 mn_log_l… multiclass 0.923 5
## # … with 2 more variables: std_err <dbl>, .config <chr>

Let’s use last_fit() to fit one final time to the training data and evaluate one final time on the testing data, with the numerically optimal result from xgb_word_rs.

xgb_last <-
  xgb_word_wf %>%
  finalize_workflow(select_best(xgb_word_rs, "mn_log_loss")) %>%
  last_fit(austin_split)

xgb_last


## # Resampling results
## # Manual resampling 
## # A tibble: 1 × 6
## splits id .metrics .notes .predictions .workflow
## <list> <chr> <list> <list> <list> <list>   
## 1 <split [7498/2502]> train/test split <tibble … <tibble… <tibble [2,… <workflo…

How did this model perform on the testing data, that was not used in tuning/training?

collect_predictions(xgb_last) %>%
  mn_log_loss(priceRange, `.pred_0-250000`:`.pred_650000+`)


## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 mn_log_loss multiclass 0.910

This result is pretty good for a single (not ensembled) model and is a wee bit better than what I did during the SLICED competition. I had an R bomb right as I was finishing up tuning a model just like the one I am demonstrating here!

How does this model perform across the different classes?

collect_predictions(xgb_last) %>%
  conf_mat(priceRange, .pred_class) %>%
  autoplot()

We can also visualize this with an ROC curve.

collect_predictions(xgb_last) %>%
  roc_curve(priceRange, `.pred_0-250000`:`.pred_650000+`) %>%
  ggplot(aes(1 - specificity, sensitivity, color = .level)) +
  geom_abline(lty = 2, color = "gray80", size = 1.5) +
  geom_path(alpha = 0.8, size = 1.2) +
  coord_equal() +
  labs(color = NULL)

Notice that it is easier to identify the most expensive homes but more difficult to correctly classify the less expensive homes.

What features are most important for this xgboost model?

library(vip)
extract_workflow(xgb_last) %>%
  extract_fit_parsnip() %>%
  vip(geom = "point", num_features = 15)

The spatial information in latitude/longitude are by far the most important. Notice that the model uses low_price_words more than it uses, for example, whether there is a spa or whether it is a single family home (as opposed to a townhome or condo). It looks like the model is trying to distinguish some of those lower priced categories. The model does not really use the high_price_words variable, perhaps because it is already easy to find the expensive houses.

The two finalists from SLICED go on to compete next Tuesday, which should be fun and interesting to watch! I have enjoyed the opportunity to participate this season.

Use racing methods to tune xgboost models and predict home runs ⚾️

Julia Silge — Tue, 10 Aug 2021 00:00:00 +0000

This is the latest in my series of screencasts demonstrating how to use the tidymodels packages, from just getting started to tuning more complex models. This week’s episode of SLICED, a competitive data science streaming show, had contestants compete to predict home runs in recent baseball games. Honestly I don’t know much about baseball ⚾ but the finetune package had a recent release and this challenge offers a good opportunity to show how to use racing methods for tuning.

Here is the code I used in the video, for those who prefer reading instead of or in addition to video.

Explore data

Our modeling goal is to predict whether a batter’s hit results in a home run given features about the hit. The main data set provided is in a CSV file called training.csv.

library(tidyverse)
train_raw <- read_csv("train.csv")

You can watch this week’s full episode of SLICED to see lots of exploratory data analysis and visualization of this dataset, but let’s just make a few plots to understand it better.

How are home runs distributed in the physical space around home plate?

train_raw %>%
  ggplot(aes(plate_x, plate_z, z = is_home_run)) +
  stat_summary_hex(alpha = 0.8, bins = 10) +
  scale_fill_viridis_c(labels = percent) +
  labs(fill = "% home runs")

How do launch speed and angle of the ball leaving the bat affect home run percentage?

train_raw %>%
  ggplot(aes(launch_angle, launch_speed, z = is_home_run)) +
  stat_summary_hex(alpha = 0.8, bins = 15) +
  scale_fill_viridis_c(labels = percent) +
  labs(fill = "% home runs")

How does pacing, like the number of balls, strikes, or the inning, affect home runs?

train_raw %>%
  mutate(is_home_run = if_else(as.logical(is_home_run), "yes", "no")) %>%
  select(is_home_run, balls, strikes, inning) %>%
  pivot_longer(balls:inning) %>%
  mutate(name = fct_inorder(name)) %>%
  ggplot(aes(value, after_stat(density), fill = is_home_run)) +
  geom_histogram(alpha = 0.5, binwidth = 1, position = "identity") +
  facet_wrap(~name, scales = "free") +
  labs(fill = "Home run?")

There is certainly lots more to discover, but let’s move on to modeling.

Build a model

Let’s start our modeling by setting up our “data budget.” I’m going to convert the 0s and 1s from the original dataset into a factor for classification modeling.

library(tidymodels)

set.seed(123)
bb_split <- train_raw %>%
  mutate(
    is_home_run = if_else(as.logical(is_home_run), "HR", "no"),
    is_home_run = factor(is_home_run)
  ) %>%
  initial_split(strata = is_home_run)
bb_train <- training(bb_split)
bb_test <- testing(bb_split)

set.seed(234)
bb_folds <- vfold_cv(bb_train, strata = is_home_run)
bb_folds


## # 10-fold cross-validation using stratification 
## # A tibble: 10 × 2
## splits id    
## <list> <chr> 
## 1 <split [31214/3469]> Fold01
## 2 <split [31214/3469]> Fold02
## 3 <split [31214/3469]> Fold03
## 4 <split [31215/3468]> Fold04
## 5 <split [31215/3468]> Fold05
## 6 <split [31215/3468]> Fold06
## 7 <split [31215/3468]> Fold07
## 8 <split [31215/3468]> Fold08
## 9 <split [31215/3468]> Fold09
## 10 <split [31215/3468]> Fold10

For feature engineering, let’s concentrate on the variables we already explored during EDA along with info about the pitch and handedness of players. There is some missing data, especially in the launch_angle and launch_speed, so let’s impute those values.

bb_rec <-
  recipe(is_home_run ~ launch_angle + launch_speed + plate_x + plate_z +
    bb_type + bearing + pitch_mph +
    is_pitcher_lefty + is_batter_lefty +
    inning + balls + strikes + game_date,
  data = bb_train
  ) %>%
  step_date(game_date, features = c("week"), keep_original_cols = FALSE) %>%
  step_unknown(all_nominal_predictors()) %>%
  step_dummy(all_nominal_predictors(), one_hot = TRUE) %>%
  step_impute_median(all_numeric_predictors(), -launch_angle, -launch_speed) %>%
  step_impute_linear(launch_angle, launch_speed,
    impute_with = imp_vars(plate_x, plate_z, pitch_mph)
  ) %>%
  step_nzv(all_predictors())

## we can `prep()` just to check that it works
prep(bb_rec)


## Data Recipe
## 
## Inputs:
## 
## role #variables
## outcome 1
## predictor 13
## 
## Training data contained 34683 data points and 15255 incomplete rows. 
## 
## Operations:
## 
## Date features from game_date [trained]
## Unknown factor level assignment for bb_type, bearing [trained]
## Dummy variables from bb_type, bearing [trained]
## Median Imputation for plate_x, plate_z, pitch_mph, ... [trained]
## Linear regression imputation for launch_angle, launch_speed [trained]
## Sparse, unbalanced variable filter removed bb_type_unknown, bearing_unknown [trained]

Now let’s create a tunable xgboost model specification. In a competition like SLICED, we likely wouldn’t want to tune all these parameters because of time constraints, but instead only some of the most important.

xgb_spec <-
  boost_tree(
    trees = tune(),
    min_n = tune(),
    mtry = tune(),
    learn_rate = 0.01
  ) %>%
  set_engine("xgboost") %>%
  set_mode("classification")

xgb_wf <- workflow(bb_rec, xgb_spec)

Use racing to tune xgboost

Now we can use tune_race_anova() to eliminate parameter combinations that are not doing well. This particular SLICED episode was being evaluted on log loss.

library(finetune)
doParallel::registerDoParallel()

set.seed(345)
xgb_rs <- tune_race_anova(
  xgb_wf,
  resamples = bb_folds,
  grid = 15,
  metrics = metric_set(mn_log_loss),
  control = control_race(verbose_elim = TRUE)
)

We can visualize how the possible parameter combinations we tried did during the “race.” Notice how we saved a TON of time by not evaluating the parameter combinations that were clearly doing poorly on all the resamples; we only kept going with the good parameter combinations.

plot_race(xgb_rs)

And we can look at the top results.

show_best(xgb_rs)


## # A tibble: 1 × 9
## mtry trees min_n .metric .estimator mean n std_err .config          
## <int> <int> <int> <chr> <chr> <dbl> <int> <dbl> <chr>            
## 1 6 1536 11 mn_log_lo… binary 0.0981 10 0.00171 Preprocessor1_Mo…

Let’s use last_fit() to fit one final time to the training data and evaluate one final time on the testing data.

xgb_last <- xgb_wf %>%
  finalize_workflow(select_best(xgb_rs, "mn_log_loss")) %>%
  last_fit(bb_split)

xgb_last


## # Resampling results
## # Manual resampling 
## # A tibble: 1 × 6
## splits id .metrics .notes .predictions .workflow
## <list> <chr> <list> <list> <list> <list>   
## 1 <split [34683… train/test… <tibble [2 … <tibble [0… <tibble [11,561… <workflo…

We can collect the predictions on the testing set and do whatever we want, like create an ROC curve, or in this case compute log loss.

collect_predictions(xgb_last) %>%
  mn_log_loss(is_home_run, .pred_HR)


## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 mn_log_loss binary 0.0975

This is pretty good for a single model; the competitors on SLICED who achieved better scores than this using this dataset all used ensemble models, I believe.

We can also compute variable importance scores using the vip package.

library(vip)
extract_workflow(xgb_last) %>%
  extract_fit_parsnip() %>%
  vip(geom = "point", num_features = 15)

Using racing methods is a great way to tune through lots of possible parameter options more quickly. Perhaps I’ll put it to the test next Tuesday, when I participate in the second and final episode of the SLICED playoffs!

Tune xgboost models with early stopping to predict shelter animal status 🐱🐶

Julia Silge — Sat, 07 Aug 2021 00:00:00 +0000

This is the latest in my series of screencasts demonstrating how to use the tidymodels packages, from just getting started to tuning more complex models. I participated in this week’s episode of the SLICED playoffs, a competitive data science streaming show, where we competed to predict the status of shelter animals. 🐱 I used xgboost’s early stopping feature as I competed, so let’s walk through how and when to try that out!

Here is the code I used in the video, for those who prefer reading instead of or in addition to video.

Explore data

Our modeling goal is to predict the outcome for shelter animals (adoption, transfer, or no outcome) given features about the animal and event. The main data set provided is in a CSV file called training.csv.

library(tidyverse)
train_raw <- read_csv("train.csv")

You can watch this week’s full episode of SLICED to see lots of exploratory data analysis and visualization of this dataset, but let’s just make a few plots to understand it better.

How are outcomes distributed for animals of different ages?

library(lubridate)

train_raw %>%
  mutate(
    age_upon_outcome = as.period(as.Date(datetime) - date_of_birth),
    age_upon_outcome = time_length(age_upon_outcome, unit = "weeks")
  ) %>%
  ggplot(aes(age_upon_outcome, after_stat(density), fill = outcome_type)) +
  geom_histogram(bins = 15, alpha = 0.5, position = "identity") +
  labs(x = "Age in weeks", fill = NULL)

How does adoption rate change with day of the week and week of the year?

train_raw %>%
  mutate(outcome_type = outcome_type == "adoption") %>%
  group_by(
    week = week(datetime),
    wday = wday(datetime)
  ) %>%
  summarise(outcome_type = mean(outcome_type)) %>%
  ggplot(aes(week, wday, fill = outcome_type)) +
  geom_tile(alpha = 0.8) +
  scale_fill_viridis_c(labels = scales::percent) +
  labs(fill = "% adopted", x = "week of the year", y = "week day")

Notice the difference on weekends vs. weekdays especially!

There is certainly lots more to explore (including, for example, learning about the names of the animals, something I spent a good bit of time on during the competition), but let’s move on to modeling.

Build a model

Let’s start our modeling by setting up our “data budget,” as well as the metrics (this challenge was evaluate on multiclass log loss).

library(tidymodels)

set.seed(123)
shelter_split <- train_raw %>%
  mutate(
    age_upon_outcome = as.period(as.Date(datetime) - date_of_birth),
    age_upon_outcome = time_length(age_upon_outcome, unit = "weeks")
  ) %>%
  initial_split(strata = outcome_type)

shelter_train <- training(shelter_split)
shelter_test <- testing(shelter_split)
shelter_metrics <- metric_set(accuracy, roc_auc, mn_log_loss)

set.seed(234)
shelter_folds <- vfold_cv(shelter_train, strata = outcome_type)
shelter_folds


## # 10-fold cross-validation using stratification 
## # A tibble: 10 × 2
## splits id    
## <list> <chr> 
## 1 <split [36724/4081]> Fold01
## 2 <split [36724/4081]> Fold02
## 3 <split [36724/4081]> Fold03
## 4 <split [36724/4081]> Fold04
## 5 <split [36724/4081]> Fold05
## 6 <split [36725/4080]> Fold06
## 7 <split [36725/4080]> Fold07
## 8 <split [36725/4080]> Fold08
## 9 <split [36725/4080]> Fold09
## 10 <split [36725/4080]> Fold10

For feature engineering, let’s concentrate on just a handful of predictors, like when the event (adoption, transfer, or “no outcome”) was recorded and features of the animal itself like age, sex, type, etc.

shelter_rec <- recipe(outcome_type ~ age_upon_outcome + animal_type +
  datetime + sex + spay_neuter,
data = shelter_train
) %>%
  step_date(datetime, features = c("year", "week", "dow"), keep_original_cols = FALSE) %>%
  step_dummy(all_nominal_predictors(), one_hot = TRUE) %>%
  step_zv(all_predictors())

## we can `prep()` just to check that it works
prep(shelter_rec)


## Data Recipe
## 
## Inputs:
## 
## role #variables
## outcome 1
## predictor 5
## 
## Training data contained 40805 data points and no missing data.
## 
## Operations:
## 
## Date features from datetime [trained]
## Dummy variables from animal_type, sex, spay_neuter, datetime_dow [trained]
## Zero variance filter removed no terms [trained]

Now let’s create a tunable xgboost model specification. This is where early stopping comes in; we will keep the number of trees as a constant (and not too terribly high), set stop_iter (the early stopping parameter) to tune(), and then tune a few other parameters. Notice that we need to set a validation set (a proportion of each analysis set, actually) to hold back to use for deciding when to stop.

We can also create a custom stopping_grid to specific what parameters I want to try out.

stopping_spec <-
  boost_tree(
    trees = 500,
    mtry = tune(),
    learn_rate = tune(),
    stop_iter = tune()
  ) %>%
  set_engine("xgboost", validation = 0.2) %>%
  set_mode("classification")

stopping_grid <-
  grid_latin_hypercube(
    mtry(range = c(5L, 20L)), ## depends on number of columns in data
    learn_rate(range = c(-5, -1)), ## keep pretty big
    stop_iter(range = c(10L, 50L)), ## bigger than default
    size = 10
  )

Now we can put these together in a workflow and tune across the grid of parameters and our resamples.

early_stop_wf <- workflow(shelter_rec, stopping_spec)

doParallel::registerDoParallel()
set.seed(345)
stopping_rs <- tune_grid(
  early_stop_wf,
  shelter_folds,
  grid = stopping_grid,
  metrics = shelter_metrics
)

We did it!

Evaluate results

How did these results turn out? We can visualize them.

autoplot(stopping_rs) + theme_light(base_family = "IBMPlexSans")

Or we can look at the top results manually.

show_best(stopping_rs, metric = "mn_log_loss")


## # A tibble: 5 × 9
## mtry learn_rate stop_iter .metric .estimator mean n std_err .config 
## <int> <dbl> <int> <chr> <chr> <dbl> <int> <dbl> <chr>   
## 1 12 0.0612 46 mn_log_loss multiclass 0.502 10 0.00319 Preproc…
## 2 18 0.0378 36 mn_log_loss multiclass 0.505 10 0.00279 Preproc…
## 3 7 0.00710 12 mn_log_loss multiclass 0.544 10 0.00246 Preproc…
## 4 9 0.00252 33 mn_log_loss multiclass 0.655 10 0.00145 Preproc…
## 5 11 0.00195 25 mn_log_loss multiclass 0.699 10 0.00122 Preproc…

Let’s use last_fit() to fit one final time to the training data and evaluate one final time on the testing data, with the numerically optimal result from stopping_rs.

stopping_fit <- early_stop_wf %>%
  finalize_workflow(select_best(stopping_rs, "mn_log_loss")) %>%
  last_fit(shelter_split)

stopping_fit


## # Resampling results
## # Manual resampling 
## # A tibble: 1 × 6
## splits id .metrics .notes .predictions .workflow
## <list> <chr> <list> <list> <list> <list>   
## 1 <split [40805/13603]> train/test split <tibble … <tibb… <tibble [13… <workflo…

How did this model perform on the testing data, that was not used in tuning/training?

collect_metrics(stopping_fit)


## # A tibble: 2 × 4
## .metric .estimator .estimate .config             
## <chr> <chr> <dbl> <chr>               
## 1 accuracy multiclass 0.807 Preprocessor1_Model1
## 2 roc_auc hand_till 0.877 Preprocessor1_Model1

This result is pretty good for a single model; we would expect to do better by incorporating the breed information, perhaps the presence/absence of a name, or moving to an ensembled model.

What features are most important for this xgboost model?

library(vip)

## use this fitted workflow `extract_workflow(stopping_fit)` to predict on new data
extract_workflow(stopping_fit) %>%
  extract_fit_parsnip() %>%
  vip(num_features = 15, geom = "point")

Age, spay/neuter status, animal type, and seasonal information like week of the year or day of the week are important for this model.

We can collect the predictions on the testing set and do whatever we want, like create an ROC curve.

collect_predictions(stopping_fit) %>%
  roc_curve(outcome_type, .pred_adoption:.pred_transfer) %>%
  ggplot(aes(1 - specificity, sensitivity, color = .level)) +
  geom_abline(lty = 2, color = "gray80", size = 1.5) +
  geom_path(alpha = 0.8, size = 1) +
  coord_equal() +
  labs(color = NULL)

We can also look at a confusion matrix.

collect_predictions(stopping_fit) %>%
  conf_mat(outcome_type, .pred_class) %>%
  autoplot()

Early stopping is a great option when you have plenty of data and don’t want to overfit your boosted trees! I will be back on SLICED for the final four next Tuesday, and I plan to use early stopping again because it is a good fit for this kind of situation.

Predict which Scooby Doo monsters 👻 are REAL with a tuned decision tree model

Julia Silge — Tue, 13 Jul 2021 00:00:00 +0000

This is the latest in my series of screencasts demonstrating how to use the tidymodels packages, from just getting started to tuning more complex models. Today’s screencast walks through how to train and evalute a random forest model, with this week’s #TidyTuesday dataset on Scooby Doo episodes. 👻

Here is the code I used in the video, for those who prefer reading instead of or in addition to video.

Explore data

Our modeling goal is to predict which Scooby Doo monsters are real and which are not, based on other characteristics of the episode.

library(tidyverse)
scooby_raw <- read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-07-13/scoobydoo.csv")

scooby_raw %>%
  filter(monster_amount > 0) %>%
  count(monster_real)


## # A tibble: 2 x 2
## monster_real n
## <chr> <int>
## 1 FALSE 404
## 2 TRUE 112

Most monsters are not real!

How did the number of real vs. fake monsters change over the decades?

scooby_raw %>%
  filter(monster_amount > 0) %>%
  count(
    year_aired = 10 * ((lubridate::year(date_aired) + 1) %/% 10),
    monster_real
  ) %>%
  mutate(year_aired = factor(year_aired)) %>%
  ggplot(aes(year_aired, n, fill = monster_real)) +
  geom_col(position = position_dodge(preserve = "single"), alpha = 0.8) +
  labs(x = "Date aired", y = "Monsters per decade", fill = "Real monster?")

How are these different episodes rated on IMDB?

scooby_raw %>%
  filter(monster_amount > 0) %>%
  mutate(imdb = parse_number(imdb)) %>%
  ggplot(aes(imdb, after_stat(density), fill = monster_real)) +
  geom_histogram(position = "identity", alpha = 0.5) +
  labs(x = "IMDB rating", y = "Density", fill = "Real monster?")

It looks like there are some meaningful relationships there that we can use for modeling, but they are not linear so a decision tree may be a good fit.

Build and tune a model

Let’s start our modeling by setting up our “data budget.” We’re only going to use the year each episode was aired and the episode rating.

library(tidymodels)

set.seed(123)
scooby_split <- scooby_raw %>%
  mutate(
    imdb = parse_number(imdb),
    year_aired = lubridate::year(date_aired)
  ) %>%
  filter(monster_amount > 0, !is.na(imdb)) %>%
  mutate(
    monster_real = case_when(
      monster_real == "FALSE" ~ "fake",
      TRUE ~ "real"
    ),
    monster_real = factor(monster_real)
  ) %>%
  select(year_aired, imdb, monster_real, title) %>%
  initial_split(strata = monster_real)
scooby_train <- training(scooby_split)
scooby_test <- testing(scooby_split)

set.seed(234)
scooby_folds <- bootstraps(scooby_train, strata = monster_real)
scooby_folds


## # Bootstrap sampling using stratification 
## # A tibble: 25 x 2
## splits id         
## <list> <chr>      
## 1 <split [375/133]> Bootstrap01
## 2 <split [375/144]> Bootstrap02
## 3 <split [375/140]> Bootstrap03
## 4 <split [375/132]> Bootstrap04
## 5 <split [375/139]> Bootstrap05
## 6 <split [375/134]> Bootstrap06
## 7 <split [375/146]> Bootstrap07
## 8 <split [375/132]> Bootstrap08
## 9 <split [375/143]> Bootstrap09
## 10 <split [375/143]> Bootstrap10
## # … with 15 more rows

Next, let’s create our decision tree specification. It is tunable, and we could not fit this right away to data because we haven’t said what the model parameters are yet.

tree_spec <-
  decision_tree(
    cost_complexity = tune(),
    tree_depth = tune(),
    min_n = tune()
  ) %>%
  set_mode("classification") %>%
  set_engine("rpart")

tree_spec


## Decision Tree Model Specification (classification)
## 
## Main Arguments:
## cost_complexity = tune()
## tree_depth = tune()
## min_n = tune()
## 
## Computational engine: rpart

Let’s set up a grid of possible model parameters to try.

tree_grid <- grid_regular(cost_complexity(), tree_depth(), min_n(), levels = 4)
tree_grid


## # A tibble: 64 x 3
## cost_complexity tree_depth min_n
## <dbl> <int> <int>
## 1 0.0000000001 1 2
## 2 0.0000001 1 2
## 3 0.0001 1 2
## 4 0.1 1 2
## 5 0.0000000001 5 2
## 6 0.0000001 5 2
## 7 0.0001 5 2
## 8 0.1 5 2
## 9 0.0000000001 10 2
## 10 0.0000001 10 2
## # … with 54 more rows

Now let’s fit each possible parameter combination to each resample. By putting non-default metrics into metric_set(), we can specify which metrics are computed for each resample.

doParallel::registerDoParallel()

set.seed(345)
tree_rs <-
  tune_grid(
    tree_spec,
    monster_real ~ year_aired + imdb,
    resamples = scooby_folds,
    grid = tree_grid,
    metrics = metric_set(accuracy, roc_auc, sensitivity, specificity)
  )

tree_rs


## # Tuning results
## # Bootstrap sampling using stratification 
## # A tibble: 25 x 4
## splits id .metrics .notes          
## <list> <chr> <list> <list>          
## 1 <split [375/133]> Bootstrap01 <tibble [256 × 7]> <tibble [0 × 1]>
## 2 <split [375/144]> Bootstrap02 <tibble [256 × 7]> <tibble [0 × 1]>
## 3 <split [375/140]> Bootstrap03 <tibble [256 × 7]> <tibble [0 × 1]>
## 4 <split [375/132]> Bootstrap04 <tibble [256 × 7]> <tibble [0 × 1]>
## 5 <split [375/139]> Bootstrap05 <tibble [256 × 7]> <tibble [0 × 1]>
## 6 <split [375/134]> Bootstrap06 <tibble [256 × 7]> <tibble [0 × 1]>
## 7 <split [375/146]> Bootstrap07 <tibble [256 × 7]> <tibble [0 × 1]>
## 8 <split [375/132]> Bootstrap08 <tibble [256 × 7]> <tibble [0 × 1]>
## 9 <split [375/143]> Bootstrap09 <tibble [256 × 7]> <tibble [0 × 1]>
## 10 <split [375/143]> Bootstrap10 <tibble [256 × 7]> <tibble [0 × 1]>
## # … with 15 more rows

All done!

Evaluate and understand our model

Now that we have tuned our decision tree model, we can choose which set of model parameters we want to use. What are some of the best options?

show_best(tree_rs)


## # A tibble: 5 x 9
## cost_complexity tree_depth min_n .metric .estimator mean n std_err
## <dbl> <int> <int> <chr> <chr> <dbl> <int> <dbl>
## 1 0.0000000001 10 2 accuracy binary 0.872 25 0.00481
## 2 0.0000001 10 2 accuracy binary 0.872 25 0.00481
## 3 0.0001 10 2 accuracy binary 0.872 25 0.00481
## 4 0.0000000001 15 2 accuracy binary 0.871 25 0.00456
## 5 0.0000001 15 2 accuracy binary 0.871 25 0.00456
## # … with 1 more variable: .config <chr>

We can visualize all of the combinations we tried.

autoplot(tree_rs) + theme_light(base_family = "IBMPlexSans")

If we used select_best(), we would pick the numerically best option. However, we might want to choose a different option that is within some criteria of the best performance, like a simpler model that is within one standard error of the optimal results. We finalize our model just like we finalize a workflow, as shown in previous posts.

simpler_tree <- select_by_one_std_err(tree_rs,
  -cost_complexity,
  metric = "roc_auc"
)

final_tree <- finalize_model(tree_spec, simpler_tree)

Now we can fit final_tree to our training data.

final_fit <- fit(final_tree, monster_real ~ year_aired + imdb, scooby_train)

We also could use last_fit() instead of fit(), by swapping out the split for the training data. This will fit one time on the training data and evaluate one time on the testing data.

final_rs <- last_fit(final_tree, monster_real ~ year_aired + imdb, scooby_split)

This is the first time we have used the testing data through this whole analysis, and let’s us see how our model performs on the testing data. A bit worse, unfortunately!

collect_metrics(final_rs)


## # A tibble: 2 x 4
## .metric .estimator .estimate .config             
## <chr> <chr> <dbl> <chr>               
## 1 accuracy binary 0.857 Preprocessor1_Model1
## 2 roc_auc binary 0.780 Preprocessor1_Model1

Finally, we can use the parttree package to visualize our decision tree results.

library(parttree)

scooby_train %>%
  ggplot(aes(imdb, year_aired)) +
  geom_parttree(data = final_fit, aes(fill = monster_real), alpha = 0.2) +
  geom_jitter(alpha = 0.7, width = 0.05, height = 0.2, aes(color = monster_real))