<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Julia Silge</title>
    <description>The latest articles on DEV Community by Julia Silge (@juliasilge).</description>
    <link>https://dev.to/juliasilge</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F26527%2F05fc758d-6021-4bc7-b092-85b591d4d265.jpg</url>
      <title>DEV Community: Julia Silge</title>
      <link>https://dev.to/juliasilge</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/juliasilge"/>
    <language>en</language>
    <item>
      <title>Topic modeling for Spice Girls lyrics 🇬🇧👯‍♀️🎤</title>
      <dc:creator>Julia Silge</dc:creator>
      <pubDate>Wed, 15 Dec 2021 00:00:00 +0000</pubDate>
      <link>https://dev.to/juliasilge/topic-modeling-for-spice-girls-lyrics-46aa</link>
      <guid>https://dev.to/juliasilge/topic-modeling-for-spice-girls-lyrics-46aa</guid>
      <description>&lt;p&gt;This is the latest in my series of &lt;a href="https://www.youtube.com/juliasilge"&gt;screencasts&lt;/a&gt;, but instead of being about &lt;a href="https://www.tidymodels.org/"&gt;tidymodels&lt;/a&gt;, this screencast focuses on unsupervised modeling for text, specifically topic modeling. Today’s screencast walks through how to build a &lt;a href="https://www.structuraltopicmodel.com/"&gt;structural topic model&lt;/a&gt; and then how to explore and understand it, with this week’s &lt;a href="https://github.com/rfordatascience/tidytuesday"&gt;&lt;code&gt;#TidyTuesday&lt;/code&gt; dataset&lt;/a&gt; on Spice Girls lyrics. 👯‍♀️&lt;/p&gt;

&lt;p&gt;&lt;iframe width="710" height="399" src="https://www.youtube.com/embed/2i0Cu8MMGRc"&gt;
&lt;/iframe&gt;
&lt;/p&gt;

&lt;p&gt;Here is the code I used in the video, for those who prefer reading instead of or in addition to video.&lt;/p&gt;

&lt;h2&gt;
  
  
  Explore data
&lt;/h2&gt;

&lt;p&gt;Our modeling goal is to “discover” topics in the &lt;a href="https://github.com/rfordatascience/tidytuesday/blob/master/data/2021/2021-12-14/readme.md"&gt;lyrics of Spice Girls songs&lt;/a&gt;. Instead of a supervised or predictive model where our observations have labels, this is an unsupervised approach.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;library(tidyverse)

lyrics &amp;lt;- read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-12-14/lyrics.csv")

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;How many albums and songs are there in this dataset?&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;lyrics %&amp;gt;% distinct(album_name)


## # A tibble: 3 × 1
## album_name
## &amp;lt;chr&amp;gt;     
## 1 Spice     
## 2 Spiceworld
## 3 Forever


lyrics %&amp;gt;% distinct(album_name, song_name)


## # A tibble: 31 × 2
## album_name song_name                 
## &amp;lt;chr&amp;gt; &amp;lt;chr&amp;gt;                     
## 1 Spice "Wannabe"                 
## 2 Spice "Say You\x92ll Be There"  
## 3 Spice "2 Become 1"              
## 4 Spice "Love Thing"              
## 5 Spice "Last Time Lover"         
## 6 Spice "Mama"                    
## 7 Spice "Who Do You Think You Are"
## 8 Spice "Something Kinda Funny"   
## 9 Spice "Naked"                   
## 10 Spice "If U Can\x92t Dance"     
## # … with 21 more rows

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Let’s start by tokenizing this text and removing a small set of stop words (as well as fixing that punctuation).&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;library(tidytext)

tidy_lyrics &amp;lt;-
  lyrics %&amp;gt;%
  mutate(song_name = str_replace_all(song_name, "\x92", "'")) %&amp;gt;%
  unnest_tokens(word, line) %&amp;gt;%
  anti_join(get_stopwords())

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;What are the most common words in these songs after removing stop words?&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;tidy_lyrics %&amp;gt;%
  count(word, sort = TRUE)


## # A tibble: 979 × 2
## word n
## &amp;lt;chr&amp;gt; &amp;lt;int&amp;gt;
## 1 get 153
## 2 love 137
## 3 know 124
## 4 time 106
## 5 wanna 102
## 6 never 101
## 7 oh 88
## 8 yeah 88
## 9 la 85
## 10 got 82
## # … with 969 more rows

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;How about per song?&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;tidy_lyrics %&amp;gt;%
  count(song_name, word, sort = TRUE)


## # A tibble: 2,206 × 3
## song_name word n
## &amp;lt;chr&amp;gt; &amp;lt;chr&amp;gt; &amp;lt;int&amp;gt;
## 1 Saturday Night Divas get 91
## 2 Spice Up Your Life la 64
## 3 If U Can't Dance dance 60
## 4 Holler holler 48
## 5 Never Give Up on the Good Times never 47
## 6 Move Over generation 41
## 7 Saturday Night Divas deeper 41
## 8 Move Over yeah 39
## 9 Something Kinda Funny got 39
## 10 Never Give Up on the Good Times give 38
## # … with 2,196 more rows

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This gives us an idea of how many counts per words we have per song, for our modeling.&lt;/p&gt;

&lt;h2&gt;
  
  
  Train a topic model
&lt;/h2&gt;

&lt;p&gt;To train a topic model with the stm package, we need to create a sparse matrix from our tidy dataframe of tokens.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;lyrics_sparse &amp;lt;-
  tidy_lyrics %&amp;gt;%
  count(song_name, word) %&amp;gt;%
  cast_sparse(song_name, word, n)

dim(lyrics_sparse)


## [1] 31 979

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This means there are songs (i.e. documents) and 979 different tokens (i.e. terms or words) in our dataset for modeling.&lt;/p&gt;

&lt;p&gt;A topic model like this one models:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;each &lt;strong&gt;document&lt;/strong&gt; as a mixture of topics&lt;/li&gt;
&lt;li&gt;each &lt;strong&gt;topic&lt;/strong&gt; as a mixture of words&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The most important parameter when training a topic modeling is &lt;code&gt;K&lt;/code&gt;, the number of topics. This is like &lt;code&gt;k&lt;/code&gt; in k-means in that it is a hyperparamter of the model and we must choose this value ahead of time. We could &lt;a href="https://juliasilge.com/blog/evaluating-stm/"&gt;try multiple different values&lt;/a&gt; to find the best value for &lt;code&gt;K&lt;/code&gt;, but this is a very small dataset so let’s just stick with &lt;code&gt;K = 4&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;library(stm)
set.seed(123)
topic_model &amp;lt;- stm(lyrics_sparse, K = 4, verbose = FALSE)

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;To get a quick view of the results, we can use &lt;code&gt;summary()&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;summary(topic_model)


## A topic model with 4 topics, 31 documents and a 979 word dictionary.

## Topic 1 Top Words:
## Highest Prob: get, wanna, time, night, right, deeper, come 
## FREX: deeper, saturday, comin, get, lover, night, last 
## Lift: achieve, saying, tonight, another, anyway, blameless, breaking 
## Score: deeper, saturday, lover, get, wanna, night, comin 
## Topic 2 Top Words:
## Highest Prob: dance, yeah, generation, know, next, love, naked 
## FREX: next, naked, denying, foolin, nobody, wants, meant 
## Lift: admit, bein, check, d'ya, defeat, else, foolin 
## Score: next, naked, dance, generation, denying, foolin, nobody 
## Topic 3 Top Words:
## Highest Prob: got, holler, make, love, oh, something, play 
## FREX: holler, kinda, swing, funny, yay, use, trust 
## Lift: anyone, bottom, driving, fantasy, follow, hoo, long 
## Score: holler, swing, kinda, funny, yay, driving, loving 
## Topic 4 Top Words:
## Highest Prob: la, never, love, give, time, know, way 
## FREX: times, tried, swear, la, bring, promise, viva 
## Lift: able, certain, love's, rely, affection, shy, replace 
## Score: la, times, swear, shake, viva, chicas, front

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Explore topic model results
&lt;/h2&gt;

&lt;p&gt;To explore more deeply, we can &lt;code&gt;tidy()&lt;/code&gt; the topic model results to get a dataframe that we can compute on. There are two possible outputs for this topic model, the &lt;code&gt;"beta"&lt;/code&gt; matrix of topic-word probabilities and the &lt;code&gt;"gamma"&lt;/code&gt; matrix of document-topic probabilities. Let’s start with the first.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;word_topics &amp;lt;- tidy(topic_model, matrix = "beta")
word_topics


## # A tibble: 3,916 × 3
## topic term beta
## &amp;lt;int&amp;gt; &amp;lt;chr&amp;gt; &amp;lt;dbl&amp;gt;
## 1 1 achieve 1.66e- 3
## 2 2 achieve 2.14e-21
## 3 3 achieve 1.75e-49
## 4 4 achieve 5.18e-36
## 5 1 baby 1.20e- 2
## 6 2 baby 1.44e- 2
## 7 3 baby 1.29e-15
## 8 4 baby 5.04e- 3
## 9 1 back 1.94e- 2
## 10 2 back 5.49e- 4
## # … with 3,906 more rows

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Since this is a tidy dataframe, we can manipulate it how we like, include making a visualization showing the highest probability words from each topic.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;word_topics %&amp;gt;%
  group_by(topic) %&amp;gt;%
  slice_max(beta, n = 10) %&amp;gt;%
  ungroup() %&amp;gt;%
  mutate(topic = paste("Topic", topic)) %&amp;gt;%
  ggplot(aes(beta, reorder_within(term, beta, topic), fill = topic)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(vars(topic), scales = "free_y") +
  scale_x_continuous(expand = c(0, 0)) +
  scale_y_reordered() +
  labs(x = expression(beta), y = NULL)

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--7kda_oKv--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://juliasilge.com/blog/spice-girls/index_files/figure-html/unnamed-chunk-11-1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--7kda_oKv--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://juliasilge.com/blog/spice-girls/index_files/figure-html/unnamed-chunk-11-1.png" alt="" width="880" height="880"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;What about the other matrix? We also need to pass in the &lt;code&gt;document_names&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;song_topics &amp;lt;- tidy(topic_model,
  matrix = "gamma",
  document_names = rownames(lyrics_sparse)
)
song_topics


## # A tibble: 124 × 3
## document topic gamma
## &amp;lt;chr&amp;gt; &amp;lt;int&amp;gt; &amp;lt;dbl&amp;gt;
## 1 2 Become 1 1 0.714   
## 2 Denying 1 0.00163 
## 3 Do It 1 0.996   
## 4 Get Down With Me 1 0.947   
## 5 Goodbye 1 0.00106 
## 6 Holler 1 0.00103 
## 7 If U Can't Dance 1 0.000942
## 8 If You Wanna Have Some Fun 1 0.00722 
## 9 Last Time Lover 1 0.998   
## 10 Let Love Lead the Way 1 0.00175 
## # … with 114 more rows

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Remember that each document (song) was modeled as a mixture of topics. How did that turn out?&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;song_topics %&amp;gt;%
  mutate(
    song_name = fct_reorder(document, gamma),
    topic = factor(topic)
  ) %&amp;gt;%
  ggplot(aes(gamma, topic, fill = topic)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(vars(song_name), ncol = 4) +
  scale_x_continuous(expand = c(0, 0)) +
  labs(x = expression(gamma), y = "Topic")

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--ow64MUQ1--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://juliasilge.com/blog/spice-girls/index_files/figure-html/unnamed-chunk-13-1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--ow64MUQ1--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://juliasilge.com/blog/spice-girls/index_files/figure-html/unnamed-chunk-13-1.png" alt="" width="880" height="800"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The songs near the top of this plot are mostly one topic, while the songs near the bottom are more a mix.&lt;/p&gt;

&lt;p&gt;There is a TON more you can do with topic models. For example, we can take the trained topic model and, using some supplementary metadata on our documents, estimate regressions for the &lt;em&gt;proportion&lt;/em&gt; of each document about a topic with the metadata as the predictors. For example, let’s estimate regressions for our four topics with the album name as the predictor. This asks the question, “Do the topics in Spice Girls songs change across albums?”&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;effects &amp;lt;-
  estimateEffect(
    1:4 ~ album_name,
    topic_model,
    tidy_lyrics %&amp;gt;% distinct(song_name, album_name) %&amp;gt;% arrange(song_name)
  )

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Again, to get a quick view of the results, we can use &lt;code&gt;summary()&lt;/code&gt;, but to dive deeper, we will want to use &lt;code&gt;tidy()&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;summary(effects)


## 
## Call:
## estimateEffect(formula = 1:4 ~ album_name, stmobj = topic_model, 
## metadata = tidy_lyrics %&amp;gt;% distinct(song_name, album_name) %&amp;gt;% 
## arrange(song_name))
## 
## 
## Topic 1:
## 
## Coefficients:
## Estimate Std. Error t value Pr(&amp;gt;|t|)
## (Intercept) 0.1787 0.1312 1.362 0.184
## album_nameSpice 0.1199 0.1892 0.634 0.531
## album_nameSpiceworld 0.1139 0.1862 0.612 0.546
## 
## 
## Topic 2:
## 
## Coefficients:
## Estimate Std. Error t value Pr(&amp;gt;|t|)
## (Intercept) 0.1444 0.1325 1.090 0.285
## album_nameSpice 0.1357 0.1879 0.722 0.476
## album_nameSpiceworld 0.1486 0.1846 0.805 0.427
## 
## 
## Topic 3:
## 
## Coefficients:
## Estimate Std. Error t value Pr(&amp;gt;|t|)  
## (Intercept) 0.27150 0.12085 2.247 0.0327 *
## album_nameSpice 0.01954 0.16752 0.117 0.9080  
## album_nameSpiceworld -0.25776 0.16700 -1.543 0.1339  
## ---
## Signif. codes: 0 ' ***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## 
## Topic 4:
## 
## Coefficients:
## Estimate Std. Error t value Pr(&amp;gt;|t|)   
## (Intercept) 0.405559 0.140820 2.880 0.00754 **
## album_nameSpice -0.273207 0.202200 -1.351 0.18746   
## album_nameSpiceworld -0.007134 0.194246 -0.037 0.97096   
## ---
## Signif. codes: 0 ' ***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1


tidy(effects)


## # A tibble: 12 × 6
## topic term estimate std.error statistic p.value
## &amp;lt;int&amp;gt; &amp;lt;chr&amp;gt; &amp;lt;dbl&amp;gt; &amp;lt;dbl&amp;gt; &amp;lt;dbl&amp;gt; &amp;lt;dbl&amp;gt;
## 1 1 (Intercept) 0.177 0.132 1.34 0.190  
## 2 1 album_nameSpice 0.120 0.189 0.633 0.532  
## 3 1 album_nameSpiceworld 0.115 0.188 0.608 0.548  
## 4 2 (Intercept) 0.145 0.133 1.09 0.283  
## 5 2 album_nameSpice 0.135 0.187 0.722 0.476  
## 6 2 album_nameSpiceworld 0.150 0.185 0.813 0.423  
## 7 3 (Intercept) 0.272 0.120 2.26 0.0316 
## 8 3 album_nameSpice 0.0167 0.167 0.100 0.921  
## 9 3 album_nameSpiceworld -0.259 0.166 -1.57 0.129  
## 10 4 (Intercept) 0.404 0.140 2.89 0.00739
## 11 4 album_nameSpice -0.273 0.196 -1.39 0.175  
## 12 4 album_nameSpiceworld -0.00502 0.193 -0.0260 0.979

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Looks like there is no statistical evidence of change in the lyrical content of the Spice Girls songs across these three albums!&lt;/p&gt;

</description>
      <category>machinelearning</category>
      <category>datascience</category>
      <category>rstats</category>
      <category>tutorial</category>
    </item>
    <item>
      <title>Predicting viewership for Doctor Who episodes</title>
      <dc:creator>Julia Silge</dc:creator>
      <pubDate>Sat, 27 Nov 2021 00:00:00 +0000</pubDate>
      <link>https://dev.to/juliasilge/predicting-viewership-for-doctor-who-episodes-41eg</link>
      <guid>https://dev.to/juliasilge/predicting-viewership-for-doctor-who-episodes-41eg</guid>
      <description>&lt;p&gt;This is the latest in my series of &lt;a href="https://juliasilge.com/category/tidymodels/"&gt;screencasts&lt;/a&gt; demonstrating how to use the &lt;a href="https://www.tidymodels.org/"&gt;tidymodels&lt;/a&gt; packages, from just getting started to tuning more complex models. Today’s screencast walks through how to handle workflow objects, with this week’s &lt;a href="https://github.com/rfordatascience/tidytuesday"&gt;&lt;code&gt;#TidyTuesday&lt;/code&gt; dataset&lt;/a&gt; on Doctor Who episodes. 💙&lt;/p&gt;

&lt;p&gt;&lt;iframe width="710" height="399" src="https://www.youtube.com/embed/T8SSxIo-9Rg"&gt;
&lt;/iframe&gt;
&lt;/p&gt;

&lt;p&gt;Here is the code I used in the video, for those who prefer reading instead of or in addition to video.&lt;/p&gt;

&lt;h2&gt;
  
  
  Explore data
&lt;/h2&gt;

&lt;p&gt;Our modeling goal is to predict the UK viewership of &lt;a href="https://github.com/rfordatascience/tidytuesday/blob/master/data/2021/2021-11-23/readme.md"&gt;Doctor Who episodes&lt;/a&gt; (since the 2005 revival) from the episodes’ air date. How has the viewership of these episodes changed over time?&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;episodes &amp;lt;- readr::read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-11-23/episodes.csv") %&amp;gt;%
  filter(!is.na(uk_viewers))

episodes %&amp;gt;%
  ggplot(aes(first_aired, uk_viewers)) +
  geom_line(alpha = 0.8, size = 1.2, color = "midnightblue") +
  labs(x = NULL)

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--lEXNNToB--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://juliasilge.com/blog/doctor-who/index_files/figure-html/unnamed-chunk-2-1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--lEXNNToB--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://juliasilge.com/blog/doctor-who/index_files/figure-html/unnamed-chunk-2-1.png" alt="" width="880" height="550"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;These are quite spiky, with much higher viewer numbers for special episodes like season finales or Christmas episodes.&lt;/p&gt;

&lt;p&gt;I have only ever watched episodes of Doctor Who after they arrive on US streaming platforms, but I will say that I haven’t caught up on some of the latest seasons, much like many viewers in the UK.&lt;/p&gt;

&lt;h2&gt;
  
  
  Create a workflow
&lt;/h2&gt;

&lt;p&gt;In tidymodels, we typically recommend using a &lt;a href="https://www.tmwr.org/workflows.html"&gt;workflow&lt;/a&gt; in your modeling analyses, to make it easier to carry around preprocessing and modeling components in your code and to protect against errors. Let’s create some bootstrap resampling folds for these episodes, and then a workflow to predict viewership (in millions) from the air date.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;library(tidymodels)

set.seed(123)
folds &amp;lt;- bootstraps(episodes, times = 100, strata = uk_viewers)
folds


## # Bootstrap sampling using stratification 
## # A tibble: 100 × 2
## splits id          
## &amp;lt;list&amp;gt; &amp;lt;chr&amp;gt;       
## 1 &amp;lt;split [167/61]&amp;gt; Bootstrap001
## 2 &amp;lt;split [167/55]&amp;gt; Bootstrap002
## 3 &amp;lt;split [167/64]&amp;gt; Bootstrap003
## 4 &amp;lt;split [167/56]&amp;gt; Bootstrap004
## 5 &amp;lt;split [167/69]&amp;gt; Bootstrap005
## 6 &amp;lt;split [167/63]&amp;gt; Bootstrap006
## 7 &amp;lt;split [167/68]&amp;gt; Bootstrap007
## 8 &amp;lt;split [167/55]&amp;gt; Bootstrap008
## 9 &amp;lt;split [167/60]&amp;gt; Bootstrap009
## 10 &amp;lt;split [167/60]&amp;gt; Bootstrap010
## # … with 90 more rows

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We want to use &lt;code&gt;first_aired&lt;/code&gt; as our predictor, but let’s do some feature engineering here. Let’s create a date feature (just year here; if we had more data, maybe we could try week of the year or month), and also create a feature for a few holidays that are celebrated in the UK and have special Doctor Who episodes.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;who_rec &amp;lt;-
  recipe(uk_viewers ~ first_aired, data = episodes) %&amp;gt;%
  step_date(first_aired, features = "year") %&amp;gt;%
  step_holiday(first_aired,
    holidays = c("NewYearsDay", "ChristmasDay"),
    keep_original_cols = FALSE
  )

## not needed for modeling, but just to check how things are going:
prep(who_rec) %&amp;gt;% bake(new_data = NULL)


## # A tibble: 167 × 4
## uk_viewers first_aired_year first_aired_NewYearsDay first_aired_ChristmasDay
## &amp;lt;dbl&amp;gt; &amp;lt;dbl&amp;gt; &amp;lt;dbl&amp;gt; &amp;lt;dbl&amp;gt;
## 1 10.8 2005 0 0
## 2 7.97 2005 0 0
## 3 8.86 2005 0 0
## 4 7.63 2005 0 0
## 5 7.98 2005 0 0
## 6 8.63 2005 0 0
## 7 8.01 2005 0 0
## 8 8.06 2005 0 0
## 9 7.11 2005 0 0
## 10 6.86 2005 0 0
## # … with 157 more rows

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now let’s combine this feature engineering recipe together with a model. We don’t have much data here, so let’s stick with a straightforward OLS linear model.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;who_wf &amp;lt;- workflow(who_rec, linear_reg())
who_wf


## ══ Workflow ════════════════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: linear_reg()
## 
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 2 Recipe Steps
## 
## • step_date()
## • step_holiday()
## 
## ── Model ───────────────────────────────────────────────────────────────────────
## Linear Regression Model Specification (regression)
## 
## Computational engine: lm

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Extract custom quantities from resampled workflows
&lt;/h2&gt;

&lt;p&gt;If you look at many of my tutorials or the documentation for tidymodels, you’ll see that we can fit our workflow to our resamples with code like &lt;code&gt;fit_resamples(who_wf, folds)&lt;/code&gt;. This can give us some useful results, but sometimes we want &lt;em&gt;more&lt;/em&gt;. The functions like &lt;code&gt;fit_resamples()&lt;/code&gt; and &lt;code&gt;tune_grid()&lt;/code&gt; and friends don’t keep the fitted models they train, because they are all trained for the purpose of evaluation or tuning or similar; we usually throw those models away. Sometimes we want to record something about those models beyond their performance; we can do that using a special &lt;code&gt;control_*()&lt;/code&gt; function.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;ctrl_extract &amp;lt;- control_resamples(extract = extract_fit_engine)

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;To create &lt;code&gt;ctrl_extract&lt;/code&gt;, I used the &lt;a href="https://workflows.tidymodels.org/reference/extract-workflow.html"&gt;&lt;code&gt;extract_fit_engine()&lt;/code&gt;&lt;/a&gt; function, but you have total flexibility here and can supply your own function. Check out &lt;a href="https://www.tidymodels.org/learn/models/coefficients/"&gt;this tutorial&lt;/a&gt; for another way to supply a custom function here.&lt;/p&gt;

&lt;p&gt;With our &lt;code&gt;ctrl_extract&lt;/code&gt; ready to go, we can now fit to our resamples and keep the linear models for each resample.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;doParallel::registerDoParallel()
set.seed(234)
who_rs &amp;lt;- fit_resamples(who_wf, folds, control = ctrl_extract)
who_rs


## # Resampling results
## # Bootstrap sampling using stratification 
## # A tibble: 100 × 5
## splits id .metrics .notes .extracts    
## &amp;lt;list&amp;gt; &amp;lt;chr&amp;gt; &amp;lt;list&amp;gt; &amp;lt;list&amp;gt; &amp;lt;list&amp;gt;       
## 1 &amp;lt;split [167/61]&amp;gt; Bootstrap001 &amp;lt;tibble [2 × 4]&amp;gt; &amp;lt;tibble [0 × 1]&amp;gt; &amp;lt;tibble [1 ×…
## 2 &amp;lt;split [167/55]&amp;gt; Bootstrap002 &amp;lt;tibble [2 × 4]&amp;gt; &amp;lt;tibble [0 × 1]&amp;gt; &amp;lt;tibble [1 ×…
## 3 &amp;lt;split [167/64]&amp;gt; Bootstrap003 &amp;lt;tibble [2 × 4]&amp;gt; &amp;lt;tibble [0 × 1]&amp;gt; &amp;lt;tibble [1 ×…
## 4 &amp;lt;split [167/56]&amp;gt; Bootstrap004 &amp;lt;tibble [2 × 4]&amp;gt; &amp;lt;tibble [0 × 1]&amp;gt; &amp;lt;tibble [1 ×…
## 5 &amp;lt;split [167/69]&amp;gt; Bootstrap005 &amp;lt;tibble [2 × 4]&amp;gt; &amp;lt;tibble [0 × 1]&amp;gt; &amp;lt;tibble [1 ×…
## 6 &amp;lt;split [167/63]&amp;gt; Bootstrap006 &amp;lt;tibble [2 × 4]&amp;gt; &amp;lt;tibble [0 × 1]&amp;gt; &amp;lt;tibble [1 ×…
## 7 &amp;lt;split [167/68]&amp;gt; Bootstrap007 &amp;lt;tibble [2 × 4]&amp;gt; &amp;lt;tibble [0 × 1]&amp;gt; &amp;lt;tibble [1 ×…
## 8 &amp;lt;split [167/55]&amp;gt; Bootstrap008 &amp;lt;tibble [2 × 4]&amp;gt; &amp;lt;tibble [0 × 1]&amp;gt; &amp;lt;tibble [1 ×…
## 9 &amp;lt;split [167/60]&amp;gt; Bootstrap009 &amp;lt;tibble [2 × 4]&amp;gt; &amp;lt;tibble [0 × 1]&amp;gt; &amp;lt;tibble [1 ×…
## 10 &amp;lt;split [167/60]&amp;gt; Bootstrap010 &amp;lt;tibble [2 × 4]&amp;gt; &amp;lt;tibble [0 × 1]&amp;gt; &amp;lt;tibble [1 ×…
## # … with 90 more rows

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Since we have each &lt;code&gt;lm&lt;/code&gt; object for each resample, we can &lt;code&gt;tidy()&lt;/code&gt; them to find the coefficients. We can do any kind of analysis we want on these bootstrapped coefficients, including making a visualization.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;who_rs %&amp;gt;%
  select(id, .extracts) %&amp;gt;%
  unnest(.extracts) %&amp;gt;%
  mutate(coefs = map(.extracts, tidy)) %&amp;gt;%
  unnest(coefs) %&amp;gt;%
  filter(term != "(Intercept)") %&amp;gt;%
  ggplot(aes(estimate, fill = term)) +
  geom_histogram(alpha = 0.8, bins = 12, show.legend = FALSE) +
  facet_wrap(vars(term), scales = "free")

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--JjoktKKU--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://juliasilge.com/blog/doctor-who/index_files/figure-html/unnamed-chunk-8-1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--JjoktKKU--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://juliasilge.com/blog/doctor-who/index_files/figure-html/unnamed-chunk-8-1.png" alt="" width="880" height="440"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;It looks like episodes airing on Christmas Day have &lt;strong&gt;much&lt;/strong&gt; higher viewership, 2.5 to 3 million viewers higher than other days. Airing on New Years also looks like it is associated with more viewers, and we see evidence for a modest decrease in viewers with year.&lt;/p&gt;

</description>
      <category>machinelearning</category>
      <category>datascience</category>
      <category>rstats</category>
      <category>tutorial</category>
    </item>
    <item>
      <title>Predict giant pumpkin weights 🎃 with tidymodels</title>
      <dc:creator>Julia Silge</dc:creator>
      <pubDate>Mon, 08 Nov 2021 00:00:00 +0000</pubDate>
      <link>https://dev.to/juliasilge/predict-giant-pumpkin-weights-with-tidymodels-3o52</link>
      <guid>https://dev.to/juliasilge/predict-giant-pumpkin-weights-with-tidymodels-3o52</guid>
      <description>&lt;p&gt;This is the latest in my series of &lt;a href="https://juliasilge.com/category/tidymodels/" rel="noopener noreferrer"&gt;screencasts&lt;/a&gt; demonstrating how to use the &lt;a href="https://www.tidymodels.org/" rel="noopener noreferrer"&gt;tidymodels&lt;/a&gt; packages. If you are a tidymodels user, either just starting out or someone who has used the packages a lot, we are interested in your feedback on &lt;a href="https://www.tidyverse.org/blog/2021/10/tidymodels-2022-survey/" rel="noopener noreferrer"&gt;our priorities for 2022&lt;/a&gt;. The survey we fielded last year turned out to be very helpful in making decisions, so we would so appreciate your input again!&lt;/p&gt;

&lt;p&gt;Today’s screencast is great for someone just starting out with &lt;a href="https://workflowsets.tidymodels.org/" rel="noopener noreferrer"&gt;workflowsets&lt;/a&gt;, the tidymodels package for handling multiple preprocessing/modeling combinations at once, with this week’s &lt;a href="https://github.com/rfordatascience/tidytuesday" rel="noopener noreferrer"&gt;&lt;code&gt;#TidyTuesday&lt;/code&gt; dataset&lt;/a&gt; on giant pumpkins from competitons. 🥧&lt;/p&gt;

&lt;p&gt;&lt;iframe width="710" height="399" src="https://www.youtube.com/embed/qNxJKke2rsE"&gt;
&lt;/iframe&gt;
&lt;/p&gt;

&lt;p&gt;Here is the code I used in the video, for those who prefer reading instead of or in addition to video.&lt;/p&gt;

&lt;h2&gt;
  
  
  Explore data
&lt;/h2&gt;

&lt;p&gt;Our modeling goal is to predict the weight of &lt;a href="https://github.com/rfordatascience/tidytuesday/blob/master/data/2021/2021-10-19/readme.md" rel="noopener noreferrer"&gt;giant pumpkins&lt;/a&gt; from other characteristics measured during a competition.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;library(tidyverse)

pumpkins_raw &amp;lt;- readr::read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-10-19/pumpkins.csv")

pumpkins &amp;lt;-
  pumpkins_raw %&amp;gt;%
  separate(id, into = c("year", "type")) %&amp;gt;%
  mutate(across(c(year, weight_lbs, ott, place), parse_number)) %&amp;gt;%
  filter(type == "P") %&amp;gt;%
  select(weight_lbs, year, place, ott, gpc_site, country)

pumpkins


## # A tibble: 15,965 × 6
## weight_lbs year place ott gpc_site country    
## &amp;lt;dbl&amp;gt; &amp;lt;dbl&amp;gt; &amp;lt;dbl&amp;gt; &amp;lt;dbl&amp;gt; &amp;lt;chr&amp;gt; &amp;lt;chr&amp;gt;      
## 1 2032 2013 1 475 Uesugi Farms Weigh-off United Sta…
## 2 1985 2013 2 453 Safeway World Championship Pumpkin … United Sta…
## 3 1894 2013 3 445 Safeway World Championship Pumpkin … United Sta…
## 4 1874. 2013 4 436 Elk Grove Giant Pumpkin Festival United Sta…
## 5 1813 2013 5 430 The Great Howard Dill Giant Pumpkin… Canada     
## 6 1791 2013 6 431 Elk Grove Giant Pumpkin Festival United Sta…
## 7 1784 2013 7 445 Uesugi Farms Weigh-off United Sta…
## 8 1784. 2013 8 434 Stillwater Harvestfest United Sta…
## 9 1780. 2013 9 422 Stillwater Harvestfest United Sta…
## 10 1766. 2013 10 425 Durham Fair Weigh-Off United Sta…
## # … with 15,955 more rows

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The main relationship here is between the volume/size of the pumpkin (measured via “over-the-top inches”) and weight.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;pumpkins %&amp;gt;%
  filter(ott &amp;gt; 20, ott &amp;lt; 1e3) %&amp;gt;%
  ggplot(aes(ott, weight_lbs, color = place)) +
  geom_point(alpha = 0.2, size = 1.1) +
  labs(x = "over-the-top inches", y = "weight (lbs)") +
  scale_color_viridis_c()

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fjuliasilge.com%2Fblog%2Fgiant-pumpkins%2Findex_files%2Ffigure-html%2Funnamed-chunk-3-1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fjuliasilge.com%2Fblog%2Fgiant-pumpkins%2Findex_files%2Ffigure-html%2Funnamed-chunk-3-1.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Big, heavy pumpkins placed closer to winning at the competitions, naturally!&lt;/p&gt;

&lt;p&gt;Has there been any shift in this relationship over time?&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;pumpkins %&amp;gt;%
  filter(ott &amp;gt; 20, ott &amp;lt; 1e3) %&amp;gt;%
  ggplot(aes(ott, weight_lbs)) +
  geom_point(alpha = 0.2, size = 1.1, color = "gray60") +
  geom_smooth(aes(color = factor(year)),
    method = lm, formula = y ~ splines::bs(x, 3),
    se = FALSE, size = 1.5, alpha = 0.6
  ) +
  labs(x = "over-the-top inches", y = "weight (lbs)", color = NULL) +
  scale_color_viridis_d()

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fjuliasilge.com%2Fblog%2Fgiant-pumpkins%2Findex_files%2Ffigure-html%2Funnamed-chunk-4-1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fjuliasilge.com%2Fblog%2Fgiant-pumpkins%2Findex_files%2Ffigure-html%2Funnamed-chunk-4-1.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Hard to say, I think.&lt;/p&gt;

&lt;p&gt;Which countries produced more or less massive pumpkins?&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;pumpkins %&amp;gt;%
  mutate(
    country = fct_lump(country, n = 10),
    country = fct_reorder(country, weight_lbs)
  ) %&amp;gt;%
  ggplot(aes(country, weight_lbs, color = country)) +
  geom_boxplot(outlier.colour = NA) +
  geom_jitter(alpha = 0.1, width = 0.15) +
  labs(x = NULL, y = "weight (lbs)") +
  theme(legend.position = "none")

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fjuliasilge.com%2Fblog%2Fgiant-pumpkins%2Findex_files%2Ffigure-html%2Funnamed-chunk-5-1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fjuliasilge.com%2Fblog%2Fgiant-pumpkins%2Findex_files%2Ffigure-html%2Funnamed-chunk-5-1.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Build and fit a workflow set
&lt;/h2&gt;

&lt;p&gt;Let’s start our modeling by setting up our “data budget.” We’ll stratify by our outcome &lt;code&gt;weight_lbs&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;library(tidymodels)

set.seed(123)
pumpkin_split &amp;lt;- pumpkins %&amp;gt;%
  filter(ott &amp;gt; 20, ott &amp;lt; 1e3) %&amp;gt;%
  initial_split(strata = weight_lbs)

pumpkin_train &amp;lt;- training(pumpkin_split)
pumpkin_test &amp;lt;- testing(pumpkin_split)

set.seed(234)
pumpkin_folds &amp;lt;- vfold_cv(pumpkin_train, strata = weight_lbs)
pumpkin_folds


## # 10-fold cross-validation using stratification 
## # A tibble: 10 × 2
## splits id    
## &amp;lt;list&amp;gt; &amp;lt;chr&amp;gt; 
## 1 &amp;lt;split [8954/996]&amp;gt; Fold01
## 2 &amp;lt;split [8954/996]&amp;gt; Fold02
## 3 &amp;lt;split [8954/996]&amp;gt; Fold03
## 4 &amp;lt;split [8954/996]&amp;gt; Fold04
## 5 &amp;lt;split [8954/996]&amp;gt; Fold05
## 6 &amp;lt;split [8954/996]&amp;gt; Fold06
## 7 &amp;lt;split [8955/995]&amp;gt; Fold07
## 8 &amp;lt;split [8956/994]&amp;gt; Fold08
## 9 &amp;lt;split [8957/993]&amp;gt; Fold09
## 10 &amp;lt;split [8958/992]&amp;gt; Fold10

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Next, let’s create three data preprocessing recipes: one that only pools infrequently used factors levels, one that also creates indicator variables, and finally one that also creates spline terms for over-the-top inches.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;base_rec &amp;lt;-
  recipe(weight_lbs ~ ott + year + country + gpc_site,
    data = pumpkin_train
  ) %&amp;gt;%
  step_other(country, gpc_site, threshold = 0.02)

ind_rec &amp;lt;-
  base_rec %&amp;gt;%
  step_dummy(all_nominal_predictors())

spline_rec &amp;lt;-
  ind_rec %&amp;gt;%
  step_bs(ott)

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then, let’s create three model specifications: a random forest model, a MARS model, and a linear model.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;rf_spec &amp;lt;-
  rand_forest(trees = 1e3) %&amp;gt;%
  set_mode("regression") %&amp;gt;%
  set_engine("ranger")

mars_spec &amp;lt;-
  mars() %&amp;gt;%
  set_mode("regression") %&amp;gt;%
  set_engine("earth")

lm_spec &amp;lt;- linear_reg()

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now it’s time to put the preprocessing and models together in a &lt;code&gt;workflow_set()&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;pumpkin_set &amp;lt;-
  workflow_set(
    list(base_rec, ind_rec, spline_rec),
    list(rf_spec, mars_spec, lm_spec),
    cross = FALSE
  )

pumpkin_set


## # A workflow set/tibble: 3 × 4
## wflow_id info option result    
## &amp;lt;chr&amp;gt; &amp;lt;list&amp;gt; &amp;lt;list&amp;gt; &amp;lt;list&amp;gt;    
## 1 recipe_1_rand_forest &amp;lt;tibble [1 × 4]&amp;gt; &amp;lt;opts[0]&amp;gt; &amp;lt;list [0]&amp;gt;
## 2 recipe_2_mars &amp;lt;tibble [1 × 4]&amp;gt; &amp;lt;opts[0]&amp;gt; &amp;lt;list [0]&amp;gt;
## 3 recipe_3_linear_reg &amp;lt;tibble [1 × 4]&amp;gt; &amp;lt;opts[0]&amp;gt; &amp;lt;list [0]&amp;gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We use &lt;code&gt;cross = FALSE&lt;/code&gt; because we don’t want every combination of these components, only three options to try. Let’s fit these possible candidates to our resamples to see which one performs best.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;doParallel::registerDoParallel()
set.seed(2021)

pumpkin_rs &amp;lt;-
  workflow_map(
    pumpkin_set,
    "fit_resamples",
    resamples = pumpkin_folds
  )

pumpkin_rs


## # A workflow set/tibble: 3 × 4
## wflow_id info option result   
## &amp;lt;chr&amp;gt; &amp;lt;list&amp;gt; &amp;lt;list&amp;gt; &amp;lt;list&amp;gt;   
## 1 recipe_1_rand_forest &amp;lt;tibble [1 × 4]&amp;gt; &amp;lt;opts[1]&amp;gt; &amp;lt;rsmp[+]&amp;gt;
## 2 recipe_2_mars &amp;lt;tibble [1 × 4]&amp;gt; &amp;lt;opts[1]&amp;gt; &amp;lt;rsmp[+]&amp;gt;
## 3 recipe_3_linear_reg &amp;lt;tibble [1 × 4]&amp;gt; &amp;lt;opts[1]&amp;gt; &amp;lt;rsmp[+]&amp;gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Evaluate workflow set
&lt;/h2&gt;

&lt;p&gt;How did our three candidates do?&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;autoplot(pumpkin_rs)

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fjuliasilge.com%2Fblog%2Fgiant-pumpkins%2Findex_files%2Ffigure-html%2Funnamed-chunk-11-1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fjuliasilge.com%2Fblog%2Fgiant-pumpkins%2Findex_files%2Ffigure-html%2Funnamed-chunk-11-1.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;There is not much difference between the three options, and if anything, our linear model with spline feature engineering maybe did better. This is nice because it’s a simpler model!&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;collect_metrics(pumpkin_rs)


## # A tibble: 6 × 9
## wflow_id .config preproc model .metric .estimator mean n std_err
## &amp;lt;chr&amp;gt; &amp;lt;chr&amp;gt; &amp;lt;chr&amp;gt; &amp;lt;chr&amp;gt; &amp;lt;chr&amp;gt; &amp;lt;chr&amp;gt; &amp;lt;dbl&amp;gt; &amp;lt;int&amp;gt; &amp;lt;dbl&amp;gt;
## 1 recipe_1_r… Preprocess… recipe rand_… rmse standard 86.1 10 1.10e+0
## 2 recipe_1_r… Preprocess… recipe rand_… rsq standard 0.969 10 9.97e-4
## 3 recipe_2_m… Preprocess… recipe mars rmse standard 83.8 10 1.92e+0
## 4 recipe_2_m… Preprocess… recipe mars rsq standard 0.969 10 1.67e-3
## 5 recipe_3_l… Preprocess… recipe linea… rmse standard 82.4 10 2.27e+0
## 6 recipe_3_l… Preprocess… recipe linea… rsq standard 0.970 10 1.97e-3

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We can extract the workflow we want to use and fit it to our training data.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;final_fit &amp;lt;-
  extract_workflow(pumpkin_rs, "recipe_3_linear_reg") %&amp;gt;%
  fit(pumpkin_train)

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We can use an object like this to predict, such as on the test data like &lt;code&gt;predict(final_fit, pumpkin_test)&lt;/code&gt;, or we can examine the model parameters.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;tidy(final_fit) %&amp;gt;%
  arrange(-abs(estimate))


## # A tibble: 15 × 5
## term estimate std.error statistic p.value
## &amp;lt;chr&amp;gt; &amp;lt;dbl&amp;gt; &amp;lt;dbl&amp;gt; &amp;lt;dbl&amp;gt; &amp;lt;dbl&amp;gt;
## 1 (Intercept) -9731. 675. -14.4 1.30e- 46
## 2 ott_bs_3 2585. 25.6 101. 0        
## 3 ott_bs_2 450. 11.9 37.9 2.75e-293
## 4 ott_bs_1 -345. 36.3 -9.50 2.49e- 21
## 5 gpc_site_Ohio.Valley.Giant.Pumpkin.Gr… 21.1 7.80 2.70 6.89e- 3
## 6 country_United.States 11.9 5.66 2.11 3.53e- 2
## 7 gpc_site_Stillwater.Harvestfest 11.6 7.87 1.48 1.40e- 1
## 8 country_Germany -11.5 6.68 -1.71 8.64e- 2
## 9 country_other -10.7 6.33 -1.69 9.13e- 2
## 10 country_Canada 9.29 6.12 1.52 1.29e- 1
## 11 country_Italy 8.12 7.02 1.16 2.47e- 1
## 12 gpc_site_Elk.Grove.Giant.Pumpkin.Fest… -7.81 7.70 -1.01 3.10e- 1
## 13 year 4.89 0.334 14.6 5.03e- 48
## 14 gpc_site_Wiegemeisterschaft.Berlin.Br… 1.51 8.07 0.187 8.51e- 1
## 15 gpc_site_other 1.41 5.60 0.251 8.02e- 1

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The spline terms are by far the most important, but we do see evidence of certain sites and countries being predictive of weight (either up or down) as well as a small trend of heavier pumpkins with year.&lt;/p&gt;

&lt;p&gt;Don’t forget to take the &lt;a href="https://www.tidyverse.org/blog/2021/10/tidymodels-2022-survey/" rel="noopener noreferrer"&gt;tidymodels survey for 2022 priorities&lt;/a&gt;!&lt;/p&gt;

</description>
      <category>machinelearning</category>
      <category>datascience</category>
      <category>rstats</category>
      <category>tutorial</category>
    </item>
    <item>
      <title>Spatial resampling for the #30DayMapChallenge 🗺  </title>
      <dc:creator>Julia Silge</dc:creator>
      <pubDate>Fri, 05 Nov 2021 00:00:00 +0000</pubDate>
      <link>https://dev.to/juliasilge/spatial-resampling-for-the-30daymapchallenge-1gdm</link>
      <guid>https://dev.to/juliasilge/spatial-resampling-for-the-30daymapchallenge-1gdm</guid>
      <description>&lt;p&gt;This is the latest in my series of &lt;a href="https://juliasilge.com/category/tidymodels/" rel="noopener noreferrer"&gt;screencasts&lt;/a&gt; demonstrating how to use the &lt;a href="https://www.tidymodels.org/" rel="noopener noreferrer"&gt;tidymodels&lt;/a&gt; packages, from just getting started to tuning more complex models. Today’s screencast walks through how to use spatial resampling for evaluating a model, with this week’s &lt;a href="https://github.com/rfordatascience/tidytuesday" rel="noopener noreferrer"&gt;&lt;code&gt;#TidyTuesday&lt;/code&gt; dataset&lt;/a&gt; on geographic data. 🗾&lt;/p&gt;

&lt;p&gt;&lt;iframe width="710" height="399" src="https://www.youtube.com/embed/wVrcw_ek3a4"&gt;
&lt;/iframe&gt;
&lt;/p&gt;

&lt;p&gt;Here is the code I used in the video, for those who prefer reading instead of or in addition to video.&lt;/p&gt;

&lt;h2&gt;
  
  
  Explore data
&lt;/h2&gt;

&lt;p&gt;Geographic data is special when it comes to, well, basically everything! This includes modeling and especially &lt;em&gt;evaluating&lt;/em&gt; models. This week’s &lt;code&gt;#TidyTuesday&lt;/code&gt; is all about exploring &lt;a href="https://github.com/rfordatascience/tidytuesday/blob/master/data/2021/2021-11-02/readme.md" rel="noopener noreferrer"&gt;spatial data&lt;/a&gt; for the &lt;a href="https://github.com/tjukanovt/30DayMapChallenge" rel="noopener noreferrer"&gt;&lt;code&gt;#30DayMapChallenge&lt;/code&gt;&lt;/a&gt; this month, and especially the spData and spDataLarge packages along with the book &lt;a href="https://geocompr.robinlovelace.net/" rel="noopener noreferrer"&gt;&lt;em&gt;Geocomputation with R&lt;/em&gt;&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Let’s use the dataset of landslides (plus not-landslide locations) in Southern Ecuador.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;data("lsl", package = "spDataLarge")
landslides &amp;lt;- as_tibble(lsl)
landslides


## # A tibble: 350 × 8
## x y lslpts slope cplan cprof elev log10_carea
## &amp;lt;dbl&amp;gt; &amp;lt;dbl&amp;gt; &amp;lt;fct&amp;gt; &amp;lt;dbl&amp;gt; &amp;lt;dbl&amp;gt; &amp;lt;dbl&amp;gt; &amp;lt;dbl&amp;gt; &amp;lt;dbl&amp;gt;
## 1 715078. 9558647. FALSE 37.4 0.0205 0.00866 2477. 2.61
## 2 713748. 9558047. FALSE 41.7 -0.0241 0.00679 2486. 3.07
## 3 712508. 9558887. FALSE 20.0 0.0390 0.0147 2142. 2.29
## 4 713998. 9558187. FALSE 45.8 -0.00632 0.00435 2391. 3.83
## 5 714308. 9557307. FALSE 41.7 0.0423 -0.0202 2570. 2.70
## 6 713488. 9558117. FALSE 52.9 0.0323 0.00703 2418. 2.48
## 7 714948. 9558347. FALSE 51.9 0.0399 0.000791 2546. 3.15
## 8 714678. 9560357. FALSE 38.5 0.0164 0.0299 1932. 3.26
## 9 714368. 9560287. FALSE 24.1 -0.0188 -0.00956 2059. 3.20
## 10 712528. 9559217. FALSE 50.5 0.0142 0.0151 1973. 2.60
## # … with 340 more rows

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;How are these landslides (plus not landslides) distributes in this area?&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;ggplot(landslides, aes(x, y)) +
  stat_summary_hex(aes(z = elev), alpha = 0.6, bins = 12) +
  geom_point(aes(color = lslpts), alpha = 0.7) +
  coord_fixed() +
  scale_fill_viridis_c() +
  scale_color_manual(values = c("gray90", "midnightblue")) +
  labs(fill = "Elevation", color = "Landslide?")

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fjuliasilge.com%2Fblog%2Fmap-challenge%2Findex_files%2Ffigure-html%2Funnamed-chunk-3-1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fjuliasilge.com%2Fblog%2Fmap-challenge%2Findex_files%2Ffigure-html%2Funnamed-chunk-3-1.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Create spatial resamples
&lt;/h2&gt;

&lt;p&gt;In tidymodels, one of the first steps we recommend thinking about is “spending your data budget.” When it comes to geographic data, points close to each other are often similar so we don’t want to randomly resample our observations. Instead, we want to use a resampling strategy that accounts for that autocorrelation. Let’s create both resamples that are appropriate to spatial data and resamples that might work for “regular,” non-spatial data but are not a good fit for geographic data.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;library(tidymodels)
library(spatialsample)

set.seed(123)
good_folds &amp;lt;- spatial_clustering_cv(landslides, coords = c("x", "y"), v = 5)
good_folds


## # 5-fold spatial cross-validation 
## # A tibble: 5 × 2
## splits id   
## &amp;lt;list&amp;gt; &amp;lt;chr&amp;gt;
## 1 &amp;lt;split [306/44]&amp;gt; Fold1
## 2 &amp;lt;split [256/94]&amp;gt; Fold2
## 3 &amp;lt;split [251/99]&amp;gt; Fold3
## 4 &amp;lt;split [303/47]&amp;gt; Fold4
## 5 &amp;lt;split [284/66]&amp;gt; Fold5


set.seed(234)
bad_folds &amp;lt;- vfold_cv(landslides, v = 5, strata = lslpts)
bad_folds


## # 5-fold cross-validation using stratification 
## # A tibble: 5 × 2
## splits id   
## &amp;lt;list&amp;gt; &amp;lt;chr&amp;gt;
## 1 &amp;lt;split [280/70]&amp;gt; Fold1
## 2 &amp;lt;split [280/70]&amp;gt; Fold2
## 3 &amp;lt;split [280/70]&amp;gt; Fold3
## 4 &amp;lt;split [280/70]&amp;gt; Fold4
## 5 &amp;lt;split [280/70]&amp;gt; Fold5

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;a href="https://spatialsample.tidymodels.org/" rel="noopener noreferrer"&gt;spatialsample&lt;/a&gt; package currently provides one method for spatial resampling and we are interested in hearing about what other methods we should support next.&lt;/p&gt;

&lt;p&gt;How do these resamples look? Let’s create a little helper function:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;plot_splits &amp;lt;- function(split) {
  p &amp;lt;- bind_rows(
    analysis(split) %&amp;gt;%
      mutate(analysis = "Analysis"),
    assessment(split) %&amp;gt;%
      mutate(analysis = "Assessment")
  ) %&amp;gt;%
    ggplot(aes(x, y, color = analysis)) +
    geom_point(size = 1.5, alpha = 0.8) +
    coord_fixed() +
    labs(color = NULL)
  print(p)
}

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The spatial resampling creates resamples where observations close to each other are together.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;walk(good_folds$splits, plot_splits)

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fjuliasilge.com%2Fblog%2Fmap-challenge%2Findex_files%2Ffigure-html%2Funnamed-chunk-6-.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fjuliasilge.com%2Fblog%2Fmap-challenge%2Findex_files%2Ffigure-html%2Funnamed-chunk-6-.gif"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The regular resampling doesn’t do this; it just randomly resamples all observations.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;walk(bad_folds$splits, plot_splits)

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fjuliasilge.com%2Fblog%2Fmap-challenge%2Findex_files%2Ffigure-html%2Funnamed-chunk-7-.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fjuliasilge.com%2Fblog%2Fmap-challenge%2Findex_files%2Ffigure-html%2Funnamed-chunk-7-.gif"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This second option is &lt;em&gt;not&lt;/em&gt; a good idea for geographic data.&lt;/p&gt;

&lt;h2&gt;
  
  
  Fit and evaluate model
&lt;/h2&gt;

&lt;p&gt;Let’s create a straightforward logistic regression model to predict whether a location saw a landslide based on the other characteristics like slope, elevation, amount of water flow, etc. We can estimate how well this &lt;em&gt;same&lt;/em&gt; model fits the data both with our regular folds and our special spatial resampling.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;glm_spec &amp;lt;- logistic_reg()
lsl_form &amp;lt;- lslpts ~ slope + cplan + cprof + elev + log10_carea

lsl_wf &amp;lt;- workflow(lsl_form, glm_spec)

doParallel::registerDoParallel()
set.seed(2021)
regular_rs &amp;lt;- fit_resamples(lsl_wf, bad_folds)
set.seed(2021)
spatial_rs &amp;lt;- fit_resamples(lsl_wf, good_folds)

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;How did our results turn out?&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;collect_metrics(regular_rs)


## # A tibble: 2 × 6
## .metric .estimator mean n std_err .config             
## &amp;lt;chr&amp;gt; &amp;lt;chr&amp;gt; &amp;lt;dbl&amp;gt; &amp;lt;int&amp;gt; &amp;lt;dbl&amp;gt; &amp;lt;chr&amp;gt;               
## 1 accuracy binary 0.737 5 0.0173 Preprocessor1_Model1
## 2 roc_auc binary 0.808 5 0.0201 Preprocessor1_Model1


collect_metrics(spatial_rs)


## # A tibble: 2 × 6
## .metric .estimator mean n std_err .config             
## &amp;lt;chr&amp;gt; &amp;lt;chr&amp;gt; &amp;lt;dbl&amp;gt; &amp;lt;int&amp;gt; &amp;lt;dbl&amp;gt; &amp;lt;chr&amp;gt;               
## 1 accuracy binary 0.677 5 0.0708 Preprocessor1_Model1
## 2 roc_auc binary 0.782 5 0.00790 Preprocessor1_Model1

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If we use the “regular” resampling, we get a more optimistc estimate of performance which would fool us into thinking our model would perform better than it really could. The lower performance estimate using spatial resampling is more accurate because of the autocorrelation of this geographic data; observations near each other are more alike than observations far apart. With geographic data, it’s important to use an appropriate model evaluation strategy!&lt;/p&gt;

</description>
      <category>datascience</category>
      <category>machinelearning</category>
      <category>rstats</category>
      <category>tutorial</category>
    </item>
    <item>
      <title>Multiclass predictive modeling for economics research papers 📑</title>
      <dc:creator>Julia Silge</dc:creator>
      <pubDate>Wed, 29 Sep 2021 00:00:00 +0000</pubDate>
      <link>https://dev.to/juliasilge/multiclass-predictive-modeling-for-economics-research-papers-5f33</link>
      <guid>https://dev.to/juliasilge/multiclass-predictive-modeling-for-economics-research-papers-5f33</guid>
      <description>&lt;p&gt;This is the latest in my series of &lt;a href="https://juliasilge.com/category/tidymodels/"&gt;screencasts&lt;/a&gt; demonstrating how to use the &lt;a href="https://www.tidymodels.org/"&gt;tidymodels&lt;/a&gt; packages, from just getting started to tuning more complex models. Today’s screencast walks through how to build, tune, and evaluate a multiclass predictive model with text features and lasso regularization, with this week’s &lt;a href="https://github.com/rfordatascience/tidytuesday"&gt;&lt;code&gt;#TidyTuesday&lt;/code&gt; dataset&lt;/a&gt; on NBER working papers. 📑&lt;/p&gt;

&lt;p&gt;Here is the code I used in the video, for those who prefer reading instead of or in addition to video.&lt;/p&gt;

&lt;h2&gt;
  
  
  Explore data
&lt;/h2&gt;

&lt;p&gt;Our modeling goal is to predict the category of &lt;a href="https://github.com/rfordatascience/tidytuesday/blob/master/data/2021/2021-09-28/readme.md"&gt;National Bureau of Economic Research working papers&lt;/a&gt; from the titles and years of the papers.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;library(tidyverse)

papers &amp;lt;- readr::read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-09-28/papers.csv")
programs &amp;lt;- readr::read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-09-28/programs.csv")
paper_authors &amp;lt;- readr::read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-09-28/paper_authors.csv")
paper_programs &amp;lt;- readr::read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-09-28/paper_programs.csv")

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Let’s start by joining up these datasets to find the info we need.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;papers_joined &amp;lt;-
  paper_programs %&amp;gt;%
  left_join(programs) %&amp;gt;%
  left_join(papers) %&amp;gt;%
  filter(!is.na(program_category)) %&amp;gt;%
  distinct(paper, program_category, year, title)

papers_joined %&amp;gt;%
  count(program_category)


## # A tibble: 3 × 2
## program_category n
## &amp;lt;chr&amp;gt; &amp;lt;int&amp;gt;
## 1 Finance 4336
## 2 Macro/International 12012
## 3 Micro 18527

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The papers are in three categories (finance, microeconomics, and macroeconomics) so we’ll be training a multiclass predictive model, not a binary classification model as we often see or use.&lt;/p&gt;

&lt;p&gt;Let’s create one exploratory plot before we move on to modeling.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;library(tidytext)
library(tidylo)

title_log_odds &amp;lt;-
  papers_joined %&amp;gt;%
  unnest_tokens(word, title) %&amp;gt;%
  filter(!is.na(program_category)) %&amp;gt;%
  count(program_category, word, sort = TRUE) %&amp;gt;%
  bind_log_odds(program_category, word, n)

title_log_odds %&amp;gt;%
  group_by(program_category) %&amp;gt;%
  slice_max(log_odds_weighted, n = 10) %&amp;gt;%
  ungroup() %&amp;gt;%
  ggplot(aes(log_odds_weighted,
    fct_reorder(word, log_odds_weighted),
    fill = program_category
  )) +
  geom_col(show.legend = FALSE) +
  facet_wrap(vars(program_category), scales = "free_y") +
  labs(x = "Log odds (weighted)", y = NULL)

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--Nepb9c4f--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://juliasilge.com/blog/nber-papers/index_files/figure-html/unnamed-chunk-4-1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--Nepb9c4f--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://juliasilge.com/blog/nber-papers/index_files/figure-html/unnamed-chunk-4-1.png" alt=""&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;These type of relationships between category and title words are what we want to use in our predictive model.&lt;/p&gt;

&lt;h2&gt;
  
  
  Build and tune a model
&lt;/h2&gt;

&lt;p&gt;Let’s start our modeling by setting up our “data budget.” We’ll stratify by our outcome &lt;code&gt;program_category&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;library(tidymodels)

set.seed(123)
nber_split &amp;lt;- initial_split(papers_joined, strata = program_category)
nber_train &amp;lt;- training(nber_split)
nber_test &amp;lt;- testing(nber_split)

set.seed(234)
nber_folds &amp;lt;- vfold_cv(nber_train, strata = program_category)
nber_folds


## # 10-fold cross-validation using stratification 
## # A tibble: 10 × 2
## splits id    
## &amp;lt;list&amp;gt; &amp;lt;chr&amp;gt; 
## 1 &amp;lt;split [23539/2617]&amp;gt; Fold01
## 2 &amp;lt;split [23539/2617]&amp;gt; Fold02
## 3 &amp;lt;split [23540/2616]&amp;gt; Fold03
## 4 &amp;lt;split [23540/2616]&amp;gt; Fold04
## 5 &amp;lt;split [23540/2616]&amp;gt; Fold05
## 6 &amp;lt;split [23541/2615]&amp;gt; Fold06
## 7 &amp;lt;split [23541/2615]&amp;gt; Fold07
## 8 &amp;lt;split [23541/2615]&amp;gt; Fold08
## 9 &amp;lt;split [23541/2615]&amp;gt; Fold09
## 10 &amp;lt;split [23542/2614]&amp;gt; Fold10

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Next, let’s set up our feature engineering. We will need to transform our text data into features useful for our model by tokenizing and computing (in this case) tf-idf. Let’s also downsample since our dataset is imbalanced, with many more of some of the categories than others.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;library(themis)
library(textrecipes)

nber_rec &amp;lt;-
  recipe(program_category ~ year + title, data = nber_train) %&amp;gt;%
  step_tokenize(title) %&amp;gt;%
  step_tokenfilter(title, max_tokens = 200) %&amp;gt;%
  step_tfidf(title) %&amp;gt;%
  step_downsample(program_category)

nber_rec


## Recipe
## 
## Inputs:
## 
## role #variables
## outcome 1
## predictor 2
## 
## Operations:
## 
## Tokenization for title
## Text filtering for title
## Term frequency-inverse document frequency with title
## Down-sampling based on program_category

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then, let’s create our model specification for a lasso model. We need to use a model specification that can handle multiclass data, in this case &lt;code&gt;multinom_reg()&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;multi_spec &amp;lt;-
  multinom_reg(penalty = tune(), mixture = 1) %&amp;gt;%
  set_mode("classification") %&amp;gt;%
  set_engine("glmnet")

multi_spec


## Multinomial Regression Model Specification (classification)
## 
## Main Arguments:
## penalty = tune()
## mixture = 1
## 
## Computational engine: glmnet

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now it’s time to put the preprocessing and model together in a &lt;code&gt;workflow()&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;nber_wf &amp;lt;- workflow(nber_rec, multi_spec)
nber_wf


## ══ Workflow ════════════════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: multinom_reg()
## 
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 4 Recipe Steps
## 
## • step_tokenize()
## • step_tokenfilter()
## • step_tfidf()
## • step_downsample()
## 
## ── Model ───────────────────────────────────────────────────────────────────────
## Multinomial Regression Model Specification (classification)
## 
## Main Arguments:
## penalty = tune()
## mixture = 1
## 
## Computational engine: glmnet

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Since the lasso regularization &lt;code&gt;penalty&lt;/code&gt; is a hyperparameter of the model (we can’t find the best value from fitting the model a single time), let’s tune over a grid of possible &lt;code&gt;penalty&lt;/code&gt; parameters.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;nber_grid &amp;lt;- grid_regular(penalty(range = c(-5, 0)), levels = 20)

doParallel::registerDoParallel()
set.seed(2021)
nber_rs &amp;lt;-
  tune_grid(
    nber_wf,
    nber_folds,
    grid = nber_grid
  )

nber_rs


## # Tuning results
## # 10-fold cross-validation using stratification 
## # A tibble: 10 × 4
## splits id .metrics .notes          
## &amp;lt;list&amp;gt; &amp;lt;chr&amp;gt; &amp;lt;list&amp;gt; &amp;lt;list&amp;gt;          
## 1 &amp;lt;split [23539/2617]&amp;gt; Fold01 &amp;lt;tibble [40 × 5]&amp;gt; &amp;lt;tibble [0 × 1]&amp;gt;
## 2 &amp;lt;split [23539/2617]&amp;gt; Fold02 &amp;lt;tibble [40 × 5]&amp;gt; &amp;lt;tibble [0 × 1]&amp;gt;
## 3 &amp;lt;split [23540/2616]&amp;gt; Fold03 &amp;lt;tibble [40 × 5]&amp;gt; &amp;lt;tibble [0 × 1]&amp;gt;
## 4 &amp;lt;split [23540/2616]&amp;gt; Fold04 &amp;lt;tibble [40 × 5]&amp;gt; &amp;lt;tibble [0 × 1]&amp;gt;
## 5 &amp;lt;split [23540/2616]&amp;gt; Fold05 &amp;lt;tibble [40 × 5]&amp;gt; &amp;lt;tibble [0 × 1]&amp;gt;
## 6 &amp;lt;split [23541/2615]&amp;gt; Fold06 &amp;lt;tibble [40 × 5]&amp;gt; &amp;lt;tibble [0 × 1]&amp;gt;
## 7 &amp;lt;split [23541/2615]&amp;gt; Fold07 &amp;lt;tibble [40 × 5]&amp;gt; &amp;lt;tibble [0 × 1]&amp;gt;
## 8 &amp;lt;split [23541/2615]&amp;gt; Fold08 &amp;lt;tibble [40 × 5]&amp;gt; &amp;lt;tibble [0 × 1]&amp;gt;
## 9 &amp;lt;split [23541/2615]&amp;gt; Fold09 &amp;lt;tibble [40 × 5]&amp;gt; &amp;lt;tibble [0 × 1]&amp;gt;
## 10 &amp;lt;split [23542/2614]&amp;gt; Fold10 &amp;lt;tibble [40 × 5]&amp;gt; &amp;lt;tibble [0 × 1]&amp;gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is a pretty fast model to fit, since it is linear. How did it turn out?&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;autoplot(nber_rs)

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--A72rc3r---/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://juliasilge.com/blog/nber-papers/index_files/figure-html/unnamed-chunk-10-1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--A72rc3r---/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://juliasilge.com/blog/nber-papers/index_files/figure-html/unnamed-chunk-10-1.png" alt=""&gt;&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;show_best(nber_rs)


## # A tibble: 5 × 7
## penalty .metric .estimator mean n std_err .config              
## &amp;lt;dbl&amp;gt; &amp;lt;chr&amp;gt; &amp;lt;chr&amp;gt; &amp;lt;dbl&amp;gt; &amp;lt;int&amp;gt; &amp;lt;dbl&amp;gt; &amp;lt;chr&amp;gt;                
## 1 0.00234 roc_auc hand_till 0.784 10 0.00249 Preprocessor1_Model10
## 2 0.00428 roc_auc hand_till 0.783 10 0.00244 Preprocessor1_Model11
## 3 0.00127 roc_auc hand_till 0.783 10 0.00251 Preprocessor1_Model09
## 4 0.000695 roc_auc hand_till 0.782 10 0.00253 Preprocessor1_Model08
## 5 0.000379 roc_auc hand_till 0.782 10 0.00254 Preprocessor1_Model07

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Choose and evaluate a final model
&lt;/h2&gt;

&lt;p&gt;We could use the numerically best model with &lt;code&gt;select_best()&lt;/code&gt; by often with regularized models we would rather choose a simpler model within some limits of performance. We can choose using the “one-standard error rule” with &lt;code&gt;select_by_one_std_err()&lt;/code&gt; and then use &lt;code&gt;last_fit()&lt;/code&gt; to &lt;strong&gt;fit&lt;/strong&gt; one time to the training data and &lt;strong&gt;evaluate&lt;/strong&gt; one time to the testing data.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;final_penalty &amp;lt;-
  nber_rs %&amp;gt;%
  select_by_one_std_err(metric = "roc_auc", desc(penalty))

final_penalty


## # A tibble: 1 × 9
## penalty .metric .estimator mean n std_err .config .best .bound
## &amp;lt;dbl&amp;gt; &amp;lt;chr&amp;gt; &amp;lt;chr&amp;gt; &amp;lt;dbl&amp;gt; &amp;lt;int&amp;gt; &amp;lt;dbl&amp;gt; &amp;lt;chr&amp;gt; &amp;lt;dbl&amp;gt; &amp;lt;dbl&amp;gt;
## 1 0.00428 roc_auc hand_till 0.783 10 0.00244 Preprocessor1_Mod… 0.784 0.781


final_rs &amp;lt;-
  nber_wf %&amp;gt;%
  finalize_workflow(final_penalty) %&amp;gt;%
  last_fit(nber_split)

final_rs


## # Resampling results
## # Manual resampling 
## # A tibble: 1 × 6
## splits id .metrics .notes .predictions .workflow
## &amp;lt;list&amp;gt; &amp;lt;chr&amp;gt; &amp;lt;list&amp;gt; &amp;lt;list&amp;gt; &amp;lt;list&amp;gt; &amp;lt;list&amp;gt;   
## 1 &amp;lt;split [26156/8719]&amp;gt; train/test split &amp;lt;tibble … &amp;lt;tibbl… &amp;lt;tibble [8,… &amp;lt;workflo…

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;How did our final model perform on the training data?&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;collect_metrics(final_rs)


## # A tibble: 2 × 4
## .metric .estimator .estimate .config             
## &amp;lt;chr&amp;gt; &amp;lt;chr&amp;gt; &amp;lt;dbl&amp;gt; &amp;lt;chr&amp;gt;               
## 1 accuracy multiclass 0.609 Preprocessor1_Model1
## 2 roc_auc hand_till 0.779 Preprocessor1_Model1

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We can visualize the difference in performance across classes with a confusion matrix.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;collect_predictions(final_rs) %&amp;gt;%
  conf_mat(program_category, .pred_class) %&amp;gt;%
  autoplot()

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--dsFUuMBh--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://juliasilge.com/blog/nber-papers/index_files/figure-html/unnamed-chunk-14-1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--dsFUuMBh--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://juliasilge.com/blog/nber-papers/index_files/figure-html/unnamed-chunk-14-1.png" alt=""&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We can also visualize the ROC curves for each class.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;collect_predictions(final_rs) %&amp;gt;%
  roc_curve(truth = program_category, .pred_Finance:.pred_Micro) %&amp;gt;%
  ggplot(aes(1 - specificity, sensitivity, color = .level)) +
  geom_abline(slope = 1, color = "gray50", lty = 2, alpha = 0.8) +
  geom_path(size = 1.5, alpha = 0.7) +
  labs(color = NULL) +
  coord_fixed()

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--v_N0wv7C--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://juliasilge.com/blog/nber-papers/index_files/figure-html/unnamed-chunk-15-1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--v_N0wv7C--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://juliasilge.com/blog/nber-papers/index_files/figure-html/unnamed-chunk-15-1.png" alt=""&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;It looks like the finance and microeconomics papers were easier to identify than the macroeconomics papers.&lt;/p&gt;

&lt;p&gt;Finally, we can extract (and save, if we like) the fitted workflow from our results to use for predicting on new data.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;final_fitted &amp;lt;- extract_workflow(final_rs)
## can save this for prediction later with readr::write_rds()

predict(final_fitted, nber_test[111,], type = "prob")


## # A tibble: 1 × 3
## .pred_Finance `.pred_Macro/International` .pred_Micro
## &amp;lt;dbl&amp;gt; &amp;lt;dbl&amp;gt; &amp;lt;dbl&amp;gt;
## 1 0.104 0.531 0.365

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We can even make up new paper titles and see how our model classifies them.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;predict(final_fitted, tibble(year = 2021, title = "Pricing Models for Corporate Responsibility"), type = "prob")


## # A tibble: 1 × 3
## .pred_Finance `.pred_Macro/International` .pred_Micro
## &amp;lt;dbl&amp;gt; &amp;lt;dbl&amp;gt; &amp;lt;dbl&amp;gt;
## 1 0.598 0.158 0.244


predict(final_fitted, tibble(year = 2021, title = "Teacher Health and Medicaid Expansion"), type = "prob")


## # A tibble: 1 × 3
## .pred_Finance `.pred_Macro/International` .pred_Micro
## &amp;lt;dbl&amp;gt; &amp;lt;dbl&amp;gt; &amp;lt;dbl&amp;gt;
## 1 0.288 0.141 0.571

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



</description>
      <category>machinelearning</category>
      <category>datascience</category>
      <category>tutorial</category>
      <category>rstats</category>
    </item>
    <item>
      <title>Dimensionality reduction for Billboard Top 100 songs 🎶</title>
      <dc:creator>Julia Silge</dc:creator>
      <pubDate>Wed, 15 Sep 2021 00:00:00 +0000</pubDate>
      <link>https://dev.to/juliasilge/dimensionality-reduction-for-billboard-top-100-songs-48g8</link>
      <guid>https://dev.to/juliasilge/dimensionality-reduction-for-billboard-top-100-songs-48g8</guid>
      <description>&lt;p&gt;This is the latest in my series of &lt;a href="https://juliasilge.com/category/tidymodels/"&gt;screencasts&lt;/a&gt; demonstrating how to use the &lt;a href="https://www.tidymodels.org/"&gt;tidymodels&lt;/a&gt; packages, from just getting started to tuning more complex models. Today’s screencast focuses only on data preprocessing, or feature engineering; let’s walk through how to use dimensionality reduction for song features sourced from Spotify (mostly audio), with this week’s &lt;a href="https://github.com/rfordatascience/tidytuesday"&gt;&lt;code&gt;#TidyTuesday&lt;/code&gt; dataset&lt;/a&gt; on Billboard Top 100 songs. 🎵&lt;/p&gt;

&lt;p&gt;&lt;iframe width="710" height="399" src="https://www.youtube.com/embed/kE7H1oQ2rY4"&gt;
&lt;/iframe&gt;
&lt;/p&gt;

&lt;p&gt;Here is the code I used in the video, for those who prefer reading instead of or in addition to video.&lt;/p&gt;

&lt;h2&gt;
  
  
  Explore data
&lt;/h2&gt;

&lt;p&gt;Our modeling goal is to use dimensionality reduction for features of &lt;a href="https://github.com/rfordatascience/tidytuesday/blob/master/data/2021/2021-09-14/readme.md"&gt;Billboard Top 100 songs&lt;/a&gt;, connecting data about where the songs were in the rankings with mostly audio features available from Spotify.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;library(tidyverse)

## billboard ranking data
billboard &amp;lt;- readr::read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-09-14/billboard.csv")

## spotify feature data
audio_features &amp;lt;- readr::read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-09-14/audio_features.csv")

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Let’s start by finding the longest streak each song was on this chart.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;max_weeks &amp;lt;-
  billboard %&amp;gt;%
  group_by(song_id) %&amp;gt;%
  summarise(weeks_on_chart = max(weeks_on_chart), .groups = "drop")

max_weeks


## # A tibble: 29,389 × 2
## song_id weeks_on_chart
## &amp;lt;chr&amp;gt; &amp;lt;dbl&amp;gt;
## 1 -twistin'-White Silver SandsBill Black's Combo 2
## 2 ¿Dònde Està Santa Claus? (Where Is Santa Claus?)Augie Rios 4
## 3 ......And Roses And RosesAndy Williams 7
## 4 ...And Then There Were DrumsSandy Nelson 4
## 5 ...Baby One More TimeBritney Spears 32
## 6 ...Ready For It?Taylor Swift 19
## 7 '03 Bonnie &amp;amp; ClydeJay-Z Featuring Beyonce Knowles 23
## 8 '65 Love AffairPaul Davis 20
## 9 '98 Thug ParadiseTragedy, Capone, Infinite 5
## 10 'Round We GoBig Sister 2
## # … with 29,379 more rows

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now let’s join this with the Spotify audio features (where available). We don’t have Spotify features for all the songs, and it’s possible that there are systematic differences in songs that we could vs. could not get Spotify data for. Something to keep in mind!&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;billboard_joined &amp;lt;-
  audio_features %&amp;gt;%
  filter(!is.na(spotify_track_popularity)) %&amp;gt;%
  inner_join(max_weeks)

billboard_joined


## # A tibble: 24,395 × 23
## song_id performer song spotify_genre spotify_track_id spotify_track_pr…
## &amp;lt;chr&amp;gt; &amp;lt;chr&amp;gt; &amp;lt;chr&amp;gt; &amp;lt;chr&amp;gt; &amp;lt;chr&amp;gt; &amp;lt;chr&amp;gt;            
## 1 ......An… Andy Will… .....… ['adult stand… 3tvqPPpXyIgKrm4… https://p.scdn.c…
## 2 ...And T… Sandy Nel… ...An… ['rock-and-ro… 1fHHq3qHU8wpRKH… &amp;lt;NA&amp;gt;             
## 3 ...Baby … Britney S… ...Ba… ['dance pop',… 3MjUtNVVq3C8Fn0… https://p.scdn.c…
## 4 ...Ready… Taylor Sw… ...Re… ['pop', 'post… 2yLa0QULdQr0qAI… &amp;lt;NA&amp;gt;             
## 5 '03 Bonn… Jay-Z Fea… '03 B… ['east coast … 5ljCWsDlSyJ41kw… &amp;lt;NA&amp;gt;             
## 6 '65 Love… Paul Davis '65 L… ['album rock'… 5nBp8F6tekSrnFg… https://p.scdn.c…
## 7 'til I C… Tammy Wyn… 'til … ['country', '… 0aJHZYjwbfTmeyU… https://p.scdn.c…
## 8 'Til My … Luther Va… 'Til … ['funk', 'mot… 2R97RZWUx4vAFbM… https://p.scdn.c…
## 9 'Til Sum… Keith Urb… 'Til … ['australian … 1CKmI1IQjVEVB3F… &amp;lt;NA&amp;gt;             
## 10 'Til You… After 7 'Til … ['funk', 'neo… 3kGMziz884MLV1o… &amp;lt;NA&amp;gt;             
## # … with 24,385 more rows, and 17 more variables:
## # spotify_track_duration_ms &amp;lt;dbl&amp;gt;, spotify_track_explicit &amp;lt;lgl&amp;gt;,
## # spotify_track_album &amp;lt;chr&amp;gt;, danceability &amp;lt;dbl&amp;gt;, energy &amp;lt;dbl&amp;gt;, key &amp;lt;dbl&amp;gt;,
## # loudness &amp;lt;dbl&amp;gt;, mode &amp;lt;dbl&amp;gt;, speechiness &amp;lt;dbl&amp;gt;, acousticness &amp;lt;dbl&amp;gt;,
## # instrumentalness &amp;lt;dbl&amp;gt;, liveness &amp;lt;dbl&amp;gt;, valence &amp;lt;dbl&amp;gt;, tempo &amp;lt;dbl&amp;gt;,
## # time_signature &amp;lt;dbl&amp;gt;, spotify_track_popularity &amp;lt;dbl&amp;gt;, weeks_on_chart &amp;lt;dbl&amp;gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Some of the features we now have for each song are characteristics of the song like the time signature (3/4, 4/4, 5/4) and the tempo in BPM.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;billboard_joined %&amp;gt;%
  filter(tempo &amp;gt; 0, time_signature &amp;gt; 1) %&amp;gt;%
  ggplot(aes(tempo, fill = factor(time_signature))) +
  geom_histogram(alpha = 0.5, position = "identity") +
  labs(fill = "time signature")

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--aFokIrZW--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://juliasilge.com/blog/billboard-100/index_files/figure-html/unnamed-chunk-5-1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--aFokIrZW--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://juliasilge.com/blog/billboard-100/index_files/figure-html/unnamed-chunk-5-1.png" alt=""&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Pop songs like those on the Billboard chart are overwhelming in 4/4!&lt;/p&gt;

&lt;p&gt;There are other features available from Spotify as well, such as “danceability” and “loudness.”&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;library(corrr)

billboard_joined %&amp;gt;%
  select(danceability:weeks_on_chart) %&amp;gt;%
  na.omit() %&amp;gt;%
  correlate() %&amp;gt;%
  rearrange() %&amp;gt;%
  network_plot(colours = c("orange", "white", "midnightblue"))

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--kxkWuK-m--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://juliasilge.com/blog/billboard-100/index_files/figure-html/unnamed-chunk-6-1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--kxkWuK-m--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://juliasilge.com/blog/billboard-100/index_files/figure-html/unnamed-chunk-6-1.png" alt=""&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;It looks like only &lt;code&gt;spotify_track_popularity&lt;/code&gt; is really at all correlated with &lt;code&gt;weeks_on_chart&lt;/code&gt;. That popularity metric isn’t really an audio feature of the song per se, but it might be helpful to have such a feature as we understand more how dimensionality reduction works.&lt;/p&gt;

&lt;h2&gt;
  
  
  Dimensionality reduction
&lt;/h2&gt;

&lt;p&gt;In our book &lt;em&gt;Tidy Modeling with R&lt;/em&gt;, we recently published a chapter on &lt;a href="https://www.tmwr.org/dimensionality.html"&gt;dimensionality reduction&lt;/a&gt;. My post today walks through a more brief and basic outline of some of the material from that chapter. Within the tidymodels framework, dimensionality reduction is a feature engineering or data preprocessing step, so we use &lt;a href="https://recipes.tidymodels.org/"&gt;recipes&lt;/a&gt; to implement this kind of analysis. I typically show how to use a data preprocessing recipe together with a model, but in this post, let’s just focus on recipes and how they work.&lt;/p&gt;

&lt;p&gt;Let’s still start by splitting our data into training and testing sets, so we can estimate or traing our preprocessing recipe on our training set, and then apply that trained recipe onto a new set (our testing set).&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;library(tidymodels)

set.seed(123)
billboard_split &amp;lt;- billboard_joined %&amp;gt;%
  select(danceability:weeks_on_chart) %&amp;gt;%
  mutate(weeks_on_chart = log(weeks_on_chart)) %&amp;gt;%
  na.omit() %&amp;gt;%
  initial_split(strata = weeks_on_chart)

## how many observations in each set?
billboard_split


## &amp;lt;Analysis/Assess/Total&amp;gt;
## &amp;lt;18245/6084/24329&amp;gt;


billboard_train &amp;lt;- training(billboard_split)
billboard_test &amp;lt;- testing(billboard_split)

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now let’s make a basic starter recipe that we can work off of.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;billboard_rec &amp;lt;-
  recipe(weeks_on_chart ~ ., data = billboard_train) %&amp;gt;%
  step_zv(all_numeric_predictors()) %&amp;gt;%
  step_normalize(all_numeric_predictors())

rec_trained &amp;lt;- prep(billboard_rec)
rec_trained


## Data Recipe
## 
## Inputs:
## 
## role #variables
## outcome 1
## predictor 13
## 
## Training data contained 18245 data points and no missing data.
## 
## Operations:
## 
## Zero variance filter removed no terms [trained]
## Centering and scaling for danceability, energy, key, loudness, ... [trained]

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When we &lt;code&gt;prep()&lt;/code&gt; the recipe, we use the training data to estimate the quantities we need to apply these steps to new data. So in this case, we would, for example, compute the mean and standard deviation from the training data in order to center and scale. The testing data will be centered and scaled with the mean and standard deviation from the training data.&lt;/p&gt;

&lt;p&gt;Next, let’s make a little helper function for us to extend this starter recipe. This function will:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;prep()&lt;/code&gt; the recipe (you can &lt;code&gt;prep()&lt;/code&gt; an already-prepped recipe, for example after you have added new steps)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;bake()&lt;/code&gt; the recipe using our &lt;strong&gt;testing&lt;/strong&gt; data&lt;/li&gt;
&lt;li&gt;make a visualization of the results
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;library(ggforce)

plot_test_results &amp;lt;- function(recipe, dat = billboard_test) {
  recipe %&amp;gt;%
    prep() %&amp;gt;%
    bake(new_data = dat) %&amp;gt;%
    ggplot() +
    geom_autopoint(aes(color = weeks_on_chart), alpha = 0.4, size = 0.5) +
    geom_autodensity(alpha = .3) +
    facet_matrix(vars(-weeks_on_chart), layer.diag = 2) +
    scale_color_distiller(palette = "BuPu", direction = 1) +
    labs(color = "weeks (log)")
}

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  PCA
&lt;/h3&gt;

&lt;p&gt;Let’s start with principal component analysis, one of the most straightforward dimensionality reduction approaches. It is linear, unsupervised, and makes new features that try to account for as much variation in the data as possible. Remember that our function estimates PCA components from our training data and then applies those to our testing data.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;rec_trained %&amp;gt;%
  step_pca(all_numeric_predictors(), num_comp = 4) %&amp;gt;%
  plot_test_results() +
  ggtitle("Principal Component Analysis")

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--ItpyqWPA--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://juliasilge.com/blog/billboard-100/index_files/figure-html/unnamed-chunk-11-1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--ItpyqWPA--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://juliasilge.com/blog/billboard-100/index_files/figure-html/unnamed-chunk-11-1.png" alt=""&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This looks a bit underwhelming in terms of the components being connected to weeks on the chart, but there is a little bit of relationship.&lt;/p&gt;

&lt;p&gt;We &lt;a href="https://www.tmwr.org/recipes.html#tidy-a-recipe"&gt;can &lt;code&gt;tidy()&lt;/code&gt; recipes&lt;/a&gt;, either as a whole or for individual steps, and either before or after using &lt;code&gt;prep()&lt;/code&gt;. Let’s &lt;code&gt;tidy()&lt;/code&gt; this recipe to find the features that contribute the most to the PC components.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;rec_trained %&amp;gt;%
  step_pca(all_numeric_predictors(), num_comp = 4) %&amp;gt;%
  prep() %&amp;gt;%
  tidy(number = 3) %&amp;gt;%
  filter(component %in% paste0("PC", 1:4)) %&amp;gt;%
  group_by(component) %&amp;gt;%
  slice_max(abs(value), n = 5) %&amp;gt;%
  ungroup() %&amp;gt;%
  ggplot(aes(abs(value), terms, fill = value &amp;gt; 0)) +
  geom_col(alpha = 0.8) +
  facet_wrap(vars(component), scales = "free_y") +
  labs(x = "Contribution to principal component", y = NULL, fill = "Positive?")

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--Wajsp321--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://juliasilge.com/blog/billboard-100/index_files/figure-html/unnamed-chunk-12-1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--Wajsp321--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://juliasilge.com/blog/billboard-100/index_files/figure-html/unnamed-chunk-12-1.png" alt=""&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;I’ve &lt;a href="https://juliasilge.com/blog/best-hip-hop/"&gt;implemented PCA for these features before&lt;/a&gt;. The results this time for a different sample of songs aren’t exactly the same but have some qualitative similarities; we see that the first component is mostly about loudness/energy vs. acoustic while the second is about valence, where high valence means more positive, cheerful, happy music.&lt;/p&gt;

&lt;h3&gt;
  
  
  PLS
&lt;/h3&gt;

&lt;p&gt;Partial least squares is a lot like PCA but it is &lt;strong&gt;supervised&lt;/strong&gt; ; it makes components that try to account for a lot of variation but also are related to the outcome.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;rec_trained %&amp;gt;%
  step_pls(all_numeric_predictors(), outcome = "weeks_on_chart", num_comp = 4) %&amp;gt;%
  plot_test_results() +
  ggtitle("Partial Least Squares")

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--ahu2W7fj--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://juliasilge.com/blog/billboard-100/index_files/figure-html/unnamed-chunk-13-1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--ahu2W7fj--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://juliasilge.com/blog/billboard-100/index_files/figure-html/unnamed-chunk-13-1.png" alt=""&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We do see a stronger relationship to weeks on the chart here, like we would hope since we were using PLS.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;rec_trained %&amp;gt;%
  step_pls(all_numeric_predictors(), outcome = "weeks_on_chart", num_comp = 4) %&amp;gt;%
  prep() %&amp;gt;%
  tidy(number = 3) %&amp;gt;%
  filter(component %in% paste0("PLS", 1:4)) %&amp;gt;%
  group_by(component) %&amp;gt;%
  slice_max(abs(value), n = 5) %&amp;gt;%
  ungroup() %&amp;gt;%
  ggplot(aes(abs(value), terms, fill = value &amp;gt; 0)) +
  geom_col(alpha = 0.8) +
  facet_wrap(vars(component), scales = "free_y") +
  labs(x = "Contribution to PLS component", y = NULL, fill = "Positive?")

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--UtatbsVP--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://juliasilge.com/blog/billboard-100/index_files/figure-html/unnamed-chunk-14-1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--UtatbsVP--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://juliasilge.com/blog/billboard-100/index_files/figure-html/unnamed-chunk-14-1.png" alt=""&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The Spotify popularity feature, which like we said before is not really an audio feature, is now a big contributor to the first couple of components.&lt;/p&gt;

&lt;h3&gt;
  
  
  UMAP
&lt;/h3&gt;

&lt;p&gt;Uniform manifold approximation and projection (UMAP) is another dimensionality reduction technique, but it works very differently than either PCA or PLS. It is not linear. Instead, it starts by finding nearest neighbors for the observations in the high dimensional space, building a graph network, and then creating a new lower dimensional space based on that.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;library(embed)

rec_trained %&amp;gt;%
  step_umap(all_numeric_predictors(), num_comp = 4) %&amp;gt;%
  plot_test_results() +
  ggtitle("UMAP")

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--yjaQIK2u--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://juliasilge.com/blog/billboard-100/index_files/figure-html/unnamed-chunk-15-1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--yjaQIK2u--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://juliasilge.com/blog/billboard-100/index_files/figure-html/unnamed-chunk-15-1.png" alt=""&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;UMAP is very good at making little clusters in the new reduced space, but notice that in our case they aren’t very connected to weeks on the chart. UMAP results can seem very appealing but it’s good to understand how arbitrary some of the structure we see here is, and generally &lt;a href="https://twitter.com/lpachter/status/1431325969411821572"&gt;this algorithm’s limitations&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>machinelearning</category>
      <category>datascience</category>
      <category>rstats</category>
      <category>tutorial</category>
    </item>
    <item>
      <title>Fit and predict with tidymodels for bird baths in Australia 🇦🇺</title>
      <dc:creator>Julia Silge</dc:creator>
      <pubDate>Wed, 01 Sep 2021 00:00:00 +0000</pubDate>
      <link>https://dev.to/juliasilge/fit-and-predict-with-tidymodels-for-bird-baths-in-australia-1fo0</link>
      <guid>https://dev.to/juliasilge/fit-and-predict-with-tidymodels-for-bird-baths-in-australia-1fo0</guid>
      <description>&lt;p&gt;This is the latest in my series of &lt;a href="https://juliasilge.com/category/tidymodels/"&gt;screencasts&lt;/a&gt; demonstrating how to use the &lt;a href="https://www.tidymodels.org/"&gt;tidymodels&lt;/a&gt; packages, from just getting started to tuning more complex models. Today’s screencast is good for folks who are newer to modeling or tidymodels; it focuses on how to use feature engineering together with a model algorithm and how to fit and predict, with this week’s &lt;a href="https://github.com/rfordatascience/tidytuesday"&gt;&lt;code&gt;#TidyTuesday&lt;/code&gt; dataset&lt;/a&gt; on bird baths in Australia. 🐦&lt;/p&gt;

&lt;p&gt;&lt;iframe width="710" height="399" src="https://www.youtube.com/embed/NXot3Q0QtGk"&gt;
&lt;/iframe&gt;
&lt;/p&gt;

&lt;p&gt;Here is the code I used in the video, for those who prefer reading instead of or in addition to video.&lt;/p&gt;

&lt;h2&gt;
  
  
  Explore data
&lt;/h2&gt;

&lt;p&gt;Our modeling goal is to predict whether we’ll see a bird at a &lt;a href="https://github.com/rfordatascience/tidytuesday/blob/master/data/2021/2021-08-31/readme.md"&gt;bird bath in Australia&lt;/a&gt;, given info like what kind of bird we’re looking for and whether the bird bath is in an urban or rural location.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;library(tidyverse)

bird_baths &amp;lt;- readr::read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-08-31/bird_baths.csv")

bird_baths %&amp;gt;%
  count(urban_rural)


## # A tibble: 3 × 2
## urban_rural n
## &amp;lt;chr&amp;gt; &amp;lt;int&amp;gt;
## 1 Rural 49686
## 2 Urban 111202
## 3 &amp;lt;NA&amp;gt; 169

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Notice that there are some summary rows in the dataset with &lt;code&gt;NA&lt;/code&gt; values for &lt;code&gt;urban_rural&lt;/code&gt;, &lt;code&gt;survey_year&lt;/code&gt;, etc. We can use that to choose some top bird types to focus on, instead of all the many bird types included in this dataset.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;top_birds &amp;lt;-
  bird_baths %&amp;gt;%
  filter(is.na(urban_rural)) %&amp;gt;%
  arrange(-bird_count) %&amp;gt;%
  slice_max(bird_count, n = 15) %&amp;gt;%
  pull(bird_type)

top_birds


## [1] "Noisy Miner" "Australian Magpie" "Rainbow Lorikeet"  
## [4] "Red Wattlebird" "Superb Fairy-wren" "Magpie-lark"       
## [7] "Pied Currawong" "Crimson Rosella" "Eastern Spinebill" 
## [10] "Spotted Dove" "Lewin's Honeyeater" "Satin Bowerbird"   
## [13] "Crested Pigeon" "Grey Fantail" "Red-browed Finch"

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;How likely were the citizen scientists who collected this data to see birds of different types, in different locations?&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;bird_parsed &amp;lt;-
  bird_baths %&amp;gt;%
  filter(
    !is.na(urban_rural),
    bird_type %in% top_birds
  ) %&amp;gt;%
  group_by(urban_rural, bird_type) %&amp;gt;%
  summarise(bird_count = mean(bird_count), .groups = "drop")

p1 &amp;lt;-
  bird_parsed %&amp;gt;%
  ggplot(aes(bird_count, bird_type)) +
  geom_segment(
    data = bird_parsed %&amp;gt;%
      pivot_wider(
        names_from = urban_rural,
        values_from = bird_count
      ),
    aes(x = Rural, xend = Urban, y = bird_type, yend = bird_type),
    alpha = 0.7, color = "gray70", size = 1.5
  ) +
  geom_point(aes(color = urban_rural), size = 3) +
  scale_x_continuous(labels = scales::percent) +
  labs(x = "Probability of seeing bird", y = NULL, color = NULL)

p1

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--4a5u87hF--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://juliasilge.com/blog/bird-baths/index_files/figure-html/unnamed-chunk-4-1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--4a5u87hF--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://juliasilge.com/blog/bird-baths/index_files/figure-html/unnamed-chunk-4-1.png" alt=""&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Superb fairy-wrens are more rural, while noisy miners are more urban.&lt;/p&gt;

&lt;p&gt;Let’s build a model to predict this probability of seeing a bird using just these two predictors.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;bird_df &amp;lt;-
  bird_baths %&amp;gt;%
  filter(
    !is.na(urban_rural),
    bird_type %in% top_birds
  ) %&amp;gt;%
  mutate(bird_count = if_else(bird_count &amp;gt; 0, "bird", "no bird")) %&amp;gt;%
  mutate_if(is.character, as.factor)

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Build a first model
&lt;/h2&gt;

&lt;p&gt;Let’s start our modeling by setting up our “data budget.” We are going to use a simple logistic regression model that is unlikely to overfit, but let’s still split our data into training and testing, and then create resampling folds.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;library(tidymodels)

set.seed(123)
bird_split &amp;lt;- initial_split(bird_df, strata = bird_count)
bird_train &amp;lt;- training(bird_split)
bird_test &amp;lt;- testing(bird_split)

set.seed(234)
bird_folds &amp;lt;- vfold_cv(bird_train, strata = bird_count)
bird_folds


## # 10-fold cross-validation using stratification 
## # A tibble: 10 × 2
## splits id    
## &amp;lt;list&amp;gt; &amp;lt;chr&amp;gt; 
## 1 &amp;lt;split [9637/1072]&amp;gt; Fold01
## 2 &amp;lt;split [9638/1071]&amp;gt; Fold02
## 3 &amp;lt;split [9638/1071]&amp;gt; Fold03
## 4 &amp;lt;split [9638/1071]&amp;gt; Fold04
## 5 &amp;lt;split [9638/1071]&amp;gt; Fold05
## 6 &amp;lt;split [9638/1071]&amp;gt; Fold06
## 7 &amp;lt;split [9638/1071]&amp;gt; Fold07
## 8 &amp;lt;split [9638/1071]&amp;gt; Fold08
## 9 &amp;lt;split [9639/1070]&amp;gt; Fold09
## 10 &amp;lt;split [9639/1070]&amp;gt; Fold10

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We’ll make a couple of attempts at fitting models here, but they will all use straightforward logistic regression.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;glm_spec &amp;lt;- logistic_reg()

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For this first model, let’s set up our feature engineering recipe with our &lt;strong&gt;outcome&lt;/strong&gt; and two &lt;strong&gt;predictors&lt;/strong&gt; , and begin with only one preprocessing step to transform our nominal (factor or character, like &lt;code&gt;urban_rural&lt;/code&gt; and &lt;code&gt;bird_type&lt;/code&gt;) predictors to &lt;a href="https://www.tmwr.org/recipes.html#dummies"&gt;dummy or indicator variables&lt;/a&gt;. Then let’s put our preprocessing recipe together with our model specification in a workflow.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;rec_basic &amp;lt;-
  recipe(bird_count ~ urban_rural + bird_type, data = bird_train) %&amp;gt;%
  step_dummy(all_nominal_predictors())

wf_basic &amp;lt;- workflow(rec_basic, glm_spec)

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We could fit this one time to the training data, but to get better estimates of performance, let’s fit 10 times to our 10 resampling folds.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;doParallel::registerDoParallel()
ctrl_preds &amp;lt;- control_resamples(save_pred = TRUE)
rs_basic &amp;lt;- fit_resamples(wf_basic, bird_folds, control = ctrl_preds)

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;How did this turn out? If we look at some overall metrics, accuracy does not look so bad:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;collect_metrics(rs_basic)


## # A tibble: 2 × 6
## .metric .estimator mean n std_err .config             
## &amp;lt;chr&amp;gt; &amp;lt;chr&amp;gt; &amp;lt;dbl&amp;gt; &amp;lt;int&amp;gt; &amp;lt;dbl&amp;gt; &amp;lt;chr&amp;gt;               
## 1 accuracy binary 0.822 10 0.0000762 Preprocessor1_Model1
## 2 roc_auc binary 0.601 10 0.00783 Preprocessor1_Model1

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is because there were not many birds overall, though! The model is just saying “no bird” everywhere and getting good accuracy. The ROC curve, on the other hand, looks not so great.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;augment(rs_basic) %&amp;gt;%
  roc_curve(bird_count, .pred_bird) %&amp;gt;%
  autoplot()

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--MKQKfTXv--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://juliasilge.com/blog/bird-baths/index_files/figure-html/unnamed-chunk-11-1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--MKQKfTXv--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://juliasilge.com/blog/bird-baths/index_files/figure-html/unnamed-chunk-11-1.png" alt=""&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Add interactions
&lt;/h2&gt;

&lt;p&gt;We know from the plot we made during EDA that there are interactions between whether a bird bath is urban/rural and what kinds of birds we see there; we could model these interactions either with a model type that can handle it natively (like trees) or with explicit interaction terms like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;rec_interact &amp;lt;-
  rec_basic %&amp;gt;%
  step_interact(~ starts_with("urban_rural"):starts_with("bird_type"))

wf_interact &amp;lt;- workflow(rec_interact, glm_spec)
rs_interact &amp;lt;- fit_resamples(wf_interact, bird_folds, control = ctrl_preds)

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;How did &lt;em&gt;this&lt;/em&gt; do, our same logistic regression model specification but now with interactions?&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;collect_metrics(rs_interact)


## # A tibble: 2 × 6
## .metric .estimator mean n std_err .config             
## &amp;lt;chr&amp;gt; &amp;lt;chr&amp;gt; &amp;lt;dbl&amp;gt; &amp;lt;int&amp;gt; &amp;lt;dbl&amp;gt; &amp;lt;chr&amp;gt;               
## 1 accuracy binary 0.822 10 0.0000762 Preprocessor1_Model1
## 2 roc_auc binary 0.669 10 0.00660 Preprocessor1_Model1

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The accuracy is about the same (since the model is always predicting “no bird”) but the probabilities look better.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;augment(rs_interact) %&amp;gt;%
  roc_curve(bird_count, .pred_bird) %&amp;gt;%
  autoplot()

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--LgMrQmh1--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://juliasilge.com/blog/bird-baths/index_files/figure-html/unnamed-chunk-14-1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--LgMrQmh1--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://juliasilge.com/blog/bird-baths/index_files/figure-html/unnamed-chunk-14-1.png" alt=""&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Evaluate model on new data
&lt;/h2&gt;

&lt;p&gt;Let’s stick with this model, logistic regression together with interactions between urban/rural and bird type. We can fit the model one time to the entire training set.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;bird_fit &amp;lt;- fit(wf_interact, bird_train)

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now this trained model is ready to be applied to new data. For example, we can predict the test set, perhaps to get out probabilities.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;predict(bird_fit, bird_test, type = "prob")


## # A tibble: 3,571 × 2
## .pred_bird `.pred_no bird`
## &amp;lt;dbl&amp;gt; &amp;lt;dbl&amp;gt;
## 1 0.213 0.787
## 2 0.123 0.877
## 3 0.141 0.859
## 4 0.283 0.717
## 5 0.119 0.881
## 6 0.252 0.748
## 7 0.0380 0.962
## 8 0.123 0.877
## 9 0.129 0.871
## 10 0.119 0.881
## # … with 3,561 more rows

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In fact, we can predict on any kind of new data that has the right input variables. Let’s make some ourselves.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;new_bird_data &amp;lt;-
  tibble(bird_type = top_birds) %&amp;gt;%
  crossing(urban_rural = c("Urban", "Rural"))

new_bird_data


## # A tibble: 30 × 2
## bird_type urban_rural
## &amp;lt;chr&amp;gt; &amp;lt;chr&amp;gt;      
## 1 Australian Magpie Rural      
## 2 Australian Magpie Urban      
## 3 Crested Pigeon Rural      
## 4 Crested Pigeon Urban      
## 5 Crimson Rosella Rural      
## 6 Crimson Rosella Urban      
## 7 Eastern Spinebill Rural      
## 8 Eastern Spinebill Urban      
## 9 Grey Fantail Rural      
## 10 Grey Fantail Urban      
## # … with 20 more rows

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We can use a &lt;a href="https://parsnip.tidymodels.org/reference/augment.html"&gt;helpful function like &lt;code&gt;augment()&lt;/code&gt;&lt;/a&gt; to take this new data and “augment” it with predicted probabilities and class predictions, and we can &lt;a href="https://parsnip.tidymodels.org/reference/predict.model_fit.html"&gt;use &lt;code&gt;predict()&lt;/code&gt; with specific &lt;code&gt;type&lt;/code&gt; arguments&lt;/a&gt; to return specialized predictions like confidence intervals. Let’s bind these together.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;bird_preds &amp;lt;-
  augment(bird_fit, new_bird_data) %&amp;gt;%
  bind_cols(
    predict(bird_fit, new_bird_data, type = "conf_int")
  )

bird_preds


## # A tibble: 30 × 9
## bird_type urban_rural .pred_class .pred_bird `.pred_no bird` .pred_lower_bird
## &amp;lt;chr&amp;gt; &amp;lt;chr&amp;gt; &amp;lt;fct&amp;gt; &amp;lt;dbl&amp;gt; &amp;lt;dbl&amp;gt; &amp;lt;dbl&amp;gt;
## 1 Australi… Rural no bird 0.245 0.755 0.193 
## 2 Australi… Urban no bird 0.287 0.713 0.249 
## 3 Crested … Rural no bird 0.0826 0.917 0.0526
## 4 Crested … Urban no bird 0.141 0.859 0.113 
## 5 Crimson … Rural no bird 0.215 0.785 0.166 
## 6 Crimson … Urban no bird 0.123 0.877 0.0969
## 7 Eastern … Rural no bird 0.283 0.717 0.227 
## 8 Eastern … Urban no bird 0.0973 0.903 0.0736
## 9 Grey Fan… Rural no bird 0.254 0.746 0.200 
## 10 Grey Fan… Urban no bird 0.0614 0.939 0.0435
## # … with 20 more rows, and 3 more variables: .pred_upper_bird &amp;lt;dbl&amp;gt;,
## # .pred_lower_no bird &amp;lt;dbl&amp;gt;, .pred_upper_no bird &amp;lt;dbl&amp;gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now let’s visualize these predictions.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;p2 &amp;lt;-
  bird_preds %&amp;gt;%
  ggplot(aes(.pred_bird, bird_type, color = urban_rural)) +
  geom_errorbar(aes(
    xmin = .pred_lower_bird,
    xmax = .pred_upper_bird
  ),
  width = .2, size = 1.2, alpha = 0.5
  ) +
  geom_point(size = 2.5) +
  scale_x_continuous(labels = scales::percent) +
  labs(x = "Predicted probability of seeing bird", y = NULL, color = NULL)

p2

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--X5JvQ-lK--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://juliasilge.com/blog/bird-baths/index_files/figure-html/unnamed-chunk-19-1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--X5JvQ-lK--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://juliasilge.com/blog/bird-baths/index_files/figure-html/unnamed-chunk-19-1.png" alt=""&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Actually, let’s put this together with our earlier plot!&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;library(patchwork)

p1 + p2

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--jqPH5rgw--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://juliasilge.com/blog/bird-baths/index_files/figure-html/unnamed-chunk-20-1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--jqPH5rgw--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://juliasilge.com/blog/bird-baths/index_files/figure-html/unnamed-chunk-20-1.png" alt=""&gt;&lt;/a&gt;&lt;/p&gt;

</description>
      <category>datascience</category>
      <category>machinelearning</category>
      <category>rstats</category>
      <category>tutorial</category>
    </item>
    <item>
      <title>Modeling human/computer interactions on Star Trek 🖖</title>
      <dc:creator>Julia Silge</dc:creator>
      <pubDate>Tue, 24 Aug 2021 00:00:00 +0000</pubDate>
      <link>https://dev.to/juliasilge/modeling-human-computer-interactions-on-star-trek-4l6b</link>
      <guid>https://dev.to/juliasilge/modeling-human-computer-interactions-on-star-trek-4l6b</guid>
      <description>&lt;p&gt;This is the latest in my series of &lt;a href="https://juliasilge.com/category/tidymodels/"&gt;screencasts&lt;/a&gt; demonstrating how to use the &lt;a href="https://www.tidymodels.org/"&gt;tidymodels&lt;/a&gt; packages, from just getting started to tuning more complex models. Today’s screencast is on a more advanced topic, how to evaluate multiple combinations of feature engineering and modeling approaches via &lt;a href="https://workflowsets.tidymodels.org/"&gt;workflowsets&lt;/a&gt;, with this week’s &lt;a href="https://github.com/rfordatascience/tidytuesday"&gt;&lt;code&gt;#TidyTuesday&lt;/code&gt; dataset&lt;/a&gt; on Star Trek human/computer interactions.&lt;/p&gt;

&lt;p&gt;&lt;iframe width="710" height="399" src="https://www.youtube.com/embed/_gVHRqz8GIE"&gt;
&lt;/iframe&gt;
&lt;/p&gt;

&lt;p&gt;Here is the code I used in the video, for those who prefer reading instead of or in addition to video.&lt;/p&gt;

&lt;h2&gt;
  
  
  Explore data
&lt;/h2&gt;

&lt;p&gt;Our modeling goal is to predict which &lt;a href="https://github.com/rfordatascience/tidytuesday/blob/master/data/2021/2021-08-17/readme.md"&gt;computer interactions from Star Trek&lt;/a&gt; were spoken by a person and which were spoken by the computer.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;library(tidyverse)
computer_raw &amp;lt;- read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-08-17/computer.csv")

computer_raw %&amp;gt;%
  distinct(value_id, .keep_all = TRUE) %&amp;gt;%
  count(char_type)


## # A tibble: 2 × 2
## char_type n
## &amp;lt;chr&amp;gt; &amp;lt;int&amp;gt;
## 1 Computer 178
## 2 Person 234

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Which words are more likely to be spoken by a computer vs. by a person?&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;library(tidytext)
library(tidylo)

computer_counts &amp;lt;-
  computer_raw %&amp;gt;%
  distinct(value_id, .keep_all = TRUE) %&amp;gt;%
  unnest_tokens(word, interaction) %&amp;gt;%
  count(char_type, word, sort = TRUE)

computer_counts %&amp;gt;%
  bind_log_odds(char_type, word, n) %&amp;gt;%
  filter(n &amp;gt; 10) %&amp;gt;%
  group_by(char_type) %&amp;gt;%
  slice_max(log_odds_weighted, n = 10) %&amp;gt;%
  ungroup() %&amp;gt;%
  ggplot(aes(log_odds_weighted,
    fct_reorder(word, log_odds_weighted),
    fill = char_type
  )) +
  geom_col(alpha = 0.8, show.legend = FALSE) +
  facet_wrap(vars(char_type), scales = "free_y") +
  labs(y = NULL)

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--0bT84wFd--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://juliasilge.com/blog/star-trek/index_files/figure-html/unnamed-chunk-3-1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--0bT84wFd--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://juliasilge.com/blog/star-trek/index_files/figure-html/unnamed-chunk-3-1.png" alt=""&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Notice that stop words are among the words with highest weighted log odds; they are very informative in this situation.&lt;/p&gt;

&lt;h2&gt;
  
  
  Build and compare models
&lt;/h2&gt;

&lt;p&gt;Let’s start our modeling by setting up our “data budget.” This is a &lt;em&gt;very&lt;/em&gt; small dataset so we won’t expect to see amazing results from our model, but it is fun and a nice way to demonstrate some of these concepts.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;library(tidymodels)

set.seed(123)

comp_split &amp;lt;-
  computer_raw %&amp;gt;%
  distinct(value_id, .keep_all = TRUE) %&amp;gt;%
  select(char_type, interaction) %&amp;gt;%
  initial_split(prop = 0.8, strata = char_type)

comp_train &amp;lt;- training(comp_split)
comp_test &amp;lt;- testing(comp_split)

set.seed(234)
comp_folds &amp;lt;- bootstraps(comp_train, strata = char_type)
comp_folds


## # Bootstrap sampling using stratification 
## # A tibble: 25 × 2
## splits id         
## &amp;lt;list&amp;gt; &amp;lt;chr&amp;gt;      
## 1 &amp;lt;split [329/118]&amp;gt; Bootstrap01
## 2 &amp;lt;split [329/128]&amp;gt; Bootstrap02
## 3 &amp;lt;split [329/134]&amp;gt; Bootstrap03
## 4 &amp;lt;split [329/124]&amp;gt; Bootstrap04
## 5 &amp;lt;split [329/118]&amp;gt; Bootstrap05
## 6 &amp;lt;split [329/116]&amp;gt; Bootstrap06
## 7 &amp;lt;split [329/106]&amp;gt; Bootstrap07
## 8 &amp;lt;split [329/124]&amp;gt; Bootstrap08
## 9 &amp;lt;split [329/121]&amp;gt; Bootstrap09
## 10 &amp;lt;split [329/121]&amp;gt; Bootstrap10
## # … with 15 more rows

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When it comes to feature engineering, we don’t know ahead of time if we should remove stop words, or center and scale the predictors, or balance the classes. Let’s create feature engineering recipes that do &lt;em&gt;all&lt;/em&gt; of these things so we can compare how they perform.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;library(textrecipes)
library(themis)

rec_all &amp;lt;-
  recipe(char_type ~ interaction, data = comp_train) %&amp;gt;%
  step_tokenize(interaction) %&amp;gt;%
  step_tokenfilter(interaction, max_tokens = 80) %&amp;gt;%
  step_tfidf(interaction)

rec_all_norm &amp;lt;-
  rec_all %&amp;gt;%
  step_normalize(all_predictors())

rec_all_smote &amp;lt;-
  rec_all_norm %&amp;gt;%
  step_smote(char_type)

## we can `prep()` just to check if it works
prep(rec_all_smote)


## Data Recipe
## 
## Inputs:
## 
## role #variables
## outcome 1
## predictor 1
## 
## Training data contained 329 data points and no missing data.
## 
## Operations:
## 
## Tokenization for interaction [trained]
## Text filtering for interaction [trained]
## Term frequency-inverse document frequency with interaction [trained]
## Centering and scaling for tfidf_interaction_a, ... [trained]
## SMOTE based on char_type [trained]

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now let’s do the same with removing stop words.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;rec_stop &amp;lt;-
  recipe(char_type ~ interaction, data = comp_train) %&amp;gt;%
  step_tokenize(interaction) %&amp;gt;%
  step_stopwords(interaction) %&amp;gt;%
  step_tokenfilter(interaction, max_tokens = 80) %&amp;gt;%
  step_tfidf(interaction)

rec_stop_norm &amp;lt;-
  rec_stop %&amp;gt;%
  step_normalize(all_predictors())

rec_stop_smote &amp;lt;-
  rec_stop_norm %&amp;gt;%
  step_smote(char_type)

## again, let's check it
prep(rec_stop_smote)


## Data Recipe
## 
## Inputs:
## 
## role #variables
## outcome 1
## predictor 1
## 
## Training data contained 329 data points and no missing data.
## 
## Operations:
## 
## Tokenization for interaction [trained]
## Stop word removal for interaction [trained]
## Text filtering for interaction [trained]
## Term frequency-inverse document frequency with interaction [trained]
## Centering and scaling for 80 items [trained]
## SMOTE based on char_type [trained]

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Let’s try out two kinds of models that often work well for text data, a support vector machine and a naive Bayes model.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;library(discrim)

nb_spec &amp;lt;-
  naive_Bayes() %&amp;gt;%
  set_mode("classification") %&amp;gt;%
  set_engine("naivebayes")

nb_spec


## Naive Bayes Model Specification (classification)
## 
## Computational engine: naivebayes


svm_spec &amp;lt;-
  svm_linear() %&amp;gt;%
  set_mode("classification") %&amp;gt;%
  set_engine("LiblineaR")

svm_spec


## Linear Support Vector Machine Specification (classification)
## 
## Computational engine: LiblineaR

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now we can put all these together in a &lt;a href="https://workflowsets.tidymodels.org/"&gt;workflowset&lt;/a&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;comp_models &amp;lt;-
  workflow_set(
    preproc = list(
      all = rec_all,
      all_norm = rec_all_norm,
      all_smote = rec_all_smote,
      stop = rec_stop,
      stop_norm = rec_stop_norm,
      stop_smote = rec_stop_smote
    ),
    models = list(nb = nb_spec, svm = svm_spec),
    cross = TRUE
  )

comp_models


## # A workflow set/tibble: 12 × 4
## wflow_id info option result    
## &amp;lt;chr&amp;gt; &amp;lt;list&amp;gt; &amp;lt;list&amp;gt; &amp;lt;list&amp;gt;    
## 1 all_nb &amp;lt;tibble [1 × 4]&amp;gt; &amp;lt;opts[0]&amp;gt; &amp;lt;list [0]&amp;gt;
## 2 all_svm &amp;lt;tibble [1 × 4]&amp;gt; &amp;lt;opts[0]&amp;gt; &amp;lt;list [0]&amp;gt;
## 3 all_norm_nb &amp;lt;tibble [1 × 4]&amp;gt; &amp;lt;opts[0]&amp;gt; &amp;lt;list [0]&amp;gt;
## 4 all_norm_svm &amp;lt;tibble [1 × 4]&amp;gt; &amp;lt;opts[0]&amp;gt; &amp;lt;list [0]&amp;gt;
## 5 all_smote_nb &amp;lt;tibble [1 × 4]&amp;gt; &amp;lt;opts[0]&amp;gt; &amp;lt;list [0]&amp;gt;
## 6 all_smote_svm &amp;lt;tibble [1 × 4]&amp;gt; &amp;lt;opts[0]&amp;gt; &amp;lt;list [0]&amp;gt;
## 7 stop_nb &amp;lt;tibble [1 × 4]&amp;gt; &amp;lt;opts[0]&amp;gt; &amp;lt;list [0]&amp;gt;
## 8 stop_svm &amp;lt;tibble [1 × 4]&amp;gt; &amp;lt;opts[0]&amp;gt; &amp;lt;list [0]&amp;gt;
## 9 stop_norm_nb &amp;lt;tibble [1 × 4]&amp;gt; &amp;lt;opts[0]&amp;gt; &amp;lt;list [0]&amp;gt;
## 10 stop_norm_svm &amp;lt;tibble [1 × 4]&amp;gt; &amp;lt;opts[0]&amp;gt; &amp;lt;list [0]&amp;gt;
## 11 stop_smote_nb &amp;lt;tibble [1 × 4]&amp;gt; &amp;lt;opts[0]&amp;gt; &amp;lt;list [0]&amp;gt;
## 12 stop_smote_svm &amp;lt;tibble [1 × 4]&amp;gt; &amp;lt;opts[0]&amp;gt; &amp;lt;list [0]&amp;gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;None of these models have any tuning parameters, so next let’s use &lt;code&gt;fit_resamples()&lt;/code&gt; to evaluate how each of these combinations of feature engineering recipes and model specifications performs, using our bootstrap resamples.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;set.seed(123)
doParallel::registerDoParallel()

computer_rs &amp;lt;-
  comp_models %&amp;gt;%
  workflow_map(
    "fit_resamples",
    resamples = comp_folds,
    metrics = metric_set(accuracy, sensitivity, specificity)
  )

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We can make a quick high-level visualization of these results.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;autoplot(computer_rs)

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--ImBt4LDE--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://juliasilge.com/blog/star-trek/index_files/figure-html/unnamed-chunk-10-1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--ImBt4LDE--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://juliasilge.com/blog/star-trek/index_files/figure-html/unnamed-chunk-10-1.png" alt=""&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;All of the SVMs did better than all of the naive Bayes models, at least as far as overall accuracy. We can also dig deeper and explore the results more.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;rank_results(computer_rs) %&amp;gt;%
  filter(.metric == "accuracy")


## # A tibble: 12 × 9
## wflow_id .config .metric mean std_err n preprocessor model rank
## &amp;lt;chr&amp;gt; &amp;lt;chr&amp;gt; &amp;lt;chr&amp;gt; &amp;lt;dbl&amp;gt; &amp;lt;dbl&amp;gt; &amp;lt;int&amp;gt; &amp;lt;chr&amp;gt; &amp;lt;chr&amp;gt; &amp;lt;int&amp;gt;
## 1 all_svm Preprocess… accuracy 0.679 0.00655 25 recipe svm_l… 1
## 2 all_norm_… Preprocess… accuracy 0.658 0.00756 25 recipe svm_l… 2
## 3 stop_svm Preprocess… accuracy 0.652 0.00700 25 recipe svm_l… 3
## 4 all_smote… Preprocess… accuracy 0.650 0.00611 25 recipe svm_l… 4
## 5 stop_norm… Preprocess… accuracy 0.646 0.00753 25 recipe svm_l… 5
## 6 stop_smot… Preprocess… accuracy 0.632 0.00914 25 recipe svm_l… 6
## 7 all_norm_… Preprocess… accuracy 0.589 0.00678 25 recipe naive… 7
## 8 all_smote… Preprocess… accuracy 0.575 0.0115 25 recipe naive… 8
## 9 stop_smot… Preprocess… accuracy 0.573 0.00971 25 recipe naive… 9
## 10 stop_norm… Preprocess… accuracy 0.571 0.00950 25 recipe naive… 10
## 11 all_nb Preprocess… accuracy 0.570 0.0102 25 recipe naive… 11
## 12 stop_nb Preprocess… accuracy 0.559 0.0120 25 recipe naive… 12

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Some interesting things to note are:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;how balancing the classes via SMOTE does in fact change sensitivity and specificity the way we would expect&lt;/li&gt;
&lt;li&gt;that removing stop words looks like mostly a &lt;strong&gt;bad&lt;/strong&gt; idea!&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Train and evaluate final model
&lt;/h2&gt;

&lt;p&gt;Let’s say that we want to keep overall accuracy high, so we pick &lt;code&gt;rec_all&lt;/code&gt; and &lt;code&gt;svm_spec&lt;/code&gt;. We can use &lt;code&gt;last_fit()&lt;/code&gt; to &lt;strong&gt;fit&lt;/strong&gt; one time to all the training data and &lt;strong&gt;evalute&lt;/strong&gt; one time on the testing data.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;comp_wf &amp;lt;- workflow(rec_all, svm_spec)

comp_fitted &amp;lt;-
  last_fit(
    comp_wf,
    comp_split,
    metrics = metric_set(accuracy, sensitivity, specificity)
  )

comp_fitted


## # Resampling results
## # Manual resampling 
## # A tibble: 1 × 6
## splits id .metrics .notes .predictions .workflow
## &amp;lt;list&amp;gt; &amp;lt;chr&amp;gt; &amp;lt;list&amp;gt; &amp;lt;list&amp;gt; &amp;lt;list&amp;gt; &amp;lt;list&amp;gt;   
## 1 &amp;lt;split [329/83]&amp;gt; train/test split &amp;lt;tibble [… &amp;lt;tibble … &amp;lt;tibble [83 … &amp;lt;workflo…

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;How did that turn out?&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;collect_metrics(comp_fitted)


## # A tibble: 3 × 4
## .metric .estimator .estimate .config             
## &amp;lt;chr&amp;gt; &amp;lt;chr&amp;gt; &amp;lt;dbl&amp;gt; &amp;lt;chr&amp;gt;               
## 1 accuracy binary 0.735 Preprocessor1_Model1
## 2 sens binary 0.611 Preprocessor1_Model1
## 3 spec binary 0.830 Preprocessor1_Model1

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We can also look at the predictions, and for example make a confusion matrix.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;collect_predictions(comp_fitted) %&amp;gt;%
  conf_mat(char_type, .pred_class) %&amp;gt;%
  autoplot()

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s---Nre7XPu--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://juliasilge.com/blog/star-trek/index_files/figure-html/unnamed-chunk-14-1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s---Nre7XPu--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://juliasilge.com/blog/star-trek/index_files/figure-html/unnamed-chunk-14-1.png" alt=""&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;It was easier to identify people talking to computers than the other way around.&lt;/p&gt;

&lt;p&gt;Since this is a linear model, we can also look at the coefficients for words in the model, perhaps for the largest effect size terms in each direction.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;extract_workflow(comp_fitted) %&amp;gt;%
  tidy() %&amp;gt;%
  group_by(estimate &amp;gt; 0) %&amp;gt;%
  slice_max(abs(estimate), n = 10) %&amp;gt;%
  ungroup() %&amp;gt;%
  mutate(term = str_remove(term, "tfidf_interaction_")) %&amp;gt;%
  ggplot(aes(estimate, fct_reorder(term, estimate), fill = estimate &amp;gt; 0)) +
  geom_col(alpha = 0.8) +
  scale_fill_discrete(labels = c("people", "computer")) +
  labs(y = NULL, fill = "More from...")

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--jKEzUHVN--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://juliasilge.com/blog/star-trek/index_files/figure-html/unnamed-chunk-15-1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--jKEzUHVN--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://juliasilge.com/blog/star-trek/index_files/figure-html/unnamed-chunk-15-1.png" alt=""&gt;&lt;/a&gt;&lt;/p&gt;

</description>
      <category>machinelearning</category>
      <category>datascience</category>
      <category>tutorial</category>
      <category>rstats</category>
    </item>
    <item>
      <title>Predict housing prices 🏠 in Austin TX with xgboost</title>
      <dc:creator>Julia Silge</dc:creator>
      <pubDate>Sun, 15 Aug 2021 00:00:00 +0000</pubDate>
      <link>https://dev.to/juliasilge/predict-housing-prices-in-austin-tx-with-xgboost-54jb</link>
      <guid>https://dev.to/juliasilge/predict-housing-prices-in-austin-tx-with-xgboost-54jb</guid>
      <description>&lt;p&gt;This is the latest in my series of &lt;a href="https://juliasilge.com/category/tidymodels/"&gt;screencasts&lt;/a&gt; demonstrating how to use the &lt;a href="https://www.tidymodels.org/"&gt;tidymodels&lt;/a&gt; packages, from just getting started to tuning more complex models. My screencasts lately have focused on xgboost as I have participated in &lt;a href="https://www.notion.so/SLICED-Show-c7bd26356e3a42279e2dfbafb0480073"&gt;SLICED&lt;/a&gt;, a competitive data science streaming show. This past week were the semifinals, where we competed to predict prices of homes in Austin, TX. 🏠 One of the more interesting available variables for this dataset was the text description of the real estate listings, so let’s walk through one way to incorporate text information with boosted tree modeling.&lt;/p&gt;

&lt;p&gt;&lt;iframe width="710" height="399" src="https://www.youtube.com/embed/1LEW8APSOJo"&gt;
&lt;/iframe&gt;
&lt;/p&gt;

&lt;p&gt;Here is the code I used in the video, for those who prefer reading instead of or in addition to video.&lt;/p&gt;

&lt;h2&gt;
  
  
  Explore data
&lt;/h2&gt;

&lt;p&gt;Our modeling goal is to predict &lt;a href="https://www.kaggle.com/c/sliced-s01e11-semifinals/"&gt;the price (binned) for homes in Austin, TX&lt;/a&gt; given features about the real estate listing. This is a multiclass classification challenge, where we needed to submit a probability for each home being in each &lt;code&gt;priceRange&lt;/code&gt; bin. The main data set provided is in a CSV file called &lt;code&gt;training.csv&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;library(tidyverse)
train_raw &amp;lt;- read_csv("train.csv")

train_raw %&amp;gt;%
  count(priceRange)


## # A tibble: 5 × 2
## priceRange n
## &amp;lt;chr&amp;gt; &amp;lt;int&amp;gt;
## 1 0-250000 1249
## 2 250000-350000 2356
## 3 350000-450000 2301
## 4 450000-650000 2275
## 5 650000+ 1819

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You can watch &lt;a href="https://www.twitch.tv/videos/1114553508"&gt;this week’s full episode of SLICED&lt;/a&gt; to see lots of exploratory data analysis and visualization of this dataset, but let’s just make a few data visualization for context in this blog post.&lt;/p&gt;

&lt;p&gt;How is price distributed across Austin?&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;price_plot &amp;lt;-
  train_raw %&amp;gt;%
  mutate(priceRange = parse_number(priceRange)) %&amp;gt;%
  ggplot(aes(longitude, latitude, z = priceRange)) +
  stat_summary_hex(alpha = 0.8, bins = 50) +
  scale_fill_viridis_c() +
  labs(
    fill = "mean",
    title = "Price"
  )

price_plot

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--pWqm1qcT--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://juliasilge.com/blog/austin-housing/index_files/figure-html/unnamed-chunk-3-1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--pWqm1qcT--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://juliasilge.com/blog/austin-housing/index_files/figure-html/unnamed-chunk-3-1.png" alt=""&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Let’s look at this distribution and compare it to some other variables available in the dataset. We can create a little plotting function &lt;a href="https://dplyr.tidyverse.org/articles/programming.html#indirection"&gt;using &lt;code&gt;{{}}&lt;/code&gt;&lt;/a&gt; to quickly iterate through, and put them together with &lt;a href="https://patchwork.data-imaginist.com/"&gt;patchwork&lt;/a&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;library(patchwork)

plot_austin &amp;lt;- function(var, title) {
  train_raw %&amp;gt;%
    ggplot(aes(longitude, latitude, z = {{ var }})) +
    stat_summary_hex(alpha = 0.8, bins = 50) +
    scale_fill_viridis_c() +
    labs(
      fill = "mean",
      title = title
    )
}

(price_plot + plot_austin(avgSchoolRating, "School rating")) /
  (plot_austin(yearBuilt, "Year built") + plot_austin(log(lotSizeSqFt), "Lot size (log)"))

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--gRoHjMsG--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://juliasilge.com/blog/austin-housing/index_files/figure-html/unnamed-chunk-4-1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--gRoHjMsG--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://juliasilge.com/blog/austin-housing/index_files/figure-html/unnamed-chunk-4-1.png" alt=""&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Notice the east/west gradients as well as the radial changes. I went to grad school in Austin and this all look very familiar to me!&lt;/p&gt;

&lt;h2&gt;
  
  
  Finding words related to price
&lt;/h2&gt;

&lt;p&gt;The &lt;code&gt;description&lt;/code&gt; variable contains text from each real estate listing. We could try to use the text features directly in modeling, &lt;a href="https://smltar.com/"&gt;as described in our book&lt;/a&gt;, but I’ve found that often isn’t great for boosted tree models (which tend to be what works best overall in an environment like SLICED). Let’s walk through another option which may work better in some situations, which is to use some separate analysis to identify important words and then create dummy variables indicating whether any given listing has those words.&lt;/p&gt;

&lt;p&gt;Let’s start by tidying the &lt;code&gt;description&lt;/code&gt; text.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;library(tidytext)

austin_tidy &amp;lt;-
  train_raw %&amp;gt;%
  mutate(priceRange = parse_number(priceRange) + 100000) %&amp;gt;%
  unnest_tokens(word, description) %&amp;gt;%
  anti_join(get_stopwords())

austin_tidy %&amp;gt;%
  count(word, sort = TRUE)


## # A tibble: 17,944 × 2
## word n
## &amp;lt;chr&amp;gt; &amp;lt;int&amp;gt;
## 1 home 11620
## 2 kitchen 5721
## 3 room 5494
## 4 austin 4918
## 5 new 4772
## 6 large 4771
## 7 2 4585
## 8 bedrooms 4571
## 9 contains 4413
## 10 3 4386
## # … with 17,934 more rows

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Next, let’s compute word frequencies per price range for the top 100 words.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;top_words &amp;lt;-
  austin_tidy %&amp;gt;%
  count(word, sort = TRUE) %&amp;gt;%
  filter(!word %in% as.character(1:5)) %&amp;gt;%
  slice_max(n, n = 100) %&amp;gt;%
  pull(word)

word_freqs &amp;lt;-
  austin_tidy %&amp;gt;%
  count(word, priceRange) %&amp;gt;%
  complete(word, priceRange, fill = list(n = 0)) %&amp;gt;%
  group_by(priceRange) %&amp;gt;%
  mutate(
    price_total = sum(n),
    proportion = n / price_total
  ) %&amp;gt;%
  ungroup() %&amp;gt;%
  filter(word %in% top_words)

word_freqs


## # A tibble: 500 × 5
## word priceRange n price_total proportion
## &amp;lt;chr&amp;gt; &amp;lt;dbl&amp;gt; &amp;lt;dbl&amp;gt; &amp;lt;dbl&amp;gt; &amp;lt;dbl&amp;gt;
## 1 access 100000 180 56290 0.00320
## 2 access 350000 365 114853 0.00318
## 3 access 450000 322 116678 0.00276
## 4 access 550000 294 125585 0.00234
## 5 access 750000 248 112073 0.00221
## 6 appliances 100000 209 56290 0.00371
## 7 appliances 350000 583 114853 0.00508
## 8 appliances 450000 576 116678 0.00494
## 9 appliances 550000 567 125585 0.00451
## 10 appliances 750000 391 112073 0.00349
## # … with 490 more rows

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now let’s use modeling to find the words that are &lt;strong&gt;increasing&lt;/strong&gt; with price and those that are &lt;strong&gt;decreasing&lt;/strong&gt; with price.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;word_mods &amp;lt;-
  word_freqs %&amp;gt;%
  nest(data = c(priceRange, n, price_total, proportion)) %&amp;gt;%
  mutate(
    model = map(data, ~ glm(cbind(n, price_total) ~ priceRange, ., family = "binomial")),
    model = map(model, tidy)
  ) %&amp;gt;%
  unnest(model) %&amp;gt;%
  filter(term == "priceRange") %&amp;gt;%
  mutate(p.value = p.adjust(p.value)) %&amp;gt;%
  arrange(-estimate)

word_mods


## # A tibble: 100 × 7
## word data term estimate std.error statistic p.value
## &amp;lt;chr&amp;gt; &amp;lt;list&amp;gt; &amp;lt;chr&amp;gt; &amp;lt;dbl&amp;gt; &amp;lt;dbl&amp;gt; &amp;lt;dbl&amp;gt; &amp;lt;dbl&amp;gt;
## 1 outdoor &amp;lt;tibble [5 × 4]&amp;gt; priceRange 0.00000325 1.85e-7 17.6 4.37e-67
## 2 custom &amp;lt;tibble [5 × 4]&amp;gt; priceRange 0.00000214 1.47e-7 14.6 3.98e-46
## 3 pool &amp;lt;tibble [5 × 4]&amp;gt; priceRange 0.00000159 1.22e-7 13.0 6.12e-37
## 4 office &amp;lt;tibble [5 × 4]&amp;gt; priceRange 0.00000150 1.46e-7 10.3 6.03e-23
## 5 suite &amp;lt;tibble [5 × 4]&amp;gt; priceRange 0.00000143 1.39e-7 10.3 4.03e-23
## 6 gorgeous &amp;lt;tibble [5 × 4]&amp;gt; priceRange 0.000000975 1.62e-7 6.02 1.19e- 7
## 7 w &amp;lt;tibble [5 × 4]&amp;gt; priceRange 0.000000920 9.05e-8 10.2 2.33e-22
## 8 windows &amp;lt;tibble [5 × 4]&amp;gt; priceRange 0.000000890 1.28e-7 6.95 2.81e-10
## 9 private &amp;lt;tibble [5 × 4]&amp;gt; priceRange 0.000000889 1.15e-7 7.70 1.08e-12
## 10 car &amp;lt;tibble [5 × 4]&amp;gt; priceRange 0.000000778 1.66e-7 4.69 1.52e- 4
## # … with 90 more rows

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Let’s make something like a &lt;a href="https://en.wikipedia.org/wiki/Volcano_plot_%28statistics%29"&gt;volcano plot&lt;/a&gt; to see the relationship between p-value and effect size for these words.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;library(ggrepel)

word_mods %&amp;gt;%
  ggplot(aes(estimate, p.value)) +
  geom_vline(xintercept = 0, lty = 2, alpha = 0.7, color = "gray50") +
  geom_point(color = "midnightblue", alpha = 0.8, size = 2.5) +
  scale_y_log10() +
  geom_text_repel(aes(label = word), family = "IBMPlexSans")

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--Q2peQP2T--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://juliasilge.com/blog/austin-housing/index_files/figure-html/unnamed-chunk-8-1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--Q2peQP2T--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://juliasilge.com/blog/austin-housing/index_files/figure-html/unnamed-chunk-8-1.png" alt=""&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Words like outdoor, custom, pool, suite, office &lt;strong&gt;increase&lt;/strong&gt; with price.&lt;/li&gt;
&lt;li&gt;Words like new, paint, carpet, great, tile, close, flooring &lt;strong&gt;decrease&lt;/strong&gt; with price.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These are the words that we’d like to try to detect and use in feature engineering for our xgboost model, rather than using all the text tokens as features individually.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;higher_words &amp;lt;-
  word_mods %&amp;gt;%
  filter(p.value &amp;lt; 0.05) %&amp;gt;%
  slice_max(estimate, n = 12) %&amp;gt;%
  pull(word)

lower_words &amp;lt;-
  word_mods %&amp;gt;%
  filter(p.value &amp;lt; 0.05) %&amp;gt;%
  slice_max(-estimate, n = 12) %&amp;gt;%
  pull(word)

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We can look at these changes with price directly. For example, these are the words most associated with price decrease.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;word_freqs %&amp;gt;%
  filter(word %in% lower_words) %&amp;gt;%
  ggplot(aes(priceRange, proportion, color = word)) +
  geom_line(size = 2.5, alpha = 0.7, show.legend = FALSE) +
  facet_wrap(vars(word), scales = "free_y") +
  scale_x_continuous(labels = scales::dollar) +
  scale_y_continuous(labels = scales::percent, limits = c(0, NA)) +
  labs(x = NULL, y = "proportion of total words used for homes at that price") +
  theme_light(base_family = "IBMPlexSans")

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--NCPEqkw5--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://juliasilge.com/blog/austin-housing/index_files/figure-html/unnamed-chunk-10-1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--NCPEqkw5--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://juliasilge.com/blog/austin-housing/index_files/figure-html/unnamed-chunk-10-1.png" alt=""&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Cheaper houses are “great” but not expensive houses, and apparently you don’t need to mention the location (“close,” “minutes,” “location”) of more expensive houses.&lt;/p&gt;

&lt;h2&gt;
  
  
  Build a model
&lt;/h2&gt;

&lt;p&gt;Let’s start our modeling by setting up our “data budget,” as well as the metrics (this challenge was evaluate on multiclass log loss).&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;library(tidymodels)

set.seed(123)
austin_split &amp;lt;- train_raw %&amp;gt;%
  select(-city) %&amp;gt;%
  mutate(description = str_to_lower(description)) %&amp;gt;%
  initial_split(strata = priceRange)
austin_train &amp;lt;- training(austin_split)
austin_test &amp;lt;- testing(austin_split)
austin_metrics &amp;lt;- metric_set(accuracy, roc_auc, mn_log_loss)

set.seed(234)
austin_folds &amp;lt;- vfold_cv(austin_train, v = 5, strata = priceRange)
austin_folds


## # 5-fold cross-validation using stratification 
## # A tibble: 5 × 2
## splits id   
## &amp;lt;list&amp;gt; &amp;lt;chr&amp;gt;
## 1 &amp;lt;split [5996/1502]&amp;gt; Fold1
## 2 &amp;lt;split [5998/1500]&amp;gt; Fold2
## 3 &amp;lt;split [5999/1499]&amp;gt; Fold3
## 4 &amp;lt;split [5999/1499]&amp;gt; Fold4
## 5 &amp;lt;split [6000/1498]&amp;gt; Fold5

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For feature engineering, let’s use basically everything in the dataset (aside from &lt;code&gt;city&lt;/code&gt;, which was not a very useful variable) and &lt;a href="https://recipes.tidymodels.org/reference/step_regex.html"&gt;create dummy or indicator variables using &lt;code&gt;step_regex()&lt;/code&gt;&lt;/a&gt;. The idea here is that we will detect whether these words associated with low/high price are there and create a yes/no variable indicating their presence or absence.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;higher_pat &amp;lt;- glue::glue_collapse(higher_words, sep = "|")
lower_pat &amp;lt;- glue::glue_collapse(lower_words, sep = "|")

austin_rec &amp;lt;-
  recipe(priceRange ~ ., data = austin_train) %&amp;gt;%
  update_role(uid, new_role = "uid") %&amp;gt;%
  step_regex(description, pattern = higher_pat, result = "high_price_words") %&amp;gt;%
  step_regex(description, pattern = lower_pat, result = "low_price_words") %&amp;gt;%
  step_rm(description) %&amp;gt;%
  step_novel(homeType) %&amp;gt;%
  step_unknown(homeType) %&amp;gt;%
  step_other(homeType, threshold = 0.02) %&amp;gt;%
  step_dummy(all_nominal_predictors()) %&amp;gt;%
  step_nzv(all_predictors())

austin_rec


## Data Recipe
## 
## Inputs:
## 
## role #variables
## outcome 1
## predictor 13
## uid 1
## 
## Operations:
## 
## Regular expression dummy variable using `outdoor|custom|pool|office|suite|gorgeous|w|windows|private|car|high|full`
## Regular expression dummy variable using `carpet|paint|close|flooring|shopping|new|easy|minutes|tile|great|community|location`
## Delete terms description
## Novel factor level assignment for homeType
## Unknown factor level assignment for homeType
## Collapsing factor levels for homeType
## Dummy variables from all_nominal_predictors()
## Sparse, unbalanced variable filter on all_predictors()

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now let’s create a tunable xgboost model specification, tuning a lot of the important model hyperparameters, and combine it with our feature engineering recipe in a &lt;code&gt;workflow()&lt;/code&gt;. We can also create a custom &lt;code&gt;xgb_grid&lt;/code&gt; to specify what parameters I want to try out, like not-too-small learning rate, avoiding tree stubs, etc. I chose this parameter grid to get reasonable performance in a reasonable amount of tuning time.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;xgb_spec &amp;lt;-
  boost_tree(
    trees = 1000,
    tree_depth = tune(),
    min_n = tune(),
    mtry = tune(),
    sample_size = tune(),
    learn_rate = tune()
  ) %&amp;gt;%
  set_engine("xgboost") %&amp;gt;%
  set_mode("classification")

xgb_word_wf &amp;lt;- workflow(austin_rec, xgb_spec)

set.seed(123)
xgb_grid &amp;lt;-
  grid_max_entropy(
    tree_depth(c(5L, 10L)),
    min_n(c(10L, 40L)),
    mtry(c(5L, 10L)),
    sample_prop(c(0.5, 1.0)),
    learn_rate(c(-2, -1)),
    size = 20
  )

xgb_grid


## # A tibble: 20 × 5
## tree_depth min_n mtry sample_size learn_rate
## &amp;lt;int&amp;gt; &amp;lt;int&amp;gt; &amp;lt;int&amp;gt; &amp;lt;dbl&amp;gt; &amp;lt;dbl&amp;gt;
## 1 7 33 8 0.768 0.0845
## 2 10 33 7 0.928 0.0784
## 3 5 21 6 0.626 0.0868
## 4 9 31 8 0.728 0.0162
## 5 8 35 5 0.666 0.0937
## 6 6 21 5 0.907 0.0105
## 7 6 27 6 0.982 0.0729
## 8 7 33 8 0.936 0.0102
## 9 7 15 5 0.559 0.0182
## 10 6 35 9 0.784 0.0347
## 11 9 39 9 0.737 0.0582
## 12 8 17 8 0.596 0.0818
## 13 9 21 7 0.601 0.0136
## 14 7 15 7 0.763 0.0197
## 15 6 12 10 0.800 0.0569
## 16 9 19 9 0.589 0.0138
## 17 10 14 5 0.829 0.0140
## 18 8 37 10 0.664 0.0202
## 19 5 11 5 0.514 0.0136
## 20 10 38 9 0.962 0.0150

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now we can tune across the grid of parameters and our resamples. Since we are trying quite a lot of hyperparameter combinations, let’s use &lt;a href="https://juliasilge.com/blog/baseball-racing/"&gt;racing&lt;/a&gt; to quit early on clearly bad hyperparameter combinations.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;library(finetune)
doParallel::registerDoParallel()

set.seed(234)
xgb_word_rs &amp;lt;-
  tune_race_anova(
    xgb_word_wf,
    austin_folds,
    grid = xgb_grid,
    metrics = metric_set(mn_log_loss),
    control = control_race(verbose_elim = TRUE)
  )

xgb_word_rs


## # Tuning results
## # 5-fold cross-validation using stratification 
## # A tibble: 5 × 5
## splits id .order .metrics .notes          
## &amp;lt;list&amp;gt; &amp;lt;chr&amp;gt; &amp;lt;int&amp;gt; &amp;lt;list&amp;gt; &amp;lt;list&amp;gt;          
## 1 &amp;lt;split [5996/1502]&amp;gt; Fold1 3 &amp;lt;tibble [20 × 9]&amp;gt; &amp;lt;tibble [0 × 1]&amp;gt;
## 2 &amp;lt;split [5999/1499]&amp;gt; Fold3 1 &amp;lt;tibble [20 × 9]&amp;gt; &amp;lt;tibble [0 × 1]&amp;gt;
## 3 &amp;lt;split [6000/1498]&amp;gt; Fold5 2 &amp;lt;tibble [20 × 9]&amp;gt; &amp;lt;tibble [0 × 1]&amp;gt;
## 4 &amp;lt;split [5999/1499]&amp;gt; Fold4 4 &amp;lt;tibble [10 × 9]&amp;gt; &amp;lt;tibble [0 × 1]&amp;gt;
## 5 &amp;lt;split [5998/1500]&amp;gt; Fold2 5 &amp;lt;tibble [4 × 9]&amp;gt; &amp;lt;tibble [0 × 1]&amp;gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That takes a little while but we did it!&lt;/p&gt;

&lt;h2&gt;
  
  
  Evaluate results
&lt;/h2&gt;

&lt;p&gt;First off, how did the “race” go?&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;plot_race(xgb_word_rs)

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--u5KpWZPe--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://juliasilge.com/blog/austin-housing/index_files/figure-html/unnamed-chunk-15-1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--u5KpWZPe--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://juliasilge.com/blog/austin-housing/index_files/figure-html/unnamed-chunk-15-1.png" alt=""&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We can look at the top results manually as well.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;show_best(xgb_word_rs)


## # A tibble: 4 × 11
## mtry min_n tree_depth learn_rate sample_size .metric .estimator mean n
## &amp;lt;int&amp;gt; &amp;lt;int&amp;gt; &amp;lt;int&amp;gt; &amp;lt;dbl&amp;gt; &amp;lt;dbl&amp;gt; &amp;lt;chr&amp;gt; &amp;lt;chr&amp;gt; &amp;lt;dbl&amp;gt; &amp;lt;int&amp;gt;
## 1 5 14 10 0.0140 0.829 mn_log_l… multiclass 0.920 5
## 2 7 15 7 0.0197 0.763 mn_log_l… multiclass 0.921 5
## 3 5 15 7 0.0182 0.559 mn_log_l… multiclass 0.921 5
## 4 9 19 9 0.0138 0.589 mn_log_l… multiclass 0.923 5
## # … with 2 more variables: std_err &amp;lt;dbl&amp;gt;, .config &amp;lt;chr&amp;gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Let’s use &lt;code&gt;last_fit()&lt;/code&gt; to fit one final time to the &lt;strong&gt;training&lt;/strong&gt; data and evaluate one final time on the &lt;strong&gt;testing&lt;/strong&gt; data, with the numerically optimal result from &lt;code&gt;xgb_word_rs&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;xgb_last &amp;lt;-
  xgb_word_wf %&amp;gt;%
  finalize_workflow(select_best(xgb_word_rs, "mn_log_loss")) %&amp;gt;%
  last_fit(austin_split)

xgb_last


## # Resampling results
## # Manual resampling 
## # A tibble: 1 × 6
## splits id .metrics .notes .predictions .workflow
## &amp;lt;list&amp;gt; &amp;lt;chr&amp;gt; &amp;lt;list&amp;gt; &amp;lt;list&amp;gt; &amp;lt;list&amp;gt; &amp;lt;list&amp;gt;   
## 1 &amp;lt;split [7498/2502]&amp;gt; train/test split &amp;lt;tibble … &amp;lt;tibble… &amp;lt;tibble [2,… &amp;lt;workflo…

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;How did this model perform on the testing data, that was not used in tuning/training?&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;collect_predictions(xgb_last) %&amp;gt;%
  mn_log_loss(priceRange, `.pred_0-250000`:`.pred_650000+`)


## # A tibble: 1 × 3
## .metric .estimator .estimate
## &amp;lt;chr&amp;gt; &amp;lt;chr&amp;gt; &amp;lt;dbl&amp;gt;
## 1 mn_log_loss multiclass 0.910

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This result is pretty good for a single (not ensembled) model and is a wee bit better than what I did during the SLICED competition. I had an R bomb right as I was finishing up tuning a model just like the one I am demonstrating here!&lt;/p&gt;

&lt;p&gt;How does this model perform across the different classes?&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;collect_predictions(xgb_last) %&amp;gt;%
  conf_mat(priceRange, .pred_class) %&amp;gt;%
  autoplot()

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--F3D_pTAZ--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://juliasilge.com/blog/austin-housing/index_files/figure-html/unnamed-chunk-19-1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--F3D_pTAZ--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://juliasilge.com/blog/austin-housing/index_files/figure-html/unnamed-chunk-19-1.png" alt=""&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We can also visualize this with an ROC curve.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;collect_predictions(xgb_last) %&amp;gt;%
  roc_curve(priceRange, `.pred_0-250000`:`.pred_650000+`) %&amp;gt;%
  ggplot(aes(1 - specificity, sensitivity, color = .level)) +
  geom_abline(lty = 2, color = "gray80", size = 1.5) +
  geom_path(alpha = 0.8, size = 1.2) +
  coord_equal() +
  labs(color = NULL)

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--5WDzJ6hI--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://juliasilge.com/blog/austin-housing/index_files/figure-html/unnamed-chunk-20-1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--5WDzJ6hI--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://juliasilge.com/blog/austin-housing/index_files/figure-html/unnamed-chunk-20-1.png" alt=""&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Notice that it is easier to identify the most expensive homes but more difficult to correctly classify the less expensive homes.&lt;/p&gt;

&lt;p&gt;What features are most important for this xgboost model?&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;library(vip)
extract_workflow(xgb_last) %&amp;gt;%
  extract_fit_parsnip() %&amp;gt;%
  vip(geom = "point", num_features = 15)

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--w23sUDKd--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://juliasilge.com/blog/austin-housing/index_files/figure-html/unnamed-chunk-21-1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--w23sUDKd--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://juliasilge.com/blog/austin-housing/index_files/figure-html/unnamed-chunk-21-1.png" alt=""&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The spatial information in latitude/longitude are by far the most important. Notice that the model uses &lt;code&gt;low_price_words&lt;/code&gt; more than it uses, for example, whether there is a spa or whether it is a single family home (as opposed to a townhome or condo). It looks like the model is trying to distinguish some of those lower priced categories. The model does &lt;em&gt;not&lt;/em&gt; really use the &lt;code&gt;high_price_words&lt;/code&gt; variable, perhaps because it is already easy to find the expensive houses.&lt;/p&gt;

&lt;p&gt;The two finalists from SLICED go on to compete next Tuesday, which should be fun and interesting to watch! I have enjoyed the opportunity to participate this season.&lt;/p&gt;

</description>
      <category>machinelearning</category>
      <category>datascience</category>
      <category>tutorial</category>
      <category>rstats</category>
    </item>
    <item>
      <title>Use racing methods to tune xgboost models and predict home runs ⚾️</title>
      <dc:creator>Julia Silge</dc:creator>
      <pubDate>Tue, 10 Aug 2021 00:00:00 +0000</pubDate>
      <link>https://dev.to/juliasilge/use-racing-methods-to-tune-xgboost-models-and-predict-home-runs-301m</link>
      <guid>https://dev.to/juliasilge/use-racing-methods-to-tune-xgboost-models-and-predict-home-runs-301m</guid>
      <description>&lt;p&gt;This is the latest in my series of &lt;a href="https://juliasilge.com/category/tidymodels/" rel="noopener noreferrer"&gt;screencasts&lt;/a&gt; demonstrating how to use the &lt;a href="https://www.tidymodels.org/" rel="noopener noreferrer"&gt;tidymodels&lt;/a&gt; packages, from just getting started to tuning more complex models. This week’s episode of &lt;a href="https://www.notion.so/SLICED-Show-c7bd26356e3a42279e2dfbafb0480073" rel="noopener noreferrer"&gt;SLICED&lt;/a&gt;, a competitive data science streaming show, had contestants compete to predict home runs in recent baseball games. Honestly I don’t know much about baseball ⚾ but the &lt;a href="https://github.com/tidymodels/finetune/" rel="noopener noreferrer"&gt;finetune&lt;/a&gt; package had a recent release and this challenge offers a good opportunity to show how to use racing methods for tuning.&lt;/p&gt;

&lt;p&gt;&lt;iframe width="710" height="399" src="https://www.youtube.com/embed/_e0NFIaHY2c"&gt;
&lt;/iframe&gt;
&lt;/p&gt;

&lt;p&gt;Here is the code I used in the video, for those who prefer reading instead of or in addition to video.&lt;/p&gt;

&lt;h2&gt;
  
  
  Explore data
&lt;/h2&gt;

&lt;p&gt;Our modeling goal is to predict &lt;a href="https://www.kaggle.com/c/sliced-s01e09-playoffs-1/" rel="noopener noreferrer"&gt;whether a batter’s hit results in a home run&lt;/a&gt; given features about the hit. The main data set provided is in a CSV file called &lt;code&gt;training.csv&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;library(tidyverse)
train_raw &amp;lt;- read_csv("train.csv")

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You can watch &lt;a href="https://www.youtube.com/channel/UCCsy9G2d0Q7m_d8cOtDineQ" rel="noopener noreferrer"&gt;this week’s full episode of SLICED&lt;/a&gt; to see lots of exploratory data analysis and visualization of this dataset, but let’s just make a few plots to understand it better.&lt;/p&gt;

&lt;p&gt;How are home runs distributed in the physical space around home plate?&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;train_raw %&amp;gt;%
  ggplot(aes(plate_x, plate_z, z = is_home_run)) +
  stat_summary_hex(alpha = 0.8, bins = 10) +
  scale_fill_viridis_c(labels = percent) +
  labs(fill = "% home runs")

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fjuliasilge.com%2Fblog%2Fbaseball-racing%2Findex_files%2Ffigure-html%2Funnamed-chunk-3-1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fjuliasilge.com%2Fblog%2Fbaseball-racing%2Findex_files%2Ffigure-html%2Funnamed-chunk-3-1.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;How do launch speed and angle of the ball leaving the bat affect home run percentage?&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;train_raw %&amp;gt;%
  ggplot(aes(launch_angle, launch_speed, z = is_home_run)) +
  stat_summary_hex(alpha = 0.8, bins = 15) +
  scale_fill_viridis_c(labels = percent) +
  labs(fill = "% home runs")

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fjuliasilge.com%2Fblog%2Fbaseball-racing%2Findex_files%2Ffigure-html%2Funnamed-chunk-4-1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fjuliasilge.com%2Fblog%2Fbaseball-racing%2Findex_files%2Ffigure-html%2Funnamed-chunk-4-1.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;How does pacing, like the number of balls, strikes, or the inning, affect home runs?&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;train_raw %&amp;gt;%
  mutate(is_home_run = if_else(as.logical(is_home_run), "yes", "no")) %&amp;gt;%
  select(is_home_run, balls, strikes, inning) %&amp;gt;%
  pivot_longer(balls:inning) %&amp;gt;%
  mutate(name = fct_inorder(name)) %&amp;gt;%
  ggplot(aes(value, after_stat(density), fill = is_home_run)) +
  geom_histogram(alpha = 0.5, binwidth = 1, position = "identity") +
  facet_wrap(~name, scales = "free") +
  labs(fill = "Home run?")

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fjuliasilge.com%2Fblog%2Fbaseball-racing%2Findex_files%2Ffigure-html%2Funnamed-chunk-5-1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fjuliasilge.com%2Fblog%2Fbaseball-racing%2Findex_files%2Ffigure-html%2Funnamed-chunk-5-1.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;There is certainly lots more to discover, but let’s move on to modeling.&lt;/p&gt;

&lt;h2&gt;
  
  
  Build a model
&lt;/h2&gt;

&lt;p&gt;Let’s start our modeling by setting up our “data budget.” I’m going to convert the 0s and 1s from the original dataset into a factor for classification modeling.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;library(tidymodels)

set.seed(123)
bb_split &amp;lt;- train_raw %&amp;gt;%
  mutate(
    is_home_run = if_else(as.logical(is_home_run), "HR", "no"),
    is_home_run = factor(is_home_run)
  ) %&amp;gt;%
  initial_split(strata = is_home_run)
bb_train &amp;lt;- training(bb_split)
bb_test &amp;lt;- testing(bb_split)

set.seed(234)
bb_folds &amp;lt;- vfold_cv(bb_train, strata = is_home_run)
bb_folds


## # 10-fold cross-validation using stratification 
## # A tibble: 10 × 2
## splits id    
## &amp;lt;list&amp;gt; &amp;lt;chr&amp;gt; 
## 1 &amp;lt;split [31214/3469]&amp;gt; Fold01
## 2 &amp;lt;split [31214/3469]&amp;gt; Fold02
## 3 &amp;lt;split [31214/3469]&amp;gt; Fold03
## 4 &amp;lt;split [31215/3468]&amp;gt; Fold04
## 5 &amp;lt;split [31215/3468]&amp;gt; Fold05
## 6 &amp;lt;split [31215/3468]&amp;gt; Fold06
## 7 &amp;lt;split [31215/3468]&amp;gt; Fold07
## 8 &amp;lt;split [31215/3468]&amp;gt; Fold08
## 9 &amp;lt;split [31215/3468]&amp;gt; Fold09
## 10 &amp;lt;split [31215/3468]&amp;gt; Fold10

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For feature engineering, let’s concentrate on the variables we already explored during EDA along with info about the pitch and handedness of players. There is some missing data, especially in the &lt;code&gt;launch_angle&lt;/code&gt; and &lt;code&gt;launch_speed&lt;/code&gt;, so let’s impute those values.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;bb_rec &amp;lt;-
  recipe(is_home_run ~ launch_angle + launch_speed + plate_x + plate_z +
    bb_type + bearing + pitch_mph +
    is_pitcher_lefty + is_batter_lefty +
    inning + balls + strikes + game_date,
  data = bb_train
  ) %&amp;gt;%
  step_date(game_date, features = c("week"), keep_original_cols = FALSE) %&amp;gt;%
  step_unknown(all_nominal_predictors()) %&amp;gt;%
  step_dummy(all_nominal_predictors(), one_hot = TRUE) %&amp;gt;%
  step_impute_median(all_numeric_predictors(), -launch_angle, -launch_speed) %&amp;gt;%
  step_impute_linear(launch_angle, launch_speed,
    impute_with = imp_vars(plate_x, plate_z, pitch_mph)
  ) %&amp;gt;%
  step_nzv(all_predictors())

## we can `prep()` just to check that it works
prep(bb_rec)


## Data Recipe
## 
## Inputs:
## 
## role #variables
## outcome 1
## predictor 13
## 
## Training data contained 34683 data points and 15255 incomplete rows. 
## 
## Operations:
## 
## Date features from game_date [trained]
## Unknown factor level assignment for bb_type, bearing [trained]
## Dummy variables from bb_type, bearing [trained]
## Median Imputation for plate_x, plate_z, pitch_mph, ... [trained]
## Linear regression imputation for launch_angle, launch_speed [trained]
## Sparse, unbalanced variable filter removed bb_type_unknown, bearing_unknown [trained]

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now let’s create a tunable xgboost model specification. In a competition like SLICED, we likely wouldn’t want to tune all these parameters because of time constraints, but instead only some of the most important.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;xgb_spec &amp;lt;-
  boost_tree(
    trees = tune(),
    min_n = tune(),
    mtry = tune(),
    learn_rate = 0.01
  ) %&amp;gt;%
  set_engine("xgboost") %&amp;gt;%
  set_mode("classification")

xgb_wf &amp;lt;- workflow(bb_rec, xgb_spec)

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Use racing to tune xgboost
&lt;/h2&gt;

&lt;p&gt;Now we &lt;a href="https://finetune.tidymodels.org/reference/tune_race_anova.html" rel="noopener noreferrer"&gt;can use &lt;code&gt;tune_race_anova()&lt;/code&gt; to eliminate&lt;/a&gt; parameter combinations that are not doing well. This particular SLICED episode was being evaluted on log loss.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;library(finetune)
doParallel::registerDoParallel()

set.seed(345)
xgb_rs &amp;lt;- tune_race_anova(
  xgb_wf,
  resamples = bb_folds,
  grid = 15,
  metrics = metric_set(mn_log_loss),
  control = control_race(verbose_elim = TRUE)
)

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We can visualize how the possible parameter combinations we tried did during the “race.” Notice how we saved a TON of time by not evaluating the parameter combinations that were clearly doing poorly on all the resamples; we only kept going with the good parameter combinations.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;plot_race(xgb_rs)

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fjuliasilge.com%2Fblog%2Fbaseball-racing%2Findex_files%2Ffigure-html%2Funnamed-chunk-10-1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fjuliasilge.com%2Fblog%2Fbaseball-racing%2Findex_files%2Ffigure-html%2Funnamed-chunk-10-1.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;And we can look at the top results.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;show_best(xgb_rs)


## # A tibble: 1 × 9
## mtry trees min_n .metric .estimator mean n std_err .config          
## &amp;lt;int&amp;gt; &amp;lt;int&amp;gt; &amp;lt;int&amp;gt; &amp;lt;chr&amp;gt; &amp;lt;chr&amp;gt; &amp;lt;dbl&amp;gt; &amp;lt;int&amp;gt; &amp;lt;dbl&amp;gt; &amp;lt;chr&amp;gt;            
## 1 6 1536 11 mn_log_lo… binary 0.0981 10 0.00171 Preprocessor1_Mo…

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Let’s use &lt;code&gt;last_fit()&lt;/code&gt; to fit one final time to the &lt;strong&gt;training&lt;/strong&gt; data and evaluate one final time on the &lt;strong&gt;testing&lt;/strong&gt; data.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;xgb_last &amp;lt;- xgb_wf %&amp;gt;%
  finalize_workflow(select_best(xgb_rs, "mn_log_loss")) %&amp;gt;%
  last_fit(bb_split)

xgb_last


## # Resampling results
## # Manual resampling 
## # A tibble: 1 × 6
## splits id .metrics .notes .predictions .workflow
## &amp;lt;list&amp;gt; &amp;lt;chr&amp;gt; &amp;lt;list&amp;gt; &amp;lt;list&amp;gt; &amp;lt;list&amp;gt; &amp;lt;list&amp;gt;   
## 1 &amp;lt;split [34683… train/test… &amp;lt;tibble [2 … &amp;lt;tibble [0… &amp;lt;tibble [11,561… &amp;lt;workflo…

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We can collect the predictions on the testing set and do whatever we want, like create an ROC curve, or in this case compute log loss.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;collect_predictions(xgb_last) %&amp;gt;%
  mn_log_loss(is_home_run, .pred_HR)


## # A tibble: 1 × 3
## .metric .estimator .estimate
## &amp;lt;chr&amp;gt; &amp;lt;chr&amp;gt; &amp;lt;dbl&amp;gt;
## 1 mn_log_loss binary 0.0975

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is pretty good for a single model; the competitors on SLICED who achieved better scores than this using this dataset all used ensemble models, I believe.&lt;/p&gt;

&lt;p&gt;We can also compute variable importance scores using the vip package.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;library(vip)
extract_workflow(xgb_last) %&amp;gt;%
  extract_fit_parsnip() %&amp;gt;%
  vip(geom = "point", num_features = 15)

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fjuliasilge.com%2Fblog%2Fbaseball-racing%2Findex_files%2Ffigure-html%2Funnamed-chunk-14-1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fjuliasilge.com%2Fblog%2Fbaseball-racing%2Findex_files%2Ffigure-html%2Funnamed-chunk-14-1.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Using racing methods is a great way to tune through lots of possible parameter options more quickly. Perhaps I’ll put it to the test next Tuesday, when I participate in the second and final episode of the SLICED playoffs!&lt;/p&gt;

</description>
      <category>machinelearning</category>
      <category>datascience</category>
      <category>tutorial</category>
      <category>rstats</category>
    </item>
    <item>
      <title>Tune xgboost models with early stopping to predict shelter animal status 🐱🐶</title>
      <dc:creator>Julia Silge</dc:creator>
      <pubDate>Sat, 07 Aug 2021 00:00:00 +0000</pubDate>
      <link>https://dev.to/juliasilge/tune-xgboost-models-with-early-stopping-to-predict-shelter-animal-status-48gf</link>
      <guid>https://dev.to/juliasilge/tune-xgboost-models-with-early-stopping-to-predict-shelter-animal-status-48gf</guid>
      <description>&lt;p&gt;This is the latest in my series of &lt;a href="https://juliasilge.com/category/tidymodels/"&gt;screencasts&lt;/a&gt; demonstrating how to use the &lt;a href="https://www.tidymodels.org/"&gt;tidymodels&lt;/a&gt; packages, from just getting started to tuning more complex models. I participated in this week’s episode of the &lt;a href="https://www.notion.so/SLICED-Show-c7bd26356e3a42279e2dfbafb0480073"&gt;SLICED&lt;/a&gt; playoffs, a competitive data science streaming show, where we competed to predict the status of shelter animals. 🐱 I used xgboost’s early stopping feature as I competed, so let’s walk through how and when to try that out!&lt;/p&gt;

&lt;p&gt;&lt;iframe width="710" height="399" src="https://www.youtube.com/embed/aXAafzOFyjk"&gt;
&lt;/iframe&gt;
&lt;/p&gt;

&lt;p&gt;Here is the code I used in the video, for those who prefer reading instead of or in addition to video.&lt;/p&gt;

&lt;h2&gt;
  
  
  Explore data
&lt;/h2&gt;

&lt;p&gt;Our modeling goal is to predict &lt;a href="https://www.kaggle.com/c/sliced-s01e10-playoffs-2/"&gt;the outcome for shelter animals&lt;/a&gt; (adoption, transfer, or no outcome) given features about the animal and event. The main data set provided is in a CSV file called &lt;code&gt;training.csv&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;library(tidyverse)
train_raw &amp;lt;- read_csv("train.csv")

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You can watch &lt;a href="https://www.twitch.tv/videos/1107382565"&gt;this week’s full episode of SLICED&lt;/a&gt; to see lots of exploratory data analysis and visualization of this dataset, but let’s just make a few plots to understand it better.&lt;/p&gt;

&lt;p&gt;How are outcomes distributed for animals of different ages?&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;library(lubridate)

train_raw %&amp;gt;%
  mutate(
    age_upon_outcome = as.period(as.Date(datetime) - date_of_birth),
    age_upon_outcome = time_length(age_upon_outcome, unit = "weeks")
  ) %&amp;gt;%
  ggplot(aes(age_upon_outcome, after_stat(density), fill = outcome_type)) +
  geom_histogram(bins = 15, alpha = 0.5, position = "identity") +
  labs(x = "Age in weeks", fill = NULL)

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--HxWvJ-K_--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://juliasilge.com/blog/shelter-animals/index_files/figure-html/unnamed-chunk-3-1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--HxWvJ-K_--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://juliasilge.com/blog/shelter-animals/index_files/figure-html/unnamed-chunk-3-1.png" alt=""&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;How does adoption rate change with day of the week and week of the year?&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;train_raw %&amp;gt;%
  mutate(outcome_type = outcome_type == "adoption") %&amp;gt;%
  group_by(
    week = week(datetime),
    wday = wday(datetime)
  ) %&amp;gt;%
  summarise(outcome_type = mean(outcome_type)) %&amp;gt;%
  ggplot(aes(week, wday, fill = outcome_type)) +
  geom_tile(alpha = 0.8) +
  scale_fill_viridis_c(labels = scales::percent) +
  labs(fill = "% adopted", x = "week of the year", y = "week day")

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--_KP9LVib--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://juliasilge.com/blog/shelter-animals/index_files/figure-html/unnamed-chunk-4-1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--_KP9LVib--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://juliasilge.com/blog/shelter-animals/index_files/figure-html/unnamed-chunk-4-1.png" alt=""&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Notice the difference on weekends vs. weekdays especially!&lt;/p&gt;

&lt;p&gt;There is certainly lots more to explore (including, for example, learning about the names of the animals, something I spent a good bit of time on during the competition), but let’s move on to modeling.&lt;/p&gt;

&lt;h2&gt;
  
  
  Build a model
&lt;/h2&gt;

&lt;p&gt;Let’s start our modeling by setting up our “data budget,” as well as the metrics (this challenge was evaluate on multiclass log loss).&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;library(tidymodels)

set.seed(123)
shelter_split &amp;lt;- train_raw %&amp;gt;%
  mutate(
    age_upon_outcome = as.period(as.Date(datetime) - date_of_birth),
    age_upon_outcome = time_length(age_upon_outcome, unit = "weeks")
  ) %&amp;gt;%
  initial_split(strata = outcome_type)

shelter_train &amp;lt;- training(shelter_split)
shelter_test &amp;lt;- testing(shelter_split)
shelter_metrics &amp;lt;- metric_set(accuracy, roc_auc, mn_log_loss)

set.seed(234)
shelter_folds &amp;lt;- vfold_cv(shelter_train, strata = outcome_type)
shelter_folds


## # 10-fold cross-validation using stratification 
## # A tibble: 10 × 2
## splits id    
## &amp;lt;list&amp;gt; &amp;lt;chr&amp;gt; 
## 1 &amp;lt;split [36724/4081]&amp;gt; Fold01
## 2 &amp;lt;split [36724/4081]&amp;gt; Fold02
## 3 &amp;lt;split [36724/4081]&amp;gt; Fold03
## 4 &amp;lt;split [36724/4081]&amp;gt; Fold04
## 5 &amp;lt;split [36724/4081]&amp;gt; Fold05
## 6 &amp;lt;split [36725/4080]&amp;gt; Fold06
## 7 &amp;lt;split [36725/4080]&amp;gt; Fold07
## 8 &amp;lt;split [36725/4080]&amp;gt; Fold08
## 9 &amp;lt;split [36725/4080]&amp;gt; Fold09
## 10 &amp;lt;split [36725/4080]&amp;gt; Fold10

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For feature engineering, let’s concentrate on just a handful of predictors, like when the event (adoption, transfer, or “no outcome”) was recorded and features of the animal itself like age, sex, type, etc.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;shelter_rec &amp;lt;- recipe(outcome_type ~ age_upon_outcome + animal_type +
  datetime + sex + spay_neuter,
data = shelter_train
) %&amp;gt;%
  step_date(datetime, features = c("year", "week", "dow"), keep_original_cols = FALSE) %&amp;gt;%
  step_dummy(all_nominal_predictors(), one_hot = TRUE) %&amp;gt;%
  step_zv(all_predictors())

## we can `prep()` just to check that it works
prep(shelter_rec)


## Data Recipe
## 
## Inputs:
## 
## role #variables
## outcome 1
## predictor 5
## 
## Training data contained 40805 data points and no missing data.
## 
## Operations:
## 
## Date features from datetime [trained]
## Dummy variables from animal_type, sex, spay_neuter, datetime_dow [trained]
## Zero variance filter removed no terms [trained]

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now let’s create a tunable xgboost model specification. This is where &lt;a href="https://en.wikipedia.org/wiki/Early_stopping"&gt;early stopping&lt;/a&gt; comes in; we will keep the number of trees as a constant (and not too terribly high), set &lt;code&gt;stop_iter&lt;/code&gt; (the early stopping parameter) to &lt;code&gt;tune()&lt;/code&gt;, and then tune a few other parameters. Notice that we need to set a &lt;code&gt;validation&lt;/code&gt; set (a proportion of each analysis set, actually) to hold back to use for deciding when to stop.&lt;/p&gt;

&lt;p&gt;We can also create a custom &lt;code&gt;stopping_grid&lt;/code&gt; to specific what parameters I want to try out.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;stopping_spec &amp;lt;-
  boost_tree(
    trees = 500,
    mtry = tune(),
    learn_rate = tune(),
    stop_iter = tune()
  ) %&amp;gt;%
  set_engine("xgboost", validation = 0.2) %&amp;gt;%
  set_mode("classification")

stopping_grid &amp;lt;-
  grid_latin_hypercube(
    mtry(range = c(5L, 20L)), ## depends on number of columns in data
    learn_rate(range = c(-5, -1)), ## keep pretty big
    stop_iter(range = c(10L, 50L)), ## bigger than default
    size = 10
  )

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now we can put these together in a workflow and tune across the grid of parameters and our resamples.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;early_stop_wf &amp;lt;- workflow(shelter_rec, stopping_spec)

doParallel::registerDoParallel()
set.seed(345)
stopping_rs &amp;lt;- tune_grid(
  early_stop_wf,
  shelter_folds,
  grid = stopping_grid,
  metrics = shelter_metrics
)

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We did it!&lt;/p&gt;

&lt;h2&gt;
  
  
  Evaluate results
&lt;/h2&gt;

&lt;p&gt;How did these results turn out? We can visualize them.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;autoplot(stopping_rs) + theme_light(base_family = "IBMPlexSans")

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--ILKZba29--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://juliasilge.com/blog/shelter-animals/index_files/figure-html/unnamed-chunk-9-1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--ILKZba29--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://juliasilge.com/blog/shelter-animals/index_files/figure-html/unnamed-chunk-9-1.png" alt=""&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Or we can look at the top results manually.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;show_best(stopping_rs, metric = "mn_log_loss")


## # A tibble: 5 × 9
## mtry learn_rate stop_iter .metric .estimator mean n std_err .config 
## &amp;lt;int&amp;gt; &amp;lt;dbl&amp;gt; &amp;lt;int&amp;gt; &amp;lt;chr&amp;gt; &amp;lt;chr&amp;gt; &amp;lt;dbl&amp;gt; &amp;lt;int&amp;gt; &amp;lt;dbl&amp;gt; &amp;lt;chr&amp;gt;   
## 1 12 0.0612 46 mn_log_loss multiclass 0.502 10 0.00319 Preproc…
## 2 18 0.0378 36 mn_log_loss multiclass 0.505 10 0.00279 Preproc…
## 3 7 0.00710 12 mn_log_loss multiclass 0.544 10 0.00246 Preproc…
## 4 9 0.00252 33 mn_log_loss multiclass 0.655 10 0.00145 Preproc…
## 5 11 0.00195 25 mn_log_loss multiclass 0.699 10 0.00122 Preproc…

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Let’s use &lt;code&gt;last_fit()&lt;/code&gt; to fit one final time to the &lt;strong&gt;training&lt;/strong&gt; data and evaluate one final time on the &lt;strong&gt;testing&lt;/strong&gt; data, with the numerically optimal result from &lt;code&gt;stopping_rs&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;stopping_fit &amp;lt;- early_stop_wf %&amp;gt;%
  finalize_workflow(select_best(stopping_rs, "mn_log_loss")) %&amp;gt;%
  last_fit(shelter_split)

stopping_fit


## # Resampling results
## # Manual resampling 
## # A tibble: 1 × 6
## splits id .metrics .notes .predictions .workflow
## &amp;lt;list&amp;gt; &amp;lt;chr&amp;gt; &amp;lt;list&amp;gt; &amp;lt;list&amp;gt; &amp;lt;list&amp;gt; &amp;lt;list&amp;gt;   
## 1 &amp;lt;split [40805/13603]&amp;gt; train/test split &amp;lt;tibble … &amp;lt;tibb… &amp;lt;tibble [13… &amp;lt;workflo…

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;How did this model perform on the testing data, that was not used in tuning/training?&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;collect_metrics(stopping_fit)


## # A tibble: 2 × 4
## .metric .estimator .estimate .config             
## &amp;lt;chr&amp;gt; &amp;lt;chr&amp;gt; &amp;lt;dbl&amp;gt; &amp;lt;chr&amp;gt;               
## 1 accuracy multiclass 0.807 Preprocessor1_Model1
## 2 roc_auc hand_till 0.877 Preprocessor1_Model1

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This result is pretty good for a single model; we would expect to do better by incorporating the &lt;code&gt;breed&lt;/code&gt; information, perhaps the presence/absence of a name, or moving to an ensembled model.&lt;/p&gt;

&lt;p&gt;What features are most important for this xgboost model?&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;library(vip)

## use this fitted workflow `extract_workflow(stopping_fit)` to predict on new data
extract_workflow(stopping_fit) %&amp;gt;%
  extract_fit_parsnip() %&amp;gt;%
  vip(num_features = 15, geom = "point")

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--fFSA8k4c--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://juliasilge.com/blog/shelter-animals/index_files/figure-html/unnamed-chunk-13-1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--fFSA8k4c--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://juliasilge.com/blog/shelter-animals/index_files/figure-html/unnamed-chunk-13-1.png" alt=""&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Age, spay/neuter status, animal type, and seasonal information like week of the year or day of the week are important for this model.&lt;/p&gt;

&lt;p&gt;We can collect the predictions on the testing set and do whatever we want, like create an ROC curve.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;collect_predictions(stopping_fit) %&amp;gt;%
  roc_curve(outcome_type, .pred_adoption:.pred_transfer) %&amp;gt;%
  ggplot(aes(1 - specificity, sensitivity, color = .level)) +
  geom_abline(lty = 2, color = "gray80", size = 1.5) +
  geom_path(alpha = 0.8, size = 1) +
  coord_equal() +
  labs(color = NULL)

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--c32MJNNi--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://juliasilge.com/blog/shelter-animals/index_files/figure-html/unnamed-chunk-14-1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--c32MJNNi--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://juliasilge.com/blog/shelter-animals/index_files/figure-html/unnamed-chunk-14-1.png" alt=""&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We can also look at a confusion matrix.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;collect_predictions(stopping_fit) %&amp;gt;%
  conf_mat(outcome_type, .pred_class) %&amp;gt;%
  autoplot()

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--DNWEWXs7--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://juliasilge.com/blog/shelter-animals/index_files/figure-html/unnamed-chunk-15-1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--DNWEWXs7--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://juliasilge.com/blog/shelter-animals/index_files/figure-html/unnamed-chunk-15-1.png" alt=""&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Early stopping is a great option when you have plenty of data and don’t want to overfit your boosted trees! I will be back on SLICED for the final four next Tuesday, and I plan to use early stopping again because it is a good fit for this kind of situation.&lt;/p&gt;

</description>
      <category>machinelearning</category>
      <category>datascience</category>
      <category>tutorial</category>
      <category>rstats</category>
    </item>
    <item>
      <title>Predict which Scooby Doo monsters 👻 are REAL with a tuned decision tree model</title>
      <dc:creator>Julia Silge</dc:creator>
      <pubDate>Tue, 13 Jul 2021 00:00:00 +0000</pubDate>
      <link>https://dev.to/juliasilge/predict-which-scooby-doo-monsters-are-real-with-a-tuned-decision-tree-model-32e6</link>
      <guid>https://dev.to/juliasilge/predict-which-scooby-doo-monsters-are-real-with-a-tuned-decision-tree-model-32e6</guid>
      <description>&lt;p&gt;This is the latest in my series of &lt;a href="https://juliasilge.com/category/tidymodels/"&gt;screencasts&lt;/a&gt; demonstrating how to use the &lt;a href="https://www.tidymodels.org/"&gt;tidymodels&lt;/a&gt; packages, from just getting started to tuning more complex models. Today’s screencast walks through how to train and evalute a random forest model, with this week’s &lt;a href="https://github.com/rfordatascience/tidytuesday"&gt;&lt;code&gt;#TidyTuesday&lt;/code&gt; dataset&lt;/a&gt; on Scooby Doo episodes. 👻&lt;/p&gt;

&lt;p&gt;&lt;iframe width="710" height="399" src="https://www.youtube.com/embed/2g6f-j3sHS4"&gt;
&lt;/iframe&gt;
&lt;/p&gt;

&lt;p&gt;Here is the code I used in the video, for those who prefer reading instead of or in addition to video.&lt;/p&gt;

&lt;h2&gt;
  
  
  Explore data
&lt;/h2&gt;

&lt;p&gt;Our modeling goal is to predict which &lt;a href="https://github.com/rfordatascience/tidytuesday/blob/master/data/2021/2021-07-13/readme.md"&gt;Scooby Doo monsters&lt;/a&gt; are real and which are not, based on other characteristics of the episode.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;library(tidyverse)
scooby_raw &amp;lt;- read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-07-13/scoobydoo.csv")

scooby_raw %&amp;gt;%
  filter(monster_amount &amp;gt; 0) %&amp;gt;%
  count(monster_real)


## # A tibble: 2 x 2
## monster_real n
## &amp;lt;chr&amp;gt; &amp;lt;int&amp;gt;
## 1 FALSE 404
## 2 TRUE 112

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Most monsters are not real!&lt;/p&gt;

&lt;p&gt;How did the number of real vs. fake monsters change over the decades?&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;scooby_raw %&amp;gt;%
  filter(monster_amount &amp;gt; 0) %&amp;gt;%
  count(
    year_aired = 10 * ((lubridate::year(date_aired) + 1) %/% 10),
    monster_real
  ) %&amp;gt;%
  mutate(year_aired = factor(year_aired)) %&amp;gt;%
  ggplot(aes(year_aired, n, fill = monster_real)) +
  geom_col(position = position_dodge(preserve = "single"), alpha = 0.8) +
  labs(x = "Date aired", y = "Monsters per decade", fill = "Real monster?")

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--1PWdtG3B--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://juliasilge.com/blog/scooby-doo/index_files/figure-html/unnamed-chunk-3-1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--1PWdtG3B--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://juliasilge.com/blog/scooby-doo/index_files/figure-html/unnamed-chunk-3-1.png" alt=""&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;How are these different episodes rated on IMDB?&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;scooby_raw %&amp;gt;%
  filter(monster_amount &amp;gt; 0) %&amp;gt;%
  mutate(imdb = parse_number(imdb)) %&amp;gt;%
  ggplot(aes(imdb, after_stat(density), fill = monster_real)) +
  geom_histogram(position = "identity", alpha = 0.5) +
  labs(x = "IMDB rating", y = "Density", fill = "Real monster?")

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--n1MgZC-C--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://juliasilge.com/blog/scooby-doo/index_files/figure-html/unnamed-chunk-4-1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--n1MgZC-C--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://juliasilge.com/blog/scooby-doo/index_files/figure-html/unnamed-chunk-4-1.png" alt=""&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;It looks like there are some meaningful relationships there that we can use for modeling, but they are not linear so a decision tree may be a good fit.&lt;/p&gt;

&lt;h2&gt;
  
  
  Build and tune a model
&lt;/h2&gt;

&lt;p&gt;Let’s start our modeling by setting up our “data budget.” We’re only going to use the &lt;em&gt;year&lt;/em&gt; each episode was aired and the episode &lt;em&gt;rating&lt;/em&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;library(tidymodels)

set.seed(123)
scooby_split &amp;lt;- scooby_raw %&amp;gt;%
  mutate(
    imdb = parse_number(imdb),
    year_aired = lubridate::year(date_aired)
  ) %&amp;gt;%
  filter(monster_amount &amp;gt; 0, !is.na(imdb)) %&amp;gt;%
  mutate(
    monster_real = case_when(
      monster_real == "FALSE" ~ "fake",
      TRUE ~ "real"
    ),
    monster_real = factor(monster_real)
  ) %&amp;gt;%
  select(year_aired, imdb, monster_real, title) %&amp;gt;%
  initial_split(strata = monster_real)
scooby_train &amp;lt;- training(scooby_split)
scooby_test &amp;lt;- testing(scooby_split)

set.seed(234)
scooby_folds &amp;lt;- bootstraps(scooby_train, strata = monster_real)
scooby_folds


## # Bootstrap sampling using stratification 
## # A tibble: 25 x 2
## splits id         
## &amp;lt;list&amp;gt; &amp;lt;chr&amp;gt;      
## 1 &amp;lt;split [375/133]&amp;gt; Bootstrap01
## 2 &amp;lt;split [375/144]&amp;gt; Bootstrap02
## 3 &amp;lt;split [375/140]&amp;gt; Bootstrap03
## 4 &amp;lt;split [375/132]&amp;gt; Bootstrap04
## 5 &amp;lt;split [375/139]&amp;gt; Bootstrap05
## 6 &amp;lt;split [375/134]&amp;gt; Bootstrap06
## 7 &amp;lt;split [375/146]&amp;gt; Bootstrap07
## 8 &amp;lt;split [375/132]&amp;gt; Bootstrap08
## 9 &amp;lt;split [375/143]&amp;gt; Bootstrap09
## 10 &amp;lt;split [375/143]&amp;gt; Bootstrap10
## # … with 15 more rows

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Next, let’s create our decision tree specification. It is tunable, and we could not fit this right away to data because we haven’t said what the model parameters are yet.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;tree_spec &amp;lt;-
  decision_tree(
    cost_complexity = tune(),
    tree_depth = tune(),
    min_n = tune()
  ) %&amp;gt;%
  set_mode("classification") %&amp;gt;%
  set_engine("rpart")

tree_spec


## Decision Tree Model Specification (classification)
## 
## Main Arguments:
## cost_complexity = tune()
## tree_depth = tune()
## min_n = tune()
## 
## Computational engine: rpart

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Let’s set up a grid of possible model parameters to try.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;tree_grid &amp;lt;- grid_regular(cost_complexity(), tree_depth(), min_n(), levels = 4)
tree_grid


## # A tibble: 64 x 3
## cost_complexity tree_depth min_n
## &amp;lt;dbl&amp;gt; &amp;lt;int&amp;gt; &amp;lt;int&amp;gt;
## 1 0.0000000001 1 2
## 2 0.0000001 1 2
## 3 0.0001 1 2
## 4 0.1 1 2
## 5 0.0000000001 5 2
## 6 0.0000001 5 2
## 7 0.0001 5 2
## 8 0.1 5 2
## 9 0.0000000001 10 2
## 10 0.0000001 10 2
## # … with 54 more rows

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now let’s fit each possible parameter combination to each resample. By putting non-default metrics into &lt;code&gt;metric_set()&lt;/code&gt;, we can specify which metrics are computed for each resample.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;doParallel::registerDoParallel()

set.seed(345)
tree_rs &amp;lt;-
  tune_grid(
    tree_spec,
    monster_real ~ year_aired + imdb,
    resamples = scooby_folds,
    grid = tree_grid,
    metrics = metric_set(accuracy, roc_auc, sensitivity, specificity)
  )

tree_rs


## # Tuning results
## # Bootstrap sampling using stratification 
## # A tibble: 25 x 4
## splits id .metrics .notes          
## &amp;lt;list&amp;gt; &amp;lt;chr&amp;gt; &amp;lt;list&amp;gt; &amp;lt;list&amp;gt;          
## 1 &amp;lt;split [375/133]&amp;gt; Bootstrap01 &amp;lt;tibble [256 × 7]&amp;gt; &amp;lt;tibble [0 × 1]&amp;gt;
## 2 &amp;lt;split [375/144]&amp;gt; Bootstrap02 &amp;lt;tibble [256 × 7]&amp;gt; &amp;lt;tibble [0 × 1]&amp;gt;
## 3 &amp;lt;split [375/140]&amp;gt; Bootstrap03 &amp;lt;tibble [256 × 7]&amp;gt; &amp;lt;tibble [0 × 1]&amp;gt;
## 4 &amp;lt;split [375/132]&amp;gt; Bootstrap04 &amp;lt;tibble [256 × 7]&amp;gt; &amp;lt;tibble [0 × 1]&amp;gt;
## 5 &amp;lt;split [375/139]&amp;gt; Bootstrap05 &amp;lt;tibble [256 × 7]&amp;gt; &amp;lt;tibble [0 × 1]&amp;gt;
## 6 &amp;lt;split [375/134]&amp;gt; Bootstrap06 &amp;lt;tibble [256 × 7]&amp;gt; &amp;lt;tibble [0 × 1]&amp;gt;
## 7 &amp;lt;split [375/146]&amp;gt; Bootstrap07 &amp;lt;tibble [256 × 7]&amp;gt; &amp;lt;tibble [0 × 1]&amp;gt;
## 8 &amp;lt;split [375/132]&amp;gt; Bootstrap08 &amp;lt;tibble [256 × 7]&amp;gt; &amp;lt;tibble [0 × 1]&amp;gt;
## 9 &amp;lt;split [375/143]&amp;gt; Bootstrap09 &amp;lt;tibble [256 × 7]&amp;gt; &amp;lt;tibble [0 × 1]&amp;gt;
## 10 &amp;lt;split [375/143]&amp;gt; Bootstrap10 &amp;lt;tibble [256 × 7]&amp;gt; &amp;lt;tibble [0 × 1]&amp;gt;
## # … with 15 more rows

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;All done!&lt;/p&gt;

&lt;h2&gt;
  
  
  Evaluate and understand our model
&lt;/h2&gt;

&lt;p&gt;Now that we have tuned our decision tree model, we can choose which set of model parameters we want to use. What are some of the best options?&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;show_best(tree_rs)


## # A tibble: 5 x 9
## cost_complexity tree_depth min_n .metric .estimator mean n std_err
## &amp;lt;dbl&amp;gt; &amp;lt;int&amp;gt; &amp;lt;int&amp;gt; &amp;lt;chr&amp;gt; &amp;lt;chr&amp;gt; &amp;lt;dbl&amp;gt; &amp;lt;int&amp;gt; &amp;lt;dbl&amp;gt;
## 1 0.0000000001 10 2 accuracy binary 0.872 25 0.00481
## 2 0.0000001 10 2 accuracy binary 0.872 25 0.00481
## 3 0.0001 10 2 accuracy binary 0.872 25 0.00481
## 4 0.0000000001 15 2 accuracy binary 0.871 25 0.00456
## 5 0.0000001 15 2 accuracy binary 0.871 25 0.00456
## # … with 1 more variable: .config &amp;lt;chr&amp;gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We can visualize all of the combinations we tried.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;autoplot(tree_rs) + theme_light(base_family = "IBMPlexSans")

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--ipoZ8ps5--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://juliasilge.com/blog/scooby-doo/index_files/figure-html/unnamed-chunk-10-1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--ipoZ8ps5--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://juliasilge.com/blog/scooby-doo/index_files/figure-html/unnamed-chunk-10-1.png" alt=""&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;If we used &lt;code&gt;select_best()&lt;/code&gt;, we would pick the numerically best option. However, we might want to choose a different option that is within some criteria of the best performance, like a simpler model that is within one standard error of the optimal results. We finalize our model just like we finalize a workflow, as shown in previous posts.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;simpler_tree &amp;lt;- select_by_one_std_err(tree_rs,
  -cost_complexity,
  metric = "roc_auc"
)

final_tree &amp;lt;- finalize_model(tree_spec, simpler_tree)

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now we can fit &lt;code&gt;final_tree&lt;/code&gt; to our training data.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;final_fit &amp;lt;- fit(final_tree, monster_real ~ year_aired + imdb, scooby_train)

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We also could use &lt;code&gt;last_fit()&lt;/code&gt; instead of &lt;code&gt;fit()&lt;/code&gt;, by swapping out the &lt;strong&gt;split&lt;/strong&gt; for the training data. This will fit one time on the training data and evaluate one time on the testing data.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;final_rs &amp;lt;- last_fit(final_tree, monster_real ~ year_aired + imdb, scooby_split)

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is the first time we have used the testing data through this whole analysis, and let’s us see how our model performs on the testing data. A bit worse, unfortunately!&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;collect_metrics(final_rs)


## # A tibble: 2 x 4
## .metric .estimator .estimate .config             
## &amp;lt;chr&amp;gt; &amp;lt;chr&amp;gt; &amp;lt;dbl&amp;gt; &amp;lt;chr&amp;gt;               
## 1 accuracy binary 0.857 Preprocessor1_Model1
## 2 roc_auc binary 0.780 Preprocessor1_Model1

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Finally, we can use the &lt;a href="https://github.com/grantmcdermott/parttree"&gt;parttree&lt;/a&gt; package to visualize our decision tree results.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;library(parttree)

scooby_train %&amp;gt;%
  ggplot(aes(imdb, year_aired)) +
  geom_parttree(data = final_fit, aes(fill = monster_real), alpha = 0.2) +
  geom_jitter(alpha = 0.7, width = 0.05, height = 0.2, aes(color = monster_real))

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--WNlPCoBr--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://juliasilge.com/blog/scooby-doo/index_files/figure-html/unnamed-chunk-15-1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--WNlPCoBr--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://juliasilge.com/blog/scooby-doo/index_files/figure-html/unnamed-chunk-15-1.png" alt=""&gt;&lt;/a&gt;&lt;/p&gt;

</description>
      <category>datascience</category>
      <category>machinelearning</category>
      <category>tutorial</category>
      <category>rstats</category>
    </item>
  </channel>
</rss>
