<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Bijen Patel</title>
    <description>The latest articles on DEV Community by Bijen Patel (@bijenpatel).</description>
    <link>https://dev.to/bijenpatel</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F507452%2F07844963-e8c9-4189-a180-fc4c56d90ba6.jpeg</url>
      <title>DEV Community: Bijen Patel</title>
      <link>https://dev.to/bijenpatel</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/bijenpatel"/>
    <language>en</language>
    <item>
      <title>ISLR Chapter 10 — Unsupervised Learning</title>
      <dc:creator>Bijen Patel</dc:creator>
      <pubDate>Sun, 08 Nov 2020 23:47:04 +0000</pubDate>
      <link>https://dev.to/bijenpatel/islr-chapter-10-unsupervised-learning-2gpe</link>
      <guid>https://dev.to/bijenpatel/islr-chapter-10-unsupervised-learning-2gpe</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;This is a summary of chapter 10 of the &lt;em&gt;Introduction to Statistical Learning&lt;/em&gt; textbook. I’ve written a 10-part guide that covers the entire book. The guide can be read at &lt;a href="https://www.bijenpatel.com/guide/" rel="noopener noreferrer"&gt;my website&lt;/a&gt;, or &lt;a href="https://dev.to/bijenpatel/series/9611"&gt;here&lt;/a&gt; at DEV. Subscribe to stay up to date on my latest Data Science &amp;amp; Engineering guides!&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The statistical methods from the previous chapters focused on supervised learning. Again, supervised learning is where we have access to a set of predictors &lt;em&gt;X&lt;/em&gt;, and a response &lt;em&gt;Y&lt;/em&gt;. The goal is to predict &lt;em&gt;Y&lt;/em&gt; by using the predictors.&lt;/p&gt;

&lt;p&gt;In unsupervised learning, we have a set of features &lt;em&gt;X&lt;/em&gt;, but no response. The goal is not to predict anything. Instead, the goal is to learn information about the features, such as discovering subgroups or relationships.&lt;/p&gt;

&lt;p&gt;In this chapter, we will &lt;span id="rmm"&gt;c&lt;/span&gt;over two common methods of unsupervised learning: principal components analysis (PCA) and clustering. PCA is useful for data visualization and data pre-processing before using supervised learning methods. Clustering methods are useful for discovering unknown subgroups or relationships within a dataset.&lt;/p&gt;

&lt;h1&gt;
  
  
  The Challenge of Unsupervised Learning
&lt;/h1&gt;

&lt;p&gt;Unsupervised learning is more challenging than supervised learning because it is more subjective. There is no simple goal for the analysis. Also, it is hard to assess results because there is no “true” answer. Supervised learning is usually performed as part of the data exploratory process.&lt;/p&gt;

&lt;h1&gt;
  
  
  Principal Components Analysis
&lt;/h1&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;Principal components analysis was previously covered as part of principal components regression from&lt;/em&gt; &lt;a href="https://www.bijenpatel.com/guide/islr/linear-model-selection-regularization/" rel="noopener noreferrer"&gt;&lt;em&gt;chapter six&lt;/em&gt;&lt;/a&gt;&lt;em&gt;. Therefore, this section will be nearly identical to the explanation from chapter six, but without the regression aspects.&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;When we have a dataset with a large number of predictors, dimension reduction methods can be used to summarize the dataset with a smaller number of representative predictors (dimensions) that collectively explain most of the variability in the data. Each of the dimensions is some linear combination of all of the predictors. For example, if we had a dataset of 5 predictors, the first dimension (&lt;em&gt;Z&lt;/em&gt;₁) would be as follows:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605463860528%2FoP-_ays89.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605463860528%2FoP-_ays89.png" alt="dimension-equation.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Note that we will always have less dimensions than the number of predictors.&lt;/p&gt;

&lt;p&gt;The &lt;em&gt;ϕ&lt;/em&gt; values for the dimensions are known as loading values. The values are subject to the constraint that the square of the &lt;em&gt;ϕ&lt;/em&gt; values in a dimension must equal one. For our above example, this means that:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605463871124%2FRVAr3QF9b.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605463871124%2FRVAr3QF9b.png" alt="loading-values-equation.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The &lt;em&gt;ϕ&lt;/em&gt; values in a dimension make up a “loading vector” that defines a “direction” to explain the predictors. But how exactly do we come up with the &lt;em&gt;ϕ&lt;/em&gt; values? They are determined through principal components analysis.&lt;/p&gt;

&lt;p&gt;Principal components analysis is the most popular dimension reduction method. The method involves determining different principal component directions of the data.&lt;/p&gt;

&lt;p&gt;The first principal component direction (&lt;em&gt;Z&lt;/em&gt;₁) defines the direction along which the data varies the most. In other words, it is a linear combination of all of the predictors, such that it explains most of the variance in the predictors.&lt;/p&gt;

&lt;p&gt;The second principal component direction (&lt;em&gt;Z&lt;/em&gt;₂) defines another direction along which the data varies the most, but is subject to the constraint that it must be uncorrelated with the first principal component, &lt;em&gt;Z&lt;/em&gt;₁.&lt;/p&gt;

&lt;p&gt;The third principal component direction (&lt;em&gt;Z&lt;/em&gt;₃) defines another direction along which the data varies the most, but is subject to the constraint that it must be uncorrelated with both of the previous principal components, &lt;em&gt;Z&lt;/em&gt;₁ and &lt;em&gt;Z&lt;/em&gt;₂.&lt;/p&gt;

&lt;p&gt;And so on and so forth for additional principal component directions.&lt;/p&gt;

&lt;p&gt;Dimension reduction is best explained with an example. Assume that we have a dataset of different baseball players, which consists of their statistics in 1986, their years in the league, and their salaries in the following year (1987).&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605463888483%2Fhl1KEAfJA.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605463888483%2Fhl1KEAfJA.png" alt="baseball-statistics-table.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We need to perform dimension reduction by transforming our 7 different predictors into a smaller number of principal components to use for regression.&lt;/p&gt;

&lt;p&gt;It is important to note that prior to performing principal components analysis, each predictor should be standardized to ensure that all of the predictors are on the same scale. The absence of standardization will cause the predictors with high variance to play a larger role in the final principal components obtained.&lt;/p&gt;

&lt;p&gt;Performing principal components analysis would result in the following &lt;em&gt;ϕ&lt;/em&gt; values for the first three principal components:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605463902253%2FxRszBzV4m.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605463902253%2FxRszBzV4m.png" alt="baseball-pca-table.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;A plot of the loading values of the first two principal components would look as follows:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605463928441%2Fu7Atz-FHC.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605463928441%2Fu7Atz-FHC.png" alt="baseball-pca.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;How do we interpret these principal components?&lt;/p&gt;

&lt;p&gt;If we take a look at the first principal component, we can see that there is approximately an equal weight placed on each of the six baseball statistic predictors, and much less weight placed on the years that a player has been in the league. This means that the first principal component roughly corresponds to a player’s level of production.&lt;/p&gt;

&lt;p&gt;On the other hand, in the second principal component, we can see that there is a large weight placed on the number of years that a player has been in the league, and much less weight placed on the baseball statistics. This means that the second principal component roughly corresponds to how long a player has been in the league.&lt;/p&gt;

&lt;p&gt;In the third principal component, we can see that there is more weight placed on three specific baseball statistics: home runs, RBIs, and walks. This means that the third principal component roughly corresponds to a player’s batting ability.&lt;/p&gt;

&lt;p&gt;Performing principal components analysis also tells us the percent of variation in the data that is explained by each of the components. The first principal component from the baseball data explains 67% of the variation in the predictors. The second principal component explains 15%. The third principal component explains 9%. Therefore, together, these three principal components explain 91% of the variation in the data.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605463942689%2Fn8KfEdmjc.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605463942689%2Fn8KfEdmjc.png" alt="pca-pve.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This helps explain the key idea of principal components analysis, which is that a small number of principal components are sufficient to explain most of the variability in the data. Through principal components analysis, we’ve reduced the dimensions of our dataset from seven to three.&lt;/p&gt;

&lt;p&gt;The number of principal components to use can be chosen by using the number of components that explain a large amount of the variation in the data. For example, in the baseball data, the first three principal components explain 91% of variation in the data, so using just the first three is a valid option. Aside from that, there is no objective way to decide, unless PCA is being used in the context of a supervised learning method, such as principal components regression. In that case, we can perform cross-validation to choose the number of components that results in the lowest test error.&lt;/p&gt;

&lt;p&gt;Principal components analysis can ultimately be used in regression, classification, and clustering methods.&lt;/p&gt;

&lt;h1&gt;
  
  
  Clustering Methods
&lt;/h1&gt;

&lt;p&gt;Clustering refers to a broad set of techniques of finding subgroups in a dataset. The two most common clustering methods are:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt; K-means Clustering&lt;/li&gt;
&lt;li&gt; Hierarchical Clustering&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;In K-means clustering, we are looking to partition our dataset into a pre-specified number of clusters.&lt;/p&gt;

&lt;p&gt;In hierarchical clustering, we don’t know in advance how many clusters we want. Instead, we end up with a tree-like visual representation of the observations, called a dendogram, which allows us to see at once the clusterings obtained for each possible number of clusters from 1 to &lt;em&gt;n&lt;/em&gt;.&lt;/p&gt;

&lt;h1&gt;
  
  
  K-Means Clustering
&lt;/h1&gt;

&lt;p&gt;K-means clustering involves specifying a desired number of clusters &lt;em&gt;K&lt;/em&gt; and assigning each observation to one of the clusters. The clusters are determined such that the total within-cluster variation, summed over all &lt;em&gt;K&lt;/em&gt; clusters, is minimized. Within-cluster variation is defined as the squared Euclidean distance. The general process of performing K-means clustering is as follows:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt; Randomly assign each observation to one of the clusters. These will serve as the initial cluster assignments.&lt;/li&gt;
&lt;li&gt; For each of the clusters, compute the cluster centroid. The cluster centroid is the mean of the observations assigned to the cluster.&lt;/li&gt;
&lt;li&gt; Reassign each observation to the cluster whose centroid is the closest.&lt;/li&gt;
&lt;li&gt; Continue repeating steps 2 &amp;amp; 3 until the result no longer changes.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;A visual of the process looks something like the following graphic:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605463962766%2FkMsbjWxcz.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605463962766%2FkMsbjWxcz.png" alt="k-means-clustering.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;However, there is an issue with K-means clustering in that the final results depend on the initial random cluster assignment. Therefore, the algorithm should be run multiple times and the result that minimizes the total within-cluster variation should be chosen.&lt;/p&gt;

&lt;h1&gt;
  
  
  Hierarchical Clustering
&lt;/h1&gt;

&lt;p&gt;Hierarchical clustering does not require choosing the number of clusters in advance. Additionally, it results in a tree-based representation of the observations, known as a dendogram.&lt;/p&gt;

&lt;p&gt;The most common type of hierarchical clustering is bottom-up (agglomerative) clustering. The bottom-up phrase refers to the fact that a dendogram is built starting from the leaves and combining clusters up to the trunk.&lt;/p&gt;

&lt;p&gt;The following graphic is an example of hierarchical clustering in action:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605463974876%2FtJUsimMsq.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605463974876%2FtJUsimMsq.png" alt="hierarchical-clustering.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Interpreting a Dendogram
&lt;/h2&gt;

&lt;p&gt;At the bottom of a dendogram, each leaf represents one of the observations in the data. As we move up the tree, some leaves begin to fuse into branches. These observations that fuse are observations that are quite similar to each other. As we move higher up the tree, branches fuse with leaves or with other branches. Additionally, the earlier that fusions occur, the more similar the groups of observations are to each other.&lt;/p&gt;

&lt;p&gt;When interpreting a dendogram, be careful not to draw any conclusions about the similarity of observations based on their proximity along the horizontal axis. In other words, if two observations are next to each other on a dendogram, it doesn’t mean that they are similar.&lt;/p&gt;

&lt;p&gt;Instead, we draw conclusions about the similarity of two observations based on their location on the vertical axis where branches containing both of those observations are first fused.&lt;/p&gt;

&lt;p&gt;In hierarchical clustering, the final clusters are identified by making a horizontal cut across the dendogram. The set of observations below the cut are clusters. The attractive aspect of hierarchical clustering is that a single dendogram can be used to obtain multiple different clustering options. However, the choice of where to make the cut is often not clear. In practice, people usually look at the dendogram and select a sensible number of clusters, based on the heights of the fusion and the number of clusters desired.&lt;/p&gt;

&lt;p&gt;The term “hierarchical” refers to the fact that clusters obtained by cutting a dendogram at some height are nested within clusters that are obtained by cutting the dendogram at a greater height. Therefore, the assumption of a hierarchical structure might be unrealistic for some datasets, in which case K-means clustering would be better. An example of this would be if we had a dataset consisting of males and females from America, Japan, and France. The best division into two groups would be by gender. The best division into three groups would be by nationality. However, we cannot split one of the two gender clusters and end up with three nationality clusters. In other words, the true clusters are not nested.&lt;/p&gt;

&lt;h2&gt;
  
  
  Hierarchical Clustering Algorithm
&lt;/h2&gt;

&lt;p&gt;The hierarchical clustering dendogram is obtained through the following process:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt; First, we define some dissimilarity measure for each pair of observations in the dataset and initially treat each observation as its own cluster.&lt;/li&gt;
&lt;li&gt; The two clusters most similar to each other are fused.&lt;/li&gt;
&lt;li&gt; The algorithm proceeds iteratively until all observations belong to one cluster and the dendogram is complete. However, the concept of dissimilarity is extended to be able to define dissimilarity between two clusters if one or both have multiple observations. This is known as linkage, of which there are four types.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;See the hierarchical clustering graphic above for an example of this process.&lt;/p&gt;

&lt;p&gt;Complete linkage involves computing all pairwise dissimilarities between the observations in cluster A and cluster B, and recording the largest of the dissimilarities. Complete linkage is commonly used in practice.&lt;/p&gt;

&lt;p&gt;Single linkage is the opposite of complete linkage, where we record the smallest dissimilarity instead of the largest. Single linkage can result in extended, trailing clusters where single observations are fused one-at-a-time.&lt;/p&gt;

&lt;p&gt;Average linkage involves computing all pairwise dissimilarities between the observations in cluster A and cluster B, and recording the average of the dissimilarities. Average linkage is commonly used in practice.&lt;/p&gt;

&lt;p&gt;Centroid linkage involves determining the dissimilarity between the centroid for cluster A and the centroid for cluster B. Centroid linkage is often used in genomics. However, it suffers from the drawback of potential inversion, where two clusters are fused at a height below either of the individual clusters in the dendogram.&lt;/p&gt;

&lt;h2&gt;
  
  
  Choice of Initial Dissimilarity Measure
&lt;/h2&gt;

&lt;p&gt;As mentioned previously, the first step of the hierarchical clustering algorithm involves defining some initial pairwise dissimilarity measure for each pair of individual observations in the dataset. There are two options for this:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt; Euclidean distance&lt;/li&gt;
&lt;li&gt; Correlation-based distance&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The Euclidean distance is simply the straight-line distance between two observations.&lt;/p&gt;

&lt;p&gt;Correlation-based distance considers two observations to be similar if their features are highly correlated, even if the observations are far apart in terms of Euclidean distance. In other words, correlation-based distance focuses on the shapes of observation profiles instead of their magnitude.&lt;/p&gt;

&lt;p&gt;The choice of the pairwise dissimilarity measure has a strong effect on the final dendogram obtained. Therefore, the type of data being clustered and the problem trying to be solved should determine the measure to use. For example, assume that we have data on shoppers and the number of times each shopper has bought a specific item. The goal is to cluster shoppers together to ultimately show different advertisements to different clusters.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  If we use Euclidean distance, shoppers who have bought very few items overall will be clustered together.&lt;/li&gt;
&lt;li&gt;  On the other hand, correlation-based distance would group together shoppers with similar preferences, regardless of shopping volume.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Correlation-based distance would be the preferred choice for the problem that is trying to be solved.&lt;/p&gt;

&lt;p&gt;Regardless of which dissimilarity measure is used, it is usually good practice to scale all variables to have a standard deviation of one. This ensures that each variable is given equal importance in the hierarchical clustering that is performed. However, there are cases where this is sometimes not desired.&lt;/p&gt;

&lt;h1&gt;
  
  
  Practical Issues in Clustering
&lt;/h1&gt;

&lt;h2&gt;
  
  
  Decisions with Big Consequences
&lt;/h2&gt;

&lt;p&gt;In clustering, decisions have to be made that ultimately have big consequences on the final result obtained.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  Should features be standardized to have a mean of zero and standard deviation of one?&lt;/li&gt;
&lt;li&gt;  In hierarchical clustering, what dissimilarity measure should be used? What type of linkage should be used? Where should the dendogram be cut to obtain the clusters?&lt;/li&gt;
&lt;li&gt;  In K-means clustering, how many clusters should we look for in the data?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In practice, we try several different choices and look for the one with the most interpretable solution or useful solution. There simply is no “right” answer. Any solution that leads to something interesting should be considered.&lt;/p&gt;

&lt;h2&gt;
  
  
  Validating Clusters
&lt;/h2&gt;

&lt;p&gt;Any time that clustering is performed on a dataset, we will obtain clusters. But what we really want to know is whether or not the obtained clusters represent true subgroups in our data. If we obtained a new and independent dataset, would we obtain the same clusters? This is a tough question to answer, and there isn’t a consensus on the best approach to answering it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Other Considerations in Clustering
&lt;/h2&gt;

&lt;p&gt;K-means and hierarchical clustering will assign very observation to some cluster, which may cause some clusters to get distorted due to outliers that actually do not belong to any cluster.&lt;/p&gt;

&lt;p&gt;Additionally, clustering methods are not robust to data changes. For example, if we performed clustering on a full dataset, and then performed clustering on a subset of the data, we likely would not obtain similar clusters.&lt;/p&gt;

&lt;h2&gt;
  
  
  Interpreting Clusters
&lt;/h2&gt;

&lt;p&gt;In practice, we should not report the results of clustering as absolute truth about a dataset. We should indicate the results as a starting point for a hypothesis and further study, preferable on more independent datasets.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Originally published at&lt;/em&gt; &lt;a href="https://www.bijenpatel.com/guide/islr/unsupervised-learning/" rel="noopener noreferrer"&gt;https://www.bijenpatel.com&lt;/a&gt; &lt;em&gt;on August 10, 2020.&lt;/em&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;I will be releasing the equivalent Python code for these examples soon. Subscribe to get notified!&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h1&gt;
  
  
  ISLR Chapter 10 — R Code
&lt;/h1&gt;

&lt;h2&gt;
  
  
  Principal Components Analysis (PCA)
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight r"&gt;&lt;code&gt;&lt;span class="n"&gt;library&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ISLR&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;library&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;MASS&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;library&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ggplot2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;library&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;gridExtra&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="c1"&gt;# For side-by-side ggplots&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;library&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;e1071&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;library&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;caret&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="c1"&gt;# We will use the USArrests dataset&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;head&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;USArrests&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="c1"&gt;# Analyze the mean and variance of each variable&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;apply&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;USArrests&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;mean&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;apply&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;USArrests&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;var&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="c1"&gt;# The prcomp function is used to perform PCA&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="c1"&gt;# We specify scale=TRUE to standardize the variables to have mean 0 and standard deviation 1&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;USArrests_pca&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;prcomp&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;USArrests&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;scale&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="kc"&gt;TRUE&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="c1"&gt;# The rotation component of the PCA object indicates the principal component loading vectors&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;USArrests_pca&lt;/span&gt;&lt;span class="o"&gt;$&lt;/span&gt;&lt;span class="n"&gt;rotation&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="c1"&gt;# Easily plot the first two principal components and the vectors&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="c1"&gt;# Setting scale=0 ensures that arrows are scaled to represent loadings&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;biplot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;USArrests_pca&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;scale&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="c1"&gt;# The variance explained by each principal component is obtained by squaring the standard deviation component&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;USArrests_pca_var&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;USArrests_pca&lt;/span&gt;&lt;span class="o"&gt;$&lt;/span&gt;&lt;span class="n"&gt;sdev&lt;/span&gt;&lt;span class="o"&gt;^&lt;/span&gt;&lt;span class="m"&gt;2&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="c1"&gt;# The proportion of variance explained (PVE) of each component&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;USArrests_pca_pve&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;USArrests_pca_var&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="nf"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;USArrests_pca_var&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="c1"&gt;# Plotting the PVE of each component&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;USArrests_pca_pve&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;data.frame&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Component&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nf"&gt;c&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;4&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;PVE&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;USArrests_pca_pve&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;ggplot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;USArrests_pca_pve&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;aes&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;Component&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;PVE&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;geom_point&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;geom_line&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="c1"&gt;# Next, we perform PCA on the NCI60 data, which contains gene expression data from 64 cancer lines&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;nci_pca&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;prcomp&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;NCI60&lt;/span&gt;&lt;span class="o"&gt;$&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;scale&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="kc"&gt;TRUE&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;NCI60_scaled&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;scale&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;NCI60&lt;/span&gt;&lt;span class="o"&gt;$&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="c1"&gt;# Alternatively, we could have scaled the data&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="c1"&gt;# Create a data frame of the principal component scores and the cancer type labels&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;nci_pca_x&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;data.frame&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;nci_pca&lt;/span&gt;&lt;span class="o"&gt;$&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;nci_pca_x&lt;/span&gt;&lt;span class="o"&gt;$&lt;/span&gt;&lt;span class="n"&gt;labs&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;NCI60&lt;/span&gt;&lt;span class="o"&gt;$&lt;/span&gt;&lt;span class="n"&gt;labs&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="c1"&gt;# Plot the first two principal components in ggplot&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;ggplot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;nci_pca_x&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;aes&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;nci_pca_x&lt;/span&gt;&lt;span class="p"&gt;[,&lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;nci_pca_x&lt;/span&gt;&lt;span class="p"&gt;[,&lt;/span&gt;&lt;span class="m"&gt;2&lt;/span&gt;&lt;span class="p"&gt;]))&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;geom_point&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;aes&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;col&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;nci_pca_x&lt;/span&gt;&lt;span class="o"&gt;$&lt;/span&gt;&lt;span class="n"&gt;labs&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="c1"&gt;# Similar cancer types have similar principal component scores&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="c1"&gt;# This indicates that similar cancer types have similar gene expression levels&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="c1"&gt;# Plot the first and third principal components in ggplot&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;ggplot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;nci_pca_x&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;aes&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;nci_pca_x&lt;/span&gt;&lt;span class="p"&gt;[,&lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;nci_pca_x&lt;/span&gt;&lt;span class="p"&gt;[,&lt;/span&gt;&lt;span class="m"&gt;3&lt;/span&gt;&lt;span class="p"&gt;]))&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;geom_point&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;aes&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;col&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;nci_pca_x&lt;/span&gt;&lt;span class="o"&gt;$&lt;/span&gt;&lt;span class="n"&gt;labs&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="c1"&gt;# Plot the PVE of each principal component&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;nci_pca_pve&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;100&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;nci_pca&lt;/span&gt;&lt;span class="o"&gt;$&lt;/span&gt;&lt;span class="n"&gt;sdev&lt;/span&gt;&lt;span class="o"&gt;^&lt;/span&gt;&lt;span class="m"&gt;2&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="nf"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;nci_pca&lt;/span&gt;&lt;span class="o"&gt;$&lt;/span&gt;&lt;span class="n"&gt;sdev&lt;/span&gt;&lt;span class="o"&gt;^&lt;/span&gt;&lt;span class="m"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;nci_pca_pve_df&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;data.frame&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Component&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nf"&gt;c&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="m"&gt;64&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;PVE&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;nci_pca_pve&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;ggplot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;nci_pca_pve_df&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;aes&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;Component&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;PVE&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;geom_point&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;geom_line&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="c1"&gt;# Dropoff in PVE after the 7th component&amp;lt;/span&amp;gt;&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  K-Means Clustering
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight r"&gt;&lt;code&gt;&lt;span class="c1"&gt;# First, we will perform K-Means Clustering on a simulated dataset that truly has two clusters&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="c1"&gt;# Create the simulated data&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;set.seed&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="m"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;matrix&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;rnorm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="m"&gt;50&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="m"&gt;2&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;ncol&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="m"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="m"&gt;25&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="m"&gt;25&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="m"&gt;+3&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="m"&gt;25&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;2&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="m"&gt;25&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;2&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="m"&gt;-4&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="c1"&gt;# Plot the simulated data&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;plot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="c1"&gt;# Perform K-Means Clustering by specifying 2 clusters&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="c1"&gt;# The nstart argument runs K-Means with multiple initial cluster assignments&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="c1"&gt;# Running with multiple initial cluster assignments is desirable to minimize within-cluster variance&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;kmeans_example&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;kmeans&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;nstart&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="m"&gt;20&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="c1"&gt;# The tot.withinss component contains the within-cluster variance&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;kmeans_example&lt;/span&gt;&lt;span class="o"&gt;$&lt;/span&gt;&lt;span class="n"&gt;tot.withinss&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="c1"&gt;# Plot the data and the assigned clusters through the kmeans function&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;plot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;col&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;kmeans_example&lt;/span&gt;&lt;span class="o"&gt;$&lt;/span&gt;&lt;span class="n"&gt;cluster&lt;/span&gt;&lt;span class="m"&gt;+1&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;pch&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="m"&gt;20&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;cex&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="m"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="c1"&gt;# Next, we perform K-Means clustering on the NCI60 data&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;set.seed&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="m"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;NCI60_kmeans&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;kmeans&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;NCI60_scaled&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;nstart&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="m"&gt;20&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="c1"&gt;# View the cluster assignments&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;NCI60_kmeans_clusters&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;NCI60_kmeans&lt;/span&gt;&lt;span class="o"&gt;$&lt;/span&gt;&lt;span class="n"&gt;cluster&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;table&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;NCI60_kmeans_clusters&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;NCI60&lt;/span&gt;&lt;span class="o"&gt;$&lt;/span&gt;&lt;span class="n"&gt;labs&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;/&lt;/span&gt;&lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Hierarchical Clustering
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight r"&gt;&lt;code&gt;&lt;span class="c1"&gt;# We will continue using the simulated dataset to perform hierarchical clustering&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="c1"&gt;# We perform hierarchical clustering with complete, average, and single linkage&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="c1"&gt;# For the dissimilarity measure, we will use Euclidean distance through the dist function&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;hclust_complete&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;hclust&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dist&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;method&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"complete"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;hclust_average&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;hclust&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dist&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;method&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"complete"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;hclust_single&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;hclust&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dist&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;method&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"single"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="c1"&gt;# Plot the dendograms&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;plot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;hclust_complete&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;main&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"Complete Linkage"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;plot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;hclust_average&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;main&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"Average Linkage"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;plot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;hclust_single&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;main&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"Single Linkage"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="c1"&gt;# Determine the cluster labels for different numbers of clusters using the cutree function&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;cutree&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;hclust_complete&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;cutree&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;hclust_average&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;cutree&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;hclust_single&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="c1"&gt;# Scale the variables before performing hierarchical clustering&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;x_scaled&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;scale&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;hclust_complete_scale&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;hclust&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dist&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x_scaled&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;method&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"complete"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;plot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;hclust_complete_scale&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="c1"&gt;# Use correlation-based distance for dissimilarity instead of Euclidean distance&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="c1"&gt;# Since correlation-based distance can only be used when there are at least 3 features, we simulate new data&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;matrix&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;rnorm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="m"&gt;30&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="m"&gt;3&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;ncol&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="m"&gt;3&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;y_corr&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;as.dist&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;cor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="p"&gt;)))&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;hclust_corr&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;hclust&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;y_corr&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;method&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"complete"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;plot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;hclust_corr&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="n"&gt;kmeans_example&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;kmeans&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;nstart&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="m"&gt;20&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="c1"&gt;# The tot.withinss component contains the within-cluster variance&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;kmeans_example&lt;/span&gt;&lt;span class="o"&gt;$&lt;/span&gt;&lt;span class="n"&gt;tot.withinss&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="c1"&gt;# Plot the data and the assigned clusters through the kmeans function&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;plot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;col&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;kmeans_example&lt;/span&gt;&lt;span class="o"&gt;$&lt;/span&gt;&lt;span class="n"&gt;cluster&lt;/span&gt;&lt;span class="m"&gt;+1&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;pch&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="m"&gt;20&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;cex&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="m"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="c1"&gt;# Next, we perform hierarchical clustering on the NCI60 data&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="c1"&gt;# First, we standardize the variables&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;NCI60_scaled&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;scale&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;NCI60&lt;/span&gt;&lt;span class="o"&gt;$&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="c1"&gt;# Perform hierarchical clustering with complete linkage&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;NCI60_hclust&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;hclust&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dist&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;NCI60_scaled&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;method&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"complete"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;plot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;NCI60_hclust&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="c1"&gt;# Cut the tree to yield 4 clusters&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;NCI60_clusters&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;cutree&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;NCI60_hclust&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;4&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="c1"&gt;# View the cluster assignments&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;table&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;NCI60_clusters&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;NCI60&lt;/span&gt;&lt;span class="o"&gt;$&lt;/span&gt;&lt;span class="n"&gt;labs&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;/&lt;/span&gt;&lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



</description>
      <category>datascience</category>
      <category>machinelearning</category>
      <category>tutorial</category>
    </item>
    <item>
      <title>ISLR Chapter 9 — Support Vector Machines</title>
      <dc:creator>Bijen Patel</dc:creator>
      <pubDate>Sun, 08 Nov 2020 23:44:40 +0000</pubDate>
      <link>https://dev.to/bijenpatel/islr-chapter-9-support-vector-machines-3ho9</link>
      <guid>https://dev.to/bijenpatel/islr-chapter-9-support-vector-machines-3ho9</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;This is a summary of chapter 9 of the &lt;em&gt;Introduction to Statistical Learning&lt;/em&gt; textbook. I’ve written a 10-part guide that covers the entire book. The guide can be read at &lt;a href="https://www.bijenpatel.com/guide/" rel="noopener noreferrer"&gt;my website&lt;/a&gt;, or &lt;a href="https://dev.to/bijenpatel/series/9611"&gt;here&lt;/a&gt; at DEV. Subscribe to stay up to date on my latest Data Science &amp;amp; Engineering guides!&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;em&gt;Support vector machines&lt;/em&gt; (SVMs) are often considered one of the best “out of the box” classifiers, though this is not to say that another classifier such as logistic regression couldn’t outperform an SVM.&lt;/p&gt;

&lt;p&gt;The SVM is a generalization of a simple classifier known as the &lt;em&gt;max&lt;span id="rmm"&gt;i&lt;/span&gt;mal margin classifier&lt;/em&gt;. The maximal margin classifier is simple and intuitive, but cannot be applied to most datasets because it requires classes to be perfectly separable by a boundary. Another classifier known as the &lt;em&gt;support vector classifier&lt;/em&gt; is an extension of the maximal margin classifier, which can be applied in a broader range of cases. The support vector machine is a further extension of the support vector classifier, which can accommodate non-linear class boundaries.&lt;/p&gt;

&lt;p&gt;SVMs are intended for the binary classification setting, in which there are only two classes.&lt;/p&gt;

&lt;h1&gt;
  
  
  Maximal Margin Classifier
&lt;/h1&gt;

&lt;h2&gt;
  
  
  What is a Hyperplane?
&lt;/h2&gt;

&lt;p&gt;In a &lt;em&gt;p&lt;/em&gt;-dimensional space, a hyperplane is a flat subspace of dimension &lt;em&gt;p&lt;/em&gt; — 1. For example, in a two-dimensional setting, a hyperplane is a flat one-dimensional subspace, which is also simply known as a line. A hyperplane in a &lt;em&gt;p&lt;/em&gt;-dimensional setting is defined by the following equation:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605463155170%2FKMeAUFzj9.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605463155170%2FKMeAUFzj9.png" alt="mmc-hyperplane-equation.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Any point &lt;em&gt;X&lt;/em&gt; in &lt;em&gt;p&lt;/em&gt;-dimensional space that satisfies the equation is a point that lies on the hyperplane. If some point &lt;em&gt;X&lt;/em&gt; results in a value greater than or less than 0 for the equation, then the point lies on one of the sides of the hyperplane.&lt;/p&gt;

&lt;p&gt;In other words, a hyperplane essentially divides a &lt;em&gt;p&lt;/em&gt;-dimensional space into two parts.&lt;/p&gt;

&lt;h2&gt;
  
  
  Classification Using a Hyperplane
&lt;/h2&gt;

&lt;p&gt;Suppose that we had a training dataset with &lt;em&gt;p&lt;/em&gt; predictors and &lt;em&gt;n&lt;/em&gt; observations. Additionally, the observations were either associated with one of two classes. Suppose that we also had a separate test dataset. Our goal is to develop a classifier based on the training data, and to classify the test data. How can we classify the data based on the concept of the separating hyperplane?&lt;/p&gt;

&lt;p&gt;Assume that it is possible to create a hyperplane that separates the training data perfectly according to their class labels. We could use this hyperplane as a natural classifier. An observation from the test dataset would be assigned to a class, depending on which side of the hyperplane it is located. The determination is made by plugging the test observation into the hyperplane equation. If the value is greater than 0, it is assigned to the class corresponding to that side. If the value is less than 0, then it is assigned to the other class.&lt;/p&gt;

&lt;p&gt;However, when we can perfectly separate the classes, many possibilities exist for the hyperplane. The following chart shows an example where two hyperplanes separate the classes perfectly.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605463180848%2FklNPnwcH4.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605463180848%2FklNPnwcH4.png" alt="separating-hyperplane.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This is where the maximal margin classifier helps determine the hyperplane to use.&lt;/p&gt;

&lt;h2&gt;
  
  
  Maximal Margin Classifier
&lt;/h2&gt;

&lt;p&gt;The maximal margin classifier is a separating hyperplane that is farthest from the training observations.&lt;/p&gt;

&lt;p&gt;The method involves determining the perpendicular distance from each training observation to some hyperplane. The smallest such distance is known as the margin. The maximal margin classifier settles on the hyperplane for which the margin is largest. In other words, the chosen hyperplane is the one that has the farthest minimum distance to the training observations.&lt;/p&gt;

&lt;p&gt;The maximal margin classifier is often successful, but can lead to overfitting when we have a lot of predictors in our dataset.&lt;/p&gt;

&lt;p&gt;The points that end up supporting the maximal margin hyperplane are known as &lt;em&gt;support vectors&lt;/em&gt;. If these points are moved even slightly, the maximal margin hyperplane would move as well. The fact that the maximal margin hyperplane depends only on a small subset of observations is an important property that will also be discussed in the sections on support vector classifiers and support vector machines.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605463211623%2FnsBlxtvfh.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605463211623%2FnsBlxtvfh.png" alt="support-vectors.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Construction of the Maximal Margin Classifier
&lt;/h2&gt;

&lt;p&gt;The maximal margin hyperplane is the solution to an optimization problem with three components:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605463233217%2FUDARbVizd.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605463233217%2FUDARbVizd.png" alt="mmc-construction.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;M&lt;/em&gt; is the margin of the hyperplane. The second component is a constraint that ensures that the perpendicular distance from any observation to the hyperplane is given by the following:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605463285962%2FduoDQW5CS.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605463285962%2FduoDQW5CS.png" alt="perpendicular-distance-equation.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The third component guarantees that each observation will be on the correct side of the hyperplane, with some cushion &lt;em&gt;M&lt;/em&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Non-separable Case
&lt;/h2&gt;

&lt;p&gt;The maximal margin classifier is a natural way to perform classification, but only if a separating hyperplane exists. However, that is usually not the case in real-world datasets.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605463300003%2FWn0xQ6nrC.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605463300003%2FWn0xQ6nrC.png" alt="non-separable-data.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The concept of the separating hyperplane can be extended to develop a hyperplane that &lt;em&gt;almost&lt;/em&gt; separates the classes. This is done by using a &lt;em&gt;soft margin&lt;/em&gt;. The generalization of the maximal margin classifier to the non-separable case is known as the supper vector classifier.&lt;/p&gt;

&lt;h1&gt;
  
  
  Support Vector Classifier
&lt;/h1&gt;

&lt;p&gt;In most cases, we usually don’t have a perfectly separating hyperplane for our datasets. However, even if we did, there are cases where it wouldn’t be desirable. This is due to sensitivity issues from individual observations. For example, the addition of a single observation could result in a dramatic change in the maximal margin hyperplane.&lt;/p&gt;

&lt;p&gt;Therefore, it is usually a good idea to consider a hyperplane that does &lt;em&gt;not&lt;/em&gt; perfectly separate the classes. This provides two advantages:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  Greater robustness to individual observations&lt;/li&gt;
&lt;li&gt;  Better classification of most of the training observations&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In other words, it is usually worthwhile to misclassify a few training observations in order to do a better job of classifying the other observations. This is what the support vector classifier does. It allows observations to be on the wrong side of the margin, and even the wrong side of the hyperplane.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605463319534%2FfQYoPL_lO.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605463319534%2FfQYoPL_lO.png" alt="svc.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Details of the Support Vector Classifier
&lt;/h2&gt;

&lt;p&gt;The support vector classifier will classify a test observation depending on what side of the hyperplane that it lies. The hyperplane is the solution to an optimization problem that is similar to the one for the maximal margin classifier.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605463336630%2FzxRkXcQuM.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605463336630%2FzxRkXcQuM.png" alt="svc-construction.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;M&lt;/em&gt; is the margin of the hyperplane. The second component is a constraint that ensures that the perpendicular distance from any observation to the hyperplane is given by the following:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605463363562%2FCBUHWtSJ7.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605463363562%2FCBUHWtSJ7.png" alt="perpendicular-distance-equation-2.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;ϵ&lt;/em&gt; is a &lt;em&gt;slack variable&lt;/em&gt; that allows observations to be on the wrong side of the margin or hyperplane. It tells us where the &lt;em&gt;i&lt;/em&gt;-&lt;em&gt;th&lt;/em&gt; observation is located, relative to the hyperplane and margin.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  If &lt;em&gt;ϵ&lt;/em&gt; = 0, the observation is on the correct side of the margin&lt;/li&gt;
&lt;li&gt;  If &lt;em&gt;ϵ&lt;/em&gt; &amp;gt; 0, the observation is on the wrong side of the margin&lt;/li&gt;
&lt;li&gt;  If &lt;em&gt;ϵ&lt;/em&gt; &amp;gt; 1, the observation is on the wrong side of the hyperplane&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;em&gt;C&lt;/em&gt; is a nonnegative tuning parameter that bounds the sum of the &lt;em&gt;ϵ&lt;/em&gt; values. It determines the number and severity of violations to the margin and hyperplane that will be tolerated.&lt;/p&gt;

&lt;p&gt;In other words, &lt;em&gt;C&lt;/em&gt; is a budget for the amount that the margin can be violated by the &lt;em&gt;n&lt;/em&gt; observations. If &lt;em&gt;C&lt;/em&gt; = 0, then there is no budget, and the result would simply be the same as the maximal margin classifier (if a perfectly separating hyperplane exists). If &lt;em&gt;C&lt;/em&gt; &amp;gt; 0, no more than &lt;em&gt;C&lt;/em&gt; observations can be on the wrong side of the hyperplane because &lt;em&gt;ϵ&lt;/em&gt; &amp;gt; 1 in those cases, and the constraint from the fourth component doesn’t allow for it. As the budget increases, more violations to the margin are tolerated, and so the margin becomes wider.&lt;/p&gt;

&lt;p&gt;It should come as no surprise that &lt;em&gt;C&lt;/em&gt; is usually chosen through cross-validation. &lt;em&gt;C&lt;/em&gt; controls the bias-variance tradeoff. When &lt;em&gt;C&lt;/em&gt; is small, the classifier is highly fit to the data, resulting in high variance. When &lt;em&gt;C&lt;/em&gt; is large, the classifier may be too general and oversimplified for the data, resulting in high bias.&lt;/p&gt;

&lt;p&gt;In support vector classifiers, the support vectors for the hyperplane are a bit different than the ones from the maximal margin hyperplane. They are the observations that lie directly on the margin and the wrong side of the margin. The larger the value of &lt;em&gt;C&lt;/em&gt;, the more support vectors there will be.&lt;/p&gt;

&lt;p&gt;The fact that the supper vector classifier is based only on a small subset of the training data means that it is robust to the behavior of observations far from the hyperplane. This is different from other classification methods such as linear discriminant analysis, where the mean of all observations within a class help determine the boundary. However, support vector classifiers are similar to logistic regression because logistic regression is not very sensitive to observations far from the decision boundary.&lt;/p&gt;

&lt;h1&gt;
  
  
  Support Vector Machines
&lt;/h1&gt;

&lt;p&gt;First, we will discuss how a linear classifier can be converted into a non-linear classifier. Then, we’ll talk about support vector machines, which do this in an automatic way.&lt;/p&gt;

&lt;h2&gt;
  
  
  Classification with Non-linear Decision Boundaries
&lt;/h2&gt;

&lt;p&gt;The support vector classifier is a natural approach for classification in the binary class setting, if the boundary between the classes is linear. However, there are many cases in practice where we need a non-linear boundary.&lt;/p&gt;

&lt;p&gt;In chapter 7, we were able to extend linear regression to address non-linear relationships by enlarging the feature space by using higher-order polynomial functions, such as quadratic and cubic terms. Similarly, non-linear boundaries can be created through the use of higher-order polynomial functions. For example, we could fit a support vector classifier using each predictor and its squared term:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605463389905%2Ff0k24KQJg.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605463389905%2Ff0k24KQJg.png" alt="squared-terms.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This would change the optimization problem to become the following:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605463422323%2FxIQM0c17K.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605463422323%2FxIQM0c17K.png" alt="svm-optimization-problem.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;However, the problem with enlarging the feature space is that there are many ways to do so. We could use cubic or even higher-order polynomial functions. We could add interaction terms. Many possibilities exist, which could lead to inefficiency in computation. Support vector machines allow for enlarging the feature space in a way that leads to efficient computations.&lt;/p&gt;

&lt;h2&gt;
  
  
  Support Vector Machines
&lt;/h2&gt;

&lt;p&gt;The &lt;em&gt;support vector machine&lt;/em&gt; is an extension of the support vector classifier that enlarges the feature space by using &lt;em&gt;kernels&lt;/em&gt;. Before we talk about kernels, let’s discuss the solution to the support vector classifier optimization problem.&lt;/p&gt;

&lt;h2&gt;
  
  
  Solution to Support Vector Classifier Optimization Problem
&lt;/h2&gt;

&lt;p&gt;The details of how the support vector classifier is computed is highly technical. However, it turns out that the solution only involves the inner products of the observations, instead of the observations themselves. The inner product of two vectors is illustrated as follows:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605463435645%2Fr85_ZsayG.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605463435645%2Fr85_ZsayG.png" alt="inner-product.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605463447645%2FrdM70NOTv.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605463447645%2FrdM70NOTv.png" alt="inner-product-2.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The linear support vector classifier can be represented as:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605463464067%2FXe0C02SpZ.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605463464067%2FXe0C02SpZ.png" alt="linear-svc-equation.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;There are &lt;em&gt;n&lt;/em&gt; parameters (&lt;em&gt;α&lt;/em&gt;ᵢ) per training observation. The parameters are estimated with the inner products between all pairs of training observations.&lt;/p&gt;

&lt;p&gt;To evaluate the support vector classifier function &lt;em&gt;f&lt;/em&gt;(&lt;em&gt;x&lt;/em&gt;), we compute the inner product between a new observation &lt;em&gt;x&lt;/em&gt; and each training observation &lt;em&gt;x&lt;/em&gt;ᵢ. However, the &lt;em&gt;α&lt;/em&gt;ᵢ parameters are nonzero only for the support vectors. In other words, if an observation is not a support vector, then its &lt;em&gt;α&lt;/em&gt;ᵢ is zero. If we represent &lt;em&gt;S&lt;/em&gt; as the collection of the support vectors, then the solution function can be rewritten as the following:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605463483594%2FM4p0AKpuX.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605463483594%2FM4p0AKpuX.png" alt="linear-svc-equation-2.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Kernels
&lt;/h2&gt;

&lt;p&gt;A kernel is a function that quantifies the similarity of two observations, and is a generalization of the inner product. The kernel function used in the support vector classifier is simply&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605463492900%2FweKzm056m.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605463492900%2FweKzm056m.png" alt="kernel-function.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This is known as a linear kernel because the support vector classifier results in a linear boundary. The linear kernel determines the similarity of observations by using the Pearson correlation.&lt;/p&gt;

&lt;p&gt;Instead of using the linear kernel, we could use a polynomial kernel:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605463503242%2F8jjKsZIcK.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605463503242%2F8jjKsZIcK.png" alt="polynomial-kernel.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Using a nonlinear kernel results in a non-linear decision boundary. When the support vector classifier is combined with a nonlinear kernel, it results in the support vector machine. The support vector machine takes on the form:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605463519395%2F7eXKIj_26.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605463519395%2F7eXKIj_26.png" alt="svm-function.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605463533442%2Frgt2v7Poz.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605463533442%2Frgt2v7Poz.png" alt="svm.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The polynomial kernel is just one example of a nonlinear kernel. Another common choice is the radial kernel, which takes the following form:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605463562456%2F1nTXRbfHf.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605463562456%2F1nTXRbfHf.png" alt="radial-kernel-equation.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605463573634%2F38dRg7b88.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605463573634%2F38dRg7b88.png" alt="svm-radial.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The advantage to using kernels is that the computations can be performed without explicitly working in the enlarged feature space. For example, in the polynomial kernel, we are simply determining the summation of the inner products, and &lt;em&gt;then&lt;/em&gt; transforming our result to a higher degree or dimension &lt;em&gt;d&lt;/em&gt;. This is known as the kernel trick.&lt;/p&gt;

&lt;h1&gt;
  
  
  SVMs with More than Two Classes
&lt;/h1&gt;

&lt;p&gt;The concept of separating hyperplanes upon which SVMs are based does not lend itself naturally to more than two classes. However, there are a few approaches for extending SVMs beyond the binary class setting. The most common approaches are &lt;em&gt;one-versus-one&lt;/em&gt; and &lt;em&gt;one-versus-all&lt;/em&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  One-Versus-One Classification
&lt;/h2&gt;

&lt;p&gt;A &lt;em&gt;one-versus-one&lt;/em&gt; or &lt;em&gt;all-pairs&lt;/em&gt; approach develops multiple SVMs, each of which compares a pair of classes. Test observations are classified in each of the SVMs. In the end, we count the number of times that the test observation is assigned to each class. The class that the observation was assigned to most is the class assigned to the test observation.&lt;/p&gt;

&lt;h2&gt;
  
  
  One-Versus-All Classification
&lt;/h2&gt;

&lt;p&gt;The one-versus-all approach develops multiple SVMs, each of which compares one class to all of the other classes. Assume that each SVM resulted in the following parameters from comparing some class k to all of the others:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605463599463%2F532yBfT3K.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605463599463%2F532yBfT3K.png" alt="one-versus-all-1.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Let x represent a test observation. The test observation is assigned to the class for which the following is the largest:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605463611291%2Fk_E3Qla3N.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605463611291%2Fk_E3Qla3N.png" alt="one-versus-all-2.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h1&gt;
  
  
  SVMs vs Logistic Regression
&lt;/h1&gt;

&lt;p&gt;As previously mentioned, only the support vectors end up playing a role in the support vector classifier that is obtained. This is because the loss function is exactly zero for observations that are on the correct side of the margin.&lt;/p&gt;

&lt;p&gt;The loss function for logistic regression is not exactly zero anywhere. However, it is very small for observations that are far from the decision boundary.&lt;/p&gt;

&lt;p&gt;Due to the similarities between their loss functions, support vector classifiers and logistic regression often give similar results. However, when the classes are well separated, support vector machines tend to perform better. In cases where there is more overlap, logistic regression tends to perform better. In any case, both should always be tested, and the method that performs best should be chosen.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Originally published at&lt;/em&gt; &lt;a href="https://www.bijenpatel.com/guide/islr/support-vector-machines/" rel="noopener noreferrer"&gt;https://www.bijenpatel.com&lt;/a&gt; &lt;em&gt;on August 9, 2020.&lt;/em&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;I will be releasing the equivalent Python code for these examples soon. Subscribe to get notified!&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h1&gt;
  
  
  ISLR Chapter 9 — R Code
&lt;/h1&gt;

&lt;h2&gt;
  
  
  Support Vector Classifiers
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight r"&gt;&lt;code&gt;&lt;span class="n"&gt;library&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ISLR&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;library&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;MASS&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;library&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;e1071&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="c1"&gt;# We will generate a random dataset of observations belonging to 2 classes&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;set.seed&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;matrix&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;rnorm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="m"&gt;20&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="m"&gt;2&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;ncol&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="m"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nf"&gt;c&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;rep&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="m"&gt;-1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;10&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;rep&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;10&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="o"&gt;==&lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,]&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="o"&gt;==&lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,]&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="n"&gt;plot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;col&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="m"&gt;3&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="c1"&gt;# To use SVM, the response must be encoded as a factor variable&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;data.frame&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;as.factor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="c1"&gt;# Fit a Support Vector Classifier with a cost of 10&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="c1"&gt;# The scale argument is used to scale predictors&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="c1"&gt;# In this example, we will not scale them&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;svmfit&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;svm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="o"&gt;~&lt;/span&gt;&lt;span class="n"&gt;.&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;kernel&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"linear"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;cost&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="m"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;scale&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="kc"&gt;FALSE&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="c1"&gt;# Plot the fit&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;plot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;svmfit&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="c1"&gt;# Determine which observations are the support vectors&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;svmfit&lt;/span&gt;&lt;span class="o"&gt;$&lt;/span&gt;&lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="c1"&gt;# Fit an SVM with a smaller cost of 0.1&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;svmfit&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;svm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="o"&gt;~&lt;/span&gt;&lt;span class="n"&gt;.&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;kernel&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"linear"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;cost&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="m"&gt;0.1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;scale&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="kc"&gt;FALSE&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="c1"&gt;# The e1071 library contains a tune function&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="c1"&gt;# The function performs cross-validation with different cost values&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;set.seed&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;tune.out&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;tune&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;svm&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="o"&gt;~&lt;/span&gt;&lt;span class="n"&gt;.&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;kernel&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"linear"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;ranges&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nf"&gt;list&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cost&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nf"&gt;c&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="m"&gt;0.001&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0.01&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0.1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;10&lt;/span&gt;&lt;span class="p"&gt;)))&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="c1"&gt;# Check the summary to see the error rates of the different models&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="c1"&gt;# The model with a cost of 0.1 has the lowest error&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;summary&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tune.out&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="c1"&gt;# Choose the best model&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;bestmod&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;tune.out&lt;/span&gt;&lt;span class="o"&gt;$&lt;/span&gt;&lt;span class="n"&gt;best.model&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;summary&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;bestmod&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="c1"&gt;# The predict function can be used to predict classes on a set of test observations&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;xtest&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;matrix&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;rnorm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="m"&gt;20&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="m"&gt;2&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;ncol&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="m"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;ytest&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;sample&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;c&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="m"&gt;-1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;20&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;rep&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="kc"&gt;TRUE&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;xtest&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;ytest&lt;/span&gt;&lt;span class="o"&gt;==&lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,]&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;xtest&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;ytest&lt;/span&gt;&lt;span class="o"&gt;==&lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,]&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;testdata&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;data.frame&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;xtest&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;as.factor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ytest&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;ypred&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;predict&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;bestmod&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;testdata&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;table&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;predict&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;ypred&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;truth&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;testdata&lt;/span&gt;&lt;span class="o"&gt;$&lt;/span&gt;&lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;/&lt;/span&gt;&lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Support Vector Machines
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight r"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Now, we will fit a Support Vector Machine model&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="c1"&gt;# We can do this by simply using a non-linear kernel in the svm function&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="c1"&gt;# Generate a dataset with a non-linear class boundary&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;set.seed&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;matrix&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;rnorm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="m"&gt;200&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="m"&gt;2&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;ncol&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="m"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="m"&gt;100&lt;/span&gt;&lt;span class="p"&gt;,]&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="m"&gt;100&lt;/span&gt;&lt;span class="p"&gt;,]&lt;/span&gt;&lt;span class="m"&gt;+2&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="m"&gt;101&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="m"&gt;150&lt;/span&gt;&lt;span class="p"&gt;,]&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="m"&gt;101&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="m"&gt;150&lt;/span&gt;&lt;span class="p"&gt;,]&lt;/span&gt;&lt;span class="m"&gt;-2&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nf"&gt;c&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;rep&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;150&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;rep&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="m"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;50&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;data.frame&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;as.factor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;plot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;col&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="c1"&gt;# Split the data into training and test sets&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;train&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;sample&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="m"&gt;200&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;100&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="c1"&gt;# Fit an SVM with a radial kernel&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;svmfit&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;svm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="o"&gt;~&lt;/span&gt;&lt;span class="n"&gt;.&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;train&lt;/span&gt;&lt;span class="p"&gt;,],&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;kernel&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"radial"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;gamma&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;cost&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;plot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;svmfit&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;train&lt;/span&gt;&lt;span class="p"&gt;,])&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="c1"&gt;# Perform cross-validation using the tune function to test different choices for cost&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;set.seed&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;tune.out&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;tune&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;svm&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="o"&gt;~&lt;/span&gt;&lt;span class="n"&gt;.&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;train&lt;/span&gt;&lt;span class="p"&gt;,],&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;kernel&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"radial"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
                &lt;/span&gt;&lt;span class="n"&gt;ranges&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nf"&gt;list&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cost&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nf"&gt;c&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="m"&gt;0.1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;100&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;1000&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;&lt;span class="w"&gt;
                &lt;/span&gt;&lt;span class="n"&gt;gamma&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nf"&gt;c&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="m"&gt;0.5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;4&lt;/span&gt;&lt;span class="p"&gt;)))&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;summary&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tune.out&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="c1"&gt;# Cost of 1 and Gamma of 2 has the lowest error&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="c1"&gt;# Test the model on the test dataset&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;table&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;true&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;train&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="s2"&gt;"y"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;pred&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;predict&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tune.out&lt;/span&gt;&lt;span class="o"&gt;$&lt;/span&gt;&lt;span class="n"&gt;best.model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;newx&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;train&lt;/span&gt;&lt;span class="p"&gt;,]))&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;/&lt;/span&gt;&lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



</description>
      <category>datascience</category>
      <category>machinelearning</category>
      <category>tutorial</category>
    </item>
    <item>
      <title>ISLR Chapter 8 — Tree-Based Methods</title>
      <dc:creator>Bijen Patel</dc:creator>
      <pubDate>Sun, 08 Nov 2020 23:42:45 +0000</pubDate>
      <link>https://dev.to/bijenpatel/islr-chapter-8-tree-based-methods-5556</link>
      <guid>https://dev.to/bijenpatel/islr-chapter-8-tree-based-methods-5556</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;This is a summary of chapter 8 of the &lt;em&gt;Introduction to Statistical Learning&lt;/em&gt; textbook. I’ve written a 10-part guide that covers the entire book. The guide can be read at &lt;a href="https://www.bijenpatel.com/guide/" rel="noopener noreferrer"&gt;my website&lt;/a&gt;, or &lt;a href="https://dev.to/bijenpatel/series/9611"&gt;here&lt;/a&gt; at DEV. Subscribe to stay up to date on my latest Data Science &amp;amp; Engineering guides!&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Tree-based methods for regression and classification involve segmenting the predictor space into a number of simple regions. To make a prediction for an observation, we simply use the mean or mode of the training observations in the region that it belongs to. Since the set of splitting rules used to segment the predictor space can be summarized in a tree, these approaches are known as decision tree methods.&lt;/p&gt;

&lt;p&gt;These methods are simple and use&lt;span id="rmm"&gt;f&lt;/span&gt;ul for interpretation, but not competitive in terms of prediction accuracy when compared with the methods from the chapter on linear model selection and regularization. However, advanced methods such as bagging, random forests, and boosting can result in dramatic improvements at the expense of a loss in interpretation.&lt;/p&gt;

&lt;h1&gt;
  
  
  Basics of Decision Trees
&lt;/h1&gt;

&lt;h2&gt;
  
  
  Regression Trees
&lt;/h2&gt;

&lt;p&gt;Assume that we had a baseball salary dataset that consisted of the number of years that a player was in the league, the number of hits in the previous year, and each player’s log-transformed salary.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605462770846%2FfFSdpIvx1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605462770846%2FfFSdpIvx1.png" alt="baseball-statistics-table.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We might get a decision tree that looks like the following:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605462806596%2Fx_lZN6f2I.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605462806596%2Fx_lZN6f2I.png" alt="basic-decision-tree.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The interpretation of the tree might look as follows:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  Years is the most important factor in determining salary — players with less experience earn less money&lt;/li&gt;
&lt;li&gt;  If a player is experienced (more than 5 years in the league), then the number of hits plays a role in the salary&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The tree is probably an oversimplification of the relationship, but is easy to interpret and has a nice graphical representation.&lt;/p&gt;

&lt;p&gt;In general, the process of building a regression tree is a two-step procedure:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt; Divide the predictor space (the set of possible values for all &lt;em&gt;X&lt;/em&gt;ⱼ into &lt;em&gt;j&lt;/em&gt; distinct and non-overlapping regions (&lt;em&gt;R&lt;/em&gt;₁, &lt;em&gt;R&lt;/em&gt;₂, &lt;em&gt;R&lt;/em&gt;₃ … &lt;em&gt;R&lt;/em&gt;ⱼ)&lt;/li&gt;
&lt;li&gt; For every observation that falls into some region, we make the same prediction, which is the mean of the response values in the region&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The regions are constructed by dividing the predictor space into “box” shapes for simplicity and ease of interpretation. The goal is to find boxes that minimize the RSS given by the following:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605462832153%2FayOFpA8RZ.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605462832153%2FayOFpA8RZ.png" alt="rss-minimization.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Unfortunately, it is computationally infeasible to consider every partition of the feature space. Therefore, we take a greedy top-down approach known as recursive binary splitting. First, we consider all of the predictors and all of the possible cutpoints (&lt;em&gt;s&lt;/em&gt;) for each of the predictors and choose the predictor and cutpoint that results in the lowest RSS. Next, the process is repeated to further split the data, but this time we split one of the two previously identified regions, which results in 3 total regions. The process continues until some stopping criteria is reached.&lt;/p&gt;

&lt;p&gt;However, the process may end up overfitting the data and result in a tree that is too complex. A smaller tree would result in lower variance and easier interpretation at the cost of a little more bias. Therefore, we should first grow a large complex tree, and then prune it back to obtain a subtree.&lt;/p&gt;

&lt;p&gt;This is done through a method known as cost complexity pruning, or weakest link pruning. The method allows us to consider a small set of subtrees instead of every possible subtree. This is done through the addition of a tuning parameter, &lt;em&gt;α&lt;/em&gt;. For each value of &lt;em&gt;α&lt;/em&gt;, there is a subtree &lt;em&gt;T&lt;/em&gt; such that the following is as small as possible:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605462856097%2FlilkkxxVu.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605462856097%2FlilkkxxVu.png" alt="cost-complexity-pruning-equation.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  Where |&lt;em&gt;T&lt;/em&gt;| is the number of terminal nodes&lt;/li&gt;
&lt;li&gt;  Where &lt;em&gt;R&lt;/em&gt; is the rectangle corresponding to the &lt;em&gt;m&lt;/em&gt;-&lt;em&gt;th&lt;/em&gt; terminal node&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The &lt;em&gt;α&lt;/em&gt; tuning parameter controls a tradeoff between tree complexity and its fit to the data. As &lt;em&gt;α&lt;/em&gt; increases, there is a price to pay for having a tree with many terminal nodes, so the equation will typically be minimized for a smaller subtree. This is similar to the lasso regression method, which controlled the complexity of the linear model. As &lt;em&gt;α&lt;/em&gt; is increased from zero, the branches get pruned in a nested and predictable fashion, so getting the subtree as a function of &lt;em&gt;α&lt;/em&gt; is simple. The final value of &lt;em&gt;α&lt;/em&gt; is selected using cross-validation.&lt;/p&gt;

&lt;h1&gt;
  
  
  Classification Trees
&lt;/h1&gt;

&lt;p&gt;Classification trees are similar to regression trees, but predict qualitative responses. In regression, the prediction for some observation is given by the mean response of the observations at some terminal node. In classification, the prediction is given by the most commonly occurring class at the terminal node.&lt;/p&gt;

&lt;p&gt;In classification, instead of using the RSS, we look at either the Gini index or the cross-entropy. Both are very similar measures, where small values indicate node purity, meaning that a node contains predominantly observations from a single class. Either of these measures can be used to build a tree and evaluate splits.&lt;/p&gt;

&lt;p&gt;When it comes to pruning a tree, we could either use the Gini index, cross-entropy, or the classification error rate. The classification error rate is the fraction of observations in a region that do not belong to the most common class. The classification error rate may be preferable for the purposes of prediction accuracy and tree pruning.&lt;/p&gt;

&lt;h1&gt;
  
  
  Qualitative Variables
&lt;/h1&gt;

&lt;p&gt;In a dataset, we may have a qualitative variable that takes on more than two values, such as an ethnicity variable.&lt;/p&gt;

&lt;p&gt;In a decision tree, these variables are typically split through the use of letters or numbers that are assigned to the values of the qualitative variable.&lt;/p&gt;

&lt;h1&gt;
  
  
  Pros &amp;amp; Cons of Trees
&lt;/h1&gt;

&lt;p&gt;The advantage of decision trees over other methods is their simplicity. Decision trees are very easy to explain to others, and closely mirror human decision-making. They can be displayed graphically and easily interpreted by non-experts.&lt;/p&gt;

&lt;p&gt;The disadvantage is that the ease of interpretation comes at the cost of prediction accuracy. Usually, other regression and classification methods are more accurate.&lt;/p&gt;

&lt;p&gt;However, advanced methods such as bagging, random forests, and boosting can greatly improve predictive performance at the cost of interpretability.&lt;/p&gt;

&lt;h1&gt;
  
  
  Bagging, Random Forests, and Boosting
&lt;/h1&gt;

&lt;h2&gt;
  
  
  Bagging
&lt;/h2&gt;

&lt;p&gt;Bagging is a general purpose method for reducing the variance of a statistical learning method. It is very useful for decision trees because they suffer from high variance.&lt;/p&gt;

&lt;p&gt;A natural way to reduce variance and increase prediction accuracy is to take many training sets from the population, build separate trees using each dataset, and average the predictions. However, it isn’t practical to do this because we usually never have access to multiple different training datasets. Instead, we can use the bootstrap method to generate B different bootstrapped training datasets, create B trees, and average the predictions. In classification, instead of averaging the predictions, we take a majority vote, where the overall prediction is the most occurring class.&lt;/p&gt;

&lt;p&gt;In bagging, the trees are grown large and are not pruned. Additionally, there is a straightforward way to estimate the test error of a bagged model without needing to perform cross-validation. Remember that bootstrapped datasets are created through resampling, which allows for observations to be repeated. As it turns out, on average, each bagged tree will make use of two-thirds of the total observations because of the resampling. The remaining one-third of the observations not used to fit a tree are referred to as the out-of-bag (OOB) observations. We predict the response for the &lt;em&gt;i&lt;/em&gt;-&lt;em&gt;th&lt;/em&gt; observation using each tree in which the &lt;em&gt;i&lt;/em&gt;-&lt;em&gt;th&lt;/em&gt; observation was OOB and average the predictions or take a majority vote. This is done for all of the observations in the original dataset, and the overall test error rate is then determined.&lt;/p&gt;

&lt;p&gt;Bagging improves prediction accuracy at the expense of interpretability. However, we can obtain an overall summary of the importance of each predictor using RSS or the Gini index. This is done by recording the total amount that the RSS or Gini index decreases due to splits in a given predictor and averaging over all of the trees. The larger the total value, the more significant the predictor.&lt;/p&gt;

&lt;h1&gt;
  
  
  Random Forests
&lt;/h1&gt;

&lt;p&gt;Random forests provide an improvement over bagging by decorrelating the trees. The random forest method is similar to bagging, where a large number of trees are built on bootstrapped datasets. However, when building the trees, each time a split is considered, a random sample of &lt;em&gt;m&lt;/em&gt; predictors is chosen as split candidates from the full set of &lt;em&gt;p&lt;/em&gt; predictors, and the split can only use one of the &lt;em&gt;m&lt;/em&gt; predictors. Additionally, a fresh set of m predictors is taken at each split. Usually, we use &lt;em&gt;m&lt;/em&gt; = sqrt(&lt;em&gt;p&lt;/em&gt;).&lt;/p&gt;

&lt;p&gt;Why might only considering a subset of the predictors at each split be a good thing? Suppose that we have a dataset with one very strong predictor and some other moderately strong predictors. In the bagged trees, almost every tree will use the very strong predictor as the top split. All of the bagged trees will look very similar and be highly correlated. Averaging many highly correlated trees does not result in a large reduction to variance. Since random forests force each split to consider only a subset of the predictors, many splits will not even consider the strongest predictor. Basically, this decorrelates the trees and reduces variance.&lt;/p&gt;

&lt;h1&gt;
  
  
  Boosting
&lt;/h1&gt;

&lt;p&gt;Like bagging, boosting is a general approach that can be applied to many different statistical learning methods. Boosting involves building many trees, but each tree is grown sequentially. This means that each tree is grown using information from a previously grown tree. Additionally, boosting does not involve bootstrap sampling. Each tree is fit on a modified version of the original dataset. Boosting works as follows:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605462882276%2FRXdh8JBv1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605462882276%2FRXdh8JBv1.png" alt="boosting.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The boosting approach learns slowly. If we set the parameter &lt;em&gt;d&lt;/em&gt; to be small, we fit small trees to the residuals and slowly improve the function in areas where it does not perform well. The shrinkage parameter &lt;em&gt;λ&lt;/em&gt; slows the process down even further, allowing more and different shaped trees to attack the residuals.&lt;/p&gt;

&lt;p&gt;In boosting, unlike in bagging, the construction of each tree depends strongly on the trees already grown. Additionally, boosting could potentially overfit the data if the number of trees B is too large. Therefore, we use cross-validation to select B. Typically, &lt;em&gt;λ&lt;/em&gt; is between 0.01 and 0.001, and &lt;em&gt;d&lt;/em&gt;=1 is usually sufficient.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Originally published at&lt;/em&gt; &lt;a href="https://www.bijenpatel.com/guide/islr/tree-based-methods/" rel="noopener noreferrer"&gt;https://www.bijenpatel.com&lt;/a&gt; &lt;em&gt;on August 8, 2020.&lt;/em&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;I will be releasing the equivalent Python code for these examples soon. Subscribe to get notified!&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h1&gt;
  
  
  ISLR Chapter 8 — R Code
&lt;/h1&gt;

&lt;h2&gt;
  
  
  Classification Trees
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight r"&gt;&lt;code&gt;&lt;span class="n"&gt;library&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ISLR&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;library&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;MASS&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;library&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ggplot2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;library&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;gridExtra&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="c1"&gt;# For side-by-side ggplots&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;library&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;e1071&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;library&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;caret&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="n"&gt;library&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tree&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="c1"&gt;# For decision trees&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="c1"&gt;# We will work with the Carseats data&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="c1"&gt;# It contains data on sales of carseats for 400 different store locations&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;head&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Carseats&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="c1"&gt;# We will build a tree that predicts whether or not a location had "high" sales&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="c1"&gt;# We define "high" as sales exceeding 8000 units&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="c1"&gt;# Create a vector to indicate whether or not a location had "high" sales&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;High&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;ifelse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Carseats&lt;/span&gt;&lt;span class="o"&gt;$&lt;/span&gt;&lt;span class="n"&gt;Sales&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;8&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"No"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Yes"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="c1"&gt;# Merge the vector to the entire Carseats data&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;Carseats&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;data.frame&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Carseats&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;High&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="c1"&gt;# The tree function is used to fit a decision tree&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;Carseats_tree&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;tree&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;High&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;~&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;.&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;Sales&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;Carseats&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="c1"&gt;# Use the summary function to see the predictors used, # of terminal nodes, and training error&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;summary&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Carseats_tree&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="c1"&gt;# Visualize the tree model&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;plot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Carseats_tree&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Carseats_tree&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;pretty&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="c1"&gt;# pretty=0 shows category names instead of letters&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="c1"&gt;# To estimate test error, we fit a tree to training data and predict on test data&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;set.seed&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="m"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="n"&gt;train&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;sample&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="m"&gt;400&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;200&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;Carseats_train&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Carseats&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;train&lt;/span&gt;&lt;span class="p"&gt;,]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;Carseats_test&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Carseats&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;train&lt;/span&gt;&lt;span class="p"&gt;,]&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="n"&gt;Carseats_tree&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;tree&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;High&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;~&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;.&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;Sales&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;Carseats_train&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;Carseats_tree_predictions&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;predict&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Carseats_tree&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Carseats_test&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"class"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="c1"&gt;# Use a confusion matrix to determine the accuracy of the tree model&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;confusionMatrix&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Carseats_tree_predictions&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Carseats_test&lt;/span&gt;&lt;span class="o"&gt;$&lt;/span&gt;&lt;span class="n"&gt;High&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;positive&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"Yes"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;/&lt;/span&gt;&lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Tree Pruning
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight r"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Will pruning the tree lead to improved accuracy?&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="c1"&gt;# The cv.tree function performs cross-validation to determine the optimal tree complexity&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;set.seed&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="m"&gt;3&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;Carseats_tree_cv&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;cv.tree&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Carseats_tree&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;FUN&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;prune.misclass&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="c1"&gt;# FUN = prune.misclass performs pruning through the misclassification error rate&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="c1"&gt;# The object contains different terminal node values, their error rate, and cost complexity parameter&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;Carseats_tree_cv&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="c1"&gt;# Create a dataframe of the values from the cv.tree function&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;Carseats_tree_cv_df&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;data.frame&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Nodes&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;Carseats_tree_cv&lt;/span&gt;&lt;span class="o"&gt;$&lt;/span&gt;&lt;span class="n"&gt;size&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
                                &lt;/span&gt;&lt;span class="n"&gt;Error&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;Carseats_tree_cv&lt;/span&gt;&lt;span class="o"&gt;$&lt;/span&gt;&lt;span class="n"&gt;dev&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
                                &lt;/span&gt;&lt;span class="n"&gt;Alpha&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;Carseats_tree_cv&lt;/span&gt;&lt;span class="o"&gt;$&lt;/span&gt;&lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="c1"&gt;# Plot the number of terminal nodes, and their corresponding errors and alpha parameters&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;Carseats_tree_cv_error&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;ggplot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Carseats_tree_cv_df&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;aes&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;Nodes&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;Error&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;geom_line&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;geom_point&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;Carseats_tree_cv_alpha&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;ggplot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Carseats_tree_cv_df&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;aes&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;Nodes&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;Alpha&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;geom_line&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;geom_point&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="c1"&gt;# Show the plots side-by-side with the grid.arrange function from gridExtra package&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;grid.arrange&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Carseats_tree_cv_error&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Carseats_tree_cv_alpha&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;ncol&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="m"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="c1"&gt;# A tree with 9 terminal nodes results in the lowest error&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="c1"&gt;# This also corresponds to alpha value of 1.75&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="c1"&gt;# Finally, prune the tree with prune.misclass function and specify 9 terminal nodes&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;Carseats_tree_pruned&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;prune.misclass&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Carseats_tree&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;best&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="m"&gt;9&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="c1"&gt;# Plot the pruned tree&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;plot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Carseats_tree_pruned&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Carseats_tree_pruned&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;pretty&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="c1"&gt;# Use the pruned tree to make predictions, and compare the accuracy to the non-pruned tree&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;Carseats_tree_pruned_predictions&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;predict&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Carseats_tree_pruned&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Carseats_test&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"class"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;confusionMatrix&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Carseats_tree_pruned_predictions&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Carseats_test&lt;/span&gt;&lt;span class="o"&gt;$&lt;/span&gt;&lt;span class="n"&gt;High&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;positive&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"Yes"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="c1"&gt;# 77% accuracy for the pruned tree versus 71.5% for the non-pruned tree&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="c1"&gt;# Pruning results in a better model&amp;lt;/span&amp;gt;&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Regression Trees
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight r"&gt;&lt;code&gt;&lt;span class="c1"&gt;# We will work with the Boston data, which has data on median house values&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;head&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Boston&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="c1"&gt;# We will build a tree that predicts median house values&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="c1"&gt;# First, create training and test datasets&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;set.seed&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="n"&gt;train&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;sample&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="n"&gt;nrow&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Boston&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;nrow&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Boston&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="m"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;Boston_train&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Boston&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;train&lt;/span&gt;&lt;span class="p"&gt;,]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;Boston_test&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Boston&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;train&lt;/span&gt;&lt;span class="p"&gt;,]&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="c1"&gt;# Fit a tree to the training data&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;Boston_tree&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;tree&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;medv&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;~&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;.&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Boston_train&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="c1"&gt;# See the predictors used, number of terminal nodes, and error&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;summary&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Boston_tree&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="c1"&gt;# Plot the tree&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;plot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Boston_tree&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Boston_tree&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;pretty&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="c1"&gt;# Perform cross validation to determine optimal tree complexity&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;Boston_tree_cv&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;cv.tree&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Boston_tree&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="c1"&gt;# Create a dataframe of the values from cross-validation&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;Boston_tree_cv_df&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;data.frame&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Nodes&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;Boston_tree_cv&lt;/span&gt;&lt;span class="o"&gt;$&lt;/span&gt;&lt;span class="n"&gt;size&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Error&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;Boston_tree_cv&lt;/span&gt;&lt;span class="o"&gt;$&lt;/span&gt;&lt;span class="n"&gt;dev&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Alpha&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;Boston_tree_cv&lt;/span&gt;&lt;span class="o"&gt;$&lt;/span&gt;&lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="c1"&gt;# Plot the number of terminal nodes, and their corresponding errors and alpha parameters&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;Boston_tree_cv_error&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;ggplot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Boston_tree_cv_df&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;aes&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;Nodes&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;Error&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;geom_line&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;geom_point&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;Boston_tree_cv_alpha&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;ggplot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Boston_tree_cv_df&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;aes&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;Nodes&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;Alpha&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;geom_line&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;geom_point&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="n"&gt;grid.arrange&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Boston_tree_cv_error&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Boston_tree_cv_alpha&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;ncol&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="m"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="c1"&gt;# Cross-validation indicates that a tree with 8 terminal nodes is best&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="c1"&gt;# However, we could choose to use 7, as the error for 7 is essentially nearly the same as 8&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="c1"&gt;# This will result in a simpler tree&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;Boston_pruned_tree&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;prune.tree&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Boston_tree&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;best&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="m"&gt;7&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="c1"&gt;# Plot the final pruned tree&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;plot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Boston_pruned_tree&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Boston_pruned_tree&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;pretty&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="c1"&gt;# Use the pruned tree to make predictions on the test data, and determine the test MSE&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;Boston_tree_predictions&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;predict&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Boston_pruned_tree&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Boston_test&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;mean&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="n"&gt;Boston_tree_predictions&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Boston_test&lt;/span&gt;&lt;span class="o"&gt;$&lt;/span&gt;&lt;span class="n"&gt;medv&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;^&lt;/span&gt;&lt;span class="m"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;/&lt;/span&gt;&lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Bagging
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight r"&gt;&lt;code&gt;&lt;span class="n"&gt;library&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;randomForest&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="c1"&gt;# The randomForest package is used for bagging and random forest&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="c1"&gt;# We continue working with the Boston data&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="c1"&gt;# The randomForest function is used to perform both bagging and random forest&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="c1"&gt;# Bagging is a special case of random forest, where all predictors are used&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;set.seed&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;Boston_bag&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;randomForest&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;medv&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;~&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;.&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;Boston_train&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;mtry&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="m"&gt;13&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;importance&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="kc"&gt;TRUE&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="c1"&gt;# Use the bagged tree to make predictions on the test data&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;Boston_bag_predictions&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;predict&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Boston_bag&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Boston_test&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="c1"&gt;# Determine the test MSE&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;mean&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="n"&gt;Boston_bag_predictions&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Boston_test&lt;/span&gt;&lt;span class="o"&gt;$&lt;/span&gt;&lt;span class="n"&gt;medv&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;^&lt;/span&gt;&lt;span class="m"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;/&lt;/span&gt;&lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Random Forest
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight r"&gt;&lt;code&gt;&lt;span class="c1"&gt;# We continue working with the Boston data&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="c1"&gt;# For random forest, we simply specify a smaller number of predictors in mtry&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="c1"&gt;# Typically, m=(p/3) is used for regression&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="c1"&gt;# Typically, m=sqrt(p) is used for classification&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;set.seed&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;Boston_rf&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;randomForest&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;medv&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;~&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;.&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;Boston_train&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;mtry&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nf"&gt;round&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="m"&gt;13&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="m"&gt;3&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;importance&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="kc"&gt;TRUE&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="c1"&gt;# Use the bagged tree to make predictions on the test data&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;Boston_rf_predictions&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;predict&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Boston_rf&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Boston_test&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="c1"&gt;# Determine the test MSE and compare to the result from bagging&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;mean&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="n"&gt;Boston_rf_predictions&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Boston_test&lt;/span&gt;&lt;span class="o"&gt;$&lt;/span&gt;&lt;span class="n"&gt;medv&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;^&lt;/span&gt;&lt;span class="m"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="c1"&gt;# Lower test MSE than result from bagging&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="c1"&gt;# The importance function can be used to see the importance of each variable&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="c1"&gt;# The first column indicates how much the error increases if the variable is excluded from the model&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="c1"&gt;# The second column indicates how much node purity decreases if the variable is excluded from the model&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;importance&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Boston_rf&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="c1"&gt;# Quick plot of the data from the importance function&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;varImpPlot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Boston_rf&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="c1"&gt;# lstat and rm are the most importance variables&amp;lt;/span&amp;gt;&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Boosting
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight r"&gt;&lt;code&gt;&lt;span class="n"&gt;library&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;gbm&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="c1"&gt;# For boosting&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="c1"&gt;# We continue working with the Boston data&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="c1"&gt;# The gbm function is used to perform boosting&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="c1"&gt;# For regression problems, the distribution is set to gaussian&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="c1"&gt;# For classification problems, the distribution is set to bernoulli&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="c1"&gt;# n.trees is used to specify the number of trees&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="c1"&gt;# interaction.depth is used to limit the depth of each tree&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;set.seed&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;Boston_boost&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;gbm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;medv&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;~&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;.&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Boston_train&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;distribution&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"gaussian"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;n.trees&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="m"&gt;5000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;interaction.depth&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="m"&gt;4&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="c1"&gt;# The summary function shows the relative influence statistics for each variable&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;summary&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Boston_boost&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="c1"&gt;# Plot the marginal effect of variables on the response after integrating out the other variables&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;plot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Boston_boost&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"lstat"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;plot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Boston_boost&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"rm"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="c1"&gt;# Use the boosted model to predict on the test data&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;Boston_boost_predictions&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;predict&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Boston_boost&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Boston_test&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;n.trees&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="m"&gt;5000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;interaction.depth&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="m"&gt;4&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="c1"&gt;# Determine the test MSE&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;mean&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="n"&gt;Boston_boost_predictions&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Boston_test&lt;/span&gt;&lt;span class="o"&gt;$&lt;/span&gt;&lt;span class="n"&gt;medv&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;^&lt;/span&gt;&lt;span class="m"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="c1"&gt;# Boosting results in a test MSE that is slightly better than random forest&amp;lt;/span&amp;gt;&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



</description>
      <category>datascience</category>
      <category>machinelearning</category>
      <category>tutorial</category>
    </item>
    <item>
      <title>ISLR Chapter 7 — Moving Beyond Linearity</title>
      <dc:creator>Bijen Patel</dc:creator>
      <pubDate>Sun, 08 Nov 2020 23:41:01 +0000</pubDate>
      <link>https://dev.to/bijenpatel/islr-chapter-7-moving-beyond-linearity-52h4</link>
      <guid>https://dev.to/bijenpatel/islr-chapter-7-moving-beyond-linearity-52h4</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;This is a summary of chapter 7 of the &lt;em&gt;Introduction to Statistical Learning&lt;/em&gt; textbook. I’ve written a 10-part guide that covers the entire book. The guide can be read at &lt;a href="https://www.bijenpatel.com/guide/" rel="noopener noreferrer"&gt;my website&lt;/a&gt;, or &lt;a href="https://dev.to/bijenpatel/series/9611"&gt;here&lt;/a&gt; at DEV. Subscribe to stay up to date on my latest Data Science &amp;amp; Engineering guides!&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Linear models are advantageous when it comes to their interpretability. However, their capabilities are limited, especially in scenarios where the linear assumption is poor. Ridge, lasso, and principal components regression improve upon the least squares regression model by reducing the variance of the coefficient estimates. However, these models are still linear, and will perform poorly in nonlinear settings. We can move beyond linearity through methods such as polynomial regression, step functions, splines, local regression, and generalized additive models.&lt;/p&gt;

&lt;h1&gt;
  
  
  Polynomial Regression
&lt;/h1&gt;

&lt;p&gt;Polynomial regression &lt;span id="rmm"&gt;i&lt;/span&gt;s the simplest method of extending linear regression. It involves adding extra predictors, which are just the original predictors raised to some exponent. A polynomial regression model may look like the following:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605462389246%2Fja3GXmLql.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605462389246%2Fja3GXmLql.png" alt="polynomial-regression-equation.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;A basic polynomial regression model was introduced in the linear regression section, where the relationship between automobile MPG and horsepower was modeled with quadratic regression.&lt;/p&gt;

&lt;p&gt;For large degree (&lt;em&gt;d&lt;/em&gt;) values, polynomial regression can produce extremely non-linear curves. However, it is typically unusual to use degree values greater than 3 or 4 because the resulting curve will be too flexible and take on a weird shape. This is especially the case at the boundaries of the &lt;em&gt;X&lt;/em&gt; predictor.&lt;/p&gt;

&lt;p&gt;In polynomial regression, the individual coefficient values are not very important. Instead, we look at the overall fit of the model to assess the relationship between the predictor and response.&lt;/p&gt;

&lt;p&gt;Additionally, it is fairly easy to plot a 95% confidence interval for a polynomial regression model. Least squares returns variance estimates for each coefficient, as well as covariances between pairs of coefficients. The estimated pointwise standard error for some predictor value is the square root of the variance. Plotting twice the pointwise standard error for each possible &lt;em&gt;X&lt;/em&gt; value on both sides of the fitted polynomial curve will result in the 95% confidence interval.&lt;/p&gt;

&lt;h1&gt;
  
  
  Step Functions
&lt;/h1&gt;

&lt;p&gt;Polynomial regression imposes a global structure on the non-linear function of &lt;em&gt;X&lt;/em&gt;. But what if we don’t want all of our &lt;em&gt;X&lt;/em&gt; values to follow a global structure? Step functions can be used to break the range of &lt;em&gt;X&lt;/em&gt; into bins. Then, a different constant is fit to each bin. The bins are created through simple cutpoints as follows:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605462406799%2FOMWhvYamH.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605462406799%2FOMWhvYamH.png" alt="step-function-1.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605462416120%2FxHoPzxvv3.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605462416120%2FxHoPzxvv3.png" alt="step-function-2.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605462421963%2FYEoxFeB5e.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605462421963%2FYEoxFeB5e.png" alt="step-function-3.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605462428961%2FU5B9E2FRc.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605462428961%2FU5B9E2FRc.png" alt="step-function-4.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605462435446%2FrRCI6UYJd.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605462435446%2FrRCI6UYJd.png" alt="step-function-5.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;I&lt;/em&gt; is an indicator function that returns a 1 if the condition is true, and 0 if it is false.&lt;/p&gt;

&lt;p&gt;After creating the cutpoints, least squares regression is used to fit a linear model that uses the cutpoints as predictors.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605462454287%2FUewArcbyZ.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605462454287%2FUewArcbyZ.png" alt="step-function-6.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Note that &lt;em&gt;C&lt;/em&gt;₀(&lt;em&gt;X&lt;/em&gt;)is excluded as a predictor because it is redundant with the intercept. Additionally, for a given value of &lt;em&gt;X&lt;/em&gt;, only one of the &lt;em&gt;C&lt;/em&gt; values can be non-zero.&lt;/p&gt;

&lt;p&gt;An example of a step function with cutpoints at 10 and 15 is shown in the following graph:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605462465349%2FpXYIQlZWw.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605462465349%2FpXYIQlZWw.png" alt="step-function.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The step function is also known as a piecewise constant regression model.&lt;/p&gt;

&lt;h1&gt;
  
  
  Basis Functions
&lt;/h1&gt;

&lt;p&gt;Polynomial and piecewise constant models are special cases of a basis function approach. The basis function approach utilizes a family of functions or transformations that can be applied to a predictor. The model takes on the following form:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605462476989%2F2nsj387h4.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605462476989%2F2nsj387h4.png" alt="basis-function-equation.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The basis functions (&lt;em&gt;b&lt;/em&gt;) are fixed and known prior to running the model. In polynomial and piecewise constant regression, the basis functions are as follows:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605462485102%2FYXUPwmcab.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605462485102%2FYXUPwmcab.png" alt="basis-polynomial-piecewise-constant.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In the next few sections, we’ll discuss splines, which are very commonly used in the basis function approach.&lt;/p&gt;

&lt;h1&gt;
  
  
  Regression Splines
&lt;/h1&gt;

&lt;p&gt;Regression splines extend upon polynomial regression and piecewise-constant regression.&lt;/p&gt;

&lt;h2&gt;
  
  
  Piecewise Polynomial Regression
&lt;/h2&gt;

&lt;p&gt;In polynomial regression, an entire polynomial function is fit over the entire range of &lt;em&gt;X&lt;/em&gt; values. Piecewise polynomial regression simply involves fitting separate polynomial functions over different ranges of &lt;em&gt;X&lt;/em&gt;. The points at which the functions change are called knots. Using more knots leads to a more flexible model. For example, a piecewise polynomial regression model with one knot takes on the following form:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605462496732%2FvSsOS-O2_.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605462496732%2FvSsOS-O2_.png" alt="piecewise-polynomial-equation.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;An example of a piecewise polynomial regression model with one knot at &lt;em&gt;X&lt;/em&gt; = 12 can be seen in the following graph:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605462508430%2Fz-m6BG-0b.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605462508430%2Fz-m6BG-0b.png" alt="piecewise-polynomial-regression.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;As can be seen from the chart above, piecewise polynomial models are discontinuous. While discontinuity can sometimes be desired, we usually want a continuous model. To solve this, we can introduce constraints, which result in continuous and smooth models.&lt;/p&gt;

&lt;h2&gt;
  
  
  Constraints and Splines
&lt;/h2&gt;

&lt;p&gt;In order to produce a piecewise polynomial model that is continuous, we need to introduce a continuity constraint. In other words, the separate polynomial models must meet at the knots. Introducing the continuity constraint on the previous chart might result in something like the following:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605462522405%2FYyaucGfv6.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605462522405%2FYyaucGfv6.png" alt="piecewise-polynomial-regression-continuous.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;However, the point at which the functions join looks a bit unnatural. This can be addressed by introducing another constraint that results in a smooth join. Specifically, this constraint requires that the derivatives up to degree (&lt;em&gt;d&lt;/em&gt;-1) be continuous at each knot. For example, if we have a cubic polynomial (&lt;em&gt;d&lt;/em&gt;=3), then the first and second derivatives must be continuous at the knots. A model that is continuous and smooth is known as a spline. Introducing the smoothness constraint on the previous chart might result in the following spline:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605462533779%2F3_pvonryH.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605462533779%2F3_pvonryH.png" alt="piecewise-polynomial-regression-spline.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Splines have the potential to have high variance at the outer range of the predictor values, which causes wide confidence intervals at the outer boundaries. This can be addressed through a boundary constraint, which requires the spline function to be linear at the boundaries (before the first knot, and after the last knot). A spline with this additional constraint is called a natural spline.&lt;/p&gt;

&lt;h2&gt;
  
  
  Choosing the Number and Locations of Knots
&lt;/h2&gt;

&lt;p&gt;There are many methods that can be used to choose the number of knots to use, and their locations.&lt;/p&gt;

&lt;p&gt;One option is to place more knots in the regions where we feel that the function might vary the most, and to place fewer knots where it will be more stable.&lt;/p&gt;

&lt;p&gt;The first option could work well, but it is more common to simply place knots at uniform percentiles of the data. For example, if we choose to use three knots, then they would be placed at the 25th, 50th, and 75th percentiles of the data. But how do we know how many knots should be used? One option is to simply try out a different number of knots and analyze the fitted curves. Another more objective option is to use cross-validation to choose the number of knots that results in the lowest test error.&lt;/p&gt;

&lt;p&gt;Another option is to use cross-validation to determine the number of knots and their locations.&lt;/p&gt;

&lt;h2&gt;
  
  
  Polynomial Regression vs Regression Splines
&lt;/h2&gt;

&lt;p&gt;Regression splines often produce better results than polynomial regression models. This is because polynomial regression requires the use of a high-degree model to produce a very flexible fit. High-degree models usually lead to highly inaccurate predictions at certain &lt;em&gt;X&lt;/em&gt; values. Splines produce flexible fits by introducing knots and separate low-degree functions, which ultimately results in better and stable predictions.&lt;/p&gt;

&lt;h1&gt;
  
  
  Smoothing Splines
&lt;/h1&gt;

&lt;p&gt;Smoothing splines are an alternative approach to fitting a smooth curve over a dataset. Smoothing splines find a function &lt;em&gt;g&lt;/em&gt;(&lt;em&gt;X&lt;/em&gt;) that minimizes the following:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605462548320%2FtpTPWrqMR.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605462548320%2FtpTPWrqMR.png" alt="smoothing-spline-equation.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The above equation should look familiar to the equation that ridge regression and lasso regression aim to minimize. The first part is simply the residual sum of squares. The second part is a penalty term that encourages &lt;em&gt;g&lt;/em&gt;(&lt;em&gt;X&lt;/em&gt;) to be smooth. &lt;em&gt;g&lt;/em&gt;^&lt;em&gt;m&lt;/em&gt;(&lt;em&gt;t&lt;/em&gt;) is the &lt;em&gt;m&lt;/em&gt;-&lt;em&gt;th&lt;/em&gt; derivative of the function, which is a measure of the roughness of the function at some point. The variable &lt;em&gt;m&lt;/em&gt; is known as the penalty order, and takes on a value of &lt;em&gt;m&lt;/em&gt;=1 for a linear smoothing spline, &lt;em&gt;m&lt;/em&gt;=2 for a cubic smoothing spline, &lt;em&gt;m&lt;/em&gt;=3 for a quintic (fifth-order) smoothing spline, and &lt;em&gt;m&lt;/em&gt;=4 for a septic (seventh-order) smoothing spline. If the function is very wiggly at some point, then the &lt;em&gt;m&lt;/em&gt;-&lt;em&gt;th&lt;/em&gt; derivative will be large. Therefore, the integral is a measure of the total roughness of the entire function. &lt;em&gt;λ&lt;/em&gt; is a tuning parameter that ultimately controls the bias-variance trade-off of the smoothing spline. Larger values of &lt;em&gt;λ&lt;/em&gt; will result in an even smoother spline.&lt;/p&gt;

&lt;p&gt;Smoothing splines have some special properties:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt; Smoothing splines are piecewise cubic polynomials with knots at each of the unique &lt;em&gt;X&lt;/em&gt; values in the dataset.&lt;/li&gt;
&lt;li&gt; Smoothing splines are linear at the boundaries (before the first knot, and after the last knot)&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The combination of the above two properties means that smoothing splines are natural splines. However, it isn’t the same as the natural spline you would get from a piecewise polynomial function with knots at each unique &lt;em&gt;X&lt;/em&gt; value and constraints. Instead, it is a shrunken version of that natural cubic spline because of the presence of the penalty term.&lt;/p&gt;

&lt;p&gt;So, how is the &lt;em&gt;λ&lt;/em&gt; parameter chosen? As it turns out, the most efficient method is through leave-one-out cross-validation (LOOCV). In fact, LOOCV essentially has the same computational cost as a single fit of a smoothing spline.&lt;/p&gt;

&lt;h1&gt;
  
  
  Local Regression
&lt;/h1&gt;

&lt;p&gt;Local regression is yet another approach to fitting flexible non-linear functions. It involves determining a fit at some target point &lt;em&gt;X&lt;/em&gt;₀ using only the nearby training observations. The process of fitting a local regression model is as follows:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt; Gather the fraction (&lt;em&gt;s&lt;/em&gt; = &lt;em&gt;k&lt;/em&gt;/&lt;em&gt;n&lt;/em&gt;) of training points whose xᵢ values are closest to the target point &lt;em&gt;X&lt;/em&gt;₀&lt;/li&gt;
&lt;li&gt; Assign a weight &lt;em&gt;K&lt;/em&gt;ᵢ₀ = &lt;em&gt;K&lt;/em&gt;(&lt;em&gt;X&lt;/em&gt;ᵢ, &lt;em&gt;X&lt;/em&gt;₀) to each point in this neighborhood, so that the point furthest from X₀ has a weight of zero, and the closest has the highest weight. All but these K nearest neighbors get a weight of zero.&lt;/li&gt;
&lt;li&gt; Fit a weighted least squares regression of the &lt;em&gt;Y&lt;/em&gt;ᵢ on the &lt;em&gt;X&lt;/em&gt;ᵢ using the &lt;em&gt;K&lt;/em&gt;ᵢ₀ weights by finding &lt;em&gt;β&lt;/em&gt;₀ and &lt;em&gt;β&lt;/em&gt;₁ that minimize the following:&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605462571753%2FdApWKTvwy.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605462571753%2FdApWKTvwy.png" alt="local-regression-equation.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;There are many important choices to be made when performing local regression, such as how to define the weighting function &lt;em&gt;K&lt;/em&gt;, and whether to fit a linear, quadratic, or other type of regression model. However, the most important choice is the choice of the span (&lt;em&gt;s&lt;/em&gt;) in step one. The span controls the flexibility of the non-linear fit. The smaller the value of the span, the more local and wiggly it will be. As in most cases, cross-validation is a very useful method for determining the value of the span.&lt;/p&gt;

&lt;p&gt;Local regression can be generalized in many different ways. For example, in a setting where we have multiple predictors, one useful generalization involves fitting a multiple linear regression model that is global in some variables, but local in another. This is known as a varying coefficient model. Additionally, local regression generalizes naturally when we want to fit models that are local in multiple predictors instead of just one. However, local regression can perform poorly if we’re trying to fit models that are local in more than 3 or 4 predictors because there will generally be very few training observations close to the target point &lt;em&gt;X&lt;/em&gt;₀. &lt;em&gt;K&lt;/em&gt;-Nearest Neighbors regression suffers from this same problem.&lt;/p&gt;

&lt;h1&gt;
  
  
  Generalized Additive Models
&lt;/h1&gt;

&lt;p&gt;What if we wanted to flexibly predict &lt;em&gt;Y&lt;/em&gt; on the basis of several different &lt;em&gt;X&lt;/em&gt; predictors? Generalized additive models (GAMs) allow for non-linear functions of each predictor, while maintaining additivity. Additivity simply refers to the ability to add together each predictor’s contribution. GAMs can be used in either the regression or classification setting.&lt;/p&gt;

&lt;p&gt;Recall that a multiple linear regression takes on the following form:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605462599640%2FEjMP_1lc7.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605462599640%2FEjMP_1lc7.png" alt="multiple-linear-regression-equation.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;To allow for non-linear relationships between each predictor and the response, we can replace each &lt;em&gt;β&lt;/em&gt;ⱼ&lt;em&gt;X&lt;/em&gt;ⱼlinear component with a non-linear function &lt;em&gt;f&lt;/em&gt;ⱼ(&lt;em&gt;X&lt;/em&gt;ⱼ). This results in a GAM model, which looks as follows:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605462616474%2FcyLOLhCAq.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605462616474%2FcyLOLhCAq.png" alt="GAM-equation.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;So far, we’ve discussed different methods for fitting functions to a single predictor, such as step functions, basis functions, and splines. The beauty of GAMs is that these methods can be used as building blocks for fitting an additive model. For example, assume that we wanted to predict salary and had year, age, and education as predictors. The GAM would look as follows:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605462635737%2FJ-9nBxpMp.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605462635737%2FJ-9nBxpMp.png" alt="GAM-wage-equation.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We could fit natural splines for the year and age predictors, and a step function for the education predictor. The plots of the relationship between each of these predictors and the response may look like the following:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605462650453%2FDKBQk7mE0.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605462650453%2FDKBQk7mE0.png" alt="year-age-education.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Fitting a GAM with smoothing splines is not simple because least squares cannot be used. However, there is an approach known as backfitting, which can be used to fit GAMs with smoothing splines. Backfitting fits a model involving multiple predictors by repeatedly updating the fit for each predictor in turn, holding the others fixed.&lt;/p&gt;

&lt;h2&gt;
  
  
  Pros and Cons of GAMs
&lt;/h2&gt;

&lt;p&gt;The advantages of GAMs are the following:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  GAMs allow us to fit non-linear functions for each predictor. This allows us to easily model non-linear relationships without having to manually try different transformations on each predictor.&lt;/li&gt;
&lt;li&gt;  The non-linear fits can potentially result in more accurate predictions.&lt;/li&gt;
&lt;li&gt;  Because GAMs are additive, we can still examine the effect of each predictor on the response, while holding all of the other predictors fixed. This means that GAMs maintain interpretability.&lt;/li&gt;
&lt;li&gt;  The smoothness of a function for a predictor can be summarized via degrees of freedom.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The main disadvantage of GAMs is that:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  GAMs are restricted to be additive. This means that variable interactions could be missed. However, interaction terms can be included by adding interaction predictors or interaction functions.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;em&gt;Originally published at&lt;/em&gt; &lt;a href="https://www.bijenpatel.com/guide/islr/moving-beyond-linearity/" rel="noopener noreferrer"&gt;https://www.bijenpatel.com&lt;/a&gt; &lt;em&gt;on August 7, 2020.&lt;/em&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;I will be releasing the equivalent Python code for these examples soon. Subscribe to get notified!&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h1&gt;
  
  
  ISLR Chapter 7 — R Code
&lt;/h1&gt;

&lt;h2&gt;
  
  
  Polynomial Regression and Step Functions
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight r"&gt;&lt;code&gt;&lt;span class="n"&gt;library&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;MASS&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;library&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ISLR&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;library&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ggplot2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="c1"&gt;# Working with the Wage dataset&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;head&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Wage&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="c1"&gt;# Fit a fourth-degree polynomial model to predict wage with age&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="c1"&gt;# Note that this fits an orthagonal polynomial&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;fit&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;lm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;wage&lt;/span&gt;&lt;span class="o"&gt;~&lt;/span&gt;&lt;span class="n"&gt;poly&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;age&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;4&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;Wage&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;coef&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;summary&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;fit&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="c1"&gt;# To fit a regular polynomial, we can use the raw=TRUE argument&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="c1"&gt;# This doesn't affect the model in a meaningful way&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;fit2&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;lm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;wage&lt;/span&gt;&lt;span class="o"&gt;~&lt;/span&gt;&lt;span class="n"&gt;poly&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;age&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;raw&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="kc"&gt;TRUE&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;Wage&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="c1"&gt;# Create a grid of values for age, which we use to make predictions with&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;agelims&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Wage&lt;/span&gt;&lt;span class="o"&gt;$&lt;/span&gt;&lt;span class="n"&gt;age&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;age.grid&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;seq&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;from&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;agelims&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;to&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;agelims&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="m"&gt;2&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="c1"&gt;# Get predictions and standard error bands&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;preds&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;predict&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;fit&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;newdata&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nf"&gt;list&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;age&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;age.grid&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;se&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="kc"&gt;TRUE&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;preds&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;data.frame&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;preds&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;preds&lt;/span&gt;&lt;span class="o"&gt;$&lt;/span&gt;&lt;span class="n"&gt;age&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;age.grid&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;se.bands&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;cbind&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;preds&lt;/span&gt;&lt;span class="o"&gt;$&lt;/span&gt;&lt;span class="n"&gt;fit&lt;/span&gt;&lt;span class="m"&gt;+2&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;preds&lt;/span&gt;&lt;span class="o"&gt;$&lt;/span&gt;&lt;span class="n"&gt;se.fit&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;preds&lt;/span&gt;&lt;span class="o"&gt;$&lt;/span&gt;&lt;span class="n"&gt;fit&lt;/span&gt;&lt;span class="m"&gt;-2&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;preds&lt;/span&gt;&lt;span class="o"&gt;$&lt;/span&gt;&lt;span class="n"&gt;se.fit&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;se.bands&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;data.frame&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;se.bands&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;se.bands&lt;/span&gt;&lt;span class="o"&gt;$&lt;/span&gt;&lt;span class="n"&gt;age&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;age.grid&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="c1"&gt;# Plot the data, the model fit, and the standard error bands with ggplot&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;ggplot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;Wage&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;aes&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;Wage&lt;/span&gt;&lt;span class="o"&gt;$&lt;/span&gt;&lt;span class="n"&gt;age&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;Wage&lt;/span&gt;&lt;span class="o"&gt;$&lt;/span&gt;&lt;span class="n"&gt;wage&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="n"&gt;geom_point&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="n"&gt;labs&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;title&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"Degree-4 Polynomial"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"Age"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"Wage"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="n"&gt;geom_line&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;preds&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;aes&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;preds&lt;/span&gt;&lt;span class="o"&gt;$&lt;/span&gt;&lt;span class="n"&gt;age&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;preds&lt;/span&gt;&lt;span class="o"&gt;$&lt;/span&gt;&lt;span class="n"&gt;fit&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;colour&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"blue"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="m"&gt;1.5&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="n"&gt;geom_line&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;se.bands&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;aes&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;se.bands&lt;/span&gt;&lt;span class="o"&gt;$&lt;/span&gt;&lt;span class="n"&gt;age&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;se.bands&lt;/span&gt;&lt;span class="o"&gt;$&lt;/span&gt;&lt;span class="n"&gt;X1&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;colour&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"orange"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="m"&gt;0.75&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="n"&gt;geom_line&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;se.bands&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;aes&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;se.bands&lt;/span&gt;&lt;span class="o"&gt;$&lt;/span&gt;&lt;span class="n"&gt;age&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;se.bands&lt;/span&gt;&lt;span class="o"&gt;$&lt;/span&gt;&lt;span class="n"&gt;X2&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;colour&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"orange"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="m"&gt;0.75&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="c1"&gt;# Perform ANOVA to compare models with different degrees&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="c1"&gt;# The more complex model is compared to the simpler model&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;fit.1&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;lm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;wage&lt;/span&gt;&lt;span class="o"&gt;~&lt;/span&gt;&lt;span class="n"&gt;age&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;Wage&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;fit.2&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;lm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;wage&lt;/span&gt;&lt;span class="o"&gt;~&lt;/span&gt;&lt;span class="n"&gt;poly&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;age&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;2&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;Wage&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;fit.3&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;lm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;wage&lt;/span&gt;&lt;span class="o"&gt;~&lt;/span&gt;&lt;span class="n"&gt;poly&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;age&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;3&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;Wage&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;fit.4&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;lm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;wage&lt;/span&gt;&lt;span class="o"&gt;~&lt;/span&gt;&lt;span class="n"&gt;poly&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;age&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;4&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;Wage&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;fit.5&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;lm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;wage&lt;/span&gt;&lt;span class="o"&gt;~&lt;/span&gt;&lt;span class="n"&gt;poly&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;age&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;5&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;Wage&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;anova&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;fit.1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;fit.2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;fit.3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;fit.4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;fit.5&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="c1"&gt;# Quadratic model is better than linear model (p-value: 0)&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="c1"&gt;# Cubic model is better than quadratic model (p-vlaue: 0)&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="c1"&gt;# Quartic model is better than cubic model (p-value: 0.05)&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="c1"&gt;# Quintic polynomial is NOT better than quartic model (p-value: 0.37)&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="c1"&gt;# Note that we also could have used cross-validation to choose a model&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="c1"&gt;# Fit a polynomial logistic regression model to predict if someone makes more than $100,000&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;logistic.fit&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;glm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;I&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;wage&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="m"&gt;100&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;~&lt;/span&gt;&lt;span class="n"&gt;poly&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;age&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;4&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;Wage&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;family&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;binomial&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="c1"&gt;# Make predictions for all of the ages&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;logistic.preds&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;predict&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;logistic.fit&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;newdata&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nf"&gt;list&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;age&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;age.grid&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;se&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;T&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"response"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;logistic.preds&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;data.frame&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;logistic.preds&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;logistic.preds&lt;/span&gt;&lt;span class="o"&gt;$&lt;/span&gt;&lt;span class="n"&gt;age&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;age.grid&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="c1"&gt;# Fit a simple step function using the cut function&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="c1"&gt;# Cut automatically chooses cutpoints, but cutpoints can be set manually too&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;step.fit&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;lm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;wage&lt;/span&gt;&lt;span class="o"&gt;~&lt;/span&gt;&lt;span class="n"&gt;cut&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;age&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;4&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;Wage&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;coef&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;summary&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;step.fit&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;/&lt;/span&gt;&lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Splines
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight r"&gt;&lt;code&gt;&lt;span class="n"&gt;library&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;splines&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="c1"&gt;# We continue working with the Wage dataset&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="c1"&gt;# Use the basis function bs() to fit a piecewise polynomial spline&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="c1"&gt;# We use pre-specified knots at ages 25, 40, 65&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="c1"&gt;# By default, a cubic spline is produced&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;fit&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;lm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;wage&lt;/span&gt;&lt;span class="o"&gt;~&lt;/span&gt;&lt;span class="n"&gt;bs&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;age&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;knots&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nf"&gt;c&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="m"&gt;25&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;40&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;65&lt;/span&gt;&lt;span class="p"&gt;)),&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;Wage&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="c1"&gt;# Get predictions&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;preds&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;predict&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;fit&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;newdata&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nf"&gt;list&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;age&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;age.grid&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;se&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;T&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;preds&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;data.frame&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;preds&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;preds&lt;/span&gt;&lt;span class="o"&gt;$&lt;/span&gt;&lt;span class="n"&gt;age&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;age.grid&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="c1"&gt;# Get standard error bands&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;se.bands&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;cbind&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;preds&lt;/span&gt;&lt;span class="o"&gt;$&lt;/span&gt;&lt;span class="n"&gt;fit&lt;/span&gt;&lt;span class="m"&gt;+2&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;preds&lt;/span&gt;&lt;span class="o"&gt;$&lt;/span&gt;&lt;span class="n"&gt;se.fit&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;preds&lt;/span&gt;&lt;span class="o"&gt;$&lt;/span&gt;&lt;span class="n"&gt;fit&lt;/span&gt;&lt;span class="m"&gt;-2&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;preds&lt;/span&gt;&lt;span class="o"&gt;$&lt;/span&gt;&lt;span class="n"&gt;se.fit&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;se.bands&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;data.frame&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;se.bands&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;se.bands&lt;/span&gt;&lt;span class="o"&gt;$&lt;/span&gt;&lt;span class="n"&gt;age&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;age.grid&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="c1"&gt;# Plot the data, the model fit, and the standard error bands with ggplot&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;ggplot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;Wage&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;aes&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;Wage&lt;/span&gt;&lt;span class="o"&gt;$&lt;/span&gt;&lt;span class="n"&gt;age&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;Wage&lt;/span&gt;&lt;span class="o"&gt;$&lt;/span&gt;&lt;span class="n"&gt;wage&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="n"&gt;geom_point&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="n"&gt;labs&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;title&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"Piecewise Polynomial Spline"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"Age"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"Wage"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="n"&gt;geom_line&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;preds&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;aes&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;preds&lt;/span&gt;&lt;span class="o"&gt;$&lt;/span&gt;&lt;span class="n"&gt;age&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;preds&lt;/span&gt;&lt;span class="o"&gt;$&lt;/span&gt;&lt;span class="n"&gt;fit&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;colour&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"blue"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="m"&gt;1.5&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="n"&gt;geom_line&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;se.bands&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;aes&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;se.bands&lt;/span&gt;&lt;span class="o"&gt;$&lt;/span&gt;&lt;span class="n"&gt;age&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;se.bands&lt;/span&gt;&lt;span class="o"&gt;$&lt;/span&gt;&lt;span class="n"&gt;X1&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;colour&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"orange"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="m"&gt;0.75&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="n"&gt;geom_line&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;se.bands&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;aes&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;se.bands&lt;/span&gt;&lt;span class="o"&gt;$&lt;/span&gt;&lt;span class="n"&gt;age&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;se.bands&lt;/span&gt;&lt;span class="o"&gt;$&lt;/span&gt;&lt;span class="n"&gt;X2&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;colour&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"orange"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="m"&gt;0.75&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="c1"&gt;# We could also use the df option in the bs() function to use knots at uniform quantiles&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="c1"&gt;# This will set the knots at the 25th, 50th, and 75th percentiles of the age data&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;fit&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;lm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;wage&lt;/span&gt;&lt;span class="o"&gt;~&lt;/span&gt;&lt;span class="n"&gt;bs&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;age&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="m"&gt;6&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;Wage&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="c1"&gt;# We could fit a natural spline instead of using basis functions&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;fit&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;lm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;wage&lt;/span&gt;&lt;span class="o"&gt;~&lt;/span&gt;&lt;span class="n"&gt;ns&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;age&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="m"&gt;4&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;Wage&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;preds&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;predict&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;fit&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;newdata&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nf"&gt;list&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;age&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;age.grid&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;se&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;T&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;preds&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;data.frame&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;preds&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;preds&lt;/span&gt;&lt;span class="o"&gt;$&lt;/span&gt;&lt;span class="n"&gt;age&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;age.grid&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;ggplot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;Wage&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;aes&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;Wage&lt;/span&gt;&lt;span class="o"&gt;$&lt;/span&gt;&lt;span class="n"&gt;age&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;Wage&lt;/span&gt;&lt;span class="o"&gt;$&lt;/span&gt;&lt;span class="n"&gt;wage&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="n"&gt;geom_point&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="n"&gt;labs&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;title&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"Piecewise Polynomial Spline"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"Age"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"Wage"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="n"&gt;geom_line&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;preds&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;aes&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;preds&lt;/span&gt;&lt;span class="o"&gt;$&lt;/span&gt;&lt;span class="n"&gt;age&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;preds&lt;/span&gt;&lt;span class="o"&gt;$&lt;/span&gt;&lt;span class="n"&gt;fit&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;colour&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"blue"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="m"&gt;1.5&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="c1"&gt;# We can also use a smoothing spline&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;fit&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;smooth.spline&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Wage&lt;/span&gt;&lt;span class="o"&gt;$&lt;/span&gt;&lt;span class="n"&gt;age&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Wage&lt;/span&gt;&lt;span class="o"&gt;$&lt;/span&gt;&lt;span class="n"&gt;wage&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="m"&gt;6&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="c1"&gt;# Lastly, we can use local regression&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="c1"&gt;# We will use a span value of 0.5&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;fit&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;loess&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;wage&lt;/span&gt;&lt;span class="o"&gt;~&lt;/span&gt;&lt;span class="n"&gt;age&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="m"&gt;0.5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;Wage&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;/&lt;/span&gt;&lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  GAMs
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight r"&gt;&lt;code&gt;&lt;span class="n"&gt;library&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;gam&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="c1"&gt;# We continue working with the Wage dataset&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="c1"&gt;# Fit a GAM&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="c1"&gt;# Use a smoothing spline for the year and age variables&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="c1"&gt;# Year spline with 4 df, and age with 5 df&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;gam.m3&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;gam&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;wage&lt;/span&gt;&lt;span class="o"&gt;~&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;year&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="m"&gt;4&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;age&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;5&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;education&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;Wage&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="c1"&gt;# Fit a GAM model without the year variable&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;gam.m1&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;gam&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;wage&lt;/span&gt;&lt;span class="o"&gt;~&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;age&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;5&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;education&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;Wage&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="c1"&gt;# Fit a GAM model with a linear year variable instead of a spline&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;gam.m2&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;gam&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;wage&lt;/span&gt;&lt;span class="o"&gt;~&lt;/span&gt;&lt;span class="n"&gt;year&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;age&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;5&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;education&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;Wage&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="c1"&gt;# Perform ANOVA tests to compare the different GAM models&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;anova&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;gam.m1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;gam.m2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;gam.m3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;test&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"F"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="c1"&gt;# The second GAM model is better than the first&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="c1"&gt;# The third GAM model is not better than the second&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="c1"&gt;# Make prediction with the GAM model&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;preds&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;predict&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;gam.m2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;newdata&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;Wage&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="c1"&gt;# We can also use local regression in GAMs&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="c1"&gt;# Use local regression for the age variable, with a span of 0.7&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;gam.lo&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;gam&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;wage&lt;/span&gt;&lt;span class="o"&gt;~&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;year&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="m"&gt;4&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;lo&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;age&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="m"&gt;0.7&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;education&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;Wage&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="c1"&gt;# We could also use local regression to create an interaction term for the GAM&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;gam.lo.i&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;gam&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;wage&lt;/span&gt;&lt;span class="o"&gt;~&lt;/span&gt;&lt;span class="n"&gt;lo&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;year&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;age&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="m"&gt;0.5&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;education&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;Wage&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="c1"&gt;# Lastly, we can also perform logistic regression with GAM&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;gam.lr&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;gam&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;I&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;wage&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="m"&gt;100&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;~&lt;/span&gt;&lt;span class="n"&gt;year&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;age&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="m"&gt;5&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;education&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;family&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;binomial&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;Wage&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;/&lt;/span&gt;&lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



</description>
      <category>datascience</category>
      <category>machinelearning</category>
      <category>tutorial</category>
    </item>
    <item>
      <title>ISLR Chapter 6 — Linear Model Selection &amp; Regularization</title>
      <dc:creator>Bijen Patel</dc:creator>
      <pubDate>Sun, 08 Nov 2020 23:39:16 +0000</pubDate>
      <link>https://dev.to/bijenpatel/islr-chapter-6-linear-model-selection-regularization-40ph</link>
      <guid>https://dev.to/bijenpatel/islr-chapter-6-linear-model-selection-regularization-40ph</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;This is a summary of chapter 6 of the &lt;em&gt;Introduction to Statistical Learning&lt;/em&gt; textbook. I’ve written a 10-part guide that covers the entire book. The guide can be read at &lt;a href="https://www.bijenpatel.com/guide/" rel="noopener noreferrer"&gt;my website&lt;/a&gt;, or &lt;a href="https://dev.to/bijenpatel/series/9611"&gt;here&lt;/a&gt; at DEV. Subscribe to stay up to date on my latest Data Science &amp;amp; Engineering guides!&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Linear regression can be improved by replacing plain least squares with some alternative fitting procedures. These alternative fitting procedures may yield better prediction accuracy and model interpretability.&lt;/p&gt;

&lt;h1&gt;
  
  
  Alternative Methods
&lt;/h1&gt;

&lt;p&gt;There are three classes of alternative methods to least squares: subset selection, shrinkage, and dimension reduction.&lt;/p&gt;

&lt;h2&gt;
  
  
  Subset Selection
&lt;/h2&gt;

&lt;p&gt;Subset selection involves identifying a subset of the p predictors that are believed to be related to the response, and fitting a least squares model to the reduced set of predictors.&lt;/p&gt;

&lt;h2&gt;
  
  
  Shrinkage
&lt;/h2&gt;

&lt;p&gt;Shrinkage involves fitting &lt;span id="rmm"&gt;&lt;/span&gt; a model involving all p predictors. However, the estimated coefficients are shrunken towards zero relative to the least squares estimates. Shrinkage reduces variance, and may perform variable selection.&lt;/p&gt;

&lt;h2&gt;
  
  
  Dimension Reduction
&lt;/h2&gt;

&lt;p&gt;Dimension reduction involves projecting all of the p predictors into an &lt;em&gt;M&lt;/em&gt;-dimensional subspace where &lt;em&gt;M&lt;/em&gt;&amp;lt;&lt;em&gt;p&lt;/em&gt;. M different “linear combinations” of all of the &lt;em&gt;p&lt;/em&gt; predictors are computed, and are used as predictors to fit a linear regression model through least squares.&lt;/p&gt;

&lt;h2&gt;
  
  
  Improved Prediction Accuracy
&lt;/h2&gt;

&lt;p&gt;When the number of observations n is not much greater than the number of predictors &lt;em&gt;p&lt;/em&gt;, least squares regression will tend to have higher variance. In the case that the number of predictors &lt;em&gt;p&lt;/em&gt; is greater than the number of observations &lt;em&gt;n&lt;/em&gt;, least squares cannot be used at all.&lt;/p&gt;

&lt;p&gt;By shrinking or constraining the estimated coefficients through alternative methods, variance can be substantially reduced at a negligible increase to bias.&lt;/p&gt;

&lt;h2&gt;
  
  
  Improved Model Interpretability
&lt;/h2&gt;

&lt;p&gt;Additionally, the least squares approach is highly unlikely to yield any coefficient estimates that are exactly zero. There are alternative approaches that automatically perform feature selection for excluding irrelevant variables from a linear regression model, thus resulting in a model that is easier to interpret.&lt;/p&gt;

&lt;h1&gt;
  
  
  Subset Selection
&lt;/h1&gt;

&lt;p&gt;There are two main types of subset selection methods: best subset selection and stepwise model selection.&lt;/p&gt;

&lt;h2&gt;
  
  
  Best Subset Selection
&lt;/h2&gt;

&lt;p&gt;Best subset selection involves fitting a separate least squares regression model for each possible combination of the &lt;em&gt;p&lt;/em&gt; predictors. For example, assume we had a credit balance dataset that looked like the following:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605461654123%2F9stQwT3bt.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605461654123%2F9stQwT3bt.png" alt="best-subset-selection-table.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The best subset selection process begins by fitting all models that contain only one predictor:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605461683849%2Fej2o5e4ZP.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605461683849%2Fej2o5e4ZP.png" alt="best-subset-0.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605461692941%2FLu-u7YTB8.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605461692941%2FLu-u7YTB8.png" alt="best-subset-1.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605461699247%2FseF654RhI.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605461699247%2FseF654RhI.png" alt="best-subset-2.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605461705892%2FGLpwJnIUf.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605461705892%2FGLpwJnIUf.png" alt="best-subset-3.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605461711627%2Fw_wopq_0S.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605461711627%2Fw_wopq_0S.png" alt="best-subset-4.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Next, all possible two-predictor models are fit:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605461733875%2FCFsE5syl9.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605461733875%2FCFsE5syl9.png" alt="best-subset-5.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605461739206%2FOZ4c-9R9_.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605461739206%2FOZ4c-9R9_.png" alt="best-subset-6.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605461778968%2FF85enrll0.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605461778968%2FF85enrll0.png" alt="best-subset-7.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605461784464%2F-JbTcuNOP.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605461784464%2F-JbTcuNOP.png" alt="best-subset-8.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605461791158%2FmLM6_pom2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605461791158%2FmLM6_pom2.png" alt="best-subset-9.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605461813511%2FMk1xH30Mk.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605461813511%2FMk1xH30Mk.png" alt="best-subset-10.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605461819597%2Fz-99YC4k8.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605461819597%2Fz-99YC4k8.png" alt="best-subset-11.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605461826677%2FLPbWyIHbOd.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605461826677%2FLPbWyIHbOd.png" alt="best-subset-12.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605461844903%2FB-6w_fN0R.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605461844903%2FB-6w_fN0R.png" alt="best-subset-13.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605461866973%2FnxAGb-Pdy.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605461866973%2FnxAGb-Pdy.png" alt="best-subset-14.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The process is continued for all possible three, four, and five predictor models.&lt;/p&gt;

&lt;p&gt;After all possible models are fit, the best one, two, three, four, and five predictor models are chosen based on some criteria, such as the largest &lt;em&gt;R&lt;/em&gt;² value. For the credit balance dataset, these models may be as follows:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605461880614%2F2I17XvhnK.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605461880614%2F2I17XvhnK.png" alt="best-subset-15.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605461889791%2F74wYcoFDpr.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605461889791%2F74wYcoFDpr.png" alt="best-subset-16.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605461898852%2FdzrtCBKMy.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605461898852%2FdzrtCBKMy.png" alt="best-subset-17.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605461907065%2F5sIW5_Vm-.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605461907065%2F5sIW5_Vm-.png" alt="best-subset-18.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605461915919%2F_I-0FqKGv.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605461915919%2F_I-0FqKGv.png" alt="best-subset-19.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Lastly, the overall best model is chosen from the remaining models. The best model is chosen through cross-validation or some other method that chooses the model with the lowest measure of test error. In this final step, a model cannot be chosen based on &lt;em&gt;R&lt;/em&gt;², because &lt;em&gt;R&lt;/em&gt;² always increases when more predictors are added. The plot of the K-Fold CV test errors for the five models looks as follows:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605461931003%2FJC4VOZsJD.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605461931003%2FJC4VOZsJD.png" alt="best-subset-selection.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The model with four predictors has the lowest test error, and thus would be chosen as the best overall model.&lt;/p&gt;

&lt;h2&gt;
  
  
  Stepwise Selection
&lt;/h2&gt;

&lt;p&gt;There are three different types of stepwise selection: forward, backward, and hybrid. Compared to best subset selection, stepwise selection is computationally more feasible. Additionally, it helps avoid the issue of overfitting and high variance.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Forward Stepwise Selection&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Forward stepwise selection begins with a model containing no predictors. Predictors are added to the model one at a time, based on the predictor that gives the greatest additional improvement to the fit of the model.&lt;/p&gt;

&lt;p&gt;After using forward stepwise selection to come up with the best one, two, three, four, and five predictor models, cross-validation is used to choose the best model.&lt;/p&gt;

&lt;p&gt;The disadvantage to forward stepwise selection is that the method does not guarantee to find the best possible model, since the final model is dependent on the first predictor that is added to the null model. The credit balance dataset is a good example of this phenomenon. Best subset selection and forward stepwise selection both choose the same one-, two-, and three-predictor models. However, they differ in the fourth model.&lt;/p&gt;

&lt;p&gt;Best subset selection replaces the “Rating” predictor with the “Cards” predictor.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605461990379%2FDBRg1dkLl.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605461990379%2FDBRg1dkLl.png" alt="forward-stepwise-0.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Since forward stepwise selection begins by choosing “Rating” as the first predictor, the predictor is kept:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605461998003%2FaRp2ftWQa9.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605461998003%2FaRp2ftWQa9.png" alt="forward-stepwise-1.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Backward Stepwise Selection
&lt;/h2&gt;

&lt;p&gt;Backward stepwise selection begins with a model that contains all of the predictors. Predictors are removed from the model one at a time, based on the predictor that is least useful towards the fit of the model.&lt;/p&gt;

&lt;p&gt;After using backward stepwise selection to come up with the best one, two, three, and four predictor models, cross-validation is used to choose the best model.&lt;/p&gt;

&lt;p&gt;Backward stepwise selection has the same disadvantage as forward stepwise selection in that it is not guaranteed to find the best possible model.&lt;/p&gt;

&lt;h2&gt;
  
  
  Hybrid Stepwise Selection
&lt;/h2&gt;

&lt;p&gt;Hybrid stepwise selection is a combination of forward and backward selection. We begin with a null model that contains no predictors. Then, variables are added one by one, exactly as done in forward selection. However, if at any point the &lt;em&gt;p&lt;/em&gt;-value for some variable rises above a chosen threshold, then it is removed from the model, as done in backward selection.&lt;/p&gt;

&lt;h1&gt;
  
  
  Optimal Model Selection
&lt;/h1&gt;

&lt;p&gt;In order to select the best model with respect to the test error, the test error needs to be estimated through one of two methods:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt; An indirect estimate through some kind of mathematical adjustment to the training error, which accounts for bias due to overfitting.&lt;/li&gt;
&lt;li&gt; A direct estimate through a method such as cross-validation.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Indirect Estimation of Test Error
&lt;/h2&gt;

&lt;p&gt;There are various different methods of indirectly estimating the test error: &lt;em&gt;Cp&lt;/em&gt;, &lt;em&gt;AIC&lt;/em&gt;, &lt;em&gt;BIC&lt;/em&gt;, and adjusted-&lt;em&gt;R&lt;/em&gt;².&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;Cp&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Cp&lt;/em&gt; is an estimate of test error that adds a penalty to the training error, based on the number of predictors and the estimate of the variance of error. The model with the lowest &lt;em&gt;Cp&lt;/em&gt; is chosen as the best model.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;AIC&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Similar to &lt;em&gt;Cp&lt;/em&gt;, &lt;em&gt;AIC&lt;/em&gt; is another estimate of test error that adds a penalty to the training error, based on the number of predictors and the estimate of the variance of error. However, it is defined for a larger class of models fit by maximum likelihood. The model with the lowest &lt;em&gt;AIC&lt;/em&gt; is chosen as the best model.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;BIC&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;BIC&lt;/em&gt; is an estimate of test error that is similar to &lt;em&gt;AIC&lt;/em&gt;, but places a heavier penalty on models with a large number of predictors. Therefore, &lt;em&gt;BIC&lt;/em&gt; results in the selection of models with less predictors. The model with the lowest &lt;em&gt;BIC&lt;/em&gt; is chosen as the best model.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Adjusted-&lt;em&gt;R&lt;/em&gt;²&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Adjusted-&lt;em&gt;R&lt;/em&gt;² is a measure that penalizes the &lt;em&gt;R&lt;/em&gt;² value for the inclusion of unnecessary predictors. The model with the highest adjusted-&lt;em&gt;R&lt;/em&gt;² is chosen as the best model.&lt;/p&gt;

&lt;h2&gt;
  
  
  One-Standard-Error Rule
&lt;/h2&gt;

&lt;p&gt;The model with the absolute lowest test error doesn’t always have to be chosen. For example, what if we had a case where the models with anywhere from 3 to 10 predictors all had similar test errors?&lt;/p&gt;

&lt;p&gt;A model could be chosen based on the one-standard-error rule. This rule involves selecting the model with the least number of predictors for which the estimated test error is within one standard error of the model with the absolute minimum error. This is done for simplicity and ease of interpretability.&lt;/p&gt;

&lt;h1&gt;
  
  
  Shrinkage Methods
&lt;/h1&gt;

&lt;p&gt;Shrinkage methods involve fitting a model with all of the available predictors, and shrinking the coefficients towards zero. There are two main shrinkage methods:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt; Ridge Regression&lt;/li&gt;
&lt;li&gt; Lasso Regression&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Ridge Regression
&lt;/h2&gt;

&lt;p&gt;Recall that least squares regression minimizes RSS to estimate coefficients. The coefficients are unbiased, meaning that least squares doesn’t take variable significance into consideration when determining the coefficient values.&lt;/p&gt;

&lt;p&gt;Instead, ridge regression minimizes the following:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605462017039%2FdngvGX0A8.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605462017039%2FdngvGX0A8.png" alt="ridge-regression-minimization.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The second term is called a shrinkage penalty, and is the term that shrinks the coefficient estimates towards zero. &lt;em&gt;λ&lt;/em&gt; is a tuning parameter that controls the relative impact of the penalty term on the regression model. The coefficient estimates that come from ridge regression are biased because variable significance is taken into consideration.&lt;/p&gt;

&lt;p&gt;Performing ridge regression with different values of &lt;em&gt;λ&lt;/em&gt; on the credit balance dataset would result in the following sets of standardized coefficients:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605462029465%2FBqOMy_47u.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605462029465%2FBqOMy_47u.png" alt="ridge-regression.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In short, different values of &lt;em&gt;λ&lt;/em&gt; will produce different sets of coefficient estimates. Therefore, choosing the proper &lt;em&gt;λ&lt;/em&gt; value is critical, and is usually chosen through cross-validation.&lt;/p&gt;

&lt;p&gt;Additionally, in ridge regression, coefficient estimates significantly depend on the scaling of predictors. Therefore, it is best practice to perform ridge regression after standardizing the predictors, so that they are on the same scale.&lt;/p&gt;

&lt;h2&gt;
  
  
  Ridge Regression vs Least Squares Regression
&lt;/h2&gt;

&lt;p&gt;So, when is it appropriate to use ridge regression over least squares regression?&lt;/p&gt;

&lt;p&gt;The advantage of ridge regression is rooted in the bias-variance tradeoff. As &lt;em&gt;λ&lt;/em&gt; increases, the flexibility of the ridge regression model decreases, resulting in a decrease in variance, but an increase in bias.&lt;/p&gt;

&lt;p&gt;Therefore, ridge regression should be used when least squares regression will have high variance. This happens when the number of predictors is almost as large as the number of observations.&lt;/p&gt;

&lt;p&gt;If we’re unsure whether or not ridge regression would result in a better predictive model than least squares, we could simply run both methods and compare the test errors to each other. Unless ridge regression provides a significant advantage, least squares should be used for model simplicity and interpretability.&lt;/p&gt;

&lt;h2&gt;
  
  
  Lasso Regression
&lt;/h2&gt;

&lt;p&gt;The disadvantage of ridge regression is that all predictors are included in the final model. Even though the penalty term shrinks the coefficient estimates towards zero, it does not set them to exactly zero.&lt;/p&gt;

&lt;p&gt;This might not matter in terms of prediction accuracy, but it creates a challenge when it comes to model interpretation because of the large number of predictors.&lt;/p&gt;

&lt;p&gt;Lasso regression is an alternative to ridge regression, which overcomes this disadvantage. Lasso regression minimizes the following:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605462047080%2FrglQyZBDS.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605462047080%2FrglQyZBDS.png" alt="lasso-regression-minimization.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Similar to ridge regression, lasso regression will shrink coefficient estimates towards zero. However, the penalty is a bit different in that it forces some of the coefficient estimates to be exactly zero when the tuning parameter &lt;em&gt;λ&lt;/em&gt; is large enough. Again, choosing the appropriate &lt;em&gt;λ&lt;/em&gt; value is critical.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605462061552%2FZAuk8uC2E.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605462061552%2FZAuk8uC2E.png" alt="lasso-regression.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This effectively means that lasso regression performs variable selection, and makes it easier to interpret the final model.&lt;/p&gt;

&lt;p&gt;As in ridge regression, it is best practice to perform lasso regression after standardizing the predictors.&lt;/p&gt;

&lt;h2&gt;
  
  
  Ridge Regression vs Lasso Regression
&lt;/h2&gt;

&lt;p&gt;In general, lasso regression is expected to perform better than ridge regression when the response &lt;em&gt;Y&lt;/em&gt; is expected to be a function of only a few of the predictors.&lt;/p&gt;

&lt;p&gt;In general, ridge regression is expected to perform better than lasso regression when the response is expected to be a function of a large number of predictors.&lt;/p&gt;

&lt;p&gt;Cross-validation should be used to compare both methods and choose the best model.&lt;/p&gt;

&lt;h2&gt;
  
  
  Selecting the Tuning Parameter λ
&lt;/h2&gt;

&lt;p&gt;As mentioned previously, choosing the proper value for the tuning parameter is crucial for coming up with the best model.&lt;/p&gt;

&lt;p&gt;Cross-validation is a simple method of choosing the appropriate &lt;em&gt;λ&lt;/em&gt; value. First, create a grid of different &lt;em&gt;λ&lt;/em&gt; values, and determine the cross-validation test error for each value. Finally, choose the value that resulted in the lowest error.&lt;/p&gt;

&lt;h1&gt;
  
  
  Dimension Reduction Methods
&lt;/h1&gt;

&lt;p&gt;When we have a dataset with a large number of predictors, dimension reduction methods can be used to summarize the dataset with a smaller number of representative predictors (dimensions) that collectively explain most of the variability in the data. Each of the dimensions is some linear combination of all of the predictors. For example, if we had a dataset of 5 predictors, the first dimension &lt;em&gt;Z&lt;/em&gt;₁ would be as follows:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605462086438%2Fo6IXzKuh3.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605462086438%2Fo6IXzKuh3.png" alt="dimension-equation.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;And a final regression model might look as follows:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605462101477%2FwTlbH6yHn.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605462101477%2FwTlbH6yHn.png" alt="pcr-equation.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Note that we will always have less dimensions than the number of predictors.&lt;/p&gt;

&lt;p&gt;The &lt;em&gt;ϕ&lt;/em&gt; values for the dimensions are known as loading values. The values are subject to the constraint that the square of the &lt;em&gt;ϕ&lt;/em&gt; values in a dimension must equal one. For our above example, this means that:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605462120188%2FZSn-i4qFe.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605462120188%2FZSn-i4qFe.png" alt="loading-value-equation.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The &lt;em&gt;ϕ&lt;/em&gt; values in a dimension make up a “loading vector” that defines a “direction” to explain the predictors. But how exactly do we come up with the &lt;em&gt;ϕ&lt;/em&gt; values? They are determined through either of the two most common dimension reduction methods:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt; Principal Components Analysis&lt;/li&gt;
&lt;li&gt; Partial Least Squares&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Principal Components Analysis
&lt;/h2&gt;

&lt;p&gt;Principal components analysis is the most popular dimension reduction method. The method involves determining different principal component directions of the data.&lt;/p&gt;

&lt;p&gt;The first principal component direction &lt;em&gt;Z&lt;/em&gt;₁ defines the direction along which the data varies the most. In other words, it is a linear combination of all of the predictors, such that it explains most of the variance in the predictors.&lt;/p&gt;

&lt;p&gt;The second principal component direction &lt;em&gt;Z&lt;/em&gt;₂ defines another direction along which the data varies the most, but is subject to the constraint that it must be uncorrelated with the first principal component, &lt;em&gt;Z&lt;/em&gt;₁.&lt;/p&gt;

&lt;p&gt;The third principal component direction &lt;em&gt;Z&lt;/em&gt;₃ defines another direction along which the data varies the most, but is subject to the constraint that it must be uncorrelated with both of the previous principal components, &lt;em&gt;Z&lt;/em&gt;₁ and &lt;em&gt;Z&lt;/em&gt;₂.&lt;/p&gt;

&lt;p&gt;And so on and so forth for additional principal component directions.&lt;/p&gt;

&lt;p&gt;Dimension reduction is best explained with an example. Assume that we have a dataset of different baseball players, which consists of their statistics in 1986, their years in the league, and their salaries in the following year (1987).&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605462172456%2Ft5mIAONXV.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605462172456%2Ft5mIAONXV.png" alt="baseball-statistics-table.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Our goal is to perform principal components regression to come up with a model that predicts salaries.&lt;/p&gt;

&lt;p&gt;First, we need to perform dimension reduction by transforming our 7 different predictors into a smaller number of principal components to use for regression.&lt;/p&gt;

&lt;p&gt;It is important to note that prior to performing principal components analysis, each predictor should be standardized to ensure that all of the predictors are on the same scale. The absence of standardization will cause the predictors with high variance to play a larger role in the final principal components obtained.&lt;/p&gt;

&lt;p&gt;Performing principal components analysis would result in the following &lt;em&gt;ϕ&lt;/em&gt; values for the first three principal components:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605462188381%2FsD7scjs8y.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605462188381%2FsD7scjs8y.png" alt="baseball-pca-table.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;A plot of the loading values of the first two principal components would look as follows:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605462200387%2FHY2ZdZM5z.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605462200387%2FHY2ZdZM5z.png" alt="baseball-pca.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;How do we interpret these principal components?&lt;/p&gt;

&lt;p&gt;If we take a look at the first principal component, we can see that there is approximately an equal weight placed on each of the six baseball statistic predictors, and much less weight placed on the years that a player has been in the league. This means that the first principal component roughly corresponds to a player’s level of production.&lt;/p&gt;

&lt;p&gt;On the other hand, in the second principal component, we can see that there is a large weight placed on the number of years that a player has been in the league, and much less weight placed on the baseball statistics. This means that the second principal component roughly corresponds to how long a player has been in the league.&lt;/p&gt;

&lt;p&gt;In the third principal component, we can see that there is more weight placed on three specific baseball statistics: home runs, RBIs, and walks. This means that the third principal component roughly corresponds to a player’s batting ability.&lt;/p&gt;

&lt;p&gt;Performing principal components analysis also tells us the percent of variation in the data that is explained by each of the components. The first principal component from the baseball data explains 67% of the variation in the predictors. The second principal component explains 15%. The third principal component explains 9%. Therefore, together, these three principal components explain 91% of the variation in the data.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605462215969%2FnXJfY-92X.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605462215969%2FnXJfY-92X.png" alt="pca-pve.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This helps explain the key idea of principal components analysis, which is that a small number of principal components are sufficient to explain most of the variability in the data. Through principal components analysis, we’ve reduced the dimensions of our dataset from seven to three.&lt;/p&gt;

&lt;p&gt;The number of principal components to use in regression can be chosen in one of two ways. The first method involves simply using the number of components that explain a large amount of the variation in the data. For example, in the baseball data, the first three principal components explain 91% of variation in the data, so using just the first three is a valid option. The second method involves choosing the number of principal components that results in the regression model with the lowest test error. Typically, both methods should result in the same or similar final models with test errors that do not greatly differ from one another.&lt;/p&gt;

&lt;h2&gt;
  
  
  Principal Components Regression vs Least Squares Regression
&lt;/h2&gt;

&lt;p&gt;Remember that in principal components analysis, the response variable is not used to determine the principal components. Therefore, in principal components regression, we are making the assumption that the directions in which the predictors show the most variation are also the directions that are associated with the response variable.&lt;/p&gt;

&lt;p&gt;The assumption is not guaranteed to be true, but it often turns out to be a reasonable assumption to make. When the assumption holds, principal components regression will result in a better model than least squares regression due to mitigation of overfitting.&lt;/p&gt;

&lt;p&gt;In general, principal components regression will perform well when the first few principal components are sufficient to capture most of the variation in the predictors, as well as the relationship with the response variable.&lt;/p&gt;

&lt;p&gt;However, principal components regression is not a feature selection method because each of the principal components is a linear combination of all of original predictors in the dataset. If performing feature selection is important, then another method such as stepwise selection or lasso regression should be used.&lt;/p&gt;

&lt;h2&gt;
  
  
  Partial Least Squares Regression
&lt;/h2&gt;

&lt;p&gt;In principal components regression, the directions that best represent the predictors are identified in an unsupervised way since the response variable is not used to help determine the directions. Therefore, there is no guarantee that the directions that best explain the predictors will also be the best directions to use for predicting the response.&lt;/p&gt;

&lt;p&gt;Partial least squares regression is a supervised alternative to principal components regression. In other words, partial least squares regression attempts to find directions that help explain both the response and the predictors.&lt;/p&gt;

&lt;p&gt;Partial least squares works by first standardizing the predictors and response, and then determining the first direction by setting each &lt;em&gt;ϕ&lt;/em&gt; value equal to the coefficients from simple linear regression of the response onto each of the predictors. In doing so, partial least squares places the highest weight on the variables most strongly related to the response.&lt;/p&gt;

&lt;p&gt;To identify the second direction, each of the original variables is adjusted by regressing each variable onto the first direction and taking the residuals. These residuals represent the remaining information that isn’t explained by the first direction. The second direction is then determined by using the new residual data, in the same method that the first direction was determined based on the original data. This iterative approach is repeated to find multiple directions, which are then used to fit a linear model to predict the response.&lt;/p&gt;

&lt;p&gt;The number of directions to use in partial least squares regression should be determined through cross-validation.&lt;/p&gt;

&lt;p&gt;However, partial least squares regression usually does not perform better than ridge regression or principal components regression. This is because supervised dimension reduction can reduce bias, but also has the potential to increase variance.&lt;/p&gt;

&lt;h1&gt;
  
  
  Considerations in High Dimensions
&lt;/h1&gt;

&lt;p&gt;Today, it is common to be able to collect hundreds or even thousands of predictors for our data. The high dimensional setting refers to situations where the number of predictors exceeds the number of observations we have available. In the high dimensional setting, least squares cannot be used because it will result in coefficient estimates that are a perfect fit to the data, such that the residuals are zero. This is where subset selection, ridge regression, lasso, principal components regression, and partial least squares should be used instead.&lt;/p&gt;

&lt;p&gt;However, the interpretation of model results is a bit different. For example, assume that we are trying to predict blood pressure from hundreds of predictors. A forward stepwise selection model might indicate that 20 of the predictors lead to a good predictive model. It would actually be improper to conclude that these 20 predictors are the best to predict blood pressure. Since we have hundreds of predictors, a different dataset might actually result in a totally different predictive model. Therefore, we must indicate that we have identified one of the many possible models, and it must be further validated on independent datasets.&lt;/p&gt;

&lt;p&gt;Additionally, SSE, &lt;em&gt;p&lt;/em&gt;-values, &lt;em&gt;R&lt;/em&gt;², and other traditional measures of model fit should never be used in the high dimensional setting. Instead, report the results of the model on an independent test dataset, or cross-validation errors.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Originally published at&lt;/em&gt; &lt;a href="https://www.bijenpatel.com/guide/islr/linear-model-selection-regularization/" rel="noopener noreferrer"&gt;https://www.bijenpatel.com&lt;/a&gt; &lt;em&gt;on August 6, 2020.&lt;/em&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;I will be releasing the equivalent Python code for these examples soon. Subscribe to get notified!&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h1&gt;
  
  
  ISLR Chapter 6 — R Code
&lt;/h1&gt;

&lt;h2&gt;
  
  
  Ridge Regression
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight r"&gt;&lt;code&gt;&lt;span class="n"&gt;library&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ISLR&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;library&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;MASS&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;library&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ggplot2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;library&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dplyr&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;library&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;glmnet&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="c1"&gt;# For ridge regression and lasso models&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="c1"&gt;# We will work with the Hitters dataset&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;head&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Hitters&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="c1"&gt;# First, we create a dataframe of only the predictors&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="c1"&gt;# The model.matrix function will automatically transform qualitative variables into dummy variables&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;Hitters_predictors&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;model.matrix&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Salary&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;~&lt;/span&gt;&lt;span class="n"&gt;.&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;Hitters&lt;/span&gt;&lt;span class="p"&gt;)[,&lt;/span&gt;&lt;span class="m"&gt;-1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="c1"&gt;# Next, we create a vector of only the responses&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;Hitters_responses&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Hitters&lt;/span&gt;&lt;span class="o"&gt;$&lt;/span&gt;&lt;span class="n"&gt;Salary&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="c1"&gt;# Next, we further split the datasets into training and test subsets&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;set.seed&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;train&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;sample&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="n"&gt;nrow&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Hitters&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;nrow&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Hitters&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="m"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="n"&gt;Hitters_train_predictors&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Hitters_predictors&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;train&lt;/span&gt;&lt;span class="p"&gt;,]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;Hitters_test_predictors&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Hitters_predictors&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;train&lt;/span&gt;&lt;span class="p"&gt;,]&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="n"&gt;Hitters_train_responses&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Hitters_responses&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;train&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;Hitters_test_responses&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Hitters_responses&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;train&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="c1"&gt;# Next, we use cv.glmnet function to perform ridge regression with 10-fold cross-validation&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="c1"&gt;# alpha = 0 specifies ridge regression&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;set.seed&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;Hitters_ridge_cv&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;cv.glmnet&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Hitters_train_predictors&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Hitters_train_responses&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;alpha&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;nfolds&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="m"&gt;10&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="c1"&gt;# The lambda.min component of the object contains the best lambda value to use&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;lambda_min&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Hitters_ridge_cv&lt;/span&gt;&lt;span class="o"&gt;$&lt;/span&gt;&lt;span class="n"&gt;lambda.min&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="c1"&gt;# Finally, we fit a ridge regression model to the training data with the minimum lambda value&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="c1"&gt;# And then determine the test error&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;Hitters_ridge&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;glmnet&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Hitters_train_predictors&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Hitters_train_responses&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;alpha&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;lambda&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;lambda_min&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;Hitters_ridge_predictions&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;predict&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Hitters_ridge&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;lambda_min&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;newx&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;Hitters_test_predictors&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;mean&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="n"&gt;Hitters_ridge_predictions&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Hitters_test_responses&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;^&lt;/span&gt;&lt;span class="m"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="c1"&gt;# See the coefficients of the fitted ridge regression model&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;coef&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Hitters_ridge&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;/&lt;/span&gt;&lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Lasso Regression
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight r"&gt;&lt;code&gt;&lt;span class="c1"&gt;# We continue working with the Hitters data&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="c1"&gt;# To fit a lasso model instead of a ridge regression model, we use alpha = 1 instead of alpha = 0&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="c1"&gt;# The same methodology is used to determine a lambda value and the test MSE&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;set.seed&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="c1"&gt;# Perform lasso with 10-fold cross-validation&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;Hitters_lasso_cv&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;cv.glmnet&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Hitters_train_predictors&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Hitters_train_responses&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;alpha&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;nfolds&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="m"&gt;10&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;lambda_min_lasso&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Hitters_lasso_cv&lt;/span&gt;&lt;span class="o"&gt;$&lt;/span&gt;&lt;span class="n"&gt;lambda.min&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="c1"&gt;# Fit the lasso model with the minimum lambda value&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="c1"&gt;# And then determine the test error&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;Hitter_lasso&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;glmnet&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Hitters_train_predictors&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Hitters_train_responses&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;alpha&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;lambda&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;lambda_min_lasso&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;Hitters_lasso_predictions&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;predict&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Hitters_lasso&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;lambda_min_lasso&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;newx&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;Hitters_test_predictors&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;mean&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="n"&gt;Hitters_lasso_predictions&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Hitters_test_responses&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;^&lt;/span&gt;&lt;span class="m"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="c1"&gt;# See the coefficients of the fitted lasso model&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;coef&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Hitters_lasso&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;/&lt;/span&gt;&lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Principal Components Regression (PCR)
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight r"&gt;&lt;code&gt;&lt;span class="n"&gt;library&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pls&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="c1"&gt;# The pls library is used for PCR&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="c1"&gt;# We continue working with the Hitters data&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="c1"&gt;# For PCR, we create full training and test datasets (predictors/responses are not separated)&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;Hitters_train&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Hitters&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;train&lt;/span&gt;&lt;span class="p"&gt;,]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;Hitters_test&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Hitters&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;train&lt;/span&gt;&lt;span class="p"&gt;,]&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="c1"&gt;# The pcr function is used to perform PCR&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="c1"&gt;# We set scale=TRUE to standardize the variables&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="c1"&gt;# We set the validation type to cross-validation (the function performs 10-Fold CV)&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;set.seed&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;Hitters_pcr&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;pcr&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Salary&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;~&lt;/span&gt;&lt;span class="n"&gt;.&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;Hitters_train&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;scale&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="kc"&gt;TRUE&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;validation&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"CV"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="c1"&gt;# The summary of the model shows the CV errors for models with different # of principal components&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="c1"&gt;# It also shows the percent of variance explained (PVE) for models with different # of components&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;summary&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Hitters_pcr&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="c1"&gt;# validationplot provides a quick visual of the # of principal components that result in lowest MSE&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;validationplot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Hitters_pcr&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;val.type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"MSEP"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="c1"&gt;# Use the PCR model with 7 different components to make predictions on the test data&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;Hitters_pcr_predictions&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;predict&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Hitters_pcr&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Hitters_test&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;ncomp&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="m"&gt;7&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="c1"&gt;# Determine the test MSE&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;mean&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="n"&gt;Hitters_pcr_predictions&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Hitters_test&lt;/span&gt;&lt;span class="o"&gt;$&lt;/span&gt;&lt;span class="n"&gt;Salary&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;^&lt;/span&gt;&lt;span class="m"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;/&lt;/span&gt;&lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Partial Least Squares Regression (PLS)
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight r"&gt;&lt;code&gt;&lt;span class="n"&gt;library&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pls&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="c1"&gt;# The pls library is used for PLS regression&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="c1"&gt;# We continue working with the Hitters data&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="c1"&gt;# The plsr function is used to perform PLS regression&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;set.seed&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;Hitters_pls&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;plsr&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Salary&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;~&lt;/span&gt;&lt;span class="n"&gt;.&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;Hitters_train&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;scale&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="kc"&gt;TRUE&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;validation&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"CV"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="c1"&gt;# The summary output of PLS has a very similar format to the output of PCR&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;summary&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Hitters_pls&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="c1"&gt;# validationplot provides a quick visual of the # of PLS components that result in lowest MSE&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;validationplot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Hitters_pls&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;val.type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"MSEP"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="c1"&gt;# Use the PLS model with 2 principal components to make predictions on the test data&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;Hitters_pls_predictions&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;predict&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Hitters_pls&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Hitters_test&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;ncomp&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="m"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="c1"&gt;# Determine the test MSE&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;mean&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="n"&gt;Hitters_pls_predictions&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Hitters_test&lt;/span&gt;&lt;span class="o"&gt;$&lt;/span&gt;&lt;span class="n"&gt;Salary&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;^&lt;/span&gt;&lt;span class="m"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;/&lt;/span&gt;&lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



</description>
      <category>datascience</category>
      <category>machinelearning</category>
      <category>tutorial</category>
    </item>
    <item>
      <title>ISLR Chapter 5 — Resampling Methods</title>
      <dc:creator>Bijen Patel</dc:creator>
      <pubDate>Sun, 08 Nov 2020 23:36:30 +0000</pubDate>
      <link>https://dev.to/bijenpatel/islr-chapter-5-resampling-methods-2kie</link>
      <guid>https://dev.to/bijenpatel/islr-chapter-5-resampling-methods-2kie</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;This is a summary of chapter 5 of the &lt;em&gt;Introduction to Statistical Learning&lt;/em&gt; textbook. I’ve written a 10-part guide that covers the entire book. The guide can be read at &lt;a href="https://www.bijenpatel.com/guide/" rel="noopener noreferrer"&gt;my website&lt;/a&gt;, or &lt;a href="https://dev.to/bijenpatel/series/9611"&gt;here&lt;/a&gt; at DEV. Subscribe to stay up to date on my latest Data Science &amp;amp; Engineering guides!&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Resampling methods involve repeatedly drawing samples from a training dataset and refitting a statistical model on each of the samples in order to obtain additional information about the fitted model.&lt;/p&gt;

&lt;p&gt;For example, to estimate the variability of a linear regression model, we can repeatedly draw different samples from the training data, fit a linear regression model to each new sample, and then examine the extent to which the fits differ.&lt;/p&gt;

&lt;p&gt;There are two resampling methods that are most common:&lt;/p&gt;

&lt;p&gt;Cross validation can &lt;span id="rmm"&gt;b&lt;/span&gt;e used to estimate the test error of a specific statistical learning method, or to select the appropriate level of flexibility of a method.&lt;/p&gt;

&lt;p&gt;Bootstrap can be used to help provide the accuracy of parameter estimates, or of a statistical learning method.&lt;/p&gt;

&lt;h1&gt;
  
  
  Cross-Validation
&lt;/h1&gt;

&lt;p&gt;There is a difference between the training error and the test error. The test error is what we are much more interested in measuring. However, we do not always have a large test dataset available to estimate the test error.&lt;/p&gt;

&lt;p&gt;Many techniques exist to estimate the test error with the training data. These techniques fall into one of two categories:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt; Applying a mathematical adjustment to the training error&lt;/li&gt;
&lt;li&gt; Holding out a subset of the training data from the fitting process and applying the model to the held-out data&lt;/li&gt;
&lt;/ol&gt;

&lt;h1&gt;
  
  
  Validation Set Approach
&lt;/h1&gt;

&lt;p&gt;The Validation Set Approach is a simple strategy for estimating the test error.&lt;/p&gt;

&lt;p&gt;The available training data is randomly divided into a training set and test (hold-out) set. A model is fit to the training set, and is used to make predictions on the data in the test set. The error from the predictions on the test set serves as the estimate for the test error.&lt;/p&gt;

&lt;p&gt;In regression, the measure of error is usually the Mean Squared Error (MSE). In classification, the measure of error is usually the misclassification rate.&lt;/p&gt;

&lt;p&gt;To be sure about which model to fit, the Validation Set Approach process could be repeated multiple times to allow for differences in random data assignment between the training and test sets.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605461361171%2F2ciImDxGU.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605461361171%2F2ciImDxGU.png" alt="validation-set-approach.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;There are two drawbacks associated with the Validation Set Approach:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt; The validation estimate of the test error can be highly variable, depending on precisely which observations are randomly assigned to the training set and test set.&lt;/li&gt;
&lt;li&gt; Since only a subset of the observations are used to fit the model, the validation set error may overestimate the test error because statistical methods tend to perform worse when trained on fewer observations.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Cross-validation is a refinement to the Validation Set Approach that addresses these drawbacks.&lt;/p&gt;

&lt;h1&gt;
  
  
  Leave-One-Out Cross-Validation (LOOCV)
&lt;/h1&gt;

&lt;p&gt;Leave-One-Out Cross-Validation (LOOCV) is closely related to the Validation Set Approach, but attempts to address the drawbacks.&lt;/p&gt;

&lt;p&gt;Similar to the Validation Set Approach, LOOCV involves splitting a full dataset into separate training and test sets. However, only a single observation is included in the test set, and all of the remaining observations are assigned to the training set.&lt;/p&gt;

&lt;p&gt;A model is fit to the training set, and a prediction is made for the single excluded observation in the test set. The test error is determined for the prediction. Then, the LOOCV procedure is repeated to individually exclude each observation from the full dataset. Finally, the LOOCV estimate for the test error is simply the average of all of the individual test errors.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605461377233%2FbrWyuclB4.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605461377233%2FbrWyuclB4.png" alt="loocv.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;LOOCV has two advantages over the Validation Set Approach:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt; LOOCV is not highly variable because it will yield identical results each time it is performed.&lt;/li&gt;
&lt;li&gt; LOOCV has far less bias because almost the entire dataset is being used to fit a model. Therefore, LOOCV does not overestimate the test error as much as the Validation Set Approach.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;However, LOOCV has one potentially major disadvantage:&lt;/p&gt;

&lt;h1&gt;
  
  
  K-Fold Cross-Validation
&lt;/h1&gt;

&lt;p&gt;K-Fold Cross-Validation (K-Fold CV) is an alternative to LOOCV, and is typically the preferred method.&lt;/p&gt;

&lt;p&gt;It involves randomly dividing observations into &lt;em&gt;K&lt;/em&gt; groups (folds) of approximately the same size.&lt;/p&gt;

&lt;p&gt;The first fold is treated as the test set, and a model is fit to the remaining folds. The fitted model is used to make predictions on the observations in the first fold, and the test error is measured. The K-Fold CV procedure is repeated &lt;em&gt;K&lt;/em&gt; times so that a different fold is treated as the test set each time. Finally, the K-Fold CV estimate of the test error is simply the average of the &lt;em&gt;K&lt;/em&gt; test measures.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605461387685%2FumQOmXus4.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605461387685%2FumQOmXus4.png" alt="k-fold-cross-validation.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;LOOCV is basically a special case of K-Fold CV, in which &lt;em&gt;K&lt;/em&gt; is set equal to &lt;em&gt;n&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;K-Fold CV has two advantages over LOOCV:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt; K-Fold CV is much less computationally expensive.&lt;/li&gt;
&lt;li&gt; K-Fold CV is advantangeous when it comes to the Bias-Variance tradeoff.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The K-Fold CV test error estimates have a lower variance than the LOOCV estimates because more observations are included in the test set. Overall, K-Fold CV provides a more accurate measure of the test error.&lt;/p&gt;

&lt;p&gt;Typically, &lt;em&gt;K&lt;/em&gt; = 5 or &lt;em&gt;K&lt;/em&gt; = 10 is chosen for K-Fold CV because these have been shown to empirically yield test error estimates that suffer neither from excessively high bias nor variance.&lt;/p&gt;

&lt;h1&gt;
  
  
  Bootstrap
&lt;/h1&gt;

&lt;p&gt;The bootstrap is a widely acceptable and extremely powerful statistical tool that can be used to quantify the uncertainty associated with an estimator or statistical learning method.&lt;/p&gt;

&lt;p&gt;The point of the bootstrap can be illustrated with an example where we are trying to figure out the best investment allocation under a simple model.&lt;/p&gt;

&lt;p&gt;Assume that we want to invest $10,000 into two stocks that yield X and Y, respectively:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  Stock A: X Return (random quantity)&lt;/li&gt;
&lt;li&gt;  Stock B: Y Return (random quantity)&lt;/li&gt;
&lt;li&gt;  Stock A and Stock B are not the same, and have different means and variances for their returns&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;We will invest &lt;em&gt;α&lt;/em&gt; into A, and 1-&lt;em&gt;α&lt;/em&gt; into B. Additionally, &lt;em&gt;α&lt;/em&gt; will be chosen to minimize the total risk of investment since there is variability associated with the returns. The equation to determine the α value that minimizes risk is as follows:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605461400637%2FFsLkGECtf.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605461400637%2FFsLkGECtf.png" alt="alpha-equation.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The variances are unknown, but could be estimated through past data of returns for the investments. Plugging the variances into the equation would let us know the value for &lt;em&gt;α&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;However, since we only have one dataset, we can only determine one value for &lt;em&gt;α&lt;/em&gt;. But what if we wanted to estimate the accuracy of the estimate for &lt;em&gt;α&lt;/em&gt;? To estimate a variance for the &lt;em&gt;α&lt;/em&gt; value itself, we would need hundreds of datasets. However, we usually cannot simply generate new samples from the original population. The bootstrap approach allows us to &lt;em&gt;emulate&lt;/em&gt; the process of obtaining new sample datasets so that we can estimate variability without &lt;em&gt;actually&lt;/em&gt; generating new sample datasets. Instead of obtaining new independent datasets from the population, the bootstrap method obtains distinct datasets by repeatedly sampling observations from the original dataset.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605461425369%2Fs_H0G-IBE.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605461425369%2Fs_H0G-IBE.png" alt="bootstrap.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;For example, assume that we have a dataset of 15 observations for the returns of A and B. The bootstrap method randomly selects 15 observations from the dataset to produce a new bootstrapped dataset. The selection is done with replacement, meaning that it is possible for the same observation to occur more than one time in the bootstrapped dataset. The bootstrapping procedure is repeated a large number of times (usually 1000) so that 1000 bootstrapped datasets are generated. These datasets would give us 1000 values for &lt;em&gt;α&lt;/em&gt;, which would help us ultimately determine the variance for &lt;em&gt;α&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Originally published at&lt;/em&gt; &lt;a href="https://www.bijenpatel.com/guide/islr/resampling-methods/" rel="noopener noreferrer"&gt;https://www.bijenpatel.com&lt;/a&gt; &lt;em&gt;on August 5, 2020.&lt;/em&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;I will be releasing the equivalent Python code for these examples soon. Subscribe to get notified!&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h1&gt;
  
  
  ISLR Chapter 5 — R Code
&lt;/h1&gt;

&lt;h2&gt;
  
  
  Validation Set Approach
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight r"&gt;&lt;code&gt;&lt;span class="n"&gt;library&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ISLR&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;library&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;MASS&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;library&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ggplot2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;library&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dplyr&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="c1"&gt;# We will work with the Auto dataset&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;head&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Auto&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="c1"&gt;# The set.seed function allows reproducibility of results any time that random numbers are being generated&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="c1"&gt;# It takes an arbitrary integer value&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;set.seed&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="c1"&gt;# The Validation Set approach involves randomly splitting the data into two halves&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;nrow&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Auto&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="c1"&gt;# Determine how many rows of data we have&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;train&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;sample&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="m"&gt;392&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;196&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="c1"&gt;# Randomly choose which 196 rows to use for the training data&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;Auto_train&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Auto&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;train&lt;/span&gt;&lt;span class="p"&gt;,]&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="c1"&gt;# Create a training set&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;Auto_test&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Auto&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;train&lt;/span&gt;&lt;span class="p"&gt;,]&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="c1"&gt;# Create a test set&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="c1"&gt;# Fit a simple linear model to the training data&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="c1"&gt;# We are attempting to predict a vehicle's Miles per Gallon with Horsepower as the predictor&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;Auto_lm_1&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;lm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;mpg&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;~&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;horsepower&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;Auto_train&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="c1"&gt;# Use the fitted model to predict on the held-out test data&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;Auto_test_predictions&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;predict&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Auto_lm_1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Auto_test&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="c1"&gt;# Determine the Mean Squared Error (MSE) of the test data, which is a measure of the test error&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;Auto_test_mse&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;data.frame&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;mpg&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;Auto_test&lt;/span&gt;&lt;span class="o"&gt;$&lt;/span&gt;&lt;span class="n"&gt;mpg&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;predicted_mpg&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;Auto_test_predictions&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;Auto_test_mse&lt;/span&gt;&lt;span class="o"&gt;$&lt;/span&gt;&lt;span class="n"&gt;sq_error&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Auto_test_mse&lt;/span&gt;&lt;span class="o"&gt;$&lt;/span&gt;&lt;span class="n"&gt;mpg&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Auto_test_mse&lt;/span&gt;&lt;span class="o"&gt;$&lt;/span&gt;&lt;span class="n"&gt;predicted_mpg&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;^&lt;/span&gt;&lt;span class="m"&gt;2&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;Auto_lm_1_mse&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;mean&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Auto_test_mse&lt;/span&gt;&lt;span class="o"&gt;$&lt;/span&gt;&lt;span class="n"&gt;sq_error&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="c1"&gt;# Fit a polynomial regression model instead, which seems to be a better choice&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;Auto_lm_2&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;lm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;mpg&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;~&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;horsepower&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;I&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;horsepower&lt;/span&gt;&lt;span class="o"&gt;^&lt;/span&gt;&lt;span class="m"&gt;2&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;Auto_train&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;Auto_test_predictions_2&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;predict&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Auto_lm_2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Auto_test&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;Auto_test_mse_2&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;data.frame&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;mpg&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;Auto_test&lt;/span&gt;&lt;span class="o"&gt;$&lt;/span&gt;&lt;span class="n"&gt;mpg&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;predicted_mpg&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;Auto_test_predictions_2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;Auto_test_mse_2&lt;/span&gt;&lt;span class="o"&gt;$&lt;/span&gt;&lt;span class="n"&gt;sq_error&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Auto_test_mse_2&lt;/span&gt;&lt;span class="o"&gt;$&lt;/span&gt;&lt;span class="n"&gt;mpg&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Auto_test_mse_2&lt;/span&gt;&lt;span class="o"&gt;$&lt;/span&gt;&lt;span class="n"&gt;predicted_mpg&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;^&lt;/span&gt;&lt;span class="m"&gt;2&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;Auto_lm_2_mse&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;mean&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Auto_test_mse_2&lt;/span&gt;&lt;span class="o"&gt;$&lt;/span&gt;&lt;span class="n"&gt;sq_error&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="c1"&gt;# Running same models with different random training/test splits results in different errors&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;set.seed&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="m"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;train&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;sample&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="m"&gt;392&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;196&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="c1"&gt;# Repeat above models to see that MSE will be slightly different&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="c1"&gt;# This is because variance increases due to only using half the data to fit the model&amp;lt;/span&amp;gt;&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Leave-One-Out Cross Validation
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight r"&gt;&lt;code&gt;&lt;span class="n"&gt;library&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;boot&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="c1"&gt;# For the cv.glm function&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="c1"&gt;# We continue working with the Auto data&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="c1"&gt;# We use the glm function instead of the lm function to fit a linear regression model&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="c1"&gt;# This is because a glm object can be used with the cv.glm function from the boot library&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="c1"&gt;# We fit the model to the entire Auto data because cv.glm will perform the LOOCV for us&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;Auto_glm_1&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;glm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;mpg&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;~&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;horsepower&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;Auto&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="c1"&gt;# Perform LOOCV and determine the test error&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;Auto_glm_1_cv&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;cv.glm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;Auto&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;glmfit&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;Auto_glm_1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="c1"&gt;# Perform LOOCV&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;Auto_glm_1_cv&lt;/span&gt;&lt;span class="o"&gt;$&lt;/span&gt;&lt;span class="n"&gt;delta&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="c1"&gt;# Test error is stored in the delta component&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="c1"&gt;# There are two values&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="c1"&gt;# The first value is the standard CV estimate&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="c1"&gt;# The second value is a bias-corrected CV estimate&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="c1"&gt;# What if we wanted to get LOOCV test error estimates for five different polynomial models?&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="c1"&gt;# Easily fit five different polynomial models from degree 1 to 5 and compare their test errors&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;Auto_loocv_errors&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;rep&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;5&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="c1"&gt;# Create initial vector for the errors&amp;lt;/span&amp;gt;&amp;lt;span id="c3a1" class="de kv ii ef lp b db lt lu lv lw lx lr w ls"&amp;gt;# The for loop goes through all five degrees and stores their errors&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="k"&gt;for&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;in&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="m"&gt;5&lt;/span&gt;&lt;span class="p"&gt;){&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="n"&gt;Auto_glm&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;glm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;mpg&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;~&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;poly&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;horsepower&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;Auto&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="n"&gt;Auto_loocv_errors&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;cv.glm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;Auto&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;glmfit&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;Auto_glm&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;$&lt;/span&gt;&lt;span class="n"&gt;delta&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="c1"&gt;# Plot the different model degrees and their errors with ggplot&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;Auto_loocv_errors&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;data.frame&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Degree&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;c&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;5&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Error&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Auto_loocv_errors&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="n"&gt;Auto_loocv_errors_plot&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;ggplot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Auto_loocv_errors&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;aes&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;Degree&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;Error&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="n"&gt;geom_line&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="n"&gt;labs&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;title&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"Auto LOOCV Errors"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;subtitle&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"Polynomial Models (Degrees 1 to 5)"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"Degree"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"LOOCV Error"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="n"&gt;geom_line&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;colour&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"red"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="n"&gt;plot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Auto_loocv_errors_plot&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;/&lt;/span&gt;&lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  K-Fold Cross Validation
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight r"&gt;&lt;code&gt;&lt;span class="c1"&gt;# What if we wanted K-Fold test error estimates instead of LOOCV?&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="c1"&gt;# We can easily repurpose the previous for loop for K-Fold CV&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="c1"&gt;# We will perform 10-Fold CV on the Auto data&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;Auto_tenfold_errors&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;rep&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;5&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="c1"&gt;# Create initial vector for the errors&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="k"&gt;for&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;in&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="m"&gt;5&lt;/span&gt;&lt;span class="p"&gt;){&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="n"&gt;Auto_glm&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;glm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;mpg&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;~&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;poly&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;horsepower&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;Auto&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="n"&gt;Auto_tenfold_errors&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;cv.glm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;Auto&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;glmfit&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;Auto_glm&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;K&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="m"&gt;10&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;$&lt;/span&gt;&lt;span class="n"&gt;delta&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="c1"&gt;# Simply change the previous for loop by adding K=10&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="c1"&gt;# Plot the different model degrees and their K-Fold errors with ggplot&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;Auto_tenfold_errors&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;data.frame&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Degree&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;c&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;5&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Error&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Auto_tenfold_errors&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="n"&gt;Auto_tenfold_errors_plot&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;ggplot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Auto_tenfold_errors&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;aes&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;Degree&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;Error&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="n"&gt;geom_line&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="n"&gt;labs&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;title&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"Auto 10-Fold Errors"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;subtitle&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"Polynomial Models (Degrees 1 to 5)"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"Degree"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"10-Fold Error"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="n"&gt;geom_line&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;colour&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"red"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="n"&gt;plot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Auto_tenfold_errors_plot&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;/&lt;/span&gt;&lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Bootstrap
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight r"&gt;&lt;code&gt;&lt;span class="n"&gt;library&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;boot&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="c1"&gt;# The boot library is also for bootstrap&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="c1"&gt;# Bootstrap can be used to assess the variability of coefficient estimates of a model&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="c1"&gt;# First, we create a simple function to get coefficient values from a linear regression model&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;coef_function&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;function&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nf"&gt;return&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;lm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;mpg&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;~&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;horsepower&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;I&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;horsepower&lt;/span&gt;&lt;span class="o"&gt;^&lt;/span&gt;&lt;span class="m"&gt;2&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;subset&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;$&lt;/span&gt;&lt;span class="n"&gt;coefficients&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="c1"&gt;# Test the function by fitting a linear model to the entire Auto dataset&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;coef_function&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Auto&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="m"&gt;392&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="c1"&gt;# We specified data=Auto and index=1:392&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="c1"&gt;# Now, we test the function with a dataset through the bootstrap resampling method&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="c1"&gt;# Bootstrap creates a new dataset by randomly sampling from the original data WITH replacement&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;set.seed&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;coef_function&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Auto&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;sample&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="m"&gt;392&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;392&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;replace&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;T&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="c1"&gt;# So far, that was only one result from one bootstrap dataset&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="c1"&gt;# What if we wanted to get 1000 results?&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="c1"&gt;# Use the boot function to get 1000 bootstrap estimates for the coefficients and standard errors&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;boot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Auto&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;coef_function&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;1000&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="c1"&gt;# Compare results of the bootstrap estimates to the standard quadratic model on the original data&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;summary&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;lm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;mpg&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;~&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;horsepower&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;I&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;horsepower&lt;/span&gt;&lt;span class="o"&gt;^&lt;/span&gt;&lt;span class="m"&gt;2&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;Auto&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;&lt;span class="o"&gt;$&lt;/span&gt;&lt;span class="n"&gt;coef&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;/&lt;/span&gt;&lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



</description>
      <category>datascience</category>
      <category>machinelearning</category>
      <category>tutorial</category>
    </item>
    <item>
      <title>ISLR Chapter 4 — Classification</title>
      <dc:creator>Bijen Patel</dc:creator>
      <pubDate>Sun, 08 Nov 2020 23:34:15 +0000</pubDate>
      <link>https://dev.to/bijenpatel/islr-chapter-4-classification-5fo3</link>
      <guid>https://dev.to/bijenpatel/islr-chapter-4-classification-5fo3</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;This is a summary of chapter 4 of the &lt;em&gt;Introduction to Statistical Learning&lt;/em&gt; textbook. I’ve written a 10-part guide that covers the entire book. The guide can be read at &lt;a href="https://www.bijenpatel.com/guide/" rel="noopener noreferrer"&gt;my website&lt;/a&gt;, or &lt;a href="https://dev.to/bijenpatel/series/9611"&gt;here&lt;/a&gt; at DEV. Subscribe to stay up to date on my latest Data Science &amp;amp; Engineering guides!&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h1&gt;
  
  
  Overview of Classification
&lt;/h1&gt;

&lt;p&gt;Qualitative variables, such as gender, are known as categorical variables. Predicting qualitative responses is known as classification.&lt;/p&gt;

&lt;p&gt;Some real world examples of classification include determining whether or not a banking transaction is fraudulent, or determining whether or not an individual will default on credit card debt.&lt;/p&gt;

&lt;p&gt;The three most widely used classifiers, which are covered in this post, are:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  Logistic Regression&lt;/li&gt;
&lt;li&gt;  Linear Discriminant Analysis&lt;/li&gt;
&lt;li&gt;  K-Nearest Neighbors&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;There are also more advanced classifiers, which are covered later:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  Generalized Additive Models&lt;/li&gt;
&lt;li&gt;  Trees&lt;/li&gt;
&lt;li&gt;  Random Forests&lt;/li&gt;
&lt;li&gt;  Boosting&lt;/li&gt;
&lt;li&gt;  Support Vector Machines&lt;/li&gt;
&lt;/ul&gt;

&lt;h1&gt;
  
  
  Logistic Regression
&lt;/h1&gt;

&lt;p&gt;Logistic regression models the probability that the response &lt;em&gt;Y&lt;/em&gt; belongs to a particular category.&lt;/p&gt;

&lt;p&gt;For example, assu&lt;span id="rmm"&gt;m&lt;/span&gt;e that we have data on whether or not someone defaulted on their credit. The data includes one predictor for the credit balance that someone had.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605460051797%2FtfP-kRqCb.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605460051797%2FtfP-kRqCb.png" alt="credit-balance-table.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Logistic regression would model the probability of default, given credit balance:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605460118165%2FNipTt5yaF.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605460118165%2FNipTt5yaF.png" alt="probability-default.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Additionally, a probability threshold can be chosen for the classification.&lt;/p&gt;

&lt;p&gt;For example, if we choose a probability threshold of 50%, then we would indicate any observation with a probability of 50% or more as “default.”&lt;/p&gt;

&lt;p&gt;However, we could also choose a more conservative probability threshold, such as 10%. In this case, any observation with a probability of 10% or more would be indicated as “default.”&lt;/p&gt;

&lt;h2&gt;
  
  
  Logistic Model
&lt;/h2&gt;

&lt;p&gt;The logistic function is used to model the relationship between the probability (&lt;em&gt;Y&lt;/em&gt;) and some predictor (&lt;em&gt;X&lt;/em&gt;) because the function falls between 0 and 1 for all &lt;em&gt;X&lt;/em&gt; values. The logistic function has the form:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605460150732%2FIQuTseXQ8.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605460150732%2FIQuTseXQ8.png" alt="logistic-function-equation.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The logistic function always produces an S-shaped curve.&lt;/p&gt;

&lt;p&gt;Additionally, the logistic function can also be rewritten as a logit function:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605460168611%2FfziCm0YcX.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605460168611%2FfziCm0YcX.png" alt="logit-function.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The logistic regression model for Credit Default data may look like the chart below.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605460205397%2FPkgr4AW0k.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605460205397%2FPkgr4AW0k.png" alt="logistic-regression.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Interpretation of Coefficients
&lt;/h2&gt;

&lt;p&gt;This equation can be interpreted as a one unit increase in &lt;em&gt;X&lt;/em&gt; changing the log-odds or logit (left side of equation) by &lt;em&gt;β&lt;/em&gt;₁.&lt;/p&gt;

&lt;p&gt;The fraction inside the log() is known as the odds. In the context of the Credit Default data, the odds would indicate the ratio of the probability of defaulting versus the probability of not defaulting. For example, on average, 1 in 5 people with an odds of 1/4 will default. On the other hand, on average, 9 out of 10 people with an odds of 9/1 will default.&lt;/p&gt;

&lt;p&gt;So, alternatively, the logit function can also be interpreted as a one-unit increase in &lt;em&gt;X&lt;/em&gt; multiplying the odds by &lt;em&gt;e&lt;/em&gt;^&lt;em&gt;β&lt;/em&gt;₁.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;β&lt;/em&gt;₁ cannot be interpreted as a specific change in value for the probability. The only conclusion that can be made is that if the coefficient is positive, then an increase in &lt;em&gt;X&lt;/em&gt; will increase the probability, whereas if the coefficient is negative, then an increase in &lt;em&gt;X&lt;/em&gt; will decrease the probability.&lt;/p&gt;

&lt;h2&gt;
  
  
  Maximum Likelihood
&lt;/h2&gt;

&lt;p&gt;The coefficients are estimated through the maximum likelihood method.&lt;/p&gt;

&lt;p&gt;In the context of the Credit Default data, the maximum likelihood estimate essentially attempts to find &lt;em&gt;β&lt;/em&gt;₀ and &lt;em&gt;β&lt;/em&gt;₁ such that plugging these estimates into the logistic function results in a number close to 1 for individuals who defaulted, and a number close to 0 for individuals who did not default.&lt;/p&gt;

&lt;p&gt;In other words, maximum likelihood chooses coefficients such that the predicted probability of each observation in the data corresponds as closely as possible to the actual observed status.&lt;/p&gt;

&lt;h2&gt;
  
  
  Hypothesis Testing
&lt;/h2&gt;

&lt;p&gt;So, how do we determine whether or not there truly is a relationship between the probability of a class and some predictor?&lt;/p&gt;

&lt;p&gt;Similar to the linear regression setting, we conduct a hypothesis test:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605460222458%2FB72z4xhD6.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605460222458%2FB72z4xhD6.png" alt="hypothesis-test.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;z&lt;/em&gt;-statistic and &lt;em&gt;p&lt;/em&gt;-value&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;In logistic regression, we have a &lt;em&gt;z&lt;/em&gt;-statistic instead of the &lt;em&gt;t&lt;/em&gt;-statistic that we had in linear regression. However, they are essentially the same.&lt;/p&gt;

&lt;p&gt;The &lt;em&gt;z&lt;/em&gt;-statistic measures the number of standard deviations that &lt;em&gt;β&lt;/em&gt;₁ is away from 0.&lt;/p&gt;

&lt;p&gt;The &lt;em&gt;z&lt;/em&gt;-statistic allows us to determine the &lt;em&gt;p&lt;/em&gt;-value, which ultimately helps determine whether or not the coefficient is non-zero.&lt;/p&gt;

&lt;p&gt;The &lt;em&gt;p&lt;/em&gt;-value indicates how likely it is to observe a meaningful association between the probability of a class and some predictor &lt;em&gt;X&lt;/em&gt; by some bizarre random error or chance, as opposed to there being a true relationship between them.&lt;/p&gt;

&lt;p&gt;Typically, we want &lt;em&gt;p&lt;/em&gt;-values less than 5% or 1% to reject the null hypothesis. In other words, rejecting the null hypothesis means that we are declaring that some relationship exists.&lt;/p&gt;

&lt;h1&gt;
  
  
  Multiple Logistic Regression
&lt;/h1&gt;

&lt;p&gt;What if our dataset had multiple predictors. For example, let’s expand our Credit Default dataset to include two additional predictors: student status and income.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605460331055%2FcPehfc0Ql.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605460331055%2FcPehfc0Ql.png" alt="multiple-logistic-regression-table.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Logistic Model
&lt;/h2&gt;

&lt;p&gt;Similar to how the simple linear regression model was extended to multiple linear regression, the logistic regression model is extended in a related fashion:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605460370491%2FtwVl2BvYI.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605460370491%2FtwVl2BvYI.png" alt="multiple-logistic-equation.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Interpretation of Coefficients
&lt;/h2&gt;

&lt;p&gt;The interpretation of the coefficients remains nearly the same. However, when interpreting one of the coefficients, we have to indicate that the values of the other predictors remain fixed.&lt;/p&gt;

&lt;h2&gt;
  
  
  Logistic Regression for Multiple Response Classes
&lt;/h2&gt;

&lt;p&gt;What if we had to classify observations in more than two classes? In the Credit Default data, we only had two classes: Default and No Default.&lt;/p&gt;

&lt;p&gt;For example, assume that we had a medical dataset and had to classify medical conditions as either a stroke, drug overdose, or seizure.&lt;/p&gt;

&lt;p&gt;The two-class logistic regression models have multiple-class extensions, but are not used often.&lt;/p&gt;

&lt;p&gt;Discriminant analysis is the popular approach for multiple-class classification.&lt;/p&gt;

&lt;h1&gt;
  
  
  Linear Discriminant Analysis
&lt;/h1&gt;

&lt;p&gt;Linear discriminant analysis is an alternative approach to classification that models the distributions of the different predictors separately in each of the response classes (&lt;em&gt;Y&lt;/em&gt;), and then uses Bayes’ theorem to flip these around into estimates.&lt;/p&gt;

&lt;h2&gt;
  
  
  Linear Discriminant Analysis vs Logistic Regression
&lt;/h2&gt;

&lt;p&gt;When classes are well separated in a dataset, logistic regression parameter estimates are unstable, whereas linear discriminant analysis does not have this problem.&lt;/p&gt;

&lt;p&gt;If the number of observations in a dataset is small, and the distribution of the predictors is approximately normal, then linear discriminant analysis will typically be more stable.&lt;/p&gt;

&lt;p&gt;Additionally, as mentioned previously, linear discriminant analysis is the popular approach for scenarios in which we have more than two classes in the response.&lt;/p&gt;

&lt;h2&gt;
  
  
  Bayes’ Theorem for Classification
&lt;/h2&gt;

&lt;p&gt;Assume that we have a qualitative response variable (&lt;em&gt;Y&lt;/em&gt;) that can take on &lt;em&gt;K&lt;/em&gt; distinct class values.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;π&lt;/em&gt; represents the prior probability that a randomly chosen observation comes from class &lt;em&gt;K&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;Bayes’ theorem states that:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605460391563%2F1BeyQAwHN.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605460391563%2F1BeyQAwHN.png" alt="bayes-theorem.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This is the posterior probability that an observation belongs to some class &lt;em&gt;K&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;The Bayes’ classifier has the lowest possible error rate out of all classifiers because it classifies an observation to the class for which the &lt;em&gt;Pr&lt;/em&gt;(&lt;em&gt;Y&lt;/em&gt;=&lt;em&gt;k&lt;/em&gt;|&lt;em&gt;X&lt;/em&gt;=&lt;em&gt;x&lt;/em&gt;) is largest.&lt;/p&gt;

&lt;p&gt;Estimating &lt;em&gt;π&lt;/em&gt; is easy if we have random sample data from a population. We simply determine the fraction of observations that belong to some class &lt;em&gt;K&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;However, estimating &lt;em&gt;f&lt;/em&gt;(&lt;em&gt;X&lt;/em&gt;) is more challenging unless we assume simple density forms.&lt;/p&gt;

&lt;h2&gt;
  
  
  Linear Discriminant Analysis for One Predictor
&lt;/h2&gt;

&lt;p&gt;Suppose that &lt;em&gt;f&lt;/em&gt;(&lt;em&gt;X&lt;/em&gt;) is a normal distribution.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;μ&lt;/em&gt; represents the class-specific mean.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;σ&lt;/em&gt; represents the class-specific variance. However, we further assume that all classes have variances that are equal.&lt;/p&gt;

&lt;p&gt;The Bayes’ classifier assigns observations to the class for which the following is largest:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605460404956%2FSZm3kNAPt.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605460404956%2FSZm3kNAPt.png" alt="bayes-assignment-equation.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The above equation is obtained through some mathematical simplification of the Bayes probability formula.&lt;/p&gt;

&lt;p&gt;However, the Bayes’ classifier can only be determined if we know that &lt;em&gt;X&lt;/em&gt; is drawn from a normal distribution, and know all of the parameters involved, which does not happen in real situations. Additionally, even if we were sure that &lt;em&gt;X&lt;/em&gt; was drawn from a normal distribution, we would still have to estimate the parameters.&lt;/p&gt;

&lt;p&gt;Linear discriminant analysis approximates the Bayes’ classifier by using these estimates in the previous equation:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605460423524%2FBJvsHmC8-.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605460423524%2FBJvsHmC8-.png" alt="lda-approx-1.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605460433159%2FNGR_d1vpb.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605460433159%2FNGR_d1vpb.png" alt="lda-approx-2.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605460440502%2FUBHdk7mhv.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605460440502%2FUBHdk7mhv.png" alt="lda-approx-3.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;em&gt;n&lt;/em&gt; — represents the total number of observations&lt;/li&gt;
&lt;li&gt;  &lt;em&gt;n&lt;/em&gt;(&lt;em&gt;k&lt;/em&gt;) — represents the total number of observations in class &lt;em&gt;K&lt;/em&gt;
&lt;/li&gt;
&lt;li&gt;  &lt;em&gt;K&lt;/em&gt; — represents the total number of classes&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Linear discriminant analysis assumes a normal distribution and common variance among the classes.&lt;/p&gt;

&lt;h2&gt;
  
  
  Linear Discriminant Analysis for Multiple Predictors
&lt;/h2&gt;

&lt;p&gt;Linear discriminant analysis can be extended to allow for multiple predictors. In this scenario, we assume that the predictors come from a multivariate normal distribution with class-specific means and a common covariance matrix. Additionally, we assume that each predictor follows a one-dimensional normal distribution with some correlation between each pair of predictors.&lt;/p&gt;

&lt;p&gt;For example, linear discriminant analysis could be used on the Credit Default dataset with the multiple predictors.&lt;/p&gt;

&lt;h1&gt;
  
  
  Model Assessment
&lt;/h1&gt;

&lt;h2&gt;
  
  
  Confusion Matrix
&lt;/h2&gt;

&lt;p&gt;Fitting a linear discriminant model to the full &lt;a href="https://rdrr.io/cran/ISLR/man/Default.html" rel="noopener noreferrer"&gt;Credit Default data&lt;/a&gt; in the ISLR R package results in an overall training error rate of 2.75%. This may seem low at first, but there are a couple of key things to keep in mind:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt; Training error rates will usually be lower than test error rates.&lt;/li&gt;
&lt;li&gt; The Credit Default dataset is skewed. Only 3.33% of people in the data defaulted. Therefore, a simple and useless classifier that always predicts that each individual will not default will have an error rate of 3.33%.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;For these reasons, it is often of interest to look at a confusion matrix because binary classifiers can make two types of errors:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt; Incorrectly assign an individual who defaults to the No Default category.&lt;/li&gt;
&lt;li&gt; Incorrectly assign an individual who does not default to the Default category.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;A confusion matrix is a convenient way to display this information, and looks as follows for the linear discriminant model (50% probability threshold) fit to the full Credit Default data:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605460467298%2FG1Nos9phd.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605460467298%2FG1Nos9phd.png" alt="confusion-matrix-1.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The confusion matrix shows that out of the 333 individuals who defaulted, 252 were missed by linear discriminant analysis, which is a 75.7% error rate. This is known as a class-specific error rate.&lt;/p&gt;

&lt;p&gt;Linear discriminant analysis does a poor job of classifying customers who default because it is trying to approximate the Bayes’ classifier, which has the lowest &lt;strong&gt;total&lt;/strong&gt; error rate instead of class-specific error rate.&lt;/p&gt;

&lt;p&gt;The probability threshold for determining defaults could be lowered to improve the model. Lowering the probability threshold to 20% results in the following confusion matrix:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605460477542%2Fs6Kp1I8hS.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605460477542%2Fs6Kp1I8hS.png" alt="confusion-matrix-2.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Now, linear discriminant analysis correctly predicts 195 individuals who defaulted, out of the 333 total. This is an improvement over the previous error rate. However, 235 individuals who do not default are classified as defaulters, compared to only 23 previously. This is the tradeoff that results from lowering the probability threshold.&lt;/p&gt;

&lt;p&gt;The threshold value to use in a real situation should be based on domain and industry knowledge, such as information about the costs associated with defaults.&lt;/p&gt;

&lt;h2&gt;
  
  
  ROC Curve
&lt;/h2&gt;

&lt;p&gt;The ROC curve is a popular graphic for simultaneously displaying the two types of errors for all possible thresholds.&lt;/p&gt;

&lt;p&gt;The overall performance of a classifier summarized over all possible thresholds is given by the area under the curve (AUC). An ideal ROC curve will hug the top left corner, so the larger the AUC, the better the classifier. A classifier that performs no better than chance will have an AUC of 0.50.&lt;/p&gt;

&lt;p&gt;Below is an example of an ideal ROC curve (blue) versus an ROC curve that indicates that the model performs no better than chance (black).&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605460491736%2FyM4Q4KvlY.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605460491736%2FyM4Q4KvlY.png" alt="roc-curve.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;ROC curves are useful for comparing different classifiers because they account for all possible probability threshold values.&lt;/p&gt;

&lt;p&gt;The &lt;em&gt;Y&lt;/em&gt; axis indicates the True Positive Rate, which is also known as the sensitivity. In the context of the Credit Default data, it represents the fraction of defaulters who are correctly classified.&lt;/p&gt;

&lt;p&gt;The &lt;em&gt;X&lt;/em&gt; axis indicates the False Positive Rate, which is also known as the one-minus-specificity. In the context of the Credit Default data, it represents the fraction of non-defaulters who are incorrectly classified.&lt;/p&gt;

&lt;p&gt;The ROC curve is very useful in comparing different classification models for the same dataset because it accounts for all possible threshold values. If the AUC of one model is much better than the others, it is the best model to use.&lt;/p&gt;

&lt;h1&gt;
  
  
  Quadratic Discriminant Analysis
&lt;/h1&gt;

&lt;p&gt;Linear discriminant analysis assumes that observations within each class are drawn from a multivariate normal distribution with class-specific means and a common covariance matrix for all of the classes.&lt;/p&gt;

&lt;p&gt;Quadratic discriminant analysis assumes that each class has its own covariance matrix. In other words, quadratic discriminant analysis relaxes the assumption of the common covariance matrix.&lt;/p&gt;

&lt;h2&gt;
  
  
  Linear vs Quadratic Discriminant Analysis
&lt;/h2&gt;

&lt;p&gt;Which method is better for classification? LDA or QDA? The answer lies in the bias-variance tradeoff.&lt;/p&gt;

&lt;p&gt;LDA is a less flexible classifier, meaning it has lower variance than QDA. However, if the assumption of the common covariance matrix is badly off, then LDA could suffer from high bias.&lt;/p&gt;

&lt;p&gt;In general, LDA tends to be a better classifier than QDA if there are relatively few observations in the training data because reducing variance is crucial in this case.&lt;/p&gt;

&lt;p&gt;In general, QDA is recommended over LDA if the training data is large, meaning that the variance of the classifier is not a major concern. QDA is also recommended over LDA if the assumption of the common covariance matrix is flawed.&lt;/p&gt;

&lt;h1&gt;
  
  
  K-Nearest Neighbors
&lt;/h1&gt;

&lt;p&gt;K-Nearest Neighbors (KNN) is a popular nonparametric classifier method.&lt;/p&gt;

&lt;p&gt;Given a positive integer &lt;em&gt;K&lt;/em&gt; and some test observation, the KNN classifier identifies the &lt;em&gt;K&lt;/em&gt; points in the training data that are closest to the test observation. These closest &lt;em&gt;K&lt;/em&gt; points are represented by &lt;em&gt;N&lt;/em&gt;₀. Then, it estimates the conditional probability for a class as the fraction of points in &lt;em&gt;N&lt;/em&gt;₀ that represent that specific class. Lastly, KNN will apply the Bayes’ rule and classify the test observation to the class with the largest probability.&lt;/p&gt;

&lt;p&gt;However, the choice of the &lt;em&gt;K&lt;/em&gt; value is very important. Lower values are more flexible, whereas higher values are less flexible but have more bias. Similar to the regression setting, a bias-variance tradeoff exists.&lt;/p&gt;

&lt;h1&gt;
  
  
  Comparison of Classification Methods
&lt;/h1&gt;

&lt;p&gt;There are four main classification methods: logistic regression, LDA, QDA, and KNN.&lt;/p&gt;

&lt;p&gt;Logistic regression and LDA are closely connected. They both produce linear decision boundaries. However, LDA may provide improvement over logistic regression when the assumption of the normal distribution with common covariance for classes holds. Additionally, LDA may be better when the classes are well separated. On the other hand, logistic regression outperforms LDA when the normal distribution assumption is not met.&lt;/p&gt;

&lt;p&gt;KNN is a completely non-parametric approach. There are no assumptions made about the shape of the decision boundary. KNN will outperform both logistic regression and LDA when the decision boundary is highly nonlinear. However, KNN does not indicate which predictors are important.&lt;/p&gt;

&lt;p&gt;QDA serves as a compromise between the nonparametric KNN method and the linear LDA and logistic regression methods.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Originally published at&lt;/em&gt; &lt;a href="https://www.bijenpatel.com/guide/islr/classification/" rel="noopener noreferrer"&gt;https://www.bijenpatel.com&lt;/a&gt; &lt;em&gt;on August 4, 2020.&lt;/em&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;I will be releasing the equivalent Python code for these examples soon. Subscribe to get notified!&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h1&gt;
  
  
  ISLR Chapter 4 — R Code
&lt;/h1&gt;

&lt;h2&gt;
  
  
  Logistic Regression
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight r"&gt;&lt;code&gt;&lt;span class="n"&gt;library&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;MASS&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="c1"&gt;# For model functions&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;library&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ISLR&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="c1"&gt;# For datasets&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;library&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ggplot2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="c1"&gt;# For plotting&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;library&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dplyr&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="c1"&gt;# For easy data manipulation functions&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;library&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;caret&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="c1"&gt;# For confusion matrix function&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;library&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;e1071&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="c1"&gt;# Requirement for caret library&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="c1"&gt;# Working with the Stock Market dataset to predict direction of stock market movement (up/down)&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;head&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Smarket&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;summary&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Smarket&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="c1"&gt;# Fit a logistic regression model to the data&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="c1"&gt;# The direction of the stock market is the response&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="c1"&gt;# The lags (% change in market on previous day, 2 days ago, etc.) and trade volume are the predictors&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;Smarket_logistic_1&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;glm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Direction&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;~&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Lag1&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Lag2&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Lag3&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Lag4&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Lag5&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Volume&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;family&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"binomial"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;Smarket&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="n"&gt;summary&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Smarket_logistic_1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="c1"&gt;# Use the contrasts function to determine which direction is indicated as "1"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;contrasts&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Smarket&lt;/span&gt;&lt;span class="o"&gt;$&lt;/span&gt;&lt;span class="n"&gt;Direction&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="c1"&gt;##      Up&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="c1"&gt;## Down  0&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="c1"&gt;## Up    1&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="c1"&gt;# Use the names function to determine the objects in the logistic model&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="nf"&gt;names&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Smarket_logistic_1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="c1"&gt;# Get confusion matrix to determine accuracy of the logistic model&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;Smarket_predictions_1&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;data.frame&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Direction&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;Smarket_logistic_1&lt;/span&gt;&lt;span class="o"&gt;$&lt;/span&gt;&lt;span class="n"&gt;fitted.values&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;/&lt;/span&gt;&lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"f484"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;class&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"de ke ii ef ly b db mc md me mf mg ma w mb"&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;Smarket_predictions_1&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;mutate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Smarket_predictions_1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Direction&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;ifelse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Direction&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0.50&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Up"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Down"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;/&lt;/span&gt;&lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"9340"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;class&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"de ke ii ef ly b db mc md me mf mg ma w mb"&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;Smarket_predictions_1&lt;/span&gt;&lt;span class="o"&gt;$&lt;/span&gt;&lt;span class="n"&gt;Direction&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;factor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Smarket_predictions_1&lt;/span&gt;&lt;span class="o"&gt;$&lt;/span&gt;&lt;span class="n"&gt;Direction&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;levels&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nf"&gt;c&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;"Down"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Up"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;ordered&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="kc"&gt;TRUE&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;/&lt;/span&gt;&lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"bf0d"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;class&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"de ke ii ef ly b db mc md me mf mg ma w mb"&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;confusionMatrix&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Smarket_predictions_1&lt;/span&gt;&lt;span class="o"&gt;$&lt;/span&gt;&lt;span class="n"&gt;Direction&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Smarket&lt;/span&gt;&lt;span class="o"&gt;$&lt;/span&gt;&lt;span class="n"&gt;Direction&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;positive&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"Up"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="c1"&gt;# Instead of fitting logistic model to the entire data, split the data into training/test sets&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;Smarket_train&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;filter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Smarket&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Year&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;2004&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;Smarket_test&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;filter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Smarket&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Year&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;==&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;2005&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="n"&gt;Smarket_logistic_2&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;glm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Direction&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;~&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Lag1&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Lag2&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Lag3&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Lag4&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Lag5&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Volume&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;family&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"binomial"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;Smarket_train&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="c1"&gt;# Use the model to make predictions on the test data&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;Smarket_test_predictions&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;predict&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Smarket_logistic_2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Smarket_test&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"response"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="c1"&gt;# Get confusion matrix to determine accuracy on the test dataset&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;Smarket_test_predictions&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;data.frame&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Direction&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;Smarket_test_predictions&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;/&lt;/span&gt;&lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"e401"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;class&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"de ke ii ef ly b db mc md me mf mg ma w mb"&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;Smarket_test_predictions&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;mutate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Smarket_test_predictions&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Direction&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;ifelse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Direction&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0.50&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Up"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Down"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;/&lt;/span&gt;&lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"57f9"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;class&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"de ke ii ef ly b db mc md me mf mg ma w mb"&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;Smarket_test_predictions&lt;/span&gt;&lt;span class="o"&gt;$&lt;/span&gt;&lt;span class="n"&gt;Direction&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;factor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Smarket_test_predictions&lt;/span&gt;&lt;span class="o"&gt;$&lt;/span&gt;&lt;span class="n"&gt;Direction&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;levels&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nf"&gt;c&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;"Down"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Up"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;ordered&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="kc"&gt;TRUE&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;/&lt;/span&gt;&lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"5c57"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;class&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"de ke ii ef ly b db mc md me mf mg ma w mb"&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;confusionMatrix&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Smarket_test_predictions&lt;/span&gt;&lt;span class="o"&gt;$&lt;/span&gt;&lt;span class="n"&gt;Direction&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Smarket_test&lt;/span&gt;&lt;span class="o"&gt;$&lt;/span&gt;&lt;span class="n"&gt;Direction&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="c1"&gt;# Fit a better logistic model that only considers the most important predictors (Lag1, Lag2)&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;Smarket_logistic_3&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;glm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Direction&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;~&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Lag1&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Lag2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;family&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"binomial"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;Smarket_train&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="c1"&gt;# Repeat the procedure to obtain a confusion matrix for the test data ...&amp;lt;/span&amp;gt;&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Linear Discriminant Analysis
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight r"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Continuing to work with the Stock Market dataset&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="c1"&gt;# The lda function is used to fit a linear discriminant model&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;Smarket_lda&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;lda&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Direction&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;~&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Lag1&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Lag2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;Smarket_train&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;Smarket_lda&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="c1"&gt;# View the prior probabilities, group means, and coefficients&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="c1"&gt;# Plot of the linear discriminants of each observation in the training data, separated by class&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;plot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Smarket_lda&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="c1"&gt;# Use the model to make predictions on the test dataset&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;Smarket_lda_predictions&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;predict&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Smarket_lda&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Smarket_test&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="c1"&gt;# View the confusion matrix to assess accuracy&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;confusionMatrix&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Smarket_lda_predictions&lt;/span&gt;&lt;span class="o"&gt;$&lt;/span&gt;&lt;span class="n"&gt;class&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Smarket_test&lt;/span&gt;&lt;span class="o"&gt;$&lt;/span&gt;&lt;span class="n"&gt;Direction&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;positive&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"Up"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="c1"&gt;# Notice that a predicted probability &amp;gt;= 50% actually corresponds to "Down" in LDA&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;Smarket_lda_predictions&lt;/span&gt;&lt;span class="o"&gt;$&lt;/span&gt;&lt;span class="n"&gt;posterior&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="m"&gt;20&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;Smarket_lda_predictions&lt;/span&gt;&lt;span class="o"&gt;$&lt;/span&gt;&lt;span class="n"&gt;class&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="m"&gt;20&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;/&lt;/span&gt;&lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Quadratic Discriminant Analysis
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight r"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Continue to work with the Stock Market dataset&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="c1"&gt;# The qda function is used to fit a quadratic discriminant model&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;Smarket_qda&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;qda&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Direction&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;~&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Lag1&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Lag2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;Smarket_train&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;Smarket_qda&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="c1"&gt;# View the prior probabilities and group means&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="c1"&gt;# Use the model to make predictions on the test dataset&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;Smarket_qda_predictions&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;predict&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Smarket_qda&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Smarket_test&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="c1"&gt;# View the confusion matrix to assess accuracy&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;confusionMatrix&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Smarket_qda_predictions&lt;/span&gt;&lt;span class="o"&gt;$&lt;/span&gt;&lt;span class="n"&gt;class&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Smarket_test&lt;/span&gt;&lt;span class="o"&gt;$&lt;/span&gt;&lt;span class="n"&gt;Direction&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;positive&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"Up"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="c1"&gt;# Notice that a predicted probability &amp;gt;= 50% actually corresponds to "Down" in QDA&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;Smarket_qda_predictions&lt;/span&gt;&lt;span class="o"&gt;$&lt;/span&gt;&lt;span class="n"&gt;posterior&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="m"&gt;20&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;Smarket_qda_predictions&lt;/span&gt;&lt;span class="o"&gt;$&lt;/span&gt;&lt;span class="n"&gt;class&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="m"&gt;20&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;/&lt;/span&gt;&lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  KNN
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight r"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Continue to work with the Stock Market dataset&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;library&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;class&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="c1"&gt;# The class library is used for KNN&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="c1"&gt;# Before performing KNN, separate dataframes are made for the predictors and responses&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;Smarket_train_predictors&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;data.frame&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Lag1&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;Smarket_train&lt;/span&gt;&lt;span class="o"&gt;$&lt;/span&gt;&lt;span class="n"&gt;Lag1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Lag2&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;Smarket_train&lt;/span&gt;&lt;span class="o"&gt;$&lt;/span&gt;&lt;span class="n"&gt;Lag2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;Smarket_test_predictors&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;data.frame&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Lag1&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;Smarket_test&lt;/span&gt;&lt;span class="o"&gt;$&lt;/span&gt;&lt;span class="n"&gt;Lag1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Lag2&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;Smarket_test&lt;/span&gt;&lt;span class="o"&gt;$&lt;/span&gt;&lt;span class="n"&gt;Lag2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;Smarket_train_response&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Smarket_train&lt;/span&gt;&lt;span class="o"&gt;$&lt;/span&gt;&lt;span class="n"&gt;Direction&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;Smarket_test_response&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Smarket_test&lt;/span&gt;&lt;span class="o"&gt;$&lt;/span&gt;&lt;span class="n"&gt;Direction&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="c1"&gt;# Perform KNN with K=3&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;set.seed&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="n"&gt;Smarket_predictions_knn&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;knn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Smarket_train_predictors&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
                            &lt;/span&gt;&lt;span class="n"&gt;Smarket_test_predictors&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
                            &lt;/span&gt;&lt;span class="n"&gt;Smarket_train_response&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
                            &lt;/span&gt;&lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="m"&gt;3&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="c1"&gt;# See a confusion matrix to assess the accuracy of KNN&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;confusionMatrix&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Smarket_predictions_knn&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Smarket_test_response&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="c1"&gt;# Next, we will use KNN on the Caravan data to predict whether or not someone will purchase caravan insurance&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="c1"&gt;# The data contains demographic data on individuals, and whether or not insurance was bought&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="c1"&gt;# Before performing KNN, predictors should be scaled to have mean 0 and standard deviation 1&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;Caravan_scaled&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;scale&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Caravan&lt;/span&gt;&lt;span class="p"&gt;[,&lt;/span&gt;&lt;span class="m"&gt;-86&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="c1"&gt;# We will designate the first 1000 rows as the test data and the remaining as training data&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="c1"&gt;# Create separate datasets for the predictors and responses&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;Caravan_test_predictors&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Caravan_scaled&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="m"&gt;1000&lt;/span&gt;&lt;span class="p"&gt;,]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;Caravan_train_predictors&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Caravan_scaled&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="m"&gt;1001&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="m"&gt;5822&lt;/span&gt;&lt;span class="p"&gt;,]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;Caravan_test_response&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Caravan&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="m"&gt;1000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;86&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;Caravan_train_response&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Caravan&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="m"&gt;1001&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="m"&gt;5822&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;86&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="c1"&gt;# Perform KNN&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;set.seed&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;Caravan_knn_predictions&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;knn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Caravan_train_predictors&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
                            &lt;/span&gt;&lt;span class="n"&gt;Caravan_test_predictors&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
                            &lt;/span&gt;&lt;span class="n"&gt;Caravan_train_response&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
                            &lt;/span&gt;&lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="m"&gt;5&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="c1"&gt;# Assess accuracy&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;confusionMatrix&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Caravan_knn_predictions&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Caravan_test_response&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;positive&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"Yes"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="c1"&gt;# 26.67% accuracy is better than 6% from random guessing&amp;lt;/span&amp;gt;&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



</description>
      <category>datascience</category>
      <category>machinelearning</category>
      <category>tutorial</category>
    </item>
    <item>
      <title>ISLR Chapter 3 — Linear Regression</title>
      <dc:creator>Bijen Patel</dc:creator>
      <pubDate>Sun, 08 Nov 2020 23:32:05 +0000</pubDate>
      <link>https://dev.to/bijenpatel/islr-chapter-3-linear-regression-mpc</link>
      <guid>https://dev.to/bijenpatel/islr-chapter-3-linear-regression-mpc</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;This is a summary of chapter 3 of the &lt;em&gt;Introduction to Statistical Learning&lt;/em&gt; textbook. I’ve written a 10-part guide that covers the entire book. The guide can be read at &lt;a href="https://www.bijenpatel.com/guide/" rel="noopener noreferrer"&gt;my website&lt;/a&gt;, or &lt;a href="https://dev.to/bijenpatel/series/9611"&gt;here&lt;/a&gt; at DEV. Subscribe to stay up to date on my latest Data Science &amp;amp; Engineering guides!&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h1&gt;
  
  
  Simple Linear Regression
&lt;/h1&gt;

&lt;p&gt;Simple linear regression assumes a linear relationship between the predictor (&lt;em&gt;X&lt;/em&gt;) and the response (&lt;em&gt;Y&lt;/em&gt;). A simple linear regression model takes the following form:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605457886425%2FSU1VGMtau.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605457886425%2FSU1VGMtau.png" alt="linear-regression-equation.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;em&gt;y-hat&lt;/em&gt; — represents the predicted value&lt;/li&gt;
&lt;li&gt;  &lt;em&gt;β&lt;/em&gt;₀— represents a coefficient known as the intercept&lt;/li&gt;
&lt;li&gt;  &lt;em&gt;β&lt;/em&gt;₁ — represents a coefficient known as the slope&lt;/li&gt;
&lt;li&gt;  &lt;em&gt;X&lt;/em&gt;— represents the value of the predictor&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For example, we could build a simple linear regression model from the following statistician salary dataset:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605458057641%2FozEycksWK.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605458057641%2FozEycksWK.png" alt="statistician-salary-table.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The simple linear regression model could be written as follows:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605458105495%2FOWHH7iFpe.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605458105495%2FOWHH7iFpe.png" alt="statistician-salary-regression-equation.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Least Squares Criteria
&lt;/h2&gt;

&lt;p&gt;The best estimates for the coefficients (&lt;em&gt;β&lt;/em&gt;₀, &lt;em&gt;β&lt;/em&gt;₁) are obtained by finding the regression line that fits the training dataset points as closely as possible. This line can be obtained by minimizing the least squares criteria.&lt;/p&gt;

&lt;p&gt;What does it m&lt;span id="rmm"&gt;e&lt;/span&gt;an to minimize the least squares criteria? Let’s use the example of the regression model for the statistician salaries.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605458105495%2FOWHH7iFpe.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605458105495%2FOWHH7iFpe.png" alt="statistician-salary-regression-equation.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The difference between the actual salary value and the predicted salary is known as the residual (&lt;em&gt;e&lt;/em&gt;). The residual sum of squares (RSS) is defined as:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605458149287%2Fph9k_4cI3.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605458149287%2Fph9k_4cI3.png" alt="rss-equation.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The least squares criteria chooses the &lt;em&gt;β&lt;/em&gt; coefficient values that minimize the RSS.&lt;/p&gt;

&lt;p&gt;For our statistician salary dataset, the linear regression model determined through the least squares criteria is as follows:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605458105495%2FOWHH7iFpe.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605458105495%2FOWHH7iFpe.png" alt="statistician-salary-regression-equation.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;em&gt;β&lt;/em&gt;₀ is $70,545&lt;/li&gt;
&lt;li&gt;  &lt;em&gt;β&lt;/em&gt;₁ is $2,576&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This final regression model can be visualized by the orange line below:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605458274334%2F1hi9B-Svb.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605458274334%2F1hi9B-Svb.png" alt="statistican-salary-regression.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Interpretation of β Coefficients​
&lt;/h2&gt;

&lt;p&gt;How do we interpret the coefficients of a simple linear regression model in plain English? In general, we say that:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  If the predictor (&lt;em&gt;X&lt;/em&gt;) were 0, the prediction (&lt;em&gt;Y&lt;/em&gt;) would be &lt;em&gt;β&lt;/em&gt;₀, on average.&lt;/li&gt;
&lt;li&gt;  For every one increase in the predictor, the prediction changes by &lt;em&gt;β&lt;/em&gt;₁, on average.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Using the example of the final statistician salary regression model, we would conclude that:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  If a statistician had 0 years of experience, he/she would have an entry-level salary of $70,545, on average.&lt;/li&gt;
&lt;li&gt;  For every one additional year of experience that a statistician has, his/her salary increases by $2,576, on average.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  True Population Regression vs Least Squares Regression
&lt;/h2&gt;

&lt;p&gt;The true population regression line represents the “true” relationship between &lt;em&gt;X&lt;/em&gt; and &lt;em&gt;Y&lt;/em&gt;. However, we never know the true relationship, so we use least squares regression to estimate it with the data that we have available.&lt;/p&gt;

&lt;p&gt;For example, assume that the true population regression line for statistician salaries was represented by the black line below. The least squares regression line, represented by the orange line, is close to the true population regression line, but not exactly the same.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605458291203%2Fc_nOmTa9W.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605458291203%2Fc_nOmTa9W.png" alt="statistician-salary-true-regression.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;So, how do we estimate how accurate the least squares regression line is as an estimate of the true population regression line?&lt;/p&gt;

&lt;p&gt;We compute the standard error of the coefficients and determine the confidence interval.&lt;/p&gt;

&lt;p&gt;The standard error is a measure of the accuracy of an estimate. Knowing how to mathematically calculate the standard error is not important, as programs like R will determine them easily.&lt;/p&gt;

&lt;p&gt;Standard errors are used to compute confidence intervals, which provide an estimate of how accurate the least squares regression line is.&lt;/p&gt;

&lt;p&gt;The most commonly used confidence interval is the 95% confidence interval.&lt;/p&gt;

&lt;p&gt;The confidence interval is generally interpreted as follows:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  There is a 95% probability that the interval contains the true population value of the coefficient.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For example, for the statistician salary regression model, the confidence intervals are as follows:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;em&gt;β&lt;/em&gt;₀ = [67852,72281]&lt;/li&gt;
&lt;li&gt;  &lt;em&gt;β&lt;/em&gt;₁ = [1989,3417]&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In the context of the statistician salaries, these confidence intervals are interpreted as follows:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  In the absence of any years of experience, the salary of an entry-level statistician will fall between $67,852 and $72,281.&lt;/li&gt;
&lt;li&gt;  For each one additional year of experience, a statistician’s salary will increase between $1,989 and $3,417.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Hypothesis Testing
&lt;/h2&gt;

&lt;p&gt;So, how do we determine whether or not there truly is a relationship between &lt;em&gt;X&lt;/em&gt; and &lt;em&gt;Y&lt;/em&gt;? In other words, how do we know that &lt;em&gt;X&lt;/em&gt; is actually a good predictor for &lt;em&gt;Y&lt;/em&gt;?&lt;/p&gt;

&lt;p&gt;We use the standard errors to perform hypothesis tests on the coefficients.&lt;/p&gt;

&lt;p&gt;The most common hypothesis test involves testing the null hypothesis versus the alternative hypothesis:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  Null Hypothesis: No relationship between &lt;em&gt;X&lt;/em&gt; and &lt;em&gt;Y&lt;/em&gt;
&lt;/li&gt;
&lt;li&gt;  Alternative Hypothesis: Some relationship between &lt;em&gt;X&lt;/em&gt; and &lt;em&gt;Y&lt;/em&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;t&lt;/em&gt;-statistic and &lt;em&gt;p&lt;/em&gt;-value&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;So, how do we determine if &lt;em&gt;β&lt;/em&gt;₁ is non-zero? We use the estimated value of the coefficient and its standard error to determine the &lt;em&gt;t&lt;/em&gt;-statistic:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605458313160%2Fk2aTPTZEQ.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605458313160%2Fk2aTPTZEQ.png" alt="t-statistic-equation.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The &lt;em&gt;t&lt;/em&gt;-statistic measures the number of standard deviations that &lt;em&gt;β&lt;/em&gt;₁ is away from 0.&lt;/p&gt;

&lt;p&gt;The &lt;em&gt;t&lt;/em&gt;-statistic allows us to determine something known as the &lt;em&gt;p&lt;/em&gt;-value, which ultimately helps determine whether or not the coefficient is non-zero.&lt;/p&gt;

&lt;p&gt;The &lt;em&gt;p&lt;/em&gt;-value indicates how likely it is to observe a meaningful association between &lt;em&gt;X&lt;/em&gt; and &lt;em&gt;Y&lt;/em&gt; by some bizarre random error or chance, as opposed to there being a true relationship between &lt;em&gt;X&lt;/em&gt; and &lt;em&gt;Y&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;Typically, we want &lt;em&gt;p&lt;/em&gt;-values less than 5% or 1% to reject the null hypothesis. In other words, rejecting the null hypothesis means that we are declaring that some relationship exists between &lt;em&gt;X&lt;/em&gt; and &lt;em&gt;Y&lt;/em&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Assessing Model Accuracy
&lt;/h2&gt;

&lt;p&gt;There are 2 main assessments for how well a model fits the data: RSE and &lt;em&gt;R&lt;/em&gt;².&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Residual Standard Error (RSE)&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The RSE is a measure of the standard deviation of the random error term (&lt;em&gt;ϵ&lt;/em&gt;).&lt;/p&gt;

&lt;p&gt;In other words, it is the average amount that the actual response will deviate from the true regression line. It is a measure of the lack of fit of a model.&lt;/p&gt;

&lt;p&gt;The value of RSE and whether or not it is acceptable will depend on the context of the problem.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;R-Squared (&lt;em&gt;R&lt;/em&gt;²)&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;R&lt;/em&gt;² measures the proportion of variability in &lt;em&gt;Y&lt;/em&gt; that can be explained by using &lt;em&gt;X&lt;/em&gt;. It is a proportion that is calculated as follows:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605458335914%2F6cJV4XND9.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605458335914%2F6cJV4XND9.png" alt="r-squared-equation.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;TSS is a measure of variability that is already inherent in the response variable before regression is performed.&lt;/p&gt;

&lt;p&gt;RSS is a measure of variability in the response variable after regression is performed.&lt;/p&gt;

&lt;p&gt;The final statistician salary regression model has an &lt;em&gt;R&lt;/em&gt;² of 0.90, meaning that 90% of the variability in the salaries of statisticians is explained by using years of experience as a predictor.&lt;/p&gt;

&lt;h1&gt;
  
  
  Multiple Linear Regression
&lt;/h1&gt;

&lt;p&gt;Simple linear regression is useful for prediction if there is only one predictor. But what if we had multiple predictors? Multiple linear regression allows for multiple predictors, and takes the following form:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605458358050%2FEdQIyVYbS.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605458358050%2FEdQIyVYbS.png" alt="multiple-linear-regression-equation.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;For example, let’s take the statistician salary dataset, add a new predictor for college GPA, and add 10 new data points.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605458377989%2F_c9Zzgd48.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605458377989%2F_c9Zzgd48.png" alt="statistician-salary-gpa-table.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The multiple linear regression model for the dataset would take the form:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605458416875%2F-B0b0QyRJ.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605458416875%2F-B0b0QyRJ.png" alt="salary-multiple-regression-equation.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The multiple linear regression model would fit a plane to the dataset. The dataset is represented below as a 3D scatter plot with an &lt;em&gt;X&lt;/em&gt;, &lt;em&gt;Y&lt;/em&gt;, and &lt;em&gt;Z&lt;/em&gt; axis.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605458436752%2F2BVoKhaBM.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605458436752%2F2BVoKhaBM.png" alt="statistician-salary-multiple-regression.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In multiple linear regression, we’re interested in a few specific questions:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt; Is at least one of the predictors useful in predicting the response?&lt;/li&gt;
&lt;li&gt; Do all of the predictors help explain &lt;em&gt;Y&lt;/em&gt;, or only a few of them?&lt;/li&gt;
&lt;li&gt; How well does the model fit the data?&lt;/li&gt;
&lt;li&gt; How accurate is our prediction?&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Least Squares Criteria
&lt;/h2&gt;

&lt;p&gt;Similar to simple linear regression, the coefficient estimates in multiple linear regression are chosen based on the same least squares approach that minimizes RSS.&lt;/p&gt;

&lt;h2&gt;
  
  
  Interpretation of β Coefficients​
&lt;/h2&gt;

&lt;p&gt;The interpretation of the coefficients is also very similar to the simple linear regression setting, with one key difference (indicated in bold). In general, we say that:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  If all of the predictors were 0, the prediction (&lt;em&gt;Y&lt;/em&gt;) would be &lt;em&gt;β&lt;/em&gt;₀, on average.&lt;/li&gt;
&lt;li&gt;  For every one increase in some predictor &lt;em&gt;X&lt;/em&gt;ⱼ, the prediction changes by &lt;em&gt;β&lt;/em&gt;ⱼ, on average, &lt;strong&gt;holding all of the other predictors constant&lt;/strong&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Hypothesis Testing
&lt;/h2&gt;

&lt;p&gt;So, how do we determine whether or not there truly is a relationship between the &lt;em&gt;X_s and _Y&lt;/em&gt;? In other words, how do we know that the &lt;em&gt;X_s are actually good predictors for _Y&lt;/em&gt;?&lt;/p&gt;

&lt;p&gt;Similar to the simple linear regression setting, we perform a hypothesis test. The null and alternative hypotheses are slightly different:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  Null Hypothesis: No relationship between the &lt;em&gt;X_s and _Y&lt;/em&gt;
&lt;/li&gt;
&lt;li&gt;  Alternative Hypothesis: At least one predictor has a relationship to the response&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;F&lt;/em&gt;-statistic and &lt;em&gt;p&lt;/em&gt;-value&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;So, how do we determine if at least one &lt;em&gt;β&lt;/em&gt;ⱼ is non-zero? In simple regression, we determined the &lt;em&gt;t&lt;/em&gt;-statistic. In multiple regression, we determine the &lt;em&gt;F&lt;/em&gt;-statistic instead.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605458458477%2FaIVg--phX.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605458458477%2FaIVg--phX.png" alt="f-statistic-equation.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;When there is no relationship between the response and predictors, we generally expect the &lt;em&gt;F&lt;/em&gt;-statistic to be close to 1.&lt;/p&gt;

&lt;p&gt;If there is a relationship, we generally expect the &lt;em&gt;F&lt;/em&gt;-statistic to be greater than 1.&lt;/p&gt;

&lt;p&gt;Similar to the &lt;em&gt;t&lt;/em&gt;-statistic, the &lt;em&gt;F&lt;/em&gt;-statistic also allows us to determine the &lt;em&gt;p&lt;/em&gt;-value, which ultimately helps decide whether or not a relationship exists.&lt;/p&gt;

&lt;p&gt;The &lt;em&gt;p&lt;/em&gt;-value is essentially interpreted in the same way that it is interpreted in simple regression.&lt;/p&gt;

&lt;p&gt;Typically, we want &lt;em&gt;p&lt;/em&gt;-values less than 5% or 1% to reject the null hypothesis. In other words, rejecting the null hypothesis means that we are declaring that some relationship exists between the &lt;em&gt;X_s and _Y&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;t&lt;/em&gt;-statistics in Multiple Linear Regression&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;In multiple linear regression, we will receive outputs that indicate the &lt;em&gt;t&lt;/em&gt;-statistic and &lt;em&gt;p&lt;/em&gt;-values for each of the different coefficients.&lt;/p&gt;

&lt;p&gt;However, we have to use the overall &lt;em&gt;F&lt;/em&gt;-statistic instead of the individual coefficient &lt;em&gt;p&lt;/em&gt;-values. This is because when the number of predictors is large (e.g. p=100), about 5% of the coefficients will have low &lt;em&gt;p&lt;/em&gt;-values less than 5% just by chance. Therefore, in this scenario, choosing whether or not to reject the null hypothesis based on the individual &lt;em&gt;p&lt;/em&gt;-values would be flawed.&lt;/p&gt;

&lt;h2&gt;
  
  
  Determining Variable Significance
&lt;/h2&gt;

&lt;p&gt;So, after concluding that at least one predictor is related to the response, how do we determine which specific predictors are significant?&lt;/p&gt;

&lt;p&gt;This process is called variable selection, and there are three approaches: forward selection, backward selection, and mixed selection.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Forward Selection&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Assume that we had a dataset of credit card balance and 10 predictors.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605458499907%2FDHITbrr6W.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605458499907%2FDHITbrr6W.png" alt="credit-balance-table.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Forward selection begins with a null model with no predictors:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605458611675%2FOM1WCUn_0.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605458611675%2FOM1WCUn_0.png" alt="forward-selection-0.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Then, 10 different simple linear regression models are built for each of the predictors:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605458524534%2F-zl4sFjo_.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605458524534%2F-zl4sFjo_.png" alt="forward-selection-1.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605458536066%2F88Mk8o0xf.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605458536066%2F88Mk8o0xf.png" alt="forward-selection-2.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605458548199%2F7e8hPaAAF.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605458548199%2F7e8hPaAAF.png" alt="forward-selection-3.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605458558757%2FObw8xTA54.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605458558757%2FObw8xTA54.png" alt="forward-selection-4.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The predictor that results in the lowest RSS is then added to the initial null model. Assume that the Limit variable is the variable that results in the lowest RSS. The forward selection model would then become:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605458575085%2FGuH4LLX9I.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605458575085%2FGuH4LLX9I.png" alt="forward-selection-5.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Then, a second predictor is added to this new model, which will result in building 9 different multiple linear regression models for the remaining predictors:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605458588342%2F-vaVDD5mQ.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605458588342%2F-vaVDD5mQ.png" alt="forward-selection-6.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605458596972%2FoZJLwTf2G.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605458596972%2FoZJLwTf2G.png" alt="forward-selection-7.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605458606491%2F-yCw_FIhg.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605458606491%2F-yCw_FIhg.png" alt="forward-selection-8.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605458617664%2FgSfMOt5at.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605458617664%2FgSfMOt5at.png" alt="forward-selection-9.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The second predictor that results in the lowest RSS is then added to the model. Assume that the model with the Income variable resulted in the lowest RSS. The forward selection model would then become:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605458713057%2FI1-U-Oyua.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605458713057%2FI1-U-Oyua.png" alt="forward-selection-10.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This process of adding predictors is continued until some statistical stopping rule is satisfied.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Backward Selection&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Backward selection begins with a model with all predictors:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605458729161%2FfXHFsJ6l0.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605458729161%2FfXHFsJ6l0.png" alt="backward-selection.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Then, the variable with the largest &lt;em&gt;p&lt;/em&gt;-value is removed, and the new model is fit.&lt;/p&gt;

&lt;p&gt;Again, the variable with the largest &lt;em&gt;p&lt;/em&gt;-value is removed, and the new model is fit.&lt;/p&gt;

&lt;p&gt;This process is continued until some statistical stopping rule is satisfied, such as all variables in the model having low &lt;em&gt;p&lt;/em&gt;-values less than 5%.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Mixed Selection&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Mixed selection is a combination of forward and backward selection. We begin with a null model that contains no predictors. Then, variables are added one by one, exactly as done in forward selection. However, if at any point the &lt;em&gt;p&lt;/em&gt;-value for some variable rises above a chosen threshold, then it is removed from the model, as done in backward selection.&lt;/p&gt;

&lt;h2&gt;
  
  
  Assessing Model Accuracy
&lt;/h2&gt;

&lt;p&gt;Similar to the simple regression setting, RSE and &lt;em&gt;R&lt;/em&gt;² are used to determine how well the model fits the data.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;RSE&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;In multiple linear regression, RSE is calculated as follows:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605458746267%2Fu55r7M5dp.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605458746267%2Fu55r7M5dp.png" alt="rse-equation.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;R-Squared (&lt;em&gt;R&lt;/em&gt;²)&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;R&lt;/em&gt;² is interpreted in the same manner that it is interpreted in simple regression.&lt;/p&gt;

&lt;p&gt;However, in multiple linear regression, adding more predictors to the model will always result in an increase in &lt;em&gt;R&lt;/em&gt;².&lt;/p&gt;

&lt;p&gt;Therefore, it is important to look at the magnitude at which &lt;em&gt;R&lt;/em&gt;² changes when adding or removing a variable. A small change will generally indicate an insignificant variable, whereas a large change will generally indicate a significant variable.&lt;/p&gt;

&lt;h2&gt;
  
  
  Assessing Prediction Accuracy
&lt;/h2&gt;

&lt;p&gt;How do we estimate how accurate the actual predictions are? Confidence intervals and prediction intervals can help assess prediction accuracy.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Confidence Intervals&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Confidence intervals are determined through the &lt;em&gt;β&lt;/em&gt; coefficient estimates and their inaccuracy through the standard errors.&lt;/p&gt;

&lt;p&gt;This means that confidence intervals only account for reducible error.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Prediction Intervals&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Reducible error isn’t the only type of error that is present in regression modeling.&lt;/p&gt;

&lt;p&gt;Even if we knew the true values of the &lt;em&gt;β&lt;/em&gt; coefficients, we would not be able to predict the response variable perfectly because of random error &lt;em&gt;ϵ&lt;/em&gt; in the model, which is an irreducible error.&lt;/p&gt;

&lt;p&gt;Prediction intervals go a step further than confidence intervals by accounting for both reducible and irreducible error. This means that prediction intervals will always be wider than confidence intervals.&lt;/p&gt;

&lt;h1&gt;
  
  
  Other Considerations in Regression Modeling
&lt;/h1&gt;

&lt;h2&gt;
  
  
  Qualitative Predictors
&lt;/h2&gt;

&lt;p&gt;It is possible to have qualitative predictors in regression models. For example, assume that we had a predictor that indicated gender.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605458764132%2F4CVAgCmDn.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605458764132%2F4CVAgCmDn.png" alt="gender-table.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;For qualitative variables with only two levels, we could simply create a “dummy” variable that takes on two values:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605458774065%2FSHCzqOI5E.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605458774065%2FSHCzqOI5E.png" alt="gender-dummy.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We’d simply use that logic to create a new column for the dummy variable in the data, and use that for regression purposes:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605458788741%2FbBaHP-tjj.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605458788741%2FbBaHP-tjj.png" alt="gender-dummy-table.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;But what if the qualitative variable had more than two levels? For example, assume we had a predictor that indicated ethnicity:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605458803281%2FOmF2rGVlI.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605458803281%2FOmF2rGVlI.png" alt="ethnicity-table.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In this case, a single dummy variable cannot represent all of the possible values. In this situation, we create multiple dummy variables:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605458820160%2F_Z4HrbZDs.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605458820160%2F_Z4HrbZDs.png" alt="ethnicity-dummy-0.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605458828430%2Fl2brlpmK6.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605458828430%2Fl2brlpmK6.png" alt="ethnicity-dummy-1.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;For qualitative variables with multiple levels, there will always be one fewer dummy variable than the total number of levels. In this example, we have three ethnicity levels, so we create two dummy variables.&lt;/p&gt;

&lt;p&gt;The new variables for regression purposes would be represented as follows:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605458850997%2FtsFf8y183.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605458850997%2FtsFf8y183.png" alt="ethnicity-dummy-table.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Extensions of the Linear Model
&lt;/h2&gt;

&lt;p&gt;Regression models provide interpretable results and work well, but make highly restrictive assumptions that are often violated in practice.&lt;/p&gt;

&lt;p&gt;Two of the most important assumptions state that the relationship between the predictors and response are additive and linear.&lt;/p&gt;

&lt;p&gt;The additive assumption means that the effect of changes in some predictor &lt;em&gt;X&lt;/em&gt;ⱼ on the response is independent of the values of the other predictors.&lt;/p&gt;

&lt;p&gt;The linear assumption states that the change in the response &lt;em&gt;Y&lt;/em&gt; due to a one-unit change in some predictor &lt;em&gt;X&lt;/em&gt;ⱼ is constant, regardless of the value of &lt;em&gt;X&lt;/em&gt;ⱼ.&lt;/p&gt;

&lt;h2&gt;
  
  
  Removing the Additive Assumption
&lt;/h2&gt;

&lt;p&gt;Assume that we had an advertising dataset of money spent on TV ads, money spent on radio ads, and product sales.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605458869223%2FqBqE63NJu.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605458869223%2FqBqE63NJu.png" alt="ad-spend-table.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The multiple linear regression model for the data would have the form:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605458922857%2FYuboe-zKW.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605458922857%2FYuboe-zKW.png" alt="ad-multiple-regression-equation.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This model states that the average effect on sales for a $1 increase in TV advertising spend is &lt;em&gt;β&lt;/em&gt;₁, on average, regardless of the amount of money spent on radio ads.&lt;/p&gt;

&lt;p&gt;However, it is possible that spending money on radio ads actually increases the effectiveness of TV ads, thus increasing sales further. This is known as an interaction effect or a synergy effect. The interaction effect can be taken into account by including an interaction term:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605458888101%2FZLeP3Ygjx.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605458888101%2FZLeP3Ygjx.png" alt="ad-multiple-regression-interaction-term-equation.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The interaction term relaxes the additive assumption. Now, every $1 increase in TV ad spend increases sales by &lt;em&gt;β&lt;/em&gt;₁ + &lt;em&gt;β&lt;/em&gt;₃(Radio).&lt;/p&gt;

&lt;p&gt;Sometimes it is possible for the interaction term to have a low &lt;em&gt;p&lt;/em&gt;-value, yet the main terms no longer have a low &lt;em&gt;p&lt;/em&gt;-value. The hierarchical principle states that if an interaction term is included, then the main terms should also be included, even if the &lt;em&gt;p&lt;/em&gt;-values for the main terms are not significant.&lt;/p&gt;

&lt;p&gt;Interaction terms are also possible for qualitative variables, as well as a combination of qualitative and quantitative variables.&lt;/p&gt;

&lt;h2&gt;
  
  
  Non-Linear Relationships
&lt;/h2&gt;

&lt;p&gt;In some cases, the true relationship between the predictors and response may be non-linear. A simple way to extend the linear model is through polynomial regression.&lt;/p&gt;

&lt;p&gt;For example, for automobiles, there is a curved relationship between miles per gallon and horsepower.&lt;/p&gt;

&lt;p&gt;A quadratic model of the following form would be a great fit to the data:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605458946470%2FP0JrkkbWl.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605458946470%2FP0JrkkbWl.png" alt="mpg-equation.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605458956923%2F5m50Tgral.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605458956923%2F5m50Tgral.png" alt="quadratic-regression.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h1&gt;
  
  
  Potential Problems with Linear Regression
&lt;/h1&gt;

&lt;p&gt;When fitting linear models, there are six potential problems that may occur: non-linearity of data, correlation of error terms, nonconstant variance of error terms, outliers, high leverage data points, and collinearity.&lt;/p&gt;

&lt;h2&gt;
  
  
  Non-Linearity of Data
&lt;/h2&gt;

&lt;p&gt;Residual plots are a useful graphical tool for the identification of non-linearity.&lt;/p&gt;

&lt;p&gt;In simple linear regression, the residuals are plotted against the predictor.&lt;/p&gt;

&lt;p&gt;In multiple linear regression, the residuals are plotted against the predicted values.&lt;/p&gt;

&lt;p&gt;If there is some kind of pattern in the residual plot, then it is an indication of potential non-linearity.&lt;/p&gt;

&lt;p&gt;Non-linear transformations, such as log-transformation, of the predictors could be a simple method to solving the issue.&lt;/p&gt;

&lt;p&gt;For example, take a look at the below residual graphs, which represent different types of fits for the automobile data mentioned previously. The graph on the left represents what the residuals look like if a simple linear model is fit to the data. Clearly, there is a curved pattern in the residual plot, indicating non-linearity. The graph on the right represents what the residuals look like if a quadratic model is fit to the data. Fitting a quadratic model seems to resolve the issue, as a pattern doesn’t exist in the plot.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605458987345%2FhN-nJ89gj.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605458987345%2FhN-nJ89gj.png" alt="check-nonlinearity.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Correlation of Error Terms
&lt;/h2&gt;

&lt;p&gt;Proper linear models should have residual terms that are uncorrelated. This means that the sign (positive or negative) of some residual should provide no information about the sign of the next residual.&lt;/p&gt;

&lt;p&gt;If the error terms are correlated, we may have an unwarranted sense of confidence in the linear model.&lt;/p&gt;

&lt;p&gt;To determine if correlated errors exist, we plot residuals in order of observation number. If the errors are uncorrelated, there should not be a pattern. If the errors are correlated, then we may see tracking in the graph. Tracking is where adjacent residuals have similar signs.&lt;/p&gt;

&lt;p&gt;For example, take a look at the below graphs. The graph on the left represents a scenario in which residuals are not correlated. In other words, just because one residual is positive doesn’t seem to indicate that the next residual will be positive. The graph on the right represents a scenario in which the residuals are correlated. There are 15 residuals in a row that are all positive.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605458998974%2FeCpDGDkNZ.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605458998974%2FeCpDGDkNZ.png" alt="check-correlation.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Nonconstant Variance of Error Terms
&lt;/h2&gt;

&lt;p&gt;Proper linear models should also have residual terms that have a constant variance. The standard errors, confidence intervals, and hypothesis tests associated with the model rely on this assumption.&lt;/p&gt;

&lt;p&gt;Nonconstant variance in errors is known as heteroscedasticity. It is identified as the presence of a funnel shape in the residual plot.&lt;/p&gt;

&lt;p&gt;One solution to nonconstant variance is to transform the response using a concave function, such as &lt;em&gt;log&lt;/em&gt;(&lt;em&gt;Y&lt;/em&gt;) or the square root of &lt;em&gt;Y&lt;/em&gt;. Another solution is to use weighted least squares instead of ordinary least squares.&lt;/p&gt;

&lt;p&gt;The graphs below represent the difference between constant and nonconstant variance. The residual plot on the right has a funnel shape, indicating nonconstant variance.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605459033749%2FbVNj3sGkh.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605459033749%2FbVNj3sGkh.png" alt="check-variance.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Outliers
&lt;/h2&gt;

&lt;p&gt;Depending on how many outliers are present and their magnitude, they could either have a minor or major impact on the fit of the linear model. However, even if the impact is small, they could cause other issues, such as impacting the confidence intervals, &lt;em&gt;p&lt;/em&gt;-values, and &lt;em&gt;R&lt;/em&gt;².&lt;/p&gt;

&lt;p&gt;Outliers are identified through various methods. The most common is studentized residuals, where each residual is divided by its estimated standard error. Studentized residuals greater than 3 in absolute value are possible outliers.&lt;/p&gt;

&lt;p&gt;These outliers can be removed from the data to come up with a better linear model. However, it is also possible that outliers indicate some kind of model deficiency, so caution should be taken before removing the outliers.&lt;/p&gt;

&lt;p&gt;The red data point below represents an example of an outlier that would greatly impact the slope of a linear regression model.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605459045563%2FrCzsnswYJ.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605459045563%2FrCzsnswYJ.png" alt="check-outlier.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  High Leverage Data Points
&lt;/h2&gt;

&lt;p&gt;Observations with high leverage have an unusual value compared to the other observation values. For example, you might have a dataset of &lt;em&gt;X&lt;/em&gt; values between 0 and 10, and just one other data point with a value of 20. The value of 20 is a high leverage data point.&lt;/p&gt;

&lt;p&gt;High leverage is determined through the leverage statistic. The leverage statistic is always between 1/&lt;em&gt;n&lt;/em&gt; and 1.&lt;/p&gt;

&lt;p&gt;The average leverage is defined as:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605459056039%2FnhlegBaJB.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605459056039%2FnhlegBaJB.png" alt="check-leverage.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;If the leverage statistic of a data point is greatly higher than the average leverage, then we have reason to suspect high leverage.&lt;/p&gt;

&lt;p&gt;The red data point below represents an example of a high leverage data point that would impact the linear regression fit.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fmiro.medium.com%2Fmax%2F1200%2F1%2AABAwOL0Z7PNLqZLRzIqXOQ.png" class="article-body-image-wrapper"&gt;&lt;img alt="Image for post" src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fmiro.medium.com%2Fmax%2F1200%2F1%2AABAwOL0Z7PNLqZLRzIqXOQ.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Collinearity
&lt;/h2&gt;

&lt;p&gt;Collinearity refers to the situation in which 2 or more predictor variables are closely related. Collinearity makes it difficult to separate out the individual effects of collinear variables on the response. It also reduces the accuracy of the estimates of the regression coefficients by causing the coefficient standard errors to grow, thus reducing the credibility of hypothesis testing.&lt;/p&gt;

&lt;p&gt;A simple way to detect collinearity is to look at the correlation matrix of the predictors.&lt;/p&gt;

&lt;p&gt;However, not all collinearity can be detected through the correlation matrix. It is possible for collinearity to exist between multiple variables instead of pairs of variables, which is known as multicollinearity.&lt;/p&gt;

&lt;p&gt;The better way to assess collinearity is through the Variance Inflation Factor (VIF). VIF is the ratio of the variance of a coefficient when fitting the full model, divided by the variance of the coefficient when fitting a model only on its own. The smallest possible VIF is 1. In general, a VIF that exceeds 5 or 10 may indicate a collinearity problem.&lt;/p&gt;

&lt;p&gt;One way to solve the issue of collinearity is to simply drop one of the predictors from the linear model. Another solution is to combine collinear variables into one variable.&lt;/p&gt;

&lt;p&gt;The chart below demonstrates an example of collinearity. As we know, an individual’s credit limit is directly related to their credit rating. A dataset that includes both of these predictor should only include one of them for regression purposes, to avoid the issue of collinearity.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605459076703%2FxX_UyMAgO.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1605459076703%2FxX_UyMAgO.png" alt="check-collinearity.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Originally published at&lt;/em&gt; &lt;a href="https://www.bijenpatel.com/guide/islr/linear-regression/" rel="noopener noreferrer"&gt;https://www.bijenpatel.com&lt;/a&gt; &lt;em&gt;on August 3, 2020.&lt;/em&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;I will be releasing the equivalent Python code for these examples soon. Subscribe to get notified!&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h1&gt;
  
  
  ISLR Chapter 3 — R Code
&lt;/h1&gt;

&lt;h2&gt;
  
  
  Simple Linear Regression
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight r"&gt;&lt;code&gt;&lt;span class="n"&gt;library&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;MASS&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="c1"&gt;# For model functions&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;library&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ISLR&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="c1"&gt;# For datasets&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;library&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ggplot2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="c1"&gt;# For plotting&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="c1"&gt;# Working with the Boston dataset to predict median house values&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;head&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Boston&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="c1"&gt;# Fit a simple linear regression model&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="c1"&gt;# Median house value is the response (Y)&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="c1"&gt;# Percentage of low income households in the neighborhood is the predictor&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;Boston_lm&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;lm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;medv&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;~&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;lstat&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;Boston&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="c1"&gt;# View the fitted simple linear regression model&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;summary&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Boston_lm&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="c1"&gt;# View all of the objects stored in the model, and get one of them, such as the coefficients&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="nf"&gt;names&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Boston_lm&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;Boston_lm&lt;/span&gt;&lt;span class="o"&gt;$&lt;/span&gt;&lt;span class="n"&gt;coefficients&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="c1"&gt;# 95% confidence interval of the coefficients&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;confint&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Boston_lm&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;level&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="m"&gt;0.95&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="c1"&gt;# Use the model to predict house values for specific lstat values&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;lstat_predict&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;c&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="m"&gt;5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;15&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;lstat_predict&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;data.frame&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;lstat&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;lstat_predict&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="n"&gt;predict&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Boston_lm&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;lstat_predict&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="c1"&gt;# Predictions&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;predict&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Boston_lm&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;lstat_predict&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;interval&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"confidence"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;level&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="m"&gt;0.95&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;predict&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Boston_lm&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;lstat_predict&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;interval&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"prediction"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;level&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="m"&gt;0.95&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="c1"&gt;# Use ggplot to create a residual plot of the model&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;Boston_lm_pred_resid&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;data.frame&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Prediction&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;Boston_lm&lt;/span&gt;&lt;span class="o"&gt;$&lt;/span&gt;&lt;span class="n"&gt;fitted.values&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Residual&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;Boston_lm&lt;/span&gt;&lt;span class="o"&gt;$&lt;/span&gt;&lt;span class="n"&gt;residuals&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="n"&gt;Boston_resid_plot&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;ggplot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Boston_lm_pred_resid&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;aes&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;Prediction&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;Residual&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="n"&gt;geom_point&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="n"&gt;labs&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;title&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"Boston Residual Plot"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"House Value Prediction"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"Residual"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="n"&gt;plot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Boston_resid_plot&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="c1"&gt;# Use ggplot to create a studentized residual plot of the model (for outlier detection)&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;Boston_lm_pred_Rstudent&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;data.frame&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Prediction&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;Boston_lm&lt;/span&gt;&lt;span class="o"&gt;$&lt;/span&gt;&lt;span class="n"&gt;fitted.values&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Rstudent&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;rstudent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Boston_lm&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="n"&gt;Boston_Rstudent_plot&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;ggplot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Boston_lm_pred_Rstudent&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;aes&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;Prediction&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;Rstudent&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="n"&gt;geom_point&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="n"&gt;labs&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;title&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"Boston Rstudent Plot"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"House Value Prediction"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"Rstudent"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="n"&gt;plot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Boston_Rstudent_plot&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="c1"&gt;# Determine leverage statistics for the lstat values (for high leverage detection)&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;Boston_leverage&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;hatvalues&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Boston_lm&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;head&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;order&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;Boston_leverage&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;10&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;/&lt;/span&gt;&lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Multiple Linear Regression
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight r"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Multiple linear regression model with two predictors&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="c1"&gt;# Median house value is the response (Y)&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="c1"&gt;# Percentage of low income households in the neighborhood is the first predictor (X1)&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="c1"&gt;# Percentage of houses in the neighborhood built before 1940 is the second predictor (X2)&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;Boston_lm_mult_1&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;lm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;medv&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;~&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;lstat&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;age&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;Boston&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;summary&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Boston_lm_mult_1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="c1"&gt;## Coefficients:&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="c1"&gt;## (Intercept)        lstat          age  &lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="c1"&gt;##    33.22276     -1.03207      0.03454&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="c1"&gt;# Multiple linear regression model with all predictors&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="c1"&gt;# Median house value is the response (Y)&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="c1"&gt;# Every variable in the Boston dataset is a predictor (X)&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;Boston_lm_mult_2&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;lm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;medv&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;~&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;.&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;Boston&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="c1"&gt;# Multiple linear regression model with all predictors except specified (age)&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;Boston_lm_mult_3&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;lm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;medv&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;~&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;.&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;age&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;Boston&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="c1"&gt;# Multiple linear regression model with an interaction term&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;Boston_lm_mult_4&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;lm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;medv&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;~&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;crim&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;lstat&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="n"&gt;age&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;Boston&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="c1"&gt;# The colon ":" will include an (lstat)(age) interaction term&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="n"&gt;Boston_lm_mult_5&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;lm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;medv&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;~&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;crim&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;lstat&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;age&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;Boston&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="c1"&gt;# The asterisk "*" will include an (lstat)(age) interaction term&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="c1"&gt;# It will also include the terms by themselves, without having to specify separately&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="c1"&gt;# Multiple linear regression model with nonlinear transformation&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;Boston_lm_mult_6&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;lm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;medv&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;~&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;lstat&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;I&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;lstat&lt;/span&gt;&lt;span class="o"&gt;^&lt;/span&gt;&lt;span class="m"&gt;2&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;Boston&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="c1"&gt;# Multiple linear regression model with nonlinear transformation (easier method)&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;Boston_lm_mult_7&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;lm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;medv&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;~&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;poly&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;lstat&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;5&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;Boston&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="c1"&gt;# Multiple linear regression model with log transformation&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;Boston_lm_mult_8&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;lm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;medv&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;~&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;rm&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;Boston&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="c1"&gt;# ANOVA test to compare two different regression models&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;anova&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Boston_lm_mult&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Boston_lm_mult_6&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="c1"&gt;# Null Hypothesis: Both models fit the data equally well&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="c1"&gt;# Alternative Hypothesis: The second model is superior to the first&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="c1"&gt;# 95% confidence interval for the coefficients&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;confint&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Boston_lm_mult_1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;level&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="m"&gt;0.95&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;/&lt;/span&gt;&lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Linear Regression with Qualitative Variables
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight r"&gt;&lt;code&gt;&lt;span class="c1"&gt;# The Carseats data has a qualitative variable for the quality of shelf location&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="c1"&gt;# It takes on one of three values: Bad, Medium, Good&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="c1"&gt;# R automatically generates dummy variables ShelveLocGood and ShelveLocMedium&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="c1"&gt;# Multiple linear regression model to predict carseat sales&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;Carseats_lm_mult&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;lm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Sales&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;~&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;.&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;Carseats&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;summary&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Carseats_lm_mult&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="c1"&gt;## Coefficients:&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="c1"&gt;##                   Estimate Std. Error t value Pr(&amp;gt;|t|)    &lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="c1"&gt;## ...&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="c1"&gt;## ShelveLocGood    4.8501827  0.1531100  31.678  &amp;lt; 2e-16 ***&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="c1"&gt;## ShelveLocMedium  1.9567148  0.1261056  15.516  &amp;lt; 2e-16 ***&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="c1"&gt;## ...&amp;lt;/span&amp;gt;&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



</description>
      <category>datascience</category>
      <category>machinelearning</category>
      <category>tutorial</category>
    </item>
    <item>
      <title>ISLR Chapter 2 — What is Statistical Learning?</title>
      <dc:creator>Bijen Patel</dc:creator>
      <pubDate>Sun, 08 Nov 2020 23:29:18 +0000</pubDate>
      <link>https://dev.to/bijenpatel/islr-chapter-2-what-is-statistical-learning-1jdf</link>
      <guid>https://dev.to/bijenpatel/islr-chapter-2-what-is-statistical-learning-1jdf</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;This is a summary of chapter 2 of the &lt;em&gt;Introduction to Statistical Learning&lt;/em&gt; textbook. I’ve written a 10-part guide that covers the entire book. The guide can be read at &lt;a href="https://www.bijenpatel.com/guide/" rel="noopener noreferrer"&gt;my website&lt;/a&gt;, or &lt;a href="https://dev.to/bijenpatel/series/9611"&gt;here&lt;/a&gt; at DEV. Subscribe to stay up to date on my latest Data Science &amp;amp; Engineering guides!&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h1&gt;
  
  
  What is Statistical Learning?
&lt;/h1&gt;

&lt;p&gt;Assume that we have an advertising dataset that consists of TV advertising spend, radio advertising spend, newspaper advertising spend, and product sales.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1604918535016%2F5G7hgV9Wf.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1604918535016%2F5G7hgV9Wf.png" alt="advertising-spend-table.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We could build a multiple linear regression model to predict sales by using the different types of advertising spend as predictors.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  Product sales would be the response variable (&lt;em&gt;Y&lt;/em&gt;)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The different advertising spends would be the predictors (&lt;em&gt;X&lt;/em&gt;₁&lt;em&gt;, X&lt;/em&gt;₂&lt;em&gt;, X&lt;/em&gt;₃)&lt;/p&gt;

&lt;p&gt;The model could be written in the following general form:&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Y&lt;/em&gt; = &lt;em&gt;f&lt;/em&gt;(&lt;em&gt;X&lt;/em&gt;) + &lt;em&gt;e&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The symbol &lt;em&gt;f&lt;/em&gt; represents the systematic information that &lt;em&gt;X&lt;/em&gt; provides about &lt;em&gt;Y&lt;/em&gt;. Statistical learning refers to a set of approaches for estimating &lt;em&gt;f&lt;/em&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Reasons for Estimating &lt;em&gt;f&lt;/em&gt;
&lt;/h2&gt;

&lt;p&gt;There ar&lt;span id="rmm"&gt;e&lt;/span&gt; two main reasons for estimating &lt;em&gt;f&lt;/em&gt;: prediction and inference.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Prediction&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Prediction refers to any scenario in which we want to come up with an estimate for the response variable &lt;em&gt;Y&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Inference&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Inference involves understanding the relationship between &lt;em&gt;X&lt;/em&gt; and &lt;em&gt;Y&lt;/em&gt;, as opposed to predicting &lt;em&gt;Y&lt;/em&gt;. For example, we may ask the following questions:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  Which predictors &lt;em&gt;X&lt;/em&gt; are associated with the response variable &lt;em&gt;Y&lt;/em&gt;?&lt;/li&gt;
&lt;li&gt;  What is the relationship between the predictor and response?&lt;/li&gt;
&lt;li&gt;  What type of model best explains the relationship?&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Methods of Estimating f
&lt;/h2&gt;

&lt;p&gt;There are two main methods of estimating &lt;em&gt;f&lt;/em&gt;: parametric and nonparametric.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Parametric&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Parametric methods involve the following two-step model-based approach:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt; Make an assumption about the function form (linear, lognormal, etc.) of &lt;em&gt;f&lt;/em&gt;
&lt;/li&gt;
&lt;li&gt; After selecting a model, use training data to fit or train the model&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1604918599476%2F9clmvi5Nj.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1604918599476%2F9clmvi5Nj.png" alt="image.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Nonparametric&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;On the other hand, nonparametric methods do not make explicit assumptions about the functional form of &lt;em&gt;f&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1604918613897%2FFb8OAJy-K.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1604918613897%2FFb8OAJy-K.png" alt="image.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Instead, they attempt to get as close to the data points as possible, without being too rough or too smooth. They have the potential to accurately fit a wider range of possible shapes for &lt;em&gt;f&lt;/em&gt;. Spline models are common examples of nonparametric methods.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Parametric vs Nonparametric&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Nonparametric methods have the potential to be more accurate due to being able to fit a wider range of shapes. However, they are more likely to suffer from the issue of overfitting to the data.&lt;/p&gt;

&lt;p&gt;Additionally, parametric methods are much more interpretable than nonparametric methods. For example, it is much easier to explain the results of a linear regression model than it is to explain the results of a spline model.&lt;/p&gt;

&lt;h1&gt;
  
  
  Assessing Model Accuracy
&lt;/h1&gt;

&lt;p&gt;In statistical learning, we’re not only interested in the type of model to fit to our data, but also how well the model fits the data.&lt;/p&gt;

&lt;p&gt;One way to determine how well a model fits is by comparing the predictions to the actual observed data.&lt;/p&gt;

&lt;h2&gt;
  
  
  Regression Accuracy
&lt;/h2&gt;

&lt;p&gt;In regression, the most commonly used method to assess model accuracy is the mean squared error (MSE):&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;em&gt;n&lt;/em&gt; — represents the number of observations in the data&lt;/li&gt;
&lt;li&gt;  &lt;em&gt;y&lt;/em&gt; — represents the actual response value in the data&lt;/li&gt;
&lt;li&gt;  &lt;em&gt;y-hat&lt;/em&gt; — represents the predicted response value&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1604918629921%2FPzcIspV4P.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1604918629921%2FPzcIspV4P.png" alt="mse-equation.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Bias-Variance Tradeoff
&lt;/h2&gt;

&lt;p&gt;However, we should usually be interested in the error of the test dataset instead of the training dataset. This is because we usually want the model that best predicts the future, and not the past. Additionally, more flexible models will reduce training error, but may not necessarily reduce test error. Therefore, the test error should be of higher concern.&lt;/p&gt;

&lt;p&gt;In general, to minimize the expected test error, a model that has low variance and low bias should be chosen.&lt;/p&gt;

&lt;p&gt;Bias refers to the error that is introduced by modeling a very complicated problem with a simplistic model. More flexible models have less bias because they are more complex.&lt;/p&gt;

&lt;p&gt;Variance refers to how good a model is if it was used on a different training dataset. More flexible models have higher variance.&lt;/p&gt;

&lt;p&gt;The relationship among bias, variance, and the test error is known as the bias-variance tradeoff. The challenge lies in finding the method in which both bias and variance are low.&lt;/p&gt;

&lt;h2&gt;
  
  
  Classification Accuracy
&lt;/h2&gt;

&lt;p&gt;In classification, the most commonly used method to assess model accuracy is the error rate (the proportion of mistakes made):&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1604918649400%2FIHrmAlSAk.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1604918649400%2FIHrmAlSAk.png" alt="error-rate-equation.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;However, again, we should usually be interested in the error rate of the test dataset instead of the training dataset.&lt;/p&gt;

&lt;p&gt;The test error rate can be minimized through a very simple classifier known as the Bayes’ classifier. The Bayes’ classifier assigns classes based on a known predictor value.&lt;/p&gt;

&lt;p&gt;For example, assume that we knew for a fact that 70% of all people who make more than $100,000 per year were STEM graduates.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1604918661433%2FBWBdQjCIp.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1604918661433%2FBWBdQjCIp.png" alt="salary-category-table.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The Bayes’ classifier predicts one class if the probability of a certain class is greater than 50%. Otherwise, it predicts the other class. The Bayes’ decision boundary is the line at which the probability is exactly 50%. The classifier predictions are based on this boundary.&lt;/p&gt;

&lt;p&gt;For example, if we were given a test dataset of just salary values, we’d simply assign any salaries greater than $100,000 as STEM graduates, and salary values less than $100,000 as non-STEM graduates.&lt;/p&gt;

&lt;p&gt;The Bayes’ classifier produces the lowest possible test error rate, called the Bayes’ error rate.&lt;/p&gt;

&lt;p&gt;In theory, we would always like to predict classifications using the Bayes’ classifier. However, we do not always know the probability of a class, given some predictor value. We have to estimate this probability, and then classify the data based on the estimated probability.&lt;/p&gt;

&lt;h1&gt;
  
  
  K-Nearest Neighbors (KNN)
&lt;/h1&gt;

&lt;p&gt;The K-Nearest Neighbors (KNN) classifier is a popular method of estimating conditional probability. Given a positive integer &lt;em&gt;K&lt;/em&gt; and some test observation, the KNN classifier identifies the &lt;em&gt;K&lt;/em&gt; points in the training data that are closest to the test observation. These closest &lt;em&gt;K&lt;/em&gt; points are represented by &lt;em&gt;N&lt;/em&gt;₀. Then, it estimates the conditional probability for a class as the fraction of points in &lt;em&gt;N&lt;/em&gt;₀ that represent that specific class. Lastly, KNN will apply the Bayes’ rule and classify the test observation to the class with the largest probability.&lt;/p&gt;

&lt;p&gt;However, the choice of the &lt;em&gt;K&lt;/em&gt; value is very important. Lower values are more flexible, whereas higher values are less flexible but have more bias. Similar to the regression setting, a bias-variance tradeoff exists.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Originally published at&lt;/em&gt; &lt;a href="https://www.bijenpatel.com/guide/islr/what-is-statistical-learning/" rel="noopener noreferrer"&gt;https://www.bijenpatel.com/&lt;/a&gt; &lt;em&gt;on August 2, 2020.&lt;/em&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;I will be releasing the equivalent Python code for these examples soon. Subscribe to get notified!&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h1&gt;
  
  
  ISLR Chapter 2 — R Code
&lt;/h1&gt;

&lt;h2&gt;
  
  
  Basic R Functions
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight r"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Assign a vector to a new object named "data_vector"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;data_vector&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;c&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;4&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="c1"&gt;# List of all objects&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;ls&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="c1"&gt;# Remove an object&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;rm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data_vector&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="c1"&gt;# Create two vectors of 50 random numbers&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="c1"&gt;# with mean of 0 and standard deviation of 1&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;random_1&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;rnorm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="m"&gt;50&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;mean&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;sd&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;random_2&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;rnorm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="m"&gt;50&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;mean&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;sd&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="c1"&gt;# Create a dataframe from the vectors&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="c1"&gt;# with columns labeled as X and Y&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;data_frame&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;data.frame&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;random_1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Y&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;random_2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="c1"&gt;# Get the dimensions (rows, columns) of a dataframe&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="nf"&gt;dim&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data_frame&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="c1"&gt;## 50 2&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="c1"&gt;# Get the class types of columns in a dataframe&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;sapply&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data_frame&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;class&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="c1"&gt;##         X         Y &lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="c1"&gt;## "numeric" "numeric"&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="c1"&gt;# Easily omit any NA values from a dataframe&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;data_frame&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;na.omit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data_frame&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="c1"&gt;# See the first 5 rows of the dataframe&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;head&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data_frame&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;5&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="c1"&gt;##            X          Y&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="c1"&gt;## 1  1.3402318 -0.2318012&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="c1"&gt;## 2 -1.8688186  1.0121503&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="c1"&gt;## 3  2.9939211 -1.7843108&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="c1"&gt;## 4 -0.9833264 -1.0518947&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="c1"&gt;## 5 -1.2800747 -0.4674771&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="c1"&gt;# Get a specific column (column X) from the dataframe&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;data_frame&lt;/span&gt;&lt;span class="o"&gt;$&lt;/span&gt;&lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="c1"&gt;## 1.3402318 -1.8688186 2.9939211 -0.9833264 -1.2800747&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="c1"&gt;# Get the mean, variance, or standard deviation&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;mean&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;random_1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="c1"&gt;## -0.1205189&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="n"&gt;var&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;random_1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="c1"&gt;## 1.096008&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="n"&gt;sd&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;random_1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="c1"&gt;## 1.046904&amp;lt;/span&amp;gt;&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Plotting with ggplot
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight r"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Create a scatter plot using the ggplot2 package&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;plot_1&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;ggplot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data_frame&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;aes&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;data_frame&lt;/span&gt;&lt;span class="o"&gt;$&lt;/span&gt;&lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;data_frame&lt;/span&gt;&lt;span class="o"&gt;$&lt;/span&gt;&lt;span class="n"&gt;Y&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="n"&gt;geom_point&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="n"&gt;coord_cartesian&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;xlim&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nf"&gt;c&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="m"&gt;-3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;3&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;ylim&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nf"&gt;c&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="m"&gt;-3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;3&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="n"&gt;labs&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;title&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"Random X and Random Y"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;subtitle&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"Random Numbers"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"Y Random"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"X Random"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="n"&gt;scale_x_continuous&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;breaks&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;seq&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="m"&gt;-3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0.5&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="n"&gt;scale_y_continuous&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;breaks&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;seq&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="m"&gt;-3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0.5&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="n"&gt;geom_point&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;colour&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"blue"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="n"&gt;geom_point&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;aes&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;col&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;data_frame&lt;/span&gt;&lt;span class="o"&gt;$&lt;/span&gt;&lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="c1"&gt;# Visualize the ggplot&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;plot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;plot_1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;/&lt;/span&gt;&lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



</description>
      <category>datascience</category>
      <category>machinelearning</category>
      <category>tutorial</category>
    </item>
    <item>
      <title>ISLR Chapter 1 — Introduction to Statistical Learning</title>
      <dc:creator>Bijen Patel</dc:creator>
      <pubDate>Sun, 08 Nov 2020 23:25:20 +0000</pubDate>
      <link>https://dev.to/bijenpatel/islr-chapter-1-introduction-to-statistical-learning-2337</link>
      <guid>https://dev.to/bijenpatel/islr-chapter-1-introduction-to-statistical-learning-2337</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;This is a summary of chapter 1 of the &lt;em&gt;Introduction to Statistical Learning&lt;/em&gt; textbook. I’ve written a 10-part guide that covers the entire book. The guide can be read at &lt;a href="https://www.bijenpatel.com/guide/" rel="noopener noreferrer"&gt;my website&lt;/a&gt;, or &lt;a href="https://dev.to/bijenpatel/series/9611"&gt;here&lt;/a&gt; at DEV. Subscribe to stay up to date on my latest Data Science &amp;amp; Engineering guides!&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h1&gt;
  
  
  Overview of Statistical Learning
&lt;/h1&gt;

&lt;p&gt;Statistical learning simply refers to the broad set of tools that are available for understanding data. There are two main types of statistical learning: supervised and unsupervised.&lt;/p&gt;

&lt;h2&gt;
  
  
  Supervised Learning
&lt;/h2&gt;

&lt;p&gt;Supervised learning involves building statistical models to predict outputs (&lt;em&gt;Y&lt;/em&gt;) from inputs (&lt;em&gt;X&lt;/em&gt;). For example, assume that we have a salary dataset for statisticians. The dataset consists of the experience level and salary for 10 different statisticians.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1604918350275%2F11UlenfRo.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1604918350275%2F11UlenfRo.png" alt="statistician-salary-table.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We could build a simple linear regre&lt;span id="rmm"&gt;s&lt;/span&gt;sion model to predict the salary of statisticians by using experience level as a predictor. This is an example of supervised learning, where we have supervising outputs (salary values) that guide us in developing a statistical model to determine the relationship between experience level and salary.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1604918368060%2FThX7WZ9u8.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1604918368060%2FThX7WZ9u8.png" alt="statistician-salary-regression.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In general, there are two main types of supervised learning: regression and classification.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Regression&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Predicting a quantitative output is known as a regression problem. For example, predicting someone’s salary is a regression problem.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Classification&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Predicting a qualitative output is known as a classification problem. For example, predicting whether a stock will go up or down is a classification problem.&lt;/p&gt;

&lt;h1&gt;
  
  
  Unsupervised Learning
&lt;/h1&gt;

&lt;p&gt;Unsupervised learning involves building statistical models to determine relationships from inputs (&lt;em&gt;X&lt;/em&gt;). There are no supervising outputs. For example, assume that we have a customer dataset. The dataset consists of the annual salary and annual spend on Amazon for 10 different individuals.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1604873951166%2FZQcYFmRa2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1604873951166%2FZQcYFmRa2.png" alt="amazon-salary-spend-table.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We could use a statistical clustering algorithm to group customers by their purchasing behavior. This is an example of unsupervised learning, where we &lt;strong&gt;do not&lt;/strong&gt; have supervising outputs that already inform us which customers are low spenders, average spenders, or high spenders. Instead, we have to come up with the determination ourselves.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1604918399263%2FdiwbeTAVx.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1604918399263%2FdiwbeTAVx.png" alt="amazon-clustering.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In general, there are two main types of unsupervised learning: clustering and association.&lt;/p&gt;

&lt;h2&gt;
  
  
  Clustering
&lt;/h2&gt;

&lt;p&gt;Determining groupings is known as a clustering problem. For example, grouping customers together based on purchasing behavior is a clustering problem.&lt;/p&gt;

&lt;h2&gt;
  
  
  Association
&lt;/h2&gt;

&lt;p&gt;Determining rules that describe large portions of a dataset is known as an association problem. For example, determining that people who buy &lt;em&gt;X&lt;/em&gt; also buy &lt;em&gt;Y&lt;/em&gt; is an association problem. A modern real-world example of this is Amazon’s “frequently bought together” product recommendations.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Originally published at&lt;/em&gt; &lt;a href="https://www.bijenpatel.com/guide/islr/introduction-to-statistical-learning/" rel="noopener noreferrer"&gt;https://www.bijenpatel.com&lt;/a&gt; &lt;em&gt;on August 1, 2020.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>datascience</category>
      <category>machinelearning</category>
      <category>tutorial</category>
    </item>
  </channel>
</rss>
