<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Edward Amor</title>
    <description>The latest articles on DEV Community by Edward Amor (@edwardamor).</description>
    <link>https://dev.to/edwardamor</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F515425%2F52b97273-f0b7-4206-922f-5b4a4134d59d.jpeg</url>
      <title>DEV Community: Edward Amor</title>
      <link>https://dev.to/edwardamor</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/edwardamor"/>
    <language>en</language>
    <item>
      <title>Analysis of Hidden Technical Debt in Machine Learning Systems</title>
      <dc:creator>Edward Amor</dc:creator>
      <pubDate>Sun, 01 Nov 2020 05:00:00 +0000</pubDate>
      <link>https://dev.to/edwardamor/analysis-of-hidden-technical-debt-in-machine-learning-systems-4hp6</link>
      <guid>https://dev.to/edwardamor/analysis-of-hidden-technical-debt-in-machine-learning-systems-4hp6</guid>
      <description>&lt;p&gt;&lt;em&gt;Hidden Technical Debt in Machine Learning Systems&lt;/em&gt; &lt;sup id="fnref:1"&gt;1&lt;/sup&gt; offers a very interesting high level overview of the numerous extra layers of &lt;em&gt;technical debt&lt;/em&gt; &lt;sup id="fnref:2"&gt;2&lt;/sup&gt; which exist in Machine Learning enabled systems. Unlike standard software systems, ML-enabled systems utilize external data instead of standard code and software logic, and contain a machine learning component. This replacement of standard software logic with data results in systems which are much harder to maintain in the long run if the proper precautions aren’t taken. Therefore, it’s imperative that every Data Scientist/Machine Learning Engineer be aware of the various debts that come with ML-enabled systems in order to prevent serious catastrophe in the future.&lt;/p&gt;

&lt;h2&gt;
  
  
  Model Complexity
&lt;/h2&gt;

&lt;p&gt;Model complexity refers to the overall complex nature of ML-enabled systems, the process through which data is input and output, and any intermediary stages in between. This complexity makes it entirely impossible to make isolated changes in an ML-enabled system, as any change would result in a variation of the ML component within. Additionally, an interesting problem which arises as complexity in an ML system increases is the advent of undeclared consumers. Undeclared consumers are other systems or parts of the development stack which silently utilize outputs and/or intermediary files generated by the ML system. This poses a huge risk since these components are now silently coupled to the system, and any changes in the ML component affect the silent consumers. This coupling could result in adverse outcomes which are tough to debug at best and possibly cascading failures at worst.&lt;/p&gt;

&lt;h2&gt;
  
  
  Data Dependencies
&lt;/h2&gt;

&lt;blockquote&gt;
&lt;p&gt;ML is required in exactly those cases when the desired behavior cannot be effectively expressed in software logic without dependency on external data.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Unlike regular software systems, ML-enabled systems are entirely dependent on external data. When the inputs to an ML system aren’t strictly maintained, the input data may change and lead to an adverse effect on the system. This includes any improvements to the input signals, since &lt;em&gt;changing anything changes everything&lt;/em&gt;. Additionally, over time underutilized data features, legacy features, bundled features, and/or correlated features can generate inefficiencies at best and faults at worst. It’s imperative that regular input validation checks are made, and exhaustive leave-one-out feature selection evaluations are run to eliminate underutilized features.&lt;/p&gt;

&lt;h2&gt;
  
  
  Feedback Loops
&lt;/h2&gt;

&lt;blockquote&gt;
&lt;p&gt;One of the key features of live ML systems is that they often end up influencing their own behavior if they update over time.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;ML systems are unique in that they require training, which requires data. Often over time direct feedback loops arise and ML systems directly influence the selection of their own future training data. Although this is a relatively tough problem to deal with, it’s exactly what data scientists love to research and solve. The more challenging issue is hidden feedback loops, which is when two systems influence each other indirectly through the world. An example being, two stock market prediction models from independent investment firms. Any improvements (or at worst, bugs) to one may influence the bidding and buying behavior of the other.&lt;/p&gt;

&lt;h2&gt;
  
  
  Anti-Patterns
&lt;/h2&gt;

&lt;p&gt;An anti-pattern is a common response to a recurring problem that is usually ineffective and risks being highly counterproductive. &lt;sup id="fnref:3"&gt;3&lt;/sup&gt; Within an ML-enabled system there are a few unique anti-patterns which hinder the maintainability of the system. Glue code, which is often used to get data into and out of a general purpose ML solution, ends up creating lots of supporting code which is costly in the long term. Pipeline jungles, which organically evolve over time, are the result of incremental scrapes, joins, and sampling steps often with intermediate outputs and files. Managing and testing these pipelines are costly and time consuming, however since 2015 many libraries (such as &lt;code&gt;scikit-learn&lt;/code&gt; with the &lt;code&gt;Pipeline&lt;/code&gt; class) come with built-in pipeline abstractions easing their management.&lt;/p&gt;

&lt;h2&gt;
  
  
  Other Areas of ML Debt
&lt;/h2&gt;

&lt;p&gt;Lastly, there are many other areas of technical debt in the production of ML-enabled systems. Process management debt, which occurs in very mature systems that tend to have hundreds or thousands of models running simultaneously, involves managing and assigning resources with different business priorities. Reproducibility debt involves designing real world systems and ensuring strict reproducibility, which is difficult when using randomized algorithms, non-determinism in parallel learning, and interactions with the outside world. Further, probably the most important type of debt, cultural debt. Cultural debt exists when there is a hard line between research and engineering, which is counter productive in the long term. Therefore, it’s imperative to cultivate a culture which rewards the simplicity, stability, and reproducibility.&lt;/p&gt;

&lt;h2&gt;
  
  
  Key Takeaways
&lt;/h2&gt;

&lt;blockquote&gt;
&lt;p&gt;The goal is not to add new functionality, but to enable future improvements, reduce errors, and improve maintainability.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;As the Data Science field continues to grow, it’s important that those within the community are aware of the issues involved with putting ML into production. Luckily, since generally 95% of any ML system isn’t actually ML, it works in our benefit to learn from the software engineering field, and take advantage of the many decades of learned experience. The authors of &lt;em&gt;Hidden Technical Debt in Machine Learning Systems&lt;/em&gt; did an excellent job of expressing the additional layers of technical debt involved in ML systems, and the various solutions/measures to limit it. Some of the key ways they offer to pay down the debt are:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Using common APIs, which allows support infrastructure to be more reusable.&lt;/li&gt;
&lt;li&gt;Isolating models by serving ensembles to reduce interaction between the external world and models.&lt;/li&gt;
&lt;li&gt;Creating versioned copies of inputs, to prevent detriments to the system from changes in the input.&lt;/li&gt;
&lt;li&gt;Regularly running exhaustive leave-one-feature-out evaluations, to identify and remove unnecessary features.&lt;/li&gt;
&lt;li&gt;Testing of input signals, providing sanity checks which prevent corruption of models.&lt;/li&gt;
&lt;li&gt;Improving documentation&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Data Science has never been an isolated field, but it is more important then ever that as a community we pay attention to the long term implications of our ML systems. Taking the time and consideration from the beginning when managing these systems will result in better maintainability and future growth which otherwise would have been burdensome.&lt;/p&gt;




&lt;ol&gt;
&lt;li id="fn:1"&gt;
&lt;p&gt;&lt;a href="https://papers.nips.cc/paper/2015/file/86df7dcfd896fcaf2674f757a2463eba-Paper.pdf"&gt;https://papers.nips.cc/paper/2015/file/86df7dcfd896fcaf2674f757a2463eba-Paper.pdf&lt;/a&gt; ↩︎&lt;/p&gt;
&lt;/li&gt;
&lt;li id="fn:2"&gt;
&lt;p&gt;&lt;a href="https://en.wikipedia.org/wiki/Technical_debt"&gt;https://en.wikipedia.org/wiki/Technical_debt&lt;/a&gt; ↩︎&lt;/p&gt;
&lt;/li&gt;
&lt;li id="fn:3"&gt;
&lt;p&gt;&lt;a href="https://en.wikipedia.org/wiki/Anti-pattern"&gt;https://en.wikipedia.org/wiki/Anti-pattern&lt;/a&gt; ↩︎&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;

</description>
    </item>
    <item>
      <title>Simple Time Series Forecasting with ML</title>
      <dc:creator>Edward Amor</dc:creator>
      <pubDate>Wed, 16 Sep 2020 13:00:00 +0000</pubDate>
      <link>https://dev.to/edwardamor/simple-time-series-forecasting-with-ml-2kda</link>
      <guid>https://dev.to/edwardamor/simple-time-series-forecasting-with-ml-2kda</guid>
      <description>&lt;p&gt;Time series forecasting is an interesting sub-topic within the field of machine learning, mainly due to the time component which adds to the complexity of making predictions. Over the past month I’ve grown quite fond of it, and one of the best things I’ve learned is that standard supervised machine learning algorithms can be applied to time series to make predictions. The process is quite similar to a standard ML process with the exception that you have to structure your data a specific way to maintain the temporal structure.&lt;/p&gt;

&lt;h2&gt;
  
  
  Environment Setup
&lt;/h2&gt;

&lt;p&gt;For setting up your environment I do recommend that you use anaconda, it’s kind of the de facto environment manager when doing data science. However, if you only have python on your system that is more than enough as well. I’m also assuming you have a terminal available with a unix-like shell such as bash or git bash.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$ &lt;/span&gt;&lt;span class="nb"&gt;mkdir &lt;/span&gt;tsml-tutorial
&lt;span class="nv"&gt;$ &lt;/span&gt;&lt;span class="nb"&gt;cd &lt;/span&gt;tsml-tutorial
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If you have anaconda available on your system:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$ &lt;/span&gt;conda create &lt;span class="nt"&gt;-n&lt;/span&gt; tsml jupyter pandas scipy numpy matplotlib seaborn scikit-learn statsmodels
&lt;span class="nv"&gt;$ &lt;/span&gt;conda activate tsml
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If you don’t have anaconda available on your system, but have python 3.3+ installed:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$ &lt;/span&gt;python &lt;span class="nt"&gt;-m&lt;/span&gt; venv venv
&lt;span class="nv"&gt;$ &lt;/span&gt;&lt;span class="nb"&gt;source &lt;/span&gt;venv/bin/activate
&lt;span class="nv"&gt;$ &lt;/span&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;jupyter pandas scipy numpy matplotlib seaborn scikit-learn statsmodels
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now that you have an environment installed you can start following along by starting your local jupyter server and opening a fresh notebook.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$ &lt;/span&gt;jupyter notebook
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Data Extraction
&lt;/h2&gt;

&lt;p&gt;For this little tutorial we’ll be using one of the most common univariate time series datasets, that you’ve probably already seen, Daily minimum temperatures in Melbourne, Australia, 1981-1990. The data consists of, as you may have guessed, the daily minimum temperature over the course of 10 years in Melbourne, Australia. We’ll be grabbing our data using pandas, from a github repository. You can find the data at the following url &lt;a href="https://github.com/jbrownlee/Datasets/blob/master/daily-min-temperatures.csv"&gt;https://github.com/jbrownlee/Datasets/blob/master/daily-min-temperatures.csv&lt;/a&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# import libraries
&lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;pandas&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;numpy&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;matplotlib.pyplot&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;plt&lt;/span&gt;

&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;sklearn.ensemble&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;RandomForestRegressor&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;sklearn.pipeline&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Pipeline&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;sklearn.model_selection&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;cross_val_score&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;TimeSeriesSplit&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;sklearn.preprocessing&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;OneHotEncoder&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;StandardScaler&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;sklearn.metrics&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;mean_squared_error&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;sklearn.compose&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;ColumnTransformer&lt;/span&gt;

&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="n"&gt;matplotlib&lt;/span&gt; &lt;span class="n"&gt;inline&lt;/span&gt;
&lt;span class="n"&gt;sns&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;set&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# load our dataset
&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"https://raw.githubusercontent.com/jbrownlee/Datasets/master/daily-min-temperatures.csv"&lt;/span&gt;
&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;read_csv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# output dataframe info
&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;info&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;&amp;lt;class 'pandas.core.frame.DataFrame'&amp;gt;
RangeIndex: 3650 entries, 0 to 3649
Data columns (total 2 columns):
 # Column Non-Null Count Dtype  
--- ------ -------------- -----  
 0 Date 3650 non-null object 
 1 Temp 3650 non-null float64
dtypes: float64(1), object(1)
memory usage: 57.2+ KB
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Our dataframe consists of 2 columns, &lt;code&gt;Date&lt;/code&gt; and &lt;code&gt;Temp&lt;/code&gt;, with no missing values, and 3650 observations (365 per year). Our data is typed as follows:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;Date&lt;/code&gt; column as a string which we’ll want to convert to a DateTimeIndex&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;Temp&lt;/code&gt; column as a float64.
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# set Date as datetimeindex
&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;"Date"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;to_datetime&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;"Date"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;set_index&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Date"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Data Exploration
&lt;/h2&gt;

&lt;p&gt;Since this is a time series, we’d be remiss if we didn’t plot the data out fully. We’ll also want to inspect our data and see if there is autocorrelation.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# plot full 10 years
&lt;/span&gt;&lt;span class="n"&gt;fig&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ax&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;subplots&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;figsize&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;16&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;9&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;plot&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;line&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;title&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;"Daily minimum temperatures in Melbourne, Australia, 1981-1990"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;style&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;"."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ax&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;ax&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;rolling&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;30&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="n"&gt;mean&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="n"&gt;plot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;figsize&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;16&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;9&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;style&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;"-"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ax&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;ax&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;rolling&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;30&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="n"&gt;plot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;figsize&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;16&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;9&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;style&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;"-"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ax&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;ax&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;legend&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="s"&gt;"Temperature"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"30-Day Rolling Average"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"30-Day Rolling Std. Dev."&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;show&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--WQiS16Lz--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://edwardamor.xyz/posts/005/ts-plot.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--WQiS16Lz--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://edwardamor.xyz/posts/005/ts-plot.png" alt="Time Series Plot"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Our plot of the temperature for the last ten years shows the temperature oscillates, almost like a sinusoidal wave. With our rolling standard deviation showing that we don’t grow in variance as time progresses. This would definitely be an optimal dataset for an SARIMA model, but that isn’t what we are here for.&lt;/p&gt;

&lt;h2&gt;
  
  
  Modeling
&lt;/h2&gt;

&lt;p&gt;This is the crux of our tutorial and essentially we’ll be doing regression (using a RF regressor albeit) to predict the temperature. To start we’ll create some features such as time lags, and time features to incorporate the temporal structure into our model. To make it easier as our data grows though we’ll want to make a pipeline.&lt;/p&gt;

&lt;p&gt;Features to create:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Time lags for the previous week&lt;/li&gt;
&lt;li&gt;Rolling 30-Day Temperature average&lt;/li&gt;
&lt;li&gt;Rolling 7-Day Temperature average&lt;/li&gt;
&lt;li&gt;Month of the year&lt;/li&gt;
&lt;li&gt;Week of the year&lt;/li&gt;
&lt;li&gt;Next day’s temperature (what we are predicting)
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# create our features and new dataframe
&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;DataFrame&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="s"&gt;f"t-&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Temp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;shift&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nb"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;7&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)})&lt;/span&gt;
&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;"t"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Temp&lt;/span&gt;
&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;"day"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;isocalendar&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="n"&gt;day&lt;/span&gt;
&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;"week_of_year"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;isocalendar&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="n"&gt;week&lt;/span&gt;
&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;"month"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;month&lt;/span&gt;
&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;"7-Day Temp. Avg."&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Temp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;rolling&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;7&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="n"&gt;mean&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;"30-Day Temp. Avg."&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Temp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;rolling&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;30&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="n"&gt;mean&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;"t+1"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Temp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;shift&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;dropna&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;t&lt;/code&gt; is our current time step, and &lt;code&gt;t+1&lt;/code&gt; is the next day’s temperature which we’ll be predicting. To make our model aware of time we’ve also created a week and month feature, and included lag values and rolling averages. Next we’ll want to divide our data up into testing and training sets so we can do some training and validate our data. However, since we are working with time series data, there is a strict order dependence and so we can’t split and shuffle our data, we’ll have to maintain our order.&lt;/p&gt;

&lt;p&gt;We’ll split our data up using a 70-30 split, where the last 30% of our data will be used as our testing data, and the first 70% is for our training.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# split data up into training and testing set, preprocess
&lt;/span&gt;&lt;span class="n"&gt;num_cols&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;'t-7'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;'t-6'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;'t-5'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;'t-4'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;'t-3'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;'t-2'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;'t-1'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;'t'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="s"&gt;'7-Day Temp. Avg.'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;'30-Day Temp. Avg.'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;col_trans&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ColumnTransformer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"categorical_cols"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;OneHotEncoder&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;drop&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;"first"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;sparse&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;"week_of_year"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"month"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"day"&lt;/span&gt;&lt;span class="p"&gt;]),&lt;/span&gt;
        &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"numeric_cols"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;StandardScaler&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="n"&gt;num_cols&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;pipe&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Pipeline&lt;/span&gt;&lt;span class="p"&gt;([(&lt;/span&gt;&lt;span class="s"&gt;"trans"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;col_trans&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"regression"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;RandomForestRegressor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;n_jobs&lt;/span&gt;&lt;span class="o"&gt;=-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;))])&lt;/span&gt;

&lt;span class="n"&gt;X&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;drop&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;columns&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;"t+1"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;y&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;"t+1"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="n"&gt;X_train&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;X_test&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;[:&lt;/span&gt;&lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;shape&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;7&lt;/span&gt;&lt;span class="p"&gt;)],&lt;/span&gt; &lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;shape&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;7&lt;/span&gt;&lt;span class="p"&gt;):]&lt;/span&gt;
&lt;span class="n"&gt;y_train&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_test&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="p"&gt;[:&lt;/span&gt;&lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;shape&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;7&lt;/span&gt;&lt;span class="p"&gt;)],&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;shape&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;7&lt;/span&gt;&lt;span class="p"&gt;):]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Since we can’t do cross validation, we’ll use the time series split class from sklearn, which is essentially the k-fold validation of time series validation. Our alternative would be to train our model on all the data, and use information criterion, which realistically when doing any model selection you should use multiple metrics to select your model.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# perform cross validation on training data
&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;cross_val_score&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pipe&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;X_train&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_train&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;cv&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;TimeSeriesSplit&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="n"&gt;scoring&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;"neg_root_mean_squared_error"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="n"&gt;mean&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;2.5270646279116358
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Here we have the RMSE score after doing some cross validation, it isn’t anything special but verifies that we can apply our standard ML toolset on a time series dataset. From our CV we can see our model is about 2.5 degrees off.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# fit our model and make predictions on testing data
&lt;/span&gt;&lt;span class="n"&gt;pipe&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;fit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X_train&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_train&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;preds&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pipe&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;predict&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X_test&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# show the predictions
&lt;/span&gt;&lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1086&lt;/span&gt;&lt;span class="p"&gt;:].&lt;/span&gt;&lt;span class="n"&gt;plot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;figsize&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;16&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;9&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;title&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;"Predictions on Hold Out Data"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Series&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;preds&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1086&lt;/span&gt;&lt;span class="p"&gt;:].&lt;/span&gt;&lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="n"&gt;plot&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;legend&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="s"&gt;"Observations"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"Predictions"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;show&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--N7j06U7S--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://edwardamor.xyz/posts/005/modelpredictions.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--N7j06U7S--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://edwardamor.xyz/posts/005/modelpredictions.png" alt="Model Predictions"&gt;&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# output RMSE score on test data
&lt;/span&gt;&lt;span class="n"&gt;mean_squared_error&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;y_test&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;preds&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;squared&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;2.3153790785018127
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Looking at the predictions made by our model, we aren’t going to be telling anyone the weather anytime soon. However, this is a prime example of how to apply standard Machine Learning algorithms to your time series.&lt;/p&gt;

</description>
    </item>
    <item>
      <title>Model Selection, Validation, and Hyperparameter Tuning</title>
      <dc:creator>Edward Amor</dc:creator>
      <pubDate>Sat, 01 Aug 2020 21:00:00 +0000</pubDate>
      <link>https://dev.to/edwardamor/model-selection-validation-and-hyperparameter-tuning-156j</link>
      <guid>https://dev.to/edwardamor/model-selection-validation-and-hyperparameter-tuning-156j</guid>
      <description>&lt;p&gt;In practice, a majority of the time dedicated to any data science project (unless you’re lucky) is consumed by data cleaning and wrangling. However, once you’ve completed you’re data mining, cleaning, exploration, and feature engineering, generally the next step is to do some machine learning. The ML process is pretty standard regardless of the algorithm you choose, it’ll always require some model selection, model validation, and hyperparameter tuning. One of the easiest ways you can wrap your mind around the process is through trial and error.&lt;/p&gt;

&lt;h2&gt;
  
  
  Logistic Regression
&lt;/h2&gt;

&lt;p&gt;Logistic regression is very well known supervised binary classification algorithm. Unlike linear regression where you’re predicting a continuous value, logistic regression is a binary algorithm outputs a prediction of either a 0 or a 1, or a probability (there is also softmax regression an extension to logisitic regression for more than 2 classes). Although logistic regression isn’t as fancy as neural networks or natural language processing, it is still a significantly useful tool in any data scientist’s toolbelt. Just like other classification algorthims it can be used for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Predicting Bank Loan Worthiness&lt;/li&gt;
&lt;li&gt;Detecting Credit Card Fraud&lt;/li&gt;
&lt;li&gt;Detecting Email Spam&lt;/li&gt;
&lt;li&gt;And Many More …&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;To get a grasp on how model selection, model validation, and hyperparameter tuning work we’ll run through an example with a simple dataset. For the purposes of demonstration we’ll be using a dataset from the &lt;a href="https://archive.ics.uci.edu/ml/datasets.php"&gt;UCI Machine Learning Repository&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Blood Transfusion Service Center Data Set
&lt;/h2&gt;

&lt;p&gt;The dataset we’ll be using can be found in the UCI Machine Learning Repository by &lt;a href="https://archive.ics.uci.edu/ml/datasets/Blood+Transfusion+Service+Center"&gt;clicking here&lt;/a&gt;. Below is some information about the data for those who would like to know, it was taken directly from the UCI Machine Learning Repository.&lt;/p&gt;

&lt;h3&gt;
  
  
  Summary:
&lt;/h3&gt;

&lt;p&gt;To demonstrate the RFMTC marketing model (a modified version of RFM), this study adopted the donor database of Blood Transfusion Service Center in Hsin-Chu City in Taiwan. The center passes their blood transfusion service bus to one university in Hsin-Chu City to gather blood donated about every three months. To build a FRMTC model, we selected 748 donors at random from the donor database. These 748 donor data, each one included R (Recency - months since last donation), F (Frequency - total number of donation), M (Monetary - total blood donated in c.c.), T (Time - months since first donation), and a binary variable representing whether he/she donated blood in March 2007 (1 stand for donating blood; 0 stands for not donating blood).&lt;/p&gt;

&lt;h3&gt;
  
  
  Attribute Information:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;R (Recency - months since last donation)&lt;/li&gt;
&lt;li&gt;F (Frequency - total number of donation)&lt;/li&gt;
&lt;li&gt;M (Monetary - total blood donated in c.c.)&lt;/li&gt;
&lt;li&gt;T (Time - months since first donation)&lt;/li&gt;
&lt;li&gt;a binary variable representing whether he/she donated blood in March 2007 (1 stand for donating blood; 0 stands for not donating blood)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Source:
&lt;/h3&gt;

&lt;p&gt;Original Owner and Donor&lt;br&gt;&lt;br&gt;
Prof. I-Cheng Yeh&lt;br&gt;&lt;br&gt;
Department of Information Management&lt;br&gt;&lt;br&gt;
Chung-Hua University,&lt;br&gt;&lt;br&gt;
Hsin Chu, Taiwan 30067, R.O.C.&lt;br&gt;&lt;br&gt;
e-mail: icyeh ‘at’ chu.edu.tw&lt;br&gt;&lt;br&gt;
TEL:886-3-5186511&lt;/p&gt;
&lt;h3&gt;
  
  
  Citation Request:
&lt;/h3&gt;

&lt;p&gt;Yeh, I-Cheng, Yang, King-Jang, and Ting, Tao-Ming, “Knowledge discovery on RFM model using Bernoulli sequence, “Expert Systems with Applications, 2008 (doi:10.1016/j.eswa.2008.07.018).&lt;/p&gt;
&lt;h2&gt;
  
  
  Exploratory Data Analysis
&lt;/h2&gt;

&lt;p&gt;It’s always a good habit after extracting and cleaning your data to perform some EDA to get a grasp of the main characteristics of the data you’ll be working with. It provides you with visuals which immensely assist in the preprocessing steps later on.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# import libraries
&lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;pandas&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;numpy&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;matplotlib.pyplot&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;plt&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;seaborn&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;sns&lt;/span&gt;

&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;sklearn.linear_model&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;LogisticRegression&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;sklearn.model_selection&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;train_test_split&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;cross_val_score&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;StratifiedKFold&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;GridSearchCV&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;sklearn.metrics&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;classification_report&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;plot_roc_curve&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;plot_confusion_matrix&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;plot_precision_recall_curve&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;sklearn.preprocessing&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;StandardScaler&lt;/span&gt;

&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;imblearn.over_sampling&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;SMOTE&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;imblearn.pipeline&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;make_pipeline&lt;/span&gt;

&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="n"&gt;matplotlib&lt;/span&gt; &lt;span class="n"&gt;inline&lt;/span&gt;
&lt;span class="n"&gt;sns&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;font_scale&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;1.1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;SEED&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;890432&lt;/span&gt;


&lt;span class="c1"&gt;# load data
&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"https://archive.ics.uci.edu/ml/machine-learning-databases/blood-transfusion/transfusion.data"&lt;/span&gt;
&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;read_csv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;columns&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;"r"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"f"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"m"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"t"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"y"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="c1"&gt;# rename the columns for brevity
&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;info&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;&amp;lt;class 'pandas.core.frame.DataFrame'&amp;gt;
RangeIndex: 748 entries, 0 to 747
Data columns (total 5 columns):
 # Column Non-Null Count Dtype
--- ------ -------------- -----
 0 r 748 non-null int64
 1 f 748 non-null int64
 2 m 748 non-null int64
 3 t 748 non-null int64
 4 y 748 non-null int64
dtypes: int64(5)
memory usage: 29.3 KB
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In our case, the dataset is typed correctly and void of any null values since a majority of the processing has already been done, leaving just EDA and modeling to us. Since this is a classification challenge, one of the best things to look at immediately is the class distribution of your output. This will give us insight into whether we should implement some downsampling/upsampling/hybrid-sampling to adjust for imbalance.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# plot class distribution
&lt;/span&gt;&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;figure&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;figsize&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;16&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;9&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="n"&gt;sns&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;countplot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;"y"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# adjust figure 
&lt;/span&gt;&lt;span class="n"&gt;ticks&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;xticks&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;xticks&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ticks&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;"Didn't donate in 2007 (0)"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"Did donate in 2007 (1)"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;xlabel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;""&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ylabel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"# of individuals"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;title&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Class Distribution of Dependent Variable"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;show&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--BH83UZw4--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://edwardamor.xyz/posts/004/model-shv_16_0.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--BH83UZw4--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://edwardamor.xyz/posts/004/model-shv_16_0.png" alt="png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;There is for sure a class imbalance which we’ll have to account for later in our modeling, but this highlights why it’s always important to visually explore your data. Next we should also determine if there exists any correlation amongst our independent variables, if there is we could implement some dimensionality reduction or remove some redundant variables. We’ll inspect this by plotting the pairwise relationship between the variables, and also viewing a heatmap of pairwise pearson correlation values.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# view the correlation between variables
&lt;/span&gt;&lt;span class="n"&gt;sns&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;pairplot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;vars&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;"r"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"f"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"m"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"t"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;aspect&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;1.3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;corner&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;show&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--x4i2xd_Z--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://edwardamor.xyz/posts/004/model-shv_18_0.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--x4i2xd_Z--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://edwardamor.xyz/posts/004/model-shv_18_0.png" alt="png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Simply looking the histogram of each variable we see they’re all positively skewed, which is another thing we’ll have to adjust for before we do our modeling by scaling our data. Along with that, the variables “f” and “m” appear to be positively correlated, forming basically a straight line. This makes sense as from our attribute information we know that frequency is “the total number of donation and monetary is “the total blood donated in c.c.”. We should therefore expect that as frequency increases, monetary will also increase.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# obtain the pairwise pearson correlation
&lt;/span&gt;&lt;span class="n"&gt;correlation&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[[&lt;/span&gt;&lt;span class="s"&gt;"r"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"f"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"m"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"t"&lt;/span&gt;&lt;span class="p"&gt;]].&lt;/span&gt;&lt;span class="n"&gt;corr&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;mask&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;triu&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ones_like&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;correlation&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="c1"&gt;# for removing upper diagonal of information
&lt;/span&gt;
&lt;span class="c1"&gt;# plot heatmap of pearson correlation
&lt;/span&gt;&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;figure&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;figsize&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;16&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;9&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="n"&gt;labels&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;"Recency"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"Frequency"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"Monetary"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"Time"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;sns&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;heatmap&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;correlation&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;annot&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;square&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;xticklabels&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;labels&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;yticklabels&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;labels&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;mask&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;mask&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;title&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Pairwise Correlation Heatmap"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;show&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--U-HwwYZt--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://edwardamor.xyz/posts/004/model-shv_20_0.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--U-HwwYZt--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://edwardamor.xyz/posts/004/model-shv_20_0.png" alt="png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Our suspicions are confirmed, and by the looks of it our Frequency and Monetary variables are perfectly positively correlated with each other. We’ll remove one of these columns as it’ll interfere with our modeling efforts down the line. It also appears that Frequency/Monetary are somewhat positively correlated with Time, my heuristic is to leave the variable if the pearson correlation is |x| &amp;lt; .75, so in this case we’ll simply stick to removing either frequency or monetary.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# drop the monetary column
&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;drop&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;columns&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;"m"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;inplace&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;errors&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;"ignore"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Since this is a classification task, another good thing to do is look at the distribution of values in each category. To get a quick summation of this information, we can use a box and whisker plot.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# plot categorical distribution of values
&lt;/span&gt;&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;figure&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;figsize&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;16&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;9&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;subplot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;131&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;sns&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;boxplot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;"y"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;"r"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;showmeans&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;title&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Recency"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;subplot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;132&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;sns&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;boxplot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;"y"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;"f"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;showmeans&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;title&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Frequency"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;subplot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;133&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;sns&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;boxplot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;"y"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;"t"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;showmeans&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;title&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Time"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;show&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--Hjp5VQ8x--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://edwardamor.xyz/posts/004/model-shv_24_0.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--Hjp5VQ8x--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://edwardamor.xyz/posts/004/model-shv_24_0.png" alt="png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;As noted before, we will be scaling our data to within the same range. It is quite noticeable that the range for each of the above 3 plots is significantly substantially. Other than that what stands out to me is the somewhat similar characteristics between the classes, particularly the time variable. Between the Recency and Frequency variables, we see decent amount of outliers which could dampen the ability of our logit model from detecting the difference between the two classes. After generating a vanilla model we’ll assess it’s performance and whether we want to drop our outlier observations.&lt;/p&gt;

&lt;p&gt;Last but certainly not least, we’ll look at the descriptive statistics of our variables. This is typically also helpful at the beginning of any EDA, as you should notice any suspicious facts about your data relatively immediately.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# output descriptive statistics
&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;describe&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;r&lt;/th&gt;
&lt;th&gt;f&lt;/th&gt;
&lt;th&gt;t&lt;/th&gt;
&lt;th&gt;y&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;count&lt;/td&gt;
&lt;td&gt;748.000000&lt;/td&gt;
&lt;td&gt;748.000000&lt;/td&gt;
&lt;td&gt;748.000000&lt;/td&gt;
&lt;td&gt;748.000000&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;mean&lt;/td&gt;
&lt;td&gt;9.506684&lt;/td&gt;
&lt;td&gt;5.514706&lt;/td&gt;
&lt;td&gt;34.282086&lt;/td&gt;
&lt;td&gt;0.237968&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;std&lt;/td&gt;
&lt;td&gt;8.095396&lt;/td&gt;
&lt;td&gt;5.839307&lt;/td&gt;
&lt;td&gt;24.376714&lt;/td&gt;
&lt;td&gt;0.426124&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;min&lt;/td&gt;
&lt;td&gt;0.000000&lt;/td&gt;
&lt;td&gt;1.000000&lt;/td&gt;
&lt;td&gt;2.000000&lt;/td&gt;
&lt;td&gt;0.000000&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;25%&lt;/td&gt;
&lt;td&gt;2.750000&lt;/td&gt;
&lt;td&gt;2.000000&lt;/td&gt;
&lt;td&gt;16.000000&lt;/td&gt;
&lt;td&gt;0.000000&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;50%&lt;/td&gt;
&lt;td&gt;7.000000&lt;/td&gt;
&lt;td&gt;4.000000&lt;/td&gt;
&lt;td&gt;28.000000&lt;/td&gt;
&lt;td&gt;0.000000&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;75%&lt;/td&gt;
&lt;td&gt;14.000000&lt;/td&gt;
&lt;td&gt;7.000000&lt;/td&gt;
&lt;td&gt;50.000000&lt;/td&gt;
&lt;td&gt;0.000000&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;max&lt;/td&gt;
&lt;td&gt;74.000000&lt;/td&gt;
&lt;td&gt;50.000000&lt;/td&gt;
&lt;td&gt;98.000000&lt;/td&gt;
&lt;td&gt;1.000000&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Nothing unusual here, as we’ve uncovered the majority of our information from our previous visualizations.&lt;/p&gt;

&lt;h2&gt;
  
  
  Preprocessing
&lt;/h2&gt;

&lt;p&gt;This step is short and just involves setting our data up for modeling by splitting it into training and testing sets and doing any additional scaling/manipulation. Typically it’s best to use a pipeline for scaling/manipulation of your data, as it’ll reduce the headache down the line, and provide you with a simple interface for modeling.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# create our data pipeline
&lt;/span&gt;&lt;span class="n"&gt;imbalanced_pipeline&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;make_pipeline&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;StandardScaler&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
    &lt;span class="n"&gt;LogisticRegression&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;solver&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;"liblinear"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;random_state&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;SEED&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;skfold&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;StratifiedKFold&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;shuffle&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;random_state&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;SEED&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; 
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Our pipeline is quite simple but given a different dataset, with different characteristics, we’d have to use something else. One thing to note is, we are doing nothing to account for class imbalance in our vanilla pipeline, but based on our assessment we will see if we need to.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# separate our data into training and testing sets
&lt;/span&gt;&lt;span class="n"&gt;X&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[[&lt;/span&gt;&lt;span class="s"&gt;"r"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"f"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"t"&lt;/span&gt;&lt;span class="p"&gt;]]&lt;/span&gt;
&lt;span class="n"&gt;y&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;"y"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="n"&gt;X_train&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;X_test&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_train&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_test&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;train_test_split&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;test_size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;random_state&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;SEED&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;stratify&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We use the stratify keyword argument, to make sure we separate out a proportionate amount of our classes into our training and testing set. So 75% of our class 0 and class 1 respectively will be in the training set, and 25% will be in the test set respectively.&lt;/p&gt;

&lt;h2&gt;
  
  
  Model Validation
&lt;/h2&gt;

&lt;p&gt;To validate the results of our training, we’ll be using cross validation with kfolds to ascertain generally the performance of our model.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# perform CV on our data and output f1 results
&lt;/span&gt;&lt;span class="n"&gt;f1_score&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;cross_val_score&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;imbalanced_pipeline&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;X_train&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_train&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;scoring&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;"f1"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;cv&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;skfold&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;n_jobs&lt;/span&gt;&lt;span class="o"&gt;=-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="n"&gt;mean&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="s"&gt;"%.4f"&lt;/span&gt; &lt;span class="o"&gt;%&lt;/span&gt; &lt;span class="n"&gt;f1_score&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;'0.2470'
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Our current model performs horribly, this may be due to a number of reasons, and could even be that our logistic regression algorithm just isn’t suited to this task. However, we should take this baseline score, and try to improve.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# fit our model and output precision recall curve
&lt;/span&gt;&lt;span class="n"&gt;imbalanced_pipeline&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;fit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X_train&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_train&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;fig&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ax&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;subplots&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;figsize&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;16&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;9&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="n"&gt;plot_precision_recall_curve&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;imbalanced_pipeline&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;X_train&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_train&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ax&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;ax&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;title&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Precision-Recall Curve Baseline Logit"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;show&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--BBVNkeEy--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://edwardamor.xyz/posts/004/model-shv_39_0.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--BBVNkeEy--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://edwardamor.xyz/posts/004/model-shv_39_0.png" alt="png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The precision recall curve for our classifier really tells us that this model worse than random guessing, and is wrong more time than it is right.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# output classification report
&lt;/span&gt;&lt;span class="k"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;classification_report&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;y_train&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;imbalanced_pipeline&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;predict&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X_train&lt;/span&gt;&lt;span class="p"&gt;)))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;              precision recall f1-score support

           0 0.79 0.98 0.88 399
           1 0.72 0.17 0.27 124

    accuracy 0.79 523
   macro avg 0.76 0.57 0.58 523
weighted avg 0.78 0.79 0.73 523
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It appears we are severely being hurt by the imbalance in our classes, next we’ll use synthethic over sampling, and random undersampling to better our model. Although this isn’t hyperparameter tuning, it is important to tune your data just as much as you’d tune your model, as what you get out is only as good as what you put in.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# make balanced pipeline
&lt;/span&gt;&lt;span class="n"&gt;balanced_pipeline&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;make_pipeline&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;StandardScaler&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
    &lt;span class="n"&gt;SMOTE&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;random_state&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;SEED&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="c1"&gt;# completely balance the two classes
&lt;/span&gt;    &lt;span class="n"&gt;LogisticRegression&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;solver&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;"liblinear"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;random_state&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;SEED&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Our new pipeline integrates the imbalanced learn libraries Synthetic Minority Oversampling Technique to upsample our minority class. It does this by generating synthethic datapoints using our minority class data. The result is an updated dataset with a balanced class distribution of data. Since we’ll have a balanced dataset for training, we’ll be able to use an ROC/AUC curve to assess performance.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# perform CV on our data and output f1 results
&lt;/span&gt;&lt;span class="n"&gt;f1_score&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;cross_val_score&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;balanced_pipeline&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;X_train&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_train&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;scoring&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;"f1"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;cv&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;skfold&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;n_jobs&lt;/span&gt;&lt;span class="o"&gt;=-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="n"&gt;mean&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="s"&gt;"%.4f"&lt;/span&gt; &lt;span class="o"&gt;%&lt;/span&gt; &lt;span class="n"&gt;f1_score&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;'0.5153'
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We’ve already doubled our baseline f1_score which is an excellent sign, we desperately needed to implement that upsampling.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# fit our model and output ROC curve, now that our pipeline is balancing our dataset
&lt;/span&gt;&lt;span class="n"&gt;balanced_pipeline&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;fit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X_train&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_train&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;fig&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ax&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;subplots&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;figsize&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;16&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;9&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="n"&gt;plot_roc_curve&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;balanced_pipeline&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;X_train&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_train&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ax&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;ax&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;title&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Precision-Recall Curve Balanced Logit"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;plot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;linspace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;linspace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="s"&gt;"--"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;show&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--tkcAMdFm--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://edwardamor.xyz/posts/004/model-shv_47_0.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--tkcAMdFm--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://edwardamor.xyz/posts/004/model-shv_47_0.png" alt="png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Here we can see that our model performs alright, it isn’t anything special, but it does have the ability to somewhat distinguish between the two classes.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# output classification report
&lt;/span&gt;&lt;span class="k"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;classification_report&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;y_train&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;balanced_pipeline&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;predict&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X_train&lt;/span&gt;&lt;span class="p"&gt;)))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;              precision recall f1-score support

           0 0.90 0.63 0.74 399
           1 0.39 0.77 0.52 124

    accuracy 0.66 523
   macro avg 0.64 0.70 0.63 523
weighted avg 0.78 0.66 0.69 523
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Likewise our classification report shows that the recall score for our minority class as dramatically gone up by .60. This has come with the diminshment of other numbers though.&lt;/p&gt;

&lt;p&gt;Next we will be looking into some more model validation, and finally hyperparameter tuning to optimize our model some more.&lt;/p&gt;

&lt;h2&gt;
  
  
  Tuning
&lt;/h2&gt;

&lt;p&gt;Now that we have our data pipeline working how we want it to, we’ll look into tuning our logit classifier by altering some of its parameters, called hyperparameter tuning in the biz. The easiest way to run through a finite set of possible parameters is to use GridSearchCV, if you have a large set of parameters to search through it is alot better to use the RandomizedSearchCV, which will only create n number of models you specify.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# create parameter grid and grid search
&lt;/span&gt;&lt;span class="n"&gt;param_grid&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="s"&gt;"logisticregression__penalty"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;"l1"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"l2"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="s"&gt;"logisticregression__C"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;linspace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;1e-10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="s"&gt;"logisticregression__fit_intercept"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;

&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="n"&gt;gs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;GridSearchCV&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;balanced_pipeline&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;param_grid&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;scoring&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;"f1"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;n_jobs&lt;/span&gt;&lt;span class="o"&gt;=-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;cv&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;skfold&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;gs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;fit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X_train&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_train&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="s"&gt;"%.4f"&lt;/span&gt; &lt;span class="o"&gt;%&lt;/span&gt; &lt;span class="n"&gt;gs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;best_score_&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;'0.5153'
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It looks like through our grid search we have no improvement. However, this is a good example of what not to do if you have a very large parameter space to search through. Instead you should check out the RandomizedSearchCV, which randomly selects a parameter set to use in each iteration.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# perform CV on our data and output f1 results
&lt;/span&gt;&lt;span class="n"&gt;f1_score&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;cross_val_score&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;gs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;best_estimator_&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;X_train&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_train&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;scoring&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;"f1"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;cv&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;skfold&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;n_jobs&lt;/span&gt;&lt;span class="o"&gt;=-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="n"&gt;mean&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="s"&gt;"%.4f"&lt;/span&gt; &lt;span class="o"&gt;%&lt;/span&gt; &lt;span class="n"&gt;f1_score&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;'0.5153'
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# fit our model and output ROC curve, now that our pipeline is balancing our dataset
&lt;/span&gt;&lt;span class="n"&gt;gs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;best_estimator_&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;fit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X_train&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_train&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;fig&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ax&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;subplots&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;figsize&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;16&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;9&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="n"&gt;plot_roc_curve&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;gs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;best_estimator_&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;X_train&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_train&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ax&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;ax&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;title&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Precision-Recall Curve Grid Search Balanced Logit"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;plot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;linspace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;linspace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="s"&gt;"--"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;show&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--vssuMvdN--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://edwardamor.xyz/posts/004/model-shv_57_0.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--vssuMvdN--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://edwardamor.xyz/posts/004/model-shv_57_0.png" alt="png"&gt;&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# output classification report
&lt;/span&gt;&lt;span class="k"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;classification_report&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;y_train&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;gs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;best_estimator_&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;predict&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X_train&lt;/span&gt;&lt;span class="p"&gt;)))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;              precision recall f1-score support

           0 0.90 0.63 0.74 399
           1 0.39 0.77 0.52 124

    accuracy 0.66 523
   macro avg 0.64 0.70 0.63 523
weighted avg 0.78 0.66 0.69 523
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# test data predictions
&lt;/span&gt;&lt;span class="n"&gt;test_pred&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;gs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;best_estimator_&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;predict&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X_test&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;


&lt;span class="c1"&gt;# output classification report
&lt;/span&gt;&lt;span class="k"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;classification_report&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;y_test&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;test_pred&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;              precision recall f1-score support

           0 0.90 0.57 0.70 171
           1 0.37 0.80 0.50 54

    accuracy 0.62 225
   macro avg 0.63 0.68 0.60 225
weighted avg 0.77 0.62 0.65 225
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# plot confusion matrix
&lt;/span&gt;&lt;span class="n"&gt;fig&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ax&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;subplots&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;figsize&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;16&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;9&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="n"&gt;plot_confusion_matrix&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;gs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;best_estimator_&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;X_test&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_test&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ax&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;ax&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;show&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--wpMrCGs0--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://edwardamor.xyz/posts/004/model-shv_61_0.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--wpMrCGs0--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://edwardamor.xyz/posts/004/model-shv_61_0.png" alt="png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;By the looks of it we’ve optimized a model that from the beginning does very poorly. However, the lesson still stands, and the process of validation and tuning is still very much the same.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;When working on any machine learning problem you’re data is invaluable, sometimes you have alot and sometimes you have very little. Regardless of the volume you have, you must always ensure there is no data leakage, which at best leads to false confidence, and at worse leads to being unaware that your models are terrible. Ensuring there is no data leakage is often simple, just split up your data into two groups one for training and another for testing. However, do make sure there is no order dependence in your data, or any other dependence in your data.&lt;/p&gt;

&lt;p&gt;After splitting your data, it’s always good to use cross validation, it helps you verify using your training data, that you’re going in the right direction. You really only have to use cross validation on your training data, and never let your model touch your testing/hold-out data (don’t even let yourself see it) until you are reasonably sure that you want use your final model. The problem that usually arises is developers will use their testing data to verify if their model is good, but then optimize their model on the testing data, which isn’t any good. You want your model to generalize well and that means it can’t see your testing data at all.&lt;/p&gt;

&lt;p&gt;Lastly, hyperparameter tuning, not everyone remembers exactly what each parameter does for each function call. That’s why documentation exists, make sure to go and check out the docs for whichever algorithm you are using, and determine how to manipulate the hyperparameters. Typically you can identify some good values for your hyperparameters during EDA, but if you can’t you can use something like GridSearchCV or RandomizedSearchCV to help you optimize and search through a large parameter space.&lt;/p&gt;

</description>
    </item>
    <item>
      <title>The Bias Variance Tradeoff</title>
      <dc:creator>Edward Amor</dc:creator>
      <pubDate>Wed, 17 Jun 2020 05:00:00 +0000</pubDate>
      <link>https://dev.to/edwardamor/the-bias-variance-tradeoff-2a8f</link>
      <guid>https://dev.to/edwardamor/the-bias-variance-tradeoff-2a8f</guid>
      <description>&lt;p&gt;Foundational to any data science curriculum is the introduction of the terms bias and variance, and subsequently the trade-off that exists between the two. As machine learning continues to grow it is imperative that we understand these concepts, as they directly effect the predictions we make and the business value we can derive from our generated models. While machine learning may seem simple, one of the more difficult parts is optimizing your models but sometimes optimization can lead to over-fitting and if your model is too simple it may be under-fitting your data. The inevitable trade-off between these two aspects will greatly impact the validity of your model, and the predictions you make. But what is bias, what is variance, and what is this trade-off?&lt;/p&gt;

&lt;h2&gt;
  
  
  Bias
&lt;/h2&gt;

&lt;p&gt;When we speak of bias, we aren’t talking about the standard bias us humans are susceptible to. Instead when it comes to machine learning, we are actually referring to the difference between our model’s prediction and the expected value (Prediction - Reality). When a model has high bias it consistently is making wrong predictions, and isn’t considering the complexity of our data. &lt;strong&gt;A model with high bias is under-fitting&lt;/strong&gt; our data and consistently does so after training as well on testing/validation data.&lt;/p&gt;

&lt;p&gt;We can identify if our model has high bias if the following occur:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;We tend to get high training errors.&lt;/li&gt;
&lt;li&gt;The validation error or test error will be similar to the training error.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;We can compensate for high bias by doing the following:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;We need to gather more input features, or generate new ones using feature engineering techniques.&lt;/li&gt;
&lt;li&gt;We can add polynomial features in order to increase the complexity.&lt;/li&gt;
&lt;li&gt;If we are using any regularization terms in our model, we can try to minimize them.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Variance
&lt;/h2&gt;

&lt;p&gt;Similar to the statistical term, variance refers to the variability of our model’s predictions. A model with high variance does not generalize well, and instead pays a lot of attention to our training data. What ends up happening is we get a model which performs very well during training, but when introduced to our testing/validation or any unseen data, we see very high error rates. One way to think about it is like a travel route, if you were to take the route alone and map it onto a completely different area, it wouldn’t work as it only fits the particular origin/destination it was made for. In our case we aren’t making routes, but the concept still holds and we hope to create models which generalize well to unseen data similar to the data used during training.&lt;/p&gt;

&lt;p&gt;We can identify whether the model has high variance if:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;We tend to get low training error&lt;/li&gt;
&lt;li&gt;The validation error or test error will be very high.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;We can fix high variance by:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Gathering more training data, so that the model can learn more based on the patterns rather than the noise.&lt;/li&gt;
&lt;li&gt;We can even try to reduce the input features or do feature selection, reducing model complexity.&lt;/li&gt;
&lt;li&gt;If we are using any regularization terms in our model, we can try to maximize them.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  The Trade-Off
&lt;/h2&gt;

&lt;p&gt;Now knowing what bias and variance is, it is key to understand that when we minimize one, we are maximizing the other. A model with high bias will have low variance, and vice versa. Given a model, with respect to bias and variance, we can say a model’s error is the sum of three parts, the bias, variance, and random noise (&lt;code&gt;E[x] = bias + variance + noise&lt;/code&gt; ).&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;The &lt;strong&gt;bias–variance decomposition&lt;/strong&gt; is a way of analyzing a learning algorithm’s &lt;a href="https://en.wikipedia.org/wiki/Expected_value"&gt;expected&lt;/a&gt; &lt;a href="https://en.wikipedia.org/wiki/Generalization_error"&gt;generalization error&lt;/a&gt; with respect to a particular problem as a sum of three terms, the bias, variance, and a quantity called the &lt;em&gt;irreducible error&lt;/em&gt;, resulting from noise in the problem itself.&lt;/p&gt;

&lt;p&gt;— Wikipedia &lt;sup id="fnref:1"&gt;1&lt;/sup&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The trade-off therefore is determining the optimal bias and variance levels, so as to minimize our overall error. With the steps I’ve listed previously for minimizing either of the two, one has to iteratively improve on the models generated until we arrive at one which is relatively balanced between bias and variance. If we don’t balance these two terms out, we’ll end up with a model that either under or over fits our data, which doesn’t give us any value when it comes to making predictions.&lt;/p&gt;




&lt;ol&gt;
&lt;li id="fn:1"&gt;
&lt;p&gt;&lt;a href="https://en.wikipedia.org/wiki/Bias%E2%80%93variance_tradeoff"&gt;https://en.wikipedia.org/wiki/Bias%E2%80%93variance_tradeoff&lt;/a&gt; ↩︎&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;

</description>
    </item>
    <item>
      <title>And Now for Something Completely Different</title>
      <dc:creator>Edward Amor</dc:creator>
      <pubDate>Sat, 02 May 2020 13:00:00 +0000</pubDate>
      <link>https://dev.to/edwardamor/and-now-for-something-completely-different-7l6</link>
      <guid>https://dev.to/edwardamor/and-now-for-something-completely-different-7l6</guid>
      <description>&lt;p&gt;On January 21st, 2020, I enrolled in Flatiron School’s Data Science Bootcamp with the intention of gaining and developing the foundational skills and techniques necessary to become a Data Scientist. At the time of writing, I’m about 4 months into the program and in retrospect, I believe my decision to enroll was one of the best choices I’ve made. Along with the opportunities that will be available to me when I finish, the passionate and intelligent peers that I get to collaborate with, and the breadth of exciting, new, and challenging material I’m learning, I am glad I made the decision to learn data science.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;As my journey progresses, there is one question I keep getting, &lt;em&gt;why did I decide to learn data science?&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--PkXaADZF--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://edwardamor.xyz/posts/002/xps.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--PkXaADZF--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://edwardamor.xyz/posts/002/xps.jpg" alt=""&gt;&lt;/a&gt;&lt;br&gt;
            &lt;p&gt;Photo by &lt;a href="https://unsplash.com/@xps?utm_source=unsplash&amp;amp;utm_medium=referral&amp;amp;utm_content=creditCopyText"&gt;XPS&lt;/a&gt; on &lt;a href="https://unsplash.com/t/technology?utm_source=unsplash&amp;amp;utm_medium=referral&amp;amp;utm_content=creditCopyText"&gt;Unsplash&lt;/a&gt;&lt;/p&gt;

 &lt;/p&gt;

&lt;h2&gt;
  
  
  Passion for STEM
&lt;/h2&gt;

&lt;p&gt;As an interdisciplinary field, &lt;strong&gt;Data Science&lt;/strong&gt;** incorporates a lot from many other fields like mathematics, statistics, computer science &lt;strong&gt;, and information science. I wouldn’t have made the decision to learn data science if it wasn’t for my passion and joy for programming coupled with my love for mathematics.&lt;/strong&gt; Ultimately, it really brings me joy** to work on the projects and material at my Bootcamp, a feeling I can only compare to the addiction of staying up late to solve calculus equations. Except now instead of calculus equations, it’s iteratively designing regression models and extracting business insights from large datasets, and who can forget making beautiful visualizations.&lt;/p&gt;

&lt;h2&gt;
  
  
  Adaptability
&lt;/h2&gt;

&lt;p&gt;One of the best parts of data science is that it’s so &lt;strong&gt;flexible and can be applied to practically any domain&lt;/strong&gt;. The high adaptability is clearly reflected in the tools data scientists use, the basic skills they require, and the knowledge they draw from. And with the rising amount of data collection by businesses large and small, data scientists fit in to analyze and provide businesses insights into the relationships that exist within their data. Moreover, since they’re generally skilled at every step of the data lifecycle, many organizations could benefit from having a specialist like a data scientist on staff.&lt;/p&gt;

&lt;h2&gt;
  
  
  Return on Investment
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--v2O7xpa3--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://edwardamor.xyz/posts/002/pennyplant.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--v2O7xpa3--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://edwardamor.xyz/posts/002/pennyplant.jpeg" alt=""&gt;&lt;/a&gt;&lt;br&gt;
            &lt;p&gt;Photo by &lt;a href="https://unsplash.com/@micheile?utm_source=medium&amp;amp;utm_medium=referral"&gt;Micheile Henderson&lt;/a&gt; on &lt;a href="https://unsplash.com/t/technology?utm_source=unsplash&amp;amp;utm_medium=referral&amp;amp;utm_content=creditCopyText"&gt;Unsplash&lt;/a&gt;&lt;/p&gt;

 &lt;/p&gt;

&lt;p&gt;At the end of the day &lt;strong&gt;I’m investing in myself, my future&lt;/strong&gt; , so why not take advantage of my youth and break a few eggs trying new things and experimenting with what works and what doesn’t. I don’t know with absolute certainty that my decision will put me in a better financial position in the long run, but I do know that I am getting a head start &lt;strong&gt;developing my skills and turning myself into an asset&lt;/strong&gt;. And since the role is in such high demand, &lt;strong&gt;data science is very lucrative&lt;/strong&gt;. Who wouldn’t want to do something they’re excited about while also making a living, sounds like the perfect scenario.&lt;/p&gt;

&lt;p&gt;The journey ahead of me is a long and arduous one, but I know beyond all doubt that I will look back on my decision to learn Data Science as the best choice I made for my career. I can’t wait to see the amazing things I will accomplish. Until then I’ll keep working hard and sharpening my skills.&lt;/p&gt;

</description>
    </item>
    <item>
      <title>Virtual Environments with Python</title>
      <dc:creator>Edward Amor</dc:creator>
      <pubDate>Tue, 17 Mar 2020 21:00:00 +0000</pubDate>
      <link>https://dev.to/edwardamor/virtual-environments-with-python-4djc</link>
      <guid>https://dev.to/edwardamor/virtual-environments-with-python-4djc</guid>
      <description>&lt;p&gt;Similar to other programming languages (R, Ruby, Scala, JavaScript) Python comes with its own way of managing third party packages you choose to install for projects. And since Python 3.4, pip has been included by default in all binary installations of Python, allowing users to install packages from the Python Packaging Index (a public repository of open source licensed packages). However, there is one major shortcoming of the way packages are managed, and that is all packages get installed and retrieved from the same place. To the uninitiated this may not seem like an issue, however it is a disaster waiting to happen.&lt;/p&gt;

&lt;p&gt;Without going in-depth on the inner workings of package managers and dependency resolution, I’ll paint a simple picture. You’re a hobbyist developer and you enjoy scripting general tasks that are monotonous, and your new project involves downloading a bunch of images from a website. You’ve read through some repositories and figure you only need the package &lt;code&gt;foo&lt;/code&gt; to get the task done. You go to download &lt;code&gt;foo&lt;/code&gt; using pip but suddenly an error gets raised &lt;code&gt;ERROR: bar 1.0 has requirement requests==2.24.0, but you'll have requests 2.22.0 which is incompatible&lt;/code&gt;. It appears a package you previously installed &lt;code&gt;bar&lt;/code&gt; has a conflicting dependency with &lt;code&gt;foo&lt;/code&gt;, in this case they both require different versions of the &lt;code&gt;requests&lt;/code&gt; library. To solve this issue you could manually go through your package list and remove stuff, or you could use a virtual environment.&lt;/p&gt;

&lt;p&gt;A virtual environment is essentially an isolated sandbox, with its own instance of pip and an isolated set of packages (and their dependencies). This means that for each different project you have, instead of downloading packages globally, you can have isolated environments with no dependency conflicts with other projects. The other benefit to having virtual environments is there are no limits, and for as many projects as you have, each one can have its own unique sandbox of packages to work with. I imagine the next question you might have is, well how do I get started using them, and honestly there are so many ways the options seem endless.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Standard Library
&lt;/h2&gt;

&lt;p&gt;The simplest way to start using virtual environments is to use the &lt;code&gt;venv&lt;/code&gt; module (&lt;a href="https://docs.python.org/3/library/venv.html"&gt;link here&lt;/a&gt;) that’s available in the standard library. Simply run the following command inside your project’s directory.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ python -m venv env
$ source env/bin/activate

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Et voilà, you’ve officially created and activated your virtual environment. You’ll know it’s active because your prompt will be different, and you can verify by running the command &lt;code&gt;pip list&lt;/code&gt;, you should have both &lt;code&gt;pip&lt;/code&gt; and &lt;code&gt;setuptools&lt;/code&gt; installed.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;(env) $ pip list

Package Version
---------- -------
pip 20.1.1
setuptools 47.1.0

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You’ll also have a new directory &lt;code&gt;env&lt;/code&gt; in your project, make sure not to commit it to your version control system. Instead, if you’re not already, keep a requirements.txt in your project which is just a plain text file with packages on each line required by your project. This will allow you, and any collaborators, to recreate your environment simply by running the command &lt;code&gt;pip install -r requirements.txt&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;The main disadvantage of this method of creating virtual environments is you have to maintain your requirements.txt file. Typically this means manually appending packages to the file &lt;em&gt;(if you want a human readable version)&lt;/em&gt;, or running &lt;code&gt;pip freeze &amp;gt; requirements.txt&lt;/code&gt; &lt;em&gt;(for a more explicit machine readable version)&lt;/em&gt; every time you install something new. &lt;strong&gt;Note that the output of the &lt;code&gt;pip freeze&lt;/code&gt; command will include the exact version number of each package you’ve installed along with their dependencies&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Pipenv
&lt;/h2&gt;

&lt;p&gt;As an alternative to the standard library’s &lt;code&gt;venv&lt;/code&gt; module, and from the same mind that created the popular &lt;code&gt;requests&lt;/code&gt; python library is &lt;code&gt;pipenv&lt;/code&gt;. As the &lt;code&gt;pipenv&lt;/code&gt; &lt;a href="https://github.com/pypa/pipenv"&gt;repository&lt;/a&gt; says:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;[Pipenv] automatically creates and manages a virtualenv for your projects, as well as adds/removes packages from your &lt;code&gt;Pipfile&lt;/code&gt; as you install/uninstall packages. It also generates the ever-important &lt;code&gt;Pipfile.lock&lt;/code&gt;, which is used to produce deterministic builds.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;It is an amazing tool once you get tired of using &lt;code&gt;venv&lt;/code&gt;, and similar to other projects by &lt;a href="https://github.com/ken-reitz"&gt;Ken Reitz&lt;/a&gt; it is made for humans, and is immensely simple to use.&lt;/p&gt;

&lt;p&gt;To get started, the best way to use pipenv is to have a fresh install of python, although it is fine if you don’t. Simply run &lt;code&gt;pip install pipenv&lt;/code&gt; and you’re all set. Moving forward instead of using pip to install packages for your projects, you’ll use &lt;code&gt;pipenv install [insert package name]&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;One thing you’ll notice when using &lt;code&gt;pipenv&lt;/code&gt;is instead of a requirements.txt, it generates a Pipfile and Pipfile.lock. Both of these files are important and should be committed to your version control system. The Pipfile simply contains information about your project’s dependencies, whereas the Pipfile.lock contains sha256 hashes of each downloaded package allowing &lt;code&gt;pip&lt;/code&gt; to guarantee you’re installing what you intend to. The result is a simple way to get deterministic environments (environments which are exactly the same), without any intervention from you.&lt;/p&gt;

&lt;p&gt;**Note that in order to run your python files using your virtual environment, you’ll need to activate it. With &lt;code&gt;pipenv&lt;/code&gt; it’s as simple as running &lt;code&gt;pipenv shell&lt;/code&gt; while inside your project’s directory. **&lt;/p&gt;

&lt;p&gt;One disadvantage to using &lt;code&gt;pipenv&lt;/code&gt;, similar to &lt;code&gt;venv&lt;/code&gt;, is that you need to already have python and pip installed on your system, otherwise you won’t be able to use them.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conda
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;conda&lt;/code&gt; is the de facto environment/package manager for python data scientists for a reason. It’s a platform agnostic binary (python doesn’t need to already be on your system), which not only does package management, it also allows you to have different versions of python for different projects. You can download it by going to the &lt;a href="https://www.anaconda.com/"&gt;Anaconda website&lt;/a&gt; and selecting the installer for your platform. After you have it, you can use the graphical user interface &lt;code&gt;anaconda-navigator&lt;/code&gt; to access and manage your virtual environments. However, if you’re like me and live in your shell, then you’ll most likely want to use the &lt;code&gt;conda&lt;/code&gt; command.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Note there is wonderful &lt;a href="https://docs.conda.io/en/latest/"&gt;documentation&lt;/a&gt; on all the amazing configuration you can perform which I won’t go into, but I highly recommend you edit you condarc to your specification.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The main parts of &lt;code&gt;conda&lt;/code&gt; that you should get acquainted with are creating environments, installing packages, and generating an environment.yml. The environment.yml is similar to both a requirements.txt and Pipfile.&lt;/p&gt;

&lt;p&gt;To create an environment, activate, and install a package to the environment:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ conda create --name my-env # create a new environment
$ conda activate my-env # activate the environment
(my-env) $ conda install jupyter # install jupyter in the environment

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;One of the most important thing is after installing packages creating an environment.yml, and committing it to version control. You can generate two types, one is a deterministic version which isn’t as useful especially when working on multiple platforms &lt;code&gt;conda env export&lt;/code&gt;. The second type which is more generally useful can be created by running the command &lt;code&gt;conda env export --from-history&lt;/code&gt;. These files can also be generated by hand if you ever need to.&lt;/p&gt;

&lt;p&gt;One of the biggest advantages of using &lt;code&gt;conda&lt;/code&gt; is that it also works for multiple programming languages, not just Python. A short list of the languages it is available for include R, Ruby, Lua, Scala, Java, JavaScript, C/ C++, FORTRAN, and more.&lt;/p&gt;

&lt;h2&gt;
  
  
  Pyenv &amp;amp; Pyenv-Virtualenv
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;pyenv&lt;/code&gt; is a python environment manager written in shell scripts (available for *nix systems). As it says in the &lt;a href="https://github.com/pyenv/pyenv"&gt;project&lt;/a&gt; description, &lt;em&gt;“It’s simple, unobtrusive, and follows the UNIX tradition of single-purpose tools that do one thing well.”&lt;/em&gt; Coupled with &lt;code&gt;pyenv-virtualenv&lt;/code&gt; you can create virtual environments for many versions of python, including:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Python 2&lt;/li&gt;
&lt;li&gt;Python 3&lt;/li&gt;
&lt;li&gt;activepython&lt;/li&gt;
&lt;li&gt;anaconda2&lt;/li&gt;
&lt;li&gt;anaconda3&lt;/li&gt;
&lt;li&gt;ironpython&lt;/li&gt;
&lt;li&gt;jython&lt;/li&gt;
&lt;li&gt;micropython&lt;/li&gt;
&lt;li&gt;miniconda3&lt;/li&gt;
&lt;li&gt;pypy&lt;/li&gt;
&lt;li&gt;pypy2.7&lt;/li&gt;
&lt;li&gt;pypy3.6&lt;/li&gt;
&lt;li&gt;pyston&lt;/li&gt;
&lt;li&gt;stackless&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;You’ll notice that anaconda is available, and by using &lt;code&gt;pyenv&lt;/code&gt; you’re not limited to just regular python. &lt;code&gt;pyenv&lt;/code&gt; really shines on GNU/Linux because it alleviates the pressures of installing packages to your system version of python. Personally, I prefer to use &lt;code&gt;pyenv&lt;/code&gt; as it allows me to mix and match and play around with my python environments with no worry. And it’s super simple to install, and remove if you choose to abandon it.&lt;/p&gt;

&lt;p&gt;The simplest way to install it is to use the &lt;a href="https://github.com/pyenv/pyenv-installer"&gt;automatic installer&lt;/a&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ curl https://pyenv.run | bash
$ exec $SHELL

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Once done, you’ll have the &lt;code&gt;pyenv&lt;/code&gt; command available, along with some additional plugins. Most importantly you’ll have &lt;code&gt;pyenv-virtualenv&lt;/code&gt;, which allows you to create virtual environments for the python versions you install.&lt;/p&gt;

&lt;p&gt;To get started creating virtual environments, you’ll need to first install a version of python, then use virtualenv to creaete one.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ pyenv install 3.8.5 # version I want to install
$ pyenv virtualenv 3.8.5 my-env # create the environment
$ pyenv activate my-env # activate the environment

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The nicest part of &lt;code&gt;pyenv-virtualenv&lt;/code&gt;, is that you can have global, and directory settings of which python version to use. For example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[me@host my-project] $ pyenv local my-env

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The above command will create a new file, .python-version, in the project directory, and every time you enter or exit the directory will magically activate/deactivate the environment. The best part is that it’s so easy.&lt;/p&gt;

&lt;p&gt;The one disadvantage to using &lt;code&gt;pyenv&lt;/code&gt; and friends, is that you will most likely need to read up on it to get comfortable with it. The learning curve isn’t steep, but if you’re curious it’s always best to know the inner workings of any tools you use. &lt;strong&gt;Note on some distributions of GNU/Linux you will have to install additional dependencies which can be found in the ‘Common build problems’ section of the &lt;a href="https://github.com/pyenv/pyenv/wiki/Common-build-problems"&gt;wiki&lt;/a&gt;.&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Recommendations
&lt;/h2&gt;

&lt;p&gt;Depending on your use case my recommendation will be different, however above all else experiment with the options above. There are many other options available, although I’ve only listed some of the mainstream options, more obscure options are available.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Platform&lt;/th&gt;
&lt;th&gt;Recommendation&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Windows&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;pipenv&lt;/code&gt; for 99% of use cases. &lt;code&gt;conda&lt;/code&gt; if you’re into data science&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Mac OSX&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;pyenv&lt;/code&gt; with &lt;code&gt;pyenv-virtualenv&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GNU/Linux&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;pyenv&lt;/code&gt; with &lt;code&gt;pyenv-virtualenv&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;I hope moving forward if this was of any help to you, that you start taking environment management seriously. It’s very powerful and will liberate you of some of the more serious headaches that may arise from not using it.&lt;/p&gt;

</description>
    </item>
    <item>
      <title>High Level Overview of Quantile Quantile Plots</title>
      <dc:creator>Edward Amor</dc:creator>
      <pubDate>Sat, 01 Feb 2020 05:00:00 +0000</pubDate>
      <link>https://dev.to/edwardamor/high-level-overview-of-quantile-quantile-plots-13ih</link>
      <guid>https://dev.to/edwardamor/high-level-overview-of-quantile-quantile-plots-13ih</guid>
      <description>&lt;p&gt;A part of any data analyst’s toolkit when working with one dimensional data, is the Quantile Quantile plot. Colloquially referred to as Q-Q plots, these visualizations are unique in that they’re mainly utilized when comparing samples and/or comparing distributions. Although they’re not intuitive, Q-Q plots are amazing tools, especially when assessing whether a sample fits a known distribution, like the Gaussian distribution.&lt;/p&gt;

&lt;p&gt;Q-Q plots work simply by plotting the quantiles of one distribution (x-coordinate), typically a theoretical distribution, against the quantiles of another distribution (y-coordinate), typically an observed dataset. If the two quantiles being compared are related, then the resulting plot will show points lying approximately on the line &lt;code&gt;y=x&lt;/code&gt;. There are some variations to the Q-Q plot though, and each one tells you something different about the data being compared. Q-Q plots are also loosely open to interpretation, and a good heuristic is if it generally lies close enough to the line y=x​ then you’re golden. Even data randomly drawn from the Gaussian distribution won’t lie exactly on the line ​y=x, so there is wiggle room.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--2sDgU8kj--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://edwardamor.xyz/posts/000/normal-qq-plot.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--2sDgU8kj--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://edwardamor.xyz/posts/000/normal-qq-plot.png" alt=""&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This Q-Q plot shows the quantiles of 75 randomly drawn data points from the standard normal distribution, compared to the normal distribution. One would intuitively think that the points would lie perfectly on the line ​y=x, however this isn’t the case and explains why we say QQ plots are loosely open to interpretation.&lt;/p&gt;

&lt;p&gt;One example of where Q-Q plots are definitely applied are in linear regression. In linear regression, there are assumptions that have to be met in order for the created model to be considered valid and not misleading. One of the assumptions is that the residuals of the model are normally distributed. To verify this assumption has not been violated, we typically use a Q-Q plot to quickly compare the distribution of residuals to that of the Gaussian distribution. If the residuals loosely fit the line ​&lt;code&gt;y=x&lt;/code&gt;, then one can state that the assumption has not been violated.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--kl9C2XMq--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://edwardamor.xyz/posts/000/residuals-qq-plot.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--kl9C2XMq--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://edwardamor.xyz/posts/000/residuals-qq-plot.png" alt=""&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This Q-Q plot was generated from fitting a multi-variate linear regression model, the residuals from the training data were then plotted against the normal standard distribution. One can see that this data doesn’t appear to be normal, due to the curvature of the points. This upward curvature actually denotes a positive skew in the residuals, meaning our model is over predicting even on our training set.&lt;/p&gt;

&lt;p&gt;Just like any other graphical method for analyzing data, there are strengths and weaknesses to Q-Q plots. One has to know when best to use a Q-Q plot to receive the most benefit from it. In the case of Q-Q plots, they are immensely beneficial when comparing two distributions (theoretical or empirical), as they show how location (mean), scale (standard deviation), and skewness are similar or different in the two distributions. They’re also extremely beneficial when assessing the residuals of a regression model as shown previously.&lt;/p&gt;

&lt;p&gt;The biggest weakness of Q-Q plots in my eyes is there exists an initial steep learning curve, but luckily the Internet offers a trove of information, and one of the most beneficial resources I found was a post on &lt;a href="https://stats.stackexchange.com/a/101290"&gt;StackExchange&lt;/a&gt;. Beyond that, the other major issue with Q-Q plots is that there is some room for interpretation on whether your data lies close enough to the line &lt;code&gt;y=x​&lt;/code&gt;. One person’s assessment will not always line up with another person’s, but after some practice they provide an immense benefit when quickly assessing data.&lt;/p&gt;

</description>
    </item>
  </channel>
</rss>
