<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: hoganbyun</title>
    <description>The latest articles on DEV Community by hoganbyun (@hoganbyun).</description>
    <link>https://dev.to/hoganbyun</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F536741%2F3823c6d7-8919-4a01-9053-b8275a7263cf.jpg</url>
      <title>DEV Community: hoganbyun</title>
      <link>https://dev.to/hoganbyun</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/hoganbyun"/>
    <language>en</language>
    <item>
      <title>Classification Models in Scikit-learn</title>
      <dc:creator>hoganbyun</dc:creator>
      <pubDate>Fri, 13 Aug 2021 16:40:49 +0000</pubDate>
      <link>https://dev.to/hoganbyun/classification-models-in-scikit-learn-1i33</link>
      <guid>https://dev.to/hoganbyun/classification-models-in-scikit-learn-1i33</guid>
      <description>&lt;p&gt;This post will walk you through some of the different classification models available to use in scikit-learn. &lt;/p&gt;

&lt;p&gt;First, it's important to go over what a classification algorithm is and how they are used. A classification algorithm takes in a training set of data to create a model that will predict and classify other data into pre-determined categories. For example, a phone company may take in customer data pertaining to sales, location, etc. to determine whether certain customers are likely to stick with the company for the next calendar year. &lt;/p&gt;

&lt;h2&gt;
  
  
  K-Nearest Neighbors
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;What it Does&lt;/strong&gt;&lt;br&gt;
K-Nearest Neighbors is a classification algorithm that measures distances between points. KNN takes a point and measures the &lt;em&gt;k&lt;/em&gt; nearest points in the training set. Then, it looks at the labels of each point and classifies the starting point by the majority of the labels surrounding it. Look at the example below:&lt;br&gt;
&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--cx485nrf--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/n7yrjbje4yl8adtxfuqd.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--cx485nrf--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/n7yrjbje4yl8adtxfuqd.png" alt="Alt Text"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Here, the green point is our starting point and we see the surrounding blue and red points. If &lt;em&gt;k&lt;/em&gt; = 3, we see the three nearest points are two reds and one blue. Thus, the algorithm will classify the green point as a red triangle.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Code&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Import KNeighborsClassifier
&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;sklearn.neighbors&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;KNeighborsClassifier&lt;/span&gt;

&lt;span class="c1"&gt;# Instantiate KNeighborsClassifier
&lt;/span&gt;&lt;span class="n"&gt;knn&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;KNeighborsClassifier&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="c1"&gt;# Fit the classifier
&lt;/span&gt;&lt;span class="n"&gt;classifier&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;knn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;fit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;scaled_data_train&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_train&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Predict on the test set
&lt;/span&gt;&lt;span class="n"&gt;test_preds&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;classifier&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;predict&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;scaled_data_test&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Decision Trees
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;What it Does&lt;/strong&gt;&lt;br&gt;
A decision tree, quite simply, takes a starting point and makes multiple decisions that branch out to ultimately make a classification. See the example below:&lt;br&gt;
&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--QoUhL_XP--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/ncq0yxpw97b77q7ekua5.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--QoUhL_XP--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/ncq0yxpw97b77q7ekua5.png" alt="Alt Text"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This particular example tries to determine what kind of contact lens a person should wear depending on different characteristics. These trees use a greedy search, which always chooses the best way to classify the training data at each classification (depending on the criteria: entropy, information gain, etc).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Code&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;sklearn.tree&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;DecisionTreeClassifier&lt;/span&gt; 
&lt;span class="n"&gt;clf&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;DecisionTreeClassifier&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;criterion&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;'entropy'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;clf&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;fit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X_train_ohe&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_train&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Random Forest
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;What it Does&lt;/strong&gt;&lt;br&gt;
A random forest classifier is quite simply multiple decision trees. Usually, bootstrapping is involved where subsets of the training data are created with replacement. This way, each time you create a decision tree, it'll be different from the last. Each tree will classify a point and the random forest will take the aggregate majority as its final prediction. While this is better than a singular decision tree as it has a less likelihood of overfitting, it does take more memory and is more computationally complex.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Code&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;sklearn.ensemble&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;RandomForestClassifier&lt;/span&gt;

&lt;span class="c1"&gt;# n_estimators is how many trees in the forest
&lt;/span&gt;&lt;span class="n"&gt;forest&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;RandomForestClassifier&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;n_estimators&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;forest&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;fit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data_train&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;target_train&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  How to Evaluate a Model
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Code&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;confusion_matrix&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;y_test&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;pred&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="k"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;classification_report&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;y_test&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;pred&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="k"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Testing Accuracy for Decision Tree Classifier: {:.4}%"&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;format&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;accuracy_score&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;y_test&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;pred&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;What These Mean&lt;/strong&gt;&lt;br&gt;
The code above will give you something like this:&lt;br&gt;
&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--mFxAUpxf--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/nmqlvsejf8h4inw8sk0i.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--mFxAUpxf--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/nmqlvsejf8h4inw8sk0i.png" alt="Alt Text"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Let's go over how to read this. At the top, we see a 2x2 confusion matrix. Going from left-right, top-bottom, these numbers represent True negatives, False positives, False negatives, and True positives. The other metrics measure the following: &lt;strong&gt;Accuracy&lt;/strong&gt; measures the proportion of correct classifications, &lt;strong&gt;precision&lt;/strong&gt; measures the proportion of predicted positives that are actually positive, &lt;strong&gt;recall&lt;/strong&gt; measures how many of the actual positives were correctly classified, and &lt;strong&gt;f1-score&lt;/strong&gt; measures the balance between precision and recall. These are all metrics that can be used to determine how good your classification model is.&lt;/p&gt;

</description>
    </item>
    <item>
      <title>Continuous vs. Categorical: How to Treat These Variables in Multiple Linear Regression</title>
      <dc:creator>hoganbyun</dc:creator>
      <pubDate>Fri, 06 Aug 2021 20:24:17 +0000</pubDate>
      <link>https://dev.to/hoganbyun/continuous-vs-categorical-how-to-treat-these-variables-in-multiple-linear-regression-1gh1</link>
      <guid>https://dev.to/hoganbyun/continuous-vs-categorical-how-to-treat-these-variables-in-multiple-linear-regression-1gh1</guid>
      <description>&lt;p&gt;When attempting to make predictions using multiple linear regression, there are a few steps one must take before diving in, particularly, prepping continuous and categorical variables accordingly. Through this blog post, I will be showing you some techniques to make your data valid and usable in multiple linear regression. &lt;/p&gt;

&lt;h2&gt;
  
  
  Continuous Variables
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;What They Are&lt;/strong&gt;&lt;br&gt;
Continuous variables measure things like height, time, or other things that would not make sense to be classified into specific categories. One way to identify potential continuous variables is to look at the scatter plot of the data points. Usually, the data will be distributed in a cloud-like shape, unlike that of a categorical variable, which will be shown later. &lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--X1hcabmV--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/i/9l0649e17dnrgtzh31yp.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--X1hcabmV--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/i/9l0649e17dnrgtzh31yp.png" alt="Alt Text"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How to prep them&lt;/strong&gt;&lt;br&gt;
Continuous variables are a lot easier to deal with than categorical variables because adjustments are not always needed (besides the initial data cleaning). However, there are some changes, such as normalization and log transforming, that may potentially improve the model. &lt;/p&gt;

&lt;p&gt;&lt;em&gt;Standardization&lt;/em&gt;&lt;br&gt;
This method is pretty self-explanatory as you would standardize each data point. In other words, for each data point, you would subtract the mean and divide by the standard deviation.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# function to standardize values
&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;standardize&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;col&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;col&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;col&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;mean&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;col&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;Log Transform&lt;/em&gt;&lt;br&gt;
This method takes the previous standardization method and takes it one step forward. Before the standardization is done, you would first take the logs of each variable, which will make your data more normally-distributed. Afterwards, standardize each data point.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Code excerpt where 'cont_df' is assumed to be instantiated
&lt;/span&gt;&lt;span class="n"&gt;cont_log_df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cont_df&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;cont_log_std_df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;cont_log_df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;apply&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;standardize&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Categorical Variables
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;What They Are&lt;/strong&gt;&lt;br&gt;
Categorical variables, as the name suggests, represent things that can be divided into groups or categories. For example, color or grade level could be considered categorical. One way to identify potential categorical variables is to look at the scatter plot of the data points. Usually, the data will be distributed in a rod-like shapes, unlike the clouds of continuous variables.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--BUbXmxHy--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/i/wqm7429r4bnkx40efmdl.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--BUbXmxHy--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/i/wqm7429r4bnkx40efmdl.png" alt="Alt Text"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How to Prep Them&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;One-Hot Encoding&lt;/em&gt;&lt;br&gt;
This is required when running categorical variables into linear regression. The idea is to create dummy variables that each represent a group. For example, if you had a variable of cities in California, you would need one dummy for each unique city in that column. Afterwards, if a certain data point is associated with that city, you would place a 1 there, if not, 0. One thing to keep note of is the dummy variable trap. Because of how dummy variables are created, you could technically "predict" a dummy from by combining all the other ones (multicollinearity), which will be an issue for multiple linear regression. To combat this, drop the first column, which will eliminate the perfect multicollinearity.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;city&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;'LA'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;'SD'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;'SAC'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;'SD'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="s"&gt;'LA'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;'SAC'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;'SD'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;'LA'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="c1"&gt;# convert into series
&lt;/span&gt;&lt;span class="n"&gt;city_series&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Series&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;city&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# convert into categories
&lt;/span&gt;&lt;span class="n"&gt;city_cat&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;city_series&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;astype&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;'category'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# get dummy variables
# remember to drop first column
&lt;/span&gt;&lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;get_dummies&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;city_cat&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;drop_first&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;Binning&lt;/em&gt;&lt;br&gt;
Binning is a technique to cut down on the number of categorical variables. Ideally, categorical dummy variables should be kept at a minimum and if possible, you should have less dummies than continuous variables. The idea behind it is creating new categorical variables based on a criteria that you choose. An example would be converting months into seasons (which cuts 12 dummies down to 4 dummies).&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# This is a change unique to months and seasons. This
# method forms bins with pandas.cut(), which cuts a list
# of consecutive numbers at the given points. In order for
# winter to be represented by Dec-Feb (12, 1, 2), we need
# a way for 12 to smoothly connect to 1, that is, convert
# all 12's into 0's.
&lt;/span&gt;&lt;span class="n"&gt;bin_df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;loc&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;bin_df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;'month'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;12&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;'month'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;

&lt;span class="c1"&gt;# The cuts are made left-exclusive and right-inclusive.
# Eg. The first bin does not include -1, but includes 2.
&lt;/span&gt;&lt;span class="n"&gt;month_bins&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;11&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="c1"&gt;# Apply bin labels
&lt;/span&gt;&lt;span class="n"&gt;season_bins&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cut&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;bin_df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;'month'&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;month_bins&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;labels&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;
&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;'Winter'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;'Spring'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;'Summer'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;'Fall'&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Summary
&lt;/h2&gt;

&lt;p&gt;Now to summarize, continuous variables can be standardized or log transformed. These steps may help the model, but are not required. In fact, if they do not help the fit of the model, using these techniques are not recommended. Categorical variables require some sort of adjustment to be able to run a multiple linear regression. These can be one-hot encoding dummy variables or further lessening the number of categorical dummies by binning.&lt;/p&gt;

&lt;h2&gt;
  
  
  Thank You
&lt;/h2&gt;

&lt;p&gt;Hopefully this rundown on some of the common steps to take when preparing data for multiple linear regression has been helpful. Thank you for reading!&lt;/p&gt;

</description>
    </item>
    <item>
      <title>Popular Data Science Plots and When to Use Them</title>
      <dc:creator>hoganbyun</dc:creator>
      <pubDate>Sun, 01 Aug 2021 00:14:43 +0000</pubDate>
      <link>https://dev.to/hoganbyun/popular-data-science-plots-and-when-to-use-them-1dbo</link>
      <guid>https://dev.to/hoganbyun/popular-data-science-plots-and-when-to-use-them-1dbo</guid>
      <description>&lt;p&gt;When working in Data Science, being able to investigate and answer questions is only half of your responsibilities. No matter how well you are able to manipulate data and code difficult techniques, your findings are no good if you aren't able to communicate them clearly. Doing so, you will probably run into using &lt;strong&gt;matplotlib&lt;/strong&gt; to do a lot of your plotting. Through this blog post, I will be showing you some of the most common types of plots and what situations to use them in, while providing some do's and don't's that'll ensure that your plots are easy to understand. &lt;/p&gt;

&lt;h2&gt;
  
  
  Line Plot
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;When to Use One&lt;/em&gt;&lt;br&gt;
A line plot is probably the most simple plot out there. It plots points with x and y values on the chart and draws lines thought each, connecting them. A situation where one might use a line plot is when visualizing time-series data, that is, displaying changes of some variable over time. Take a look at this example, &lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--bg8j63ph--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/i/jc3vxuvafqb9kz14uy1a.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--bg8j63ph--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/i/jc3vxuvafqb9kz14uy1a.png" alt="Alt Text"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We can clearly see that this plot is measuring the speed (mph) of some object, say a car, over time (sec). From the information that the plot conveys, we can see that the car accelerated early and eventually started to decelerate later on. &lt;/p&gt;

&lt;p&gt;&lt;em&gt;How to Code a Line Plot&lt;/em&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;matplotlib.pyplot&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;plt&lt;/span&gt;
&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="n"&gt;matplotlib&lt;/span&gt; &lt;span class="n"&gt;inline&lt;/span&gt;

&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;6&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;y&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;plot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;xlabel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Week"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ylabel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Pounds Lost"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;title&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Client Pounds Lost During Training"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;show&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This example will yield the following graph:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--KMsP05Xk--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/i/f9nmz3yviyt439ii2s8e.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--KMsP05Xk--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/i/f9nmz3yviyt439ii2s8e.png" alt="Alt Text"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Bar Plot
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;When to Use One&lt;/em&gt;&lt;br&gt;
Bar plots, like line plots, may also be used to track changes over time. Yet, another use for bar plots is to visualize differences between groups. For example, here is a plot from my recent project:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--wtzxJ6GI--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/i/go7dujgusek8f94ix9t9.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--wtzxJ6GI--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/i/go7dujgusek8f94ix9t9.png" alt="Alt Text"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;You can see that the x-axis represents budgets tiers in increments of $1.5 million. The height of each bar depends on the average ROI of all movies that belong to a certain budget tier. In this case, a bar plot is especially useful because it can clearly show that the $6 million budget tier yields the highest ROI, on average. Bar plots are also useful when comparing metrics within groups that aren't quantifiable through numbers. An example would be comparing the number of award-winning movies from each movie studio. &lt;/p&gt;

&lt;p&gt;&lt;em&gt;How to Code a Bar Plot&lt;/em&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;matplotlib.pyplot&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;plt&lt;/span&gt;
&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="n"&gt;matplotlib&lt;/span&gt; &lt;span class="n"&gt;inline&lt;/span&gt;

&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;'ATL'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;'BOS'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;'DAL'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;'MEM'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;'SAC'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;'WAS'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;free_agents&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;6&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;bar&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;free_agents&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;xlabel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Team"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ylabel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Free Agents Signed"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;title&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Free Agents Signed in 2020"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;show&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This example will yield the following graph:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s---_XRU2b9--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/i/shsolaumv9lyr93g1ly3.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s---_XRU2b9--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/i/shsolaumv9lyr93g1ly3.png" alt="Alt Text"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Box (and Whisker) Plot
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;When to Use One&lt;/em&gt;&lt;br&gt;
The Box and Whisker plot is an ideal choice when you want to convey information from a five-number summary (minimum, first quartile (Q1), median, third quartile (Q3), maximum). Here, the median is the middle value of a sample. For example, in a list of [1,2,3,4,5], the median would be 3. If there is an odd number of values, the median is the average of the two middle values. The first and third quartiles are the 25th and 75th percentiles, respectively. The Inner Quartile Range (IQR) is calculated &lt;em&gt;Q3 - Q1&lt;/em&gt;, while the minimums and maximums are calculated &lt;em&gt;Q1 - 1.5*IQR&lt;/em&gt; and &lt;em&gt;Q3 + 1.5*IQR&lt;/em&gt;, respectively. These plots are especially useful for displaying how skewed a sample is and for highlighting outliers. Referring to the following example:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--eF5eUXh1--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/i/obst6cty0le3chmgd2m9.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--eF5eUXh1--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/i/obst6cty0le3chmgd2m9.png" alt="Alt Text"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The box that you see indicates 3 values. The middle line is the median. Here, we can see that the median is between 20 and 30. The right border of the box is Q3 while the left is Q1. The min and max are represented by the ends of the "whiskers" connected to the box. We also see one outlier, represented by the 55 point game where the player shot extremely well. The code for this example is below.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;How to Code a Box and Whisker Plot&lt;/em&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;matplotlib.pyplot&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;plt&lt;/span&gt;
&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="n"&gt;matplotlib&lt;/span&gt; &lt;span class="n"&gt;inline&lt;/span&gt;

&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;22&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;25&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;15&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;33&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;31&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;27&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;18&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;19&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;22&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;37&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;55&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;16&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;24&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;25&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;26&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;25&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;boxplot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;vert&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;xlabel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Points"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;title&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Player A: Points Scored Per Game"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;show&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Scatter Plot
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;When to Use One&lt;/em&gt;&lt;br&gt;
A scatter plot is used when you have numerical data that is associated by pairs (eg. Age vs. Running Speed). A scatter plot will plot each data point onto an x-y plane, giving the viewer a good picture of how the data is distributed. They are particularly useful when trying to discern whether two variables may be related. Take a look at this example:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--BMwzsTu---/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/i/gxzfjke9tkxt13dy6mop.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--BMwzsTu---/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/i/gxzfjke9tkxt13dy6mop.png" alt="Alt Text"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In this example, the scatter plot clearly shows that as age increases, max speed tends to go down. Each point represents a different person that was timed. The code for this is shown below.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;How to Code a Scatter Plot&lt;/em&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;matplotlib.pyplot&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;plt&lt;/span&gt;
&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="n"&gt;matplotlib&lt;/span&gt; &lt;span class="n"&gt;inline&lt;/span&gt;

&lt;span class="n"&gt;age&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;18&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;24&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;25&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;26&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;29&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;33&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;31&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;32&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;36&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;44&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;44&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;46&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;48&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;55&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;57&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;63&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;64&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;67&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;66&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;62&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;max_speed&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;19&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;18&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;22&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;16&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;19&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;21&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;17&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;16&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;19&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;16&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;14&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;16&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;13&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;13&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;11&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;12&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;9&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;7&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;7&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;scatter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;age&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;max_speed&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;xlabel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Age"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ylabel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Max Speed (mph)"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;title&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Age vs. Max Speed"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;show&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  BONUS: Regression Plot
&lt;/h2&gt;

&lt;p&gt;Lastly, the regression plot is sort of an extension of the scatter plot. It takes in each data point and calculates a line that "fits" the sample the best. What this means is that it will display a line cutting through the data, indicating what the approximate slope or "trend" is for the sample. Regression lines also have an r-value (between 0 and 1) which indicates how correlated two variables are. The closer this r-value is to 1, the more correlated the variables are. For example,&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--DQ3Dvpzm--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/i/uekapnzpqsq3ovwe4hhc.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--DQ3Dvpzm--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/i/uekapnzpqsq3ovwe4hhc.png" alt="Alt Text"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Here, we used the same scatter plot as an example. You can see the now, there is a line crossing through the data. This line gives us a good estimate of what speed to expect for a certain age. For example, judging from the line, we can approximate that a 40-year old will reach max speed at just under 15 mph. The code is written below. In this case, we had to use &lt;strong&gt;Seaborn&lt;/strong&gt; (an extension of &lt;strong&gt;matplotlib&lt;/strong&gt;) to use its regression plot functionality.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;How to Code a Regression Plot&lt;/em&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;matplotlib.pyplot&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;plt&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;seaborn&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;sns&lt;/span&gt;
&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="n"&gt;matplotlib&lt;/span&gt; &lt;span class="n"&gt;inline&lt;/span&gt;

&lt;span class="n"&gt;age&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;18&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;24&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;25&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;26&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;29&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;33&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;31&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;32&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;36&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;44&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;44&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;46&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;48&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;55&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;57&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;63&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;64&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;67&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;66&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;62&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;max_speed&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;19&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;18&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;22&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;16&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;19&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;21&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;17&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;16&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;19&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;16&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;14&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;16&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;13&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;13&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;11&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;12&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;9&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;7&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;7&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="n"&gt;sns&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;regplot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;age&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;max_speed&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;xlabel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Age"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ylabel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Max Speed (mph)"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;title&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Age vs. Max Speed"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;show&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Tips When Plotting
&lt;/h2&gt;

&lt;p&gt;Now that we covered some commonly-used plots in data science, we can now go over a few tips that you should keep in mind. &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Try not to make your visualizations too "busy" by highlighting the exact, relevant information that you would want the audience to see. In the below case, I've highlighted bars in green, blue, and red depending on what information I want to convey, as opposed to showing the graph with every bar being the same color.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--kkJOacNk--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/i/3obtmffw137lf2s5i5pm.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--kkJOacNk--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/i/3obtmffw137lf2s5i5pm.png" alt="Alt Text"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Avoid pie charts as they are often hard to read when each slice is very close in size. In these cases, bar charts are much more preferred. Below is the data represented in pie and bar format. Note that it is much easier to figure out what is larger and smaller in the bar graph.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--dsYWQ0Jo--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/i/m08pjzrk7jms8tthz7or.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--dsYWQ0Jo--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/i/m08pjzrk7jms8tthz7or.png" alt="Alt Text"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--rupisG8W--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/i/ririrk7yoiayv5k6j84q.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--rupisG8W--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/i/ririrk7yoiayv5k6j84q.png" alt="Alt Text"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Make sure that your graphs are scaled properly, while avoiding "white-space" on the graph, if possible. Take a look at the two examples below and the difference proper axis-scaling does. &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--BlkluPCR--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/i/x937hg03qe234pld40ih.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--BlkluPCR--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/i/x937hg03qe234pld40ih.png" alt="Alt Text"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--x-PzeEiH--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/i/exoysh0yu3fejgao34ue.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--x-PzeEiH--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/i/exoysh0yu3fejgao34ue.png" alt="Alt Text"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Summary
&lt;/h2&gt;

&lt;p&gt;Now that you have had a rundown on some of the most commonly used plots in data science along with some tips to make your graphs more digestible, you are ready to go out a plot your data into effective charts to show your findings!&lt;/p&gt;

</description>
    </item>
    <item>
      <title>Commands in SQL</title>
      <dc:creator>hoganbyun</dc:creator>
      <pubDate>Sun, 25 Jul 2021 02:03:53 +0000</pubDate>
      <link>https://dev.to/hoganbyun/commands-in-sql-570k</link>
      <guid>https://dev.to/hoganbyun/commands-in-sql-570k</guid>
      <description>&lt;p&gt;If you are planning to one day work with and manage data, chances are, you will eventually have to work with SQL (Structured Query Language). SQL is a language used to communicate with databases, most often used to update data or retrieve specific parts or groups in the data. If any task involves manipulating or creating a database, SQL will work well.&lt;/p&gt;

&lt;p&gt;In a previous post, I went over common SQL clauses, most specifically, those used to pull data from databases. In this post, I will go over some commands that are more relevant when creating or modifying a database.&lt;/p&gt;

&lt;h3&gt;
  
  
  Common Commands
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;CREATE DATABASE&lt;/strong&gt; - makes a new database&lt;br&gt;
&lt;strong&gt;CREATE TABLE&lt;/strong&gt; - makes a new  table; databases are, in essence, a collection of tables&lt;br&gt;
&lt;strong&gt;UPDATE&lt;/strong&gt; - changes or updates data&lt;br&gt;
&lt;strong&gt;INSERT INTO&lt;/strong&gt; - inserts new data&lt;br&gt;
&lt;strong&gt;DELETE&lt;/strong&gt; - deletes selected data&lt;br&gt;
&lt;strong&gt;DROP&lt;/strong&gt; - deletes tables/databases&lt;br&gt;
&lt;strong&gt;SELECT&lt;/strong&gt; - indicates what you want to pull from the data (also covered in a previous post)&lt;/p&gt;

&lt;p&gt;Here are some important distinctions to make note of. Databases and tables are both collections of data, but a database is a collection of tables. DROP and DELETE differ as DELETE is used on specific data while DROP refers to whole tables or databases. SELECT allows you to pull specific features of the data, but doesn't really manipulate it like the other commands listed.&lt;/p&gt;

</description>
    </item>
    <item>
      <title>Convolutional Neural Networks for Image Classification</title>
      <dc:creator>hoganbyun</dc:creator>
      <pubDate>Sat, 17 Jul 2021 22:36:27 +0000</pubDate>
      <link>https://dev.to/hoganbyun/convolutional-neural-networks-for-image-classification-48hd</link>
      <guid>https://dev.to/hoganbyun/convolutional-neural-networks-for-image-classification-48hd</guid>
      <description>&lt;p&gt;Convolutional Neural Networks (CNNs) are a type of neural network that are commonly used for image classification tasks because, unlike neural networks, they are able to classify a high number of pixels that are often found on images. &lt;/p&gt;

&lt;h2&gt;
  
  
  What Makes a Neural Network Convolutional?
&lt;/h2&gt;

&lt;p&gt;Convolutional networks are able to identify patterns much better. The main difference that allows this is the convolutional operation that uses a filter (usually 3x3 or 5x5) that is applied to each 3x3 (or 5x5) block on the original image. &lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--51BbFGJa--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_66%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/l6nn3zghghrapgkvc5xv.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--51BbFGJa--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_66%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/l6nn3zghghrapgkvc5xv.gif" alt="Alt Text"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In the above image, we see a filter is applied to each possible block to create a new matrix. &lt;/p&gt;

&lt;h3&gt;
  
  
  Using Padding to Maintain Dimensions
&lt;/h3&gt;

&lt;p&gt;One thing to note is that when the filter is applied, you end up with a smaller image. In the above example, you can see that a 5x5 image with a 3x3 filter results in a 3x3 image. In addition, pixels along the edges of the original image do not get applied by the filter as much as those close to the center. &lt;/p&gt;

&lt;p&gt;One way to solve both of these issues is using padding, which is basically adding extra blank pixels around the edges of the image. This will return an image of the same size as the original image and allow the edge pixels to be filtered more. &lt;br&gt;
&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--FBlyt1eK--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_66%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/ctepmcxqowgkskankjqe.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--FBlyt1eK--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_66%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/ctepmcxqowgkskankjqe.gif" alt="Alt Text"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Stride
&lt;/h3&gt;

&lt;p&gt;Striding can also affect the output image. Basically, you are controlling how many pixels over a filter will move, which will affect the output size (higher striding yields smaller image). For example, both examples above are striding by 1 pixel while the example below strides by 2.&lt;br&gt;
&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--mCKPCwT4--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_66%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/4m9q0jcudfm0esqgqkth.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--mCKPCwT4--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_66%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/4m9q0jcudfm0esqgqkth.gif" alt="Alt Text"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Pooling layer
&lt;/h3&gt;

&lt;p&gt;Everything above is done before the fully connected neural network. The pooling layer is the last step, which basically downsizes the convolutional layers while "pooling" together all the patterns that the convolutional layers found. Typically, the Max Pooling parameter is used. &lt;/p&gt;

&lt;h2&gt;
  
  
  When to Bring Fully Connected Layers
&lt;/h2&gt;

&lt;p&gt;After the pooling, the next steps are basically the same as building a regular fully connected neural network. Think of a CNN as preliminary prep before the fully connected layers are applied.&lt;/p&gt;

</description>
    </item>
    <item>
      <title>Common SQL Clauses</title>
      <dc:creator>hoganbyun</dc:creator>
      <pubDate>Sat, 10 Jul 2021 21:06:59 +0000</pubDate>
      <link>https://dev.to/hoganbyun/common-sql-clauses-5a3e</link>
      <guid>https://dev.to/hoganbyun/common-sql-clauses-5a3e</guid>
      <description>&lt;p&gt;When attempting to access and manipulate a database, SQL (Structured Query Language) is one of the top language options to use. SQL code revolve around queries, which are requests or commands for some part of the database. Once you get the hang of the query syntax, it can be very intuitive to use. Here are some of the more common SQL clauses. &lt;/p&gt;

&lt;h3&gt;
  
  
  SELECT and FROM
&lt;/h3&gt;

&lt;p&gt;Simply put, SELECT indicates what exactly you want to pull from the database. FROM points to the specific database that you are pulling from. For example, if you wanted to retrieve a customer ID from the "customers" table, it may look like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;customer_id&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;Customer&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Something to keep in mind is the asterisk (&lt;em&gt;). This character represents all records in the data. In this case, if you wanted everything, and not just the customer ID, you would replace "customer_id" with "&lt;/em&gt;" in the above example.&lt;/p&gt;

&lt;h3&gt;
  
  
  WHERE
&lt;/h3&gt;

&lt;p&gt;WHERE is a clause that allows the user to filter the data under a specific condition. One thing to note is that WHERE can only be used on ungrouped or unaggregated data. In the next example, let's say we want the customer ID's of all customers who ordered 30 or more units.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;customer_ID&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;Customer&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;Order_Quantity&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;30&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  GROUP BY
&lt;/h3&gt;

&lt;p&gt;GROUP BY does exactly what its name indicates. It groups the data by a certain feature. For example, a user who wants customer ID's grouped into the states they live in would use:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;customer_ID&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;Customer&lt;/span&gt;
&lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="k"&gt;State&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  HAVING
&lt;/h3&gt;

&lt;p&gt;HAVING works similarly to WHERE, the difference being that HAVING is used on aggregated data (most commonly after a GROUP BY). Below is an example of when HAVING would be used. Here, the user is searching for all states that have a total of more than 250 orders. In order to do this, we GROUP BY State and sum all the orders for each customer in that state.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="k"&gt;SUM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Orders&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;total&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;Customer&lt;/span&gt;
&lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="k"&gt;State&lt;/span&gt;
&lt;span class="k"&gt;HAVING&lt;/span&gt; &lt;span class="n"&gt;total&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;250&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  ORDER BY
&lt;/h3&gt;

&lt;p&gt;ORDER BY sorts the data on a given feature. One thing to note is that the default sorting is in ascending order, but there are optional ASC/DESC parameters that give the user the option to control the order. Here's an example of getting customer ID's after ordering customer last names in descending order&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;customer_id&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;Customer&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;last_name&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The above clauses are some of the most commonly used ones and are crucial to know when using SQL. &lt;/p&gt;

</description>
    </item>
    <item>
      <title>Semi-Supervised Learning</title>
      <dc:creator>hoganbyun</dc:creator>
      <pubDate>Sat, 03 Jul 2021 17:11:27 +0000</pubDate>
      <link>https://dev.to/hoganbyun/semi-supervised-learning-2k5e</link>
      <guid>https://dev.to/hoganbyun/semi-supervised-learning-2k5e</guid>
      <description>&lt;p&gt;In some cases, in machine learning, you will run into times when semi-supervised learning is necessary. For example, let's say that you want to use supervised learning to run a classification model on your data, but you have no labels. In order for the model to be built, you need data with labels assigned to properly train the model. Semi-supervised learning provides a pathway to do that, even when you start out with unlabeled data.&lt;/p&gt;

&lt;h2&gt;
  
  
  Semi-Supervised Learning (vs. Supervised/Unsupervised)
&lt;/h2&gt;

&lt;p&gt;The main difference between supervised and unsupervised learning is whether we know the output labels. Supervised learning, as the name suggests, needs labels in order to train a model, whereas unsupervised learning does not require labels. For example, a classification model requires supervised learning as it needs each data point to indicate which class that point belongs to as a prerequisite. &lt;br&gt;
&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--imSkK6ii--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/y6uxjex42pwv5lp78fv8.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--imSkK6ii--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/y6uxjex42pwv5lp78fv8.png" alt="Alt Text"&gt;&lt;/a&gt;&lt;br&gt;
Here, we need each data point to tell us whether it belongs to red (disease) or blue (healthy) in order for us to accurately produce a model to separate the two groups. &lt;/p&gt;

&lt;p&gt;For unsupervised learning, we can use the example of clustering.&lt;br&gt;
&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--kWFLdoVs--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/lo5m13zsowi8c7uwyprd.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--kWFLdoVs--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/lo5m13zsowi8c7uwyprd.png" alt="Alt Text"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Unlike the previous example, we don't know which group each data point belongs to before we run the model. For example, in k-means clustering, the model will identify centroids (centers for each group) and assign other data points to the closest centroid. &lt;/p&gt;

&lt;p&gt;Semi-supervised learning can entail different methods. For example, I recently created a model to classify NBA players into specific playstyles. Obviously a player's playstyle isn't something objective as determining that is up to personal opinion. Even determining how many playstyles exist is up to opinion as well. &lt;/p&gt;

&lt;p&gt;One method that I used was to predetermine which playstyles "existed" in this model and, for a small subset of the players, I would determine a playstyle for them. What this gave me was labeled data that I could use in supervised learning. Once that model was created, I ran the rest of the unlabeled data through it to give me playstyle predictions for everyone else. Then, I used supervised learning again on the fully-labeled data. &lt;/p&gt;

&lt;p&gt;Another method, which took less manual work, was to first use unsupervised data to separate players into clusters. Then, using the average stats of each cluster, I could get a sense of what type of playstyle each cluster represented. Using the labels from the clusters, I was able to use supervised learning on the newly-labeled data.&lt;/p&gt;

&lt;p&gt;So when you are faced with unlabeled data, but want to do a supervised learning task, don't be discouraged as there are available methods to work around it, such as semi-supervised learning.&lt;/p&gt;

</description>
    </item>
    <item>
      <title>SQL: WHERE vs. HAVING</title>
      <dc:creator>hoganbyun</dc:creator>
      <pubDate>Thu, 24 Jun 2021 23:15:13 +0000</pubDate>
      <link>https://dev.to/hoganbyun/sql-where-vs-having-4ohp</link>
      <guid>https://dev.to/hoganbyun/sql-where-vs-having-4ohp</guid>
      <description>&lt;p&gt;In SQL, the &lt;strong&gt;HAVING&lt;/strong&gt; and &lt;strong&gt;WHERE&lt;/strong&gt; clauses have a similar function with a key difference. Both functions allow a user to filter data with respect to a certain condition. The difference between the two clauses has to do with when each is used. Basically, the &lt;strong&gt;WHERE&lt;/strong&gt; can only be used in non-aggregated data (ie. data that has not been aggregated by &lt;strong&gt;GROUP BY&lt;/strong&gt;). On the other hand, &lt;strong&gt;HAVING&lt;/strong&gt; is used after a &lt;strong&gt;GROUP BY&lt;/strong&gt;. Let's walk through some examples.&lt;/p&gt;

&lt;h2&gt;
  
  
  WHERE
&lt;/h2&gt;

&lt;p&gt;As mentioned earlier, &lt;strong&gt;WHERE&lt;/strong&gt; is used on data before and aggregation/filtering is done.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;Employee_Name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Employee_ID&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;Employee&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;Employee_Age&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;30&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The code above selects the names and ID's of all employees over the age of 30. Note that the &lt;strong&gt;WHERE&lt;/strong&gt; clause is used on non-aggregated data.&lt;/p&gt;

&lt;h2&gt;
  
  
  HAVING (and GROUP BY)
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;HAVING&lt;/strong&gt; is often used following a &lt;strong&gt;GROUP BY&lt;/strong&gt;, which aggregates data by a certain feature. In the following example, let's say you have game-by-game data for each player for the first five games of the season and, as a coach, you wanted to see which players on your team made more than 10 3-pointers.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;Player_Name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;SUM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="n"&gt;PM_Made&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;total&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;Team&lt;/span&gt;
&lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;Player_Name&lt;/span&gt;
&lt;span class="k"&gt;HAVING&lt;/span&gt; &lt;span class="n"&gt;total&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Here, we want the name and number of 3-pointers each player has made. The &lt;strong&gt;GROUP BY&lt;/strong&gt; clause is essential here because if we had used a simple &lt;strong&gt;WHERE&lt;/strong&gt; with no &lt;strong&gt;GROUP BY&lt;/strong&gt; or &lt;strong&gt;HAVING&lt;/strong&gt;, we would get 5 different numbers for each player, representing how many 3-pointers each player made in each game. We want the total that each player made, thus the &lt;strong&gt;GROUP BY&lt;/strong&gt; is needed. The &lt;strong&gt;HAVING&lt;/strong&gt; acts as a filter post-aggregation to get the desired range of total 3-pointers.&lt;/p&gt;

&lt;p&gt;While there can always be exceptions, a good rule of thumb is to treat &lt;strong&gt;WHERE&lt;/strong&gt; as the clause used on non-aggregated data, while treating &lt;strong&gt;HAVING&lt;/strong&gt; as the clause used on aggregated data (often in conjunction with &lt;strong&gt;GROUP BY&lt;/strong&gt;).&lt;/p&gt;

</description>
    </item>
    <item>
      <title>Supervised Vs. Unsupervised Learning</title>
      <dc:creator>hoganbyun</dc:creator>
      <pubDate>Tue, 15 Jun 2021 01:25:06 +0000</pubDate>
      <link>https://dev.to/hoganbyun/supervised-vs-unsupervised-learning-5efl</link>
      <guid>https://dev.to/hoganbyun/supervised-vs-unsupervised-learning-5efl</guid>
      <description>&lt;p&gt;When creating machine learning models, there are typically two paths to choose from: &lt;strong&gt;supervised&lt;/strong&gt; and &lt;strong&gt;unsupervised learning&lt;/strong&gt;. Simply, the difference between these two methods is whether we know the output labels. &lt;/p&gt;

&lt;p&gt;For example, let's say that we want to build a model that can identify pneumonia from chest x-rays. In this case, for each photo we feed into the model, we know beforehand whether the x-ray is of a pneumonia-positive person. Because we know the output labels of each input beforehand, we would use &lt;strong&gt;supervised learning&lt;/strong&gt;, which aims to measure a relationship between inputs and known outputs.&lt;/p&gt;

&lt;p&gt;Now in a different example, hypothetically, let's say that we have data (average speed, total accidents, total tickets, etc.) on many drivers and we want to put these drivers into groups where they are most similar to each other. In this case, we don't have initial output labels (eg. good driver, bad driver, etc.) and have to infer what kind of groups there are after they are made. In this case, we would use &lt;strong&gt;unsupervised learning&lt;/strong&gt;.&lt;br&gt;
Let's go into more detail for each approach.&lt;/p&gt;

&lt;h2&gt;
  
  
  Supervised Learning
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Supervised learning&lt;/strong&gt; uses labeled data to train a model to classify inputs or predict outcomes more accurately. Because we are feeding labeled data into the model, we are able to test and improve the model validity by verifying how accurate a model is over time.&lt;/p&gt;

&lt;p&gt;With &lt;strong&gt;supervised learning&lt;/strong&gt;, there are usually two types of problems to use it for: &lt;strong&gt;classification&lt;/strong&gt; and &lt;strong&gt;regression&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--NngRF5bh--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/iyd2ukwmr2sl9nbs3wct.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--NngRF5bh--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/iyd2ukwmr2sl9nbs3wct.png" alt="Alt Text"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Classification
&lt;/h4&gt;

&lt;p&gt;A &lt;strong&gt;classification&lt;/strong&gt; problem entails separating the data into pre-determined groups. For example, one could classify whether an animal is a cat or a dog based on size, weight, etc. Some common classification algorithms include support vector machines, random forest, and gradient boost.&lt;/p&gt;

&lt;h4&gt;
  
  
  Regression
&lt;/h4&gt;

&lt;p&gt;A &lt;strong&gt;regression&lt;/strong&gt; problem aims to identify the relationship between the independent and dependent variables. For example, a project aiming to project a store's ice cream sales using number of flavors, hours open, etc. would use regression, which could be linear, nonlinear, or logistic, to name a few.&lt;/p&gt;

&lt;h2&gt;
  
  
  Unsupervised Learning
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Unsupervised learning&lt;/strong&gt; finds groups or patterns using unlabeled data. Because there aren't any labels, there is not a specific way to verify model validity like with &lt;strong&gt;supervised learning&lt;/strong&gt;. Common problems include &lt;strong&gt;clustering&lt;/strong&gt; and &lt;strong&gt;dimensionality reduction&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--6D96-drN--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/0cnrksbzm5nyzkdif356.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--6D96-drN--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/0cnrksbzm5nyzkdif356.png" alt="Alt Text"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Clustering
&lt;/h4&gt;

&lt;p&gt;A &lt;strong&gt;clustering&lt;/strong&gt; problem aims to separate the data into distinct groups by identifying patterns or similarities between data points. For example, an online retail store may want to separate its customers into different demographics. A common clustering algorithm would be &lt;strong&gt;k-means clustering&lt;/strong&gt;, which groups the data into &lt;em&gt;k&lt;/em&gt; groups using the mean distance of each point to a cluster centroid. &lt;/p&gt;

&lt;h4&gt;
  
  
  Dimensionality Reduction
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;Dimensionality reduction&lt;/strong&gt; is used when there are too many features in a given dataset. It will reduce the number of features in a dataset while keeping its integrity and is done before the modeling stage.&lt;/p&gt;

&lt;h2&gt;
  
  
  Which One to Choose?
&lt;/h2&gt;

&lt;p&gt;Deciding whether to use supervised or unsupervised learning comes down to a few factors, those being, determining whether your data is labeled and what kind of modeling are you trying to accomplish.&lt;/p&gt;

</description>
    </item>
    <item>
      <title>Tuning Neural Networks</title>
      <dc:creator>hoganbyun</dc:creator>
      <pubDate>Tue, 08 Jun 2021 23:47:27 +0000</pubDate>
      <link>https://dev.to/hoganbyun/tuning-neural-networks-33l</link>
      <guid>https://dev.to/hoganbyun/tuning-neural-networks-33l</guid>
      <description>&lt;p&gt;When modeling a neural network, you most likely won’t run into satisfactory results immediately. Whether it’s underfitting or overfitting, there are always small, tuning changes that can be made to improve upon the initial model. For the most part, for overfit models, these are the main techniques you can use: normalization, regularization, optimization &lt;/p&gt;

&lt;h3&gt;
  
  
  Dealing with Overfitting
&lt;/h3&gt;

&lt;p&gt;Here is an example of what an overfit model would look like:&lt;br&gt;
&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--5ncOpG-i--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/xqel864mo44hlw0h87j0.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--5ncOpG-i--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/xqel864mo44hlw0h87j0.png" alt="Alt Text"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Here we can see that as the training accuracy increases, at a certain point, the validation accuracy stagnates. This means that the model is getting too good at recognizing purely the training data that it fails to recognize general patterns.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Regularization&lt;/strong&gt;&lt;br&gt;
Regularization is often used when the initial model is overfit. In general, you have three types to choose from: l1, l2, and dropout.&lt;/p&gt;

&lt;p&gt;L1 and l2 regularization basically penalizes weight matrices that are too large and in the back propagation phase. An example of it being used:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;add&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Dense&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;128&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;activation&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;'relu'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
&lt;span class="n"&gt;kernel_regularizer&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;regularizers&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;l2&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;0.005&lt;/span&gt;&lt;span class="p"&gt;)))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Dropout, on the other hand, sets random nodes in the network to 0 on a given rate. This is also an effective counter measure to overfitting. The number within the dropout function represents the rate at which dropout will occur. An example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;add&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Dropout&lt;/span&gt;&lt;span class="p"&gt;(.&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; 
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Normalization&lt;/strong&gt;&lt;br&gt;
Another countermeasure to overfitting a model is to normalize the input data. The easiest thing to do is to normalize to scale the data to be between 0 and 1. What this does is potentially cut down training time and stablize convergence. You could also normalize within layers such as the random normal:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;add&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Dense&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;64&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;activation&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;'relu'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
&lt;span class="n"&gt;kernel_initializer&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;initializers&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;RandomNormal&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Optimization&lt;/strong&gt;&lt;br&gt;
Lastly, you could try different optimization functions. The three most used are probably Adam, SGD, and RMSprop. &lt;br&gt;
Adam (“Adaptive Moment Estimation”) is one of the most popular and works very well. &lt;/p&gt;

&lt;h3&gt;
  
  
  Dealing with Underfitting
&lt;/h3&gt;

&lt;p&gt;Underfit models would look like the opposite of the above graph, where training accuracy/loss fails to improve. There are a few ways to deal with this.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Add Complexity&lt;/strong&gt;&lt;br&gt;
A likely reason that a model is underfit is that it is not complex enough. That is, it isn't able to identify abstract patterns. A way to fix this is to add complexity to the model by: 1) adding more layers or 2) increasing the number of neurons&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Training Time&lt;/strong&gt;&lt;br&gt;
Another reason that a model may be underfit is the training time. By giving a model more time and iterations to train, you give it more chances to converge to a more ideal solution. &lt;/p&gt;

&lt;h3&gt;
  
  
  Summary
&lt;/h3&gt;

&lt;p&gt;To summarize, overfit models require regularization, normalization, and optimization while underfit models require more complexity and training time. Neural networks are all about making small, incremental changes until you reach a good balance. These tips will ensure that you are moving in the correct direction when you inevitably find the need to tune a model.&lt;/p&gt;

</description>
    </item>
  </channel>
</rss>
