<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Nicolas Vallée</title>
    <description>The latest articles on DEV Community by Nicolas Vallée (@nicolasvallee).</description>
    <link>https://dev.to/nicolasvallee</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F837908%2Fe2533b54-329c-4f73-9e69-931dcf6d42cd.jpeg</url>
      <title>DEV Community: Nicolas Vallée</title>
      <link>https://dev.to/nicolasvallee</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/nicolasvallee"/>
    <language>en</language>
    <item>
      <title>Using Transfer Learning and TensorFlow to Identify Dog Breeds from Images</title>
      <dc:creator>Nicolas Vallée</dc:creator>
      <pubDate>Wed, 30 Mar 2022 09:27:34 +0000</pubDate>
      <link>https://dev.to/nicolasvallee/using-transfer-learning-and-tensorflow-to-identify-dog-breeds-from-images-5b4b</link>
      <guid>https://dev.to/nicolasvallee/using-transfer-learning-and-tensorflow-to-identify-dog-breeds-from-images-5b4b</guid>
      <description>&lt;p&gt;&lt;strong&gt;What are we building here?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;In this project, we're using Machine Learning to identify different breeds of dogs from images.&lt;/p&gt;

&lt;p&gt;To do this, we'll use data from the &lt;a href="https://www.kaggle.com/c/dog-breed-identification/overview"&gt;Kaggle dog breed identification competition&lt;/a&gt;. The dataset consists of 10,000+ labeled images of 120 different dog breeds.&lt;/p&gt;

&lt;p&gt;This type of problem is called &lt;strong&gt;multi-class image classification&lt;/strong&gt;. Multi-class because we're trying to classify multiple dog breeds. If, on the other hand, we wanted to classify &lt;em&gt;dogs&lt;/em&gt; versus &lt;em&gt;cats&lt;/em&gt;, it would be called &lt;strong&gt;binary classification&lt;/strong&gt;.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;I completed this project in March 2022 as part of the &lt;a href="https://www.udemy.com/course/complete-machine-learning-and-data-science-zero-to-mastery/"&gt;Complete Machine Learning &amp;amp; Data Science Bootcamp&lt;/a&gt;, taught by Andrei Neagoie and Daniel Bourke. If you're looking for a beginner friendly course teaching Data Science and Machine Learning from scratch, I highly recommend you check out this one.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Follow this link: &lt;a href="https://github.com/nicolas-vallee/ztm-ml-projects/blob/main/dog_vision_project.ipynb"&gt;Dog Vision Project&lt;/a&gt; to open my completed notebook, which you can also open in Colab.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why is this an interesting topic?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Multi-class image classification is both a common and an important problem in Machine Learning. This is the same kind of technology that Tesla uses for their self-driving cars, or Airbnb to automatically add information to their listings.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How are we going about this?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The first step in a deep learning problem is to get the data ready by turning it into numbers.&lt;/p&gt;

&lt;p&gt;We will go through the following workflow:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Get the data ready (download from Kaggle, store, import).&lt;/li&gt;
&lt;li&gt;Prepare the data (preprocessing, the 3 sets, X &amp;amp; y).&lt;/li&gt;
&lt;li&gt;Choose and fit a model (&lt;a href="https://www.tensorflow.org/hub"&gt;TensorFlow Hub&lt;/a&gt;, &lt;code&gt;tf.keras.applications&lt;/code&gt;, &lt;a href="https://www.tensorflow.org/tensorboard"&gt;TensorBoard&lt;/a&gt;, &lt;a href="https://www.tensorflow.org/api_docs/python/tf/keras/callbacks/EarlyStopping"&gt;EarlyStopping&lt;/a&gt;).&lt;/li&gt;
&lt;li&gt;Evaluate the model (making predictions and comparing them with the ground truth labels).&lt;/li&gt;
&lt;li&gt;Improve the model through experimentation (starting with 1000 images, making sure it works, then increasing the number of images).&lt;/li&gt;
&lt;li&gt;Save, share, and reload the model (once we're happy with the results).&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;For the preprocessing of our data, we're going to use TensorFlow 2.x. We will turn our data into Tensors (arrays of numbers which can be run on GPUs) and then allow a machine learning model to find patterns between them.&lt;/p&gt;

&lt;p&gt;Our machine learning model will be a pretrained deep learning model from TensorFlow Hub.&lt;/p&gt;

&lt;p&gt;The process of using a pretrained model and adapting it to a specific problem is called &lt;strong&gt;transfer learning&lt;/strong&gt;. Rather than training our own model from scratch, which could be time consuming and expensive, we will leverage the patterns of another model which has already been trained to classify images.&lt;/p&gt;

&lt;h2&gt;
  
  
  Getting our workspace ready
&lt;/h2&gt;

&lt;p&gt;Before we get started, we need to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Import TensorFlow 2.x&lt;/li&gt;
&lt;li&gt;Import TensorFlow Hub&lt;/li&gt;
&lt;li&gt;Make sure we're using a GPU
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;tensorflow&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;tf&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;tensorflow_hub&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;hub&lt;/span&gt;

&lt;span class="k"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"TF version:"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;tf&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;__version__&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Hub version:"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;hub&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;__version__&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Check for GPU
&lt;/span&gt;&lt;span class="k"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"GPU"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"is available, we're good to go!"&lt;/span&gt; 
      &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;tf&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;list_physical_devices&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"GPU"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; 
      &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="s"&gt;"is not available. Change runtime type to GPU
 before proceeding."&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;TF version: 2.8.0
Hub version: 0.12.0
GPU is available, we're good to go!
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;What is a GPU and why do we need one? &lt;/p&gt;

&lt;p&gt;A GPU (graphics processing unit) is a computer chip that is faster at doing numerical computations.&lt;/p&gt;

&lt;p&gt;By default, Colab runs on a computer located on Google's servers which doesn't have a GPU attached to it.&lt;/p&gt;

&lt;p&gt;We can fix this by changing the runtime type:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Go to Runtime.&lt;/li&gt;
&lt;li&gt;Click "Change runtime type".&lt;/li&gt;
&lt;li&gt;Where it says "Hardware accelerator", choose "GPU".&lt;/li&gt;
&lt;li&gt;Click save.&lt;/li&gt;
&lt;li&gt;The runtime will be restarted to activate the new hardware, so we'll have to rerun the above cells.&lt;/li&gt;
&lt;li&gt;If the steps have worked, we should see a print out saying "GPU is available".&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;To wee how much a GPU speeds up computing, Google Colab has &lt;a href="https://colab.research.google.com/notebooks/gpu.ipynb"&gt;a demonstration notebook available&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Getting our data ready
&lt;/h2&gt;

&lt;p&gt;Getting our data ready to be used with a machine learning model is an important step.&lt;/p&gt;

&lt;p&gt;There are a few ways to do this. Many of them are detailed in the &lt;a href="https://colab.research.google.com/notebooks/io.ipynb"&gt;Google Colab notebook on I/O (input and output)&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Since the data we're using is hosted on Kaggle, we could use the &lt;a href="https://www.kaggle.com/docs/api"&gt;Kaggle API&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Another method is to upload the data to our Google Drive, mount the drive in this notebook, and import the files.&lt;/p&gt;

&lt;h3&gt;
  
  
  Mounting Google Drive
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Running this cell will provide a token to link our drive to
# this notebook
&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;google.colab&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;drive&lt;/span&gt;
&lt;span class="n"&gt;drive&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;mount&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;'/content/drive'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We now see a "drive" folder available under the Files tab.&lt;/p&gt;

&lt;p&gt;This means we'll be able to access files in our Google Drive in this notebook.&lt;/p&gt;

&lt;p&gt;For this project, I've downloaded the &lt;a href="https://www.kaggle.com/c/dog-breed-identification/data"&gt;data from Kaggle&lt;/a&gt; and uploaded it to my Google Drive as a .zip file under the folder "ML/Dog Vision".&lt;/p&gt;

&lt;p&gt;To access it, we need to unzip it.&lt;/p&gt;

&lt;p&gt;Note: Running the cell below for the first time could take a while (a couple of minutes is normal). After we've run it once and got the data in our Google Drive, we don't need to run it again.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Use the '-d' parameter as the destination for where the
# files should go
&lt;/span&gt;&lt;span class="err"&gt;!&lt;/span&gt;&lt;span class="n"&gt;unzip&lt;/span&gt; &lt;span class="s"&gt;"drive/MyDrive/ML/Dog Vision/dog-breed-identification.zip"&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;d&lt;/span&gt; &lt;span class="s"&gt;"drive/MyDrive/ML/Dog Vision/"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Accessing the data
&lt;/h3&gt;

&lt;p&gt;Now that the data files are available on our Google Drive, we can start to check them out.&lt;/p&gt;

&lt;p&gt;Let's start with &lt;code&gt;labels.csv&lt;/code&gt; which contains all of the image ID's and their associated dog breed (our data and labels).&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Checkout the labels of our data
&lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;pandas&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;
&lt;span class="n"&gt;labels_csv&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;read_csv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"drive/MyDrive/ML/Dog Vision/labels.csv"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;labels_csv&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;describe&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
&lt;span class="k"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;labels_csv&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;head&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;                                     id               breed
count                              10222               10222
unique                             10222                 120
top     000bec180eb18c7604dcecc8fe0dba07  scottish_deerhound
freq                                   1                 126
                                 id             breed
0  000bec180eb18c7604dcecc8fe0dba07       boston_bull
1  001513dfcb2ffafc82cccf4d8bbaba97             dingo
2  001cdf01b096e06d78e9e5112d419397          pekinese
3  00214f311d5d2247d5dfe4fe24b2303d          bluetick
4  0021f9ceb3235effd7fcde7f7538ed62  golden_retriever
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Looking at this, we can see there are 10,222 different ID's (meaning 10,222 different images) and 120 different breeds.&lt;/p&gt;

&lt;p&gt;Let's figure out how many images there are for each breed.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# How many images are there for each breed?
&lt;/span&gt;&lt;span class="n"&gt;labels_csv&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;"breed"&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;value_counts&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="n"&gt;plot&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;bar&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;figsize&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--KWSNqNHF--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/siy1u3loteegxhe3v3ob.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--KWSNqNHF--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/siy1u3loteegxhe3v3ob.png" alt="Barchart representing number of images for each category" width="880" height="545"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;If we draw a line across the middle of the graph, we see there's about 60+ images for each dog breed.&lt;/p&gt;

&lt;p&gt;This is a good amount. For some of their vision products, Google recommends a minimum of 10 images per class to get started. And the more images per class available, the more chance a model has to figure out patterns between them.&lt;/p&gt;

&lt;p&gt;Let's check out one of the images.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Note:&lt;/strong&gt; Loading an image file for the first time may take a while as it gets loaded into the runtime memory.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;IPython.display&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Image&lt;/span&gt; 
&lt;span class="n"&gt;Image&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"drive/MyDrive/ML/Dog Vision/train/001513dfcb2ffafc82cccf4d8bbaba97.jpg"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--jbio8B1n--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/214twefx1v6hmszjvxqm.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--jbio8B1n--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/214twefx1v6hmszjvxqm.jpeg" alt="Image of a dog" width="500" height="375"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Getting images and their labels
&lt;/h3&gt;

&lt;p&gt;Since we've got the image ID's and their labels in a DataFrame (&lt;code&gt;labels_csv&lt;/code&gt;), we'll use it to create:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A list a filepaths to training images&lt;/li&gt;
&lt;li&gt;An array of all labels&lt;/li&gt;
&lt;li&gt;An array of all unique labels&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;We'll only create a list of filepaths to images rather than importing them all to begin with. This is because working with filepaths (strings) is more efficient than working with images.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Create pathnames from image ID's
&lt;/span&gt;&lt;span class="n"&gt;filenames&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;"drive/MyDrive/ML/Dog Vision/train/"&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;fname&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="s"&gt;".jpg"&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;fname&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;labels_csv&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;"id"&lt;/span&gt;&lt;span class="p"&gt;]]&lt;/span&gt;

&lt;span class="c1"&gt;# Check the first 10 filenames
&lt;/span&gt;&lt;span class="n"&gt;filenames&lt;/span&gt;&lt;span class="p"&gt;[:&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;['drive/MyDrive/ML/Dog Vision/train/000bec180eb18c7604dcecc8fe0dba07.jpg',
 'drive/MyDrive/ML/Dog Vision/train/001513dfcb2ffafc82cccf4d8bbaba97.jpg',
 'drive/MyDrive/ML/Dog Vision/train/001cdf01b096e06d78e9e5112d419397.jpg',
 'drive/MyDrive/ML/Dog Vision/train/00214f311d5d2247d5dfe4fe24b2303d.jpg',
 'drive/MyDrive/ML/Dog Vision/train/0021f9ceb3235effd7fcde7f7538ed62.jpg',
 'drive/MyDrive/ML/Dog Vision/train/002211c81b498ef88e1b40b9abf84e1d.jpg',
 'drive/MyDrive/ML/Dog Vision/train/00290d3e1fdd27226ba27a8ce248ce85.jpg',
 'drive/MyDrive/ML/Dog Vision/train/002a283a315af96eaea0e28e7163b21b.jpg',
 'drive/MyDrive/ML/Dog Vision/train/003df8b8a8b05244b1d920bb6cf451f9.jpg',
 'drive/MyDrive/ML/Dog Vision/train/0042188c895a2f14ef64a918ed9c7b64.jpg']
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now we've got a list of all the filenames from the ID column of &lt;code&gt;labels_csv&lt;/code&gt;, we can compare it to the number of files in our training data directory to see if they line up.&lt;/p&gt;

&lt;p&gt;If they do, great. If not, there may have been an issue when unzipping the data. To fix this, we might have to unzip the data again.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Check if number of filenames matches number of actual image files
&lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;os&lt;/span&gt;
&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nb"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;listdir&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"drive/MyDrive/ML/Dog Vision/train/"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="nb"&gt;len&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;filenames&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
  &lt;span class="k"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Filenames match actual number of files."&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
  &lt;span class="k"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Filenames do not match actual number of files, check the target directory."&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Filenames match actual number of files.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Let's visualize an image directly from a filepath.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Check an image directly from a filepath
&lt;/span&gt;&lt;span class="n"&gt;Image&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;filenames&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;9000&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--KfVhib6t--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/qyjw0r2kswci6lg7fz6y.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--KfVhib6t--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/qyjw0r2kswci6lg7fz6y.jpeg" alt="Another image of a dog" width="610" height="600"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Now that we've got our image filepaths together, let's get the labels.&lt;/p&gt;

&lt;p&gt;We'll take them from &lt;code&gt;labels_csv&lt;/code&gt; and turn them into a NumPy array.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;numpy&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;
&lt;span class="n"&gt;labels&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;labels_csv&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;"breed"&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;to_numpy&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="c1"&gt;# convert labels column to NumPy array
&lt;/span&gt;&lt;span class="n"&gt;labels&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;array(['boston_bull', 'dingo', 'pekinese', ..., 'airedale',
       'miniature_pinscher', 'chesapeake_bay_retriever'], dtype=object)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now, let's compare the amount of labels to the number of filenames.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Check for missing data
&lt;/span&gt;&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nb"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;labels&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="nb"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;filenames&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
  &lt;span class="k"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Number of labels matches number of filenames!"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
  &lt;span class="k"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Number of labels does not match number of filenames, check data directories"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Number of labels matches number of filenames!
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We should have the same amount of images and labels.&lt;/p&gt;

&lt;p&gt;Finally, since a machine learning model can't take strings as input, we'll have to convert our labels to numbers.&lt;/p&gt;

&lt;p&gt;To begin with, we'll find all of the unique dog breed names.&lt;/p&gt;

&lt;p&gt;Then, we'll go through the list of &lt;code&gt;labels&lt;/code&gt;, compare them to unique breeds, and create a list of booleans indicating which one is the real label (&lt;code&gt;True&lt;/code&gt;) and which ones aren't (&lt;code&gt;False&lt;/code&gt;).&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Find the unique label values
&lt;/span&gt;&lt;span class="n"&gt;unique_breeds&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;unique&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;labels&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nb"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;unique_breeds&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;120
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The length of &lt;code&gt;unique_breeds&lt;/code&gt; should be 120, meaning we're working with images of 120 different breeds of dogs.&lt;/p&gt;

&lt;p&gt;Now we'll use &lt;code&gt;unique_breeds&lt;/code&gt; to turn our &lt;code&gt;labels&lt;/code&gt; array into an array of booleans.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Turn every label into a boolean array
&lt;/span&gt;&lt;span class="n"&gt;boolean_labels&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;label&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="n"&gt;unique_breeds&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;label&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;labels&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;boolean_labels&lt;/span&gt;&lt;span class="p"&gt;[:&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[array([False, False, False, False, False, False, False, False, False,
        False, False, False, False, False, False, False, False, False,
        False,  True, False, False, False, False, False, False, False,
        False, False, False, False, False, False, False, False, False,
        False, False, False, False, False, False, False, False, False,
        False, False, False, False, False, False, False, False, False,
        False, False, False, False, False, False, False, False, False,
        False, False, False, False, False, False, False, False, False,
        False, False, False, False, False, False, False, False, False,
        False, False, False, False, False, False, False, False, False,
        False, False, False, False, False, False, False, False, False,
        False, False, False, False, False, False, False, False, False,
        False, False, False, False, False, False, False, False, False,
        False, False, False]),
 array([False, False, False, False, False, False, False, False, False,
        False, False, False, False, False, False, False, False, False,
        False, False, False, False, False, False, False, False, False,
        False, False, False, False, False, False, False, False, False,
        False,  True, False, False, False, False, False, False, False,
        False, False, False, False, False, False, False, False, False,
        False, False, False, False, False, False, False, False, False,
        False, False, False, False, False, False, False, False, False,
        False, False, False, False, False, False, False, False, False,
        False, False, False, False, False, False, False, False, False,
        False, False, False, False, False, False, False, False, False,
        False, False, False, False, False, False, False, False, False,
        False, False, False, False, False, False, False, False, False,
        False, False, False])]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Why do it like this?&lt;/p&gt;

&lt;p&gt;An important concept in machine learning is converting our data to numbers before passing it to a machine learning model.&lt;/p&gt;

&lt;p&gt;In this case, we've transformed a single dog breed name such as &lt;code&gt;boston_bull&lt;/code&gt; into a one-hot array.&lt;/p&gt;

&lt;p&gt;Let's see an example.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Example: Turning a boolean array into integers
&lt;/span&gt;&lt;span class="k"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;labels&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="c1"&gt;# original label
&lt;/span&gt;&lt;span class="k"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;where&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;unique_breeds&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="n"&gt;labels&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;])[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="c1"&gt;# index where label occurs
&lt;/span&gt;&lt;span class="k"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;boolean_labels&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;argmax&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt; &lt;span class="c1"&gt;# index where label occurs in boolean array
&lt;/span&gt;&lt;span class="k"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;boolean_labels&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;astype&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="c1"&gt;# there will be a 1 where the sample label occurs
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;boston_bull
19
19
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now that we've got our labels in a numeric format and our image filepaths easily accessible (although they aren't numeric yet), let's split our data.&lt;/p&gt;

&lt;h3&gt;
  
  
  Creating our own validation set
&lt;/h3&gt;

&lt;p&gt;Since the dataset from Kaggle doesn't come with a validation set (a split of the data we can test our model on before making final predicitons on the test set), let's make one.&lt;/p&gt;

&lt;p&gt;We could use Scikit-Learn's &lt;code&gt;train_test_split&lt;/code&gt; function or we could simply make manual splits of the data.&lt;/p&gt;

&lt;p&gt;For accessibility later, let's save our filenames variable to &lt;code&gt;X&lt;/code&gt; (data) and our labels to &lt;code&gt;y&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Setup X and y variables
&lt;/span&gt;&lt;span class="n"&gt;X&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;filenames&lt;/span&gt;
&lt;span class="n"&gt;y&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;boolean_labels&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Since we're working with 10,000+ images, it's a good idea to start with a portion of them to make sure things are working before training on them all.&lt;/p&gt;

&lt;p&gt;This is because computing with 10,000+ images could take a fairly long time. And our goal when working through machine learning projects is to reduce the time between experiments.&lt;/p&gt;

&lt;p&gt;Let's start experimenting with 1,000 images and increase it as we need.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Set number of images to use for experimenting
&lt;/span&gt;&lt;span class="n"&gt;NUM_IMAGES&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1000&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now, let's split our data into training and validation sets. We'll use and 80/20 split (80% training data, 20% validation data).&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Import train_test_split from Scikit-Learn
&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;sklearn.model_selection&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;train_test_split&lt;/span&gt;

&lt;span class="c1"&gt;# Split them into training and validation sets of total size NUM_IMAGES
&lt;/span&gt;&lt;span class="n"&gt;X_train&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;X_val&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_train&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_val&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;train_test_split&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;[:&lt;/span&gt;&lt;span class="n"&gt;NUM_IMAGES&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
                                                  &lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="p"&gt;[:&lt;/span&gt;&lt;span class="n"&gt;NUM_IMAGES&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
                                                  &lt;span class="n"&gt;test_size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                                                  &lt;span class="n"&gt;random_state&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;42&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nb"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X_train&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="nb"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;y_train&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="nb"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X_val&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="nb"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;y_val&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;(800, 800, 200, 200)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Let's look at the training data
&lt;/span&gt;&lt;span class="n"&gt;X_train&lt;/span&gt;&lt;span class="p"&gt;[:&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;y_train&lt;/span&gt;&lt;span class="p"&gt;[:&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;(['drive/MyDrive/ML/Dog Vision/train/00bee065dcec471f26394855c5c2f3de.jpg',
  'drive/MyDrive/ML/Dog Vision/train/0d2f9e12a2611d911d91a339074c8154.jpg'],
 [array([False, False, False, False, False, False, False, False, False,
         False, False, False, False, False, False, False, False, False,
         False, False, False, False, False, False, False, False,  True,
         False, False, False, False, False, False, False, False, False,
         False, False, False, False, False, False, False, False, False,
         False, False, False, False, False, False, False, False, False,
         False, False, False, False, False, False, False, False, False,
         False, False, False, False, False, False, False, False, False,
         False, False, False, False, False, False, False, False, False,
         False, False, False, False, False, False, False, False, False,
         False, False, False, False, False, False, False, False, False,
         False, False, False, False, False, False, False, False, False,
         False, False, False, False, False, False, False, False, False,
         False, False, False]),
  array([False, False, False, False, False, False, False, False, False,
         False, False, False, False, False, False, False, False, False,
         False, False,  True, False, False, False, False, False, False,
         False, False, False, False, False, False, False, False, False,
         False, False, False, False, False, False, False, False, False,
         False, False, False, False, False, False, False, False, False,
         False, False, False, False, False, False, False, False, False,
         False, False, False, False, False, False, False, False, False,
         False, False, False, False, False, False, False, False, False,
         False, False, False, False, False, False, False, False, False,
         False, False, False, False, False, False, False, False, False,
         False, False, False, False, False, False, False, False, False,
         False, False, False, False, False, False, False, False, False,
         False, False, False])])
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Preprocessing images (turning images into Tensors)
&lt;/h3&gt;

&lt;p&gt;Our labels are in numeric format but our images are still just filepaths.&lt;/p&gt;

&lt;p&gt;Since we're using TensorFlow, our data has to be in the form of Tensors.&lt;/p&gt;

&lt;p&gt;A Tensor is a way to represent information in numbers. A Tensor can be thought of as a combination of NumPy arrays, except with the special ability to be used on a GPU.&lt;/p&gt;

&lt;p&gt;Because of how TensorFlow stores information (in Tensors), it allows machine learning and deep learning models to be run on GPUs (generally faster at numerical computing).&lt;/p&gt;

&lt;p&gt;To preprocess our images into Tensors, we're going to write a function which does a few things:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Take an image filepath as input.&lt;/li&gt;
&lt;li&gt;Use TensorFlow to read the file and save it to a variable, &lt;code&gt;image&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Turn our &lt;code&gt;image&lt;/code&gt; (a jepg file) into Tensors.&lt;/li&gt;
&lt;li&gt;Normalize our image (convert color channel values from 0-255 to 0-1).&lt;/li&gt;
&lt;li&gt;Resize the &lt;code&gt;image&lt;/code&gt; to be a shape (224, 224).&lt;/li&gt;
&lt;li&gt;Return the modified &lt;code&gt;image&lt;/code&gt;.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;A good place to read about this type of function is the &lt;a href="https://www.tensorflow.org/tutorials/load_data/images"&gt;TensorFlow documentation on loading images&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Why is the shape (224, 224), which is (heigh, width)? This is to match the size of input our model takes. As we'll see later, our model will take as input an image which is (224, 224, 3).&lt;/p&gt;

&lt;p&gt;And there, 3 is the number of color channels per pixel: red, green, and blue.&lt;/p&gt;

&lt;p&gt;Let's make this a little more concrete.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Convert image to NumPy array
&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;matplotlib.pyplot&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;imread&lt;/span&gt; 
&lt;span class="n"&gt;image&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;imread&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;filenames&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;42&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="c1"&gt;# read in an image
&lt;/span&gt;&lt;span class="n"&gt;image&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;shape&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;(257, 350, 3)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The shape of the image is (257, 350, 3). This is height, width, color channel value.&lt;/p&gt;

&lt;p&gt;And we can easily convert it to a Tensor using &lt;code&gt;tf.constant()&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Turn image into a Tensor
&lt;/span&gt;&lt;span class="n"&gt;tf&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;constant&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;image&lt;/span&gt;&lt;span class="p"&gt;)[:&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;&amp;lt;tf.Tensor: shape=(2, 350, 3), dtype=uint8, numpy=
array([[[ 89, 137,  87],
        [ 76, 124,  74],
        [ 63, 111,  59],
        ...,
        [ 76, 134,  86],
        [ 76, 134,  86],
        [ 76, 134,  86]],

       [[ 72, 119,  73],
        [ 67, 114,  68],
        [ 63, 111,  63],
        ...,
        [ 75, 131,  84],
        [ 74, 132,  84],
        [ 74, 131,  86]]], dtype=uint8)&amp;gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Let's build that function to preprocess an image.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Define image size
&lt;/span&gt;&lt;span class="n"&gt;IMG_SIZE&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;224&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;process_image&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;image_path&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;img_size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;IMG_SIZE&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
  &lt;span class="s"&gt;"""
  Takes an image file path and an image size, and turns the image into a Tensor.
  """&lt;/span&gt;
  &lt;span class="c1"&gt;# Read in an image file
&lt;/span&gt;  &lt;span class="n"&gt;image&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;tf&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;io&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;read_file&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;image_path&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="c1"&gt;# Turn the jpeg image into numerical Tensor with 3 color channels (RGB)
&lt;/span&gt;  &lt;span class="n"&gt;image&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;tf&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;image&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;decode_jpeg&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;image&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;channels&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="c1"&gt;# Convert the color channel values from 0-255 to 0-1 values
&lt;/span&gt;  &lt;span class="n"&gt;image&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;tf&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;image&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;convert_image_dtype&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;image&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;tf&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;float32&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="c1"&gt;# Resize the image to our desired value (224, 224)
&lt;/span&gt;  &lt;span class="n"&gt;image&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;tf&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;image&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;resize&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;image&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;img_size&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;img_size&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;

  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;image&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Creating data batches
&lt;/h3&gt;

&lt;p&gt;We'll now build a function to turn our data into batches (more specifically, a TensorFlow &lt;code&gt;BatchDataset&lt;/code&gt;).&lt;/p&gt;

&lt;p&gt;What's a batch?&lt;/p&gt;

&lt;p&gt;A batch (also called mini-batch) is a small portion of our data conatining, for instance, 32 images and their labels. 32 is generally the default batch size.  In deep learning, instead of finding patterns in an entire dataset at the same time, we often find them in one batch at a time.&lt;/p&gt;

&lt;p&gt;Let's say we're dealing with 10,000+ images (which we are). Together, these files may take up more memory than our GPU has. Trying to compute on them all would result in an error.&lt;/p&gt;

&lt;p&gt;Instead, it's more efficient to create smaller batches of our data and compute on one batch at a time.&lt;/p&gt;

&lt;p&gt;TensorFlow is very efficient when our data is in batches of (image, label) Tensors. So, we'll build a function to create these bacthes. We'll take advantage of the &lt;code&gt;process_image&lt;/code&gt; function at the same time.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Create a simple function to return a tuple (image, label)
&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get_image_label&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;image_path&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;label&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
  &lt;span class="s"&gt;"""
  Takes an image file path name and the associated label, 
  processes the image, and returns a tuple of (image, label).
  """&lt;/span&gt;
  &lt;span class="n"&gt;image&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;process_image&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;image_path&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;image&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;label&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now that we've got a simple function to turn our image filepath names and their associated labels into tuples, we'll create a function to make data batches.&lt;/p&gt;

&lt;p&gt;Because we'll be dealing with 3 different sets of data (training, validation, and test), we'll make sure the function can accomodate for each set.&lt;/p&gt;

&lt;p&gt;We'll set a default batch size of 32 because &lt;a href="https://twitter.com/ylecun/status/989610208497360896?s=20"&gt;according to Yann Lecun&lt;/a&gt;, friends don't let friends train with batch sizes over 32.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Define the batch size
&lt;/span&gt;&lt;span class="n"&gt;BATCH_SIZE&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;32&lt;/span&gt;

&lt;span class="c1"&gt;# Create a function to turn data into batches
&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;create_data_batches&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;batch_size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;BATCH_SIZE&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;valid_data&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;test_data&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
  &lt;span class="s"&gt;"""
  Creates batches of data out of image (X) and label (y) pairs.
  Shuffles the data if it's training data but doesn't shuffle if it's validation data.
  Also accepts test data as input (no labels).
  """&lt;/span&gt;
  &lt;span class="c1"&gt;# If the data is a test dataset, we probably don't have labels
&lt;/span&gt;  &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;test_data&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Creating test data batches..."&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;tf&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Dataset&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;from_tensor_slices&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="n"&gt;tf&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;constant&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;)))&lt;/span&gt; &lt;span class="c1"&gt;# only filepaths (no labels)
&lt;/span&gt;    &lt;span class="n"&gt;data_batch&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;map&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;process_image&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="n"&gt;batch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;BATCH_SIZE&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;data_batch&lt;/span&gt;

  &lt;span class="c1"&gt;# If the data is a validation dataset, we don't need to shuffle it
&lt;/span&gt;  &lt;span class="k"&gt;elif&lt;/span&gt; &lt;span class="n"&gt;valid_data&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Creating validation data batches..."&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;tf&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Dataset&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;from_tensor_slices&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="n"&gt;tf&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;constant&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="c1"&gt;# filepaths
&lt;/span&gt;                                               &lt;span class="n"&gt;tf&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;constant&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="p"&gt;)))&lt;/span&gt; &lt;span class="c1"&gt;# labels
&lt;/span&gt;    &lt;span class="n"&gt;data_batch&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;map&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;get_image_label&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="n"&gt;batch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;BATCH_SIZE&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;   
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;data_batch&lt;/span&gt;

  &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="c1"&gt;# If the data is a training dataset, we shuffle it
&lt;/span&gt;    &lt;span class="k"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Creating training data batches..."&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="c1"&gt;# Turn filepaths and labels into Tensors
&lt;/span&gt;    &lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;tf&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Dataset&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;from_tensor_slices&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="n"&gt;tf&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;constant&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; 
                                               &lt;span class="n"&gt;tf&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;constant&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="p"&gt;)))&lt;/span&gt;
    &lt;span class="c1"&gt;# Shuffling pathnames and labels before mapping image processor function is faster than shuffling images
&lt;/span&gt;    &lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;shuffle&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;buffer_size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

    &lt;span class="c1"&gt;# Create (image, label) tuples (this also turns the image path into a preprocessed image)
&lt;/span&gt;    &lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;map&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;get_image_label&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Turn the data into batches   
&lt;/span&gt;    &lt;span class="n"&gt;data_batch&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;batch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;BATCH_SIZE&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;   
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;data_batch&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Create training and validation data batches
&lt;/span&gt;&lt;span class="n"&gt;train_data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;create_data_batches&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X_train&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_train&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;val_data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;create_data_batches&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X_val&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_val&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;valid_data&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Check the different attributes of our data batches
&lt;/span&gt;&lt;span class="n"&gt;train_data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;element_spec&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;val_data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;element_spec&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;((TensorSpec(shape=(None, 224, 224, 3), dtype=tf.float32, name=None),
  TensorSpec(shape=(None, 120), dtype=tf.bool, name=None)),
 (TensorSpec(shape=(None, 224, 224, 3), dtype=tf.float32, name=None),
  TensorSpec(shape=(None, 120), dtype=tf.bool, name=None)))
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We've now got our data in batches, more specifically, they're in Tensor pairs of (images, labels) ready for use on a GPU.&lt;/p&gt;

&lt;p&gt;But having our data in batches can be a bit of a hard concept to understand. Let's build a function which helps us visualize what's going on under the hood.&lt;/p&gt;

&lt;h3&gt;
  
  
  Visualizing data batches
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;matplotlib.pyplot&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;plt&lt;/span&gt;

&lt;span class="c1"&gt;# Create a function for viewing images in a data batch
&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;show_25_images&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;images&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;labels&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
  &lt;span class="s"&gt;"""
  Displays a plot of 25 images and their labels from a data batch.
  """&lt;/span&gt;
  &lt;span class="c1"&gt;# Setup the figure
&lt;/span&gt;  &lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;figure&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;figsize&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
  &lt;span class="c1"&gt;# Loop through 25
&lt;/span&gt;  &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nb"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;25&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="c1"&gt;# Create subplots (5 rows, 5 columns)
&lt;/span&gt;    &lt;span class="n"&gt;ax&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;subplot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="c1"&gt;# Display an image
&lt;/span&gt;    &lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;imshow&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;images&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
    &lt;span class="c1"&gt;# Add the image label as the title
&lt;/span&gt;    &lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;title&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;unique_breeds&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;labels&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;argmax&lt;/span&gt;&lt;span class="p"&gt;()])&lt;/span&gt;
    &lt;span class="c1"&gt;# Turn grid lines off
&lt;/span&gt;    &lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;axis&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"off"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;To make computation efficient, a batch is a tighly wound collection of Tensors.&lt;/p&gt;

&lt;p&gt;So, to view data in a batch, we've got to unwind it.&lt;/p&gt;

&lt;p&gt;We can do so by calling the &lt;code&gt;as_numpy_iterator()&lt;/code&gt; method on a data batch.&lt;/p&gt;

&lt;p&gt;This will turn our a data batch into something which can be iterated over.&lt;/p&gt;

&lt;p&gt;Passing an iterable to &lt;code&gt;next()&lt;/code&gt; will return the next item in the iterator.&lt;/p&gt;

&lt;p&gt;In our case, &lt;code&gt;next()&lt;/code&gt; will return a batch of 32 images and label pairs.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Note:&lt;/strong&gt; Running the cell below and loading images may take a little while.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Visualize training images from the training data batch
&lt;/span&gt;&lt;span class="n"&gt;train_images&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;train_labels&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;next&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;train_data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;as_numpy_iterator&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
&lt;span class="n"&gt;show_25_images&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;train_images&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;train_labels&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--sh62Nny5--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/8vj6mg2umb6k4yiugqpt.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--sh62Nny5--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/8vj6mg2umb6k4yiugqpt.png" alt="Batch of training images" width="620" height="574"&gt;&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Now let's visualize our validation set
&lt;/span&gt;&lt;span class="n"&gt;val_images&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;val_labels&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;next&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;val_data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;as_numpy_iterator&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
&lt;span class="n"&gt;show_25_images&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;val_images&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;val_labels&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--IFeGsIN3--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/xts0mqkugwk32355u1le.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--IFeGsIN3--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/xts0mqkugwk32355u1le.png" alt="Batch of validation images" width="607" height="574"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Creating and training a Model
&lt;/h2&gt;

&lt;p&gt;We'll use an existing model from &lt;a href="https://tfhub.dev/"&gt;TensorFlow Hub&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;TensorFlow Hub is a resource where we can find pretrained machine learning models for the problem we're working on.&lt;/p&gt;

&lt;p&gt;Using a pretrained machine learning model is often referred to as &lt;strong&gt;transfer learning&lt;/strong&gt;.&lt;/p&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;Why use a pretrained model?&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;Building a machine learning model and training it from scratch can be expensive and time consuming.&lt;/p&gt;

&lt;p&gt;Transfer learning helps solve these issues by taking what another model has already learned and using that information with our own problem.&lt;/p&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;How do we choose a model?&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;Since we know our problem is image classification (classifying different dog breeds), we can navigate the TensorFlow Hub page by our problem domain (image).&lt;/p&gt;

&lt;p&gt;We start by choosing the image problem domain, and then can filter it down by subdomains, in our case, image classification.&lt;/p&gt;

&lt;p&gt;Doing this gives a list of different pretrained models we can apply to our task.&lt;/p&gt;

&lt;p&gt;For example, the mobilenet_v2_130_224 model takes an input of images in the shape 224, 224. It also says the model has been trained in the domain of image classification.&lt;/p&gt;

&lt;p&gt;Let's try it out.&lt;/p&gt;

&lt;h3&gt;
  
  
  Building a model
&lt;/h3&gt;

&lt;p&gt;Before we build a model, there are a few things we need to define:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The input shape (images, in the form of Tensors) to our model.&lt;/li&gt;
&lt;li&gt;The output shape (image labels, in the form of Tensors) of our model.&lt;/li&gt;
&lt;li&gt;The URL of the model we want to use.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These things will be standard practice with whatever machine learning model we use. And because we're using TensorFlow, everything will be in the form of Tensors.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Setup input shape to the model
&lt;/span&gt;&lt;span class="n"&gt;INPUT_SHAPE&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;IMG_SIZE&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;IMG_SIZE&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="c1"&gt;# batch, height, width, color channels
&lt;/span&gt;
&lt;span class="c1"&gt;# Setup output shape of the model
&lt;/span&gt;&lt;span class="n"&gt;OUTPUT_SHAPE&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;unique_breeds&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Setup model URL from TensorFlow Hub
&lt;/span&gt;&lt;span class="n"&gt;MODEL_URL&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"https://tfhub.dev/google/imagenet/mobilenet_v2_130_224/classification/5"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now we've got the inputs, outputs, and model we're using ready to go. We can start to put them together.&lt;/p&gt;

&lt;p&gt;There are many ways of building a model in TensorFlow but one of the best ways to get started is to use the &lt;a href="https://www.tensorflow.org/guide/keras/overview"&gt;Keras API&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Defining a deep learning model in Keras can be as straightforward as saying, "here are the layers of the model, the input shape, and the output shape, let's go!"&lt;/p&gt;

&lt;p&gt;Knowing, this. let's create a function which:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Takes the input shape, output shape, and the model we've chosen as parameters.&lt;/li&gt;
&lt;li&gt;Defines the layers in a Keras model in sequential fashion.&lt;/li&gt;
&lt;li&gt;Compiles the model (says how it should be evaluated and improved.)&lt;/li&gt;
&lt;li&gt;Builds the model (tells it what kind of input shape it'll be getting.)&lt;/li&gt;
&lt;li&gt;Returns the model.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;All of these steps can be found here: &lt;a href="https://www.tensorflow.org/guide/keras/sequential_model"&gt;https://www.tensorflow.org/guide/keras/sequential_model&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Create a function which builds a keras model
&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;create_model&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;input_shape&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;INPUT_SHAPE&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;output_shape&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;OUTPUT_SHAPE&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;model_url&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;MODEL_URL&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
  &lt;span class="k"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Building model with:"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;MODEL_URL&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

  &lt;span class="c1"&gt;# Setup the model layers
&lt;/span&gt;  &lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;tf&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;keras&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Sequential&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;
    &lt;span class="n"&gt;hub&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;KerasLayer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model_url&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="c1"&gt;# Layer 1 (input layer)
&lt;/span&gt;    &lt;span class="n"&gt;tf&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;keras&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;layers&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Dense&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;units&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;output_shape&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                          &lt;span class="n"&gt;activation&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;"softmax"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="c1"&gt;# Layer 2 (output layer)
&lt;/span&gt;  &lt;span class="p"&gt;])&lt;/span&gt;

  &lt;span class="c1"&gt;# Compile the model
&lt;/span&gt;  &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;compile&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
      &lt;span class="n"&gt;loss&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;tf&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;keras&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;losses&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;CategoricalCrossentropy&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
      &lt;span class="n"&gt;optimizer&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;tf&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;keras&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;optimizers&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Adam&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
      &lt;span class="n"&gt;metrics&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;"accuracy"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
  &lt;span class="p"&gt;)&lt;/span&gt;

  &lt;span class="c1"&gt;# Build the model
&lt;/span&gt;  &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;build&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;input_shape&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;What's happening here?&lt;/p&gt;

&lt;h4&gt;
  
  
  Setting up the model layers
&lt;/h4&gt;

&lt;p&gt;There are two ways to do this in Keras, the &lt;a href="https://www.tensorflow.org/guide/keras/functional"&gt;functional&lt;/a&gt; and &lt;a href="https://www.tensorflow.org/guide/keras/overview#build_a_simple_model"&gt;sequential&lt;/a&gt; API. We've used the sequential.&lt;/p&gt;

&lt;p&gt;Which one should we choose?&lt;/p&gt;

&lt;p&gt;The Keras documentation states that the functional API is the way to go for defining complex models, but the sequential API (a linear stack of layers) is perfectly fine for getting started, which is what we're doing.&lt;/p&gt;

&lt;p&gt;The first layer we use is the model from TensorFlow Hub (&lt;code&gt;hub.KerasLayer(MODEL_URL)&lt;/code&gt;). So our first layer is actually an entire model (many more layers). This &lt;strong&gt;input layer&lt;/strong&gt; takes in our images and finds patterns in them based on the patterns &lt;code&gt;mobilenet_v2_130_224&lt;/code&gt; has found.&lt;/p&gt;

&lt;p&gt;The next layer (&lt;code&gt;tf.keras.layers.Dense()&lt;/code&gt;) is the &lt;strong&gt;output layer&lt;/strong&gt; of our model. It brings all of the information discovered in the input layer together and outputs it in the shape we're after, 120 (the number of unique labels we have).&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;activation="softmax"&lt;/code&gt; parameter tells the output layer that we'd like to assign a probability value to each of the 120 labels somewhere between 0 and 1. The higher the value, the more confident the model is that the input image should have that label. If we were working on a binary classification problem, we'd use &lt;code&gt;activation="sigmoid"&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;For more on which activation function to use, see the article &lt;a href="https://towardsdatascience.com/deep-learning-which-loss-and-activation-functions-should-i-use-ac02f1c56aa8"&gt;Which Loss and Activation Functions Should I Use?&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Compiling the model
&lt;/h4&gt;

&lt;p&gt;This one is best explained with a story.&lt;/p&gt;

&lt;p&gt;Let's say you're at the international hill descending championships. Where you start standing on top of a hill and your goal is to get to the bottom of the hill. The catch is you're blindfolded.&lt;/p&gt;

&lt;p&gt;Luckily, your friend Adam is standing at the bottom of the hill shouting instructions on how to get down.&lt;/p&gt;

&lt;p&gt;At the bottom of the hill, there's a judge evaluating how you're doing. They know where you need to end up so they compare how you're doing to where you're supposed to be. Their comparison is how you get scored.&lt;/p&gt;

&lt;p&gt;Transferring this to &lt;code&gt;model.compile()&lt;/code&gt; terminology:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;loss&lt;/code&gt; - The height of the hill is the loss function, the model's goal is to minimize this, getting to 0 (the bottom of the hill) means the model is learning perfectly.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;optimizer&lt;/code&gt; - Your friend Adam is the optimizer, he's the one telling you how to navigate the hill (lower the loss function) based on what you've done so far. His name is Adam because the &lt;a href="https://machinelearningmastery.com/adam-optimization-algorithm-for-deep-learning/"&gt;Adam optimizer&lt;/a&gt; performs well on most models. Other optimizers include RMSprop and Stochastic Gradient Descent.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;metrics&lt;/code&gt; - This is the onlooker at the bottom of the hill rating how well your performance is. Or in our case, giving the accuracy of how well our model is predicting the correct image label.&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  Building the model
&lt;/h4&gt;

&lt;p&gt;We use &lt;code&gt;model.build()&lt;/code&gt; whenever we're using a layer from TensorFlow Hub to tell our model what input shape it can expect.&lt;/p&gt;

&lt;p&gt;In this case, the input shape is &lt;code&gt;[None, IMG_SIZE, IMG_SIZE, 3]&lt;/code&gt; or &lt;code&gt;[None, 224, 224, 3]&lt;/code&gt; or &lt;code&gt;[batch_size, img_height, img_width, color_channels]&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Batch size is left as &lt;code&gt;None&lt;/code&gt; as this is inferred from the data we pass the model. In our case, it'll be 32 since that's what we've set up.&lt;/p&gt;

&lt;p&gt;Now that we've gone through each section of the function, let's use it to create a model.&lt;/p&gt;

&lt;p&gt;We can call &lt;code&gt;summary()&lt;/code&gt; on our model to get an idea of what our model looks like.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;create_model&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;summary&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Building model with: https://tfhub.dev/google/imagenet/mobilenet_v2_130_224/classification/5
Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 keras_layer (KerasLayer)    (None, 1001)              5432713   

 dense (Dense)               (None, 120)               120240    

=================================================================
Total params: 5,552,953
Trainable params: 120,240
Non-trainable params: 5,432,713
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The non-trainable parameters are the patterns learned by &lt;code&gt;mobilenet_v2_130_224&lt;/code&gt; and the trainable parameters are the ones in the dense layer we added.&lt;/p&gt;

&lt;p&gt;This means the main bulk of the information in our model has already been learned and we're going to take that and adapt it to our own problem.&lt;/p&gt;

&lt;h3&gt;
  
  
  Creating callbacks
&lt;/h3&gt;

&lt;p&gt;We've got a model ready to go, but before we train it, we'll make some callbacks.&lt;/p&gt;

&lt;p&gt;Callbacks are helper functions a model can use during training to do things such as save a model's progress or stop training early if a model stops improving.&lt;/p&gt;

&lt;p&gt;The two callbacks we're going to add are a TensorBoard callback and an Early Stopping callback.&lt;/p&gt;

&lt;h4&gt;
  
  
  TensorBoard callback
&lt;/h4&gt;

&lt;p&gt;&lt;a href="https://www.tensorflow.org/tensorboard/get_started"&gt;TensorBoard&lt;/a&gt; helps provide a visual way to monitor the progress of our model during and after training.&lt;/p&gt;

&lt;p&gt;It can be used directly in a notebook to track the performance measures of a model such as loss and accuracy.&lt;/p&gt;

&lt;p&gt;To setup a TensorBoard callback and view TensorBoard in a notebook, we need to do 3 things:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Load the TensorBoard notebook extension.&lt;/li&gt;
&lt;li&gt;Create a TensorBoard callback which is able to save logs to a directory and pass it to our model's &lt;code&gt;fit()&lt;/code&gt; function.&lt;/li&gt;
&lt;li&gt;Visualize our model's training logs with the &lt;code&gt;%tensorboard&lt;/code&gt; magic function (we'll do this after model training.)
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Load TensorBoard notebook extension
&lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="n"&gt;load_ext&lt;/span&gt; &lt;span class="n"&gt;tensorboard&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;datetime&lt;/span&gt;

&lt;span class="c1"&gt;# Create a function to build a TensorBoard callback
&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;create_tensorboard_callback&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
  &lt;span class="c1"&gt;# Create a log directory for storing TensorBoard logs
&lt;/span&gt;  &lt;span class="n"&gt;logdir&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"drive/MyDrive/ML/Dog Vision/logs"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                        &lt;span class="c1"&gt;# Make it so the logs get tracked whenever we run an experiment
&lt;/span&gt;                        &lt;span class="n"&gt;datetime&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;datetime&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;now&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="n"&gt;strftime&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;'%Y%m%d-%H%M%S'&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;tf&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;keras&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;callbacks&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;TensorBoard&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;logdir&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  Early stopping callback
&lt;/h4&gt;

&lt;p&gt;Early stopping helps prevent overfitting by stopping a model when a certain evaluation metric stops improving. If a model trains for too long, it can do so well at finding patterns in a certain dataset that it's not able to use those patterns on another dataset it hasn't seen before (the model doesn't generalize).&lt;/p&gt;

&lt;p&gt;It's basically like saying to our model, "keep finding patterns until the quality of those patterns starts to go down."&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Create early stopping callback
&lt;/span&gt;&lt;span class="n"&gt;early_stopping&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;tf&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;keras&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;callbacks&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;EarlyStopping&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;monitor&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;"val_accuracy"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                                                  &lt;span class="n"&gt;patience&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="c1"&gt;# stops after 3 rounds of no improvements
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Training a model (on subset of data)
&lt;/h3&gt;

&lt;p&gt;Our first model is only going to be trained on 1,000 images. Or rather, trained on 800 images and then validated on 200 images, meaning 1,000 images in total or about 10% of the total data.&lt;/p&gt;

&lt;p&gt;We do this to make sure everything is working. And if it is, we can step it up later and train on the entire training dataset.&lt;/p&gt;

&lt;p&gt;The final parameter we'll define before training is &lt;code&gt;NUM_EPOCHS&lt;/code&gt; (also known as number of epochs).&lt;/p&gt;

&lt;p&gt;&lt;code&gt;NUM_EPOCHS&lt;/code&gt; defines how many passes of the data we'd like our model to do. A pass is equivalent to our model trying to find patterns in each dog image and see which patterns relate to each label.&lt;/p&gt;

&lt;p&gt;If &lt;code&gt;NUM_EPOCHS=1&lt;/code&gt;, the model will only look at the data once and will probably score badly because it hasn't had a chance to correct itself. It would be like you competing in the international hill descent championships and your friend Adam only being able to give you 1 single instruction to get down the hill.&lt;/p&gt;

&lt;p&gt;What's a good value for &lt;code&gt;NUM_EPOCHS&lt;/code&gt;?&lt;/p&gt;

&lt;p&gt;This one is hard to say. 10 could be a good start but so could 100. This is one of the reasons we created an early stopping callback. Having early stopping setup means if we set &lt;code&gt;NUM_EPOCHS&lt;/code&gt; to 100 but our model stops improving after 22 epochs, it'll stop training.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;NUM_EPOCHS&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Let's create a function that trains a model. The function will:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Create a model using &lt;code&gt;create_model()&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Setup a TensorBoard callback using &lt;code&gt;create_tensorboard_callback()&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Call the &lt;code&gt;fit()&lt;/code&gt; function on our model passing it the training data, validation data, number of epochs to train for (&lt;code&gt;NUM_EPOCHS&lt;/code&gt;) and the callbacks we'd like to use.&lt;/li&gt;
&lt;li&gt;Return the fitted model.
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Build a function to train and return the trained model
&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;train_model&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
  &lt;span class="s"&gt;"""
  Trains a given model and returns the trained version.
  """&lt;/span&gt;
  &lt;span class="c1"&gt;# create a model
&lt;/span&gt;  &lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;create_model&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

  &lt;span class="c1"&gt;# Create new TensorBoard session every time we train a model
&lt;/span&gt;  &lt;span class="n"&gt;tensorboard&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;create_tensorboard_callback&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

  &lt;span class="c1"&gt;# Fit the model to the data passing it the callbacks we created
&lt;/span&gt;  &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;fit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;train_data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;epochs&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;NUM_EPOCHS&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;validation_data&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;val_data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;validation_freq&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="c1"&gt;# check validation metrics every epoch
&lt;/span&gt;            &lt;span class="n"&gt;callbacks&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;tensorboard&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;early_stopping&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Note:&lt;/strong&gt; When training a model for the first time, the first epoch will take a while to load compared to the rest. This is because the model is getting ready and the data is being initialised. Using more data will generally take longer, which is why we've started with ~1,000 images. After the first epoch, subsequent epochs should take only a few seconds.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Fit the model to the data
&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;train_model&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Building model with: https://tfhub.dev/google/imagenet/mobilenet_v2_130_224/classification/5
Epoch 1/100
25/25 [==============================] - 212s 8s/step - loss: 4.4983 - accuracy: 0.0913 - val_loss: 3.3317 - val_accuracy: 0.2500
Epoch 2/100
25/25 [==============================] - 5s 183ms/step - loss: 1.5892 - accuracy: 0.7038 - val_loss: 2.0969 - val_accuracy: 0.4950
Epoch 3/100
25/25 [==============================] - 5s 182ms/step - loss: 0.5362 - accuracy: 0.9525 - val_loss: 1.6433 - val_accuracy: 0.6000
Epoch 4/100
25/25 [==============================] - 5s 181ms/step - loss: 0.2430 - accuracy: 0.9875 - val_loss: 1.4789 - val_accuracy: 0.6250
Epoch 5/100
25/25 [==============================] - 5s 184ms/step - loss: 0.1421 - accuracy: 0.9975 - val_loss: 1.4015 - val_accuracy: 0.6450
Epoch 6/100
25/25 [==============================] - 4s 174ms/step - loss: 0.0988 - accuracy: 1.0000 - val_loss: 1.3711 - val_accuracy: 0.6450
Epoch 7/100
25/25 [==============================] - 5s 186ms/step - loss: 0.0745 - accuracy: 1.0000 - val_loss: 1.3289 - val_accuracy: 0.6500
Epoch 8/100
25/25 [==============================] - 4s 175ms/step - loss: 0.0589 - accuracy: 1.0000 - val_loss: 1.3003 - val_accuracy: 0.6600
Epoch 9/100
25/25 [==============================] - 7s 268ms/step - loss: 0.0485 - accuracy: 1.0000 - val_loss: 1.2854 - val_accuracy: 0.6700
Epoch 10/100
25/25 [==============================] - 6s 226ms/step - loss: 0.0409 - accuracy: 1.0000 - val_loss: 1.2699 - val_accuracy: 0.6700
Epoch 11/100
25/25 [==============================] - 4s 178ms/step - loss: 0.0353 - accuracy: 1.0000 - val_loss: 1.2561 - val_accuracy: 0.6650
Epoch 12/100
25/25 [==============================] - 5s 192ms/step - loss: 0.0307 - accuracy: 1.0000 - val_loss: 1.2476 - val_accuracy: 0.6700
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It looks like our model is overfitting (getting far better results on the training set than the validation set). And we should look into some ways to prevent model overfitting.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Note:&lt;/strong&gt; Overfitting to begin with is a good thing. It means our model is learning something.&lt;/p&gt;

&lt;h3&gt;
  
  
  Checking the TensorBoard logs
&lt;/h3&gt;

&lt;p&gt;The TensorBoard magic function (&lt;code&gt;%tensorboard&lt;/code&gt;) will access the logs directory we created earlier and visualize its content.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="n"&gt;tensorboard&lt;/span&gt; &lt;span class="o"&gt;--&lt;/span&gt;&lt;span class="n"&gt;logdir&lt;/span&gt; &lt;span class="n"&gt;drive&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;MyDrive&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;ML&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;Dog&lt;/span&gt;\ &lt;span class="n"&gt;Vision&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;logs&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Thanks to our &lt;code&gt;early_stopping&lt;/code&gt; callback, the model stopped training after 12 epochs (in my case, yours might be slightly different). This is because the validation accuracy failed to improve for 3 epochs.&lt;/p&gt;

&lt;p&gt;But the good news is, we can definitely see that our model is learning something. The validation accuracy got to 67% in only a few minutes.&lt;/p&gt;

&lt;p&gt;This means, if we were to scale up the number of images, hopefully we'd see the accuracy increase.&lt;/p&gt;

&lt;h2&gt;
  
  
  Making and evaluating predictions using a trained model
&lt;/h2&gt;

&lt;p&gt;Before we scale up and train on more data, let's see some other ways we can evaluate our model. Because although accuracy is a pretty good indicator of how our model is doing, it would be even better if we could could see it in action.&lt;/p&gt;

&lt;p&gt;Making predictions with a trained model is as calling &lt;code&gt;predict()&lt;/code&gt; on it and passing it data in the same format the model was trained on.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Make predictions on the validation data (not used to train on)
&lt;/span&gt;&lt;span class="n"&gt;predictions&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;predict&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;val_data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;verbose&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="c1"&gt;# verbose shows us how long there is to go
&lt;/span&gt;&lt;span class="n"&gt;predictions&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;7/7 [==============================] - 2s 149ms/step
array([[6.22897118e-04, 1.67990889e-04, 8.79216474e-04, ...,
        7.09853193e-04, 5.02764196e-05, 1.65839761e-03],
       [2.67735624e-04, 2.39910398e-04, 1.09814322e-02, ...,
        9.64251813e-05, 7.48365244e-04, 7.03895203e-05],
       [1.96330846e-04, 1.39784577e-04, 4.39772011e-05, ...,
        2.66594667e-04, 2.05397533e-04, 2.26700475e-04],
       ...,
       [2.78241873e-06, 8.37749758e-05, 4.92490479e-04, ...,
        3.37603960e-05, 6.75940537e-04, 1.90498744e-04],
       [2.50874972e-03, 1.12199872e-04, 5.24179341e-05, ...,
        7.15124188e-05, 2.62187295e-05, 1.73446781e-03],
       [1.17617834e-04, 7.77364112e-05, 1.37082185e-04, ...,
        1.58314139e-03, 8.59645079e-04, 2.46757936e-05]], dtype=float32)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Check the shape of predictions
&lt;/span&gt;&lt;span class="n"&gt;predictions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;shape&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;(200, 120)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Making predictions with our model returns an array with a different value for each label.&lt;/p&gt;

&lt;p&gt;In this case, making predictions on the validation data (200 images) returns an array (&lt;code&gt;predictions&lt;/code&gt;) of arrays, each containing 120 different values (one for each unique dog breed).&lt;/p&gt;

&lt;p&gt;These different values are the probabilities, or the likelihood, the model has predicted a certain image being a certain breed of dog. The higher the value, the more likely the model thinks a given image is a specific breed of dog.&lt;/p&gt;

&lt;p&gt;Let's see how we'd convert an array of probabilities into an actual label.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# First prediction
&lt;/span&gt;&lt;span class="n"&gt;index&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
&lt;span class="k"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;predictions&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;span class="k"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="s"&gt;"Max value (probability of prediction): &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;max&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;predictions&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="s"&gt;"Sum: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;predictions&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="s"&gt;"Max index: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;argmax&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;predictions&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;"&lt;/span&gt; &lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="s"&gt;"Predicted label: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;unique_breeds&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;argmax&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;predictions&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="p"&gt;])]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[6.2289712e-04 1.6799089e-04 8.7921647e-04 1.8631003e-04 5.3724495e-04
 3.2465730e-04 2.2423903e-02 4.9657328e-04 3.0789597e-05 4.2850631e-03
 1.8981443e-04 1.0261026e-04 6.6824240e-04 2.1106268e-04 3.2271945e-04
 2.0372072e-04 2.2583949e-05 1.3980259e-01 3.2590338e-05 4.0915584e-05
 1.1816905e-03 2.7455544e-04 4.0372779e-05 6.4781279e-04 8.5824677e-06
 5.7637418e-04 4.2180789e-01 3.6451096e-05 2.1398282e-03 1.8662729e-04
 5.3795295e-05 1.1816223e-03 1.5667198e-03 1.0925717e-05 2.2503540e-05
 1.3354327e-02 3.7860959e-06 3.5932841e-04 1.2963214e-04 2.9137873e-04
 2.3260089e-03 3.3095494e-06 1.5097676e-05 4.4024091e-05 4.1719111e-05
 1.4688818e-04 5.0045412e-05 1.1059161e-04 7.7821646e-04 3.0642765e-04
 3.9216696e-04 1.4359584e-05 5.3910341e-04 6.1444713e-05 1.7364083e-04
 3.5276284e-05 1.3722615e-04 9.0744550e-04 3.9121576e-04 1.9728519e-02
 2.5874743e-04 9.5249598e-05 3.4022541e-04 1.5865565e-04 2.8779902e-04
 6.3628376e-02 7.0957394e-05 8.5773220e-04 1.4202471e-02 1.0215643e-04
 5.1950999e-02 1.8512135e-04 7.5176729e-05 2.5200758e-02 3.5105829e-04
 1.6164074e-04 8.3191908e-04 1.5876073e-02 3.7631966e-04 1.8149629e-02
 1.6471182e-04 1.5049442e-03 3.0149630e-04 5.1044985e-03 3.1844739e-04
 8.8904443e-04 3.7882995e-04 2.3105989e-04 1.8062179e-04 1.4020882e-03
 1.0508097e-03 2.2645481e-04 6.8475410e-06 2.5964212e-03 8.1505415e-05
 1.9787325e-04 9.7168679e-04 9.0247270e-04 4.6203140e-04 1.0710631e-04
 2.3471296e-02 9.6860625e-05 1.8919840e-02 5.8173094e-02 3.1562344e-05
 7.0314546e-04 2.0326689e-02 8.7760345e-05 2.7958115e-04 1.8794164e-02
 3.9155426e-04 6.9520087e-05 5.1070543e-05 2.0120994e-04 4.8968988e-04
 3.7422600e-05 3.5885598e-03 7.0985319e-04 5.0276420e-05 1.6583976e-03]
Max value (probability of prediction): 0.4218078851699829
Sum: 1.0
Max index: 26
Predicted label: cairn
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Having this information is great but it would be even better if we could compare a prediction to its true label and original image.&lt;/p&gt;

&lt;p&gt;To help us, let's first build a little function to convert prediction probabilities into predicted labels.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Note:&lt;/strong&gt; Prediction probabilities are also known as confidence levels.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Turn prediction probabilities into their respective label
&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get_pred_label&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prediction_probabilities&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
  &lt;span class="s"&gt;"""
  Turns an array of prediction probabilities into a label.
  """&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;unique_breeds&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;argmax&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prediction_probabilities&lt;/span&gt;&lt;span class="p"&gt;)]&lt;/span&gt;

&lt;span class="c1"&gt;# Get a predicted label based on an array of prediction probabilities
&lt;/span&gt;&lt;span class="n"&gt;pred_label&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;get_pred_label&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;predictions&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;span class="n"&gt;pred_label&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;'cairn'
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now we've got a list of all different predictions our model has made, we'll do the same for the validation images and validation labels.&lt;/p&gt;

&lt;p&gt;The model hasn't trained on the validation data, during the &lt;code&gt;fit()&lt;/code&gt; function, it only used the validation data to evaluate itself. So we can use the validation images to visually compare our models predictions with the validation labels.&lt;/p&gt;

&lt;p&gt;Since our validation data (&lt;code&gt;val_data&lt;/code&gt;) is in batch form, to get a list of validation images and labels, we'll have to unbatch it (using &lt;code&gt;unbatch()&lt;/code&gt;) and then turn it into an iterator using &lt;code&gt;as_numpy_iterator()&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Let's make a small function to do so.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Create a function to unbatch a batched dataset
&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;unbatchify&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
  &lt;span class="s"&gt;"""
  Takes a batched dataset of (image, label) Tensors and returns separate arrays
  of images and labels.
  """&lt;/span&gt;
  &lt;span class="n"&gt;images&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
  &lt;span class="n"&gt;labels&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
  &lt;span class="c1"&gt;# Loop through unbatched data
&lt;/span&gt;  &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;image&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;label&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;unbatch&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="n"&gt;as_numpy_iterator&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="n"&gt;images&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;image&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;labels&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;unique_breeds&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;argmax&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;label&lt;/span&gt;&lt;span class="p"&gt;)])&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;images&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;labels&lt;/span&gt;

&lt;span class="c1"&gt;# Unbatchify the validation data
&lt;/span&gt;&lt;span class="n"&gt;val_images&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;val_labels&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;unbatchify&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;val_data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;val_images&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;val_labels&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;(array([[[0.29599646, 0.43284872, 0.3056691 ],
         [0.26635826, 0.32996926, 0.22846507],
         [0.31428418, 0.2770141 , 0.22934894],
         ...,
         [0.77614343, 0.82320225, 0.8101595 ],
         [0.81291157, 0.8285351 , 0.8406944 ],
         [0.8209297 , 0.8263737 , 0.8423668 ]],

        [[0.2344871 , 0.31603682, 0.19543913],
         [0.3414841 , 0.36560842, 0.27241898],
         [0.45016077, 0.40117094, 0.33964607],
         ...,
         [0.7663987 , 0.8134138 , 0.81350833],
         [0.7304248 , 0.75012016, 0.76590735],
         [0.74518913, 0.76002574, 0.7830809 ]],

        [[0.30157745, 0.3082587 , 0.21018331],
         [0.2905954 , 0.27066195, 0.18401104],
         [0.4138316 , 0.36170745, 0.2964005 ],
         ...,
         [0.79871625, 0.8418535 , 0.8606443 ],
         [0.7957738 , 0.82859945, 0.8605655 ],
         [0.75181633, 0.77904975, 0.8155256 ]],

        ...,

        [[0.9746779 , 0.9878955 , 0.9342279 ],
         [0.99153054, 0.99772066, 0.9427856 ],
         [0.98925114, 0.9792082 , 0.9137934 ],
         ...,
         [0.0987601 , 0.0987601 , 0.0987601 ],
         [0.05703771, 0.05703771, 0.05703771],
         [0.03600177, 0.03600177, 0.03600177]],

        [[0.98197854, 0.9820659 , 0.9379411 ],
         [0.9811992 , 0.97015417, 0.9125648 ],
         [0.9722316 , 0.93666023, 0.8697186 ],
         ...,
         [0.09682598, 0.09682598, 0.09682598],
         [0.07196062, 0.07196062, 0.07196062],
         [0.0361607 , 0.0361607 , 0.0361607 ]],

        [[0.97279435, 0.9545954 , 0.92389745],
         [0.963602  , 0.93199134, 0.88407487],
         [0.9627158 , 0.9125331 , 0.8460338 ],
         ...,
         [0.08394483, 0.08394483, 0.08394483],
         [0.0886985 , 0.0886985 , 0.0886985 ],
         [0.04514172, 0.04514172, 0.04514172]]], dtype=float32), 'cairn')
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now we've got ways to get:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Prediction labels&lt;/li&gt;
&lt;li&gt;Validation labels (truth labels)&lt;/li&gt;
&lt;li&gt;Validation images&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Let's make some functions to make these all a bit more visual.&lt;/p&gt;

&lt;p&gt;More specifically, we want to be able to view an image, its predicted label and its actual label (true label).&lt;/p&gt;

&lt;p&gt;The first function we'll create will:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Take an array of prediction probabilities, an array of truth labels, an array of images and an integer.&lt;/li&gt;
&lt;li&gt;Convert the prediction probabilities to a predicted label.&lt;/li&gt;
&lt;li&gt;Plot the predicted label, its predicted probability, the truth label and target image on a single plot.
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;plot_pred&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prediction_probabilities&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;labels&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;images&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
  &lt;span class="s"&gt;"""
  View the prediction, ground truth, and image for sample n.
  """&lt;/span&gt;
  &lt;span class="n"&gt;pred_prob&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;true_label&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;image&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;prediction_probabilities&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;labels&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;images&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

  &lt;span class="c1"&gt;# get the pred label
&lt;/span&gt;  &lt;span class="n"&gt;pred_label&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;get_pred_label&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pred_prob&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

  &lt;span class="c1"&gt;# Plot image and remove ticks
&lt;/span&gt;  &lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;imshow&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;image&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;xticks&lt;/span&gt;&lt;span class="p"&gt;([])&lt;/span&gt;
  &lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;yticks&lt;/span&gt;&lt;span class="p"&gt;([])&lt;/span&gt;

  &lt;span class="c1"&gt;# Change the color of the title depending on if the preidction is right or wrong
&lt;/span&gt;  &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;pred_label&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="n"&gt;true_label&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;color&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"green"&lt;/span&gt;
  &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;color&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"red"&lt;/span&gt;

  &lt;span class="c1"&gt;# Change plot title
&lt;/span&gt;  &lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;title&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Predicted breed: {}&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt; Probability: {:2.0f}%&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt; Actual breed: {}"&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;format&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pred_label&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                                   &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;max&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pred_prob&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                                   &lt;span class="n"&gt;true_label&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
                                    &lt;span class="n"&gt;color&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;color&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# View an example prediction, original image and truth label
&lt;/span&gt;&lt;span class="n"&gt;plot_pred&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prediction_probabilities&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;predictions&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
          &lt;span class="n"&gt;labels&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;val_labels&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
          &lt;span class="n"&gt;images&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;val_images&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
          &lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--yuzw-QMm--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/xxbgikgyku3rh3mbso4s.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--yuzw-QMm--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/xxbgikgyku3rh3mbso4s.png" alt="Prediction with true label" width="235" height="281"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Making functions to help visualize our model's results is really helpful in understanding how our model is doing.&lt;/p&gt;

&lt;p&gt;Since we're working with a multi-class problem, it would also be good to see what other guesses our model is making. More specifically, if our model predicts a certain label with 24% probability, what else did it predict?&lt;/p&gt;

&lt;p&gt;Let's build a function to demonstrate this. The function will:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Take an input of a prediction probabilities array, a ground truth labels array and an integer.&lt;/li&gt;
&lt;li&gt;Find the predicted label using &lt;code&gt;get_pred_label()&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Find the top 10:

&lt;ul&gt;
&lt;li&gt;Prediction probabilities indexes&lt;/li&gt;
&lt;li&gt;Prediction probabilities values&lt;/li&gt;
&lt;li&gt;Prediction labels&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;li&gt;Plot the top 10 prediction probability values and labels, coloring the true label green.
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;plot_pred_conf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prediction_probabilities&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;labels&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
  &lt;span class="s"&gt;"""
  Plots the top 10 highest prediction confidences along with the truth label for sample n.
  """&lt;/span&gt;
  &lt;span class="n"&gt;pred_prob&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;true_label&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;prediction_probabilities&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;labels&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

  &lt;span class="c1"&gt;# Get the predicted label
&lt;/span&gt;  &lt;span class="n"&gt;pred_label&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;get_pred_label&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pred_prob&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

  &lt;span class="c1"&gt;# Find the top 10 prediction confidence indexes
&lt;/span&gt;  &lt;span class="n"&gt;top_10_pred_indexes&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pred_prob&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;argsort&lt;/span&gt;&lt;span class="p"&gt;()[&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;:][::&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

  &lt;span class="c1"&gt;# Find the top 10 pred confidence values
&lt;/span&gt;  &lt;span class="n"&gt;top_10_pred_values&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pred_prob&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;top_10_pred_indexes&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

  &lt;span class="c1"&gt;# Find the top 10 prediction labels
&lt;/span&gt;  &lt;span class="n"&gt;top_10_pred_labels&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;unique_breeds&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;top_10_pred_indexes&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

  &lt;span class="c1"&gt;# Setup plot
&lt;/span&gt;  &lt;span class="n"&gt;top_plot&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;bar&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;arange&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;top_10_pred_labels&lt;/span&gt;&lt;span class="p"&gt;)),&lt;/span&gt;
                     &lt;span class="n"&gt;top_10_pred_values&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                     &lt;span class="n"&gt;color&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;"grey"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;xticks&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;arange&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;top_10_pred_labels&lt;/span&gt;&lt;span class="p"&gt;)),&lt;/span&gt;
             &lt;span class="n"&gt;labels&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;top_10_pred_labels&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
             &lt;span class="n"&gt;rotation&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;"vertical"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

  &lt;span class="c1"&gt;# Change the color of true label
&lt;/span&gt;  &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;isin&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;true_label&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;top_10_pred_labels&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;top_plot&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;argmax&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;top_10_pred_labels&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="n"&gt;true_label&lt;/span&gt;&lt;span class="p"&gt;)].&lt;/span&gt;&lt;span class="n"&gt;set_color&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"green"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;pass&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;plot_pred_conf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prediction_probabilities&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;predictions&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
               &lt;span class="n"&gt;labels&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;val_labels&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
               &lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--L92AjdaS--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/luicooibactncamrtb7r.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--L92AjdaS--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/luicooibactncamrtb7r.png" alt="Bar chart of top 10 predictions" width="372" height="343"&gt;&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Let's check a few predictions and their different values
&lt;/span&gt;&lt;span class="n"&gt;i_multiplier&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
&lt;span class="n"&gt;num_rows&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;
&lt;span class="n"&gt;num_cols&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;
&lt;span class="n"&gt;num_images&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;num_rows&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;num_cols&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;figure&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;figsize&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;num_cols&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;num_rows&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nb"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;num_images&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
  &lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;subplot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;num_rows&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;num_cols&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="n"&gt;plot_pred&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prediction_probabilities&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;predictions&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;labels&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;val_labels&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;images&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;val_images&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="n"&gt;i_multiplier&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;subplot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;num_rows&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;num_cols&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="n"&gt;plot_pred_conf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prediction_probabilities&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;predictions&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="n"&gt;labels&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;val_labels&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="n"&gt;i_multiplier&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;tight_layout&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;h_pad&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;1.0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;show&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--F-63RUW2--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/3787bkspgpksy9zy9g2z.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--F-63RUW2--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/3787bkspgpksy9zy9g2z.png" alt="Confidence values for different images" width="880" height="691"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Saving and reloading a model
&lt;/h2&gt;

&lt;p&gt;After training a model, it's a good idea to save it. Saving it means we can share it with colleagues, put it in an application and more importantly, won't have to go through the potentially expensive step of retraining it.&lt;/p&gt;

&lt;p&gt;The format of an entire saved Keras model is h5. So we'll make a function which can take a model as input and utilize the &lt;code&gt;save()&lt;/code&gt; method to save it as a h5 file to a specified directory.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;save_model&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;suffix&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
  &lt;span class="s"&gt;"""
  Saves a given model in a models directory and appends a suffix (str)
  for clarity and reuse.
  """&lt;/span&gt;
  &lt;span class="c1"&gt;# Create model directory with current time
&lt;/span&gt;  &lt;span class="n"&gt;modeldir&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"drive/MyDrive/ML/Dog Vision/models"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                          &lt;span class="n"&gt;datetime&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;datetime&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;now&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="n"&gt;strftime&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"%Y%m%d-%H%M%s"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
  &lt;span class="n"&gt;model_path&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;modeldir&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="s"&gt;"-"&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;suffix&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="s"&gt;".h5"&lt;/span&gt; &lt;span class="c1"&gt;# save format of model
&lt;/span&gt;  &lt;span class="k"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="s"&gt;"Saving model to: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;model_path&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;..."&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;save&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model_path&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;model_path&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If we've got a saved model, we'd like to load it, let's create a function which can take a model path and use the &lt;code&gt;tf.keras.models.load_model()&lt;/code&gt; function to load it into the notebook.&lt;/p&gt;

&lt;p&gt;Because we're using a component from TensorFlow Hub (&lt;code&gt;hub.KerasLayer&lt;/code&gt;) we'll have to pass this as a parameter to the &lt;code&gt;custom_objects&lt;/code&gt; parameter.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;load_model&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model_path&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
  &lt;span class="s"&gt;"""
  Loads a saved model from a specified path.
  """&lt;/span&gt;
  &lt;span class="k"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="s"&gt;"Loading saved model from: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;model_path&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;tf&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;keras&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;models&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;load_model&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model_path&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                                     &lt;span class="n"&gt;custom_objects&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="s"&gt;"KerasLayer"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="n"&gt;hub&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;KerasLayer&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Save our model trained on 1000 images
&lt;/span&gt;&lt;span class="n"&gt;save_model&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;suffix&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;"1000-images-mobilenetv2-Adam"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Load our model trained on 1000 images
&lt;/span&gt;&lt;span class="n"&gt;model_1000_images&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;load_model&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;'drive/MyDrive/ML/Dog Vision/models/20220325-04411648183299-1000-images-mobilenetv2-Adam.h5'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Training a model (on the full dataset)
&lt;/h2&gt;

&lt;p&gt;Now we know our model works on a subset of the data, we can start to move forward with training one on the full data.&lt;/p&gt;

&lt;p&gt;Above, we saved all of the training filepaths to &lt;code&gt;X&lt;/code&gt; and all of the training labels to &lt;code&gt;y&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;We've got over 10,000 images and labels in our training set.&lt;/p&gt;

&lt;p&gt;Before we can train a model on these, we'll have to turn them into a data batch.&lt;/p&gt;

&lt;p&gt;We can use our &lt;code&gt;create_data_batches()&lt;/code&gt; function from above which also preprocesses our images for us.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Turn full training data in a data batch
&lt;/span&gt;&lt;span class="n"&gt;full_data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;create_data_batches&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Our data is in a data batch, all we need now is a model.&lt;/p&gt;

&lt;p&gt;Let's use &lt;code&gt;create_model()&lt;/code&gt; to instantiate another model.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Instantiate a new model for training on the full dataset
&lt;/span&gt;&lt;span class="n"&gt;full_model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;create_model&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Since we've made a new model instance, &lt;code&gt;full_model&lt;/code&gt;, we'll need some callbacks too.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Create full model callbacks
&lt;/span&gt;
&lt;span class="c1"&gt;# TensorBoard callback
&lt;/span&gt;&lt;span class="n"&gt;full_model_tensorboard&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;create_tensorboard_callback&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="c1"&gt;# Early stopping callback
# Note: No validation set when training on all the data, therefore can't monitor validation accuracy
&lt;/span&gt;&lt;span class="n"&gt;full_model_early_stopping&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;tf&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;keras&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;callbacks&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;EarlyStopping&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;monitor&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;"accuracy"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                                                             &lt;span class="n"&gt;patience&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;To monitor the model whilst it trains, we'll load TensorBoard (it should update every 30-seconds or so whilst the model trains).&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="n"&gt;tensorboard&lt;/span&gt; &lt;span class="o"&gt;--&lt;/span&gt;&lt;span class="n"&gt;logdir&lt;/span&gt; &lt;span class="n"&gt;drive&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;My&lt;/span&gt;\ &lt;span class="n"&gt;Drive&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;Data&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;logs&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Note:&lt;/strong&gt; Since running the cell below will cause the model to train on all of the data (10,000+) images, it may take a fairly long time to get started and finish. However, thanks to our &lt;code&gt;full_model_early_stopping&lt;/code&gt; callback, it'll stop before it starts going too long.&lt;/p&gt;

&lt;p&gt;The first epoch is always the longest as data gets loaded into memory. After it's there, it'll speed up.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Fit the full model to the full training data
&lt;/span&gt;&lt;span class="n"&gt;full_model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;fit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;full_data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
               &lt;span class="n"&gt;epochs&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;NUM_EPOCHS&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
               &lt;span class="n"&gt;callbacks&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;full_model_tensorboard&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
                          &lt;span class="n"&gt;full_model_early_stopping&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Epoch 1/100
320/320 [==============================] - 57s 163ms/step - loss: 1.3450 - accuracy: 0.6682
Epoch 2/100
320/320 [==============================] - 52s 162ms/step - loss: 0.3995 - accuracy: 0.8813
Epoch 3/100
320/320 [==============================] - 52s 163ms/step - loss: 0.2371 - accuracy: 0.9335
Epoch 4/100
320/320 [==============================] - 49s 152ms/step - loss: 0.1529 - accuracy: 0.9647
Epoch 5/100
320/320 [==============================] - 51s 159ms/step - loss: 0.1060 - accuracy: 0.9785
Epoch 6/100
320/320 [==============================] - 52s 163ms/step - loss: 0.0775 - accuracy: 0.9873
Epoch 7/100
320/320 [==============================] - 56s 175ms/step - loss: 0.0602 - accuracy: 0.9913
Epoch 8/100
320/320 [==============================] - 58s 181ms/step - loss: 0.0476 - accuracy: 0.9943
Epoch 9/100
320/320 [==============================] - 57s 178ms/step - loss: 0.0369 - accuracy: 0.9966
Epoch 10/100
320/320 [==============================] - 58s 180ms/step - loss: 0.0311 - accuracy: 0.9971
Epoch 11/100
320/320 [==============================] - 58s 181ms/step - loss: 0.0264 - accuracy: 0.9977
Epoch 12/100
320/320 [==============================] - 58s 182ms/step - loss: 0.0222 - accuracy: 0.9977
Epoch 13/100
320/320 [==============================] - 55s 171ms/step - loss: 0.0199 - accuracy: 0.9984
Epoch 14/100
320/320 [==============================] - 58s 181ms/step - loss: 0.0172 - accuracy: 0.9987
Epoch 15/100
320/320 [==============================] - 58s 181ms/step - loss: 0.0165 - accuracy: 0.9983
Epoch 16/100
320/320 [==============================] - 57s 179ms/step - loss: 0.0136 - accuracy: 0.9990
Epoch 17/100
320/320 [==============================] - 58s 181ms/step - loss: 0.0153 - accuracy: 0.9983
Epoch 18/100
320/320 [==============================] - 57s 178ms/step - loss: 0.0148 - accuracy: 0.9979
Epoch 19/100
320/320 [==============================] - 58s 181ms/step - loss: 0.0123 - accuracy: 0.9985
&amp;lt;keras.callbacks.History at 0x7fcb67ee7950&amp;gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Saving and reloading the full model
&lt;/h3&gt;

&lt;p&gt;Even on a GPU, our full model took a while to train. So it's a good idea to save it.&lt;/p&gt;

&lt;p&gt;We can do so using our &lt;code&gt;save_model()&lt;/code&gt; function.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Note:&lt;/strong&gt; It may be a good idea to incorporate the &lt;code&gt;save_model()&lt;/code&gt; function into a &lt;code&gt;train_model()&lt;/code&gt; function. Or look into setting up a checkpoint callback.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Save model to file
&lt;/span&gt;&lt;span class="n"&gt;save_model&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;full_model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;suffix&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;"all-images-Adam"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Load in the full model
&lt;/span&gt;&lt;span class="n"&gt;loaded_full_model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;load_model&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;'drive/MyDrive/ML/Dog Vision/models/20220325-05281648186092-all-images-Adam.h5'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Making predictions on the test dataset
&lt;/h3&gt;

&lt;p&gt;Since our model has been trained on images in the form of Tensor batches, to make predictions on the test data, we'll have to get it into the same format.&lt;/p&gt;

&lt;p&gt;We created &lt;code&gt;create_data_batches()&lt;/code&gt; earlier which can take a list of filenames as input and convert them into Tensor batches.&lt;/p&gt;

&lt;p&gt;To make predictions on the test data, we'll:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Get the test image filenames.&lt;/li&gt;
&lt;li&gt;Convert the filenames into test data batches using &lt;code&gt;create_data_batches()&lt;/code&gt; and setting the &lt;code&gt;test_data&lt;/code&gt; parameter to &lt;code&gt;True&lt;/code&gt; (since there are no labels with the test images).&lt;/li&gt;
&lt;li&gt;Make a predictions array by passing the test data batches to the &lt;code&gt;predict()&lt;/code&gt; function.
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Load test image filenames (since we're using os.listdir(), these already have .jpg)
&lt;/span&gt;&lt;span class="n"&gt;test_path&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"drive/MyDrive/ML/Dog Vision/test/"&lt;/span&gt;
&lt;span class="n"&gt;test_filenames&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;test_path&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;fname&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;fname&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;listdir&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;test_path&lt;/span&gt;&lt;span class="p"&gt;)]&lt;/span&gt;

&lt;span class="n"&gt;test_filenames&lt;/span&gt;&lt;span class="p"&gt;[:&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;['drive/MyDrive/ML/Dog Vision/test/e5f2204119380ce1a17fd09435c5012a.jpg',
 'drive/MyDrive/ML/Dog Vision/test/e7ce78e874945f182a4f5149aa505b09.jpg',
 'drive/MyDrive/ML/Dog Vision/test/de6cc38e54a460dd34c53b74f022a8da.jpg',
 'drive/MyDrive/ML/Dog Vision/test/e7b608110b0e29120d8740f37e85f3d0.jpg',
 'drive/MyDrive/ML/Dog Vision/test/e66a91249a4979a86db48e5c64b81a88.jpg',
 'drive/MyDrive/ML/Dog Vision/test/e17defebd1b8fc39e9c3c10df3c2e3de.jpg',
 'drive/MyDrive/ML/Dog Vision/test/e3baf6b2914677edd2729db0f32e2620.jpg',
 'drive/MyDrive/ML/Dog Vision/test/e08d42b2e6f2dbcf24c6bfee8b7d03bd.jpg',
 'drive/MyDrive/ML/Dog Vision/test/e2b24cea9d0796ffad73cb24eab1a3f6.jpg',
 'drive/MyDrive/ML/Dog Vision/test/e137b0cd96051765c349377725c4696d.jpg']
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# How many test images are there?
&lt;/span&gt;&lt;span class="nb"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;test_filenames&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;10357
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Create test data batch
&lt;/span&gt;&lt;span class="n"&gt;test_data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;create_data_batches&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;test_filenames&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;test_data&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Note:&lt;/strong&gt; Since there are 10,000+ test images, making predictions could take a while, even on a GPU. So beware running the cell below may take up to an hour.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Make predictions on test data batch using the loaded full model
&lt;/span&gt;&lt;span class="n"&gt;test_predictions&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;loaded_full_model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;predict&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;test_data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                                             &lt;span class="n"&gt;verbose&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;324/324 [==============================] - 1132s 3s/step
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Save predictions (NumPy array) to csv file
&lt;/span&gt;&lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;savetxt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"drive/MyDrive/ML/Dog Vision/preds_array.csv"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;test_predictions&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;delimiter&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;","&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Load predictions (NumPy array) from csv file
&lt;/span&gt;&lt;span class="n"&gt;test_predictions&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;loadtxt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"drive/MyDrive/ML/Dog Vision/preds_array.csv"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;delimiter&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;","&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Check out the test predictions
&lt;/span&gt;&lt;span class="n"&gt;test_predictions&lt;/span&gt;&lt;span class="p"&gt;[:&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;array([[2.77832507e-10, 3.47354963e-08, 1.59594504e-10, ...,
        3.36205460e-07, 1.01586806e-09, 1.35463404e-10],
       [1.06715709e-06, 9.80697884e-11, 1.83814009e-05, ...,
        5.81553738e-09, 1.35275069e-09, 9.25533868e-07],
       [8.79199422e-11, 7.13828712e-08, 3.49000935e-08, ...,
        5.00790861e-07, 3.85421224e-08, 9.47958489e-10],
       ...,
       [2.41431576e-13, 9.99975681e-01, 7.91099131e-11, ...,
        1.21032284e-09, 1.01821096e-09, 8.52753868e-09],
       [1.11224371e-14, 5.69826790e-11, 4.86594055e-12, ...,
        9.99895334e-01, 3.12582422e-08, 3.12060724e-11],
       [3.74029030e-09, 6.98669282e-08, 6.67546445e-08, ...,
        1.31928124e-09, 3.12934681e-05, 1.51918755e-07]])
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Making predictions on custom images
&lt;/h3&gt;

&lt;p&gt;It's great being able to make predictions on a test dataset already provided for us.&lt;/p&gt;

&lt;p&gt;But how could we use our model on our own images?&lt;/p&gt;

&lt;p&gt;The premise remains, if we want to make predictions on our own custom images, we have to pass them to the model in the same format the model was trained on.&lt;/p&gt;

&lt;p&gt;To do so, we'll:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Get the filepaths of our own images.&lt;/li&gt;
&lt;li&gt;Turn the filepaths into data batches using &lt;code&gt;create_data_batches()&lt;/code&gt;. And since our custom images won't have labels, we set the &lt;code&gt;test_data&lt;/code&gt; parameter to &lt;code&gt;True&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Pass the custom image data batch to our model's &lt;code&gt;predict()&lt;/code&gt; method.&lt;/li&gt;
&lt;li&gt;Convert the prediction output probabilities to prediction labels.&lt;/li&gt;
&lt;li&gt;Compare the predicted labels to the custom images.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Note:&lt;/strong&gt; To make predictions on custom images, I've uploaded pictures to a directory located at &lt;code&gt;drive/MyDrive/ML/Dog Vision/my-dogs/&lt;/code&gt; (as seen in the cell below).&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Get custom image filepaths
&lt;/span&gt;&lt;span class="n"&gt;custom_path&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"drive/MyDrive/ML/Dog Vision/my-dogs/"&lt;/span&gt;
&lt;span class="n"&gt;custom_image_paths&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;custom_path&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;fname&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;fname&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;listdir&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;custom_path&lt;/span&gt;&lt;span class="p"&gt;)]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Turn custom image into batch (set to test data because there are no labels)
&lt;/span&gt;&lt;span class="n"&gt;custom_data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;create_data_batches&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;custom_image_paths&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;test_data&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Make predictions on the custom data
&lt;/span&gt;&lt;span class="n"&gt;custom_preds&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;loaded_full_model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;predict&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;custom_data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now we've got some predictions arrays, let's convert them to labels and compare them with each image.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Get custom image prediction labels
&lt;/span&gt;&lt;span class="n"&gt;custom_pred_labels&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;get_pred_label&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;custom_preds&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nb"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;custom_preds&lt;/span&gt;&lt;span class="p"&gt;))]&lt;/span&gt;
&lt;span class="n"&gt;custom_pred_labels&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;['boxer',
 'bull_mastiff',
 'american_staffordshire_terrier',
 'staffordshire_bullterrier',
 'maltese_dog',
 'labrador_retriever']
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Get custom images (our unbatchify() function won't work since there aren't labels)
&lt;/span&gt;&lt;span class="n"&gt;custom_images&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
&lt;span class="c1"&gt;# Loop through unbatched data
&lt;/span&gt;&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;image&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;custom_data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;unbatch&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="n"&gt;as_numpy_iterator&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
  &lt;span class="n"&gt;custom_images&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;image&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Check custom image predictions
&lt;/span&gt;&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;figure&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;figsize&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;image&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nb"&gt;enumerate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;custom_images&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
  &lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;subplot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;xticks&lt;/span&gt;&lt;span class="p"&gt;([])&lt;/span&gt;
  &lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;yticks&lt;/span&gt;&lt;span class="p"&gt;([])&lt;/span&gt;
  &lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;title&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;custom_pred_labels&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
  &lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;imshow&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;image&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--CJrnSMe---/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/fqn2wsbfggtjhkmmfnwv.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--CJrnSMe---/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/fqn2wsbfggtjhkmmfnwv.png" alt="Custom images with predicted label" width="489" height="577"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  What's next?
&lt;/h2&gt;

&lt;p&gt;We've just gone end-to-end on a multi-class image classification problem!&lt;/p&gt;

&lt;p&gt;This is the same style of problem self-driving cars have, except with different data.&lt;/p&gt;

&lt;p&gt;We've got plenty of options on where to go next.&lt;/p&gt;

&lt;p&gt;We could try to improve the full model we trained in this notebook in a few ways. Since our early experiment (using only 1,000 images) hinted at our model overfitting, one goal going forward would be to try and prevent it.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;a href="https://tfhub.dev/"&gt;Trying another model from TensorFlow Hub&lt;/a&gt; - Perhaps a different model would perform better on our dataset. One option would be to experiment with a different pretrained model from TensorFlow Hub or look into the &lt;code&gt;tf.keras.applications&lt;/code&gt; module.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://bair.berkeley.edu/blog/2019/06/07/data_aug/"&gt;Data augmentation&lt;/a&gt; - Take the training images and manipulate (crop, resize) or distort them (flip, rotate) to create even more training data for the model to learn from. Check out the &lt;a href="https://www.tensorflow.org/api_docs/python/tf/image"&gt;TensorFlow images&lt;/a&gt; documentation for a whole bunch of functions we can use on images. A great idea would be to try and replicate the techniques in &lt;a href="https://github.com/google/eng-edu/blob/master/ml/pc/exercises/image_classification_part2.ipynb"&gt;this example cat vs. dog image classification notebook&lt;/a&gt; for our dog breeds problem.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.tensorflow.org/hub/tf2_saved_model#fine-tuning"&gt;Fine-tuning&lt;/a&gt; - The model we used in this notebook was directly from TensorFlow Hub, we took what it had already learned from another dataset (ImageNet) and applied it to our own. Another option is to use what the model already knows and fine-tune this knowledge to our own dataset (pictures of dogs). This would mean all of the patterns within the model would be updated to be more specific to pictures of dogs rather than general images.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;One of the best ways to find out something is to search for something like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;"How to improve a TensorFlow 2.x image classification model?"&lt;/li&gt;
&lt;li&gt;"TensorFlow 2.x image classification best practices"&lt;/li&gt;
&lt;li&gt;"Transfer learning for image classification with TensorFlow 2.x"&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>machinelearning</category>
      <category>deeplearning</category>
      <category>tutorial</category>
    </item>
    <item>
      <title>Machine Learning: How to Predict the Sale Price of Bulldozers</title>
      <dc:creator>Nicolas Vallée</dc:creator>
      <pubDate>Wed, 30 Mar 2022 09:18:02 +0000</pubDate>
      <link>https://dev.to/nicolasvallee/machine-learning-how-to-predict-the-sale-price-of-bulldozers-bhf</link>
      <guid>https://dev.to/nicolasvallee/machine-learning-how-to-predict-the-sale-price-of-bulldozers-bhf</guid>
      <description>&lt;p&gt;In this tutorial, we're going to walk through an example Machine Learning project where the goal is to predict the sale price of bulldozers.&lt;/p&gt;

&lt;p&gt;This kind of problem is known as a &lt;strong&gt;regression problem&lt;/strong&gt; because we're trying to determine a price, which is a continuous variable.&lt;/p&gt;

&lt;p&gt;The data is from the &lt;a href="https://www.kaggle.com/c/bluebook-for-bulldozers/overview"&gt;Kaggle Bluebook for Bulldozers competition&lt;/a&gt;. We'll also use the same evaluation metric, root mean square log error, or RMSLE.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;This exercise is a milestone project from the course &lt;a href="https://www.udemy.com/course/complete-machine-learning-and-data-science-zero-to-mastery/"&gt;Complete Machine Learning &amp;amp; Data Science Bootcamp&lt;/a&gt;, which I completed in March 2022.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;You can find &lt;a href="https://github.com/nicolas-vallee/ztm-ml-projects/blob/main/bulldozer-price-regression.ipynb"&gt;the Jupyter notebook&lt;/a&gt; in my GitHub repo.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Workflow
&lt;/h2&gt;

&lt;p&gt;Let's consider how we're approaching this problem, and the different steps to solve it.&lt;/p&gt;

&lt;p&gt;We have a dataset provided to us. We'll approach the problem with the following 6-step machine learning modelling framework.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--lk-c66gX--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/ecb5myrpy0winupv3g2g.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--lk-c66gX--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/ecb5myrpy0winupv3g2g.png" alt="6 Step Machine Learning Modelling Framework" width="880" height="345"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We'll use pandas, Matplotlib, and NumPy for data anaylsis. And we'll use Scikit-Learn for machine learning and modelling tasks.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--V3QxaMUf--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/q607xb1k00slguyi4d9e.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--V3QxaMUf--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/q607xb1k00slguyi4d9e.png" alt="Tools which can be used for each step of the machine learning modelling process" width="880" height="523"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;After working through each step of this workflow, we'll have a trained machine learning model. And ideally, this trained model will be able to accurately predict the sale price of a bulldozer given different characteristics about it.&lt;/p&gt;




&lt;h2&gt;
  
  
  1. Problem Definition
&lt;/h2&gt;

&lt;p&gt;The question that we're trying to answer is:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Given the characteristics of a particular bulldozer and data related to past sales of similar bulldozers, how well can we predict the future sale price of this bulldozer?&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  2. Data
&lt;/h2&gt;

&lt;p&gt;Let's have a look at the dataset from Kaggle. First, we can see that it's a time series problem. That means the dataset contains a time attribute.&lt;/p&gt;

&lt;p&gt;In this case, we're working with historical sales data for bulldozers. The dataset includes attributes such as model type, size, sale date, and more.&lt;/p&gt;

&lt;p&gt;The 3 datasets available are:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Train.csv&lt;/strong&gt; - Historical bulldozer sales data up to 2011 (with close to 400,000 examples with 50+ different attributes, including &lt;code&gt;SalePrice&lt;/code&gt; which is the &lt;strong&gt;target variable&lt;/strong&gt;).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Valid.csv&lt;/strong&gt; - Historical bulldozer sales data from January 1st, 2012 to April 30th, 2012 (close to 12,000 examples with the same attributes as &lt;code&gt;Train.csv&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Test.csv&lt;/strong&gt; - Historical bulldozer sales data from May 1st, 2012 to November 30th, 2012 (close to 12,000 examples, but missing the &lt;code&gt;SalePrice&lt;/code&gt; attribute, as this is what we aim to predict).&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  3. Evaluation
&lt;/h2&gt;

&lt;p&gt;For this competition, Kaggle has set the evaluation metric to the &lt;strong&gt;root mean squared log error (RMSLE)&lt;/strong&gt;. The goal will be to get this value as low as possible.&lt;/p&gt;

&lt;p&gt;To check how well our model is doing, we'll calculate the RMSLE and then compare our results to the &lt;a href="https://www.kaggle.com/c/bluebook-for-bulldozers/leaderboard"&gt;Kaggle leaderboard&lt;/a&gt;. &lt;/p&gt;




&lt;h2&gt;
  
  
  4. Features
&lt;/h2&gt;

&lt;p&gt;Features are the different parts of the data. During this step, we want to fin out what we can about the data and become more familiar with it.&lt;/p&gt;

&lt;p&gt;A common way to do this is to create a &lt;strong&gt;data dictionary&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;For this dataset, Kaggle provides a data dictionary containing information about what each attribute of the dataset means. We can &lt;a href="https://www.kaggle.com/c/bluebook-for-bulldozers/download/Bnl6RAHA0enbg0UfAvGA%2Fversions%2FwBG4f35Q8mAbfkzwCeZn%2Ffiles%2FData%20Dictionary.xlsx"&gt;download this file&lt;/a&gt; directly from the Kaggle competition page (account required) or view it on Google Sheets.&lt;/p&gt;

&lt;p&gt;First, we'll import the dataset and start exploring it. And since we know the evaluation metric we're trying to minimise (RMSLE), our first goal is to build a baseline model and see how it compares against the competition.&lt;/p&gt;

&lt;h3&gt;
  
  
  Importing the data and preparing it for modeling
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Import data analysis tools 
&lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;pandas&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;numpy&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;matplotlib.pyplot&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;plt&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;I've downloaded the data &lt;a href="https://www.kaggle.com/c/bluebook-for-bulldozers/data"&gt;from Kaggle&lt;/a&gt; and stored it under the file path &lt;code&gt;"./data/"&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Import the training and validation set
&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;read_csv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"./data/bluebook-for-bulldozers/TrainAndValid.csv"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                 &lt;span class="n"&gt;low_memory&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We can now start exploring our dataset with &lt;code&gt;df.info()&lt;/code&gt;. Let's use &lt;code&gt;matplotlib&lt;/code&gt; to visualize some of the data .&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;fig&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ax&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;subplots&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;ax&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;scatter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;"saledate"&lt;/span&gt;&lt;span class="p"&gt;][:&lt;/span&gt;&lt;span class="mi"&gt;1000&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;"SalePrice"&lt;/span&gt;&lt;span class="p"&gt;][:&lt;/span&gt;&lt;span class="mi"&gt;1000&lt;/span&gt;&lt;span class="p"&gt;]);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--WH3nF4sB--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/fje2eg9tayxh6aanehtu.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--WH3nF4sB--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/fje2eg9tayxh6aanehtu.png" alt="Matplotlib scatter plot" width="417" height="248"&gt;&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;SalePrice&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;plot&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;hist&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--h4_GNu-f--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/jr5hk8gvz1pgskl58mtm.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--h4_GNu-f--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/jr5hk8gvz1pgskl58mtm.png" alt="Matplotlib hist plot" width="408" height="248"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Parsing dates
&lt;/h3&gt;

&lt;p&gt;When working with time series data, it's a good practice to make sure dates are in the format of a &lt;a href="https://docs.python.org/3/library/datetime.html"&gt;datetime object&lt;/a&gt; (a Python data type which encodes specific information about dates).&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;read_csv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"./data/bluebook-for-bulldozers/TrainAndValid.csv"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                 &lt;span class="n"&gt;low_memory&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                 &lt;span class="n"&gt;parse_dates&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;"saledate"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now with &lt;code&gt;parse_dates&lt;/code&gt;, we can see when calling &lt;code&gt;df.info()&lt;/code&gt; that the dtype of "saledate" is &lt;code&gt;datetime64[ns]&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;fig&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ax&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;subplots&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;ax&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;scatter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;"saledate"&lt;/span&gt;&lt;span class="p"&gt;][:&lt;/span&gt;&lt;span class="mi"&gt;1000&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;"SalePrice"&lt;/span&gt;&lt;span class="p"&gt;][:&lt;/span&gt;&lt;span class="mi"&gt;1000&lt;/span&gt;&lt;span class="p"&gt;]);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--f9IsWK5u--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/wde77649a37ydskr47wj.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--f9IsWK5u--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/wde77649a37ydskr47wj.png" alt="Matplotlib scatter plot" width="406" height="248"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;To visualize the attributes of the DataFrame more easily, a trick is to transpose it with &lt;code&gt;df.head().T&lt;/code&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Sort DataFrame by saledate
&lt;/h3&gt;

&lt;p&gt;It makes sense to sort our data by date because we're working on a time series problem where we predict future values given past examples.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Sort DataFrame by date (in ascending order)
&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sort_values&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;by&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;"saledate"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;inplace&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ascending&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;saledate&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;head&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;205615   1989-01-17
274835   1989-01-31
141296   1989-01-31
212552   1989-01-31
62755    1989-01-31
Name: saledate, dtype: datetime64[ns]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Make a copy of the original DataFrame
&lt;/h3&gt;

&lt;p&gt;Since we're going to manipulate the data, it's better to make a copy of the original DataFrame and perform our changes there. This way, we keep the original DataFrame intact in case we need it again.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Make a copy of the original DataFrame
&lt;/span&gt;&lt;span class="n"&gt;df_tmp&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;copy&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Add datetime parameters for saledate column
&lt;/h3&gt;

&lt;p&gt;We're doing this to enrich our dataset with as much information as possible.&lt;/p&gt;

&lt;p&gt;Since we imported the data using &lt;code&gt;read_csv()&lt;/code&gt; and parsed the dates using &lt;code&gt;parase_dates=["saledate"]&lt;/code&gt;, we can now access the different datetime attributes of the saledate column.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Add datetime parameters for saledate
&lt;/span&gt;&lt;span class="n"&gt;df_tmp&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;"saleYear"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df_tmp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;saledate&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;dt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;year&lt;/span&gt;
&lt;span class="n"&gt;df_tmp&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;"saleMonth"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df_tmp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;saledate&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;dt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;month&lt;/span&gt;
&lt;span class="n"&gt;df_tmp&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;"saleDay"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df_tmp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;saledate&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;dt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;day&lt;/span&gt;
&lt;span class="n"&gt;df_tmp&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;"saleDayofweek"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df_tmp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;saledate&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;dt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;dayofweek&lt;/span&gt;
&lt;span class="n"&gt;df_tmp&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;"saleDayofyear"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df_tmp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;saledate&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;dt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;dayofyear&lt;/span&gt;

&lt;span class="c1"&gt;# Drop original saledate
&lt;/span&gt;&lt;span class="n"&gt;df_tmp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;drop&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"saledate"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;axis&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;inplace&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  5. Modeling
&lt;/h2&gt;

&lt;p&gt;We've explored our dataset and enriched it with some datetime attributes. We could spend more time doing exploratory data analysis (EDA), finding more out about the data ourselves. But instead, we'll use a machine learning model to help us do EDA.&lt;/p&gt;

&lt;p&gt;After all, one of the objectives when starting a new machine learning project is to reduce the time between experiments. So, let's move on to the model.&lt;/p&gt;

&lt;p&gt;Following the &lt;a href="https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html"&gt;Scikit-Learn machine learning map&lt;/a&gt;, we find that a &lt;a href="https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html#sklearn-ensemble-randomforestregressor"&gt;RandomForestRegressor()&lt;/a&gt; is a good candidate.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;sklearn.ensemble&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;RandomForestRegressor&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;However, fitting the &lt;code&gt;RandomForestRegressor()&lt;/code&gt; model with our training data would'nt work yet, because we've got missing numbers as well as attributes in a non-numerical format.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Check for missing categories and different datatypes
&lt;/span&gt;&lt;span class="n"&gt;df_tmp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;info&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We see many columns have a type &lt;code&gt;object&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Check for missing values
&lt;/span&gt;&lt;span class="n"&gt;df_tmp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;isna&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nb"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Likewise, many columns have missing values.&lt;/p&gt;

&lt;h3&gt;
  
  
  Convert strings to categories
&lt;/h3&gt;

&lt;p&gt;One way to turn our data into numbers is to convert the columns with the string datatype into a category datatype.&lt;/p&gt;

&lt;p&gt;To do this, we can use the &lt;a href="https://pandas.pydata.org/pandas-docs/stable/reference/general_utility_functions.html#data-types-related-functionality"&gt;pandas types API&lt;/a&gt; which allows us to interact and manipulate the types of data.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# These columns contain strings
&lt;/span&gt;&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;label&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;content&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;df_tmp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;items&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;api&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;types&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;is_string_dtype&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;label&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This gives us a list of all the columns for which the data is in a string format.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# This will turn all of the string values into category values
&lt;/span&gt;&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;label&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;content&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;df_tmp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;items&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;api&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;types&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;is_string_dtype&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;df_tmp&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;label&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;astype&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"category"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="n"&gt;cat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;as_ordered&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Calling &lt;code&gt;df_tmp.info()&lt;/code&gt;, we see that the type &lt;code&gt;object&lt;/code&gt; has been converted to &lt;code&gt;category&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;All of our data is now categorical. That means we can now turn the categories into numbers. However, we still have to deal with missing values.&lt;/p&gt;

&lt;p&gt;In the format it's in, the data is good to be worked with, so let's save it to a file and reimport it so we can continue on.&lt;/p&gt;

&lt;h3&gt;
  
  
  Save processed data
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Save preprocessed data
&lt;/span&gt;&lt;span class="n"&gt;df_tmp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;to_csv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"./data/bluebook-for-bulldozers/train_tmp.csv"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
              &lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Import preprocessed data
&lt;/span&gt;&lt;span class="n"&gt;df_tmp&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;read_csv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"./data/bluebook-for-bulldozers/train_tmp.csv"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                     &lt;span class="n"&gt;low_memory&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Our preprocessed DataFrame has the columns we added to it, but it's still missing some values. We can see where values are missing with &lt;code&gt;df_tmp.isna().sum()&lt;/code&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Fill missing values
&lt;/h3&gt;

&lt;p&gt;Here are two things to know about machine learning models:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;All of our data has to be numerical&lt;/li&gt;
&lt;li&gt;There can't be any missing values&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;And as we've seen using &lt;code&gt;df_tmp.isna().sum()&lt;/code&gt; our data still has plenty of missing values.&lt;/p&gt;

&lt;h4&gt;
  
  
  Filling numerical value
&lt;/h4&gt;

&lt;p&gt;First, we're going to fill any column with missing values with the median of that column.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;label&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;content&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;df_tmp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;items&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;api&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;types&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;is_numeric_dtype&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;label&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;SalesID
SalePrice
MachineID
ModelID
datasource
auctioneerID
YearMade
MachineHoursCurrentMeter
saleYear
saleMonth
saleDay
saleDayofweek
saleDayofyear
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Check for which numeric columns have null values
&lt;/span&gt;&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;label&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;content&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;df_tmp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;items&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;api&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;types&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;is_numeric_dtype&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;isnull&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nb"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
            &lt;span class="k"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;label&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;auctioneerID
MachineHoursCurrentMeter
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Fill numeric rows with the median
&lt;/span&gt;&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;label&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;content&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;df_tmp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;items&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;api&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;types&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;is_numeric_dtype&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;isnull&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nb"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
            &lt;span class="c1"&gt;# Add a binary column which tells if the data was missing our not
&lt;/span&gt;            &lt;span class="n"&gt;df_tmp&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;label&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="s"&gt;"_is_missing"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;isnull&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="c1"&gt;# Fill missing numeric values with median since it's more robust than the mean
&lt;/span&gt;            &lt;span class="n"&gt;df_tmp&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;label&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;fillna&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;median&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We can easily fill all of the missing numeric values in our dataset with the median. However, a value may be missing for a good reason. That's why we added a binary column which indicates whether the value was missing or not. This step helps to retain this information.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Check if there's any null values
&lt;/span&gt;&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;label&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;content&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;df_tmp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;items&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;api&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;types&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;is_numeric_dtype&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;isnull&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nb"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
            &lt;span class="k"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;label&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Good, we no longer have any missing values.&lt;/p&gt;

&lt;h4&gt;
  
  
  Filling and turning categorical variables to numbers
&lt;/h4&gt;

&lt;p&gt;Now, let's fill the categorical values. At the same time, we'll turn them into numbers.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Turn categorical variables into numbers
&lt;/span&gt;&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;label&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;content&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;df_tmp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;items&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="c1"&gt;# Check columns which *aren't* numeric
&lt;/span&gt;    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;api&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;types&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;is_numeric_dtype&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="c1"&gt;# Add binary column to inidicate whether sample had missing value
&lt;/span&gt;        &lt;span class="n"&gt;df_tmp&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;label&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="s"&gt;"_is_missing"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;isnull&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="c1"&gt;# We add the +1 because pandas encodes missing categories as -1
&lt;/span&gt;        &lt;span class="n"&gt;df_tmp&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;label&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Categorical&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="n"&gt;codes&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;All of our data is now numeric, and there are no missing values.&lt;/p&gt;

&lt;p&gt;We should be able to build a machine learning model! Let's instantiate our &lt;code&gt;RandomForestRegressor&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;This will take a few minutes, which is too long for interacting with it. So, what we'll do is create a subset of rows to work with.&lt;/p&gt;

&lt;h3&gt;
  
  
  Splitting data into training and validation sets
&lt;/h3&gt;

&lt;p&gt;According to the Kaggle data page, the validation set and test set are split according to dates. This makes sense since we're working on a time series problem.&lt;/p&gt;

&lt;p&gt;Knowing this, randomly splitting our data into train and test sets using something like &lt;code&gt;train_test_split()&lt;/code&gt; wouldn't work.&lt;/p&gt;

&lt;p&gt;Instead, we'll split our data into training, validation, and test sets according to the date of the sample.&lt;/p&gt;

&lt;p&gt;In our case:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Training = all samples up until 2011&lt;/li&gt;
&lt;li&gt;Valid = all samples form January 1, 2012 - April 30, 2012&lt;/li&gt;
&lt;li&gt;Test = all samples from May 1, 2012 - November 2012&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For more on making good training, validati, and test sets, check out the post &lt;a href="https://www.fast.ai/2017/11/13/validation-sets/"&gt;How (and why) to create a good validation set&lt;/a&gt; by Rachel Thomas.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Split data into training and validation sets
&lt;/span&gt;&lt;span class="n"&gt;df_train&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df_tmp&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;df_tmp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;saleYear&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="mi"&gt;2012&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;df_val&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df_tmp&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;df_tmp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;saleYear&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;2012&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="nb"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df_train&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="nb"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df_val&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;(401125, 11573)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Split data into X &amp;amp; y
&lt;/span&gt;&lt;span class="n"&gt;X_train&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_train&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df_train&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;drop&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"SalePrice"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;axis&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;df_train&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;SalePrice&lt;/span&gt;
&lt;span class="n"&gt;X_valid&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_valid&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df_val&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;drop&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"SalePrice"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;axis&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;df_val&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;SalePrice&lt;/span&gt;

&lt;span class="n"&gt;X_train&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;shape&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_train&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;shape&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;X_valid&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;shape&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_valid&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;shape&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;((401125, 102), (401125,), (11573, 102), (11573,))
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Building an evaluation function
&lt;/h3&gt;

&lt;p&gt;The evaluation function they use for this Kaggle competition is the root mean squared log error (RMSLE).&lt;/p&gt;

&lt;p&gt;It's important to understand the evaluation metric we're going for. The &lt;strong&gt;RMSLE&lt;/strong&gt; is the metric of choice when we care more about the consequences of being off by 10% than being off by $10. That is, when we care more about ratios than differences. The &lt;strong&gt;MAE&lt;/strong&gt; (mean absolute error) is more about exact differences.&lt;/p&gt;

&lt;p&gt;Since Scikit-Learn doesn't have a built-in function for RMSLE, we'll create our own.&lt;/p&gt;

&lt;p&gt;We can do this by taking the square root of Scikit-Learn's &lt;a href="https://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_squared_log_error.html#sklearn.metrics.mean_squared_log_error"&gt;mean_squared_log_error&lt;/a&gt; (MSLE). MSLE is the same as taking the log of mean squared error (MSE).&lt;/p&gt;

&lt;p&gt;While we're at it, we'll also calculate the MAE and R^2.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Create evaluation function (the competition uses Root Mean Square Log Error)
&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;sklearn.metrics&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;mean_squared_log_error&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;mean_absolute_error&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;rmsle&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;y_test&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_preds&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sqrt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;mean_squared_log_error&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;y_test&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_preds&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

&lt;span class="c1"&gt;# Create function to evaluate our model
&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;show_scores&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;train_preds&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;predict&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X_train&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;val_preds&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;predict&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X_valid&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;scores&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="s"&gt;"Training MAE"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;mean_absolute_error&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;y_train&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;train_preds&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
              &lt;span class="s"&gt;"Valid MAE"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;mean_absolute_error&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;y_valid&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;val_preds&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
              &lt;span class="s"&gt;"Training RMSLE"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;rmsle&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;y_train&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;train_preds&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
              &lt;span class="s"&gt;"Valid RMSLE"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;rmsle&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;y_valid&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;val_preds&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
              &lt;span class="s"&gt;"Training R^2"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;score&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X_train&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_train&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
              &lt;span class="s"&gt;"Valid R^2"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;score&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X_valid&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_valid&lt;/span&gt;&lt;span class="p"&gt;)}&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;scores&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Testing our model on a subset of data
&lt;/h3&gt;

&lt;p&gt;Traing an entire model would take far too long to continue experimenting as fast as we want to.&lt;/p&gt;

&lt;p&gt;So, what we'll do is take a sample of the training set and tune the hyperparameters on that subset before training a larger model.&lt;/p&gt;

&lt;p&gt;If our experiments are taking longer than 10-seconds, we should try to speed things up, by sampling less data or using a faster computer.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nb"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X_train&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;401125
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Let's alter the number of samples each &lt;code&gt;n_estimator&lt;/code&gt; in the &lt;code&gt;RandomForestRegressor&lt;/code&gt; sees using the &lt;code&gt;max_samples&lt;/code&gt; parameter.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Change max samples in RandomForestRegressor
&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;RandomForestRegressor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;n_jobs&lt;/span&gt;&lt;span class="o"&gt;=-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                              &lt;span class="n"&gt;max_samples&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;10000&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Setting &lt;code&gt;max_samples&lt;/code&gt; to 10,000 means every &lt;code&gt;n_estimator&lt;/code&gt; (by default 100) in our &lt;code&gt;RandomForestRegressor&lt;/code&gt; will only see 10,000 random samples from our DataFrame, instead of the entire 400,000.&lt;/p&gt;

&lt;p&gt;In other words, we'll be looking at 40x less samples, which means we'll get faster computation speeds. Though, we should expect our results to worsen (because the model has less samples to learn patterns from).&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="o"&gt;%%&lt;/span&gt;&lt;span class="n"&gt;time&lt;/span&gt;
&lt;span class="c1"&gt;# Cutting down the max number of samples each tree can see improves training time
&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;fit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X_train&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_train&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;CPU times: user 43.5 s, sys: 1.39 s, total: 44.9 s
Wall time: 9.3 s
RandomForestRegressor(max_samples=10000, n_jobs=-1)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;show_scores&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;{'Training MAE': 5557.476079825491,
 'Valid MAE': 7145.432947377517,
 'Training RMSLE': 0.2576354132701014,
 'Valid RMSLE': 0.2932065424618794,
 'Training R^2': 0.860725458431776,
 'Valid R^2': 0.8334321554498396}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Hyperparameter tuning with RandomizedSearchCV
&lt;/h3&gt;

&lt;p&gt;We can increase &lt;code&gt;n_iter&lt;/code&gt; to try more combinations of hyperparameters but, in our case, we'll try 20 and see where it gets us. We're trying to reduce the amount of time between experiments.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="o"&gt;%%&lt;/span&gt;&lt;span class="n"&gt;time&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;sklearn.model_selection&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;RandomizedSearchCV&lt;/span&gt;

&lt;span class="c1"&gt;# Different RandomForestClassifier hyperparameters
&lt;/span&gt;&lt;span class="n"&gt;rf_grid&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="s"&gt;"n_estimators"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;arange&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
           &lt;span class="s"&gt;"max_depth"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
           &lt;span class="s"&gt;"min_samples_split"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;arange&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
           &lt;span class="s"&gt;"min_samples_leaf"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;arange&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
           &lt;span class="s"&gt;"max_features"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mf"&gt;0.5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"sqrt"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"auto"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
           &lt;span class="s"&gt;"max_samples"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;10000&lt;/span&gt;&lt;span class="p"&gt;]}&lt;/span&gt;

&lt;span class="n"&gt;rs_model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;RandomizedSearchCV&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;RandomForestRegressor&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
                              &lt;span class="n"&gt;param_distributions&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;rf_grid&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                              &lt;span class="n"&gt;n_iter&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                              &lt;span class="n"&gt;cv&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                              &lt;span class="n"&gt;verbose&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;rs_model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;fit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X_train&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_train&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Fitting 5 folds for each of 20 candidates, totalling 100 fits
CPU times: user 9min 43s, sys: 38.7 s, total: 10min 22s
Wall time: 10min 30s
RandomizedSearchCV(cv=5, estimator=RandomForestRegressor(), n_iter=20,
                   param_distributions={'max_depth': [None, 3, 5, 10],
                                        'max_features': [0.5, 1, 'sqrt',
                                                         'auto'],
                                        'max_samples': [10000],
                                        'min_samples_leaf': array([ 1,  3,  5,  7,  9, 11, 13, 15, 17, 19]),
                                        'min_samples_split': array([ 2,  4,  6,  8, 10, 12, 14, 16, 18]),
                                        'n_estimators': array([10, 20, 30, 40, 50, 60, 70, 80, 90])},
                   verbose=True)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Find the best parameters from the RandomizedSearch 
&lt;/span&gt;&lt;span class="n"&gt;rs_model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;best_params_&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;{'n_estimators': 60,
 'min_samples_split': 6,
 'min_samples_leaf': 1,
 'max_samples': 10000,
 'max_features': 'auto',
 'max_depth': None}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Evaluate the RandomizedSearch model
&lt;/span&gt;&lt;span class="n"&gt;show_scores&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;rs_model&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;{'Training MAE': 5623.00659389663,
 'Valid MAE': 7221.500628014852,
 'Training RMSLE': 0.2600757896526271,
 'Valid RMSLE': 0.2938688696407002,
 'Training R^2': 0.857223776894714,
 'Valid R^2': 0.8302372473994826}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Train a model with the best parameters
&lt;/h3&gt;

&lt;p&gt;The course instructor, Daniel Bourke, tried 100 different combinations of hyperparameters (setting &lt;code&gt;n_iter&lt;/code&gt; to 100 in &lt;code&gt;RandomizedSearchCV&lt;/code&gt;) and found that the best results came from the ones we'll use below. &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Note:&lt;/strong&gt; This kind of search with &lt;code&gt;n_iter = 100&lt;/code&gt; can take about 2 hours on a powerful laptop. The search above with &lt;code&gt;n_iter = 10&lt;/code&gt; took only 10 minutes on my MacBook Air.&lt;/p&gt;

&lt;p&gt;We'll instantiate a new model with these discovered hyperparameters and reset the &lt;code&gt;max_samples&lt;/code&gt; back to its original value.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="o"&gt;%%&lt;/span&gt;&lt;span class="n"&gt;time&lt;/span&gt;
&lt;span class="c1"&gt;# Most ideal hyperparameters
&lt;/span&gt;&lt;span class="n"&gt;ideal_model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;RandomForestRegressor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;n_estimators&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;90&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                                    &lt;span class="n"&gt;min_samples_leaf&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                                    &lt;span class="n"&gt;min_samples_split&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;14&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                                    &lt;span class="n"&gt;max_features&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                                    &lt;span class="n"&gt;n_jobs&lt;/span&gt;&lt;span class="o"&gt;=-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                                    &lt;span class="n"&gt;max_samples&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;ideal_model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;fit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X_train&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_train&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;CPU times: user 10min 54s, sys: 5.69 s, total: 11min
Wall time: 1min 39s
RandomForestRegressor(max_features=0.5, min_samples_split=14, n_estimators=90,
                      n_jobs=-1)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;show_scores&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ideal_model&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;{'Training MAE': 2927.714226014933,
 'Valid MAE': 5893.170337485774,
 'Training RMSLE': 0.14329949046791518,
 'Valid RMSLE': 0.2436971060471662,
 'Training R^2': 0.9596781605243438,
 'Valid R^2': 0.884359696206594}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;With these hyperparameters and when using all the samples, we can see an improvement to our model's performance.&lt;/p&gt;

&lt;p&gt;We could make a faster model by altering some of the hyperparameters. Particularly by lowering &lt;code&gt;n_estimators&lt;/code&gt; since each increase in &lt;code&gt;n_estimators&lt;/code&gt; is like building another small model.&lt;/p&gt;

&lt;p&gt;However, lowering &lt;code&gt;n_estimators&lt;/code&gt; or altering other hyperparameters may lead to poorer results.&lt;/p&gt;

&lt;h3&gt;
  
  
  Make predictions on test data
&lt;/h3&gt;

&lt;p&gt;We've got a trained model so let's make predictions on the test data.&lt;/p&gt;

&lt;p&gt;What we've done so far is training our model on data prior to 2011. However, the test data is from May 2012 to November 2012.&lt;/p&gt;

&lt;p&gt;What we'll do is use the patterns that our model has learned on the training data to predict the sale price of a bulldozer with characteristics it has never seen before, but that are assumed to be similar to those found in the training data.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;df_test&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;read_csv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"./data/bluebook-for-bulldozers/Test.csv"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                      &lt;span class="n"&gt;parse_dates&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;"saledate"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Unfortunately, the test data isn't in the same format as our other data, so we have to fix it. Let's create a function to preprocess our data.&lt;/p&gt;

&lt;h4&gt;
  
  
  Preprocessing the data
&lt;/h4&gt;

&lt;p&gt;Our model has been trained on data formatted in a certain way. So, to make predictions on the test data, we need to take the same steps we used to preprocess the training data to preprocess the test data.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Note:&lt;/strong&gt; Whatever we do to the training data, you have to do the same to the test data.&lt;/p&gt;

&lt;p&gt;Let's create a function for doing so (by copying the preprocessing steps we used above).&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;preprocess_data&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="c1"&gt;# Add datetime parameters for saledate
&lt;/span&gt;    &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;"saleYear"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;saledate&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;dt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;year&lt;/span&gt;
    &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;"saleMonth"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;saledate&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;dt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;month&lt;/span&gt;
    &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;"saleDay"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;saledate&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;dt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;day&lt;/span&gt;
    &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;"saleDayofweek"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;saledate&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;dt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;dayofweek&lt;/span&gt;
    &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;"saleDayofyear"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;saledate&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;dt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;dayofyear&lt;/span&gt;

    &lt;span class="c1"&gt;# Drop original saledate
&lt;/span&gt;    &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;drop&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"saledate"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;axis&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;inplace&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Fill numeric rows with the median
&lt;/span&gt;    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;label&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;content&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;items&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;api&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;types&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;is_numeric_dtype&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;isnull&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nb"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
                &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;label&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="s"&gt;"_is_missing"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;isnull&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;label&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;fillna&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;median&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;

        &lt;span class="c1"&gt;# Turn categorical variables into numbers
&lt;/span&gt;        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;api&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;types&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;is_numeric_dtype&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
            &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;label&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="s"&gt;"_is_missing"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;isnull&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="c1"&gt;# We add the +1 because pandas encodes missing categories as -1
&lt;/span&gt;            &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;label&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Categorical&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="n"&gt;codes&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;        

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Note:&lt;/strong&gt; This function could break if the test data has different missing values compared to the training data.&lt;/p&gt;

&lt;p&gt;Now that we've got a function for preprocessing data, let's preprocess the test dataset into the same format as our training dataset.&lt;/p&gt;

&lt;p&gt;We can see that our test dataset (after preprocessing) has 101 columns. But, our training dataset &lt;code&gt;X_train&lt;/code&gt; has 102 columns (after preprocessing). Let's find the difference.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# We can find how the columns differ using sets
&lt;/span&gt;&lt;span class="nb"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X_train&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;columns&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="nb"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df_test&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;columns&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;{'auctioneerID_is_missing'}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;There wasn't any missing &lt;code&gt;auctioneerID&lt;/code&gt; field in the test dataset.&lt;/p&gt;

&lt;p&gt;To fix this, we'll add a column to the test dataset called &lt;code&gt;auctioneerID_is_missing&lt;/code&gt; and fill it with &lt;code&gt;False&lt;/code&gt;, since none of the &lt;code&gt;auctioneerID&lt;/code&gt; fields are missing in the test dataset.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Match test dataset columns to training dataset
&lt;/span&gt;&lt;span class="n"&gt;df_test&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;"auctioneerID_is_missing"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now the test dataset matches the training dataset and we should be able to make predictions on it using our trained model.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Make predictions on the test dataset using the best model
&lt;/span&gt;&lt;span class="n"&gt;test_preds&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ideal_model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;predict&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df_test&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When looking at the &lt;a href="https://www.kaggle.com/c/bluebook-for-bulldozers/overview/evaluation"&gt;Kaggle submission requirements&lt;/a&gt;, we see that to make a submission, the data must be in a certain format: a DataFrame containing the &lt;code&gt;SalesID&lt;/code&gt; and the predicted &lt;code&gt;SalePrice&lt;/code&gt; of the bulldozer.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Create DataFrame compatible with Kaggle submission requirements
&lt;/span&gt;&lt;span class="n"&gt;df_preds&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;DataFrame&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;df_preds&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;"SalesID"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df_test&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;"SalesID"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;df_preds&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;"SalePrice"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;test_preds&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Export to csv...
&lt;/span&gt;&lt;span class="n"&gt;df_preds&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;to_csv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"./data/bluebook-for-bulldozers/predictions.csv"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
               &lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Feature Importance
&lt;/h3&gt;

&lt;p&gt;By now, we've built a model which is able to make predictions. The people we share these predictions with might be curious to know what parts of the data led to these predictions.&lt;/p&gt;

&lt;p&gt;This is where &lt;strong&gt;feature importance&lt;/strong&gt; comes in. Feature importance seeks to find out which attributes of the data were the most important in predicting the &lt;strong&gt;target variable&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;In our case, which bulldozer sale attributes had the most weight in the model to predict its overall sale price?&lt;/p&gt;

&lt;p&gt;Beware: the default feature importances for random forests can lead to non-ideal results.&lt;/p&gt;

&lt;p&gt;To figure out which features were most important in a machine learning model, a good idea is to search something like "[MODEL NAME] feature importance".&lt;/p&gt;

&lt;p&gt;Doing this for our &lt;code&gt;RandomForestRegressor&lt;/code&gt; leads us to find the &lt;code&gt;feature_importances_&lt;/code&gt; attribute.&lt;/p&gt;

&lt;p&gt;Let's check it out.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Find feature importance of our best model
&lt;/span&gt;&lt;span class="n"&gt;ideal_model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;feature_importances_&lt;/span&gt;

&lt;span class="c1"&gt;# Install Seaborn package in current environment (if you don't have it)
# import sys
# !conda install --yes --prefix {sys.prefix} seaborn
&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;seaborn&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;sns&lt;/span&gt;

&lt;span class="c1"&gt;# Helper function for plotting feature importance
&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;plot_features&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;columns&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;importances&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;DataFrame&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="s"&gt;"features"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;columns&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                        &lt;span class="s"&gt;"feature_importance"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;importances&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;
          &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sort_values&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"feature_importance"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ascending&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
          &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;reset_index&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;drop&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

    &lt;span class="n"&gt;sns&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;barplot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;"feature_importance"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;"features"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[:&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
                &lt;span class="n"&gt;orient&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;"h"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;plot_features&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X_train&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;columns&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ideal_model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;feature_importances_&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--LjMKMioq--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/00ax2xlq232eja85zbp9.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--LjMKMioq--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/00ax2xlq232eja85zbp9.png" alt="Features" width="516" height="263"&gt;&lt;/a&gt;&lt;/p&gt;

</description>
      <category>machinelearning</category>
      <category>tutorial</category>
    </item>
    <item>
      <title>Machine Learning: Predicting Heart Disease From Patients' Medical Data</title>
      <dc:creator>Nicolas Vallée</dc:creator>
      <pubDate>Wed, 30 Mar 2022 08:04:41 +0000</pubDate>
      <link>https://dev.to/nicolasvallee/machine-learning-predicting-heart-disease-from-patients-medical-data-535n</link>
      <guid>https://dev.to/nicolasvallee/machine-learning-predicting-heart-disease-from-patients-medical-data-535n</guid>
      <description>&lt;p&gt;This tutorial introduces some fundamental Machine Learning and Data Science concepts by exploring the problem of heart disease &lt;strong&gt;classification&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;It is intended to be an end-to-end example of what a Data Science and Machine Learning &lt;strong&gt;proof of concept&lt;/strong&gt; looks like.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;I completed this milestone project in March 2022, as part of the &lt;a href="https://www.udemy.com/course/complete-machine-learning-and-data-science-zero-to-mastery/"&gt;Complete Machine Learning &amp;amp; Data Science Bootcamp&lt;/a&gt; taught by &lt;a href="https://www.udemy.com/user/daniel-bourke-52/"&gt;Daniel Bourke&lt;/a&gt; and &lt;a href="https://www.udemy.com/user/andrei-neagoie/"&gt;Andrei Neagoie&lt;/a&gt;.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;You can also see &lt;a href="https://github.com/nicolas-vallee/ztm-ml-projects/blob/main/heart-disease-classification.ipynb"&gt;the final version of this notebook in my GitHub repo&lt;/a&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  What is classification?
&lt;/h3&gt;

&lt;p&gt;Classification involves deciding whether a sample is part of one class or another (&lt;strong&gt;single-class classification&lt;/strong&gt;). If there are multiple class options, we refer to the problem as &lt;strong&gt;multi-class classification&lt;/strong&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  What we'll end up with
&lt;/h3&gt;

&lt;p&gt;Since we already have a dataset, we'll follow this 6-step Machine Learning modelling framework.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--lk-c66gX--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/ecb5myrpy0winupv3g2g.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--lk-c66gX--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/ecb5myrpy0winupv3g2g.png" alt="6 Step Machine Learning Modelling Framework" width="880" height="345"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;More specifically, we'll look at the following topics:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Exploratory data analysis (EDA)&lt;/strong&gt; - the process of going through a dataset to find out more about it.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Model training&lt;/strong&gt; - create model(s) to predict a target variable based on other variables.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Model evaluation&lt;/strong&gt; - evaluating a model's predictions using problem-specific evaluation metrics.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Model comparison&lt;/strong&gt; - comparing several different models to find the best one.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Model fine-tuning&lt;/strong&gt; - once we've found a good model, how can we improve it?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Feature importance&lt;/strong&gt; - since we're predicting the presence of heart disease, are there some things which are more important for prediction?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cross-validation&lt;/strong&gt; - if we build a good model, can we be sure it will work on unseen data?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Reporting what we've found&lt;/strong&gt; - if we had to present our work, what would we show someone?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;To work through these topics, we'll use pandas, Matplotlib and NumPy for data anaylsis, as well as, Scikit-Learn for machine learning and modelling tasks.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--V3QxaMUf--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/q607xb1k00slguyi4d9e.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--V3QxaMUf--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/q607xb1k00slguyi4d9e.png" alt="Tools which can be used for each step of the machine learning modelling process" width="880" height="523"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We'll work through each step and by the end of the notebook, we'll have a handful of models. These models can predict whether or not a person has heart disease based on a number of parameters with a considerable accuracy.&lt;/p&gt;

&lt;p&gt;We'll also be able to describe which parameters are more indicative than others, for example, sex may be more important than age.&lt;/p&gt;




&lt;h2&gt;
  
  
  1. Problem Definition
&lt;/h2&gt;

&lt;p&gt;The problem we will explore is a &lt;strong&gt;binary classification&lt;/strong&gt;, which means a sample can only be one of two things.&lt;/p&gt;

&lt;p&gt;This is because we're going to use a number of different &lt;strong&gt;features&lt;/strong&gt; about a person to predict whether or not they have heart disease.&lt;/p&gt;

&lt;p&gt;In a statement,&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Given clinical parameters about a patient, can we predict whether or not they have heart disease?&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  2. Data
&lt;/h2&gt;

&lt;p&gt;Here, we want to dive into the data that our problem definition is based on. This may involve sourcing, defining different parameters, talking to experts about it, and finding out what we should expect.&lt;/p&gt;

&lt;p&gt;The original data comes from the &lt;a href="https://archive.ics.uci.edu/ml/datasets/heart+disease"&gt;Cleveland database&lt;/a&gt; from UCI Machine Learning Repository.&lt;/p&gt;

&lt;p&gt;Howevever, we've downloaded it in a formatted way from &lt;a href="https://www.kaggle.com/cherngs/heart-disease-cleveland-uci"&gt;Kaggle&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;The original database contains 76 attributes, but here only 14 attributes are used. &lt;strong&gt;Attributes&lt;/strong&gt; (also called &lt;strong&gt;features&lt;/strong&gt;) are the variables that we'll use to predict our &lt;strong&gt;target variable&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Attributes and features are also referred to as &lt;strong&gt;independent variables&lt;/strong&gt;, and a target variable can be referred to as a &lt;strong&gt;dependent variable&lt;/strong&gt;.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;We use the independent variables to predict our dependent variable.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;In our case, the independent variables are a patient's medical attributes and the dependent variable is whether or not they have heart disease.&lt;/p&gt;




&lt;h2&gt;
  
  
  3. Evaluation
&lt;/h2&gt;

&lt;p&gt;The evaluation metric is something we define at the start of a project.&lt;/p&gt;

&lt;p&gt;Since machine learning is very experimental, we might say something like:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;If we can reach 95% accuracy at predicting whether or not a patient has heart disease during the proof of concept phase, we'll pursue this project.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This is helpful because it provides a rough goal for a machine learning engineer or data scientist to work towards.&lt;/p&gt;

&lt;p&gt;However, due to the nature of experimentation, the evaluation metric may change over time.&lt;/p&gt;




&lt;h2&gt;
  
  
  4. Features
&lt;/h2&gt;

&lt;p&gt;Features are different parts of the data. During this step, we want to find out what we can about the data.&lt;/p&gt;

&lt;p&gt;One of the most common ways to do this, is to create a &lt;strong&gt;data dictionary&lt;/strong&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Heart disease data dictionary
&lt;/h3&gt;

&lt;p&gt;A data dictionary describes the data we're dealing with. Not all datasets come with them so this is where we may have to do our research or ask a &lt;strong&gt;subject matter expert&lt;/strong&gt; (someone who knows about the data) for more information.&lt;/p&gt;

&lt;p&gt;The following are the features we'll use to predict our target variable (heart disease or no heart disease).&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;age - age in years&lt;/li&gt;
&lt;li&gt;sex - (1 = male; 0 = female)&lt;/li&gt;
&lt;li&gt;cp - chest pain type

&lt;ul&gt;
&lt;li&gt;0: Typical angina: chest pain related decrease blood supply to the heart&lt;/li&gt;
&lt;li&gt;1: Atypical angina: chest pain not related to heart&lt;/li&gt;
&lt;li&gt;2: Non-anginal pain: typically esophageal spasms (non heart related)&lt;/li&gt;
&lt;li&gt;3: Asymptomatic: chest pain not showing signs of disease&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;trestbps - resting blood pressure (in mmHg on admission to the hospital)

&lt;ul&gt;
&lt;li&gt;anything above 130-140 is typically cause for concern&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;chol - serum cholestoral in mg/dl

&lt;ul&gt;
&lt;li&gt;serum = LDL + HDL + .2 * triglycerides&lt;/li&gt;
&lt;li&gt;above 200 is cause for concern&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;fbs - (fasting blood sugar &amp;gt; 120 mg/dl) (1 = true; 0 = false)

&lt;ul&gt;
&lt;li&gt;'&amp;gt;126' mg/dL signals diabetes&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;restecg - resting electrocardiographic results

&lt;ul&gt;
&lt;li&gt;0: Nothing to note&lt;/li&gt;
&lt;li&gt;1: ST-T Wave abnormality

&lt;ul&gt;
&lt;li&gt;can range from mild symptoms to severe problems&lt;/li&gt;
&lt;li&gt;signals non-normal heart beat&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;2: Possible or definite left ventricular hypertrophy

&lt;ul&gt;
&lt;li&gt;Enlarged heart's main pumping chamber&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;thalach - maximum heart rate achieved&lt;/li&gt;
&lt;li&gt;exang - exercise induced angina (1 = yes; 0 = no)&lt;/li&gt;
&lt;li&gt;oldpeak - ST depression induced by exercise relative to rest

&lt;ul&gt;
&lt;li&gt;looks at stress of heart during exercise&lt;/li&gt;
&lt;li&gt;unhealthy heart will stress more&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;slope - the slope of the peak exercise ST segment

&lt;ul&gt;
&lt;li&gt;0: Upsloping: better heart rate with exercise (uncommon)&lt;/li&gt;
&lt;li&gt;1: Flatsloping: minimal change (typical healthy heart)&lt;/li&gt;
&lt;li&gt;2: Downsloping: signs of unhealthy heart&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;ca - number of major vessels (0-3) colored by fluoroscopy

&lt;ul&gt;
&lt;li&gt;colored vessel means the doctor can see the blood passing through&lt;/li&gt;
&lt;li&gt;the more blood movement the better (no clots)&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;thal - thalium stress result

&lt;ul&gt;
&lt;li&gt;1,3: normal&lt;/li&gt;
&lt;li&gt;6: fixed defect: used to be defect but ok now&lt;/li&gt;
&lt;li&gt;7: reversable defect: no proper blood movement when exercising&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;target - have disease or not (1 = yes; 0 = no) (= the predicted attribute)&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Note: No personal identifiable information (PPI) can be found in the dataset.&lt;/p&gt;

&lt;p&gt;It's a good idea to save these to a Python dictionary or in an external file, so we can look at them later without coming back here.&lt;/p&gt;

&lt;h3&gt;
  
  
  Preparing the tools
&lt;/h3&gt;

&lt;p&gt;At the start of any project, it's common to see the required libraries imported in a big chunk, like we can see below.&lt;/p&gt;

&lt;p&gt;However, in practice, our projects may import libraries as we go. After we've spent a couple of hours working on our problem, we'll probably want to do some tidying up. This is where we may want to consolidate every library we've used at the top of our notebook (like in the cell below).&lt;/p&gt;

&lt;p&gt;The libraries we use will differ from project to project. But there are a few which will we'll likely take advantage of during almost every structured data project.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://pandas.pydata.org/"&gt;pandas&lt;/a&gt; for data analysis.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://numpy.org/"&gt;NumPy&lt;/a&gt; for numerical operations.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://matplotlib.org/"&gt;Matplotlib&lt;/a&gt;/&lt;a href="https://seaborn.pydata.org/"&gt;seaborn&lt;/a&gt; for plotting or data visualization.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://scikit-learn.org/stable/"&gt;Scikit-Learn&lt;/a&gt; for machine learning modelling and evaluation.
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Regular EDA and plotting libraries
&lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;numpy&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt; 
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;pandas&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt; 
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;matplotlib.pyplot&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;plt&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;seaborn&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;sns&lt;/span&gt;

&lt;span class="c1"&gt;# We want our plots to appear in the notebook
&lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="n"&gt;matplotlib&lt;/span&gt; &lt;span class="n"&gt;inline&lt;/span&gt; 

&lt;span class="c1"&gt;## Models
&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;sklearn.linear_model&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;LogisticRegression&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;sklearn.neighbors&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;KNeighborsClassifier&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;sklearn.ensemble&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;RandomForestClassifier&lt;/span&gt;

&lt;span class="c1"&gt;## Model evaluators
&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;sklearn.model_selection&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;train_test_split&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;cross_val_score&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;sklearn.model_selection&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;RandomizedSearchCV&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;GridSearchCV&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;sklearn.metrics&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;confusion_matrix&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;classification_report&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;sklearn.metrics&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;precision_score&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;recall_score&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;f1_score&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;sklearn.metrics&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;plot_roc_curve&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Loading data
&lt;/h3&gt;

&lt;p&gt;There are many ways to store data. The typical way of storing tabular data --data similar to what you'd see in an Excel file-- is in .csv format. .csv stands for comma separated values.&lt;/p&gt;

&lt;p&gt;Pandas has a built-in function to read .csv files called &lt;code&gt;read_csv()&lt;/code&gt; which takes the file pathname of our .csv file. We'll likely use this one often.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;read_csv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;'heart-disease.csv'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="c1"&gt;# 'DataFrame' shortened to 'df'
&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;shape&lt;/span&gt; &lt;span class="c1"&gt;# (rows, columns)
&lt;/span&gt;
&lt;span class="c1"&gt;# (303, 14)
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Data exploration (exploratory data analysis or EDA)
&lt;/h3&gt;

&lt;p&gt;Once we've imported a dataset, the next step is to explore it. There's no set way of doing this, but we should try to become more familiar with the dataset.&lt;/p&gt;

&lt;p&gt;Comparing different columns to each other, or comparing them to the target variable. Referring back to our data dictionary and reminding ourself of what different columns mean.&lt;/p&gt;

&lt;p&gt;Our goal is to become a subject matter expert on the dataset we're working with. So, if someone asks us a question about it, we can give them an explanation, and when we start building models, we can sound check them to make sure they're not performing too well (overfitting) or understand why they might be performing poorly (underfitting).&lt;/p&gt;

&lt;p&gt;Since EDA has no real set methodology, the following is a short check list we might want to walk through:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;What question(s) are we trying to solve (or prove wrong)?&lt;/li&gt;
&lt;li&gt;What kind of data do we have and how do we treat different types?&lt;/li&gt;
&lt;li&gt;What’s missing from the data and how do we deal with it?&lt;/li&gt;
&lt;li&gt;Where are the outliers and why should we care about them?&lt;/li&gt;
&lt;li&gt;How can we add, change, or remove features to get more out of our data?&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;One of the quickest and easiest ways to check our data is with the &lt;code&gt;head()&lt;/code&gt; function. Calling it on any dataframe will print the top 5 rows, and &lt;code&gt;tail()&lt;/code&gt; calls the bottom 5. We can also pass a number to them like &lt;code&gt;head(10)&lt;/code&gt; to show the top 10 rows.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Let's check the top 5 rows of our dataframe
&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;head&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;age sex cp  trestbps    chol    fbs restecg thalach exang   oldpeak slope   ca  thal    target
0   63  1   3   145 233 1   0   150 0   2.3 0   0   1   1
1   37  1   2   130 250 0   1   187 0   3.5 0   0   2   1
2   41  0   1   130 204 0   0   172 0   1.4 2   0   2   1
3   56  1   1   120 236 0   1   178 0   0.8 2   0   2   1
4   57  0   0   120 354 0   1   163 1   0.6 2   0   2   1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# And the bottom 10
&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;tail&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;age sex cp  trestbps    chol    fbs restecg thalach exang   oldpeak slope   ca  thal    target
293 67  1   2   152 212 0   0   150 0   0.8 1   0   3   0
294 44  1   0   120 169 0   1   144 1   2.8 0   0   1   0
295 63  1   0   140 187 0   0   144 1   4.0 2   2   3   0
296 63  0   0   124 197 0   1   136 1   0.0 1   0   2   0
297 59  1   0   164 176 1   0   90  0   1.0 1   2   1   0
298 57  0   0   140 241 0   1   123 1   0.2 1   0   3   0
299 45  1   3   110 264 0   1   132 0   1.2 1   0   3   0
300 68  1   0   144 193 1   1   141 0   3.4 1   2   3   0
301 57  1   0   130 131 0   1   115 1   1.2 1   1   3   0
302 57  0   1   130 236 0   0   174 0   0.0 1   1   2   0
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;value_counts()&lt;/code&gt; allows us to show how many times each of the values of a &lt;strong&gt;categorical&lt;/strong&gt; column appear.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Let's see how many positive (1) and negative (0) samples we have in our dataframe
&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;"target"&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;value_counts&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;1    165
0    138
Name: target, dtype: int64
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Since these two values are close to each other, our &lt;code&gt;target&lt;/code&gt; column can be considered &lt;strong&gt;balanced&lt;/strong&gt;. An &lt;strong&gt;unbalanced&lt;/strong&gt; target column, when some classes have far more samples, can be harder to model than a balanced set. Ideally, all of our target classes have the same number of samples.&lt;/p&gt;

&lt;p&gt;If we'd prefer these values in percentages, &lt;code&gt;value_counts()&lt;/code&gt; takes a parameter, &lt;code&gt;normalize&lt;/code&gt; which can be set to &lt;code&gt;True&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Normalized value counts
&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;target&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;value_counts&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;normalize&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;1    0.544554
0    0.455446
Name: target, dtype: float64
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We can plot the target column value counts by calling the &lt;code&gt;plot()&lt;/code&gt; function and telling it what kind of plot we'd like, in this case, &lt;code&gt;bar&lt;/code&gt; is good.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Plot the value counts with a bar graph
&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;"target"&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;value_counts&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="n"&gt;plot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;kind&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;"bar"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;color&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;"salmon"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"lightblue"&lt;/span&gt;&lt;span class="p"&gt;]);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--64GvNZks--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/eq3jed8s9e2su9rbjrvu.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--64GvNZks--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/eq3jed8s9e2su9rbjrvu.png" alt="Bar chart" width="375" height="245"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;code&gt;df.info()&lt;/code&gt; shows the number of missing values we have and what type of data we're working with.&lt;/p&gt;

&lt;p&gt;In our case, there are no missing values and all of our columns are numerical.&lt;/p&gt;

&lt;p&gt;Another way to get some quick insights on our dataframe is to use &lt;code&gt;df.describe()&lt;/code&gt;. &lt;code&gt;describe()&lt;/code&gt; shows a range of different metrics about our numerical columns such as mean, max, and standard deviation.&lt;/p&gt;

&lt;h3&gt;
  
  
  Heart disease frequency according to gender
&lt;/h3&gt;

&lt;p&gt;If we want to compare two columns, we can use the function &lt;code&gt;pd.crosstab(column_1, column_2)&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;This is helpful when we want to gain an intuition about how our independent variables interact with our dependent variables.&lt;/p&gt;

&lt;p&gt;Let's compare our &lt;code&gt;target&lt;/code&gt; column with the &lt;code&gt;sex&lt;/code&gt; column.&lt;/p&gt;

&lt;p&gt;In our data dictionary, for the target column, 1 = heart disease present, 0 = no heart disease. And for sex, 1 = male, 0 = female.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sex&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;value_counts&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;There are 207 males and 96 females in our study.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# compare target column with sex column
&lt;/span&gt;&lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;crosstab&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;target&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sex&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;sex 0   1
target      
0   24  114
1   72  93
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;What can we infer from this? Let's make a simple heuristic.&lt;/p&gt;

&lt;p&gt;Since there are about 100 women and 72 of them have a positive value of heart disease being present, we might infer, based on this one variable that if the participant is a woman, there's a 75% chance she has heart disease.&lt;/p&gt;

&lt;p&gt;As for males, there's about 200 total with around half indicating a presence of heart disease. So we might predict, if the participant is male, that 50% of the time he will have heart disease.&lt;/p&gt;

&lt;p&gt;Averaging these two values, we can assume, based on no other parameter, if there's a person, there's a 62.5% chance they have heart disease.&lt;/p&gt;

&lt;p&gt;This can be our very simple baseline, and we'll try to beat it with machine learning.&lt;/p&gt;

&lt;h3&gt;
  
  
  Making our crosstab visual
&lt;/h3&gt;

&lt;p&gt;We can plot the crosstab by using the &lt;code&gt;plot()&lt;/code&gt; function and passing it a few parameters such as, &lt;code&gt;kind&lt;/code&gt; (the type of plot we want), &lt;code&gt;figsize=(length, width)&lt;/code&gt; (how big we want it to be) and &lt;code&gt;color=[color_1, color_2]&lt;/code&gt; (the different colors we'd like to use).&lt;/p&gt;

&lt;p&gt;Different metrics are best represented with different kinds of plots. In our case, a bar graph is great. We'll see more examples later. And with a bit of practice, we'll gain an intuition of which plot to use with different variables.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Create a plot
&lt;/span&gt;&lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;crosstab&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;target&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sex&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="n"&gt;plot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;kind&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;"bar"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                                    &lt;span class="n"&gt;figsize&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;6&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
                                    &lt;span class="n"&gt;color&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;"salmon"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"lightblue"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;

&lt;span class="c1"&gt;# Add some attributes to it
&lt;/span&gt;&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;title&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Heart disease frequency for sex"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;xlabel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"0 = No disease, 1 = Disease"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ylabel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Amount"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;legend&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="s"&gt;"Female"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"Male"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;xticks&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;rotation&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt; &lt;span class="c1"&gt;# keeps the labels on the x-axis vertical
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--y7CuCDGy--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/ixhpzgs2oljdo8fjjjnk.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--y7CuCDGy--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/ixhpzgs2oljdo8fjjjnk.png" alt="Heart disease frequency for sex" width="612" height="387"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Age vs. max heart rate for heart disease
&lt;/h3&gt;

&lt;p&gt;Let's try combining a couple of independent variables, such as &lt;code&gt;age&lt;/code&gt; and &lt;code&gt;thalach&lt;/code&gt; (maximum heart rate) and then compare them to our target variable.&lt;/p&gt;

&lt;p&gt;Because there are so many different values for &lt;code&gt;age&lt;/code&gt; and &lt;code&gt;thalach&lt;/code&gt;, we'll use a scatter plot.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Create another figure
&lt;/span&gt;&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;figure&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;figsize&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;6&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

&lt;span class="c1"&gt;# Start with positve examples
&lt;/span&gt;&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;scatter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;age&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;target&lt;/span&gt;&lt;span class="o"&gt;==&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
            &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;thalach&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;target&lt;/span&gt;&lt;span class="o"&gt;==&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
            &lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;"salmon"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Now for negative examples, we want them on the same plot, so we call plt again
&lt;/span&gt;&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;scatter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;age&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;target&lt;/span&gt;&lt;span class="o"&gt;==&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
            &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;thalach&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;target&lt;/span&gt;&lt;span class="o"&gt;==&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
            &lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;"lightblue"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Add some helpful info
&lt;/span&gt;&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;title&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Heart disease in function of Age and Max Heart Rate"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;xlabel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Age"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ylabel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Max Heart Rate"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;legend&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="s"&gt;"Disease"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"No Disease"&lt;/span&gt;&lt;span class="p"&gt;]);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--65g-csg4--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/huc40sbn0ex8sbwuwxwp.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--65g-csg4--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/huc40sbn0ex8sbwuwxwp.png" alt="Scatter plot" width="612" height="387"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;What can we infer from this?&lt;/p&gt;

&lt;p&gt;It seems the younger someone is, the higher their max heart rate (dots are higher on the left of the graph) and the older someone is, the more light blue dots there are. But, this may be because there are more dots all together on the right side of the graph (older participants).&lt;/p&gt;

&lt;p&gt;Both of these are observational of course, but this is what we're trying to do, get an understanding of the data.&lt;/p&gt;

&lt;p&gt;Now, let's check the &lt;strong&gt;age distribution&lt;/strong&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Histograms are a great way to check the distribution of a variable
&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;age&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;plot&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;hist&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--565ipuMI--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/bb055bry8pj9sjzktoql.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--565ipuMI--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/bb055bry8pj9sjzktoql.png" alt="Hist plot" width="382" height="248"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We can see that it's a &lt;strong&gt;normal distribution&lt;/strong&gt;, but slightly skewed to the right, which is reflected in the scatter plot above.&lt;/p&gt;

&lt;p&gt;Let's keep going.&lt;/p&gt;

&lt;h3&gt;
  
  
  Heart disease frequency per chest pain type
&lt;/h3&gt;

&lt;p&gt;Let's try another independent variable. This time, &lt;code&gt;cp&lt;/code&gt; (chest pain).&lt;/p&gt;

&lt;p&gt;We'll use the same process as we did before with &lt;code&gt;sex&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;crosstab&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cp&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;target&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;arget   0   1
cp      
0   104 39
1   9   41
2   18  69
3   7   16
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Create a new crosstab and base plot
&lt;/span&gt;&lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;crosstab&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cp&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;target&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="n"&gt;plot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;kind&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;"bar"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                                   &lt;span class="n"&gt;figsize&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;6&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
                                   &lt;span class="n"&gt;color&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;"lightblue"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="s"&gt;"salmon"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;

&lt;span class="c1"&gt;# Add attributes to the plot to make it more readable
&lt;/span&gt;&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;title&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Heart Disease Frequency per chest pain type"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;xlabel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Chest Pain Type"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ylabel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Amount"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;legend&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="s"&gt;"No Disease"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"Disease"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;xticks&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;rotation&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--TMLGMQfG--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/0rrlo1etrlheesumurec.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--TMLGMQfG--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/0rrlo1etrlheesumurec.png" alt="Frequency chest pain" width="612" height="387"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;What can we infer from this?&lt;/p&gt;

&lt;p&gt;Let's check in our data dictionary what the different levels of chest pain are.&lt;/p&gt;

&lt;p&gt;cp - chest pain type&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;0: Typical angina: chest pain related decrease blood supply to the heart&lt;/li&gt;
&lt;li&gt;1: Atypical angina: chest pain not related to heart&lt;/li&gt;
&lt;li&gt;2: Non-anginal pain: typically esophageal spasms (non heart related)&lt;/li&gt;
&lt;li&gt;3: Asymptomatic: chest pain not showing signs of disease&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It's interesting that the atypical angina (value of 1) states that it's not related to the heart but seems to have a higher ratio of participants with heart disease than not.&lt;/p&gt;

&lt;p&gt;But, what does "atypical angina" even means?&lt;/p&gt;

&lt;p&gt;At this point, it's important to remember, if our data dictionary doesn't supply enough information, we may want to do further research on our values. This research may come in the form of asking a &lt;strong&gt;subject matter expert&lt;/strong&gt; (such as a cardiologist or the person who gave us the data) or Googling to find out more.&lt;/p&gt;

&lt;p&gt;According to PubMed, it seems &lt;a href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2763472/"&gt;even some medical professionals are confused by the term&lt;/a&gt;.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Today, 23 years later, “atypical chest pain” is still popular in medical circles. Its meaning, however, remains unclear. A few articles have the term in their title, but do not define or discuss it in their text. In other articles, the term refers to noncardiac causes of chest pain.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Although not conclusive, this graph above is a hint at the confusion of definitions being represented in data.&lt;/p&gt;

&lt;h3&gt;
  
  
  Correlation between independent variables
&lt;/h3&gt;

&lt;p&gt;Finally, we'll compare all of the independent variables. This may give us an idea of which independent variables may or may not have an impact on our target variable.&lt;/p&gt;

&lt;p&gt;We can do this using &lt;code&gt;df.corr()&lt;/code&gt; which will create a &lt;strong&gt;correlation matrix&lt;/strong&gt; for us, in other words, a big table of numbers telling us how related each variable is to the others.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Find the correlation between our independent variables
&lt;/span&gt;&lt;span class="n"&gt;corr_matrix&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;corr&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="c1"&gt;# Let's make our correlation matrix look a little prettier
&lt;/span&gt;&lt;span class="n"&gt;corr_matrix&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;corr&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;fig&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ax&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;subplots&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;figsize&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;15&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="n"&gt;ax&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;sns&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;heatmap&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;corr_matrix&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                 &lt;span class="n"&gt;annot&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                 &lt;span class="n"&gt;linewidths&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                 &lt;span class="n"&gt;fmt&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;".2f"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="n"&gt;cmap&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;"YlGnBu"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--sIFBpFsX--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/jbcxnk73twsn31opqsy4.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--sIFBpFsX--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/jbcxnk73twsn31opqsy4.png" alt="Correlation" width="801" height="578"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;A higher positive value means a potential positive correlation (increase) and a higher negative value means a potential negative correlation (decrease).&lt;/p&gt;

&lt;h3&gt;
  
  
  Enough EDA, let's model!
&lt;/h3&gt;

&lt;p&gt;We've done exploratory data analysis (EDA) to start building an intuition about the dataset.&lt;/p&gt;

&lt;p&gt;What have we learned so far? Aside from our baseline estimate using sex, the rest of the data seems to be pretty distributed.&lt;/p&gt;

&lt;p&gt;So what we'll do next is &lt;em&gt;model driven EDA&lt;/em&gt;, meaning, we'll use machine learning models to drive our next questions.&lt;/p&gt;

&lt;p&gt;A few extra things to remember:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Not every EDA will look the same, what we've seen here is an example of what we could do for structured, tabular dataset.&lt;/li&gt;
&lt;li&gt;We don't necessarily have to do the same plots as we've done here, there are many more ways to visualize data.&lt;/li&gt;
&lt;li&gt;We want to quickly find:

&lt;ul&gt;
&lt;li&gt;Distributions (&lt;code&gt;df.column.hist()&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;Missing values (&lt;code&gt;df.info()&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;Outliers&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Let's build some models.&lt;/p&gt;




&lt;h2&gt;
  
  
  5. Modeling
&lt;/h2&gt;

&lt;p&gt;We've explored the data, now we'll try to use Machine Learning to predict our target variable based on the 13 independent variables.&lt;/p&gt;

&lt;p&gt;What is the problem we're solving?&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Given clinical parameters about a patient, can we predict whether or not they have heart disease?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That's what we'll be trying to answer.&lt;/p&gt;

&lt;p&gt;And remember our evaluation metric?&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;If we can reach 95% accuracy at predicting whether or not a patient has heart disease during the proof of concept, we'll pursue this project.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That's what we'll be aiming for.&lt;/p&gt;

&lt;p&gt;But before we build a model, we have to get our dataset ready.&lt;/p&gt;

&lt;p&gt;Let's look at it again with &lt;code&gt;df.head()&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;We're trying to predict our target variable using all of the other variables. To do this, we'll split the target variable from the rest.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Everything except target variable
&lt;/span&gt;&lt;span class="n"&gt;X&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;drop&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"target"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;axis&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Target variable
&lt;/span&gt;&lt;span class="n"&gt;y&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;"target"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Training and test split
&lt;/h3&gt;

&lt;p&gt;Now comes one of the most important concepts in Machine Learning, the &lt;strong&gt;training / test split&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;This is where we split our data into a training set and a test set.&lt;/p&gt;

&lt;p&gt;We use our training set to train our model and our test set to test it.&lt;/p&gt;

&lt;p&gt;The test set must remain separate from our training set.&lt;/p&gt;

&lt;h4&gt;
  
  
  Why not use all the data to train a model?
&lt;/h4&gt;

&lt;p&gt;Let's say we wanted to take our model into the hospital and start using it on patients. How would we know how well our model performs on a new patient not included in the original full dataset we had?&lt;/p&gt;

&lt;p&gt;This is where the test set comes in. It's used to mimic taking our model to a real environment as much as possible.&lt;/p&gt;

&lt;p&gt;And it's why it's important to never let our model learn from the test set, it should only be evaluated on it.&lt;/p&gt;

&lt;p&gt;To split our data into a training and test set, we can use Scikit-Learn's &lt;code&gt;train_test_split()&lt;/code&gt; and feed it our independent and dependent variables (&lt;code&gt;X&lt;/code&gt; &amp;amp; &lt;code&gt;y&lt;/code&gt;).&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Random seed for reproducibility
&lt;/span&gt;&lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;seed&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;42&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Split into train &amp;amp; test sets
&lt;/span&gt;&lt;span class="n"&gt;X_train&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;X_test&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_train&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_test&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;train_test_split&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                                                    &lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                                                    &lt;span class="n"&gt;test_size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="c1"&gt;# percentage of data to use for test set
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;test_size&lt;/code&gt; parameter is used to tell the &lt;code&gt;train_test_split()&lt;/code&gt; function how much of our data we want in the test set.&lt;/p&gt;

&lt;p&gt;A rule of thumb is to use 80% of our data to train on and the other 20% to test on.&lt;/p&gt;

&lt;p&gt;For our problem, a train and test set are enough. But for other problems, we could also use a validation (train/validation/test) set or cross-validation (we'll see this later).&lt;/p&gt;

&lt;p&gt;But again, each problem will differ. The post, &lt;a href="https://www.fast.ai/2017/11/13/validation-sets/"&gt;How (and why) to create a good validation set&lt;/a&gt; by Rachel Thomas is a good place to learn more.&lt;/p&gt;

&lt;p&gt;Let's look at our training data: we can see we're using 242 samples to train on.&lt;/p&gt;

&lt;p&gt;Let's look at our test data: we've got 61 examples we'll test our model(s) on.&lt;/p&gt;

&lt;h3&gt;
  
  
  Model choices
&lt;/h3&gt;

&lt;p&gt;Now that we've got our data prepared, we can start to fit models. We'll be using the following and comparing their results:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Logisitc Regression - &lt;code&gt;LogisticRegression()&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;K-Nearest Neighbors Classifier - &lt;code&gt;KNeighborsClassifier()&lt;/code&gt; &lt;/li&gt;
&lt;li&gt;Random Forest Classifier - &lt;code&gt;RandomForestClassifier()&lt;/code&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;h4&gt;
  
  
  Why these?
&lt;/h4&gt;

&lt;p&gt;If we look at the &lt;a href="https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html"&gt;Scikit-Learn algorithm cheat sheet&lt;/a&gt;, we can see that we're working on a classification problem and these are the algorithms that it suggests (plus a few more).&lt;/p&gt;

&lt;p&gt;"Wait, I don't see Logistic Regression, and why not use LinearSVC?"&lt;/p&gt;

&lt;p&gt;Good questions.&lt;/p&gt;

&lt;p&gt;It is confusing that Logistic Regression isn't listed as well because it's &lt;a href="https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression"&gt;a model for classification&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Let's pretend that we've tried LinearSVC, and that it doesn't work, so now we're following other options in the map.&lt;/p&gt;

&lt;p&gt;For now, knowing each of these algorithms inside and out is not essential.&lt;/p&gt;

&lt;p&gt;Machine Learning and Data Science is an iterative practice. These algorithms are tools in our toolbox.&lt;/p&gt;

&lt;p&gt;In the beginning, on our way to becoming a practitioner, it's more important to understand our problem (such as, classification versus regression) and then knowing what tools we can use to solve it.&lt;/p&gt;

&lt;p&gt;Since our dataset is relatively small, we can experiment to find which algorithm performs best.&lt;/p&gt;

&lt;p&gt;All of the algorithms in the Scikit-Learn library use the same functions, for training a model, &lt;code&gt;model.fit(X_train, y_train)&lt;/code&gt; and for scoring a model &lt;code&gt;model.score(X_test, y_test)&lt;/code&gt;. &lt;code&gt;score()&lt;/code&gt; returns the ratio of correct predictions (1.0 = 100% correct).&lt;/p&gt;

&lt;p&gt;Since the algorithms we've chosen implement the same methods for fitting them to the data as well as evaluating them, let's put them in a dictionary and create a function which fits and scores them.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Put models in a dictionary
&lt;/span&gt;&lt;span class="n"&gt;models&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="s"&gt;"Logistic Regression"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;LogisticRegression&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
          &lt;span class="s"&gt;"KNN"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;KNeighborsClassifier&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
          &lt;span class="s"&gt;"Random Forest"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;RandomForestClassifier&lt;/span&gt;&lt;span class="p"&gt;()}&lt;/span&gt;

&lt;span class="c1"&gt;# Create a function to fit and score models
&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;fit_and_score&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;models&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;X_train&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;X_test&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_train&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_test&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="s"&gt;"""
    Fits and evaluates given machine learning models.
    models: a dict of different Scikit-Learn machine learning models
    X_train: training data
    X_test: testing data
    y_train: labels associated with training data
    y_test: labels associated with test data
    """&lt;/span&gt;
    &lt;span class="c1"&gt;# Set random seed for reproducible results
&lt;/span&gt;    &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;seed&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;42&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="c1"&gt;# Make a list to keep model scores
&lt;/span&gt;    &lt;span class="n"&gt;model_scores&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{}&lt;/span&gt;
    &lt;span class="c1"&gt;#Loop through models
&lt;/span&gt;    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;models&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;items&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
        &lt;span class="c1"&gt;# Fit the model to the data
&lt;/span&gt;        &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;fit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X_train&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_train&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="c1"&gt;# Evaluate the model and append its score to model_scores
&lt;/span&gt;        &lt;span class="n"&gt;model_scores&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;score&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X_test&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_test&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;model_scores&lt;/span&gt; 

&lt;span class="n"&gt;model_scores&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;fit_and_score&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;models&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;X_train&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;X_test&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;y_train&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;y_test&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;model_scores&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;{'Logistic Regression': 0.8852459016393442,
 'KNN': 0.6885245901639344,
 'Random Forest': 0.8360655737704918}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Since our models are fitting, let's compare them visually.&lt;/p&gt;

&lt;h3&gt;
  
  
  Model comparison
&lt;/h3&gt;

&lt;p&gt;Since we've saved our models' scores to a dictionary, we can plot them by first converting them to a DataFrame.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;model_compare&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;DataFrame&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model_scores&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;"accuracy"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;span class="n"&gt;model_compare&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;plot&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;bar&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--EP8o58fY--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/mltq9eehs2xjstbwlk1b.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--EP8o58fY--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/mltq9eehs2xjstbwlk1b.png" alt="Comparison" width="372" height="334"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We can't really see it from the graph but looking at the dictionary, the &lt;code&gt;LogisticRegression()&lt;/code&gt; model performs best.&lt;/p&gt;

&lt;p&gt;We've found the best model. Now, let's put together a classification report to show to the team, including a confusion matrix, and the cross-validated precision, recall, and F1 scores. We'd also want to see which features are most important. And look at a ROC curve.&lt;/p&gt;

&lt;p&gt;Let's briefly go through each before we see them in action.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Hyperparameter tuning&lt;/strong&gt; - Each model we use has a series of dials we can turn to dictate how they perform. Changing these values may increase or decrease model performance.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Feature importance&lt;/strong&gt; - If there are a large amount of features we're using to make predictions, do some have more importance than others? For example, for predicting heart disease, which is more important, sex or age?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Confusion matrix&lt;/strong&gt; - Compares the predicted values with the true values in a tabular way, if 100% correct, all values in the matrix will be top left to bottom right (diagonal line).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cross-validation&lt;/strong&gt; - Splits our dataset into multiple parts to train and test our model on each part, then evaluates performance as an average.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Precision&lt;/strong&gt; - Proportion of true positives over total number of samples. Higher precision leads to less false positives.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Recall&lt;/strong&gt; - Proportion of true positives over total number of true positives and false negatives. Higher recall leads to less false negatives.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;F1 score&lt;/strong&gt; - Combines precision and recall into one metric. 1 is best, 0 is worst.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Classification report&lt;/strong&gt; - Sklearn has a built-in function called &lt;code&gt;classification_report()&lt;/code&gt; which returns some of the main classification metrics such as precision, recall, and f1-score.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;ROC Curve&lt;/strong&gt; - Receiver Operating Characteristic is a plot of true positive rate versus false positive rate.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Area Under Curve (AUC)&lt;/strong&gt; - The area underneath the ROC curve. A perfect model achieves a score of 1.0.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Hyperparameter tuning and cross-validation
&lt;/h3&gt;

&lt;p&gt;To cook our favourite dish, we know to set the oven to 180 degrees and turn the grill on. But when our roommate cooks their favourite dish, they use 200 degrees and the fan-forced mode. Same oven, different settings, different outcomes.&lt;/p&gt;

&lt;p&gt;The same can be done for machine learning algorithms. We can use the same algorithms but change the settings (hyperparameters) and get different results.&lt;/p&gt;

&lt;p&gt;But just like turning the oven up too high can burn our food, the same can happen for machine learning algorithms. We change the settings and it works so well that it overfits the data.&lt;/p&gt;

&lt;p&gt;We're looking for the goldilocks model. One which does well on our dataset but also does well on unseen examples.&lt;/p&gt;

&lt;p&gt;To test different hyperparameters, we could use a validation set but since we don't have much data, we'll use cross-validation.&lt;/p&gt;

&lt;p&gt;The most common type of cross-validation is k-fold. It involves splitting our data into k-fold's and then testing a model on each. For example, let's say we have 5 folds (k = 5).&lt;/p&gt;

&lt;p&gt;We'll be using this setup to tune the hyperparameters of some of our models and then evaluate them. We'll also get a few more metrics like precision, recall, F1-score, and ROC at the same time.&lt;/p&gt;

&lt;p&gt;Here's the game plan:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Tune model hyperparameters, see which performs best&lt;/li&gt;
&lt;li&gt;Perform cross-validation&lt;/li&gt;
&lt;li&gt;Plot ROC curves&lt;/li&gt;
&lt;li&gt;Make a confusion matrix&lt;/li&gt;
&lt;li&gt;Get precision, recall, and F1-score metrics&lt;/li&gt;
&lt;li&gt;Find the most important model features&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Tune KNeighborsClassifier (K-Nearest Neighbors or KNN) by hand
&lt;/h3&gt;

&lt;p&gt;There's one main hyperparameter we can tune for the K-Nearest Neighbors (KNN) algorithm, and that is the number of neighbors. The default is 5 (&lt;code&gt;n_neigbors=5&lt;/code&gt;).&lt;/p&gt;

&lt;p&gt;What are neighbors?&lt;/p&gt;

&lt;p&gt;KNN works by assuming that dots which are close to each other belong to the same class. If &lt;code&gt;n_neighbors=5&lt;/code&gt; then it assumes a dot with the 5 closest dots around it are in the same class.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Note: We're leaving out some details here like what defines close or how distance is calculated.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;For now, let's try a few different values of &lt;code&gt;n_neighbors&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Create a list of train scores
&lt;/span&gt;&lt;span class="n"&gt;train_scores&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;

&lt;span class="c1"&gt;# Create a list of test scores
&lt;/span&gt;&lt;span class="n"&gt;test_scores&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;

&lt;span class="c1"&gt;# Create a list of different values for n_neighbors
&lt;/span&gt;&lt;span class="n"&gt;neighbors&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;21&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="c1"&gt;# 1 to 20
&lt;/span&gt;
&lt;span class="c1"&gt;# Setup algorithm
&lt;/span&gt;&lt;span class="n"&gt;knn&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;KNeighborsClassifier&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="c1"&gt;# Loop through different neighbors values
&lt;/span&gt;&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;neighbors&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;knn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;set_params&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;n_neighbors&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="c1"&gt;# set neighbors value
&lt;/span&gt;
    &lt;span class="c1"&gt;# Fit the algorithm
&lt;/span&gt;    &lt;span class="n"&gt;knn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;fit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X_train&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_train&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Update the training scores
&lt;/span&gt;    &lt;span class="n"&gt;train_scores&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;knn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;score&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X_train&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_train&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

    &lt;span class="c1"&gt;# Update the test scores
&lt;/span&gt;    &lt;span class="n"&gt;test_scores&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;knn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;score&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X_test&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_test&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Let's look at KNN's train scores and test scores.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;train_scores&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[1.0,
 0.8099173553719008,
 0.7727272727272727,
 0.743801652892562,
 0.7603305785123967,
 0.7520661157024794,
 0.743801652892562,
 0.7231404958677686,
 0.71900826446281,
 0.6942148760330579,
 0.7272727272727273,
 0.6983471074380165,
 0.6900826446280992,
 0.6942148760330579,
 0.6859504132231405,
 0.6735537190082644,
 0.6859504132231405,
 0.6652892561983471,
 0.6818181818181818,
 0.6694214876033058]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;test_scores&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[0.6229508196721312,
 0.639344262295082,
 0.6557377049180327,
 0.6721311475409836,
 0.6885245901639344,
 0.7213114754098361,
 0.7049180327868853,
 0.6885245901639344,
 0.6885245901639344,
 0.7049180327868853,
 0.7540983606557377,
 0.7377049180327869,
 0.7377049180327869,
 0.7377049180327869,
 0.6885245901639344,
 0.7213114754098361,
 0.6885245901639344,
 0.6885245901639344,
 0.7049180327868853,
 0.6557377049180327]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;These are hard to understand so let's plot them.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;plot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;neighbors&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;train_scores&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;label&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;"Train score"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;plot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;neighbors&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;test_scores&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;label&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;"Test score"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;xticks&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;arange&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;21&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;xlabel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Number of neighbors"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ylabel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Model score"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;legend&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="k"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="s"&gt;"Maximum KNN score on the test data: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nb"&gt;max&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;test_scores&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;:.&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;%"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--v1UDVg5V--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/02j3hlqpriq8q8ma74r4.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--v1UDVg5V--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/02j3hlqpriq8q8ma74r4.png" alt="Test scores" width="392" height="262"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Looking at the graph, &lt;code&gt;n_neighbors = 11&lt;/code&gt; seems best.&lt;/p&gt;

&lt;p&gt;Even knowing this, the &lt;code&gt;KNN&lt;/code&gt;'s model performance didn't get near what &lt;code&gt;LogisticRegression&lt;/code&gt; or the &lt;code&gt;RandomForestClassifier&lt;/code&gt; did.&lt;/p&gt;

&lt;p&gt;Because of this, we'll discard &lt;code&gt;KNN&lt;/code&gt; and focus on the other two.&lt;/p&gt;

&lt;p&gt;We've tuned &lt;code&gt;KNN&lt;/code&gt; by hand but let's see how we can tune &lt;code&gt;LogisticsRegression&lt;/code&gt; and &lt;code&gt;RandomForestClassifier&lt;/code&gt; using &lt;code&gt;RandomizedSearchCV&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Instead of manually trying different hyperparameters by hand, &lt;code&gt;RandomizedSearchCV&lt;/code&gt; tries a number of different combinations, evaluates them, and saves the best.&lt;/p&gt;

&lt;h3&gt;
  
  
  Tuning models with &lt;code&gt;RandomizedSearchCV&lt;/code&gt;
&lt;/h3&gt;

&lt;p&gt;Reading the Scikit-Learn documentation for &lt;code&gt;LogisticRegression&lt;/code&gt;, we find there's a number of different hyperparameters we can tune.&lt;/p&gt;

&lt;p&gt;The same for &lt;code&gt;RandomForestClassifier&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Let's create a hyperparameter grid (a dictionary of different hyperparameters) for each and then test them out.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Different LogisticRegression hyperparameters
&lt;/span&gt;&lt;span class="n"&gt;log_reg_grid&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="s"&gt;"C"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;logspace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
                &lt;span class="s"&gt;"solver"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;"liblinear"&lt;/span&gt;&lt;span class="p"&gt;]}&lt;/span&gt;

&lt;span class="c1"&gt;# Different RandomForestClassifier hyperparameters
&lt;/span&gt;&lt;span class="n"&gt;rf_grid&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="s"&gt;"n_estimators"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;arange&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;50&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
           &lt;span class="s"&gt;"max_depth"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
           &lt;span class="s"&gt;"min_samples_split"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;arange&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
           &lt;span class="s"&gt;"min_samples_leaf"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;arange&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now let's use &lt;code&gt;RandomizedSearchCV&lt;/code&gt; to tune our &lt;code&gt;LogisticRegression&lt;/code&gt; model.&lt;/p&gt;

&lt;p&gt;We'll pass it the different hyperparameters from &lt;code&gt;log_reg_grid&lt;/code&gt; as well as set &lt;code&gt;n_iter = 20&lt;/code&gt;. This means, &lt;code&gt;RandomizedSearchCV&lt;/code&gt; will try 20 different combinations of hyperparameters from &lt;code&gt;log_reg_grid&lt;/code&gt; and save the best ones.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Setup random seed
&lt;/span&gt;&lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;seed&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;42&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Setup random hyperparameter search for LogisticRegression
&lt;/span&gt;&lt;span class="n"&gt;rs_log_reg&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;RandomizedSearchCV&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;LogisticRegression&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
                                &lt;span class="n"&gt;param_distributions&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;log_reg_grid&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                                &lt;span class="n"&gt;cv&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                                &lt;span class="n"&gt;n_iter&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                                &lt;span class="n"&gt;verbose&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Fit random hyperparameter search model
&lt;/span&gt;&lt;span class="n"&gt;rs_log_reg&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;fit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X_train&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_train&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Fitting 5 folds for each of 20 candidates, totalling 100 fits
RandomizedSearchCV(cv=5, estimator=LogisticRegression(), n_iter=20,
                   param_distributions={'C': array([1.00000000e-04, 2.63665090e-04, 6.95192796e-04, 1.83298071e-03,
       4.83293024e-03, 1.27427499e-02, 3.35981829e-02, 8.85866790e-02,
       2.33572147e-01, 6.15848211e-01, 1.62377674e+00, 4.28133240e+00,
       1.12883789e+01, 2.97635144e+01, 7.84759970e+01, 2.06913808e+02,
       5.45559478e+02, 1.43844989e+03, 3.79269019e+03, 1.00000000e+04]),
                                        'solver': ['liblinear']},
                   verbose=True)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;rs_log_reg&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;best_params_&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;{'solver': 'liblinear', 'C': 0.23357214690901212}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;rs_log_reg&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;score&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X_test&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_test&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;0.8852459016393442
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now that we've tuned &lt;code&gt;LogisticRegression&lt;/code&gt; using &lt;code&gt;RandomizedSearchCV&lt;/code&gt;, we'll do the same for &lt;code&gt;RandomForestClassifier&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Setup random seed
&lt;/span&gt;&lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;seed&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;42&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Setup random hyperparameter search for RandomForestClassifier
&lt;/span&gt;&lt;span class="n"&gt;rs_rf&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;RandomizedSearchCV&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;RandomForestClassifier&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
                           &lt;span class="n"&gt;param_distributions&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;rf_grid&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                           &lt;span class="n"&gt;cv&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                           &lt;span class="n"&gt;n_iter&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                           &lt;span class="n"&gt;verbose&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Fit random hyperparameter search model
&lt;/span&gt;&lt;span class="n"&gt;rs_rf&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;fit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X_train&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_train&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Fitting 5 folds for each of 20 candidates, totalling 100 fits
RandomizedSearchCV(cv=5, estimator=RandomForestClassifier(), n_iter=20,
                   param_distributions={'max_depth': [None, 3, 5, 10],
                                        'min_samples_leaf': array([ 1,  3,  5,  7,  9, 11, 13, 15, 17, 19]),
                                        'min_samples_split': array([ 2,  4,  6,  8, 10, 12, 14, 16, 18]),
                                        'n_estimators': array([ 10,  60, 110, 160, 210, 260, 310, 360, 410, 460, 510, 560, 610,
       660, 710, 760, 810, 860, 910, 960])},
                   verbose=True)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Find the best hyperparameters
&lt;/span&gt;&lt;span class="n"&gt;rs_rf&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;best_params_&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;{'n_estimators': 210,
 'min_samples_split': 4,
 'min_samples_leaf': 19,
 'max_depth': 3}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Evaluate the randomized search RFC model
&lt;/span&gt;&lt;span class="n"&gt;rs_rf&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;score&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X_test&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_test&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;0.8688524590163934
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Tuning the hyperparameters for each model saw a slight performance boost in both &lt;code&gt;RandomForestClassifier&lt;/code&gt; and &lt;code&gt;LogisticRegression&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;This is akin to tuning the settings on our oven and getting it to cook our favourite dish just right.&lt;/p&gt;

&lt;p&gt;But since &lt;code&gt;LogisticRegression&lt;/code&gt; is ahead, we'll try tuning it further with &lt;code&gt;GridSearchCV&lt;/code&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Tuning a model with &lt;code&gt;GridSearchCV&lt;/code&gt;
&lt;/h3&gt;

&lt;p&gt;The difference between &lt;code&gt;RandomizedSearchCV&lt;/code&gt; and &lt;code&gt;GridSearchCV&lt;/code&gt; is that &lt;code&gt;RandomizedSearchCV&lt;/code&gt; searches over a grid of hyperparameters performing &lt;code&gt;n_iter&lt;/code&gt; combinations, but &lt;code&gt;GridSearchCV&lt;/code&gt; will test every single possible combination.&lt;/p&gt;

&lt;p&gt;In short:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;RandomizedSearchCV&lt;/code&gt; - tries &lt;code&gt;n_iter&lt;/code&gt; combinations of hyperparameters and saves the best.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;GridSearchCV&lt;/code&gt; - tries every single combination of hyperparameters and saves the best.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Let's see it in action.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Different LogisticRegression hyperparameters
&lt;/span&gt;&lt;span class="n"&gt;log_reg_grid&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="s"&gt;"C"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;logspace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;30&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
                &lt;span class="s"&gt;"solver"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;"liblinear"&lt;/span&gt;&lt;span class="p"&gt;]}&lt;/span&gt;

&lt;span class="c1"&gt;# Setup grid hyperparameter search for LogisticRegression
&lt;/span&gt;&lt;span class="n"&gt;gs_log_reg&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;GridSearchCV&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;LogisticRegression&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
                          &lt;span class="n"&gt;param_grid&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;log_reg_grid&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                          &lt;span class="n"&gt;cv&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                          &lt;span class="n"&gt;verbose&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Fit grid hyperparameter search model
&lt;/span&gt;&lt;span class="n"&gt;gs_log_reg&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;fit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X_train&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_train&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="c1"&gt;# Check the best parameters
&lt;/span&gt;&lt;span class="n"&gt;gs_log_reg&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;best_params_&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;{'C': 0.20433597178569418, 'solver': 'liblinear'}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Evaluate the model
&lt;/span&gt;&lt;span class="n"&gt;gs_log_reg&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;score&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X_test&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_test&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;0.8852459016393442
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In this case, we get the same results as before since our grid only has a maximum of 20 different hyperparameter combinations.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Note: If there are a large amount of hyperparameters combinations in our grid, &lt;code&gt;GridSearchCV&lt;/code&gt; may take a long time to try them all out. This is why it's a good idea to start with &lt;code&gt;RandomizedSearchCV&lt;/code&gt;, try a certain amount of combinations and then use &lt;code&gt;GridSearchCV&lt;/code&gt; to refine them.&lt;/em&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Evaluating a classification model, beyond accuracy
&lt;/h3&gt;

&lt;p&gt;Now that we've got a tuned model, let's get some of the metrics we discussed before.&lt;/p&gt;

&lt;p&gt;We want:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;ROC curve and AUC score - &lt;code&gt;plot_roc_curve()&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Confusion matrix - &lt;code&gt;confusion_matrix()&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Classification report - &lt;code&gt;classification_report()&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Precision - &lt;code&gt;precision_score()&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Recall - &lt;code&gt;recall_score()&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;F1-score - &lt;code&gt;f1_score()&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Luckily, Scikit-Learn has these all built-in.&lt;/p&gt;

&lt;p&gt;To access them, we'll have to use our model to make predictions on the test set. We can make predictions by calling &lt;code&gt;predict()&lt;/code&gt; on a trained model and passing it the data we'd like to predict on.&lt;/p&gt;

&lt;p&gt;We'll make predictions on the test data.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Make predictions on test data
&lt;/span&gt;&lt;span class="n"&gt;y_preds&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;gs_log_reg&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;predict&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X_test&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Let's see them.&lt;/p&gt;

&lt;p&gt;Since we've got our prediction values, we can find the metrics we want.&lt;/p&gt;

&lt;p&gt;Let's start with the ROC curve and AUC scores.&lt;/p&gt;

&lt;h3&gt;
  
  
  ROC curve and AUC scores
&lt;/h3&gt;

&lt;p&gt;What's a ROC curve?&lt;/p&gt;

&lt;p&gt;It's a way of understanding how our model is performing by comparing the true positive rate to the false positive rate.&lt;/p&gt;

&lt;p&gt;In our case:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;To get an appropriate example in a real-world problem, consider a diagnostic test that seeks to determine whether a person has a certain disease. A false positive in this case occurs when the person tests positive, but does not actually have the disease. A false negative, on the other hand, occurs when the person tests negative, suggesting they are healthy, when they actually do have the disease.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Scikit-Learn implements a function &lt;code&gt;plot_roc_curve&lt;/code&gt; which can help us create a ROC curve as well as calculate the area under the curve (AUC) metric.&lt;/p&gt;

&lt;p&gt;Reading the documentation on the &lt;code&gt;plot_roc_curve&lt;/code&gt; function, we can see it takes &lt;code&gt;(estimator, X, y)&lt;/code&gt; as inputs. Where &lt;code&gt;estimator&lt;/code&gt; is a fitted machine learning model and &lt;code&gt;X&lt;/code&gt; and &lt;code&gt;y&lt;/code&gt; are the data we'd like to test it on.&lt;/p&gt;

&lt;p&gt;In our case, we'll use the &lt;code&gt;GridSearchCV&lt;/code&gt; version of our &lt;code&gt;LogisticRegression&lt;/code&gt; estimator, &lt;code&gt;gs_log_reg&lt;/code&gt; as well as the test data, &lt;code&gt;X_test&lt;/code&gt; and &lt;code&gt;y_test&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Plot ROC curve and calculate AUC metric
&lt;/span&gt;&lt;span class="n"&gt;plot_roc_curve&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;gs_log_reg&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;X_test&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_test&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--6romPj4G--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/rv73mzne8fo9gtyy6cw4.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--6romPj4G--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/rv73mzne8fo9gtyy6cw4.png" alt="ROC" width="386" height="262"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Our model does far better than guessing which would be a line going from the bottom left corner to the top right corner, AUC = 0.5. But a perfect model would achieve an AUC score of 1.0, so there's still room for improvement.&lt;/p&gt;

&lt;p&gt;Let's move onto the next evaluation request, a confusion matrix.&lt;/p&gt;

&lt;h3&gt;
  
  
  Confusion matrix
&lt;/h3&gt;

&lt;p&gt;A confusion matrix is a visual way to show where our model made the right predictions and where it made the wrong predictions (or in other words, got confused).&lt;/p&gt;

&lt;p&gt;Scikit-Learn allows us to create a confusion matrix using &lt;code&gt;confusion_matrix()&lt;/code&gt; and passing it the true labels and predicted labels.&lt;/p&gt;

&lt;p&gt;Because Scikit-Learn's built-in confusion matrix is a bit bland, we probably want to make it visual. Let's create a function which uses Seaborn's &lt;code&gt;heatmap()&lt;/code&gt; for doing so.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;sns&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;font_scale&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;1.5&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="c1"&gt;# Increase font size
&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;plot_conf_mat&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;y_test&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Y_preds&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="s"&gt;"""
    Plots a confusion matrix using Seaborn's heatmap().
    """&lt;/span&gt;
    &lt;span class="n"&gt;fig&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ax&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;subplots&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;figsize&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="n"&gt;ax&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;sns&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;heatmap&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;confusion_matrix&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;y_test&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_preds&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
                     &lt;span class="n"&gt;annot&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="c1"&gt;# Annotate the boxes
&lt;/span&gt;                     &lt;span class="n"&gt;cbar&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;xlabel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Predicted label"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ylabel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"True label"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;plot_conf_mat&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;y_test&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_preds&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--e5MD2c0_--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/d4h4z6e68gwdnz02eoyr.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--e5MD2c0_--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/d4h4z6e68gwdnz02eoyr.png" alt="Confusion matrix" width="228" height="223"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We can see the model gets confused (predicts the wrong label) relatively the same across both classes. In essence, there are 4 occasions where the model predicted 0 when it should have been 1 (false negative) and 3 occasions where the model predicted 1 instead of 0 (false positive).&lt;/p&gt;

&lt;h3&gt;
  
  
  Classification report
&lt;/h3&gt;

&lt;p&gt;We can make a classification report using &lt;code&gt;classification_report()&lt;/code&gt; and passing it the true labels as well as our models predicted labels.&lt;/p&gt;

&lt;p&gt;A classification report will also give us information of the precision and recall of our model for each class.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Show classification report
&lt;/span&gt;&lt;span class="k"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;classification_report&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;y_test&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_preds&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;          precision    recall  f1-score   support

           0       0.89      0.86      0.88        29
           1       0.88      0.91      0.89        32

    accuracy                           0.89        61
   macro avg       0.89      0.88      0.88        61
weighted avg       0.89      0.89      0.89        61

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;What's going on here?&lt;/p&gt;

&lt;p&gt;Let's refresh our memory.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Precision&lt;/strong&gt; - Indicates the proportion of positive identifications (model predicted class 1) which were actually correct. A model which produces no false positives has a precision of 1.0.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Recall&lt;/strong&gt; - Indicates the proportion of actual positives which were correctly classified. A model which produces no false negatives has a recall of 1.0.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;F1 score&lt;/strong&gt; - A combination of precision and recall. A perfect model achieves an F1 score of 1.0.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Support&lt;/strong&gt; - The number of samples each metric was calculated on.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Accuracy&lt;/strong&gt; - The accuracy of the model in decimal form. Perfect accuracy is equal to 1.0.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Macro avg&lt;/strong&gt; - Short for macro average, the average precision, recall and F1 score between classes. Macro avg doesn’t class imbalance into effort, so if you do have class imbalances, pay attention to this metric.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Weighted avg&lt;/strong&gt; - Short for weighted average, the weighted average precision, recall and F1 score between classes. Weighted means each metric is calculated with respect to how many samples there are in each class. This metric will favour the majority class (e.g. will give a high value when one class out performs another due to having more samples).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Ok, now we've got a few deeper insights on our model. But these were all calculated using a single training and test set.&lt;/p&gt;

&lt;p&gt;What we'll do to make them more solid is calculate them using cross-validation.&lt;/p&gt;

&lt;p&gt;How?&lt;/p&gt;

&lt;p&gt;We'll take the best model along with the best hyperparameters and use &lt;code&gt;cross_val_score()&lt;/code&gt; along with various scoring parameter values.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;cross_val_score()&lt;/code&gt; works by taking an estimator (machine learning model) along with data and labels. It then evaluates the machine learning model on the data and labels using cross-validation and a defined scoring parameter.&lt;/p&gt;

&lt;p&gt;Let's remind ourselves of the best hyperparameters and then see them in action.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Instantiate best model with best hyperparameters (found with GridSearchCV)
&lt;/span&gt;&lt;span class="n"&gt;clf&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;LogisticRegression&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;C&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.20433597178569418&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                         &lt;span class="n"&gt;solver&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;"liblinear"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now that we've got an instantiated classifier, let's find some cross-validated metrics.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Cross-validate accuracy score
&lt;/span&gt;&lt;span class="n"&gt;cv_acc&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;cross_val_score&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;clf&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                         &lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                         &lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                         &lt;span class="n"&gt;cv&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="c1"&gt;# 5-fold cross-validation
&lt;/span&gt;                         &lt;span class="n"&gt;scoring&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;"accuracy"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;cv_acc&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;mean&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cv_acc&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="c1"&gt;# since there are 5 metrics here, we'll take the average
&lt;/span&gt;&lt;span class="n"&gt;cv_acc&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;0.8479781420765027
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now we'll do the same for other classification metrics.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Cross-validated precision score
&lt;/span&gt;&lt;span class="n"&gt;cv_precision&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;cross_val_score&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;clf&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                         &lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                         &lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                         &lt;span class="n"&gt;cv&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                         &lt;span class="n"&gt;scoring&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;"precision"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;cv_precision&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;mean&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cv_precision&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;cv_precision&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;0.8215873015873015
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Cross-validated recall score
&lt;/span&gt;&lt;span class="n"&gt;cv_recall&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;cross_val_score&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;clf&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                         &lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                         &lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                         &lt;span class="n"&gt;cv&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                         &lt;span class="n"&gt;scoring&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;"recall"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;cv_recall&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;mean&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cv_recall&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;cv_recall&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;0.9272727272727274
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Cross-validated F1 score
&lt;/span&gt;&lt;span class="n"&gt;cv_f1&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;cross_val_score&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;clf&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                         &lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                         &lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                         &lt;span class="n"&gt;cv&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                         &lt;span class="n"&gt;scoring&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;"f1"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;cv_f1&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;mean&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cv_f1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;cv_f1&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;0.8705403543192143
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We've got cross validated metrics. Let's visualize them.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Visualizing cross-validated metrics
&lt;/span&gt;&lt;span class="n"&gt;cv_metrics&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;DataFrame&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="s"&gt;"Accuracy"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;cv_acc&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                           &lt;span class="s"&gt;"Precision"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;cv_precision&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                           &lt;span class="s"&gt;"Recall"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;cv_recall&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                           &lt;span class="s"&gt;"F1"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;cv_f1&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
                         &lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;

&lt;span class="n"&gt;cv_metrics&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;plot&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;bar&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;title&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;"Cross-Validated Classification Metrics"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                      &lt;span class="n"&gt;legend&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--F_g1ePNR--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/9cwkvmerbc0qjzt177vj.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--F_g1ePNR--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/9cwkvmerbc0qjzt177vj.png" alt="Metrics" width="381" height="329"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The final thing to check off the list of our model evaluation techniques is feature importance.&lt;/p&gt;

&lt;h3&gt;
  
  
  Feature importance
&lt;/h3&gt;

&lt;p&gt;Feature importance is another way of asking, "which features contribute most to the outcomes of the model?"&lt;/p&gt;

&lt;p&gt;Or for our problem, trying to predict heart disease using a patient's medical characteristics, which characteristics contribute most to a model predicting whether someone has heart disease or not?&lt;/p&gt;

&lt;p&gt;Unlike some of the other functions we've seen, because how each model finds patterns in data is slightly different, how a model judges how important those patterns are is different as well. This means for each model, there's a slightly different way of finding which features were most important.&lt;/p&gt;

&lt;p&gt;We can usually find an example via the Scikit-Learn documentation or via searching for something like "[MODEL TYPE] feature importance", such as, "random forest feature importance".&lt;/p&gt;

&lt;p&gt;Since we're using &lt;code&gt;LogisticRegression&lt;/code&gt;, we'll look at one way we can calculate feature importance for it.&lt;/p&gt;

&lt;p&gt;To do so, we'll use the &lt;code&gt;coef_&lt;/code&gt; attribute. Looking at the Scikit-Learn documentation for &lt;code&gt;LogisticRegression&lt;/code&gt;, the &lt;code&gt;coef_&lt;/code&gt; attribute is the coefficient of the features in the decision function.&lt;/p&gt;

&lt;p&gt;We can access the &lt;code&gt;coef_&lt;/code&gt; attribute after we've fit an instance of &lt;code&gt;LogisticRegression&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Fit an instance of LogisticRegression (taken from above)
&lt;/span&gt;&lt;span class="n"&gt;clf&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;fit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X_train&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_train&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="c1"&gt;# Check coef_
&lt;/span&gt;&lt;span class="n"&gt;clf&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;coef_&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;array([[ 0.00316728, -0.86044619,  0.6606706 , -0.01156993, -0.00166374,
         0.04386123,  0.31275813,  0.02459361, -0.60413061, -0.56862832,
         0.45051624, -0.63609879, -0.67663383]])
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Looking at this, it might not make much sense. But these values are how much each feature contributes to how a model makes a decision on whether patterns in a sample of patient's health data leans more towards having heart disease or not.&lt;/p&gt;

&lt;p&gt;Even knowing this, in its current form, this &lt;code&gt;coef_&lt;/code&gt; array still doesn't mean much. But it will if we combine it with the columns (features) of our dataframe.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Match features to columns
&lt;/span&gt;&lt;span class="n"&gt;feature_dict&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;zip&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;columns&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;clf&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;coef_&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;])))&lt;/span&gt;
&lt;span class="n"&gt;feature_dict&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;{'age': 0.003167276981166473,
 'sex': -0.8604461876496617,
 'cp': 0.6606705956924419,
 'trestbps': -0.011569931456373254,
 'chol': -0.0016637425660326452,
 'fbs': 0.04386123481563001,
 'restecg': 0.3127581278180605,
 'thalach': 0.02459361121787892,
 'exang': -0.6041306062021752,
 'oldpeak': -0.5686283181242949,
 'slope': 0.4505162370067001,
 'ca': -0.6360987949046014,
 'thal': -0.6766338344936489}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now, let's visualize them.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Visualize feature importance
&lt;/span&gt;&lt;span class="n"&gt;feature_df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;DataFrame&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;feature_dict&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;span class="n"&gt;feature_df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;plot&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;bar&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;title&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;"Feature Importance"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                      &lt;span class="n"&gt;legend&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--Cv2Az7Ix--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/rq15reohukkyj6nvg45u.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--Cv2Az7Ix--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/rq15reohukkyj6nvg45u.png" alt="Feature importance" width="391" height="320"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We notice some are negative and some are positive.&lt;/p&gt;

&lt;p&gt;The larger the value (bigger bar), the more the feature contributes to the model's decision.&lt;/p&gt;

&lt;p&gt;If the value is negative, it means there's a negative correlation. And vice versa for positive values.&lt;/p&gt;

&lt;p&gt;For example, the sex attribute has a negative value of -0.904, which means as the value for sex increases, the target value decreases.&lt;/p&gt;

&lt;p&gt;We can see this by comparing the sex column to the target column.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;crosstab&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;"sex"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;"target"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;arget   0   1
sex     
0   24  72
1   114 93
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We can see, when sex is 0 (female), there are almost 3 times as many (72 vs. 24) people with heart disease (target = 1) than without.&lt;/p&gt;

&lt;p&gt;And then as sex increases to 1 (male), the ratio goes down to almost 1 to 1 (114 vs. 93) of people who have heart disease and who don't.&lt;/p&gt;

&lt;p&gt;What does this mean?&lt;/p&gt;

&lt;p&gt;It means the model has found a pattern which reflects the data. Looking at these figures and this specific dataset, it seems if the patient is female, they're more likely to have heart disease.&lt;/p&gt;

&lt;p&gt;How about a positive correlation?&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Contrast slope (positive coefficient) with target
&lt;/span&gt;&lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;crosstab&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;"slope"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;"target"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;arget   0   1
slope       
0   12  9
1   91  49
2   35  107
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Looking back at the data dictionary, we see &lt;code&gt;slope&lt;/code&gt; is the "slope of the peak exercise ST segment" where:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;0: Upsloping: better heart rate with excercise (uncommon)&lt;/li&gt;
&lt;li&gt;1: Flatsloping: minimal change (typical healthy heart)&lt;/li&gt;
&lt;li&gt;2: Downslopins: signs of unhealthy heart&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;According to the model, there's a positive correlation of 0.470, not as strong as &lt;code&gt;sex&lt;/code&gt; but still more than 0.&lt;/p&gt;

&lt;p&gt;This positive correlation means our model is picking up the pattern that as &lt;code&gt;slope&lt;/code&gt; increases, so does the &lt;code&gt;target&lt;/code&gt; value.&lt;/p&gt;

&lt;p&gt;What can we do with this information?&lt;/p&gt;

&lt;p&gt;This is something we might want to talk to a subject matter expert about. They may be interested in seeing where machine learning model is finding the most patterns (highest correlation) as well as where it's not (lowest correlation).&lt;/p&gt;

&lt;p&gt;Doing this has a few benefits:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Finding out more&lt;/strong&gt; - If some of the correlations and feature importances are confusing, a subject matter expert may be able to shed some light on the situation and help us figure out more.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Redirecting efforts&lt;/strong&gt; - If some features offer far more value than others, this may change how we collect data for different problems. See point 3.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Less but better&lt;/strong&gt; - Similar to above, if some features are offering far more value than others, we could reduce the number of features our model tries to find patterns in as well as improve the ones which offer the most. This could potentially lead to saving on computation, by having a model find patterns across less features, whilst still achieving the same performance levels.&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  6. Experimentation
&lt;/h2&gt;

&lt;p&gt;We've completed all the metrics requested. We should be able to put together a great report containing a confusion matrix, a handful of cross-validated metrics such as precision, recall, and F1, as well as which features contribute most to the model making a decision.&lt;/p&gt;

&lt;p&gt;But after all this we might be wondering where step 6 in the framework is, experimentation.&lt;/p&gt;

&lt;p&gt;The whole thing is experimentation!&lt;/p&gt;

&lt;p&gt;From trying different models, to tuning different models to figuring out which hyperparameters were best.&lt;/p&gt;

&lt;p&gt;What we've worked through so far has been a series of experiments.&lt;/p&gt;

&lt;p&gt;And we could keep going. But of course, things can't go on forever.&lt;/p&gt;

&lt;p&gt;So by this stage, after trying a few different things, we'd ask ourselves: did we meet the evaluation metric?&lt;/p&gt;

&lt;p&gt;We defined one in step 3.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;If we can reach 95% accuracy at predicting whether or not a patient has heart disease during the proof of concept, we'll pursue this project.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;In this case, we didn't. The highest accuracy our model achieved was below 90%.&lt;/p&gt;

&lt;h4&gt;
  
  
  What's next?
&lt;/h4&gt;

&lt;p&gt;What happens when the evaluation metric doesn't get hit?&lt;/p&gt;

&lt;p&gt;Is everything we've done wasted?&lt;/p&gt;

&lt;p&gt;No.&lt;/p&gt;

&lt;p&gt;It means we know what doesn't work. In this case, we know the current model we're using (a tuned version of LogisticRegression) along with our specific data set doesn't hit the target we set ourselves.&lt;/p&gt;

&lt;p&gt;This is where step 6 comes into its own.&lt;/p&gt;

&lt;p&gt;A good next step would be to discuss with our team or research on our own different options for going forward.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Could we collect more data?&lt;/li&gt;
&lt;li&gt;Could we try a better model? If we're working with structured data, we might want to look into &lt;a href="https://catboost.ai/"&gt;CatBoost&lt;/a&gt; or &lt;a href="https://xgboost.ai/"&gt;XGBoost&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;Could we improve the current models (beyond what we've done so far)?&lt;/li&gt;
&lt;li&gt;If our model is good enough, how would we export it and share it with others? (Hint: check out &lt;a href="https://scikit-learn.org/stable/modules/model_persistence.html"&gt;Scikit-Learn's documentation on model persistance&lt;/a&gt;)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The key here is to remember, our biggest restriction will be time. Hence, why it's paramount to minimise delay between experiments.&lt;/p&gt;

&lt;p&gt;The more we try, the more we figure out what doesn't work, the more we'll start to get a hang of what does.&lt;/p&gt;

</description>
      <category>machinelearning</category>
      <category>tutorial</category>
    </item>
  </channel>
</rss>
