DEV Community: Nicolas Vallée

Using Transfer Learning and TensorFlow to Identify Dog Breeds from Images

Nicolas Vallée — Wed, 30 Mar 2022 09:27:34 +0000

What are we building here?

In this project, we're using Machine Learning to identify different breeds of dogs from images.

To do this, we'll use data from the Kaggle dog breed identification competition. The dataset consists of 10,000+ labeled images of 120 different dog breeds.

This type of problem is called multi-class image classification. Multi-class because we're trying to classify multiple dog breeds. If, on the other hand, we wanted to classify dogs versus cats, it would be called binary classification.

I completed this project in March 2022 as part of the Complete Machine Learning & Data Science Bootcamp, taught by Andrei Neagoie and Daniel Bourke. If you're looking for a beginner friendly course teaching Data Science and Machine Learning from scratch, I highly recommend you check out this one.

Follow this link: Dog Vision Project to open my completed notebook, which you can also open in Colab.

Why is this an interesting topic?

Multi-class image classification is both a common and an important problem in Machine Learning. This is the same kind of technology that Tesla uses for their self-driving cars, or Airbnb to automatically add information to their listings.

How are we going about this?

The first step in a deep learning problem is to get the data ready by turning it into numbers.

We will go through the following workflow:

Get the data ready (download from Kaggle, store, import).
Prepare the data (preprocessing, the 3 sets, X & y).
Choose and fit a model (TensorFlow Hub, tf.keras.applications, TensorBoard, EarlyStopping).
Evaluate the model (making predictions and comparing them with the ground truth labels).
Improve the model through experimentation (starting with 1000 images, making sure it works, then increasing the number of images).
Save, share, and reload the model (once we're happy with the results).

For the preprocessing of our data, we're going to use TensorFlow 2.x. We will turn our data into Tensors (arrays of numbers which can be run on GPUs) and then allow a machine learning model to find patterns between them.

Our machine learning model will be a pretrained deep learning model from TensorFlow Hub.

The process of using a pretrained model and adapting it to a specific problem is called transfer learning. Rather than training our own model from scratch, which could be time consuming and expensive, we will leverage the patterns of another model which has already been trained to classify images.

Getting our workspace ready

Before we get started, we need to:

Import TensorFlow 2.x
Import TensorFlow Hub
Make sure we're using a GPU

import tensorflow as tf
import tensorflow_hub as hub

print("TF version:", tf.__version__)
print("Hub version:", hub.__version__)

# Check for GPU
print("GPU", "is available, we're good to go!" 
      if tf.config.list_physical_devices("GPU") 
      else "is not available. Change runtime type to GPU
 before proceeding.")

TF version: 2.8.0
Hub version: 0.12.0
GPU is available, we're good to go!

What is a GPU and why do we need one?

A GPU (graphics processing unit) is a computer chip that is faster at doing numerical computations.

By default, Colab runs on a computer located on Google's servers which doesn't have a GPU attached to it.

We can fix this by changing the runtime type:

Go to Runtime.
Click "Change runtime type".
Where it says "Hardware accelerator", choose "GPU".
Click save.
The runtime will be restarted to activate the new hardware, so we'll have to rerun the above cells.
If the steps have worked, we should see a print out saying "GPU is available".

To wee how much a GPU speeds up computing, Google Colab has a demonstration notebook available.

Getting our data ready

Getting our data ready to be used with a machine learning model is an important step.

There are a few ways to do this. Many of them are detailed in the Google Colab notebook on I/O (input and output).

Since the data we're using is hosted on Kaggle, we could use the Kaggle API.

Another method is to upload the data to our Google Drive, mount the drive in this notebook, and import the files.

Mounting Google Drive

# Running this cell will provide a token to link our drive to
# this notebook
from google.colab import drive
drive.mount('/content/drive')

We now see a "drive" folder available under the Files tab.

This means we'll be able to access files in our Google Drive in this notebook.

For this project, I've downloaded the data from Kaggle and uploaded it to my Google Drive as a .zip file under the folder "ML/Dog Vision".

To access it, we need to unzip it.

Note: Running the cell below for the first time could take a while (a couple of minutes is normal). After we've run it once and got the data in our Google Drive, we don't need to run it again.

# Use the '-d' parameter as the destination for where the
# files should go
!unzip "drive/MyDrive/ML/Dog Vision/dog-breed-identification.zip" -d "drive/MyDrive/ML/Dog Vision/"

Accessing the data

Now that the data files are available on our Google Drive, we can start to check them out.

Let's start with labels.csv which contains all of the image ID's and their associated dog breed (our data and labels).

# Checkout the labels of our data
import pandas as pd
labels_csv = pd.read_csv("drive/MyDrive/ML/Dog Vision/labels.csv")
print(labels_csv.describe())
print(labels_csv.head())

                                     id               breed
count                              10222               10222
unique                             10222                 120
top     000bec180eb18c7604dcecc8fe0dba07  scottish_deerhound
freq                                   1                 126
                                 id             breed
0  000bec180eb18c7604dcecc8fe0dba07       boston_bull
1  001513dfcb2ffafc82cccf4d8bbaba97             dingo
2  001cdf01b096e06d78e9e5112d419397          pekinese
3  00214f311d5d2247d5dfe4fe24b2303d          bluetick
4  0021f9ceb3235effd7fcde7f7538ed62  golden_retriever

Looking at this, we can see there are 10,222 different ID's (meaning 10,222 different images) and 120 different breeds.

Let's figure out how many images there are for each breed.

# How many images are there for each breed?
labels_csv["breed"].value_counts().plot.bar(figsize=(20, 10));

If we draw a line across the middle of the graph, we see there's about 60+ images for each dog breed.

This is a good amount. For some of their vision products, Google recommends a minimum of 10 images per class to get started. And the more images per class available, the more chance a model has to figure out patterns between them.

Let's check out one of the images.

Note: Loading an image file for the first time may take a while as it gets loaded into the runtime memory.

from IPython.display import Image 
Image("drive/MyDrive/ML/Dog Vision/train/001513dfcb2ffafc82cccf4d8bbaba97.jpg")

Getting images and their labels

Since we've got the image ID's and their labels in a DataFrame (labels_csv), we'll use it to create:

A list a filepaths to training images
An array of all labels
An array of all unique labels

We'll only create a list of filepaths to images rather than importing them all to begin with. This is because working with filepaths (strings) is more efficient than working with images.

# Create pathnames from image ID's
filenames = ["drive/MyDrive/ML/Dog Vision/train/" + fname + ".jpg" for fname in labels_csv["id"]]

# Check the first 10 filenames
filenames[:10]

['drive/MyDrive/ML/Dog Vision/train/000bec180eb18c7604dcecc8fe0dba07.jpg',
 'drive/MyDrive/ML/Dog Vision/train/001513dfcb2ffafc82cccf4d8bbaba97.jpg',
 'drive/MyDrive/ML/Dog Vision/train/001cdf01b096e06d78e9e5112d419397.jpg',
 'drive/MyDrive/ML/Dog Vision/train/00214f311d5d2247d5dfe4fe24b2303d.jpg',
 'drive/MyDrive/ML/Dog Vision/train/0021f9ceb3235effd7fcde7f7538ed62.jpg',
 'drive/MyDrive/ML/Dog Vision/train/002211c81b498ef88e1b40b9abf84e1d.jpg',
 'drive/MyDrive/ML/Dog Vision/train/00290d3e1fdd27226ba27a8ce248ce85.jpg',
 'drive/MyDrive/ML/Dog Vision/train/002a283a315af96eaea0e28e7163b21b.jpg',
 'drive/MyDrive/ML/Dog Vision/train/003df8b8a8b05244b1d920bb6cf451f9.jpg',
 'drive/MyDrive/ML/Dog Vision/train/0042188c895a2f14ef64a918ed9c7b64.jpg']

Now we've got a list of all the filenames from the ID column of labels_csv, we can compare it to the number of files in our training data directory to see if they line up.

If they do, great. If not, there may have been an issue when unzipping the data. To fix this, we might have to unzip the data again.

# Check if number of filenames matches number of actual image files
import os
if len(os.listdir("drive/MyDrive/ML/Dog Vision/train/")) == len (filenames):
  print("Filenames match actual number of files.")
else:
  print("Filenames do not match actual number of files, check the target directory.")

Filenames match actual number of files.

Let's visualize an image directly from a filepath.

# Check an image directly from a filepath
Image(filenames[9000])

Now that we've got our image filepaths together, let's get the labels.

We'll take them from labels_csv and turn them into a NumPy array.

import numpy as np
labels = labels_csv["breed"].to_numpy() # convert labels column to NumPy array
labels

array(['boston_bull', 'dingo', 'pekinese', ..., 'airedale',
       'miniature_pinscher', 'chesapeake_bay_retriever'], dtype=object)

Now, let's compare the amount of labels to the number of filenames.

# Check for missing data
if len(labels) == len(filenames):
  print("Number of labels matches number of filenames!")
else:
  print("Number of labels does not match number of filenames, check data directories")

Number of labels matches number of filenames!

We should have the same amount of images and labels.

Finally, since a machine learning model can't take strings as input, we'll have to convert our labels to numbers.

To begin with, we'll find all of the unique dog breed names.

Then, we'll go through the list of labels, compare them to unique breeds, and create a list of booleans indicating which one is the real label (True) and which ones aren't (False).

# Find the unique label values
unique_breeds = np.unique(labels)
len(unique_breeds)

The length of unique_breeds should be 120, meaning we're working with images of 120 different breeds of dogs.

Now we'll use unique_breeds to turn our labels array into an array of booleans.

# Turn every label into a boolean array
boolean_labels = [label == unique_breeds for label in labels]
boolean_labels[:2]

[array([False, False, False, False, False, False, False, False, False,
        False, False, False, False, False, False, False, False, False,
        False,  True, False, False, False, False, False, False, False,
        False, False, False, False, False, False, False, False, False,
        False, False, False, False, False, False, False, False, False,
        False, False, False, False, False, False, False, False, False,
        False, False, False, False, False, False, False, False, False,
        False, False, False, False, False, False, False, False, False,
        False, False, False, False, False, False, False, False, False,
        False, False, False, False, False, False, False, False, False,
        False, False, False, False, False, False, False, False, False,
        False, False, False, False, False, False, False, False, False,
        False, False, False, False, False, False, False, False, False,
        False, False, False]),
 array([False, False, False, False, False, False, False, False, False,
        False, False, False, False, False, False, False, False, False,
        False, False, False, False, False, False, False, False, False,
        False, False, False, False, False, False, False, False, False,
        False,  True, False, False, False, False, False, False, False,
        False, False, False, False, False, False, False, False, False,
        False, False, False, False, False, False, False, False, False,
        False, False, False, False, False, False, False, False, False,
        False, False, False, False, False, False, False, False, False,
        False, False, False, False, False, False, False, False, False,
        False, False, False, False, False, False, False, False, False,
        False, False, False, False, False, False, False, False, False,
        False, False, False, False, False, False, False, False, False,
        False, False, False])]

Why do it like this?

An important concept in machine learning is converting our data to numbers before passing it to a machine learning model.

In this case, we've transformed a single dog breed name such as boston_bull into a one-hot array.

Let's see an example.

# Example: Turning a boolean array into integers
print(labels[0]) # original label
print(np.where(unique_breeds == labels[0])[0][0]) # index where label occurs
print(boolean_labels[0].argmax()) # index where label occurs in boolean array
print(boolean_labels[0].astype(int)) # there will be a 1 where the sample label occurs

boston_bull
19
19
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0]

Now that we've got our labels in a numeric format and our image filepaths easily accessible (although they aren't numeric yet), let's split our data.

Creating our own validation set

Since the dataset from Kaggle doesn't come with a validation set (a split of the data we can test our model on before making final predicitons on the test set), let's make one.

We could use Scikit-Learn's train_test_split function or we could simply make manual splits of the data.

For accessibility later, let's save our filenames variable to X (data) and our labels to y.

# Setup X and y variables
X = filenames
y = boolean_labels

Since we're working with 10,000+ images, it's a good idea to start with a portion of them to make sure things are working before training on them all.

This is because computing with 10,000+ images could take a fairly long time. And our goal when working through machine learning projects is to reduce the time between experiments.

Let's start experimenting with 1,000 images and increase it as we need.

# Set number of images to use for experimenting
NUM_IMAGES = 1000

Now, let's split our data into training and validation sets. We'll use and 80/20 split (80% training data, 20% validation data).

# Import train_test_split from Scikit-Learn
from sklearn.model_selection import train_test_split

# Split them into training and validation sets of total size NUM_IMAGES
X_train, X_val, y_train, y_val = train_test_split(X[:NUM_IMAGES],
                                                  y[:NUM_IMAGES],
                                                  test_size=0.2,
                                                  random_state=42)

len(X_train), len(y_train), len(X_val), len(y_val)

(800, 800, 200, 200)

# Let's look at the training data
X_train[:2], y_train[:2]

(['drive/MyDrive/ML/Dog Vision/train/00bee065dcec471f26394855c5c2f3de.jpg',
  'drive/MyDrive/ML/Dog Vision/train/0d2f9e12a2611d911d91a339074c8154.jpg'],
 [array([False, False, False, False, False, False, False, False, False,
         False, False, False, False, False, False, False, False, False,
         False, False, False, False, False, False, False, False,  True,
         False, False, False, False, False, False, False, False, False,
         False, False, False, False, False, False, False, False, False,
         False, False, False, False, False, False, False, False, False,
         False, False, False, False, False, False, False, False, False,
         False, False, False, False, False, False, False, False, False,
         False, False, False, False, False, False, False, False, False,
         False, False, False, False, False, False, False, False, False,
         False, False, False, False, False, False, False, False, False,
         False, False, False, False, False, False, False, False, False,
         False, False, False, False, False, False, False, False, False,
         False, False, False]),
  array([False, False, False, False, False, False, False, False, False,
         False, False, False, False, False, False, False, False, False,
         False, False,  True, False, False, False, False, False, False,
         False, False, False, False, False, False, False, False, False,
         False, False, False, False, False, False, False, False, False,
         False, False, False, False, False, False, False, False, False,
         False, False, False, False, False, False, False, False, False,
         False, False, False, False, False, False, False, False, False,
         False, False, False, False, False, False, False, False, False,
         False, False, False, False, False, False, False, False, False,
         False, False, False, False, False, False, False, False, False,
         False, False, False, False, False, False, False, False, False,
         False, False, False, False, False, False, False, False, False,
         False, False, False])])

Preprocessing images (turning images into Tensors)

Our labels are in numeric format but our images are still just filepaths.

Since we're using TensorFlow, our data has to be in the form of Tensors.

A Tensor is a way to represent information in numbers. A Tensor can be thought of as a combination of NumPy arrays, except with the special ability to be used on a GPU.

Because of how TensorFlow stores information (in Tensors), it allows machine learning and deep learning models to be run on GPUs (generally faster at numerical computing).

To preprocess our images into Tensors, we're going to write a function which does a few things:

Take an image filepath as input.
Use TensorFlow to read the file and save it to a variable, image.
Turn our image (a jepg file) into Tensors.
Normalize our image (convert color channel values from 0-255 to 0-1).
Resize the image to be a shape (224, 224).
Return the modified image.

A good place to read about this type of function is the TensorFlow documentation on loading images.

Why is the shape (224, 224), which is (heigh, width)? This is to match the size of input our model takes. As we'll see later, our model will take as input an image which is (224, 224, 3).

And there, 3 is the number of color channels per pixel: red, green, and blue.

Let's make this a little more concrete.

# Convert image to NumPy array
from matplotlib.pyplot import imread 
image = imread(filenames[42]) # read in an image
image.shape

(257, 350, 3)

The shape of the image is (257, 350, 3). This is height, width, color channel value.

And we can easily convert it to a Tensor using tf.constant().

# Turn image into a Tensor
tf.constant(image)[:2]

<tf.Tensor: shape=(2, 350, 3), dtype=uint8, numpy=
array([[[ 89, 137,  87],
        [ 76, 124,  74],
        [ 63, 111,  59],
        ...,
        [ 76, 134,  86],
        [ 76, 134,  86],
        [ 76, 134,  86]],

       [[ 72, 119,  73],
        [ 67, 114,  68],
        [ 63, 111,  63],
        ...,
        [ 75, 131,  84],
        [ 74, 132,  84],
        [ 74, 131,  86]]], dtype=uint8)>

Let's build that function to preprocess an image.

# Define image size
IMG_SIZE = 224

def process_image(image_path, img_size=IMG_SIZE):
  """
  Takes an image file path and an image size, and turns the image into a Tensor.
  """
  # Read in an image file
  image = tf.io.read_file(image_path)
  # Turn the jpeg image into numerical Tensor with 3 color channels (RGB)
  image = tf.image.decode_jpeg(image, channels=3)
  # Convert the color channel values from 0-255 to 0-1 values
  image = tf.image.convert_image_dtype(image, tf.float32)
  # Resize the image to our desired value (224, 224)
  image = tf.image.resize(image, size=[img_size, img_size])

  return image

Creating data batches

We'll now build a function to turn our data into batches (more specifically, a TensorFlow BatchDataset).

What's a batch?

A batch (also called mini-batch) is a small portion of our data conatining, for instance, 32 images and their labels. 32 is generally the default batch size. In deep learning, instead of finding patterns in an entire dataset at the same time, we often find them in one batch at a time.

Let's say we're dealing with 10,000+ images (which we are). Together, these files may take up more memory than our GPU has. Trying to compute on them all would result in an error.

Instead, it's more efficient to create smaller batches of our data and compute on one batch at a time.

TensorFlow is very efficient when our data is in batches of (image, label) Tensors. So, we'll build a function to create these bacthes. We'll take advantage of the process_image function at the same time.

# Create a simple function to return a tuple (image, label)
def get_image_label(image_path, label):
  """
  Takes an image file path name and the associated label, 
  processes the image, and returns a tuple of (image, label).
  """
  image = process_image(image_path)
  return image, label

Now that we've got a simple function to turn our image filepath names and their associated labels into tuples, we'll create a function to make data batches.

Because we'll be dealing with 3 different sets of data (training, validation, and test), we'll make sure the function can accomodate for each set.

We'll set a default batch size of 32 because according to Yann Lecun, friends don't let friends train with batch sizes over 32.

# Define the batch size
BATCH_SIZE = 32

# Create a function to turn data into batches
def create_data_batches(X, y=None, batch_size=BATCH_SIZE, valid_data=False, test_data=False):
  """
  Creates batches of data out of image (X) and label (y) pairs.
  Shuffles the data if it's training data but doesn't shuffle if it's validation data.
  Also accepts test data as input (no labels).
  """
  # If the data is a test dataset, we probably don't have labels
  if test_data:
    print("Creating test data batches...")
    data = tf.data.Dataset.from_tensor_slices((tf.constant(X))) # only filepaths (no labels)
    data_batch = data.map(process_image).batch(BATCH_SIZE)
    return data_batch

  # If the data is a validation dataset, we don't need to shuffle it
  elif valid_data:
    print("Creating validation data batches...")
    data = tf.data.Dataset.from_tensor_slices((tf.constant(X), # filepaths
                                               tf.constant(y))) # labels
    data_batch = data.map(get_image_label).batch(BATCH_SIZE)   
    return data_batch

  else:
    # If the data is a training dataset, we shuffle it
    print("Creating training data batches...")
    # Turn filepaths and labels into Tensors
    data = tf.data.Dataset.from_tensor_slices((tf.constant(X), 
                                               tf.constant(y)))
    # Shuffling pathnames and labels before mapping image processor function is faster than shuffling images
    data = data.shuffle(buffer_size=len(X))

    # Create (image, label) tuples (this also turns the image path into a preprocessed image)
    data = data.map(get_image_label)

    # Turn the data into batches   
    data_batch = data.batch(BATCH_SIZE)   
    return data_batch

# Create training and validation data batches
train_data = create_data_batches(X_train, y_train)
val_data = create_data_batches(X_val, y_val, valid_data=True)

# Check the different attributes of our data batches
train_data.element_spec, val_data.element_spec

((TensorSpec(shape=(None, 224, 224, 3), dtype=tf.float32, name=None),
  TensorSpec(shape=(None, 120), dtype=tf.bool, name=None)),
 (TensorSpec(shape=(None, 224, 224, 3), dtype=tf.float32, name=None),
  TensorSpec(shape=(None, 120), dtype=tf.bool, name=None)))

We've now got our data in batches, more specifically, they're in Tensor pairs of (images, labels) ready for use on a GPU.

But having our data in batches can be a bit of a hard concept to understand. Let's build a function which helps us visualize what's going on under the hood.

Visualizing data batches

import matplotlib.pyplot as plt

# Create a function for viewing images in a data batch
def show_25_images(images, labels):
  """
  Displays a plot of 25 images and their labels from a data batch.
  """
  # Setup the figure
  plt.figure(figsize=(10, 10))
  # Loop through 25
  for i in range(25):
    # Create subplots (5 rows, 5 columns)
    ax = plt.subplot(5, 5, i+1)
    # Display an image
    plt.imshow(images[i])
    # Add the image label as the title
    plt.title(unique_breeds[labels[i].argmax()])
    # Turn grid lines off
    plt.axis("off")

To make computation efficient, a batch is a tighly wound collection of Tensors.

So, to view data in a batch, we've got to unwind it.

We can do so by calling the as_numpy_iterator() method on a data batch.

This will turn our a data batch into something which can be iterated over.

Passing an iterable to next() will return the next item in the iterator.

In our case, next() will return a batch of 32 images and label pairs.

Note: Running the cell below and loading images may take a little while.

# Visualize training images from the training data batch
train_images, train_labels = next(train_data.as_numpy_iterator())
show_25_images(train_images, train_labels)

# Now let's visualize our validation set
val_images, val_labels = next(val_data.as_numpy_iterator())
show_25_images(val_images, val_labels)

Creating and training a Model

We'll use an existing model from TensorFlow Hub.

TensorFlow Hub is a resource where we can find pretrained machine learning models for the problem we're working on.

Using a pretrained machine learning model is often referred to as transfer learning.

Why use a pretrained model?

Building a machine learning model and training it from scratch can be expensive and time consuming.

Transfer learning helps solve these issues by taking what another model has already learned and using that information with our own problem.

How do we choose a model?

Since we know our problem is image classification (classifying different dog breeds), we can navigate the TensorFlow Hub page by our problem domain (image).

We start by choosing the image problem domain, and then can filter it down by subdomains, in our case, image classification.

Doing this gives a list of different pretrained models we can apply to our task.

For example, the mobilenet_v2_130_224 model takes an input of images in the shape 224, 224. It also says the model has been trained in the domain of image classification.

Let's try it out.

Building a model

Before we build a model, there are a few things we need to define:

The input shape (images, in the form of Tensors) to our model.
The output shape (image labels, in the form of Tensors) of our model.
The URL of the model we want to use.

These things will be standard practice with whatever machine learning model we use. And because we're using TensorFlow, everything will be in the form of Tensors.

# Setup input shape to the model
INPUT_SHAPE = [None, IMG_SIZE, IMG_SIZE, 3] # batch, height, width, color channels

# Setup output shape of the model
OUTPUT_SHAPE = len(unique_breeds)

# Setup model URL from TensorFlow Hub
MODEL_URL = "https://tfhub.dev/google/imagenet/mobilenet_v2_130_224/classification/5"

Now we've got the inputs, outputs, and model we're using ready to go. We can start to put them together.

There are many ways of building a model in TensorFlow but one of the best ways to get started is to use the Keras API.

Defining a deep learning model in Keras can be as straightforward as saying, "here are the layers of the model, the input shape, and the output shape, let's go!"

Knowing, this. let's create a function which:

Takes the input shape, output shape, and the model we've chosen as parameters.
Defines the layers in a Keras model in sequential fashion.
Compiles the model (says how it should be evaluated and improved.)
Builds the model (tells it what kind of input shape it'll be getting.)
Returns the model.

All of these steps can be found here: https://www.tensorflow.org/guide/keras/sequential_model

# Create a function which builds a keras model
def create_model(input_shape=INPUT_SHAPE, output_shape=OUTPUT_SHAPE, model_url=MODEL_URL):
  print("Building model with:", MODEL_URL)

  # Setup the model layers
  model = tf.keras.Sequential([
    hub.KerasLayer(model_url), # Layer 1 (input layer)
    tf.keras.layers.Dense(units=output_shape,
                          activation="softmax") # Layer 2 (output layer)
  ])

  # Compile the model
  model.compile(
      loss=tf.keras.losses.CategoricalCrossentropy(),
      optimizer=tf.keras.optimizers.Adam(),
      metrics=["accuracy"]
  )

  # Build the model
  model.build(input_shape)

  return model

What's happening here?

Setting up the model layers

There are two ways to do this in Keras, the functional and sequential API. We've used the sequential.

Which one should we choose?

The Keras documentation states that the functional API is the way to go for defining complex models, but the sequential API (a linear stack of layers) is perfectly fine for getting started, which is what we're doing.

The first layer we use is the model from TensorFlow Hub (hub.KerasLayer(MODEL_URL)). So our first layer is actually an entire model (many more layers). This input layer takes in our images and finds patterns in them based on the patterns mobilenet_v2_130_224 has found.

The next layer (tf.keras.layers.Dense()) is the output layer of our model. It brings all of the information discovered in the input layer together and outputs it in the shape we're after, 120 (the number of unique labels we have).

The activation="softmax" parameter tells the output layer that we'd like to assign a probability value to each of the 120 labels somewhere between 0 and 1. The higher the value, the more confident the model is that the input image should have that label. If we were working on a binary classification problem, we'd use activation="sigmoid".

For more on which activation function to use, see the article Which Loss and Activation Functions Should I Use?

Compiling the model

This one is best explained with a story.

Let's say you're at the international hill descending championships. Where you start standing on top of a hill and your goal is to get to the bottom of the hill. The catch is you're blindfolded.

Luckily, your friend Adam is standing at the bottom of the hill shouting instructions on how to get down.

At the bottom of the hill, there's a judge evaluating how you're doing. They know where you need to end up so they compare how you're doing to where you're supposed to be. Their comparison is how you get scored.

Transferring this to model.compile() terminology:

loss - The height of the hill is the loss function, the model's goal is to minimize this, getting to 0 (the bottom of the hill) means the model is learning perfectly.
optimizer - Your friend Adam is the optimizer, he's the one telling you how to navigate the hill (lower the loss function) based on what you've done so far. His name is Adam because the Adam optimizer performs well on most models. Other optimizers include RMSprop and Stochastic Gradient Descent.
metrics - This is the onlooker at the bottom of the hill rating how well your performance is. Or in our case, giving the accuracy of how well our model is predicting the correct image label.

Building the model

We use model.build() whenever we're using a layer from TensorFlow Hub to tell our model what input shape it can expect.

In this case, the input shape is [None, IMG_SIZE, IMG_SIZE, 3] or [None, 224, 224, 3] or [batch_size, img_height, img_width, color_channels].

Batch size is left as None as this is inferred from the data we pass the model. In our case, it'll be 32 since that's what we've set up.

Now that we've gone through each section of the function, let's use it to create a model.

We can call summary() on our model to get an idea of what our model looks like.

model = create_model()
model.summary()

Building model with: https://tfhub.dev/google/imagenet/mobilenet_v2_130_224/classification/5
Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 keras_layer (KerasLayer)    (None, 1001)              5432713   

 dense (Dense)               (None, 120)               120240    

=================================================================
Total params: 5,552,953
Trainable params: 120,240
Non-trainable params: 5,432,713

The non-trainable parameters are the patterns learned by mobilenet_v2_130_224 and the trainable parameters are the ones in the dense layer we added.

This means the main bulk of the information in our model has already been learned and we're going to take that and adapt it to our own problem.

Creating callbacks

We've got a model ready to go, but before we train it, we'll make some callbacks.

Callbacks are helper functions a model can use during training to do things such as save a model's progress or stop training early if a model stops improving.

The two callbacks we're going to add are a TensorBoard callback and an Early Stopping callback.

TensorBoard callback

TensorBoard helps provide a visual way to monitor the progress of our model during and after training.

It can be used directly in a notebook to track the performance measures of a model such as loss and accuracy.

To setup a TensorBoard callback and view TensorBoard in a notebook, we need to do 3 things:

Load the TensorBoard notebook extension.
Create a TensorBoard callback which is able to save logs to a directory and pass it to our model's fit() function.
Visualize our model's training logs with the %tensorboard magic function (we'll do this after model training.)

# Load TensorBoard notebook extension
%load_ext tensorboard

import datetime

# Create a function to build a TensorBoard callback
def create_tensorboard_callback():
  # Create a log directory for storing TensorBoard logs
  logdir = os.path.join("drive/MyDrive/ML/Dog Vision/logs",
                        # Make it so the logs get tracked whenever we run an experiment
                        datetime.datetime.now().strftime('%Y%m%d-%H%M%S'))
  return tf.keras.callbacks.TensorBoard(logdir)

Early stopping callback

Early stopping helps prevent overfitting by stopping a model when a certain evaluation metric stops improving. If a model trains for too long, it can do so well at finding patterns in a certain dataset that it's not able to use those patterns on another dataset it hasn't seen before (the model doesn't generalize).

It's basically like saying to our model, "keep finding patterns until the quality of those patterns starts to go down."

# Create early stopping callback
early_stopping = tf.keras.callbacks.EarlyStopping(monitor="val_accuracy",
                                                  patience=3) # stops after 3 rounds of no improvements

Training a model (on subset of data)

Our first model is only going to be trained on 1,000 images. Or rather, trained on 800 images and then validated on 200 images, meaning 1,000 images in total or about 10% of the total data.

We do this to make sure everything is working. And if it is, we can step it up later and train on the entire training dataset.

The final parameter we'll define before training is NUM_EPOCHS (also known as number of epochs).

NUM_EPOCHS defines how many passes of the data we'd like our model to do. A pass is equivalent to our model trying to find patterns in each dog image and see which patterns relate to each label.

If NUM_EPOCHS=1, the model will only look at the data once and will probably score badly because it hasn't had a chance to correct itself. It would be like you competing in the international hill descent championships and your friend Adam only being able to give you 1 single instruction to get down the hill.

What's a good value for NUM_EPOCHS?

This one is hard to say. 10 could be a good start but so could 100. This is one of the reasons we created an early stopping callback. Having early stopping setup means if we set NUM_EPOCHS to 100 but our model stops improving after 22 epochs, it'll stop training.

NUM_EPOCHS = 100

Let's create a function that trains a model. The function will:

Create a model using create_model().
Setup a TensorBoard callback using create_tensorboard_callback().
Call the fit() function on our model passing it the training data, validation data, number of epochs to train for (NUM_EPOCHS) and the callbacks we'd like to use.
Return the fitted model.

# Build a function to train and return the trained model
def train_model():
  """
  Trains a given model and returns the trained version.
  """
  # create a model
  model = create_model()

  # Create new TensorBoard session every time we train a model
  tensorboard = create_tensorboard_callback()

  # Fit the model to the data passing it the callbacks we created
  model.fit(x=train_data,
            epochs=NUM_EPOCHS,
            validation_data=val_data,
            validation_freq=1, # check validation metrics every epoch
            callbacks=[tensorboard, early_stopping])
  return model

Note: When training a model for the first time, the first epoch will take a while to load compared to the rest. This is because the model is getting ready and the data is being initialised. Using more data will generally take longer, which is why we've started with ~1,000 images. After the first epoch, subsequent epochs should take only a few seconds.

# Fit the model to the data
model = train_model()

Building model with: https://tfhub.dev/google/imagenet/mobilenet_v2_130_224/classification/5
Epoch 1/100
25/25 [==============================] - 212s 8s/step - loss: 4.4983 - accuracy: 0.0913 - val_loss: 3.3317 - val_accuracy: 0.2500
Epoch 2/100
25/25 [==============================] - 5s 183ms/step - loss: 1.5892 - accuracy: 0.7038 - val_loss: 2.0969 - val_accuracy: 0.4950
Epoch 3/100
25/25 [==============================] - 5s 182ms/step - loss: 0.5362 - accuracy: 0.9525 - val_loss: 1.6433 - val_accuracy: 0.6000
Epoch 4/100
25/25 [==============================] - 5s 181ms/step - loss: 0.2430 - accuracy: 0.9875 - val_loss: 1.4789 - val_accuracy: 0.6250
Epoch 5/100
25/25 [==============================] - 5s 184ms/step - loss: 0.1421 - accuracy: 0.9975 - val_loss: 1.4015 - val_accuracy: 0.6450
Epoch 6/100
25/25 [==============================] - 4s 174ms/step - loss: 0.0988 - accuracy: 1.0000 - val_loss: 1.3711 - val_accuracy: 0.6450
Epoch 7/100
25/25 [==============================] - 5s 186ms/step - loss: 0.0745 - accuracy: 1.0000 - val_loss: 1.3289 - val_accuracy: 0.6500
Epoch 8/100
25/25 [==============================] - 4s 175ms/step - loss: 0.0589 - accuracy: 1.0000 - val_loss: 1.3003 - val_accuracy: 0.6600
Epoch 9/100
25/25 [==============================] - 7s 268ms/step - loss: 0.0485 - accuracy: 1.0000 - val_loss: 1.2854 - val_accuracy: 0.6700
Epoch 10/100
25/25 [==============================] - 6s 226ms/step - loss: 0.0409 - accuracy: 1.0000 - val_loss: 1.2699 - val_accuracy: 0.6700
Epoch 11/100
25/25 [==============================] - 4s 178ms/step - loss: 0.0353 - accuracy: 1.0000 - val_loss: 1.2561 - val_accuracy: 0.6650
Epoch 12/100
25/25 [==============================] - 5s 192ms/step - loss: 0.0307 - accuracy: 1.0000 - val_loss: 1.2476 - val_accuracy: 0.6700

It looks like our model is overfitting (getting far better results on the training set than the validation set). And we should look into some ways to prevent model overfitting.

Note: Overfitting to begin with is a good thing. It means our model is learning something.

Checking the TensorBoard logs

The TensorBoard magic function (%tensorboard) will access the logs directory we created earlier and visualize its content.

%tensorboard --logdir drive/MyDrive/ML/Dog\ Vision/logs

Thanks to our early_stopping callback, the model stopped training after 12 epochs (in my case, yours might be slightly different). This is because the validation accuracy failed to improve for 3 epochs.

But the good news is, we can definitely see that our model is learning something. The validation accuracy got to 67% in only a few minutes.

This means, if we were to scale up the number of images, hopefully we'd see the accuracy increase.

Making and evaluating predictions using a trained model

Before we scale up and train on more data, let's see some other ways we can evaluate our model. Because although accuracy is a pretty good indicator of how our model is doing, it would be even better if we could could see it in action.

Making predictions with a trained model is as calling predict() on it and passing it data in the same format the model was trained on.

# Make predictions on the validation data (not used to train on)
predictions = model.predict(val_data, verbose=1) # verbose shows us how long there is to go
predictions

7/7 [==============================] - 2s 149ms/step
array([[6.22897118e-04, 1.67990889e-04, 8.79216474e-04, ...,
        7.09853193e-04, 5.02764196e-05, 1.65839761e-03],
       [2.67735624e-04, 2.39910398e-04, 1.09814322e-02, ...,
        9.64251813e-05, 7.48365244e-04, 7.03895203e-05],
       [1.96330846e-04, 1.39784577e-04, 4.39772011e-05, ...,
        2.66594667e-04, 2.05397533e-04, 2.26700475e-04],
       ...,
       [2.78241873e-06, 8.37749758e-05, 4.92490479e-04, ...,
        3.37603960e-05, 6.75940537e-04, 1.90498744e-04],
       [2.50874972e-03, 1.12199872e-04, 5.24179341e-05, ...,
        7.15124188e-05, 2.62187295e-05, 1.73446781e-03],
       [1.17617834e-04, 7.77364112e-05, 1.37082185e-04, ...,
        1.58314139e-03, 8.59645079e-04, 2.46757936e-05]], dtype=float32)

# Check the shape of predictions
predictions.shape

(200, 120)

Making predictions with our model returns an array with a different value for each label.

In this case, making predictions on the validation data (200 images) returns an array (predictions) of arrays, each containing 120 different values (one for each unique dog breed).

These different values are the probabilities, or the likelihood, the model has predicted a certain image being a certain breed of dog. The higher the value, the more likely the model thinks a given image is a specific breed of dog.

Let's see how we'd convert an array of probabilities into an actual label.

# First prediction
index = 0
print(predictions[index])
print(f"Max value (probability of prediction): {np.max(predictions[index])}")
print(f"Sum: {np.sum(predictions[index])}")
print(f"Max index: {np.argmax(predictions[index])}" )
print(f"Predicted label: {unique_breeds[np.argmax(predictions[index])]}")

[6.2289712e-04 1.6799089e-04 8.7921647e-04 1.8631003e-04 5.3724495e-04
 3.2465730e-04 2.2423903e-02 4.9657328e-04 3.0789597e-05 4.2850631e-03
 1.8981443e-04 1.0261026e-04 6.6824240e-04 2.1106268e-04 3.2271945e-04
 2.0372072e-04 2.2583949e-05 1.3980259e-01 3.2590338e-05 4.0915584e-05
 1.1816905e-03 2.7455544e-04 4.0372779e-05 6.4781279e-04 8.5824677e-06
 5.7637418e-04 4.2180789e-01 3.6451096e-05 2.1398282e-03 1.8662729e-04
 5.3795295e-05 1.1816223e-03 1.5667198e-03 1.0925717e-05 2.2503540e-05
 1.3354327e-02 3.7860959e-06 3.5932841e-04 1.2963214e-04 2.9137873e-04
 2.3260089e-03 3.3095494e-06 1.5097676e-05 4.4024091e-05 4.1719111e-05
 1.4688818e-04 5.0045412e-05 1.1059161e-04 7.7821646e-04 3.0642765e-04
 3.9216696e-04 1.4359584e-05 5.3910341e-04 6.1444713e-05 1.7364083e-04
 3.5276284e-05 1.3722615e-04 9.0744550e-04 3.9121576e-04 1.9728519e-02
 2.5874743e-04 9.5249598e-05 3.4022541e-04 1.5865565e-04 2.8779902e-04
 6.3628376e-02 7.0957394e-05 8.5773220e-04 1.4202471e-02 1.0215643e-04
 5.1950999e-02 1.8512135e-04 7.5176729e-05 2.5200758e-02 3.5105829e-04
 1.6164074e-04 8.3191908e-04 1.5876073e-02 3.7631966e-04 1.8149629e-02
 1.6471182e-04 1.5049442e-03 3.0149630e-04 5.1044985e-03 3.1844739e-04
 8.8904443e-04 3.7882995e-04 2.3105989e-04 1.8062179e-04 1.4020882e-03
 1.0508097e-03 2.2645481e-04 6.8475410e-06 2.5964212e-03 8.1505415e-05
 1.9787325e-04 9.7168679e-04 9.0247270e-04 4.6203140e-04 1.0710631e-04
 2.3471296e-02 9.6860625e-05 1.8919840e-02 5.8173094e-02 3.1562344e-05
 7.0314546e-04 2.0326689e-02 8.7760345e-05 2.7958115e-04 1.8794164e-02
 3.9155426e-04 6.9520087e-05 5.1070543e-05 2.0120994e-04 4.8968988e-04
 3.7422600e-05 3.5885598e-03 7.0985319e-04 5.0276420e-05 1.6583976e-03]
Max value (probability of prediction): 0.4218078851699829
Sum: 1.0
Max index: 26
Predicted label: cairn

Having this information is great but it would be even better if we could compare a prediction to its true label and original image.

To help us, let's first build a little function to convert prediction probabilities into predicted labels.

Note: Prediction probabilities are also known as confidence levels.

# Turn prediction probabilities into their respective label
def get_pred_label(prediction_probabilities):
  """
  Turns an array of prediction probabilities into a label.
  """
  return unique_breeds[np.argmax(prediction_probabilities)]

# Get a predicted label based on an array of prediction probabilities
pred_label = get_pred_label(predictions[0])
pred_label

'cairn'

Now we've got a list of all different predictions our model has made, we'll do the same for the validation images and validation labels.

The model hasn't trained on the validation data, during the fit() function, it only used the validation data to evaluate itself. So we can use the validation images to visually compare our models predictions with the validation labels.

Since our validation data (val_data) is in batch form, to get a list of validation images and labels, we'll have to unbatch it (using unbatch()) and then turn it into an iterator using as_numpy_iterator().

Let's make a small function to do so.

# Create a function to unbatch a batched dataset
def unbatchify(data):
  """
  Takes a batched dataset of (image, label) Tensors and returns separate arrays
  of images and labels.
  """
  images = []
  labels = []
  # Loop through unbatched data
  for image, label in data.unbatch().as_numpy_iterator():
    images.append(image)
    labels.append(unique_breeds[np.argmax(label)])
  return images, labels

# Unbatchify the validation data
val_images, val_labels = unbatchify(val_data)
val_images[0], val_labels[0]

(array([[[0.29599646, 0.43284872, 0.3056691 ],
         [0.26635826, 0.32996926, 0.22846507],
         [0.31428418, 0.2770141 , 0.22934894],
         ...,
         [0.77614343, 0.82320225, 0.8101595 ],
         [0.81291157, 0.8285351 , 0.8406944 ],
         [0.8209297 , 0.8263737 , 0.8423668 ]],

        [[0.2344871 , 0.31603682, 0.19543913],
         [0.3414841 , 0.36560842, 0.27241898],
         [0.45016077, 0.40117094, 0.33964607],
         ...,
         [0.7663987 , 0.8134138 , 0.81350833],
         [0.7304248 , 0.75012016, 0.76590735],
         [0.74518913, 0.76002574, 0.7830809 ]],

        [[0.30157745, 0.3082587 , 0.21018331],
         [0.2905954 , 0.27066195, 0.18401104],
         [0.4138316 , 0.36170745, 0.2964005 ],
         ...,
         [0.79871625, 0.8418535 , 0.8606443 ],
         [0.7957738 , 0.82859945, 0.8605655 ],
         [0.75181633, 0.77904975, 0.8155256 ]],

        ...,

        [[0.9746779 , 0.9878955 , 0.9342279 ],
         [0.99153054, 0.99772066, 0.9427856 ],
         [0.98925114, 0.9792082 , 0.9137934 ],
         ...,
         [0.0987601 , 0.0987601 , 0.0987601 ],
         [0.05703771, 0.05703771, 0.05703771],
         [0.03600177, 0.03600177, 0.03600177]],

        [[0.98197854, 0.9820659 , 0.9379411 ],
         [0.9811992 , 0.97015417, 0.9125648 ],
         [0.9722316 , 0.93666023, 0.8697186 ],
         ...,
         [0.09682598, 0.09682598, 0.09682598],
         [0.07196062, 0.07196062, 0.07196062],
         [0.0361607 , 0.0361607 , 0.0361607 ]],

        [[0.97279435, 0.9545954 , 0.92389745],
         [0.963602  , 0.93199134, 0.88407487],
         [0.9627158 , 0.9125331 , 0.8460338 ],
         ...,
         [0.08394483, 0.08394483, 0.08394483],
         [0.0886985 , 0.0886985 , 0.0886985 ],
         [0.04514172, 0.04514172, 0.04514172]]], dtype=float32), 'cairn')

Now we've got ways to get:

Prediction labels
Validation labels (truth labels)
Validation images

Let's make some functions to make these all a bit more visual.

More specifically, we want to be able to view an image, its predicted label and its actual label (true label).

The first function we'll create will:

Take an array of prediction probabilities, an array of truth labels, an array of images and an integer.
Convert the prediction probabilities to a predicted label.
Plot the predicted label, its predicted probability, the truth label and target image on a single plot.

def plot_pred(prediction_probabilities, labels, images, n=1):
  """
  View the prediction, ground truth, and image for sample n.
  """
  pred_prob, true_label, image = prediction_probabilities[n], labels[n], images[n]

  # get the pred label
  pred_label = get_pred_label(pred_prob)

  # Plot image and remove ticks
  plt.imshow(image)
  plt.xticks([])
  plt.yticks([])

  # Change the color of the title depending on if the preidction is right or wrong
  if pred_label == true_label:
    color = "green"
  else:
    color = "red"

  # Change plot title
  plt.title("Predicted breed: {}\n Probability: {:2.0f}%\n Actual breed: {}".format(pred_label,
                                   np.max(pred_prob)*100,
                                   true_label),
                                    color=color)

# View an example prediction, original image and truth label
plot_pred(prediction_probabilities=predictions,
          labels=val_labels,
          images=val_images,
          n=1)

Making functions to help visualize our model's results is really helpful in understanding how our model is doing.

Since we're working with a multi-class problem, it would also be good to see what other guesses our model is making. More specifically, if our model predicts a certain label with 24% probability, what else did it predict?

Let's build a function to demonstrate this. The function will:

Take an input of a prediction probabilities array, a ground truth labels array and an integer.
Find the predicted label using get_pred_label().
Find the top 10:
- Prediction probabilities indexes
- Prediction probabilities values
- Prediction labels
Plot the top 10 prediction probability values and labels, coloring the true label green.

def plot_pred_conf(prediction_probabilities, labels, n=1):
  """
  Plots the top 10 highest prediction confidences along with the truth label for sample n.
  """
  pred_prob, true_label = prediction_probabilities[n], labels[n]

  # Get the predicted label
  pred_label = get_pred_label(pred_prob)

  # Find the top 10 prediction confidence indexes
  top_10_pred_indexes = pred_prob.argsort()[-10:][::-1]

  # Find the top 10 pred confidence values
  top_10_pred_values = pred_prob[top_10_pred_indexes]

  # Find the top 10 prediction labels
  top_10_pred_labels = unique_breeds[top_10_pred_indexes]

  # Setup plot
  top_plot = plt.bar(np.arange(len(top_10_pred_labels)),
                     top_10_pred_values,
                     color="grey")
  plt.xticks(np.arange(len(top_10_pred_labels)),
             labels=top_10_pred_labels,
             rotation="vertical")

  # Change the color of true label
  if np.isin(true_label, top_10_pred_labels):
    top_plot[np.argmax(top_10_pred_labels == true_label)].set_color("green")
  else:
    pass

plot_pred_conf(prediction_probabilities=predictions,
               labels=val_labels,
               n=1)

# Let's check a few predictions and their different values
i_multiplier = 0
num_rows = 3
num_cols = 2
num_images = num_rows*num_cols
plt.figure(figsize=(5*2*num_cols, 5*num_rows))
for i in range(num_images):
  plt.subplot(num_rows, 2*num_cols, 2*i+1)
  plot_pred(prediction_probabilities=predictions,
            labels=val_labels,
            images=val_images,
            n=i+i_multiplier)
  plt.subplot(num_rows, 2*num_cols, 2*i+2)
  plot_pred_conf(prediction_probabilities=predictions,
                labels=val_labels,
                n=i+i_multiplier)
plt.tight_layout(h_pad=1.0)
plt.show()

Saving and reloading a model

After training a model, it's a good idea to save it. Saving it means we can share it with colleagues, put it in an application and more importantly, won't have to go through the potentially expensive step of retraining it.

The format of an entire saved Keras model is h5. So we'll make a function which can take a model as input and utilize the save() method to save it as a h5 file to a specified directory.

def save_model(model, suffix=None):
  """
  Saves a given model in a models directory and appends a suffix (str)
  for clarity and reuse.
  """
  # Create model directory with current time
  modeldir = os.path.join("drive/MyDrive/ML/Dog Vision/models",
                          datetime.datetime.now().strftime("%Y%m%d-%H%M%s"))
  model_path = modeldir + "-" + suffix + ".h5" # save format of model
  print(f"Saving model to: {model_path}...")
  model.save(model_path)
  return model_path

If we've got a saved model, we'd like to load it, let's create a function which can take a model path and use the tf.keras.models.load_model() function to load it into the notebook.

Because we're using a component from TensorFlow Hub (hub.KerasLayer) we'll have to pass this as a parameter to the custom_objects parameter.

def load_model(model_path):
  """
  Loads a saved model from a specified path.
  """
  print(f"Loading saved model from: {model_path}")
  model = tf.keras.models.load_model(model_path,
                                     custom_objects={"KerasLayer":hub.KerasLayer})
  return model

# Save our model trained on 1000 images
save_model(model, suffix="1000-images-mobilenetv2-Adam")

# Load our model trained on 1000 images
model_1000_images = load_model('drive/MyDrive/ML/Dog Vision/models/20220325-04411648183299-1000-images-mobilenetv2-Adam.h5')

Training a model (on the full dataset)

Now we know our model works on a subset of the data, we can start to move forward with training one on the full data.

Above, we saved all of the training filepaths to X and all of the training labels to y.

We've got over 10,000 images and labels in our training set.

Before we can train a model on these, we'll have to turn them into a data batch.

We can use our create_data_batches() function from above which also preprocesses our images for us.

# Turn full training data in a data batch
full_data = create_data_batches(X, y)

Our data is in a data batch, all we need now is a model.

Let's use create_model() to instantiate another model.

# Instantiate a new model for training on the full dataset
full_model = create_model()

Since we've made a new model instance, full_model, we'll need some callbacks too.

# Create full model callbacks

# TensorBoard callback
full_model_tensorboard = create_tensorboard_callback()

# Early stopping callback
# Note: No validation set when training on all the data, therefore can't monitor validation accuracy
full_model_early_stopping = tf.keras.callbacks.EarlyStopping(monitor="accuracy",
                                                             patience=3)

To monitor the model whilst it trains, we'll load TensorBoard (it should update every 30-seconds or so whilst the model trains).

%tensorboard --logdir drive/My\ Drive/Data/logs

Note: Since running the cell below will cause the model to train on all of the data (10,000+) images, it may take a fairly long time to get started and finish. However, thanks to our full_model_early_stopping callback, it'll stop before it starts going too long.

The first epoch is always the longest as data gets loaded into memory. After it's there, it'll speed up.

# Fit the full model to the full training data
full_model.fit(x=full_data,
               epochs=NUM_EPOCHS,
               callbacks=[full_model_tensorboard, 
                          full_model_early_stopping])

Epoch 1/100
320/320 [==============================] - 57s 163ms/step - loss: 1.3450 - accuracy: 0.6682
Epoch 2/100
320/320 [==============================] - 52s 162ms/step - loss: 0.3995 - accuracy: 0.8813
Epoch 3/100
320/320 [==============================] - 52s 163ms/step - loss: 0.2371 - accuracy: 0.9335
Epoch 4/100
320/320 [==============================] - 49s 152ms/step - loss: 0.1529 - accuracy: 0.9647
Epoch 5/100
320/320 [==============================] - 51s 159ms/step - loss: 0.1060 - accuracy: 0.9785
Epoch 6/100
320/320 [==============================] - 52s 163ms/step - loss: 0.0775 - accuracy: 0.9873
Epoch 7/100
320/320 [==============================] - 56s 175ms/step - loss: 0.0602 - accuracy: 0.9913
Epoch 8/100
320/320 [==============================] - 58s 181ms/step - loss: 0.0476 - accuracy: 0.9943
Epoch 9/100
320/320 [==============================] - 57s 178ms/step - loss: 0.0369 - accuracy: 0.9966
Epoch 10/100
320/320 [==============================] - 58s 180ms/step - loss: 0.0311 - accuracy: 0.9971
Epoch 11/100
320/320 [==============================] - 58s 181ms/step - loss: 0.0264 - accuracy: 0.9977
Epoch 12/100
320/320 [==============================] - 58s 182ms/step - loss: 0.0222 - accuracy: 0.9977
Epoch 13/100
320/320 [==============================] - 55s 171ms/step - loss: 0.0199 - accuracy: 0.9984
Epoch 14/100
320/320 [==============================] - 58s 181ms/step - loss: 0.0172 - accuracy: 0.9987
Epoch 15/100
320/320 [==============================] - 58s 181ms/step - loss: 0.0165 - accuracy: 0.9983
Epoch 16/100
320/320 [==============================] - 57s 179ms/step - loss: 0.0136 - accuracy: 0.9990
Epoch 17/100
320/320 [==============================] - 58s 181ms/step - loss: 0.0153 - accuracy: 0.9983
Epoch 18/100
320/320 [==============================] - 57s 178ms/step - loss: 0.0148 - accuracy: 0.9979
Epoch 19/100
320/320 [==============================] - 58s 181ms/step - loss: 0.0123 - accuracy: 0.9985
<keras.callbacks.History at 0x7fcb67ee7950>

Saving and reloading the full model

Even on a GPU, our full model took a while to train. So it's a good idea to save it.

We can do so using our save_model() function.

Note: It may be a good idea to incorporate the save_model() function into a train_model() function. Or look into setting up a checkpoint callback.

# Save model to file
save_model(full_model, suffix="all-images-Adam")

# Load in the full model
loaded_full_model = load_model('drive/MyDrive/ML/Dog Vision/models/20220325-05281648186092-all-images-Adam.h5')

Making predictions on the test dataset

Since our model has been trained on images in the form of Tensor batches, to make predictions on the test data, we'll have to get it into the same format.

We created create_data_batches() earlier which can take a list of filenames as input and convert them into Tensor batches.

To make predictions on the test data, we'll:

Get the test image filenames.
Convert the filenames into test data batches using create_data_batches() and setting the test_data parameter to True (since there are no labels with the test images).
Make a predictions array by passing the test data batches to the predict() function.

# Load test image filenames (since we're using os.listdir(), these already have .jpg)
test_path = "drive/MyDrive/ML/Dog Vision/test/"
test_filenames = [test_path + fname for fname in os.listdir(test_path)]

test_filenames[:10]

['drive/MyDrive/ML/Dog Vision/test/e5f2204119380ce1a17fd09435c5012a.jpg',
 'drive/MyDrive/ML/Dog Vision/test/e7ce78e874945f182a4f5149aa505b09.jpg',
 'drive/MyDrive/ML/Dog Vision/test/de6cc38e54a460dd34c53b74f022a8da.jpg',
 'drive/MyDrive/ML/Dog Vision/test/e7b608110b0e29120d8740f37e85f3d0.jpg',
 'drive/MyDrive/ML/Dog Vision/test/e66a91249a4979a86db48e5c64b81a88.jpg',
 'drive/MyDrive/ML/Dog Vision/test/e17defebd1b8fc39e9c3c10df3c2e3de.jpg',
 'drive/MyDrive/ML/Dog Vision/test/e3baf6b2914677edd2729db0f32e2620.jpg',
 'drive/MyDrive/ML/Dog Vision/test/e08d42b2e6f2dbcf24c6bfee8b7d03bd.jpg',
 'drive/MyDrive/ML/Dog Vision/test/e2b24cea9d0796ffad73cb24eab1a3f6.jpg',
 'drive/MyDrive/ML/Dog Vision/test/e137b0cd96051765c349377725c4696d.jpg']

# How many test images are there?
len(test_filenames)

# Create test data batch
test_data = create_data_batches(test_filenames, test_data=True)

Note: Since there are 10,000+ test images, making predictions could take a while, even on a GPU. So beware running the cell below may take up to an hour.

# Make predictions on test data batch using the loaded full model
test_predictions = loaded_full_model.predict(test_data,
                                             verbose=1)

324/324 [==============================] - 1132s 3s/step

# Save predictions (NumPy array) to csv file
np.savetxt("drive/MyDrive/ML/Dog Vision/preds_array.csv", test_predictions, delimiter=",")

# Load predictions (NumPy array) from csv file
test_predictions = np.loadtxt("drive/MyDrive/ML/Dog Vision/preds_array.csv", delimiter=",")

# Check out the test predictions
test_predictions[:10]

array([[2.77832507e-10, 3.47354963e-08, 1.59594504e-10, ...,
        3.36205460e-07, 1.01586806e-09, 1.35463404e-10],
       [1.06715709e-06, 9.80697884e-11, 1.83814009e-05, ...,
        5.81553738e-09, 1.35275069e-09, 9.25533868e-07],
       [8.79199422e-11, 7.13828712e-08, 3.49000935e-08, ...,
        5.00790861e-07, 3.85421224e-08, 9.47958489e-10],
       ...,
       [2.41431576e-13, 9.99975681e-01, 7.91099131e-11, ...,
        1.21032284e-09, 1.01821096e-09, 8.52753868e-09],
       [1.11224371e-14, 5.69826790e-11, 4.86594055e-12, ...,
        9.99895334e-01, 3.12582422e-08, 3.12060724e-11],
       [3.74029030e-09, 6.98669282e-08, 6.67546445e-08, ...,
        1.31928124e-09, 3.12934681e-05, 1.51918755e-07]])

Making predictions on custom images

It's great being able to make predictions on a test dataset already provided for us.

But how could we use our model on our own images?

The premise remains, if we want to make predictions on our own custom images, we have to pass them to the model in the same format the model was trained on.

To do so, we'll:

Get the filepaths of our own images.
Turn the filepaths into data batches using create_data_batches(). And since our custom images won't have labels, we set the test_data parameter to True.
Pass the custom image data batch to our model's predict() method.
Convert the prediction output probabilities to prediction labels.
Compare the predicted labels to the custom images.

Note: To make predictions on custom images, I've uploaded pictures to a directory located at drive/MyDrive/ML/Dog Vision/my-dogs/ (as seen in the cell below).

# Get custom image filepaths
custom_path = "drive/MyDrive/ML/Dog Vision/my-dogs/"
custom_image_paths = [custom_path + fname for fname in os.listdir(custom_path)]

# Turn custom image into batch (set to test data because there are no labels)
custom_data = create_data_batches(custom_image_paths, test_data=True)

# Make predictions on the custom data
custom_preds = loaded_full_model.predict(custom_data)

Now we've got some predictions arrays, let's convert them to labels and compare them with each image.

# Get custom image prediction labels
custom_pred_labels = [get_pred_label(custom_preds[i]) for i in range(len(custom_preds))]
custom_pred_labels

['boxer',
 'bull_mastiff',
 'american_staffordshire_terrier',
 'staffordshire_bullterrier',
 'maltese_dog',
 'labrador_retriever']

# Get custom images (our unbatchify() function won't work since there aren't labels)
custom_images = []
# Loop through unbatched data
for image in custom_data.unbatch().as_numpy_iterator():
  custom_images.append(image)

# Check custom image predictions
plt.figure(figsize=(10, 10))
for i, image in enumerate(custom_images):
  plt.subplot(3, 2, i+1)
  plt.xticks([])
  plt.yticks([])
  plt.title(custom_pred_labels[i])
  plt.imshow(image)

What's next?

We've just gone end-to-end on a multi-class image classification problem!

This is the same style of problem self-driving cars have, except with different data.

We've got plenty of options on where to go next.

We could try to improve the full model we trained in this notebook in a few ways. Since our early experiment (using only 1,000 images) hinted at our model overfitting, one goal going forward would be to try and prevent it.

Trying another model from TensorFlow Hub - Perhaps a different model would perform better on our dataset. One option would be to experiment with a different pretrained model from TensorFlow Hub or look into the tf.keras.applications module.
Data augmentation - Take the training images and manipulate (crop, resize) or distort them (flip, rotate) to create even more training data for the model to learn from. Check out the TensorFlow images documentation for a whole bunch of functions we can use on images. A great idea would be to try and replicate the techniques in this example cat vs. dog image classification notebook for our dog breeds problem.
Fine-tuning - The model we used in this notebook was directly from TensorFlow Hub, we took what it had already learned from another dataset (ImageNet) and applied it to our own. Another option is to use what the model already knows and fine-tune this knowledge to our own dataset (pictures of dogs). This would mean all of the patterns within the model would be updated to be more specific to pictures of dogs rather than general images.

One of the best ways to find out something is to search for something like:

"How to improve a TensorFlow 2.x image classification model?"
"TensorFlow 2.x image classification best practices"
"Transfer learning for image classification with TensorFlow 2.x"

Machine Learning: How to Predict the Sale Price of Bulldozers

Nicolas Vallée — Wed, 30 Mar 2022 09:18:02 +0000

In this tutorial, we're going to walk through an example Machine Learning project where the goal is to predict the sale price of bulldozers.

This kind of problem is known as a regression problem because we're trying to determine a price, which is a continuous variable.

The data is from the Kaggle Bluebook for Bulldozers competition. We'll also use the same evaluation metric, root mean square log error, or RMSLE.

This exercise is a milestone project from the course Complete Machine Learning & Data Science Bootcamp, which I completed in March 2022.

You can find the Jupyter notebook in my GitHub repo.

The Workflow

Let's consider how we're approaching this problem, and the different steps to solve it.

We have a dataset provided to us. We'll approach the problem with the following 6-step machine learning modelling framework.

We'll use pandas, Matplotlib, and NumPy for data anaylsis. And we'll use Scikit-Learn for machine learning and modelling tasks.

After working through each step of this workflow, we'll have a trained machine learning model. And ideally, this trained model will be able to accurately predict the sale price of a bulldozer given different characteristics about it.

1. Problem Definition

The question that we're trying to answer is:

Given the characteristics of a particular bulldozer and data related to past sales of similar bulldozers, how well can we predict the future sale price of this bulldozer?

2. Data

Let's have a look at the dataset from Kaggle. First, we can see that it's a time series problem. That means the dataset contains a time attribute.

In this case, we're working with historical sales data for bulldozers. The dataset includes attributes such as model type, size, sale date, and more.

The 3 datasets available are:

Train.csv - Historical bulldozer sales data up to 2011 (with close to 400,000 examples with 50+ different attributes, including SalePrice which is the target variable).
Valid.csv - Historical bulldozer sales data from January 1st, 2012 to April 30th, 2012 (close to 12,000 examples with the same attributes as Train.csv).
Test.csv - Historical bulldozer sales data from May 1st, 2012 to November 30th, 2012 (close to 12,000 examples, but missing the SalePrice attribute, as this is what we aim to predict).

3. Evaluation

For this competition, Kaggle has set the evaluation metric to the root mean squared log error (RMSLE). The goal will be to get this value as low as possible.

To check how well our model is doing, we'll calculate the RMSLE and then compare our results to the Kaggle leaderboard.

4. Features

Features are the different parts of the data. During this step, we want to fin out what we can about the data and become more familiar with it.

A common way to do this is to create a data dictionary.

For this dataset, Kaggle provides a data dictionary containing information about what each attribute of the dataset means. We can download this file directly from the Kaggle competition page (account required) or view it on Google Sheets.

First, we'll import the dataset and start exploring it. And since we know the evaluation metric we're trying to minimise (RMSLE), our first goal is to build a baseline model and see how it compares against the competition.

Importing the data and preparing it for modeling

# Import data analysis tools 
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

I've downloaded the data from Kaggle and stored it under the file path "./data/".

# Import the training and validation set
df = pd.read_csv("./data/bluebook-for-bulldozers/TrainAndValid.csv",
                 low_memory=False)

We can now start exploring our dataset with df.info(). Let's use matplotlib to visualize some of the data .

fig, ax = plt.subplots()
ax.scatter(df["saledate"][:1000], df["SalePrice"][:1000]);

df.SalePrice.plot.hist();

Parsing dates

When working with time series data, it's a good practice to make sure dates are in the format of a datetime object (a Python data type which encodes specific information about dates).

df = pd.read_csv("./data/bluebook-for-bulldozers/TrainAndValid.csv",
                 low_memory=False,
                 parse_dates=["saledate"])

Now with parse_dates, we can see when calling df.info() that the dtype of "saledate" is datetime64[ns].

fig, ax = plt.subplots()
ax.scatter(df["saledate"][:1000], df["SalePrice"][:1000]);

To visualize the attributes of the DataFrame more easily, a trick is to transpose it with df.head().T.

Sort DataFrame by saledate

It makes sense to sort our data by date because we're working on a time series problem where we predict future values given past examples.

# Sort DataFrame by date (in ascending order)
df.sort_values(by=["saledate"], inplace=True, ascending=True)
df.saledate.head(5)

205615   1989-01-17
274835   1989-01-31
141296   1989-01-31
212552   1989-01-31
62755    1989-01-31
Name: saledate, dtype: datetime64[ns]

Make a copy of the original DataFrame

Since we're going to manipulate the data, it's better to make a copy of the original DataFrame and perform our changes there. This way, we keep the original DataFrame intact in case we need it again.

# Make a copy of the original DataFrame
df_tmp = df.copy()

Add datetime parameters for saledate column

We're doing this to enrich our dataset with as much information as possible.

Since we imported the data using read_csv() and parsed the dates using parase_dates=["saledate"], we can now access the different datetime attributes of the saledate column.

# Add datetime parameters for saledate
df_tmp["saleYear"] = df_tmp.saledate.dt.year
df_tmp["saleMonth"] = df_tmp.saledate.dt.month
df_tmp["saleDay"] = df_tmp.saledate.dt.day
df_tmp["saleDayofweek"] = df_tmp.saledate.dt.dayofweek
df_tmp["saleDayofyear"] = df_tmp.saledate.dt.dayofyear

# Drop original saledate
df_tmp.drop("saledate", axis=1, inplace=True)

5. Modeling

We've explored our dataset and enriched it with some datetime attributes. We could spend more time doing exploratory data analysis (EDA), finding more out about the data ourselves. But instead, we'll use a machine learning model to help us do EDA.

After all, one of the objectives when starting a new machine learning project is to reduce the time between experiments. So, let's move on to the model.

Following the Scikit-Learn machine learning map, we find that a RandomForestRegressor() is a good candidate.

from sklearn.ensemble import RandomForestRegressor

However, fitting the RandomForestRegressor() model with our training data would'nt work yet, because we've got missing numbers as well as attributes in a non-numerical format.

# Check for missing categories and different datatypes
df_tmp.info()

We see many columns have a type object.

# Check for missing values
df_tmp.isna().sum()

Likewise, many columns have missing values.

Convert strings to categories

One way to turn our data into numbers is to convert the columns with the string datatype into a category datatype.

To do this, we can use the pandas types API which allows us to interact and manipulate the types of data.

# These columns contain strings
for label, content in df_tmp.items():
    if pd.api.types.is_string_dtype(content):
        print(label)

This gives us a list of all the columns for which the data is in a string format.

# This will turn all of the string values into category values
for label, content in df_tmp.items():
    if pd.api.types.is_string_dtype(content):
        df_tmp[label] = content.astype("category").cat.as_ordered()

Calling df_tmp.info(), we see that the type object has been converted to category.

All of our data is now categorical. That means we can now turn the categories into numbers. However, we still have to deal with missing values.

In the format it's in, the data is good to be worked with, so let's save it to a file and reimport it so we can continue on.

Save processed data

# Save preprocessed data
df_tmp.to_csv("./data/bluebook-for-bulldozers/train_tmp.csv",
              index=False)

# Import preprocessed data
df_tmp = pd.read_csv("./data/bluebook-for-bulldozers/train_tmp.csv",
                     low_memory=False)

Our preprocessed DataFrame has the columns we added to it, but it's still missing some values. We can see where values are missing with df_tmp.isna().sum().

Fill missing values

Here are two things to know about machine learning models:

All of our data has to be numerical
There can't be any missing values

And as we've seen using df_tmp.isna().sum() our data still has plenty of missing values.

Filling numerical value

First, we're going to fill any column with missing values with the median of that column.

for label, content in df_tmp.items():
    if pd.api.types.is_numeric_dtype(content):
        print(label)

SalesID
SalePrice
MachineID
ModelID
datasource
auctioneerID
YearMade
MachineHoursCurrentMeter
saleYear
saleMonth
saleDay
saleDayofweek
saleDayofyear

# Check for which numeric columns have null values
for label, content in df_tmp.items():
    if pd.api.types.is_numeric_dtype(content):
        if pd.isnull(content).sum():
            print(label)

auctioneerID
MachineHoursCurrentMeter

# Fill numeric rows with the median
for label, content in df_tmp.items():
    if pd.api.types.is_numeric_dtype(content):
        if pd.isnull(content).sum():
            # Add a binary column which tells if the data was missing our not
            df_tmp[label+"_is_missing"] = pd.isnull(content)
            # Fill missing numeric values with median since it's more robust than the mean
            df_tmp[label] = content.fillna(content.median())

We can easily fill all of the missing numeric values in our dataset with the median. However, a value may be missing for a good reason. That's why we added a binary column which indicates whether the value was missing or not. This step helps to retain this information.

# Check if there's any null values
for label, content in df_tmp.items():
    if pd.api.types.is_numeric_dtype(content):
        if pd.isnull(content).sum():
            print(label)

Good, we no longer have any missing values.

Filling and turning categorical variables to numbers

Now, let's fill the categorical values. At the same time, we'll turn them into numbers.

# Turn categorical variables into numbers
for label, content in df_tmp.items():
    # Check columns which *aren't* numeric
    if not pd.api.types.is_numeric_dtype(content):
        # Add binary column to inidicate whether sample had missing value
        df_tmp[label+"_is_missing"] = pd.isnull(content)
        # We add the +1 because pandas encodes missing categories as -1
        df_tmp[label] = pd.Categorical(content).codes+1

All of our data is now numeric, and there are no missing values.

We should be able to build a machine learning model! Let's instantiate our RandomForestRegressor.

This will take a few minutes, which is too long for interacting with it. So, what we'll do is create a subset of rows to work with.

Splitting data into training and validation sets

According to the Kaggle data page, the validation set and test set are split according to dates. This makes sense since we're working on a time series problem.

Knowing this, randomly splitting our data into train and test sets using something like train_test_split() wouldn't work.

Instead, we'll split our data into training, validation, and test sets according to the date of the sample.

In our case:

Training = all samples up until 2011
Valid = all samples form January 1, 2012 - April 30, 2012
Test = all samples from May 1, 2012 - November 2012

For more on making good training, validati, and test sets, check out the post How (and why) to create a good validation set by Rachel Thomas.

# Split data into training and validation sets
df_train = df_tmp[df_tmp.saleYear != 2012]
df_val = df_tmp[df_tmp.saleYear == 2012]

len(df_train), len(df_val)

(401125, 11573)

# Split data into X & y
X_train, y_train = df_train.drop("SalePrice", axis=1), df_train.SalePrice
X_valid, y_valid = df_val.drop("SalePrice", axis=1), df_val.SalePrice

X_train.shape, y_train.shape, X_valid.shape, y_valid.shape

((401125, 102), (401125,), (11573, 102), (11573,))

Building an evaluation function

The evaluation function they use for this Kaggle competition is the root mean squared log error (RMSLE).

It's important to understand the evaluation metric we're going for. The RMSLE is the metric of choice when we care more about the consequences of being off by 10% than being off by $10. That is, when we care more about ratios than differences. The MAE (mean absolute error) is more about exact differences.

Since Scikit-Learn doesn't have a built-in function for RMSLE, we'll create our own.

We can do this by taking the square root of Scikit-Learn's mean_squared_log_error (MSLE). MSLE is the same as taking the log of mean squared error (MSE).

While we're at it, we'll also calculate the MAE and R^2.

# Create evaluation function (the competition uses Root Mean Square Log Error)
from sklearn.metrics import mean_squared_log_error, mean_absolute_error

def rmsle(y_test, y_preds):
    return np.sqrt(mean_squared_log_error(y_test, y_preds))

# Create function to evaluate our model
def show_scores(model):
    train_preds = model.predict(X_train)
    val_preds = model.predict(X_valid)
    scores = {"Training MAE": mean_absolute_error(y_train, train_preds),
              "Valid MAE": mean_absolute_error(y_valid, val_preds),
              "Training RMSLE": rmsle(y_train, train_preds),
              "Valid RMSLE": rmsle(y_valid, val_preds),
              "Training R^2": model.score(X_train, y_train),
              "Valid R^2": model.score(X_valid, y_valid)}
    return scores

Testing our model on a subset of data

Traing an entire model would take far too long to continue experimenting as fast as we want to.

So, what we'll do is take a sample of the training set and tune the hyperparameters on that subset before training a larger model.

If our experiments are taking longer than 10-seconds, we should try to speed things up, by sampling less data or using a faster computer.

len(X_train)

Let's alter the number of samples each n_estimator in the RandomForestRegressor sees using the max_samples parameter.

# Change max samples in RandomForestRegressor
model = RandomForestRegressor(n_jobs=-1,
                              max_samples=10000)

Setting max_samples to 10,000 means every n_estimator (by default 100) in our RandomForestRegressor will only see 10,000 random samples from our DataFrame, instead of the entire 400,000.

In other words, we'll be looking at 40x less samples, which means we'll get faster computation speeds. Though, we should expect our results to worsen (because the model has less samples to learn patterns from).

%%time
# Cutting down the max number of samples each tree can see improves training time
model.fit(X_train, y_train)

CPU times: user 43.5 s, sys: 1.39 s, total: 44.9 s
Wall time: 9.3 s
RandomForestRegressor(max_samples=10000, n_jobs=-1)

show_scores(model)

{'Training MAE': 5557.476079825491,
 'Valid MAE': 7145.432947377517,
 'Training RMSLE': 0.2576354132701014,
 'Valid RMSLE': 0.2932065424618794,
 'Training R^2': 0.860725458431776,
 'Valid R^2': 0.8334321554498396}

Hyperparameter tuning with RandomizedSearchCV

We can increase n_iter to try more combinations of hyperparameters but, in our case, we'll try 20 and see where it gets us. We're trying to reduce the amount of time between experiments.

%%time
from sklearn.model_selection import RandomizedSearchCV

# Different RandomForestClassifier hyperparameters
rf_grid = {"n_estimators": np.arange(10, 100, 10),
           "max_depth": [None, 3, 5, 10],
           "min_samples_split": np.arange(2, 20, 2),
           "min_samples_leaf": np.arange(1, 20, 2),
           "max_features": [0.5, 1, "sqrt", "auto"],
           "max_samples": [10000]}

rs_model = RandomizedSearchCV(RandomForestRegressor(),
                              param_distributions=rf_grid,
                              n_iter=20,
                              cv=5,
                              verbose=True)

rs_model.fit(X_train, y_train)

Fitting 5 folds for each of 20 candidates, totalling 100 fits
CPU times: user 9min 43s, sys: 38.7 s, total: 10min 22s
Wall time: 10min 30s
RandomizedSearchCV(cv=5, estimator=RandomForestRegressor(), n_iter=20,
                   param_distributions={'max_depth': [None, 3, 5, 10],
                                        'max_features': [0.5, 1, 'sqrt',
                                                         'auto'],
                                        'max_samples': [10000],
                                        'min_samples_leaf': array([ 1,  3,  5,  7,  9, 11, 13, 15, 17, 19]),
                                        'min_samples_split': array([ 2,  4,  6,  8, 10, 12, 14, 16, 18]),
                                        'n_estimators': array([10, 20, 30, 40, 50, 60, 70, 80, 90])},
                   verbose=True)

# Find the best parameters from the RandomizedSearch 
rs_model.best_params_

{'n_estimators': 60,
 'min_samples_split': 6,
 'min_samples_leaf': 1,
 'max_samples': 10000,
 'max_features': 'auto',
 'max_depth': None}

# Evaluate the RandomizedSearch model
show_scores(rs_model)

{'Training MAE': 5623.00659389663,
 'Valid MAE': 7221.500628014852,
 'Training RMSLE': 0.2600757896526271,
 'Valid RMSLE': 0.2938688696407002,
 'Training R^2': 0.857223776894714,
 'Valid R^2': 0.8302372473994826}

Train a model with the best parameters

The course instructor, Daniel Bourke, tried 100 different combinations of hyperparameters (setting n_iter to 100 in RandomizedSearchCV) and found that the best results came from the ones we'll use below.

Note: This kind of search with n_iter = 100 can take about 2 hours on a powerful laptop. The search above with n_iter = 10 took only 10 minutes on my MacBook Air.

We'll instantiate a new model with these discovered hyperparameters and reset the max_samples back to its original value.

%%time
# Most ideal hyperparameters
ideal_model = RandomForestRegressor(n_estimators=90,
                                    min_samples_leaf=1,
                                    min_samples_split=14,
                                    max_features=0.5,
                                    n_jobs=-1,
                                    max_samples=None)
ideal_model.fit(X_train, y_train)

CPU times: user 10min 54s, sys: 5.69 s, total: 11min
Wall time: 1min 39s
RandomForestRegressor(max_features=0.5, min_samples_split=14, n_estimators=90,
                      n_jobs=-1)

show_scores(ideal_model)

{'Training MAE': 2927.714226014933,
 'Valid MAE': 5893.170337485774,
 'Training RMSLE': 0.14329949046791518,
 'Valid RMSLE': 0.2436971060471662,
 'Training R^2': 0.9596781605243438,
 'Valid R^2': 0.884359696206594}

With these hyperparameters and when using all the samples, we can see an improvement to our model's performance.

We could make a faster model by altering some of the hyperparameters. Particularly by lowering n_estimators since each increase in n_estimators is like building another small model.

However, lowering n_estimators or altering other hyperparameters may lead to poorer results.

Make predictions on test data

We've got a trained model so let's make predictions on the test data.

What we've done so far is training our model on data prior to 2011. However, the test data is from May 2012 to November 2012.

What we'll do is use the patterns that our model has learned on the training data to predict the sale price of a bulldozer with characteristics it has never seen before, but that are assumed to be similar to those found in the training data.

df_test = pd.read_csv("./data/bluebook-for-bulldozers/Test.csv",
                      parse_dates=["saledate"])

Unfortunately, the test data isn't in the same format as our other data, so we have to fix it. Let's create a function to preprocess our data.

Preprocessing the data

Our model has been trained on data formatted in a certain way. So, to make predictions on the test data, we need to take the same steps we used to preprocess the training data to preprocess the test data.

Note: Whatever we do to the training data, you have to do the same to the test data.

Let's create a function for doing so (by copying the preprocessing steps we used above).

def preprocess_data(df):
    # Add datetime parameters for saledate
    df["saleYear"] = df.saledate.dt.year
    df["saleMonth"] = df.saledate.dt.month
    df["saleDay"] = df.saledate.dt.day
    df["saleDayofweek"] = df.saledate.dt.dayofweek
    df["saleDayofyear"] = df.saledate.dt.dayofyear

    # Drop original saledate
    df.drop("saledate", axis=1, inplace=True)

    # Fill numeric rows with the median
    for label, content in df.items():
        if pd.api.types.is_numeric_dtype(content):
            if pd.isnull(content).sum():
                df[label+"_is_missing"] = pd.isnull(content)
                df[label] = content.fillna(content.median())

        # Turn categorical variables into numbers
        if not pd.api.types.is_numeric_dtype(content):
            df[label+"_is_missing"] = pd.isnull(content)
            # We add the +1 because pandas encodes missing categories as -1
            df[label] = pd.Categorical(content).codes+1        

    return df

Note: This function could break if the test data has different missing values compared to the training data.

Now that we've got a function for preprocessing data, let's preprocess the test dataset into the same format as our training dataset.

We can see that our test dataset (after preprocessing) has 101 columns. But, our training dataset X_train has 102 columns (after preprocessing). Let's find the difference.

# We can find how the columns differ using sets
set(X_train.columns) - set(df_test.columns)

{'auctioneerID_is_missing'}

There wasn't any missing auctioneerID field in the test dataset.

To fix this, we'll add a column to the test dataset called auctioneerID_is_missing and fill it with False, since none of the auctioneerID fields are missing in the test dataset.

# Match test dataset columns to training dataset
df_test["auctioneerID_is_missing"] = False

Now the test dataset matches the training dataset and we should be able to make predictions on it using our trained model.

# Make predictions on the test dataset using the best model
test_preds = ideal_model.predict(df_test)

When looking at the Kaggle submission requirements, we see that to make a submission, the data must be in a certain format: a DataFrame containing the SalesID and the predicted SalePrice of the bulldozer.

# Create DataFrame compatible with Kaggle submission requirements
df_preds = pd.DataFrame()
df_preds["SalesID"] = df_test["SalesID"]
df_preds["SalePrice"] = test_preds

# Export to csv...
df_preds.to_csv("./data/bluebook-for-bulldozers/predictions.csv",
               index=False)

Feature Importance

By now, we've built a model which is able to make predictions. The people we share these predictions with might be curious to know what parts of the data led to these predictions.

This is where feature importance comes in. Feature importance seeks to find out which attributes of the data were the most important in predicting the target variable.

In our case, which bulldozer sale attributes had the most weight in the model to predict its overall sale price?

Beware: the default feature importances for random forests can lead to non-ideal results.

To figure out which features were most important in a machine learning model, a good idea is to search something like "[MODEL NAME] feature importance".

Doing this for our RandomForestRegressor leads us to find the feature_importances_ attribute.

Let's check it out.

# Find feature importance of our best model
ideal_model.feature_importances_

# Install Seaborn package in current environment (if you don't have it)
# import sys
# !conda install --yes --prefix {sys.prefix} seaborn

import seaborn as sns

# Helper function for plotting feature importance
def plot_features(columns, importances, n=20):
    df = (pd.DataFrame({"features": columns,
                        "feature_importance": importances})
          .sort_values("feature_importance", ascending=False)
          .reset_index(drop=True))

    sns.barplot(x="feature_importance",
                y="features",
                data=df[:n],
                orient="h")

plot_features(X_train.columns, ideal_model.feature_importances_);

Machine Learning: Predicting Heart Disease From Patients' Medical Data

Nicolas Vallée — Wed, 30 Mar 2022 08:04:41 +0000

This tutorial introduces some fundamental Machine Learning and Data Science concepts by exploring the problem of heart disease classification.

It is intended to be an end-to-end example of what a Data Science and Machine Learning proof of concept looks like.

I completed this milestone project in March 2022, as part of the Complete Machine Learning & Data Science Bootcamp taught by Daniel Bourke and Andrei Neagoie.

You can also see the final version of this notebook in my GitHub repo.

What is classification?

Classification involves deciding whether a sample is part of one class or another (single-class classification). If there are multiple class options, we refer to the problem as multi-class classification.

What we'll end up with

Since we already have a dataset, we'll follow this 6-step Machine Learning modelling framework.

More specifically, we'll look at the following topics:

Exploratory data analysis (EDA) - the process of going through a dataset to find out more about it.
Model training - create model(s) to predict a target variable based on other variables.
Model evaluation - evaluating a model's predictions using problem-specific evaluation metrics.
Model comparison - comparing several different models to find the best one.
Model fine-tuning - once we've found a good model, how can we improve it?
Feature importance - since we're predicting the presence of heart disease, are there some things which are more important for prediction?
Cross-validation - if we build a good model, can we be sure it will work on unseen data?
Reporting what we've found - if we had to present our work, what would we show someone?

To work through these topics, we'll use pandas, Matplotlib and NumPy for data anaylsis, as well as, Scikit-Learn for machine learning and modelling tasks.

We'll work through each step and by the end of the notebook, we'll have a handful of models. These models can predict whether or not a person has heart disease based on a number of parameters with a considerable accuracy.

We'll also be able to describe which parameters are more indicative than others, for example, sex may be more important than age.

1. Problem Definition

The problem we will explore is a binary classification, which means a sample can only be one of two things.

This is because we're going to use a number of different features about a person to predict whether or not they have heart disease.

In a statement,

Given clinical parameters about a patient, can we predict whether or not they have heart disease?

2. Data

Here, we want to dive into the data that our problem definition is based on. This may involve sourcing, defining different parameters, talking to experts about it, and finding out what we should expect.

The original data comes from the Cleveland database from UCI Machine Learning Repository.

Howevever, we've downloaded it in a formatted way from Kaggle.

The original database contains 76 attributes, but here only 14 attributes are used. Attributes (also called features) are the variables that we'll use to predict our target variable.

Attributes and features are also referred to as independent variables, and a target variable can be referred to as a dependent variable.

We use the independent variables to predict our dependent variable.

In our case, the independent variables are a patient's medical attributes and the dependent variable is whether or not they have heart disease.

3. Evaluation

The evaluation metric is something we define at the start of a project.

Since machine learning is very experimental, we might say something like:

If we can reach 95% accuracy at predicting whether or not a patient has heart disease during the proof of concept phase, we'll pursue this project.

This is helpful because it provides a rough goal for a machine learning engineer or data scientist to work towards.

However, due to the nature of experimentation, the evaluation metric may change over time.

4. Features

Features are different parts of the data. During this step, we want to find out what we can about the data.

One of the most common ways to do this, is to create a data dictionary.

Heart disease data dictionary

A data dictionary describes the data we're dealing with. Not all datasets come with them so this is where we may have to do our research or ask a subject matter expert (someone who knows about the data) for more information.

The following are the features we'll use to predict our target variable (heart disease or no heart disease).

age - age in years
sex - (1 = male; 0 = female)
cp - chest pain type
- 0: Typical angina: chest pain related decrease blood supply to the heart
- 1: Atypical angina: chest pain not related to heart
- 2: Non-anginal pain: typically esophageal spasms (non heart related)
- 3: Asymptomatic: chest pain not showing signs of disease
trestbps - resting blood pressure (in mmHg on admission to the hospital)
- anything above 130-140 is typically cause for concern
chol - serum cholestoral in mg/dl
- serum = LDL + HDL + .2 * triglycerides
- above 200 is cause for concern
fbs - (fasting blood sugar > 120 mg/dl) (1 = true; 0 = false)
- '>126' mg/dL signals diabetes
restecg - resting electrocardiographic results
- 0: Nothing to note
- 1: ST-T Wave abnormality
  - can range from mild symptoms to severe problems
  - signals non-normal heart beat
- 2: Possible or definite left ventricular hypertrophy
  - Enlarged heart's main pumping chamber
thalach - maximum heart rate achieved
exang - exercise induced angina (1 = yes; 0 = no)
oldpeak - ST depression induced by exercise relative to rest
- looks at stress of heart during exercise
- unhealthy heart will stress more
slope - the slope of the peak exercise ST segment
- 0: Upsloping: better heart rate with exercise (uncommon)
- 1: Flatsloping: minimal change (typical healthy heart)
- 2: Downsloping: signs of unhealthy heart
ca - number of major vessels (0-3) colored by fluoroscopy
- colored vessel means the doctor can see the blood passing through
- the more blood movement the better (no clots)
thal - thalium stress result
- 1,3: normal
- 6: fixed defect: used to be defect but ok now
- 7: reversable defect: no proper blood movement when exercising
target - have disease or not (1 = yes; 0 = no) (= the predicted attribute)

Note: No personal identifiable information (PPI) can be found in the dataset.

It's a good idea to save these to a Python dictionary or in an external file, so we can look at them later without coming back here.

Preparing the tools

At the start of any project, it's common to see the required libraries imported in a big chunk, like we can see below.

However, in practice, our projects may import libraries as we go. After we've spent a couple of hours working on our problem, we'll probably want to do some tidying up. This is where we may want to consolidate every library we've used at the top of our notebook (like in the cell below).

The libraries we use will differ from project to project. But there are a few which will we'll likely take advantage of during almost every structured data project.

pandas for data analysis.
NumPy for numerical operations.
Matplotlib/seaborn for plotting or data visualization.
Scikit-Learn for machine learning modelling and evaluation.

# Regular EDA and plotting libraries
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns

# We want our plots to appear in the notebook
%matplotlib inline 

## Models
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier

## Model evaluators
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.metrics import precision_score, recall_score, f1_score
from sklearn.metrics import plot_roc_curve

Loading data

There are many ways to store data. The typical way of storing tabular data --data similar to what you'd see in an Excel file-- is in .csv format. .csv stands for comma separated values.

Pandas has a built-in function to read .csv files called read_csv() which takes the file pathname of our .csv file. We'll likely use this one often.

df = pd.read_csv('heart-disease.csv') # 'DataFrame' shortened to 'df'
df.shape # (rows, columns)

# (303, 14)

Data exploration (exploratory data analysis or EDA)

Once we've imported a dataset, the next step is to explore it. There's no set way of doing this, but we should try to become more familiar with the dataset.

Comparing different columns to each other, or comparing them to the target variable. Referring back to our data dictionary and reminding ourself of what different columns mean.

Our goal is to become a subject matter expert on the dataset we're working with. So, if someone asks us a question about it, we can give them an explanation, and when we start building models, we can sound check them to make sure they're not performing too well (overfitting) or understand why they might be performing poorly (underfitting).

Since EDA has no real set methodology, the following is a short check list we might want to walk through:

What question(s) are we trying to solve (or prove wrong)?
What kind of data do we have and how do we treat different types?
What’s missing from the data and how do we deal with it?
Where are the outliers and why should we care about them?
How can we add, change, or remove features to get more out of our data?

One of the quickest and easiest ways to check our data is with the head() function. Calling it on any dataframe will print the top 5 rows, and tail() calls the bottom 5. We can also pass a number to them like head(10) to show the top 10 rows.

# Let's check the top 5 rows of our dataframe
df.head()

age sex cp  trestbps    chol    fbs restecg thalach exang   oldpeak slope   ca  thal    target
0   63  1   3   145 233 1   0   150 0   2.3 0   0   1   1
1   37  1   2   130 250 0   1   187 0   3.5 0   0   2   1
2   41  0   1   130 204 0   0   172 0   1.4 2   0   2   1
3   56  1   1   120 236 0   1   178 0   0.8 2   0   2   1
4   57  0   0   120 354 0   1   163 1   0.6 2   0   2   1

# And the bottom 10
df.tail(10)

age sex cp  trestbps    chol    fbs restecg thalach exang   oldpeak slope   ca  thal    target
293 67  1   2   152 212 0   0   150 0   0.8 1   0   3   0
294 44  1   0   120 169 0   1   144 1   2.8 0   0   1   0
295 63  1   0   140 187 0   0   144 1   4.0 2   2   3   0
296 63  0   0   124 197 0   1   136 1   0.0 1   0   2   0
297 59  1   0   164 176 1   0   90  0   1.0 1   2   1   0
298 57  0   0   140 241 0   1   123 1   0.2 1   0   3   0
299 45  1   3   110 264 0   1   132 0   1.2 1   0   3   0
300 68  1   0   144 193 1   1   141 0   3.4 1   2   3   0
301 57  1   0   130 131 0   1   115 1   1.2 1   1   3   0
302 57  0   1   130 236 0   0   174 0   0.0 1   1   2   0

value_counts() allows us to show how many times each of the values of a categorical column appear.

# Let's see how many positive (1) and negative (0) samples we have in our dataframe
df["target"].value_counts()

1    165
0    138
Name: target, dtype: int64

Since these two values are close to each other, our target column can be considered balanced. An unbalanced target column, when some classes have far more samples, can be harder to model than a balanced set. Ideally, all of our target classes have the same number of samples.

If we'd prefer these values in percentages, value_counts() takes a parameter, normalize which can be set to True.

# Normalized value counts
df.target.value_counts(normalize=True)

1    0.544554
0    0.455446
Name: target, dtype: float64

We can plot the target column value counts by calling the plot() function and telling it what kind of plot we'd like, in this case, bar is good.

# Plot the value counts with a bar graph
df["target"].value_counts().plot(kind="bar", color=["salmon", "lightblue"]);

df.info() shows the number of missing values we have and what type of data we're working with.

In our case, there are no missing values and all of our columns are numerical.

Another way to get some quick insights on our dataframe is to use df.describe(). describe() shows a range of different metrics about our numerical columns such as mean, max, and standard deviation.

Heart disease frequency according to gender

If we want to compare two columns, we can use the function pd.crosstab(column_1, column_2).

This is helpful when we want to gain an intuition about how our independent variables interact with our dependent variables.

Let's compare our target column with the sex column.

In our data dictionary, for the target column, 1 = heart disease present, 0 = no heart disease. And for sex, 1 = male, 0 = female.

df.sex.value_counts()

There are 207 males and 96 females in our study.

# compare target column with sex column
pd.crosstab(df.target, df.sex)

sex 0   1
target      
0   24  114
1   72  93

What can we infer from this? Let's make a simple heuristic.

Since there are about 100 women and 72 of them have a positive value of heart disease being present, we might infer, based on this one variable that if the participant is a woman, there's a 75% chance she has heart disease.

As for males, there's about 200 total with around half indicating a presence of heart disease. So we might predict, if the participant is male, that 50% of the time he will have heart disease.

Averaging these two values, we can assume, based on no other parameter, if there's a person, there's a 62.5% chance they have heart disease.

This can be our very simple baseline, and we'll try to beat it with machine learning.

Making our crosstab visual

We can plot the crosstab by using the plot() function and passing it a few parameters such as, kind (the type of plot we want), figsize=(length, width) (how big we want it to be) and color=[color_1, color_2] (the different colors we'd like to use).

Different metrics are best represented with different kinds of plots. In our case, a bar graph is great. We'll see more examples later. And with a bit of practice, we'll gain an intuition of which plot to use with different variables.

# Create a plot
pd.crosstab(df.target, df.sex).plot(kind="bar",
                                    figsize=(10,6),
                                    color=["salmon", "lightblue"])

# Add some attributes to it
plt.title("Heart disease frequency for sex")
plt.xlabel("0 = No disease, 1 = Disease")
plt.ylabel("Amount")
plt.legend(["Female", "Male"])
plt.xticks(rotation=0); # keeps the labels on the x-axis vertical

Age vs. max heart rate for heart disease

Let's try combining a couple of independent variables, such as age and thalach (maximum heart rate) and then compare them to our target variable.

Because there are so many different values for age and thalach, we'll use a scatter plot.

# Create another figure
plt.figure(figsize=(10,6))

# Start with positve examples
plt.scatter(df.age[df.target==1],
            df.thalach[df.target==1],
            c="salmon")

# Now for negative examples, we want them on the same plot, so we call plt again
plt.scatter(df.age[df.target==0],
            df.thalach[df.target==0],
            c="lightblue")

# Add some helpful info
plt.title("Heart disease in function of Age and Max Heart Rate")
plt.xlabel("Age")
plt.ylabel("Max Heart Rate")
plt.legend(["Disease", "No Disease"]);

What can we infer from this?

It seems the younger someone is, the higher their max heart rate (dots are higher on the left of the graph) and the older someone is, the more light blue dots there are. But, this may be because there are more dots all together on the right side of the graph (older participants).

Both of these are observational of course, but this is what we're trying to do, get an understanding of the data.

Now, let's check the age distribution.

# Histograms are a great way to check the distribution of a variable
df.age.plot.hist();

We can see that it's a normal distribution, but slightly skewed to the right, which is reflected in the scatter plot above.

Let's keep going.

Heart disease frequency per chest pain type

Let's try another independent variable. This time, cp (chest pain).

We'll use the same process as we did before with sex.

pd.crosstab(df.cp, df.target)

arget   0   1
cp      
0   104 39
1   9   41
2   18  69
3   7   16

# Create a new crosstab and base plot
pd.crosstab(df.cp, df.target).plot(kind="bar",
                                   figsize=(10,6),
                                   color=["lightblue","salmon"])

# Add attributes to the plot to make it more readable
plt.title("Heart Disease Frequency per chest pain type")
plt.xlabel("Chest Pain Type")
plt.ylabel("Amount")
plt.legend(["No Disease", "Disease"])
plt.xticks(rotation=0);

What can we infer from this?

Let's check in our data dictionary what the different levels of chest pain are.

cp - chest pain type

0: Typical angina: chest pain related decrease blood supply to the heart
1: Atypical angina: chest pain not related to heart
2: Non-anginal pain: typically esophageal spasms (non heart related)
3: Asymptomatic: chest pain not showing signs of disease

It's interesting that the atypical angina (value of 1) states that it's not related to the heart but seems to have a higher ratio of participants with heart disease than not.

But, what does "atypical angina" even means?

At this point, it's important to remember, if our data dictionary doesn't supply enough information, we may want to do further research on our values. This research may come in the form of asking a subject matter expert (such as a cardiologist or the person who gave us the data) or Googling to find out more.

According to PubMed, it seems even some medical professionals are confused by the term.

Today, 23 years later, “atypical chest pain” is still popular in medical circles. Its meaning, however, remains unclear. A few articles have the term in their title, but do not define or discuss it in their text. In other articles, the term refers to noncardiac causes of chest pain.

Although not conclusive, this graph above is a hint at the confusion of definitions being represented in data.

Correlation between independent variables

Finally, we'll compare all of the independent variables. This may give us an idea of which independent variables may or may not have an impact on our target variable.

We can do this using df.corr() which will create a correlation matrix for us, in other words, a big table of numbers telling us how related each variable is to the others.

# Find the correlation between our independent variables
corr_matrix = df.corr()

# Let's make our correlation matrix look a little prettier
corr_matrix = df.corr()
fig, ax = plt.subplots(figsize=(15,10))
ax = sns.heatmap(corr_matrix,
                 annot=True,
                 linewidths=0.5,
                 fmt=".2f",
                cmap="YlGnBu");

A higher positive value means a potential positive correlation (increase) and a higher negative value means a potential negative correlation (decrease).

Enough EDA, let's model!

We've done exploratory data analysis (EDA) to start building an intuition about the dataset.

What have we learned so far? Aside from our baseline estimate using sex, the rest of the data seems to be pretty distributed.

So what we'll do next is model driven EDA, meaning, we'll use machine learning models to drive our next questions.

A few extra things to remember:

Not every EDA will look the same, what we've seen here is an example of what we could do for structured, tabular dataset.
We don't necessarily have to do the same plots as we've done here, there are many more ways to visualize data.
We want to quickly find:
- Distributions (df.column.hist())
- Missing values (df.info())
- Outliers

Let's build some models.

5. Modeling

We've explored the data, now we'll try to use Machine Learning to predict our target variable based on the 13 independent variables.

What is the problem we're solving?

Given clinical parameters about a patient, can we predict whether or not they have heart disease?

That's what we'll be trying to answer.

And remember our evaluation metric?

If we can reach 95% accuracy at predicting whether or not a patient has heart disease during the proof of concept, we'll pursue this project.

That's what we'll be aiming for.

But before we build a model, we have to get our dataset ready.

Let's look at it again with df.head().

We're trying to predict our target variable using all of the other variables. To do this, we'll split the target variable from the rest.

# Everything except target variable
X = df.drop("target", axis=1)

# Target variable
y = df["target"]

Training and test split

Now comes one of the most important concepts in Machine Learning, the training / test split.

This is where we split our data into a training set and a test set.

We use our training set to train our model and our test set to test it.

The test set must remain separate from our training set.

Why not use all the data to train a model?

Let's say we wanted to take our model into the hospital and start using it on patients. How would we know how well our model performs on a new patient not included in the original full dataset we had?

This is where the test set comes in. It's used to mimic taking our model to a real environment as much as possible.

And it's why it's important to never let our model learn from the test set, it should only be evaluated on it.

To split our data into a training and test set, we can use Scikit-Learn's train_test_split() and feed it our independent and dependent variables (X & y).

# Random seed for reproducibility
np.random.seed(42)

# Split into train & test sets
X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    test_size=0.2) # percentage of data to use for test set

The test_size parameter is used to tell the train_test_split() function how much of our data we want in the test set.

A rule of thumb is to use 80% of our data to train on and the other 20% to test on.

For our problem, a train and test set are enough. But for other problems, we could also use a validation (train/validation/test) set or cross-validation (we'll see this later).

But again, each problem will differ. The post, How (and why) to create a good validation set by Rachel Thomas is a good place to learn more.

Let's look at our training data: we can see we're using 242 samples to train on.

Let's look at our test data: we've got 61 examples we'll test our model(s) on.

Model choices

Now that we've got our data prepared, we can start to fit models. We'll be using the following and comparing their results:

Logisitc Regression - LogisticRegression()
K-Nearest Neighbors Classifier - KNeighborsClassifier()
Random Forest Classifier - RandomForestClassifier()

Why these?

If we look at the Scikit-Learn algorithm cheat sheet, we can see that we're working on a classification problem and these are the algorithms that it suggests (plus a few more).

"Wait, I don't see Logistic Regression, and why not use LinearSVC?"

Good questions.

It is confusing that Logistic Regression isn't listed as well because it's a model for classification.

Let's pretend that we've tried LinearSVC, and that it doesn't work, so now we're following other options in the map.

For now, knowing each of these algorithms inside and out is not essential.

Machine Learning and Data Science is an iterative practice. These algorithms are tools in our toolbox.

In the beginning, on our way to becoming a practitioner, it's more important to understand our problem (such as, classification versus regression) and then knowing what tools we can use to solve it.

Since our dataset is relatively small, we can experiment to find which algorithm performs best.

All of the algorithms in the Scikit-Learn library use the same functions, for training a model, model.fit(X_train, y_train) and for scoring a model model.score(X_test, y_test). score() returns the ratio of correct predictions (1.0 = 100% correct).

Since the algorithms we've chosen implement the same methods for fitting them to the data as well as evaluating them, let's put them in a dictionary and create a function which fits and scores them.

# Put models in a dictionary
models = {"Logistic Regression": LogisticRegression(),
          "KNN": KNeighborsClassifier(),
          "Random Forest": RandomForestClassifier()}

# Create a function to fit and score models
def fit_and_score(models, X_train, X_test, y_train, y_test):
    """
    Fits and evaluates given machine learning models.
    models: a dict of different Scikit-Learn machine learning models
    X_train: training data
    X_test: testing data
    y_train: labels associated with training data
    y_test: labels associated with test data
    """
    # Set random seed for reproducible results
    np.random.seed(42)
    # Make a list to keep model scores
    model_scores = {}
    #Loop through models
    for name, model in models.items():
        # Fit the model to the data
        model.fit(X_train, y_train)
        # Evaluate the model and append its score to model_scores
        model_scores[name] = model.score(X_test, y_test)
    return model_scores 

model_scores = fit_and_score(models,X_train,X_test,y_train,y_test)

model_scores

{'Logistic Regression': 0.8852459016393442,
 'KNN': 0.6885245901639344,
 'Random Forest': 0.8360655737704918}

Since our models are fitting, let's compare them visually.

Model comparison

Since we've saved our models' scores to a dictionary, we can plot them by first converting them to a DataFrame.

model_compare = pd.DataFrame(model_scores, index=["accuracy"])
model_compare.T.plot.bar();

We can't really see it from the graph but looking at the dictionary, the LogisticRegression() model performs best.

We've found the best model. Now, let's put together a classification report to show to the team, including a confusion matrix, and the cross-validated precision, recall, and F1 scores. We'd also want to see which features are most important. And look at a ROC curve.

Let's briefly go through each before we see them in action.

Hyperparameter tuning - Each model we use has a series of dials we can turn to dictate how they perform. Changing these values may increase or decrease model performance.
Feature importance - If there are a large amount of features we're using to make predictions, do some have more importance than others? For example, for predicting heart disease, which is more important, sex or age?
Confusion matrix - Compares the predicted values with the true values in a tabular way, if 100% correct, all values in the matrix will be top left to bottom right (diagonal line).
Cross-validation - Splits our dataset into multiple parts to train and test our model on each part, then evaluates performance as an average.
Precision - Proportion of true positives over total number of samples. Higher precision leads to less false positives.
Recall - Proportion of true positives over total number of true positives and false negatives. Higher recall leads to less false negatives.
F1 score - Combines precision and recall into one metric. 1 is best, 0 is worst.
Classification report - Sklearn has a built-in function called classification_report() which returns some of the main classification metrics such as precision, recall, and f1-score.
ROC Curve - Receiver Operating Characteristic is a plot of true positive rate versus false positive rate.
Area Under Curve (AUC) - The area underneath the ROC curve. A perfect model achieves a score of 1.0.

Hyperparameter tuning and cross-validation

To cook our favourite dish, we know to set the oven to 180 degrees and turn the grill on. But when our roommate cooks their favourite dish, they use 200 degrees and the fan-forced mode. Same oven, different settings, different outcomes.

The same can be done for machine learning algorithms. We can use the same algorithms but change the settings (hyperparameters) and get different results.

But just like turning the oven up too high can burn our food, the same can happen for machine learning algorithms. We change the settings and it works so well that it overfits the data.

We're looking for the goldilocks model. One which does well on our dataset but also does well on unseen examples.

To test different hyperparameters, we could use a validation set but since we don't have much data, we'll use cross-validation.

The most common type of cross-validation is k-fold. It involves splitting our data into k-fold's and then testing a model on each. For example, let's say we have 5 folds (k = 5).

We'll be using this setup to tune the hyperparameters of some of our models and then evaluate them. We'll also get a few more metrics like precision, recall, F1-score, and ROC at the same time.

Here's the game plan:

Tune model hyperparameters, see which performs best
Perform cross-validation
Plot ROC curves
Make a confusion matrix
Get precision, recall, and F1-score metrics
Find the most important model features

Tune KNeighborsClassifier (K-Nearest Neighbors or KNN) by hand

There's one main hyperparameter we can tune for the K-Nearest Neighbors (KNN) algorithm, and that is the number of neighbors. The default is 5 (n_neigbors=5).

What are neighbors?

KNN works by assuming that dots which are close to each other belong to the same class. If n_neighbors=5 then it assumes a dot with the 5 closest dots around it are in the same class.

Note: We're leaving out some details here like what defines close or how distance is calculated.

For now, let's try a few different values of n_neighbors.

# Create a list of train scores
train_scores = []

# Create a list of test scores
test_scores = []

# Create a list of different values for n_neighbors
neighbors = range(1,21) # 1 to 20

# Setup algorithm
knn = KNeighborsClassifier()

# Loop through different neighbors values
for i in neighbors:
    knn.set_params(n_neighbors=i) # set neighbors value

    # Fit the algorithm
    knn.fit(X_train, y_train)

    # Update the training scores
    train_scores.append(knn.score(X_train, y_train))

    # Update the test scores
    test_scores.append(knn.score(X_test, y_test))

Let's look at KNN's train scores and test scores.

train_scores

[1.0,
 0.8099173553719008,
 0.7727272727272727,
 0.743801652892562,
 0.7603305785123967,
 0.7520661157024794,
 0.743801652892562,
 0.7231404958677686,
 0.71900826446281,
 0.6942148760330579,
 0.7272727272727273,
 0.6983471074380165,
 0.6900826446280992,
 0.6942148760330579,
 0.6859504132231405,
 0.6735537190082644,
 0.6859504132231405,
 0.6652892561983471,
 0.6818181818181818,
 0.6694214876033058]

test_scores

[0.6229508196721312,
 0.639344262295082,
 0.6557377049180327,
 0.6721311475409836,
 0.6885245901639344,
 0.7213114754098361,
 0.7049180327868853,
 0.6885245901639344,
 0.6885245901639344,
 0.7049180327868853,
 0.7540983606557377,
 0.7377049180327869,
 0.7377049180327869,
 0.7377049180327869,
 0.6885245901639344,
 0.7213114754098361,
 0.6885245901639344,
 0.6885245901639344,
 0.7049180327868853,
 0.6557377049180327]

These are hard to understand so let's plot them.

plt.plot(neighbors, train_scores, label="Train score")
plt.plot(neighbors, test_scores, label="Test score")
plt.xticks(np.arange(1,21,1))
plt.xlabel("Number of neighbors")
plt.ylabel("Model score")
plt.legend()

print(f"Maximum KNN score on the test data: {max(test_scores)*100:.2f}%")

Looking at the graph, n_neighbors = 11 seems best.

Even knowing this, the KNN's model performance didn't get near what LogisticRegression or the RandomForestClassifier did.

Because of this, we'll discard KNN and focus on the other two.

We've tuned KNN by hand but let's see how we can tune LogisticsRegression and RandomForestClassifier using RandomizedSearchCV.

Instead of manually trying different hyperparameters by hand, RandomizedSearchCV tries a number of different combinations, evaluates them, and saves the best.

Tuning models with `RandomizedSearchCV`

Reading the Scikit-Learn documentation for LogisticRegression, we find there's a number of different hyperparameters we can tune.

The same for RandomForestClassifier.

Let's create a hyperparameter grid (a dictionary of different hyperparameters) for each and then test them out.

# Different LogisticRegression hyperparameters
log_reg_grid = {"C": np.logspace(-4, 4, 20),
                "solver": ["liblinear"]}

# Different RandomForestClassifier hyperparameters
rf_grid = {"n_estimators": np.arange(10, 1000, 50),
           "max_depth": [None, 3, 5, 10],
           "min_samples_split": np.arange(2, 20, 2),
           "min_samples_leaf": np.arange(1, 20, 2)}

Now let's use RandomizedSearchCV to tune our LogisticRegression model.

We'll pass it the different hyperparameters from log_reg_grid as well as set n_iter = 20. This means, RandomizedSearchCV will try 20 different combinations of hyperparameters from log_reg_grid and save the best ones.

# Setup random seed
np.random.seed(42)

# Setup random hyperparameter search for LogisticRegression
rs_log_reg = RandomizedSearchCV(LogisticRegression(),
                                param_distributions=log_reg_grid,
                                cv=5,
                                n_iter=20,
                                verbose=True)

# Fit random hyperparameter search model
rs_log_reg.fit(X_train, y_train)

Fitting 5 folds for each of 20 candidates, totalling 100 fits
RandomizedSearchCV(cv=5, estimator=LogisticRegression(), n_iter=20,
                   param_distributions={'C': array([1.00000000e-04, 2.63665090e-04, 6.95192796e-04, 1.83298071e-03,
       4.83293024e-03, 1.27427499e-02, 3.35981829e-02, 8.85866790e-02,
       2.33572147e-01, 6.15848211e-01, 1.62377674e+00, 4.28133240e+00,
       1.12883789e+01, 2.97635144e+01, 7.84759970e+01, 2.06913808e+02,
       5.45559478e+02, 1.43844989e+03, 3.79269019e+03, 1.00000000e+04]),
                                        'solver': ['liblinear']},
                   verbose=True)

rs_log_reg.best_params_

{'solver': 'liblinear', 'C': 0.23357214690901212}

rs_log_reg.score(X_test, y_test)

0.8852459016393442

Now that we've tuned LogisticRegression using RandomizedSearchCV, we'll do the same for RandomForestClassifier.

# Setup random seed
np.random.seed(42)

# Setup random hyperparameter search for RandomForestClassifier
rs_rf = RandomizedSearchCV(RandomForestClassifier(),
                           param_distributions=rf_grid,
                           cv=5,
                           n_iter=20,
                           verbose=True)

# Fit random hyperparameter search model
rs_rf.fit(X_train, y_train)

Fitting 5 folds for each of 20 candidates, totalling 100 fits
RandomizedSearchCV(cv=5, estimator=RandomForestClassifier(), n_iter=20,
                   param_distributions={'max_depth': [None, 3, 5, 10],
                                        'min_samples_leaf': array([ 1,  3,  5,  7,  9, 11, 13, 15, 17, 19]),
                                        'min_samples_split': array([ 2,  4,  6,  8, 10, 12, 14, 16, 18]),
                                        'n_estimators': array([ 10,  60, 110, 160, 210, 260, 310, 360, 410, 460, 510, 560, 610,
       660, 710, 760, 810, 860, 910, 960])},
                   verbose=True)

# Find the best hyperparameters
rs_rf.best_params_

{'n_estimators': 210,
 'min_samples_split': 4,
 'min_samples_leaf': 19,
 'max_depth': 3}

# Evaluate the randomized search RFC model
rs_rf.score(X_test, y_test)

0.8688524590163934

Tuning the hyperparameters for each model saw a slight performance boost in both RandomForestClassifier and LogisticRegression.

This is akin to tuning the settings on our oven and getting it to cook our favourite dish just right.

But since LogisticRegression is ahead, we'll try tuning it further with GridSearchCV.

Tuning a model with `GridSearchCV`

The difference between RandomizedSearchCV and GridSearchCV is that RandomizedSearchCV searches over a grid of hyperparameters performing n_iter combinations, but GridSearchCV will test every single possible combination.

In short:

RandomizedSearchCV - tries n_iter combinations of hyperparameters and saves the best.
GridSearchCV - tries every single combination of hyperparameters and saves the best.

Let's see it in action.

# Different LogisticRegression hyperparameters
log_reg_grid = {"C": np.logspace(-4, 4, 30),
                "solver": ["liblinear"]}

# Setup grid hyperparameter search for LogisticRegression
gs_log_reg = GridSearchCV(LogisticRegression(),
                          param_grid=log_reg_grid,
                          cv=5,
                          verbose=True)

# Fit grid hyperparameter search model
gs_log_reg.fit(X_train, y_train);

# Check the best parameters
gs_log_reg.best_params_

{'C': 0.20433597178569418, 'solver': 'liblinear'}

# Evaluate the model
gs_log_reg.score(X_test, y_test)

0.8852459016393442

In this case, we get the same results as before since our grid only has a maximum of 20 different hyperparameter combinations.

Note: If there are a large amount of hyperparameters combinations in our grid, GridSearchCV may take a long time to try them all out. This is why it's a good idea to start with RandomizedSearchCV, try a certain amount of combinations and then use GridSearchCV to refine them.

Evaluating a classification model, beyond accuracy

Now that we've got a tuned model, let's get some of the metrics we discussed before.

We want:

ROC curve and AUC score - plot_roc_curve()
Confusion matrix - confusion_matrix()
Classification report - classification_report()
Precision - precision_score()
Recall - recall_score()
F1-score - f1_score()

Luckily, Scikit-Learn has these all built-in.

To access them, we'll have to use our model to make predictions on the test set. We can make predictions by calling predict() on a trained model and passing it the data we'd like to predict on.

We'll make predictions on the test data.

# Make predictions on test data
y_preds = gs_log_reg.predict(X_test)

Let's see them.

Since we've got our prediction values, we can find the metrics we want.

Let's start with the ROC curve and AUC scores.

ROC curve and AUC scores

What's a ROC curve?

It's a way of understanding how our model is performing by comparing the true positive rate to the false positive rate.

In our case:

To get an appropriate example in a real-world problem, consider a diagnostic test that seeks to determine whether a person has a certain disease. A false positive in this case occurs when the person tests positive, but does not actually have the disease. A false negative, on the other hand, occurs when the person tests negative, suggesting they are healthy, when they actually do have the disease.

Scikit-Learn implements a function plot_roc_curve which can help us create a ROC curve as well as calculate the area under the curve (AUC) metric.

Reading the documentation on the plot_roc_curve function, we can see it takes (estimator, X, y) as inputs. Where estimator is a fitted machine learning model and X and y are the data we'd like to test it on.

In our case, we'll use the GridSearchCV version of our LogisticRegression estimator, gs_log_reg as well as the test data, X_test and y_test.

# Plot ROC curve and calculate AUC metric
plot_roc_curve(gs_log_reg, X_test, y_test);

Our model does far better than guessing which would be a line going from the bottom left corner to the top right corner, AUC = 0.5. But a perfect model would achieve an AUC score of 1.0, so there's still room for improvement.

Let's move onto the next evaluation request, a confusion matrix.

Confusion matrix

A confusion matrix is a visual way to show where our model made the right predictions and where it made the wrong predictions (or in other words, got confused).

Scikit-Learn allows us to create a confusion matrix using confusion_matrix() and passing it the true labels and predicted labels.

Because Scikit-Learn's built-in confusion matrix is a bit bland, we probably want to make it visual. Let's create a function which uses Seaborn's heatmap() for doing so.

sns.set(font_scale=1.5) # Increase font size

def plot_conf_mat(y_test, Y_preds):
    """
    Plots a confusion matrix using Seaborn's heatmap().
    """
    fig, ax = plt.subplots(figsize=(3,3))
    ax = sns.heatmap(confusion_matrix(y_test, y_preds),
                     annot=True, # Annotate the boxes
                     cbar=False)
    plt.xlabel("Predicted label")
    plt.ylabel("True label")

plot_conf_mat(y_test, y_preds)

We can see the model gets confused (predicts the wrong label) relatively the same across both classes. In essence, there are 4 occasions where the model predicted 0 when it should have been 1 (false negative) and 3 occasions where the model predicted 1 instead of 0 (false positive).

Classification report

We can make a classification report using classification_report() and passing it the true labels as well as our models predicted labels.

A classification report will also give us information of the precision and recall of our model for each class.

# Show classification report
print(classification_report(y_test, y_preds))

          precision    recall  f1-score   support

           0       0.89      0.86      0.88        29
           1       0.88      0.91      0.89        32

    accuracy                           0.89        61
   macro avg       0.89      0.88      0.88        61
weighted avg       0.89      0.89      0.89        61

What's going on here?

Let's refresh our memory.

Precision - Indicates the proportion of positive identifications (model predicted class 1) which were actually correct. A model which produces no false positives has a precision of 1.0.
Recall - Indicates the proportion of actual positives which were correctly classified. A model which produces no false negatives has a recall of 1.0.
F1 score - A combination of precision and recall. A perfect model achieves an F1 score of 1.0.
Support - The number of samples each metric was calculated on.
Accuracy - The accuracy of the model in decimal form. Perfect accuracy is equal to 1.0.
Macro avg - Short for macro average, the average precision, recall and F1 score between classes. Macro avg doesn’t class imbalance into effort, so if you do have class imbalances, pay attention to this metric.
Weighted avg - Short for weighted average, the weighted average precision, recall and F1 score between classes. Weighted means each metric is calculated with respect to how many samples there are in each class. This metric will favour the majority class (e.g. will give a high value when one class out performs another due to having more samples).

Ok, now we've got a few deeper insights on our model. But these were all calculated using a single training and test set.

What we'll do to make them more solid is calculate them using cross-validation.

How?

We'll take the best model along with the best hyperparameters and use cross_val_score() along with various scoring parameter values.

cross_val_score() works by taking an estimator (machine learning model) along with data and labels. It then evaluates the machine learning model on the data and labels using cross-validation and a defined scoring parameter.

Let's remind ourselves of the best hyperparameters and then see them in action.

# Instantiate best model with best hyperparameters (found with GridSearchCV)
clf = LogisticRegression(C=0.20433597178569418,
                         solver="liblinear")

Now that we've got an instantiated classifier, let's find some cross-validated metrics.

# Cross-validate accuracy score
cv_acc = cross_val_score(clf,
                         X,
                         y,
                         cv=5, # 5-fold cross-validation
                         scoring="accuracy")
cv_acc = np.mean(cv_acc) # since there are 5 metrics here, we'll take the average
cv_acc

0.8479781420765027

Now we'll do the same for other classification metrics.

# Cross-validated precision score
cv_precision = cross_val_score(clf,
                         X,
                         y,
                         cv=5,
                         scoring="precision")
cv_precision = np.mean(cv_precision)
cv_precision

0.8215873015873015

# Cross-validated recall score
cv_recall = cross_val_score(clf,
                         X,
                         y,
                         cv=5,
                         scoring="recall")
cv_recall = np.mean(cv_recall)
cv_recall

0.9272727272727274

# Cross-validated F1 score
cv_f1 = cross_val_score(clf,
                         X,
                         y,
                         cv=5,
                         scoring="f1")
cv_f1 = np.mean(cv_f1)
cv_f1

0.8705403543192143

We've got cross validated metrics. Let's visualize them.

# Visualizing cross-validated metrics
cv_metrics = pd.DataFrame({"Accuracy": cv_acc,
                           "Precision": cv_precision,
                           "Recall": cv_recall,
                           "F1": cv_f1},
                         index=[0])

cv_metrics.T.plot.bar(title="Cross-Validated Classification Metrics",
                      legend=False);

The final thing to check off the list of our model evaluation techniques is feature importance.

Feature importance

Feature importance is another way of asking, "which features contribute most to the outcomes of the model?"

Or for our problem, trying to predict heart disease using a patient's medical characteristics, which characteristics contribute most to a model predicting whether someone has heart disease or not?

Unlike some of the other functions we've seen, because how each model finds patterns in data is slightly different, how a model judges how important those patterns are is different as well. This means for each model, there's a slightly different way of finding which features were most important.

We can usually find an example via the Scikit-Learn documentation or via searching for something like "[MODEL TYPE] feature importance", such as, "random forest feature importance".

Since we're using LogisticRegression, we'll look at one way we can calculate feature importance for it.

To do so, we'll use the coef_ attribute. Looking at the Scikit-Learn documentation for LogisticRegression, the coef_ attribute is the coefficient of the features in the decision function.

We can access the coef_ attribute after we've fit an instance of LogisticRegression.

# Fit an instance of LogisticRegression (taken from above)
clf.fit(X_train, y_train);

# Check coef_
clf.coef_

array([[ 0.00316728, -0.86044619,  0.6606706 , -0.01156993, -0.00166374,
         0.04386123,  0.31275813,  0.02459361, -0.60413061, -0.56862832,
         0.45051624, -0.63609879, -0.67663383]])

Looking at this, it might not make much sense. But these values are how much each feature contributes to how a model makes a decision on whether patterns in a sample of patient's health data leans more towards having heart disease or not.

Even knowing this, in its current form, this coef_ array still doesn't mean much. But it will if we combine it with the columns (features) of our dataframe.

# Match features to columns
feature_dict = dict(zip(df.columns, list(clf.coef_[0])))
feature_dict

{'age': 0.003167276981166473,
 'sex': -0.8604461876496617,
 'cp': 0.6606705956924419,
 'trestbps': -0.011569931456373254,
 'chol': -0.0016637425660326452,
 'fbs': 0.04386123481563001,
 'restecg': 0.3127581278180605,
 'thalach': 0.02459361121787892,
 'exang': -0.6041306062021752,
 'oldpeak': -0.5686283181242949,
 'slope': 0.4505162370067001,
 'ca': -0.6360987949046014,
 'thal': -0.6766338344936489}

Now, let's visualize them.

# Visualize feature importance
feature_df = pd.DataFrame(feature_dict, index=[0])
feature_df.T.plot.bar(title="Feature Importance",
                      legend=False);

We notice some are negative and some are positive.

The larger the value (bigger bar), the more the feature contributes to the model's decision.

If the value is negative, it means there's a negative correlation. And vice versa for positive values.

For example, the sex attribute has a negative value of -0.904, which means as the value for sex increases, the target value decreases.

We can see this by comparing the sex column to the target column.

pd.crosstab(df["sex"], df["target"])

arget   0   1
sex     
0   24  72
1   114 93

We can see, when sex is 0 (female), there are almost 3 times as many (72 vs. 24) people with heart disease (target = 1) than without.

And then as sex increases to 1 (male), the ratio goes down to almost 1 to 1 (114 vs. 93) of people who have heart disease and who don't.

What does this mean?

It means the model has found a pattern which reflects the data. Looking at these figures and this specific dataset, it seems if the patient is female, they're more likely to have heart disease.

How about a positive correlation?

# Contrast slope (positive coefficient) with target
pd.crosstab(df["slope"], df["target"])

arget   0   1
slope       
0   12  9
1   91  49
2   35  107

Looking back at the data dictionary, we see slope is the "slope of the peak exercise ST segment" where:

0: Upsloping: better heart rate with excercise (uncommon)
1: Flatsloping: minimal change (typical healthy heart)
2: Downslopins: signs of unhealthy heart

According to the model, there's a positive correlation of 0.470, not as strong as sex but still more than 0.

This positive correlation means our model is picking up the pattern that as slope increases, so does the target value.

What can we do with this information?

This is something we might want to talk to a subject matter expert about. They may be interested in seeing where machine learning model is finding the most patterns (highest correlation) as well as where it's not (lowest correlation).

Doing this has a few benefits:

Finding out more - If some of the correlations and feature importances are confusing, a subject matter expert may be able to shed some light on the situation and help us figure out more.
Redirecting efforts - If some features offer far more value than others, this may change how we collect data for different problems. See point 3.
Less but better - Similar to above, if some features are offering far more value than others, we could reduce the number of features our model tries to find patterns in as well as improve the ones which offer the most. This could potentially lead to saving on computation, by having a model find patterns across less features, whilst still achieving the same performance levels.

6. Experimentation

We've completed all the metrics requested. We should be able to put together a great report containing a confusion matrix, a handful of cross-validated metrics such as precision, recall, and F1, as well as which features contribute most to the model making a decision.

But after all this we might be wondering where step 6 in the framework is, experimentation.

The whole thing is experimentation!

From trying different models, to tuning different models to figuring out which hyperparameters were best.

What we've worked through so far has been a series of experiments.

And we could keep going. But of course, things can't go on forever.

So by this stage, after trying a few different things, we'd ask ourselves: did we meet the evaluation metric?

We defined one in step 3.

If we can reach 95% accuracy at predicting whether or not a patient has heart disease during the proof of concept, we'll pursue this project.

In this case, we didn't. The highest accuracy our model achieved was below 90%.

What's next?

What happens when the evaluation metric doesn't get hit?

Is everything we've done wasted?

No.

It means we know what doesn't work. In this case, we know the current model we're using (a tuned version of LogisticRegression) along with our specific data set doesn't hit the target we set ourselves.

This is where step 6 comes into its own.

A good next step would be to discuss with our team or research on our own different options for going forward.

Could we collect more data?
Could we try a better model? If we're working with structured data, we might want to look into CatBoost or XGBoost.
Could we improve the current models (beyond what we've done so far)?
If our model is good enough, how would we export it and share it with others? (Hint: check out Scikit-Learn's documentation on model persistance)

The key here is to remember, our biggest restriction will be time. Hence, why it's paramount to minimise delay between experiments.

The more we try, the more we figure out what doesn't work, the more we'll start to get a hang of what does.