Suman Debnath for AWS

Posted on Mar 4, 2021 • Edited on Nov 9, 2024

Machine Learning in SQL Style (Part-2)

#aws #datascience #database #machinelearning

Continuing our learning from where we left in the Part-1 of this tutorial series, where we discussed about Amazon Redshift briefly and dive deep into Amazon Redshift ML. We also learnt about, how a database engineer/administrator could make use of Redshift ML to create, train and deploy a machine learning model using familiar SQL commands.

Now, we are going to see some of the advanced functionalities of Amazon Redshift ML which a Data Analyst or an expert Data Scientist can make use of, which offers more flexibility in terms of defining specific information, like which algorithm to use (such as XGBoost), specifying hyperparameter, preprocessor and so on.

Exercise 2 (Data Analyst's perspective)

Dataset

In this problem, we are going to use the Steel Plates Faults Data Set from UCI Machine Learning Repository. You can download the dataset from this GitHub Repo.

This dataset is related to the quality of steel plates, wherein there are 27 independent variables (input features) which comprises of various attributes of a steel plate and one dependent variable (class label) which can be of 1 of 7 types. So, the problem in hand is a multi-class classification problem, where we need to predict the fault in the steel plate, given 7 different types of the faults that it can have.

So, the objective is to predict what is the fault the steel plate has (Pastry, Z_Scratch, K_Scatch, Stains, Dirtiness, Bumps or Other_Faults)

As we have seen in Part-1, since our dataset is located in Amazon S3, first we need to load the data in table. We can open DataGrip(or whatever SQL Connector you are using) and create the schema and the table. Once that is done, we can use COPY command to load the training data from Amazon S3 (steel_fault_train.csv) to the Redshift cluster, in the table, steel_plates_fault.

As always, we need to make sure that colum names of the table matches with the feature sets in the CSV training dataset file.

Similarly we can load the dataset for the testing(steel_fault_test.csv) in a separate table, steel_plates_fault_inference

Training (Model Creation)

Now, being a data analyst, you may like to explicitly mention few of the parameters, like PROBLEM_TYPE and OBJECTIVE function. When you provide this information while creating the model, Amazon SageMaker Autopilot chooses the PROBLEM_TYPE and OBJECTIVE specified by you, instead of tying everything.

Like for this problem, we are going to provide the PROBLEM_TYPE as multiclass_classification and OBJECTIVE as accuracy.

Other PROBLEM_TYPE we can specify are :

REGRESSION
BINARY_CLASSIFICATION
MULTICLASS_CLASSIFICATION

Similarly, OBJECTIVE function could be:

MSE
Accuracy
F1
F1Macro
AUC

As we have learnt in the PART-1 of the tutorial, the CREATE MODEL command operates in an asynchronous mode and it returns the response upon the export of training data to Amazon S3. As the remaining steps of model training and compilation can take a longer time, it continues to run in the background.

But we can always check the status of the training using the STV_ML_MODEL_INFO function, and wait till the model_state becomes Model is Ready.

Now, let's look at the details about the model, and see if it has used the same PROBLEM_TYPE and OBJECTIVE function which we mentioned while executed the CREATE MODEL command

Accuracy of the Model and Prediction/Inference

Lastly, let's try to see what's the accuracy of our model using the test data which we have in the steel_plates_fault_inference table.

As we can see the accuracy is around 77%, which is not all that great, but this is because we used a very small dataset to train the model, func_model_steel_fault.

And finally, let's try to do some prediction using this same model function

Lastly, let's take another example and this time from a Data Scientist's perspective, wherein we will make use of some more advanced options while executing the CREATE MODEL command.

Exercise 3 (Data Scientist's perspective)

So the last two problems we worked on, were classification problem (binary and multi-class), and this time we will work on a regression problem and shall use some more advanced parameters while training the model (like mentioning the training algorithm, hyperparameter, etc.).

Dataset

In this problem, we are going to use the Abalone Data Set from UCI Machine Learning Repository. You can download the dataset from this GitHub Repo.

In this problem we need to predict the age of a abalone from its physical measurements. The age of abalone is determined by cutting the shell through the cone, staining it, and counting the number of rings through a microscope -- a boring and time-consuming task.

The dataset is having total 7 input features and 1 target, which is nothing but the age

So, first let's create the schema and the table. Once that is done, we can use COPY command to load the training data from Amazon S3 (xgboost_abalone_train.csv) to the Redshift cluster, in the table, abalone_xgb_train.

Similarly we can load the dataset for the testing(xgboost_abalone_test.csv) from Amazon S3, in a separate table, abalone_xgb_test

Training (Model Creation)

As a data scientist, you may like to have more control over training the model, e.g you may decide to provide more granular options, like MODEL_TYPE , OBJECTIVE, PREPROCESSORS and HYPER PARAMETERS while running the CREATE MODEL command.

As an advanced user, you may already know the model type that you want and hyperparameter to use when training these models. You can use CREATE MODEL command with AUTO OFF to turn off the CREATE MODEL automatic discovery of preprocessors and hyperparameters.

For this problem we are going to specify MODEL_TYPE as xgboost(Xtreme Gradient Boosted tree) which we can use for both regression and classification based problems. XGBoost is currently the only MODEL_TYPE supported when AUTO is set to OFF. We are also going to use the OBJECTIVE function as reg:squarederror. You can specify hyperparameters as well. For more details, you may like to check the Amazon Redshift ML Developer Guide(CREATE MODEL section)

Now, let's look at the details about the model, as we did before:

Accuracy of the Model and Prediction/Inference

Now, let's try to see what's the accuracy of our model, using the test data which we have in the abalone_xgb_test table.

And finally let's try to do some prediction using this same model function, func_model_abalone_xgboost_regression

What next...

So, in this tutorial we learnt about Amazon Redshift ML from an advanced users perspective (like Data Analyst or Data Scientist), and learnt how we can create, train and deploy a ML model using familiar SQL query. You can even go further and explore the training jobs which it internally initiates in the Amazon SageMaker console, if you are interested. Feel free to give it a try and share your feedback.