AutoAI in Cloud Pak for Data automates ETL(Extract, Transform, and Load) and feature engineering process for relational data, saves data scientists months of manual data prep time and acheives results comparable to top performing data scientists.
The AutoAI graphical tool in Watson Studio automatically analyzes your data and generates candidate model pipelines customized for your predictive modeling problem. These model pipelines are created iteratively as AutoAI analyzes your dataset and discovers data transformations, algorithms, and parameter settings that work best for your problem setting. Results are displayed on a leaderboard, showing the automatically generated model pipelines ranked according to your problem optimization objective.
Collect your input data in a CSV file or files. Where possible, AutoAI will transform the data and impute missing values.
- Your data source must contain a minimum of 100 records (rows).
- You can use the IBM Watson Studio Data Refinery tool to prepare and shape your data.
- Data can be a file added as connected data from a networked file system (NFS). Follow the instructions for adding a data connection of the type Mounted Volume. Choose the CSV file to add to the project so you can select it for training data.
Using AutoAI, you can build and deploy a machine learning model with sophisticated training features and no coding. The tool does most of the work for you.
AutoAI automatically runs the following tasks to build and evaluate candidate model pipelines:
- Data pre-processing
- Automated model selection
- Automated feature engineering
- Hyperparameter optimization
In this Think Lab, you will see how to join several data sources and then build an AutoAI experiment from the joined data. The scenario we’ll explore in Part A of the Lab is for an outdoor company that wants to project sales for each product in multiple retails stores. You will learn how to join several data sources related to a fictional outdoor store named Go, then build an experiment that uses the data to train a machine learning experiment. You will then deploy the resulting model and use it to predict daily sales for each product Go sells.
- Create an IBM Cloud Lite Tier Account
- Create a Watson Studio Instance
- Provision Watson Machine Learning & Cloud Object Storage Instances
- Create a New Project
- Download the Go Sample Dataset from the Gallery
- Unzip the Go Sample Dataset's .zip File
- Add the Go Sample Datasets to the Project
In Tutorial A of this Think Lab, you will learn how to join several data sources related to a fictional outdoor store named Go, then build an experiment that uses the data to train a machine learning experiment. You will then deploy the resulting model and use it to predict daily sales for each product Go sells.
Joining data also allows for a specialized set of feature transformations and advanced data aggregators. After building the pipelines, you can explore the factors that produced each pipeline.
The data you will join contains the following information:
- Daily_sale: the GO company has many retailers selling its outdoor products, the daily sale table is a timeseries of sale records where the DATE and QUANTITY column indicate the sale quantity and the sale date for each product in a retail store.
- Products: this table keeps product information such as product type and product names.
- Retailers: this table keeps retailer infor mation such as retailer names and address.
- Methods: this table keeps order methods such as Via Telephone, Online or Email
- Go: the GO company is interested using this data to predict its daily sale for every product in its retail stores. The prediction target column is QUANTITY in the go table and DATE column indicates the cutoff time when prediction should be made.
This tutorial presents the basic steps for joining data sets then training a machine learning model using AutoAI:
- Add and join the data
- Train the experiment
- Deploy the trained model
- Test the deployed model
- Create a New AutoAI Experiment
- Build the Data Join Schema
- Update the AutoAI Experiment Settings
- Run the AutoAI Experiment
- Explore the Holdout & Training Data Insights
- Deploy the Trained Model
- Score the Model
- View the Prediction Results
In the data join canvas you will create a left join that connects all of the data sources to the main source.
Choose Quantity as the column to predict.
AutoAI analyzes your data and determines that the Quanity column contains a wide range of numeric information, making this data suitable for a regression model. The default metric for a regression model is Root Mean Squared Error (RMSE).
- Based on analyzing a subset of the data set, AutoAI chooses a default model type: binary classification, multiclass classification, or regression. Binary is selected if the target column has two possible values, multiclass if it has a discrete set of 3 or more values, and regression if the target column is a continuous numeric variable. You can override this selection.
- AutoAI chooses a default metric for optimizing. For example, the default metric for a binary classification model is Accuracy.
- By default, ten percent of the training data is held out to test the performance of the model.
In the main data table, go_1k.csv, choose Date as the Cutoff time column and enter dd/MM/yyyy as the date format. No data after the date in the cutoff column will be considered for training the pipelines. Note: the data format must exactly match the data or an error results.
In the data table go_daily_sales.csv, choose Date as a timestamp column so that AutoAI can enhance the set of features with timeseries related features. Enter dd/MM/yyyy as the date format. Note: The data format must exactly match the format in the data source or you will get an error running the experiment.
After defining the experiment, you can allocate the resources for training the pipelines. Click Runtime to switch to the Runtime tab. Increase the number of executors to 10. Click Save settings to save the configuration changes.
To score the model, you create a batch job that will pass new data to the model for processing, then output the predictions to a file. Note: For this tutorial, you will submit the training files as the scoring files as a way to demonstrate the process and view results.
In Tutorial B of this Think Lab, you will use IBM AutoAI to automate data analysis for a dataset collected from a fictional call center. The objective of the analysis is to gain more insight into factors that impact customer experience so that the company can improve customer service. The data consists of historical information about customer interaction with call agents, call type, customer wireless plans, and call type resolution.