Donovan HOANG

Posted on Jul 8, 2024

Getting started with Amazon SageMaker: Building your first machine learning model

#aws #sagemaker #ai #machinelearning

This article provides a guide on using Amazon SageMaker to build, train, and deploy a machine learning model for predicting house prices using the Ames Housing dataset. It covers the key features of SageMaker, data preprocessing steps, model training, and deployment, and demonstrates how to test the deployed model. The guide also includes important steps to clean up resources to avoid unnecessary costs.

Overview of Amazon Sagemaker

Amazon SageMaker is a fully managed service provided by AWS (Amazon Web Services) that enables developers and data scientists to build, train, and deploy machine learning models at scale. It simplifies the machine learning workflow by offering a suite of tools and services designed to handle various stages of the machine learning lifecycle, from data preparation to model deployment and monitoring.

Key Features

Integrated Development Environment:

SageMaker Studio: An integrated development environment (IDE) for machine learning that provides a web-based interface to build, train, and deploy models. It offers a collaborative environment with support for notebooks, debugging, and monitoring.
Data Preparation:
Data Wrangler: Simplifies data preparation and feature engineering with a visual interface that integrates with various data sources.
Feature Store: A repository to store, share, and manage features for machine learning models, ensuring consistency and reusability across projects.

Model Building:

Built-in Algorithms: Provides a collection of pre-built machine learning algorithms optimized for performance and scalability.
Custom Algorithms: Supports bringing your own algorithms and frameworks, including TensorFlow, PyTorch, and Scikit-learn.

Model Training:

Managed Training: Automatically provisions and manages the underlying infrastructure for training machine learning models.
Distributed Training: Supports distributed training for large datasets and complex models, reducing training time.
Automatic Model Tuning: Also known as hyperparameter optimization, it helps find the best version of a model by automatically adjusting hyperparameters.

Model Deployment:

Real-time Inference: Deploy models as scalable, secure, and high-performance endpoints for real-time predictions.
Batch Transform: Allows for batch processing of large datasets for inference.
Multi-Model Endpoints: Supports deploying multiple models on a single endpoint, optimizing resource utilization.

Model Monitoring and Management:

Model Monitor: Automatically monitors deployed models for data drift and performance degradation, triggering alerts and actions when necessary.
Pipelines: Enables the creation and management of end-to-end machine learning workflows, from data preparation to deployment and monitoring.

Step-by-Step guide

AWS Free Tier :

First let's talk about AWS Free Tier for SageMaker. What is interesting in our case is the "Studio notebooks, and notebook instances" and "Training" section. Based on what it offers, we are going to use ml.t2.medium as Notebook instance type and ml.m5.large as Training instance (check the availability of theses types in the region you will provision the resources, I am currently using eu-west-3 region).

Storing datasets

First, create a basic S3 bucket that will be used to store our raw dataset and then the training and test formatted datasets.

For this example, I will use the "Ames Housing Dataset" dataset which is a well-known dataset used in machine learning for predictive modeling. It contains information about various houses in Ames, Iowa, and includes features such as the size of the house, the year it was built, the type of roof, and the sale price. The goal is to predict the sale price of houses based on these features.

IAM Role

First, create an IAM role with the AmazonSageMakerFullAccess policy as well as the permission to get and create files on the dataset S3 bucket.

Notebook instance

Create a SageMaker Notebook instance with the following parameters :

type : ml.t2.medium
attach the previously created IAM role

Load and explore dataset

Once your instance appears as "InService" you can click on "Open Jupyter". (It can take minutes for your instance to become ready to use)

This will open a new page with the Jupyter Notebook interface. Now create a new notebook of type conda_python3.

Dependencies and dataset loading

Add this code in the first block and run it.

import boto3
import pandas as pd
import numpy as np
import sagemaker
from sagemaker import get_execution_role
from sagemaker.serializers import CSVSerializer
from sagemaker.deserializers import JSONDeserializer
from sklearn.model_selection import train_test_split

# Load Data from S3
s3 = boto3.client('s3')
bucket_name = 'dhoang-sagemaker-datasets'
file_key = 'AmesHousing.csv'
obj = s3.get_object(Bucket=bucket_name, Key=file_key)
df = pd.read_csv(obj['Body'])
df

It will import all the libraries that we need to access the S3 bucket, pre-process the dataset, train our model and deploy it. You should have a result like this :

Pre-processing the dataset

Because we want to avoid empty or non-numerical values, we need to pre-process the dataset. In addition, we need to split the formatted dataset into a train and a test part to validate the model.

Add this code and run it :

# Preprocess Data
print(df.isnull().sum())
numeric_cols = df.select_dtypes(include=['int64', 'float64']).columns
df[numeric_cols] = df[numeric_cols].fillna(df[numeric_cols].mean())
categorical_cols = df.select_dtypes(include=['object']).columns
df[categorical_cols] = df[categorical_cols].fillna(df[categorical_cols].mode().iloc[0])

# Encode Categorical Features
df = pd.get_dummies(df)

# Ensure all boolean columns are converted to numeric
df = df.applymap(lambda x: 1 if x is True else 0 if x is False else x)

# Split the Data
train, test = train_test_split(df, test_size=0.2, random_state=42)
train.to_csv('train.csv', index=False, header=False)
test.to_csv('test.csv', index=False, header=False)

# Upload the Processed Data to S3
s3_resource = boto3.resource('s3')
s3_resource.Bucket(bucket_name).upload_file('train.csv', 'train.csv')
s3_resource.Bucket(bucket_name).upload_file('test.csv', 'test.csv')

Train and deploy the model

Now this is the that we are going to explore SageMaker features. In our case, we use the linear-learner from SageMaker that allow us to make a linear regression on our dataset.

Next, we define the instance type that will be used for our SageMaker Training Jobs, here ml.m5.large to benefits of the Free Tier.

Then, after being trained, the model is deployed as a SageMaker Endpoint so we can use it afterward.

Add this code and run it :

# Train Model
role = get_execution_role()
sess = sagemaker.Session()
output_location = 's3://{}/output'.format(bucket_name)
container = sagemaker.image_uris.retrieve("linear-learner", sess.boto_region_name, "1.0-1")

linear = sagemaker.estimator.Estimator(
    container,
    role,
    instance_count=1,
    instance_type='ml.m5.large',
    output_path=output_location,
    sagemaker_session=sess
)

linear.set_hyperparameters(
    predictor_type='regressor',
    mini_batch_size=100
)

train_data = 's3://{}/train.csv'.format(bucket_name)
test_data = 's3://{}/test.csv'.format(bucket_name)

data_channels = {
    'train': sagemaker.inputs.TrainingInput(train_data, content_type='text/csv'),
    'validation': sagemaker.inputs.TrainingInput(test_data, content_type='text/csv')
}

linear.fit(inputs=data_channels)

# Deploy Model
predictor = linear.deploy(initial_instance_count=1, instance_type='ml.t2.medium')

It should take some minutes to execute. During this time, you can go to the console to see the Training Job status :

with the logs in your notebook :

and after the job finished, the Model Endpoint deployment :

Wait until the Endpoint status appears as InService.

Test the model

The following code aims to use the fresh new Endpoint to analyse and predict SalePrice on the test dataset. I will run it into the notebook, but you can actually run it from anywhere (if you have access to your endpoint) :

# Test Model
test_data_no_target = test.drop(columns=['SalePrice'])

# Ensure all data is numeric
assert test_data_no_target.applymap(np.isreal).all().all(), "Test data contains non-numeric values"

# Convert test data to CSV string
csv_input = test_data_no_target.to_csv(header=False, index=False).strip()

# Initialize the predictor with correct serializers
predictor = sagemaker.predictor.Predictor(
    endpoint_name=predictor.endpoint_name,
    sagemaker_session=sagemaker.Session(),
    serializer=CSVSerializer(),
    deserializer=JSONDeserializer()
)

# Make predictions
predictions = predictor.predict(csv_input)
print(predictions)

You should get a set of analysis results, here the precision is not mandatory as I just want to provide you a guide to use SageMaker.

Cleaning

To avoid unwanted cost and clean your account, do not forget to delete :

SageMaker Endpoint
SageMaker Notebook instance
S3 bucket
IAM role

Thanks for reading ! Hope this helped you to use or understand how to train a model with Amazon SageMaker from a raw dataset to a ready to use endpoint. Don’t hesitate to give me your feedback or suggestions.

DEV Community