Sumbul Naqvi

Posted on Sep 19

Amazon SageMaker Built-in Models & all about ML Lifecycle in action- Mini Project.

#sagemaker #machinelearning #endtoendml #designprinciples

Machine learning or ML can be complex. We can breakdown this complexity to an extent by thinking of ML as a lifecycle. Like a baby growing into infant to adolecent and finally a mature adult .
This growth of ML problem from our aim or goal to a functional solution structured around lifecycle is an iterative process, complete with best practices to help our machine learning projects succeed.

💡 ML lifecycle includes:

Business goal identification
ML problem formulating
Data processing (collection, preprocessing, feature engineering)
Model/Solution development (training, tuning, evaluation)
Model deployment (inference, prediction)
Model monitoring

Some of these steps, such as data processing, deployment, and monitoring, are iterative processes that you might cycle through as an ML engineer, often in the same machine learning project.

Lets throw some light into these phases of ML development.

✔️ Business goal Identification
We should have a clear idea of the problem for which we are considering ML solution, plus the business value(measurable) we aim to be gained by solving that problem.

✔️ ML problem framing
In this phase, the business problem is framed as a machine learning problem: Given aset of data or observations, what should be predicted (known as a label or target variable).
Determining what to predict and how performance and error metrics must be optimized is a key step in this phase.

✔️ Data processing is foundational to training an accurate ML model. Here we explore the specifics of collecting, preparing, and engineering features from raw data followed by EDA **(Exploratory Data Analysis). Thus, main aim here is to convert data into a usable format.
**Data Collection: Collect data should be relevant, diverse and sufficient.
Data Cleaning: Address issues such as missing values, outliers and inconsistencies in the data.
Data Preprocessing: Standardize formats, scale values and encode categorical variables for consistency.
EDA: is used to uncover insights and understand the dataset's structure. During EDA patterns, trends and insights are provided using statistical and visual tools.

✔️ Model development includes model building, training, tuning and evaluation. Our aim is developing a model that performs a specific task or solves a particular problem.

✔️ Deploying trained models into production for making predictions and inferences talks about deployment strategies on AWS. These strategies help with integrating models into production environments.

✔️ Finally, Monitoring models after they are deployed into the production environment is an important task for implementing robust monitoring systems, facilitating early detection of deviations, and supporting timely mitigation measures.

Well-Architected machine learning design principles to facilitate good design in the cloud

💡Well-Architected machine learning design principles, often guided by frameworks like the AWS Well-Architected Framework's Machine Learning Lens, aim to facilitate good design in the cloud by focusing on specific considerations for ML workloads. Few ML design principles are:

Assign ownership- Apply the right skills and the right number of resources along with accountability and empowerment to increase productivity.

Provide protection- Apply security controls to systems and services hosting model data, algorithms, computation, and endpoints. This ensures secure and uninterrupted operations.

Enable resiliency- Ensure fault tolerance and the recoverability of ML models through version control, traceability, and explainability.

Enable reusability- Use independent modular components that can be shared and reused. This helps enable reliability, improve productivity, and optimize cost.

Enable reproducibility- Use version control across components, such as infrastructure, data, models, and code. Track changes back to a point-in-time release. This approach enables model governance and audit standards.

Optimize resources- Perform trade-off analysis across available resources and configurations to achieve optimal outcome.

Reduce cost- Identify the potentials for reducing cost through automation or optimization, analyzing processes, resources, and operations.

Enable automation- Use technologies, such as pipelining, scripting, and continuous integration (CI), continuous delivery (CD), and continuous training (CT), to increase agility, improve performance, sustain resiliency, and reduce cost.

Enable continuous improvement- Evolve and improve the workload through continuous monitoring, analysis, and learning.

Minimize environmental impact- Establish sustainability goals and understand the impact of ML models. Use managed services and adopt efficient hardware and software and maximize their utilization.

How does the model work?

💡It focuses on two key components of the model: features and weights.

Features are identified parts of your dataset that are important in determining accurate outcomes.

Weights represent how important an associated feature is for determining the accuracy of that outcome. A higher likelihood of accuracy results in a higher weight.

For instance if we want a recommendation system and want model to predict whether particular model of car is worth buying. We will form a mathemetical equation in form (y= mx +c) say:

y= m1x1 +m2x2 +m3x3 + c;
Where, x can be set of features **(cost, make, type etc.) and **m is the weights we assign to each feature.

This model only recommends products with a total value greater than one (say). It does a quick calculation of its features and weights, and produces a final value of 1.05. Product Y is an acceptable recommendation.
We can see given a data (labelled or unlabelled) model is trained to make predictions initially on test data. Once its trained we can use it on actual data.

Developing ML Solutions with SageMaker Studio

✔️ 𝐀𝐦𝐚𝐳𝐨𝐧 𝐒𝐚𝐠𝐞𝐌𝐚𝐤𝐞𝐫 𝐢s managed machine learning (ML) service used to build and train ML models and deploy them right away into a production-ready hosted environment.
✔️ While A𝐦𝐚𝐳𝐨𝐧 𝐁𝐞𝐝𝐫𝐨𝐜𝐤 is designed for use cases where you want to build Gen AI applications without investing heavily in custom model development.
SageMaker AI is a good choice for unique or s𝐩𝐞𝐜𝐢𝐚𝐥𝐢𝐳𝐞𝐝 𝐀𝐈/ML needs that require custom-built models.

Let’s get started with our task for today.

Task
a) Sign into the AWS Management Console.
b) Create a SageMaker notebook instance, open conda_python3
c) Prepare the Dataset and Upload the data to S3
d) Train a model using SageMaker’s built-in XGBoost
e) Train a model using Scikit-Learn Random Forest
f) Deploy the model and make predictions.
g) Clean-Up the resource
h) Check data in S3

Solution

a) 👉 Sign into the AWS Management Console.

b) 👉 Create a SageMaker notebook instance, open conda_python3.

Navigate to Amazon SageMaker AI by clicking on the Services menu at the top, then click on Amazon SageMaker AI. Then, click on **Notebooks **on the left panel.
Click on the Create notebook Instance button.

For the Notebook instance settings section, Notebook instance name: Enter Instance-BuiltIn-Model Notebook instance type: Select ml.t2.medium IAM role (existing): AmazonSagemakeroRole-(RANDOMNUMBER)

Leave other options as default, click on the Create notebook instance button.

It will take up to 5 minutes for the notebook instance to be up and running. Wait until the Status changes to InService.

Click on the Open JupyterLab button. You will be redirected to the run environment.

Select conda_python3 kernel.

Name: SageMaker_ML_Lab.ipynb

c) 👉 Prepare the Dataset and Upload the data to S3

We will be using California Housing data. Its globally available for experimenting, with the help of fetch_california_housing() function.

Copy the below code in cell for execution and run it:

import numpy as np
import pandas as pd
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
import sagemaker
from sagemaker import get_execution_role

# Load California housing dataset
california = fetch_california_housing()
X = pd.DataFrame(california.data, columns=california.feature_names)
y = pd.DataFrame(california.target, columns=["MedHouseVal"])

# Split into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Save to CSV (SageMaker expects CSV format)
train_data = pd.concat([y_train, X_train], axis=1)
test_data = pd.concat([y_test, X_test], axis=1)

train_data.to_csv("california_train.csv", index=False, header=False)
test_data.to_csv("california_test.csv", index=False, header=False)

Explanation of above code:

1. Import libraries required:
Numpy and Pandas for numerical and tabular data handling.
Sklearn for functions and scientific calculations.
Sagemaker for interacting with AWS SageMaker services

2. Functions Used:
fetch_california_housing()- to load a real-world regression dataset with housing features (e.g., income, location).
train_test_split()- to divide the dataset so that part is used for training and the rest for testing.
get_execution_role()- to allow SageMaker to securely access AWS resources like S3 and training instances.

Note: The dataset contains features such as: MedInc, HouseAge, AveRooms, etc.

The target variable is the median house value (in $100,000s).
Here we,
Assign the features to a DataFrame X.
Assign the target values to a DataFrame y.
Use 80% of the data for training the model.
Use 20% of the data to evaluate model performance.
Set random_state=42 in train_test_split() to ensure reproducibility.
Place the label/target in the first column of the CSV file, as expected by SageMaker.
Remove headers from the CSV files because the XGBoost built-in algorithm does not expect them.
Later save the files locally to be uploaded to Amazon S3.

Upload the data to S3

Go to your S3 console and create your bucket. Copy its name and paste it into the code.

import boto3
import sagemaker

# Initialize S3 client
s3 = boto3.client('s3')

bucket_name = '<Your-S3-Bucket_name>'  # Replace with your bucket name
prefix = 'california-housing'

# Upload to S3 using SageMaker session
sagemaker_session = sagemaker.Session()

train_path = sagemaker_session.upload_data(path='california_train.csv', bucket=bucket_name, key_prefix=prefix)
test_path = sagemaker_session.upload_data(path='california_test.csv', bucket=bucket_name, key_prefix=prefix)

print(f"Training data uploaded to: {train_path}")
print(f"Test data uploaded to: {test_path}")

Note:
SageMaker **jobs read data directly from **Amazon S3, not from your local machine.
upload_data() handles the transfer of local CSVs to S3.
prefix determines the folder path inside the S3 bucket.

d) 👉 Train a model using SageMaker’s built-in XGBoost

from sagemaker.amazon.amazon_estimator import get_image_uri

# Get XGBoost container (adjust region if needed)
container = get_image_uri(sagemaker_session.boto_region_name, 'xgboost', '1.0-1')

# Define SageMaker estimator
xgb = sagemaker.estimator.Estimator(
    container,
    get_execution_role(),
    instance_count=1,
    instance_type='ml.m5.large',
    output_path=f's3://{bucket_name}/{prefix}/output',
    sagemaker_session=sagemaker_session
)

# Set hyperparameters (for regression)
xgb.set_hyperparameters(
    objective="reg:squarederror",
    num_round=100,
    max_depth=5,
    eta=0.2,
    gamma=4,
    min_child_weight=6,
    subsample=0.8
)

# Define data channels
train_input = sagemaker.TrainingInput(train_path, content_type='csv')
test_input = sagemaker.TrainingInput(test_path, content_type='csv')

# Train the model
xgb.fit({'train': train_input, 'validation': test_input})

SageMaker **uses pre-built containers (Docker images) to run popular algorithms like **XGBoost.
Fetch the URI of the XGBoost container image for your AWS region.
Use version ‘1.0–1’ of the XGBoost container.
Use an Estimator, a SageMaker abstraction that manages training jobs.
Key parameters to set in the Estimator include:
container: the machine learning algorithm to use (XGBoost here).
instance_type: the compute resource (e.g., ml.m5.large which has 2 vCPUs and 8 GiB RAM).
output_path: the S3 location where the trained model will be saved.
get_execution_role(): to ensure the SageMaker job has permissions to access AWS services.
Tune the model for best results by specifying hyperparameters such as:
objective: set to regression.
eta: learning rate.
max_depth: controls how deep the decision trees can grow.
subsample: percentage of data used per tree.
These hyperparameters help control overfitting and underfitting.

*e) 👉 Train a model using Scikit-Learn Random Forest *

from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error

# Train model
rf_model = RandomForestRegressor(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train.values.ravel())

# Evaluate
predictions = rf_model.predict(X_test)
mse = mean_squared_error(y_test, predictions)

print(f"Mean Squared Error: {mse}")

Run a local model using Scikit-learn’s Random Forest algorithm.
Use this local model to benchmark performance against the XGBoost model trained on SageMaker.
Use mean_squared_error to compare and evaluate model accuracy.

We can see mean_squared_error() gives accuracy value of 25% approx.

f) 👉 Deploy the model and make predictions.

# Deploy
xgb_predictor = xgb.deploy(initial_instance_count=1, instance_type='ml.t2.medium')

# Sample prediction
sample = X_test.iloc[0].values.tolist()
payload = ",".join(map(str, sample))

response = xgb_predictor.predict(payload, initial_args={'ContentType': 'text/csv'})

print(f"Predicted MedHouseVal: {float(response)}")
print(f"Actual MedHouseVal: {y_test.iloc[0].values[0]}")

Deploy the trained model to a real-time HTTPS endpoint in SageMaker.
This endpoint allows sending requests to get predictions.
Select one test row and format it as a CSV string (the format required by the endpoint).
Send a predict() request to the deployed model endpoint using the formatted test data.
Compare the model’s prediction with the actual value.

g) 👉 Clean-Up the resource

Always delete SageMaker endpoints when you’re done using them to avoid ongoing charges.

Deleting the endpoint de-provisions the hosting instance and stops billing.

xgb_predictor.delete_endpoint()

h) 👉 Check data in S3 Bucket

Go to your S3 bucket, and inside the objects, you will find a folder named california-housing/. Click on it.

Next, click on the output folder, then select sagemaker-xgboost, and again click on the output folder inside it.

There, you will see the model.tar.gz file.

This file contains the code we used in JupyterLab to train the models.

Conclusion
● We created Amazon Sagemaker notebook instance.
● We successfully configured conda_python3
● We then created the dataset and uploaded it to S3 bucket.
● We successfully trained the Built-in XGBoost
● We successfully trained the Scikit-Learn Random Forest
● We successfully Deployed XGBoost Model & used it for prediction.
Finally, we cleaned-up the service in order to avoid incurring extra cost.

Hope you found this post informative,

Thanks,
Sumbul.