Training Your First ML Model on Amazon SageMaker Using S3 Data

#python #aws #machinelearning

Machine Learning has never been more accessible — and with tools like Amazon SageMaker, you can go from raw data to a trained model in just a few steps. In this post, I’ll walk you through how I used Amazon SageMaker to train an ML model with a dataset I uploaded to an S3 bucket. Whether you’re a student, researcher, or builder working on a cool AI project, this guide is for you.

📦 Prerequisites

Before we begin, make sure you have:

An AWS account
Amazon SageMaker and S3 access
AWS IAM role with necessary permissions
A dataset ready to upload (CSV, JSON, Parquet, etc.)

🪣 Step 1: Upload Your Dataset to S3

Go to the S3 Console:

Create a new bucket or use an existing one.
Upload your dataset file.
Make note of the S3 URI, e.g., s3://my-ml-bucket/datasets/my-data.csv.

📌 Permissions Note:

Ensure that your SageMaker execution role has access to the S3 bucket:

{
  "Effect": "Allow",
  "Action": [
    "s3:GetObject",
    "s3:PutObject"
  ],
  "Resource": "arn:aws:s3:::my-ml-bucket/*"
}

🧠 Step 2: Set Up SageMaker Notebook Instance

Go to the SageMaker Console.
Create a Notebook Instance.
Attach the IAM role with S3 access.
Once the instance is running, open Jupyter Notebook.

🧪 Step 3: Load and Explore the Data

Use the SageMaker SDK inside a Jupyter notebook:

import pandas as pd
import boto3

s3_path = 's3://my-ml-bucket/datasets/my-data.csv'
df = pd.read_csv(s3_path)
df.head()

🧰 Step 4: Preprocess and Prepare for Training

Prepare your data as needed:

from sklearn.model_selection import train_test_split

X = df.drop('target', axis=1)
y = df['target']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

🛠️ Step 5: Use SageMaker Built-in Algorithms (Optional)

SageMaker provides prebuilt algorithms like XGBoost:

import sagemaker
from sagemaker import get_execution_role
from sagemaker.inputs import TrainingInput
from sagemaker.estimator import Estimator

role = get_execution_role()
session = sagemaker.Session()
bucket = 'my-ml-bucket'

# Upload training data to S3
train_data = pd.concat([X_train, y_train], axis=1)
train_data.to_csv('train.csv', index=False)
session.upload_data('train.csv', bucket=bucket, key_prefix='train')

# Set up the estimator
xgboost_container = sagemaker.image_uris.retrieve("xgboost", session.boto_region_name)
xgb = Estimator(
    xgboost_container,
    role=role,
    instance_count=1,
    instance_type='ml.m5.large',
    output_path=f's3://{bucket}/output',
    sagemaker_session=session
)

xgb.set_hyperparameters(
    objective='binary:logistic',
    num_round=100
)

# Start training
xgb.fit({'train': TrainingInput(f's3://{bucket}/train/train.csv', content_type='csv')})

✅ Step 6: Deploy and Test

predictor = xgb.deploy(initial_instance_count=1, instance_type='ml.m5.large')

# Make predictions
result = predictor.predict(X_test.to_numpy())

🔒 Clean Up

To avoid unnecessary charges:

predictor.delete_endpoint()

🚀 Wrapping Up

Using Amazon SageMaker with an S3-hosted dataset is a powerful, scalable way to train ML models without worrying about infrastructure. With just a few lines of code, you’re able to upload data, preprocess it, train a model, and deploy it into production.

💬 Let's Connect!

If you're building something with SageMaker or just getting into ML/AI, drop a comment below or reach out on Twitter/X [@x.com/SimonNungwa ] — I'd love to connect and collaborate!

🔗 Resources

Top comments (0)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.