DEV Community

Cover image for Training Your First ML Model on Amazon SageMaker Using S3 Data
simon nungwa
simon nungwa

Posted on

Training Your First ML Model on Amazon SageMaker Using S3 Data

Machine Learning has never been more accessible — and with tools like Amazon SageMaker, you can go from raw data to a trained model in just a few steps. In this post, I’ll walk you through how I used Amazon SageMaker to train an ML model with a dataset I uploaded to an S3 bucket. Whether you’re a student, researcher, or builder working on a cool AI project, this guide is for you.


📦 Prerequisites

Before we begin, make sure you have:

  • An AWS account
  • Amazon SageMaker and S3 access
  • AWS IAM role with necessary permissions
  • A dataset ready to upload (CSV, JSON, Parquet, etc.)

🪣 Step 1: Upload Your Dataset to S3

Go to the S3 Console:

  1. Create a new bucket or use an existing one.
  2. Upload your dataset file.
  3. Make note of the S3 URI, e.g., s3://my-ml-bucket/datasets/my-data.csv.

📌 Permissions Note:

Ensure that your SageMaker execution role has access to the S3 bucket:

{
  "Effect": "Allow",
  "Action": [
    "s3:GetObject",
    "s3:PutObject"
  ],
  "Resource": "arn:aws:s3:::my-ml-bucket/*"
}
Enter fullscreen mode Exit fullscreen mode

🧠 Step 2: Set Up SageMaker Notebook Instance

  1. Go to the SageMaker Console.
  2. Create a Notebook Instance.
  3. Attach the IAM role with S3 access.
  4. Once the instance is running, open Jupyter Notebook.

🧪 Step 3: Load and Explore the Data

Use the SageMaker SDK inside a Jupyter notebook:

import pandas as pd
import boto3

s3_path = 's3://my-ml-bucket/datasets/my-data.csv'
df = pd.read_csv(s3_path)
df.head()
Enter fullscreen mode Exit fullscreen mode

🧰 Step 4: Preprocess and Prepare for Training

Prepare your data as needed:

from sklearn.model_selection import train_test_split

X = df.drop('target', axis=1)
y = df['target']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
Enter fullscreen mode Exit fullscreen mode

🛠️ Step 5: Use SageMaker Built-in Algorithms (Optional)

SageMaker provides prebuilt algorithms like XGBoost:

import sagemaker
from sagemaker import get_execution_role
from sagemaker.inputs import TrainingInput
from sagemaker.estimator import Estimator

role = get_execution_role()
session = sagemaker.Session()
bucket = 'my-ml-bucket'

# Upload training data to S3
train_data = pd.concat([X_train, y_train], axis=1)
train_data.to_csv('train.csv', index=False)
session.upload_data('train.csv', bucket=bucket, key_prefix='train')

# Set up the estimator
xgboost_container = sagemaker.image_uris.retrieve("xgboost", session.boto_region_name)
xgb = Estimator(
    xgboost_container,
    role=role,
    instance_count=1,
    instance_type='ml.m5.large',
    output_path=f's3://{bucket}/output',
    sagemaker_session=session
)

xgb.set_hyperparameters(
    objective='binary:logistic',
    num_round=100
)

# Start training
xgb.fit({'train': TrainingInput(f's3://{bucket}/train/train.csv', content_type='csv')})
Enter fullscreen mode Exit fullscreen mode

✅ Step 6: Deploy and Test

predictor = xgb.deploy(initial_instance_count=1, instance_type='ml.m5.large')

# Make predictions
result = predictor.predict(X_test.to_numpy())
Enter fullscreen mode Exit fullscreen mode

🔒 Clean Up

To avoid unnecessary charges:

predictor.delete_endpoint()
Enter fullscreen mode Exit fullscreen mode

🚀 Wrapping Up

Using Amazon SageMaker with an S3-hosted dataset is a powerful, scalable way to train ML models without worrying about infrastructure. With just a few lines of code, you’re able to upload data, preprocess it, train a model, and deploy it into production.


💬 Let's Connect!

If you're building something with SageMaker or just getting into ML/AI, drop a comment below or reach out on Twitter/X [@x.com/SimonNungwa ] — I'd love to connect and collaborate!


🔗 Resources

Top comments (0)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.