Machine Learning has never been more accessible — and with tools like Amazon SageMaker, you can go from raw data to a trained model in just a few steps. In this post, I’ll walk you through how I used Amazon SageMaker to train an ML model with a dataset I uploaded to an S3 bucket. Whether you’re a student, researcher, or builder working on a cool AI project, this guide is for you.
📦 Prerequisites
Before we begin, make sure you have:
- An AWS account
- Amazon SageMaker and S3 access
- AWS IAM role with necessary permissions
- A dataset ready to upload (CSV, JSON, Parquet, etc.)
🪣 Step 1: Upload Your Dataset to S3
Go to the S3 Console:
- Create a new bucket or use an existing one.
- Upload your dataset file.
- Make note of the S3 URI, e.g.,
s3://my-ml-bucket/datasets/my-data.csv
.
📌 Permissions Note:
Ensure that your SageMaker execution role has access to the S3 bucket:
{
"Effect": "Allow",
"Action": [
"s3:GetObject",
"s3:PutObject"
],
"Resource": "arn:aws:s3:::my-ml-bucket/*"
}
🧠 Step 2: Set Up SageMaker Notebook Instance
- Go to the SageMaker Console.
- Create a Notebook Instance.
- Attach the IAM role with S3 access.
- Once the instance is running, open Jupyter Notebook.
🧪 Step 3: Load and Explore the Data
Use the SageMaker SDK inside a Jupyter notebook:
import pandas as pd
import boto3
s3_path = 's3://my-ml-bucket/datasets/my-data.csv'
df = pd.read_csv(s3_path)
df.head()
🧰 Step 4: Preprocess and Prepare for Training
Prepare your data as needed:
from sklearn.model_selection import train_test_split
X = df.drop('target', axis=1)
y = df['target']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
🛠️ Step 5: Use SageMaker Built-in Algorithms (Optional)
SageMaker provides prebuilt algorithms like XGBoost:
import sagemaker
from sagemaker import get_execution_role
from sagemaker.inputs import TrainingInput
from sagemaker.estimator import Estimator
role = get_execution_role()
session = sagemaker.Session()
bucket = 'my-ml-bucket'
# Upload training data to S3
train_data = pd.concat([X_train, y_train], axis=1)
train_data.to_csv('train.csv', index=False)
session.upload_data('train.csv', bucket=bucket, key_prefix='train')
# Set up the estimator
xgboost_container = sagemaker.image_uris.retrieve("xgboost", session.boto_region_name)
xgb = Estimator(
xgboost_container,
role=role,
instance_count=1,
instance_type='ml.m5.large',
output_path=f's3://{bucket}/output',
sagemaker_session=session
)
xgb.set_hyperparameters(
objective='binary:logistic',
num_round=100
)
# Start training
xgb.fit({'train': TrainingInput(f's3://{bucket}/train/train.csv', content_type='csv')})
✅ Step 6: Deploy and Test
predictor = xgb.deploy(initial_instance_count=1, instance_type='ml.m5.large')
# Make predictions
result = predictor.predict(X_test.to_numpy())
🔒 Clean Up
To avoid unnecessary charges:
predictor.delete_endpoint()
🚀 Wrapping Up
Using Amazon SageMaker with an S3-hosted dataset is a powerful, scalable way to train ML models without worrying about infrastructure. With just a few lines of code, you’re able to upload data, preprocess it, train a model, and deploy it into production.
💬 Let's Connect!
If you're building something with SageMaker or just getting into ML/AI, drop a comment below or reach out on Twitter/X [@x.com/SimonNungwa ] — I'd love to connect and collaborate!
Top comments (0)
Some comments may only be visible to logged-in visitors. Sign in to view all comments.