Mursal Furqan Kumbhar

Posted on Jan 2

Building My First ML Model Using Amazon SageMaker + Kaggle + Jupyter Notebook

#aws #machinelearning #kaggle #beginners

We’re kicking off this fresh calendar year by diving into Amazon SageMaker—a fully managed machine learning service from AWS. But first, make sure you have read my previous series of articles on AWS Sagemaker.

This guide will walk you through

Setting up Amazon SageMaker
Launching and using Jupyter Notebook (or JupyterLab) within SageMaker
Importing a Kaggle Dataset directly for analysis
Recognizing the limitations of this approach (especially around storage)

All set? Let’s begin!

1. Setting Up Amazon SageMaker

Before we jump into Jupyter Notebooks, make sure you have:

An AWS Account (with a valid payment method and correct permissions)
IAM (Identity and Access Management) permissions to create and run SageMaker resources
Optionally, an S3 bucket if you plan to store large datasets

1.1. Create or Log In to Your AWS Account

Head to aws.amazon.com to either sign up or log in.
Once in the AWS Management Console, pick a region (e.g., us-east-1, us-west-2, etc.). - All SageMaker resources (notebook instances, training jobs, models) will be created in that region.

1.2. Set Up IAM Permissions

Go to the IAM console and create (or select) a user with permissions to access SageMaker.
Attach the AmazonSageMakerFullAccess policy or an equivalent custom policy granting SageMaker-related actions.
If you intend to pull in data from S3, ensure the user or role also has the necessary S3 permissions (e.g., AmazonS3ReadOnlyAccess or appropriate custom access).

1.3. (Optional) Configure VPC Settings

Some organizations need their resources in private subnets. If required:

Choose a VPC, Subnet, and Security Group that have the right outbound/internet access to retrieve data from, say, Kaggle or S3.
If you’re new or just experimenting, you can skip this and use default settings.

2. Getting Started With Jupyter Notebook (or JupyterLab) on SageMaker

2.1. Create a Notebook Instance

Navigate to SageMaker
- In the AWS console, type “SageMaker” in the search bar and select Amazon SageMaker.
Notebook Instances
- From the left panel, click Notebook → Notebook instances.
Create a Notebook Instance
- Click the Create notebook instance button.
- Notebook instance name: for instance, my-first-sagemaker-notebook.
- Instance type: pick something like ml.t2.medium for small-scale experiments.
- IAM role: either create a new role or select an existing one that grants SageMaker the necessary permissions.
- (Optional) Increase the Volume size if you anticipate working with large datasets.
- Click Create notebook instance.
Wait Until “InService”
- You’ll see the status go from Pending → InService (takes a few minutes).

2.2. Open Jupyter or JupyterLab

Once the instance is up, click Open Jupyter or Open JupyterLab.
- JupyterLab is a more modern interface with tabs, a file explorer, etc.
- Jupyter (classic) presents a more traditional notebook listing.
Inside either environment, you can create new notebooks, upload existing .ipynb files, and explore data.

2.3. Create a New Notebook

In JupyterLab:
1. Click File → New → Notebook.
2. Select a Python kernel (often conda_python3 or something similar).
In Classic Jupyter:
1. Click New → Notebook: Python 3.

Example code in a new cell:

import sys

print("Hello SageMaker!")
print(f"Python version: {sys.version}")

Run the cell and watch your output appear below.

2.4. Stop or Delete the Notebook Instance

If you’re not using the notebook, stop it to avoid ongoing charges.
If you’re done completely, delete it so you don’t get billed for storage.

3. Importing a Kaggle Dataset into SageMaker

Kaggle is a treasure trove of datasets—but most of them need an API token for automated downloads. Below is how to configure and use the Kaggle CLI.

3.1. Get Your Kaggle API Token

Log into Kaggle.
Click on your profile picture (top right) → Account.
Scroll to the API section and click Create New API Token.
A file named kaggle.json containing your credentials (KAGGLE_USERNAME and KAGGLE_KEY) will be downloaded.

3.2. Upload `kaggle.json` to SageMaker

In JupyterLab, go to the file browser panel (left side).
Drag-and-drop your kaggle.json file or click Upload.

3.3. Configure Kaggle Credentials

Open a terminal in JupyterLab (or use a notebook cell with ! commands) and run:

mkdir -p ~/.kaggle
cp kaggle.json ~/.kaggle/
chmod 600 ~/.kaggle/kaggle.json

The permissions (chmod 600) ensure that only your user can read/write the file—this is required for the Kaggle CLI to work.

3.4. Install the Kaggle CLI

To avoid compatibility issues, it helps to upgrade your core Python build tools first:

!pip install --upgrade pip setuptools packaging wheel

Then install kaggle:

!pip install kaggle

3.5. Download a Dataset from Kaggle

Check the dataset’s URL, for example:

https://www.kaggle.com/datasets/uciml/iris

The relevant part is uciml/iris. Then run:

!kaggle datasets download -d uciml/iris

A file named something like iris.zip will be downloaded to your current working directory.

3.6. Unzip and Explore

Unzip the dataset:

!unzip iris.zip -d iris_data

Check the contents:

!ls iris_data

Load it in Python:

import pandas as pd

df = pd.read_csv('iris_data/Iris.csv')
df.head()

That’s it! You’re ready to do data exploration, feature engineering, or model training using SageMaker’s compute resources.

4. Limitations and Considerations

While SageMaker is powerful, a few things might trip you up:

Storage Constraints
- Notebook instances use EBS volumes. If you download very large datasets, you might run out of space.
- Tip: Increase the volume size when creating the notebook instance or store most of your data in Amazon S3.
Instance Costs
- You’re charged for the compute time while the notebook is running.
- Best Practice: Stop or delete your notebook instance if you’re not actively using it.
Dependency Conflicts
- Installing many libraries (e.g., TensorFlow, PyTorch, scikit-learn) can lead to version clashes.
- Tip: Create separate Conda environments or specify library versions carefully.
Kaggle API Throttling
- Downloading multiple large datasets in a short time can trigger Kaggle’s rate limits.
- Some datasets require you to accept specific licenses.
Networking
- If running in a private VPC subnet, you’ll need a NAT Gateway or VPC endpoint to allow outbound traffic to Kaggle.
Credential Security
- Keep kaggle.json and AWS IAM credentials private. Never commit them to a public Git repository.
Scaling and Training
- Notebook instances don’t automatically scale out for large training jobs. For heavy-lifting, create a SageMaker Training Job or Processing Job to leverage bigger or multiple instances.

5. Summary

Happy New Year again! We’ve covered:

How to set up Amazon SageMaker with an AWS account and IAM permissions.
Spinning up a Jupyter Notebook instance in SageMaker, from creation to code execution.
Configuring and using the Kaggle CLI to download datasets for real-world ML experiments.
Potential limitations such as storage, costs, and dependency conflicts—and how to handle them.

By following these steps and best practices, you’ll be well on your way to building, training, and deploying sophisticated machine learning models in the cloud. So, grab your dataset of choice, fire up your SageMaker notebook, and explore the limitless possibilities this new year has to offer in the realm of data science!

Additional Resources

Cheers to a fantastic year of innovation and learning!

Disclaimer: ChatGPT was used to make the flow of this article better.

DEV Community

Building My First ML Model Using Amazon SageMaker + Kaggle + Jupyter Notebook

This guide will walk you through

1. Setting Up Amazon SageMaker

1.1. Create or Log In to Your AWS Account

1.2. Set Up IAM Permissions

1.3. (Optional) Configure VPC Settings

2. Getting Started With Jupyter Notebook (or JupyterLab) on SageMaker

2.1. Create a Notebook Instance

2.2. Open Jupyter or JupyterLab

2.3. Create a New Notebook

2.4. Stop or Delete the Notebook Instance

3. Importing a Kaggle Dataset into SageMaker

3.1. Get Your Kaggle API Token

3.2. Upload `kaggle.json` to SageMaker

3.3. Configure Kaggle Credentials

3.4. Install the Kaggle CLI

3.5. Download a Dataset from Kaggle

3.6. Unzip and Explore

4. Limitations and Considerations

5. Summary

Additional Resources

Top comments (0)

This guide will walk you through

1. Setting Up Amazon SageMaker

1.1. Create or Log In to Your AWS Account

1.2. Set Up IAM Permissions

1.3. (Optional) Configure VPC Settings

2. Getting Started With Jupyter Notebook (or JupyterLab) on SageMaker

2.1. Create a Notebook Instance

2.2. Open Jupyter or JupyterLab

2.3. Create a New Notebook

2.4. Stop or Delete the Notebook Instance

3. Importing a Kaggle Dataset into SageMaker

3.1. Get Your Kaggle API Token

3.2. Upload kaggle.json to SageMaker

3.3. Configure Kaggle Credentials

3.4. Install the Kaggle CLI

3.5. Download a Dataset from Kaggle

3.6. Unzip and Explore

4. Limitations and Considerations

5. Summary

Additional Resources

3.2. Upload `kaggle.json` to SageMaker