DEV Community

Cover image for Building My First ML Model Using Amazon SageMaker + Kaggle + Jupyter Notebook
Mursal Furqan Kumbhar
Mursal Furqan Kumbhar

Posted on

Building My First ML Model Using Amazon SageMaker + Kaggle + Jupyter Notebook

We’re kicking off this fresh calendar year by diving into Amazon SageMaker—a fully managed machine learning service from AWS. But first, make sure you have read my previous series of articles on AWS Sagemaker.

This guide will walk you through

  1. Setting up Amazon SageMaker
  2. Launching and using Jupyter Notebook (or JupyterLab) within SageMaker
  3. Importing a Kaggle Dataset directly for analysis
  4. Recognizing the limitations of this approach (especially around storage)

All set? Let’s begin!


1. Setting Up Amazon SageMaker

Before we jump into Jupyter Notebooks, make sure you have:

  • An AWS Account (with a valid payment method and correct permissions)
  • IAM (Identity and Access Management) permissions to create and run SageMaker resources
  • Optionally, an S3 bucket if you plan to store large datasets

1.1. Create or Log In to Your AWS Account

  1. Head to aws.amazon.com to either sign up or log in.
  2. Once in the AWS Management Console, pick a region (e.g., us-east-1, us-west-2, etc.). - All SageMaker resources (notebook instances, training jobs, models) will be created in that region.

1.2. Set Up IAM Permissions

  1. Go to the IAM console and create (or select) a user with permissions to access SageMaker.
  2. Attach the AmazonSageMakerFullAccess policy or an equivalent custom policy granting SageMaker-related actions.
  3. If you intend to pull in data from S3, ensure the user or role also has the necessary S3 permissions (e.g., AmazonS3ReadOnlyAccess or appropriate custom access).

1.3. (Optional) Configure VPC Settings

Some organizations need their resources in private subnets. If required:

  1. Choose a VPC, Subnet, and Security Group that have the right outbound/internet access to retrieve data from, say, Kaggle or S3.
  2. If you’re new or just experimenting, you can skip this and use default settings.

2. Getting Started With Jupyter Notebook (or JupyterLab) on SageMaker

2.1. Create a Notebook Instance

  1. Navigate to SageMaker
    • In the AWS console, type “SageMaker” in the search bar and select Amazon SageMaker.
  2. Notebook Instances
    • From the left panel, click NotebookNotebook instances.
  3. Create a Notebook Instance
    • Click the Create notebook instance button.
    • Notebook instance name: for instance, my-first-sagemaker-notebook.
    • Instance type: pick something like ml.t2.medium for small-scale experiments.
    • IAM role: either create a new role or select an existing one that grants SageMaker the necessary permissions.
    • (Optional) Increase the Volume size if you anticipate working with large datasets.
    • Click Create notebook instance.
  4. Wait Until “InService”
    • You’ll see the status go from PendingInService (takes a few minutes).

2.2. Open Jupyter or JupyterLab

  1. Once the instance is up, click Open Jupyter or Open JupyterLab.
    • JupyterLab is a more modern interface with tabs, a file explorer, etc.
    • Jupyter (classic) presents a more traditional notebook listing.
  2. Inside either environment, you can create new notebooks, upload existing .ipynb files, and explore data.

2.3. Create a New Notebook

  • In JupyterLab:

    1. Click FileNewNotebook.
    2. Select a Python kernel (often conda_python3 or something similar).
  • In Classic Jupyter:

    1. Click NewNotebook: Python 3.

Example code in a new cell:

import sys

print("Hello SageMaker!")
print(f"Python version: {sys.version}")
Enter fullscreen mode Exit fullscreen mode

Run the cell and watch your output appear below.

2.4. Stop or Delete the Notebook Instance

  • If you’re not using the notebook, stop it to avoid ongoing charges.
  • If you’re done completely, delete it so you don’t get billed for storage.

3. Importing a Kaggle Dataset into SageMaker

Kaggle is a treasure trove of datasets—but most of them need an API token for automated downloads. Below is how to configure and use the Kaggle CLI.

3.1. Get Your Kaggle API Token

  1. Log into Kaggle.
  2. Click on your profile picture (top right) → Account.
  3. Scroll to the API section and click Create New API Token.
  4. A file named kaggle.json containing your credentials (KAGGLE_USERNAME and KAGGLE_KEY) will be downloaded.

3.2. Upload kaggle.json to SageMaker

  1. In JupyterLab, go to the file browser panel (left side).
  2. Drag-and-drop your kaggle.json file or click Upload.

3.3. Configure Kaggle Credentials

Open a terminal in JupyterLab (or use a notebook cell with ! commands) and run:

mkdir -p ~/.kaggle
cp kaggle.json ~/.kaggle/
chmod 600 ~/.kaggle/kaggle.json
Enter fullscreen mode Exit fullscreen mode

The permissions (chmod 600) ensure that only your user can read/write the file—this is required for the Kaggle CLI to work.

3.4. Install the Kaggle CLI

To avoid compatibility issues, it helps to upgrade your core Python build tools first:

!pip install --upgrade pip setuptools packaging wheel
Enter fullscreen mode Exit fullscreen mode

Then install kaggle:

!pip install kaggle
Enter fullscreen mode Exit fullscreen mode

3.5. Download a Dataset from Kaggle

Check the dataset’s URL, for example:

https://www.kaggle.com/datasets/uciml/iris
Enter fullscreen mode Exit fullscreen mode

The relevant part is uciml/iris. Then run:

!kaggle datasets download -d uciml/iris
Enter fullscreen mode Exit fullscreen mode

A file named something like iris.zip will be downloaded to your current working directory.

3.6. Unzip and Explore

Unzip the dataset:

!unzip iris.zip -d iris_data
Enter fullscreen mode Exit fullscreen mode

Check the contents:

!ls iris_data
Enter fullscreen mode Exit fullscreen mode

Load it in Python:

import pandas as pd

df = pd.read_csv('iris_data/Iris.csv')
df.head()
Enter fullscreen mode Exit fullscreen mode

That’s it! You’re ready to do data exploration, feature engineering, or model training using SageMaker’s compute resources.


4. Limitations and Considerations

While SageMaker is powerful, a few things might trip you up:

  1. Storage Constraints

    • Notebook instances use EBS volumes. If you download very large datasets, you might run out of space.
    • Tip: Increase the volume size when creating the notebook instance or store most of your data in Amazon S3.
  2. Instance Costs

    • You’re charged for the compute time while the notebook is running.
    • Best Practice: Stop or delete your notebook instance if you’re not actively using it.
  3. Dependency Conflicts

    • Installing many libraries (e.g., TensorFlow, PyTorch, scikit-learn) can lead to version clashes.
    • Tip: Create separate Conda environments or specify library versions carefully.
  4. Kaggle API Throttling

    • Downloading multiple large datasets in a short time can trigger Kaggle’s rate limits.
    • Some datasets require you to accept specific licenses.
  5. Networking

    • If running in a private VPC subnet, you’ll need a NAT Gateway or VPC endpoint to allow outbound traffic to Kaggle.
  6. Credential Security

    • Keep kaggle.json and AWS IAM credentials private. Never commit them to a public Git repository.
  7. Scaling and Training

    • Notebook instances don’t automatically scale out for large training jobs. For heavy-lifting, create a SageMaker Training Job or Processing Job to leverage bigger or multiple instances.

5. Summary

Happy New Year again! We’ve covered:

  1. How to set up Amazon SageMaker with an AWS account and IAM permissions.
  2. Spinning up a Jupyter Notebook instance in SageMaker, from creation to code execution.
  3. Configuring and using the Kaggle CLI to download datasets for real-world ML experiments.
  4. Potential limitations such as storage, costs, and dependency conflicts—and how to handle them.

By following these steps and best practices, you’ll be well on your way to building, training, and deploying sophisticated machine learning models in the cloud. So, grab your dataset of choice, fire up your SageMaker notebook, and explore the limitless possibilities this new year has to offer in the realm of data science!


Additional Resources

Cheers to a fantastic year of innovation and learning!

Disclaimer: ChatGPT was used to make the flow of this article better.

Top comments (0)