We’re kicking off this fresh calendar year by diving into Amazon SageMaker—a fully managed machine learning service from AWS. But first, make sure you have read my previous series of articles on AWS Sagemaker.
This guide will walk you through
- Setting up Amazon SageMaker
- Launching and using Jupyter Notebook (or JupyterLab) within SageMaker
- Importing a Kaggle Dataset directly for analysis
- Recognizing the limitations of this approach (especially around storage)
All set? Let’s begin!
1. Setting Up Amazon SageMaker
Before we jump into Jupyter Notebooks, make sure you have:
- An AWS Account (with a valid payment method and correct permissions)
- IAM (Identity and Access Management) permissions to create and run SageMaker resources
- Optionally, an S3 bucket if you plan to store large datasets
1.1. Create or Log In to Your AWS Account
- Head to aws.amazon.com to either sign up or log in.
- Once in the AWS Management Console, pick a region (e.g., us-east-1, us-west-2, etc.). - All SageMaker resources (notebook instances, training jobs, models) will be created in that region.
1.2. Set Up IAM Permissions
- Go to the IAM console and create (or select) a user with permissions to access SageMaker.
- Attach the AmazonSageMakerFullAccess policy or an equivalent custom policy granting SageMaker-related actions.
- If you intend to pull in data from S3, ensure the user or role also has the necessary S3 permissions (e.g., AmazonS3ReadOnlyAccess or appropriate custom access).
1.3. (Optional) Configure VPC Settings
Some organizations need their resources in private subnets. If required:
- Choose a VPC, Subnet, and Security Group that have the right outbound/internet access to retrieve data from, say, Kaggle or S3.
- If you’re new or just experimenting, you can skip this and use default settings.
2. Getting Started With Jupyter Notebook (or JupyterLab) on SageMaker
2.1. Create a Notebook Instance
-
Navigate to SageMaker
- In the AWS console, type “SageMaker” in the search bar and select Amazon SageMaker.
-
Notebook Instances
- From the left panel, click Notebook → Notebook instances.
-
Create a Notebook Instance
- Click the Create notebook instance button.
-
Notebook instance name: for instance,
my-first-sagemaker-notebook
. -
Instance type: pick something like
ml.t2.medium
for small-scale experiments. - IAM role: either create a new role or select an existing one that grants SageMaker the necessary permissions.
- (Optional) Increase the Volume size if you anticipate working with large datasets.
- Click Create notebook instance.
-
Wait Until “InService”
- You’ll see the status go from Pending → InService (takes a few minutes).
2.2. Open Jupyter or JupyterLab
- Once the instance is up, click Open Jupyter or Open JupyterLab.
- JupyterLab is a more modern interface with tabs, a file explorer, etc.
- Jupyter (classic) presents a more traditional notebook listing.
- Inside either environment, you can create new notebooks, upload existing
.ipynb
files, and explore data.
2.3. Create a New Notebook
-
In JupyterLab:
- Click File → New → Notebook.
- Select a Python kernel (often
conda_python3
or something similar).
-
In Classic Jupyter:
- Click New → Notebook: Python 3.
Example code in a new cell:
import sys
print("Hello SageMaker!")
print(f"Python version: {sys.version}")
Run the cell and watch your output appear below.
2.4. Stop or Delete the Notebook Instance
- If you’re not using the notebook, stop it to avoid ongoing charges.
- If you’re done completely, delete it so you don’t get billed for storage.
3. Importing a Kaggle Dataset into SageMaker
Kaggle is a treasure trove of datasets—but most of them need an API token for automated downloads. Below is how to configure and use the Kaggle CLI.
3.1. Get Your Kaggle API Token
- Log into Kaggle.
- Click on your profile picture (top right) → Account.
- Scroll to the API section and click Create New API Token.
- A file named
kaggle.json
containing your credentials (KAGGLE_USERNAME
andKAGGLE_KEY
) will be downloaded.
3.2. Upload kaggle.json
to SageMaker
- In JupyterLab, go to the file browser panel (left side).
- Drag-and-drop your
kaggle.json
file or click Upload.
3.3. Configure Kaggle Credentials
Open a terminal in JupyterLab (or use a notebook cell with !
commands) and run:
mkdir -p ~/.kaggle
cp kaggle.json ~/.kaggle/
chmod 600 ~/.kaggle/kaggle.json
The permissions (chmod 600
) ensure that only your user can read/write the file—this is required for the Kaggle CLI to work.
3.4. Install the Kaggle CLI
To avoid compatibility issues, it helps to upgrade your core Python build tools first:
!pip install --upgrade pip setuptools packaging wheel
Then install kaggle:
!pip install kaggle
3.5. Download a Dataset from Kaggle
Check the dataset’s URL, for example:
https://www.kaggle.com/datasets/uciml/iris
The relevant part is uciml/iris
. Then run:
!kaggle datasets download -d uciml/iris
A file named something like iris.zip
will be downloaded to your current working directory.
3.6. Unzip and Explore
Unzip the dataset:
!unzip iris.zip -d iris_data
Check the contents:
!ls iris_data
Load it in Python:
import pandas as pd
df = pd.read_csv('iris_data/Iris.csv')
df.head()
That’s it! You’re ready to do data exploration, feature engineering, or model training using SageMaker’s compute resources.
4. Limitations and Considerations
While SageMaker is powerful, a few things might trip you up:
-
Storage Constraints
- Notebook instances use EBS volumes. If you download very large datasets, you might run out of space.
- Tip: Increase the volume size when creating the notebook instance or store most of your data in Amazon S3.
-
Instance Costs
- You’re charged for the compute time while the notebook is running.
- Best Practice: Stop or delete your notebook instance if you’re not actively using it.
-
Dependency Conflicts
- Installing many libraries (e.g., TensorFlow, PyTorch, scikit-learn) can lead to version clashes.
- Tip: Create separate Conda environments or specify library versions carefully.
-
Kaggle API Throttling
- Downloading multiple large datasets in a short time can trigger Kaggle’s rate limits.
- Some datasets require you to accept specific licenses.
-
Networking
- If running in a private VPC subnet, you’ll need a NAT Gateway or VPC endpoint to allow outbound traffic to Kaggle.
-
Credential Security
- Keep
kaggle.json
and AWS IAM credentials private. Never commit them to a public Git repository.
- Keep
-
Scaling and Training
- Notebook instances don’t automatically scale out for large training jobs. For heavy-lifting, create a SageMaker Training Job or Processing Job to leverage bigger or multiple instances.
5. Summary
Happy New Year again! We’ve covered:
- How to set up Amazon SageMaker with an AWS account and IAM permissions.
- Spinning up a Jupyter Notebook instance in SageMaker, from creation to code execution.
- Configuring and using the Kaggle CLI to download datasets for real-world ML experiments.
- Potential limitations such as storage, costs, and dependency conflicts—and how to handle them.
By following these steps and best practices, you’ll be well on your way to building, training, and deploying sophisticated machine learning models in the cloud. So, grab your dataset of choice, fire up your SageMaker notebook, and explore the limitless possibilities this new year has to offer in the realm of data science!
Additional Resources
- AWS SageMaker Documentation
- Kaggle API Docs
- AWS Pricing for SageMaker
- Amazon S3
- SageMaker + Conda Environments
Cheers to a fantastic year of innovation and learning!
Disclaimer: ChatGPT was used to make the flow of this article better.
Top comments (0)