SageMaker Studio Administration Best Practices part-1

#aws #sagemaker #tutorial #cloud

In this series of articles we will discuss and explain the best practices for SageNaker Studio Administration from AWS whitepapers docs.

Introduction
When using SageMaker Studio as your ML platform, it's important to follow best practices for scaling and organization. Consider the following:

Choose an operating model and organize ML environments to meet your business goals.
Set up domain authentication for user identities and be aware of limitations.
Federate user identity and authorization for fine-grained access control and auditing.
Set up permissions and guardrails for various ML roles.
Plan your VPC network topology based on workload sensitivity, user numbers, and launched instances and jobs.
Monitor your platform's performance and resource usage to ensure that it meets your needs and identify any potential bottlenecks.
Regularly review and update your security and compliance controls to ensure that they are up-to-date and meet your organization's requirements.
Automate your ML pipeline as much as possible to reduce manual errors and improve the efficiency of your ML development and deployment process.
Continuously evaluate and update your ML models to ensure that they are performing well and that the data used for training and validation is up-to-date.
Collaborate with your team to share best practices, knowledge and expertise to improve the performance of your ML platform.

Recommended account structure
When setting up an operating model for your SageMaker Studio platform, it's important to follow best practices for organization and management. Here are a few recommendations:

Use AWS Control Tower for account setup, management and governance
Centralize identities with an Identity Provider and AWS IAM Identity Center and enable secure access to workloads
Isolate ML workloads across development, test, and production accounts
Stream logs to a log archive account for analysis and filtering
Use a centralized governance account for data access provisioning and auditing
Embed security and governance services in each account for security and compliance.

Model account structures for data science teams

*1. Centralized: *
all data science activities are managed by one team or organization.

In this model, the ML platform team will be responsible for:

Providing shared services and tools for MLOps across data science teams
Managing shared accounts for ML workload development, testing, and production
Implementing governance policies for workload isolation
Ensuring adherence to common best practices for the platform.

Decentralized: data science activities are spread across different business functions or divisions.

In this model, each ML team is responsible for their own ML accounts and resources. However, it is recommended to use a centralized approach for monitoring and managing data governance for ease of audit management.

Federated: shared services are managed by a centralized team, while business units or product teams are managed by decentralized teams. This model is similar to a hub and spoke model, where each business unit has its own team, but they coordinate with the central team.

This model is similar to the centralized model, but with the added benefit of each data science/ML team having their own set of accounts for development, testing, and production. This allows for better isolation of resources and independent scaling for each team without affecting others.

ML platform multitenancy
**Multitenancy is a way to manage multiple user groups within a single software instance. Each group, called a tenant, has its own set of privileges and access to the software.
In Machine Learning (ML) platforms like SageMaker Studio, multitenancy allows multiple teams to work within the same platform, but with separate access and resources.
It is possible to have multiple teams within one SageMaker Studio instance, but it's important to consider factors such as cost, security and account limitations.
A best practice is to have each team work within its own Studio Domain, using separate accounts. This can be done with the help of tools like AWS Service Catalog. It allows self-service deployment of Studio resources in multiple accounts and regions.

Domain management

- Set up your domain for Identity and Access Management (IAM) federation.
Before you can use IAM federation for your Studio Domain, you need to create an IAM federation user role (like a platform administrator) in your IdP (Identity Provider). You can find more information on how to do this in the Identity Management section. To set up SageMaker Studio with IAM, refer to the guide "Onboard to Amazon SageMaker Domain Using IAM Identity Center" for detailed instructions.

- Set up your domain for single sign-on (SSO) federation.
To use Single Sign-On (SSO) with SageMaker Studio, you must first enable AWS SSO in your AWS Organization management account in the same region where you plan to run SageMaker Studio. The process for setting up the domain is similar to setting up IAM federation, with the exception of selecting SSO in the authentication section.
For more information on how to set it up, please refer to the guide "Onboard to Amazon SageMaker Domain Using IAM Identity Center".

- Studio user profile.
A user profile is an entity that represents an individual user within a SageMaker Studio domain. It's created when a user is onboarded to SageMaker Studio and it's used for sharing, reporting and other user-related features. When an administrator invites a person by email or imports them from SSO, a user profile is automatically created. Each user profile has its own private Amazon EFS home directory, settings and it's the main way to reference a user. It's recommended to create a user profile for each physical user of the application. Each user profile has its own dedicated directory on EFS, there is no shared directory between users.
Each user profile has its own dedicated compute resources, such as EC2 instances, to run notebooks. The resources allocated to one user are completely isolated from those allocated to another user and resources allocated to users in one account are separate from those allocated to users in another account. Each user can run up to four applications within isolated Docker containers or images on the same instance type.

- Jupyter Server app.
When you start a Studio Notebook for a user by using the pre-signed URL or by logging in with AWS SSO, the Jupyter Server App will be launched on a SageMaker service-managed VPC instance. Each user has their own dedicated Jupyter Server App. By default, the Jupyter Server App for SageMaker Studio Notebooks runs on a dedicated ml.t3.medium instance, which is reserved as a "system" instance type and the compute for this instance is not charged to the customer.

The Jupyter Kernel Gateway app. The Kernel Gateway app allows users to run multiple Jupyter notebook kernels, terminal sessions and interactive consoles within a SageMaker Studio image/Kernel Gateway app. Users can create it through the API or the SageMaker Studio interface and it runs on a chosen instance type. Users can also run up to four Kernel Gateway apps or images on the same physical instance, each one isolated by its container or image. Users can use built-in SageMaker Studio images that are preconfigured with popular data science, and deep learning packages such as TensorFlow, Apache MXNet, and PyTorch. To create more apps, you'll need to use a different instance type. Each user profile can only have one running instance of any type. Users will be billed for the time the instance is running. To save costs, users can shut down the instance when not in use. When a user shuts down and reopens a Kernel Gateway app from the SageMaker Studio interface, the app starts on a new instance, so the packages installed will not be persisted. Similarly, if a user changes the instance type on a notebook, the packages and session variables will be lost. However, users can use features such as bring your own image and lifecycle scripts to bring their own packages to Studio and persist them through instance switches and new instance launches. For more information, refer to Shut down and Update Studio Apps.

- EFS volume.
When a domain is created, a single EFS volume is created for use by all the users within the domain. Each user profile receives a private home directory within the EFS volume for storing the user’s notebooks, GitHub repositories, and data files. Access to the folders is segregated by user, through filesystem permissions. SageMaker Studio creates a global unique user ID for each user profile, and applies it as a Portable Operating System Interface (POSIX) user/group ID for the user’s home directory on EFS, which prevents other users from accessing its data.

It's important to backup your EFS volume to another EFS volume or Amazon S3 in case of accidental deletion, in order to restore the SageMaker Studio domain. The administrator needs to list all user profiles and associated EFS user IDs, delete all apps, user profiles, and the SageMaker Studio domain, create a new Studio domain, create the user profiles, and copy the files from the backup on EFS/Amazon S3.
You can use LifecycleConfigurations to back up data to and from S3 every time a user starts their app.
For more detailed instructions refer to the appendix section Studio Domain Backup and Recovery.

- EBS volume.
When you launch a SageMaker Studio Notebook instance, an EBS storage volume is also attached to it. It's used as the main storage for the container or image running on the instance. However, unlike EFS storage, which is persistent, the EBS volume attached to the container is temporary. That means that if you delete the app or image, the data stored locally on the EBS volume will be lost.

- Securing access to the pre-signed URL.

When a user opens a notebook link in SageMaker Studio, the Studio validates the user's IAM policy to authorize access and generates a pre-signed URL for the user. However, since the Studio console runs on an internet domain, this generated pre-signed URL is visible in the browser session, which could present a security risk for data theft if proper access controls are not in place.

To prevent this, Studio offers several methods to enforce access controls against pre-signed URL data theft:

Client IP validation using the IAM policy condition aws:sourceIp
Client VPC validation using the IAM condition aws:sourceVpc
Client VPC endpoint validation using the IAM policy condition aws:sourceVpce When accessing Studio notebooks from the Studio console, the only available option is to use client IP validation with the IAM policy condition aws:sourceIp. But you can use browser traffic routing products like Zscaler to ensure scale and compliance for internet access. These products generate their own source IP, which is not controlled by the enterprise customer, so it's impossible to use the aws:sourceIp condition. To use client VPC endpoint validation with the IAM policy condition aws:sourceVpce, the pre-signed URL needs to be created in the same customer VPC where Studio is deployed, and accessed via a Studio VPC endpoint on the customer VPC. This can be done by using DNS forwarding rules in Zscaler and corporate DNS, then using an Amazon Route 53 inbound resolver to access the customer VPC endpoint.

- SageMaker domain quotas and limits.

Each AWS account can only have one domain per region in IAM mode and one domain per account in SSO mode.
SageMaker Studio domain SSO federation is only supported in the region where AWS SSO is set up, across member accounts of the AWS Organization.
Once created, the VPC and subnet configuration of a domain cannot be changed.
It is not possible to switch between IAM and SSO modes after creating a domain.
Each user can only launch four Kernel Gateway apps per instance type.
Each user can only launch one instance of each instance type.
There are limits on the resources consumed within a domain such as the number of instances launched by instance types and number of user profiles that can be created, you can refer to the service quota page for a complete list of limits.
There is a hard limit of 1,000 user profiles per SageMaker Studio Domain.
Customers can request to increase the default resource limits by submitting an enterprise support case with a business justification and it will be subjected to account-level guardrails.