SageMaker Studio is the IDE for ML on AWS - notebooks, training, deployment, all in one place. Here's how to provision the entire domain with Terraform including VPC, IAM, user profiles, and security hardening.
In Series 1-3, we worked with managed AI services - Bedrock for models, Knowledge Bases for RAG, Agents for orchestration. Series 5 shifts to custom ML - training your own models, deploying them to endpoints, managing features, and building CI/CD pipelines.
It all starts with a SageMaker Studio Domain. The domain is the foundation for everything in SageMaker - it provides the IDE (JupyterLab, Code Editor), manages user profiles, attaches shared storage (EFS), and controls network access. Think of it as the workspace where your ML team lives. This post provisions the entire setup with Terraform. π―
ποΈ What a SageMaker Domain Contains
| Component | What It Does |
|---|---|
| Domain | Top-level resource: VPC config, auth mode, encryption |
| User Profile | Per-user config: execution role, instance types, apps |
| Shared Space | Team collaboration spaces with shared notebooks |
| EFS Volume | Persistent storage for notebooks and data files |
| Apps | JupyterLab, Code Editor, Canvas, Studio Classic |
One domain per team or environment. User profiles within the domain give each data scientist their own isolated workspace with shared access to the same data and models.
π§ Terraform: The Full Domain Setup
VPC and Networking
SageMaker Studio requires a VPC. For production, use VPC-only mode with private subnets and VPC endpoints:
# network/main.tf
resource "aws_vpc" "ml" {
cidr_block = var.vpc_cidr
enable_dns_support = true
enable_dns_hostnames = true
tags = { Name = "${var.environment}-ml-vpc" }
}
resource "aws_subnet" "private" {
count = 2
vpc_id = aws_vpc.ml.id
cidr_block = cidrsubnet(var.vpc_cidr, 8, count.index)
availability_zone = data.aws_availability_zones.available.names[count.index]
tags = { Name = "${var.environment}-ml-private-${count.index}" }
}
# S3 Gateway endpoint (required for SageMaker in VPC-only mode)
resource "aws_vpc_endpoint" "s3" {
vpc_id = aws_vpc.ml.id
service_name = "com.amazonaws.${var.region}.s3"
route_table_ids = [aws_route_table.private.id]
}
# SageMaker API interface endpoint
resource "aws_vpc_endpoint" "sagemaker_api" {
vpc_id = aws_vpc.ml.id
service_name = "com.amazonaws.${var.region}.sagemaker.api"
vpc_endpoint_type = "Interface"
subnet_ids = aws_subnet.private[*].id
security_group_ids = [aws_security_group.vpc_endpoints.id]
private_dns_enabled = true
}
# SageMaker Runtime interface endpoint
resource "aws_vpc_endpoint" "sagemaker_runtime" {
vpc_id = aws_vpc.ml.id
service_name = "com.amazonaws.${var.region}.sagemaker.runtime"
vpc_endpoint_type = "Interface"
subnet_ids = aws_subnet.private[*].id
security_group_ids = [aws_security_group.vpc_endpoints.id]
private_dns_enabled = true
}
resource "aws_security_group" "vpc_endpoints" {
name_prefix = "${var.environment}-vpce-"
vpc_id = aws_vpc.ml.id
ingress {
from_port = 443
to_port = 443
protocol = "tcp"
cidr_blocks = [var.vpc_cidr]
}
}
VPC-only mode blocks all internet access. You need VPC endpoints for every AWS service SageMaker uses: S3 (gateway), SageMaker API, SageMaker Runtime, and optionally STS, CloudWatch Logs, and ECR for training jobs.
IAM Execution Role
# iam/main.tf
resource "aws_iam_role" "sagemaker_execution" {
name = "${var.environment}-sagemaker-execution"
assume_role_policy = jsonencode({
Version = "2012-10-17"
Statement = [{
Action = "sts:AssumeRole"
Effect = "Allow"
Principal = { Service = "sagemaker.amazonaws.com" }
}]
})
}
# Scoped permissions - not AmazonSageMakerFullAccess
resource "aws_iam_role_policy" "sagemaker_core" {
name = "sagemaker-core"
role = aws_iam_role.sagemaker_execution.id
policy = jsonencode({
Version = "2012-10-17"
Statement = [
{
Effect = "Allow"
Action = [
"s3:GetObject",
"s3:PutObject",
"s3:ListBucket"
]
Resource = [
"${var.data_bucket_arn}",
"${var.data_bucket_arn}/*",
"${var.artifacts_bucket_arn}",
"${var.artifacts_bucket_arn}/*"
]
},
{
Effect = "Allow"
Action = [
"ecr:GetAuthorizationToken",
"ecr:BatchGetImage",
"ecr:GetDownloadUrlForLayer"
]
Resource = "*"
},
{
Effect = "Allow"
Action = [
"logs:CreateLogGroup",
"logs:CreateLogStream",
"logs:PutLogEvents"
]
Resource = "arn:aws:logs:${var.region}:${data.aws_caller_identity.current.account_id}:*"
}
]
})
}
Avoid AmazonSageMakerFullAccess in production. It grants broad permissions across S3, ECR, Lambda, and more. Scope your policies to the specific buckets, registries, and services your team needs.
KMS Key for Encryption
# security/kms.tf
resource "aws_kms_key" "sagemaker" {
description = "SageMaker Studio EFS and EBS encryption"
deletion_window_in_days = 7
enable_key_rotation = true
}
resource "aws_kms_alias" "sagemaker" {
name = "alias/${var.environment}-sagemaker"
target_key_id = aws_kms_key.sagemaker.key_id
}
The Domain
# sagemaker/domain.tf
resource "aws_sagemaker_domain" "this" {
domain_name = "${var.environment}-ml-studio"
auth_mode = "IAM"
vpc_id = aws_vpc.ml.id
subnet_ids = aws_subnet.private[*].id
# VPC-only mode blocks internet access
app_network_access_type = var.network_access_type
# Encrypt EFS and EBS volumes
kms_key_id = aws_kms_key.sagemaker.arn
default_user_settings {
execution_role = aws_iam_role.sagemaker_execution.arn
security_groups = [aws_security_group.sagemaker_studio.id]
jupyter_lab_app_settings {
default_resource_spec {
instance_type = var.default_instance_type
sagemaker_image_arn = var.jupyter_image_arn
}
}
sharing_settings {
notebook_output_option = "Allowed"
s3_output_path = "s3://${var.artifacts_bucket}/sharing"
}
}
tags = {
Environment = var.environment
ManagedBy = "terraform"
}
}
User Profiles
# sagemaker/users.tf
resource "aws_sagemaker_user_profile" "this" {
for_each = var.user_profiles
domain_id = aws_sagemaker_domain.this.id
user_profile_name = each.key
user_settings {
execution_role = coalesce(
each.value.execution_role_arn,
aws_iam_role.sagemaker_execution.arn
)
}
tags = {
Team = each.value.team
Environment = var.environment
}
}
π Environment Configuration
# environments/dev.tfvars
environment = "dev"
network_access_type = "PublicInternetOnly" # Simpler for dev
default_instance_type = "ml.t3.medium"
user_profiles = {
"data-scientist-1" = {
team = "ml-team"
execution_role_arn = null # Uses domain default
}
}
# environments/prod.tfvars
environment = "prod"
network_access_type = "VpcOnly" # Locked down
default_instance_type = "ml.t3.medium"
user_profiles = {
"ds-lead" = {
team = "ml-team"
execution_role_arn = null
}
"ds-engineer-1" = {
team = "ml-team"
execution_role_arn = null
}
"ds-engineer-2" = {
team = "ml-team"
execution_role_arn = null
}
}
Dev uses PublicInternetOnly for easier setup - no VPC endpoints needed, pip install works out of the box. Prod uses VpcOnly for network isolation - all traffic stays within your VPC, requires VPC endpoints for AWS services.
π§ Security Hardening Checklist
| Control | Dev | Prod |
|---|---|---|
| Network access | PublicInternetOnly | VpcOnly |
| EFS encryption | KMS key | KMS key |
| IAM scope | Scoped to buckets | Least-privilege per team |
| Instance types | ml.t3.medium | Restricted via Service Quotas |
| Notebook sharing | Allowed (S3 output) | Allowed (encrypted S3) |
| Auto-shutdown | Optional | Lifecycle config enforced |
π§ Auto-Shutdown Lifecycle Config (Cost Control)
Idle notebooks burn money. Attach a lifecycle config to auto-shutdown:
resource "aws_sagemaker_studio_lifecycle_config" "auto_shutdown" {
studio_lifecycle_config_name = "${var.environment}-auto-shutdown"
studio_lifecycle_config_app_type = "JupyterServer"
studio_lifecycle_config_content = base64encode(file("${path.module}/scripts/auto-shutdown.sh"))
}
The script checks for idle kernels and shuts down the notebook instance after a configurable timeout (typically 60-120 minutes of inactivity).
β οΈ Gotchas and Tips
One domain per region per account (soft limit). You can request an increase, but the default is one. Use user profiles to separate team members within a single domain.
EFS is created automatically. The domain creates an EFS volume and a home directory per user profile. You don't manage this directly, but you should encrypt it with KMS.
VPC endpoint costs add up. Each interface VPC endpoint costs roughly $7/month plus data processing charges. In VPC-only mode, you may need 5-10 endpoints. Budget $50-100/month for networking in prod.
Domain deletion is slow and complex. Deleting a domain requires deleting all apps, user profiles, and spaces first. Terraform handles this, but terraform destroy can take 15-20 minutes.
Instance type quotas. GPU instances (ml.g5, ml.p4d) require Service Quota increases. Request these early - approval can take days.
βοΈ What's Next
This is Post 1 of the ML Pipelines & MLOps with Terraform series.
- Post 1: SageMaker Studio Domain (you are here) π¬
- Post 2: SageMaker Endpoints - Deploy to Prod
- Post 3: SageMaker Feature Store
- Post 4: SageMaker Pipelines - CI/CD for ML
Your ML workspace is provisioned. VPC-isolated, KMS-encrypted, IAM-scoped, with user profiles for every team member. The foundation for everything that follows: model training, deployment, feature engineering, and ML pipelines. π¬
Found this helpful? Follow for the full ML Pipelines & MLOps with Terraform series! π¬
Top comments (0)