Suhas Mallesh

Posted on Apr 1

SageMaker Studio Domain with Terraform: Your ML Workspace on AWS 🔬

#aws #machinelearning #ai #devops

SageMaker Studio is the IDE for ML on AWS - notebooks, training, deployment, all in one place. Here's how to provision the entire domain with Terraform including VPC, IAM, user profiles, and security hardening.

In Series 1-3, we worked with managed AI services - Bedrock for models, Knowledge Bases for RAG, Agents for orchestration. Series 5 shifts to custom ML - training your own models, deploying them to endpoints, managing features, and building CI/CD pipelines.

It all starts with a SageMaker Studio Domain. The domain is the foundation for everything in SageMaker - it provides the IDE (JupyterLab, Code Editor), manages user profiles, attaches shared storage (EFS), and controls network access. Think of it as the workspace where your ML team lives. This post provisions the entire setup with Terraform. 🎯

🏗️ What a SageMaker Domain Contains

Component	What It Does
Domain	Top-level resource: VPC config, auth mode, encryption
User Profile	Per-user config: execution role, instance types, apps
Shared Space	Team collaboration spaces with shared notebooks
EFS Volume	Persistent storage for notebooks and data files
Apps	JupyterLab, Code Editor, Canvas, Studio Classic

One domain per team or environment. User profiles within the domain give each data scientist their own isolated workspace with shared access to the same data and models.

🔧 Terraform: The Full Domain Setup

VPC and Networking

SageMaker Studio requires a VPC. For production, use VPC-only mode with private subnets and VPC endpoints:

# network/main.tf

resource "aws_vpc" "ml" {
  cidr_block           = var.vpc_cidr
  enable_dns_support   = true
  enable_dns_hostnames = true

  tags = { Name = "${var.environment}-ml-vpc" }
}

resource "aws_subnet" "private" {
  count             = 2
  vpc_id            = aws_vpc.ml.id
  cidr_block        = cidrsubnet(var.vpc_cidr, 8, count.index)
  availability_zone = data.aws_availability_zones.available.names[count.index]

  tags = { Name = "${var.environment}-ml-private-${count.index}" }
}

# S3 Gateway endpoint (required for SageMaker in VPC-only mode)
resource "aws_vpc_endpoint" "s3" {
  vpc_id       = aws_vpc.ml.id
  service_name = "com.amazonaws.${var.region}.s3"

  route_table_ids = [aws_route_table.private.id]
}

# SageMaker API interface endpoint
resource "aws_vpc_endpoint" "sagemaker_api" {
  vpc_id              = aws_vpc.ml.id
  service_name        = "com.amazonaws.${var.region}.sagemaker.api"
  vpc_endpoint_type   = "Interface"
  subnet_ids          = aws_subnet.private[*].id
  security_group_ids  = [aws_security_group.vpc_endpoints.id]
  private_dns_enabled = true
}

# SageMaker Runtime interface endpoint
resource "aws_vpc_endpoint" "sagemaker_runtime" {
  vpc_id              = aws_vpc.ml.id
  service_name        = "com.amazonaws.${var.region}.sagemaker.runtime"
  vpc_endpoint_type   = "Interface"
  subnet_ids          = aws_subnet.private[*].id
  security_group_ids  = [aws_security_group.vpc_endpoints.id]
  private_dns_enabled = true
}

resource "aws_security_group" "vpc_endpoints" {
  name_prefix = "${var.environment}-vpce-"
  vpc_id      = aws_vpc.ml.id

  ingress {
    from_port   = 443
    to_port     = 443
    protocol    = "tcp"
    cidr_blocks = [var.vpc_cidr]
  }
}

VPC-only mode blocks all internet access. You need VPC endpoints for every AWS service SageMaker uses: S3 (gateway), SageMaker API, SageMaker Runtime, and optionally STS, CloudWatch Logs, and ECR for training jobs.

IAM Execution Role

# iam/main.tf

resource "aws_iam_role" "sagemaker_execution" {
  name = "${var.environment}-sagemaker-execution"

  assume_role_policy = jsonencode({
    Version = "2012-10-17"
    Statement = [{
      Action    = "sts:AssumeRole"
      Effect    = "Allow"
      Principal = { Service = "sagemaker.amazonaws.com" }
    }]
  })
}

# Scoped permissions - not AmazonSageMakerFullAccess
resource "aws_iam_role_policy" "sagemaker_core" {
  name = "sagemaker-core"
  role = aws_iam_role.sagemaker_execution.id

  policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Effect = "Allow"
        Action = [
          "s3:GetObject",
          "s3:PutObject",
          "s3:ListBucket"
        ]
        Resource = [
          "${var.data_bucket_arn}",
          "${var.data_bucket_arn}/*",
          "${var.artifacts_bucket_arn}",
          "${var.artifacts_bucket_arn}/*"
        ]
      },
      {
        Effect = "Allow"
        Action = [
          "ecr:GetAuthorizationToken",
          "ecr:BatchGetImage",
          "ecr:GetDownloadUrlForLayer"
        ]
        Resource = "*"
      },
      {
        Effect = "Allow"
        Action = [
          "logs:CreateLogGroup",
          "logs:CreateLogStream",
          "logs:PutLogEvents"
        ]
        Resource = "arn:aws:logs:${var.region}:${data.aws_caller_identity.current.account_id}:*"
      }
    ]
  })
}

Avoid AmazonSageMakerFullAccess in production. It grants broad permissions across S3, ECR, Lambda, and more. Scope your policies to the specific buckets, registries, and services your team needs.

KMS Key for Encryption

# security/kms.tf

resource "aws_kms_key" "sagemaker" {
  description             = "SageMaker Studio EFS and EBS encryption"
  deletion_window_in_days = 7
  enable_key_rotation     = true
}

resource "aws_kms_alias" "sagemaker" {
  name          = "alias/${var.environment}-sagemaker"
  target_key_id = aws_kms_key.sagemaker.key_id
}

The Domain

# sagemaker/domain.tf

resource "aws_sagemaker_domain" "this" {
  domain_name = "${var.environment}-ml-studio"
  auth_mode   = "IAM"
  vpc_id      = aws_vpc.ml.id
  subnet_ids  = aws_subnet.private[*].id

  # VPC-only mode blocks internet access
  app_network_access_type = var.network_access_type

  # Encrypt EFS and EBS volumes
  kms_key_id = aws_kms_key.sagemaker.arn

  default_user_settings {
    execution_role = aws_iam_role.sagemaker_execution.arn

    security_groups = [aws_security_group.sagemaker_studio.id]

    jupyter_lab_app_settings {
      default_resource_spec {
        instance_type       = var.default_instance_type
        sagemaker_image_arn = var.jupyter_image_arn
      }
    }

    sharing_settings {
      notebook_output_option = "Allowed"
      s3_output_path         = "s3://${var.artifacts_bucket}/sharing"
    }
  }

  tags = {
    Environment = var.environment
    ManagedBy   = "terraform"
  }
}

User Profiles

# sagemaker/users.tf

resource "aws_sagemaker_user_profile" "this" {
  for_each = var.user_profiles

  domain_id         = aws_sagemaker_domain.this.id
  user_profile_name = each.key

  user_settings {
    execution_role = coalesce(
      each.value.execution_role_arn,
      aws_iam_role.sagemaker_execution.arn
    )
  }

  tags = {
    Team        = each.value.team
    Environment = var.environment
  }
}

📐 Environment Configuration

# environments/dev.tfvars
environment          = "dev"
network_access_type  = "PublicInternetOnly"  # Simpler for dev
default_instance_type = "ml.t3.medium"

user_profiles = {
  "data-scientist-1" = {
    team               = "ml-team"
    execution_role_arn = null  # Uses domain default
  }
}

# environments/prod.tfvars
environment          = "prod"
network_access_type  = "VpcOnly"  # Locked down
default_instance_type = "ml.t3.medium"

user_profiles = {
  "ds-lead" = {
    team               = "ml-team"
    execution_role_arn = null
  }
  "ds-engineer-1" = {
    team               = "ml-team"
    execution_role_arn = null
  }
  "ds-engineer-2" = {
    team               = "ml-team"
    execution_role_arn = null
  }
}

Dev uses PublicInternetOnly for easier setup - no VPC endpoints needed, pip install works out of the box. Prod uses VpcOnly for network isolation - all traffic stays within your VPC, requires VPC endpoints for AWS services.

🔧 Security Hardening Checklist

Control	Dev	Prod
Network access	PublicInternetOnly	VpcOnly
EFS encryption	KMS key	KMS key
IAM scope	Scoped to buckets	Least-privilege per team
Instance types	ml.t3.medium	Restricted via Service Quotas
Notebook sharing	Allowed (S3 output)	Allowed (encrypted S3)
Auto-shutdown	Optional	Lifecycle config enforced

🔧 Auto-Shutdown Lifecycle Config (Cost Control)

Idle notebooks burn money. Attach a lifecycle config to auto-shutdown:

resource "aws_sagemaker_studio_lifecycle_config" "auto_shutdown" {
  studio_lifecycle_config_name     = "${var.environment}-auto-shutdown"
  studio_lifecycle_config_app_type = "JupyterServer"
  studio_lifecycle_config_content  = base64encode(file("${path.module}/scripts/auto-shutdown.sh"))
}

The script checks for idle kernels and shuts down the notebook instance after a configurable timeout (typically 60-120 minutes of inactivity).

⚠️ Gotchas and Tips

One domain per region per account (soft limit). You can request an increase, but the default is one. Use user profiles to separate team members within a single domain.

EFS is created automatically. The domain creates an EFS volume and a home directory per user profile. You don't manage this directly, but you should encrypt it with KMS.

VPC endpoint costs add up. Each interface VPC endpoint costs roughly $7/month plus data processing charges. In VPC-only mode, you may need 5-10 endpoints. Budget $50-100/month for networking in prod.

Domain deletion is slow and complex. Deleting a domain requires deleting all apps, user profiles, and spaces first. Terraform handles this, but terraform destroy can take 15-20 minutes.

Instance type quotas. GPU instances (ml.g5, ml.p4d) require Service Quota increases. Request these early - approval can take days.

⏭️ What's Next

This is Post 1 of the ML Pipelines & MLOps with Terraform series.

Post 1: SageMaker Studio Domain (you are here) 🔬
Post 2: SageMaker Endpoints - Deploy to Prod
Post 3: SageMaker Feature Store
Post 4: SageMaker Pipelines - CI/CD for ML

Your ML workspace is provisioned. VPC-isolated, KMS-encrypted, IAM-scoped, with user profiles for every team member. The foundation for everything that follows: model training, deployment, feature engineering, and ML pipelines. 🔬

Found this helpful? Follow for the full ML Pipelines & MLOps with Terraform series! 💬

DEV Community