DEV Community

Cover image for SageMaker Studio Domain with Terraform: Your ML Workspace on AWS πŸ”¬
Suhas Mallesh
Suhas Mallesh

Posted on

SageMaker Studio Domain with Terraform: Your ML Workspace on AWS πŸ”¬

SageMaker Studio is the IDE for ML on AWS - notebooks, training, deployment, all in one place. Here's how to provision the entire domain with Terraform including VPC, IAM, user profiles, and security hardening.

In Series 1-3, we worked with managed AI services - Bedrock for models, Knowledge Bases for RAG, Agents for orchestration. Series 5 shifts to custom ML - training your own models, deploying them to endpoints, managing features, and building CI/CD pipelines.

It all starts with a SageMaker Studio Domain. The domain is the foundation for everything in SageMaker - it provides the IDE (JupyterLab, Code Editor), manages user profiles, attaches shared storage (EFS), and controls network access. Think of it as the workspace where your ML team lives. This post provisions the entire setup with Terraform. 🎯

πŸ—οΈ What a SageMaker Domain Contains

Component What It Does
Domain Top-level resource: VPC config, auth mode, encryption
User Profile Per-user config: execution role, instance types, apps
Shared Space Team collaboration spaces with shared notebooks
EFS Volume Persistent storage for notebooks and data files
Apps JupyterLab, Code Editor, Canvas, Studio Classic

One domain per team or environment. User profiles within the domain give each data scientist their own isolated workspace with shared access to the same data and models.

πŸ”§ Terraform: The Full Domain Setup

VPC and Networking

SageMaker Studio requires a VPC. For production, use VPC-only mode with private subnets and VPC endpoints:

# network/main.tf

resource "aws_vpc" "ml" {
  cidr_block           = var.vpc_cidr
  enable_dns_support   = true
  enable_dns_hostnames = true

  tags = { Name = "${var.environment}-ml-vpc" }
}

resource "aws_subnet" "private" {
  count             = 2
  vpc_id            = aws_vpc.ml.id
  cidr_block        = cidrsubnet(var.vpc_cidr, 8, count.index)
  availability_zone = data.aws_availability_zones.available.names[count.index]

  tags = { Name = "${var.environment}-ml-private-${count.index}" }
}

# S3 Gateway endpoint (required for SageMaker in VPC-only mode)
resource "aws_vpc_endpoint" "s3" {
  vpc_id       = aws_vpc.ml.id
  service_name = "com.amazonaws.${var.region}.s3"

  route_table_ids = [aws_route_table.private.id]
}

# SageMaker API interface endpoint
resource "aws_vpc_endpoint" "sagemaker_api" {
  vpc_id              = aws_vpc.ml.id
  service_name        = "com.amazonaws.${var.region}.sagemaker.api"
  vpc_endpoint_type   = "Interface"
  subnet_ids          = aws_subnet.private[*].id
  security_group_ids  = [aws_security_group.vpc_endpoints.id]
  private_dns_enabled = true
}

# SageMaker Runtime interface endpoint
resource "aws_vpc_endpoint" "sagemaker_runtime" {
  vpc_id              = aws_vpc.ml.id
  service_name        = "com.amazonaws.${var.region}.sagemaker.runtime"
  vpc_endpoint_type   = "Interface"
  subnet_ids          = aws_subnet.private[*].id
  security_group_ids  = [aws_security_group.vpc_endpoints.id]
  private_dns_enabled = true
}

resource "aws_security_group" "vpc_endpoints" {
  name_prefix = "${var.environment}-vpce-"
  vpc_id      = aws_vpc.ml.id

  ingress {
    from_port   = 443
    to_port     = 443
    protocol    = "tcp"
    cidr_blocks = [var.vpc_cidr]
  }
}
Enter fullscreen mode Exit fullscreen mode

VPC-only mode blocks all internet access. You need VPC endpoints for every AWS service SageMaker uses: S3 (gateway), SageMaker API, SageMaker Runtime, and optionally STS, CloudWatch Logs, and ECR for training jobs.

IAM Execution Role

# iam/main.tf

resource "aws_iam_role" "sagemaker_execution" {
  name = "${var.environment}-sagemaker-execution"

  assume_role_policy = jsonencode({
    Version = "2012-10-17"
    Statement = [{
      Action    = "sts:AssumeRole"
      Effect    = "Allow"
      Principal = { Service = "sagemaker.amazonaws.com" }
    }]
  })
}

# Scoped permissions - not AmazonSageMakerFullAccess
resource "aws_iam_role_policy" "sagemaker_core" {
  name = "sagemaker-core"
  role = aws_iam_role.sagemaker_execution.id

  policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Effect = "Allow"
        Action = [
          "s3:GetObject",
          "s3:PutObject",
          "s3:ListBucket"
        ]
        Resource = [
          "${var.data_bucket_arn}",
          "${var.data_bucket_arn}/*",
          "${var.artifacts_bucket_arn}",
          "${var.artifacts_bucket_arn}/*"
        ]
      },
      {
        Effect = "Allow"
        Action = [
          "ecr:GetAuthorizationToken",
          "ecr:BatchGetImage",
          "ecr:GetDownloadUrlForLayer"
        ]
        Resource = "*"
      },
      {
        Effect = "Allow"
        Action = [
          "logs:CreateLogGroup",
          "logs:CreateLogStream",
          "logs:PutLogEvents"
        ]
        Resource = "arn:aws:logs:${var.region}:${data.aws_caller_identity.current.account_id}:*"
      }
    ]
  })
}
Enter fullscreen mode Exit fullscreen mode

Avoid AmazonSageMakerFullAccess in production. It grants broad permissions across S3, ECR, Lambda, and more. Scope your policies to the specific buckets, registries, and services your team needs.

KMS Key for Encryption

# security/kms.tf

resource "aws_kms_key" "sagemaker" {
  description             = "SageMaker Studio EFS and EBS encryption"
  deletion_window_in_days = 7
  enable_key_rotation     = true
}

resource "aws_kms_alias" "sagemaker" {
  name          = "alias/${var.environment}-sagemaker"
  target_key_id = aws_kms_key.sagemaker.key_id
}
Enter fullscreen mode Exit fullscreen mode

The Domain

# sagemaker/domain.tf

resource "aws_sagemaker_domain" "this" {
  domain_name = "${var.environment}-ml-studio"
  auth_mode   = "IAM"
  vpc_id      = aws_vpc.ml.id
  subnet_ids  = aws_subnet.private[*].id

  # VPC-only mode blocks internet access
  app_network_access_type = var.network_access_type

  # Encrypt EFS and EBS volumes
  kms_key_id = aws_kms_key.sagemaker.arn

  default_user_settings {
    execution_role = aws_iam_role.sagemaker_execution.arn

    security_groups = [aws_security_group.sagemaker_studio.id]

    jupyter_lab_app_settings {
      default_resource_spec {
        instance_type       = var.default_instance_type
        sagemaker_image_arn = var.jupyter_image_arn
      }
    }

    sharing_settings {
      notebook_output_option = "Allowed"
      s3_output_path         = "s3://${var.artifacts_bucket}/sharing"
    }
  }

  tags = {
    Environment = var.environment
    ManagedBy   = "terraform"
  }
}
Enter fullscreen mode Exit fullscreen mode

User Profiles

# sagemaker/users.tf

resource "aws_sagemaker_user_profile" "this" {
  for_each = var.user_profiles

  domain_id         = aws_sagemaker_domain.this.id
  user_profile_name = each.key

  user_settings {
    execution_role = coalesce(
      each.value.execution_role_arn,
      aws_iam_role.sagemaker_execution.arn
    )
  }

  tags = {
    Team        = each.value.team
    Environment = var.environment
  }
}
Enter fullscreen mode Exit fullscreen mode

πŸ“ Environment Configuration

# environments/dev.tfvars
environment          = "dev"
network_access_type  = "PublicInternetOnly"  # Simpler for dev
default_instance_type = "ml.t3.medium"

user_profiles = {
  "data-scientist-1" = {
    team               = "ml-team"
    execution_role_arn = null  # Uses domain default
  }
}

# environments/prod.tfvars
environment          = "prod"
network_access_type  = "VpcOnly"  # Locked down
default_instance_type = "ml.t3.medium"

user_profiles = {
  "ds-lead" = {
    team               = "ml-team"
    execution_role_arn = null
  }
  "ds-engineer-1" = {
    team               = "ml-team"
    execution_role_arn = null
  }
  "ds-engineer-2" = {
    team               = "ml-team"
    execution_role_arn = null
  }
}
Enter fullscreen mode Exit fullscreen mode

Dev uses PublicInternetOnly for easier setup - no VPC endpoints needed, pip install works out of the box. Prod uses VpcOnly for network isolation - all traffic stays within your VPC, requires VPC endpoints for AWS services.

πŸ”§ Security Hardening Checklist

Control Dev Prod
Network access PublicInternetOnly VpcOnly
EFS encryption KMS key KMS key
IAM scope Scoped to buckets Least-privilege per team
Instance types ml.t3.medium Restricted via Service Quotas
Notebook sharing Allowed (S3 output) Allowed (encrypted S3)
Auto-shutdown Optional Lifecycle config enforced

πŸ”§ Auto-Shutdown Lifecycle Config (Cost Control)

Idle notebooks burn money. Attach a lifecycle config to auto-shutdown:

resource "aws_sagemaker_studio_lifecycle_config" "auto_shutdown" {
  studio_lifecycle_config_name     = "${var.environment}-auto-shutdown"
  studio_lifecycle_config_app_type = "JupyterServer"
  studio_lifecycle_config_content  = base64encode(file("${path.module}/scripts/auto-shutdown.sh"))
}
Enter fullscreen mode Exit fullscreen mode

The script checks for idle kernels and shuts down the notebook instance after a configurable timeout (typically 60-120 minutes of inactivity).

⚠️ Gotchas and Tips

One domain per region per account (soft limit). You can request an increase, but the default is one. Use user profiles to separate team members within a single domain.

EFS is created automatically. The domain creates an EFS volume and a home directory per user profile. You don't manage this directly, but you should encrypt it with KMS.

VPC endpoint costs add up. Each interface VPC endpoint costs roughly $7/month plus data processing charges. In VPC-only mode, you may need 5-10 endpoints. Budget $50-100/month for networking in prod.

Domain deletion is slow and complex. Deleting a domain requires deleting all apps, user profiles, and spaces first. Terraform handles this, but terraform destroy can take 15-20 minutes.

Instance type quotas. GPU instances (ml.g5, ml.p4d) require Service Quota increases. Request these early - approval can take days.

⏭️ What's Next

This is Post 1 of the ML Pipelines & MLOps with Terraform series.

  • Post 1: SageMaker Studio Domain (you are here) πŸ”¬
  • Post 2: SageMaker Endpoints - Deploy to Prod
  • Post 3: SageMaker Feature Store
  • Post 4: SageMaker Pipelines - CI/CD for ML

Your ML workspace is provisioned. VPC-isolated, KMS-encrypted, IAM-scoped, with user profiles for every team member. The foundation for everything that follows: model training, deployment, feature engineering, and ML pipelines. πŸ”¬

Found this helpful? Follow for the full ML Pipelines & MLOps with Terraform series! πŸ’¬

Top comments (0)