DEV Community

Cover image for SageMaker Endpoints: Deploy Your Model to Production with Terraform πŸš€
Suhas Mallesh
Suhas Mallesh

Posted on

SageMaker Endpoints: Deploy Your Model to Production with Terraform πŸš€

Training a model is half the battle. Deploying it to a scalable, auto-scaling endpoint with blue/green deployment and rollback is the other half. Here's how to deploy SageMaker real-time endpoints with Terraform.

In the previous post, we set up the SageMaker Studio domain - the workspace where your team trains models. Now comes the production side: taking a trained model and deploying it to a scalable HTTPS endpoint that your applications can call for real-time predictions.

SageMaker endpoints involve three Terraform resources: a Model (what to serve), an Endpoint Configuration (how to serve it), and an Endpoint (the live HTTPS URL). Add autoscaling and deployment policies on top, and you have a production-grade inference system. 🎯

πŸ—οΈ The Three-Layer Architecture

Model (container image + model artifacts in S3)
    ↓
Endpoint Configuration (instance type, count, variants)
    ↓
Endpoint (HTTPS URL, auto-scaling, blue/green deployment)
Enter fullscreen mode Exit fullscreen mode
Resource What It Defines
Model Container image + S3 model artifacts + IAM role
Endpoint Config Instance type, initial count, production variants
Endpoint Live endpoint with deployment policy and autoscaling

Separating these layers means you can update the model without touching the endpoint config, or change instance types without retraining.

πŸ”§ Terraform: Deploy to Production

Model Configuration

# variables.tf

variable "model_config" {
  description = "Model deployment configuration. Change to deploy new models."
  type = object({
    name           = string
    image_uri      = string  # ECR container image
    model_data_url = string  # S3 path to model.tar.gz
    instance_type  = string
    instance_count = number
  })
}
Enter fullscreen mode Exit fullscreen mode

IAM Role for Model Execution

# inference/iam.tf

resource "aws_iam_role" "sagemaker_execution" {
  name = "${var.environment}-sagemaker-inference"

  assume_role_policy = jsonencode({
    Version = "2012-10-17"
    Statement = [{
      Action    = "sts:AssumeRole"
      Effect    = "Allow"
      Principal = { Service = "sagemaker.amazonaws.com" }
    }]
  })
}

resource "aws_iam_role_policy" "model_access" {
  name = "model-s3-ecr-access"
  role = aws_iam_role.sagemaker_execution.id

  policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Effect   = "Allow"
        Action   = ["s3:GetObject"]
        Resource = "${var.model_bucket_arn}/*"
      },
      {
        Effect = "Allow"
        Action = [
          "ecr:GetAuthorizationToken",
          "ecr:BatchGetImage",
          "ecr:GetDownloadUrlForLayer"
        ]
        Resource = "*"
      },
      {
        Effect   = "Allow"
        Action   = ["logs:CreateLogGroup", "logs:CreateLogStream", "logs:PutLogEvents"]
        Resource = "*"
      }
    ]
  })
}
Enter fullscreen mode Exit fullscreen mode

The Model

# inference/model.tf

resource "aws_sagemaker_model" "this" {
  name               = "${var.environment}-${var.model_config.name}"
  execution_role_arn = aws_iam_role.sagemaker_execution.arn

  primary_container {
    image          = var.model_config.image_uri
    model_data_url = var.model_config.model_data_url
    environment = {
      SAGEMAKER_PROGRAM = "inference.py"
    }
  }

  tags = {
    Environment = var.environment
    Model       = var.model_config.name
  }
}
Enter fullscreen mode Exit fullscreen mode

image is your serving container - either an AWS Deep Learning Container or your custom container from ECR. model_data_url points to model.tar.gz in S3 containing your trained weights and inference code.

Endpoint Configuration

# inference/endpoint_config.tf

resource "aws_sagemaker_endpoint_configuration" "this" {
  name = "${var.environment}-${var.model_config.name}-config"

  production_variants {
    variant_name           = "primary"
    model_name             = aws_sagemaker_model.this.name
    initial_instance_count = var.model_config.instance_count
    instance_type          = var.model_config.instance_type
    initial_variant_weight = 1.0
  }

  tags = {
    Environment = var.environment
    Model       = var.model_config.name
  }

  lifecycle {
    create_before_destroy = true
  }
}
Enter fullscreen mode Exit fullscreen mode

create_before_destroy = true is important. When you update the endpoint config (new model version, different instance type), Terraform creates the new config before deleting the old one. This prevents downtime during updates.

The Endpoint

# inference/endpoint.tf

resource "aws_sagemaker_endpoint" "this" {
  name                 = "${var.environment}-${var.model_config.name}"
  endpoint_config_name = aws_sagemaker_endpoint_configuration.this.name

  deployment_config {
    blue_green_update_policy {
      traffic_routing_configuration {
        type                     = "CANARY"
        canary_size {
          type  = "INSTANCE_COUNT"
          value = 1
        }
        wait_interval_in_seconds = 300
      }

      maximum_execution_timeout_in_seconds = 1800
      termination_wait_in_seconds          = 120
    }

    auto_rollback_configuration {
      alarms {
        alarm_name = aws_cloudwatch_metric_alarm.endpoint_errors.alarm_name
      }
    }
  }

  tags = {
    Environment = var.environment
    Model       = var.model_config.name
  }
}
Enter fullscreen mode Exit fullscreen mode

Canary deployment: Routes traffic to 1 instance on the new fleet first, waits 5 minutes, then shifts remaining traffic. If the CloudWatch alarm fires during the canary phase, SageMaker automatically rolls back.

Autoscaling

# inference/autoscaling.tf

resource "aws_appautoscaling_target" "endpoint" {
  max_capacity       = var.autoscaling_max
  min_capacity       = var.model_config.instance_count
  resource_id        = "endpoint/${aws_sagemaker_endpoint.this.name}/variant/primary"
  scalable_dimension = "sagemaker:variant:DesiredInstanceCount"
  service_namespace  = "sagemaker"
}

resource "aws_appautoscaling_policy" "endpoint" {
  name               = "${var.environment}-endpoint-scaling"
  policy_type        = "TargetTrackingScaling"
  resource_id        = aws_appautoscaling_target.endpoint.resource_id
  scalable_dimension = aws_appautoscaling_target.endpoint.scalable_dimension
  service_namespace  = aws_appautoscaling_target.endpoint.service_namespace

  target_tracking_scaling_policy_configuration {
    target_value = var.target_invocations_per_instance

    predefined_metric_specification {
      predefined_metric_type = "SageMakerVariantInvocationsPerInstance"
    }

    scale_in_cooldown  = 300
    scale_out_cooldown = 60
  }
}
Enter fullscreen mode Exit fullscreen mode

SageMakerVariantInvocationsPerInstance scales based on requests per instance. Set the target to the maximum your instance can handle (determined by load testing). The policy adds instances when traffic exceeds the target and removes them when traffic drops.

Monitoring

# inference/monitoring.tf

resource "aws_cloudwatch_metric_alarm" "endpoint_errors" {
  alarm_name          = "${var.environment}-endpoint-5xx-errors"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = 2
  metric_name         = "Invocation5XXErrors"
  namespace           = "AWS/SageMaker"
  period              = 60
  statistic           = "Sum"
  threshold           = 5
  alarm_description   = "Endpoint returning 5xx errors"
  alarm_actions       = [var.sns_alert_topic_arn]

  dimensions = {
    EndpointName = aws_sagemaker_endpoint.this.name
    VariantName  = "primary"
  }
}

resource "aws_cloudwatch_metric_alarm" "endpoint_latency" {
  alarm_name          = "${var.environment}-endpoint-high-latency"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = 3
  metric_name         = "ModelLatency"
  namespace           = "AWS/SageMaker"
  period              = 60
  statistic           = "Average"
  threshold           = var.latency_threshold_ms * 1000  # microseconds
  alarm_description   = "Model inference latency too high"
  alarm_actions       = [var.sns_alert_topic_arn]

  dimensions = {
    EndpointName = aws_sagemaker_endpoint.this.name
    VariantName  = "primary"
  }
}
Enter fullscreen mode Exit fullscreen mode

The error alarm also feeds into the blue/green deployment's auto_rollback_configuration. If 5xx errors spike during a deployment, SageMaker automatically rolls back to the previous fleet.

πŸ“ Environment Configuration

# environments/dev.tfvars
model_config = {
  name           = "fraud-detector-v2"
  image_uri      = "123456789012.dkr.ecr.us-east-1.amazonaws.com/ml-serving:latest"
  model_data_url = "s3://ml-models-dev/fraud-detector/v2/model.tar.gz"
  instance_type  = "ml.t2.medium"
  instance_count = 1
}
autoscaling_max = 2
target_invocations_per_instance = 100

# environments/prod.tfvars
model_config = {
  name           = "fraud-detector-v2"
  image_uri      = "123456789012.dkr.ecr.us-east-1.amazonaws.com/ml-serving:v2.1.0"
  model_data_url = "s3://ml-models-prod/fraud-detector/v2/model.tar.gz"
  instance_type  = "ml.c5.xlarge"
  instance_count = 2
}
autoscaling_max = 10
target_invocations_per_instance = 500
Enter fullscreen mode Exit fullscreen mode

Deploying a new model version: Update model_data_url to the new S3 path, run terraform apply. The canary deployment creates a new fleet, validates it, and shifts traffic automatically.

πŸ§ͺ Invoke the Endpoint

import boto3
import json

client = boto3.client("sagemaker-runtime")

response = client.invoke_endpoint(
    EndpointName="prod-fraud-detector-v2",
    ContentType="application/json",
    Body=json.dumps({"features": [0.5, 1.2, 3.4, 0.8]})
)

prediction = json.loads(response["Body"].read())
print(prediction)
Enter fullscreen mode Exit fullscreen mode

⚠️ Gotchas and Tips

Load test before setting autoscaling targets. The target_invocations_per_instance should be based on actual load testing, not guesswork. Use SageMaker Inference Recommender to find the optimal instance type and max throughput.

Endpoint creation takes 5-10 minutes. SageMaker provisions instances, downloads the container, loads the model, and runs health checks. Plan for this in your deployment pipeline.

Use create_before_destroy on endpoint configs. Without it, Terraform deletes the old config before creating the new one, causing downtime.

Pin container image tags. Use versioned tags (v2.1.0) instead of latest in production. This ensures reproducible deployments and meaningful rollbacks.

Serverless inference for low traffic. If your endpoint gets fewer than ~100 requests/hour, consider serverless_config on the endpoint configuration instead of instance-based. It scales to zero and charges per request.

⏭️ What's Next

This is Post 2 of the ML Pipelines & MLOps with Terraform series.

  • Post 1: SageMaker Studio Domain πŸ”¬
  • Post 2: SageMaker Endpoints - Deploy to Prod (you are here) πŸš€
  • Post 3: SageMaker Feature Store
  • Post 4: SageMaker Pipelines - CI/CD for ML

Your model is in production. Real-time HTTPS endpoint, autoscaling based on traffic, canary deployments with automatic rollback, and monitoring that pages you before customers notice. From notebook to production, all in Terraform. πŸš€

Found this helpful? Follow for the full ML Pipelines & MLOps with Terraform series! πŸ’¬

Top comments (0)