Suhas Mallesh

Posted on Apr 11

SageMaker Endpoints: Deploy Your Model to Production with Terraform 🚀

#aws #ai #machinelearning #devops

Training a model is half the battle. Deploying it to a scalable, auto-scaling endpoint with blue/green deployment and rollback is the other half. Here's how to deploy SageMaker real-time endpoints with Terraform.

In the previous post, we set up the SageMaker Studio domain - the workspace where your team trains models. Now comes the production side: taking a trained model and deploying it to a scalable HTTPS endpoint that your applications can call for real-time predictions.

SageMaker endpoints involve three Terraform resources: a Model (what to serve), an Endpoint Configuration (how to serve it), and an Endpoint (the live HTTPS URL). Add autoscaling and deployment policies on top, and you have a production-grade inference system. 🎯

🏗️ The Three-Layer Architecture

Model (container image + model artifacts in S3)
    ↓
Endpoint Configuration (instance type, count, variants)
    ↓
Endpoint (HTTPS URL, auto-scaling, blue/green deployment)

Resource	What It Defines
Model	Container image + S3 model artifacts + IAM role
Endpoint Config	Instance type, initial count, production variants
Endpoint	Live endpoint with deployment policy and autoscaling

Separating these layers means you can update the model without touching the endpoint config, or change instance types without retraining.

🔧 Terraform: Deploy to Production

Model Configuration

# variables.tf

variable "model_config" {
  description = "Model deployment configuration. Change to deploy new models."
  type = object({
    name           = string
    image_uri      = string  # ECR container image
    model_data_url = string  # S3 path to model.tar.gz
    instance_type  = string
    instance_count = number
  })
}

IAM Role for Model Execution

# inference/iam.tf

resource "aws_iam_role" "sagemaker_execution" {
  name = "${var.environment}-sagemaker-inference"

  assume_role_policy = jsonencode({
    Version = "2012-10-17"
    Statement = [{
      Action    = "sts:AssumeRole"
      Effect    = "Allow"
      Principal = { Service = "sagemaker.amazonaws.com" }
    }]
  })
}

resource "aws_iam_role_policy" "model_access" {
  name = "model-s3-ecr-access"
  role = aws_iam_role.sagemaker_execution.id

  policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Effect   = "Allow"
        Action   = ["s3:GetObject"]
        Resource = "${var.model_bucket_arn}/*"
      },
      {
        Effect = "Allow"
        Action = [
          "ecr:GetAuthorizationToken",
          "ecr:BatchGetImage",
          "ecr:GetDownloadUrlForLayer"
        ]
        Resource = "*"
      },
      {
        Effect   = "Allow"
        Action   = ["logs:CreateLogGroup", "logs:CreateLogStream", "logs:PutLogEvents"]
        Resource = "*"
      }
    ]
  })
}

The Model

# inference/model.tf

resource "aws_sagemaker_model" "this" {
  name               = "${var.environment}-${var.model_config.name}"
  execution_role_arn = aws_iam_role.sagemaker_execution.arn

  primary_container {
    image          = var.model_config.image_uri
    model_data_url = var.model_config.model_data_url
    environment = {
      SAGEMAKER_PROGRAM = "inference.py"
    }
  }

  tags = {
    Environment = var.environment
    Model       = var.model_config.name
  }
}

image is your serving container - either an AWS Deep Learning Container or your custom container from ECR. model_data_url points to model.tar.gz in S3 containing your trained weights and inference code.

Endpoint Configuration

# inference/endpoint_config.tf

resource "aws_sagemaker_endpoint_configuration" "this" {
  name = "${var.environment}-${var.model_config.name}-config"

  production_variants {
    variant_name           = "primary"
    model_name             = aws_sagemaker_model.this.name
    initial_instance_count = var.model_config.instance_count
    instance_type          = var.model_config.instance_type
    initial_variant_weight = 1.0
  }

  tags = {
    Environment = var.environment
    Model       = var.model_config.name
  }

  lifecycle {
    create_before_destroy = true
  }
}

create_before_destroy = true is important. When you update the endpoint config (new model version, different instance type), Terraform creates the new config before deleting the old one. This prevents downtime during updates.

The Endpoint

# inference/endpoint.tf

resource "aws_sagemaker_endpoint" "this" {
  name                 = "${var.environment}-${var.model_config.name}"
  endpoint_config_name = aws_sagemaker_endpoint_configuration.this.name

  deployment_config {
    blue_green_update_policy {
      traffic_routing_configuration {
        type                     = "CANARY"
        canary_size {
          type  = "INSTANCE_COUNT"
          value = 1
        }
        wait_interval_in_seconds = 300
      }

      maximum_execution_timeout_in_seconds = 1800
      termination_wait_in_seconds          = 120
    }

    auto_rollback_configuration {
      alarms {
        alarm_name = aws_cloudwatch_metric_alarm.endpoint_errors.alarm_name
      }
    }
  }

  tags = {
    Environment = var.environment
    Model       = var.model_config.name
  }
}

Canary deployment: Routes traffic to 1 instance on the new fleet first, waits 5 minutes, then shifts remaining traffic. If the CloudWatch alarm fires during the canary phase, SageMaker automatically rolls back.

Autoscaling

# inference/autoscaling.tf

resource "aws_appautoscaling_target" "endpoint" {
  max_capacity       = var.autoscaling_max
  min_capacity       = var.model_config.instance_count
  resource_id        = "endpoint/${aws_sagemaker_endpoint.this.name}/variant/primary"
  scalable_dimension = "sagemaker:variant:DesiredInstanceCount"
  service_namespace  = "sagemaker"
}

resource "aws_appautoscaling_policy" "endpoint" {
  name               = "${var.environment}-endpoint-scaling"
  policy_type        = "TargetTrackingScaling"
  resource_id        = aws_appautoscaling_target.endpoint.resource_id
  scalable_dimension = aws_appautoscaling_target.endpoint.scalable_dimension
  service_namespace  = aws_appautoscaling_target.endpoint.service_namespace

  target_tracking_scaling_policy_configuration {
    target_value = var.target_invocations_per_instance

    predefined_metric_specification {
      predefined_metric_type = "SageMakerVariantInvocationsPerInstance"
    }

    scale_in_cooldown  = 300
    scale_out_cooldown = 60
  }
}

SageMakerVariantInvocationsPerInstance scales based on requests per instance. Set the target to the maximum your instance can handle (determined by load testing). The policy adds instances when traffic exceeds the target and removes them when traffic drops.

Monitoring

# inference/monitoring.tf

resource "aws_cloudwatch_metric_alarm" "endpoint_errors" {
  alarm_name          = "${var.environment}-endpoint-5xx-errors"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = 2
  metric_name         = "Invocation5XXErrors"
  namespace           = "AWS/SageMaker"
  period              = 60
  statistic           = "Sum"
  threshold           = 5
  alarm_description   = "Endpoint returning 5xx errors"
  alarm_actions       = [var.sns_alert_topic_arn]

  dimensions = {
    EndpointName = aws_sagemaker_endpoint.this.name
    VariantName  = "primary"
  }
}

resource "aws_cloudwatch_metric_alarm" "endpoint_latency" {
  alarm_name          = "${var.environment}-endpoint-high-latency"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = 3
  metric_name         = "ModelLatency"
  namespace           = "AWS/SageMaker"
  period              = 60
  statistic           = "Average"
  threshold           = var.latency_threshold_ms * 1000  # microseconds
  alarm_description   = "Model inference latency too high"
  alarm_actions       = [var.sns_alert_topic_arn]

  dimensions = {
    EndpointName = aws_sagemaker_endpoint.this.name
    VariantName  = "primary"
  }
}

The error alarm also feeds into the blue/green deployment's auto_rollback_configuration. If 5xx errors spike during a deployment, SageMaker automatically rolls back to the previous fleet.

📐 Environment Configuration

# environments/dev.tfvars
model_config = {
  name           = "fraud-detector-v2"
  image_uri      = "123456789012.dkr.ecr.us-east-1.amazonaws.com/ml-serving:latest"
  model_data_url = "s3://ml-models-dev/fraud-detector/v2/model.tar.gz"
  instance_type  = "ml.t2.medium"
  instance_count = 1
}
autoscaling_max = 2
target_invocations_per_instance = 100

# environments/prod.tfvars
model_config = {
  name           = "fraud-detector-v2"
  image_uri      = "123456789012.dkr.ecr.us-east-1.amazonaws.com/ml-serving:v2.1.0"
  model_data_url = "s3://ml-models-prod/fraud-detector/v2/model.tar.gz"
  instance_type  = "ml.c5.xlarge"
  instance_count = 2
}
autoscaling_max = 10
target_invocations_per_instance = 500

Deploying a new model version: Update model_data_url to the new S3 path, run terraform apply. The canary deployment creates a new fleet, validates it, and shifts traffic automatically.

🧪 Invoke the Endpoint

import boto3
import json

client = boto3.client("sagemaker-runtime")

response = client.invoke_endpoint(
    EndpointName="prod-fraud-detector-v2",
    ContentType="application/json",
    Body=json.dumps({"features": [0.5, 1.2, 3.4, 0.8]})
)

prediction = json.loads(response["Body"].read())
print(prediction)

⚠️ Gotchas and Tips

Load test before setting autoscaling targets. The target_invocations_per_instance should be based on actual load testing, not guesswork. Use SageMaker Inference Recommender to find the optimal instance type and max throughput.

Endpoint creation takes 5-10 minutes. SageMaker provisions instances, downloads the container, loads the model, and runs health checks. Plan for this in your deployment pipeline.

Use create_before_destroy on endpoint configs. Without it, Terraform deletes the old config before creating the new one, causing downtime.

Pin container image tags. Use versioned tags (v2.1.0) instead of latest in production. This ensures reproducible deployments and meaningful rollbacks.

Serverless inference for low traffic. If your endpoint gets fewer than ~100 requests/hour, consider serverless_config on the endpoint configuration instead of instance-based. It scales to zero and charges per request.

⏭️ What's Next

This is Post 2 of the ML Pipelines & MLOps with Terraform series.

Post 1: SageMaker Studio Domain 🔬
Post 2: SageMaker Endpoints - Deploy to Prod (you are here) 🚀
Post 3: SageMaker Feature Store
Post 4: SageMaker Pipelines - CI/CD for ML

Your model is in production. Real-time HTTPS endpoint, autoscaling based on traffic, canary deployments with automatic rollback, and monitoring that pages you before customers notice. From notebook to production, all in Terraform. 🚀

Found this helpful? Follow for the full ML Pipelines & MLOps with Terraform series! 💬

DEV Community