Training a model is half the battle. Deploying it to a scalable, auto-scaling endpoint with blue/green deployment and rollback is the other half. Here's how to deploy SageMaker real-time endpoints with Terraform.
In the previous post, we set up the SageMaker Studio domain - the workspace where your team trains models. Now comes the production side: taking a trained model and deploying it to a scalable HTTPS endpoint that your applications can call for real-time predictions.
SageMaker endpoints involve three Terraform resources: a Model (what to serve), an Endpoint Configuration (how to serve it), and an Endpoint (the live HTTPS URL). Add autoscaling and deployment policies on top, and you have a production-grade inference system. π―
ποΈ The Three-Layer Architecture
Model (container image + model artifacts in S3)
β
Endpoint Configuration (instance type, count, variants)
β
Endpoint (HTTPS URL, auto-scaling, blue/green deployment)
| Resource | What It Defines |
|---|---|
| Model | Container image + S3 model artifacts + IAM role |
| Endpoint Config | Instance type, initial count, production variants |
| Endpoint | Live endpoint with deployment policy and autoscaling |
Separating these layers means you can update the model without touching the endpoint config, or change instance types without retraining.
π§ Terraform: Deploy to Production
Model Configuration
# variables.tf
variable "model_config" {
description = "Model deployment configuration. Change to deploy new models."
type = object({
name = string
image_uri = string # ECR container image
model_data_url = string # S3 path to model.tar.gz
instance_type = string
instance_count = number
})
}
IAM Role for Model Execution
# inference/iam.tf
resource "aws_iam_role" "sagemaker_execution" {
name = "${var.environment}-sagemaker-inference"
assume_role_policy = jsonencode({
Version = "2012-10-17"
Statement = [{
Action = "sts:AssumeRole"
Effect = "Allow"
Principal = { Service = "sagemaker.amazonaws.com" }
}]
})
}
resource "aws_iam_role_policy" "model_access" {
name = "model-s3-ecr-access"
role = aws_iam_role.sagemaker_execution.id
policy = jsonencode({
Version = "2012-10-17"
Statement = [
{
Effect = "Allow"
Action = ["s3:GetObject"]
Resource = "${var.model_bucket_arn}/*"
},
{
Effect = "Allow"
Action = [
"ecr:GetAuthorizationToken",
"ecr:BatchGetImage",
"ecr:GetDownloadUrlForLayer"
]
Resource = "*"
},
{
Effect = "Allow"
Action = ["logs:CreateLogGroup", "logs:CreateLogStream", "logs:PutLogEvents"]
Resource = "*"
}
]
})
}
The Model
# inference/model.tf
resource "aws_sagemaker_model" "this" {
name = "${var.environment}-${var.model_config.name}"
execution_role_arn = aws_iam_role.sagemaker_execution.arn
primary_container {
image = var.model_config.image_uri
model_data_url = var.model_config.model_data_url
environment = {
SAGEMAKER_PROGRAM = "inference.py"
}
}
tags = {
Environment = var.environment
Model = var.model_config.name
}
}
image is your serving container - either an AWS Deep Learning Container or your custom container from ECR. model_data_url points to model.tar.gz in S3 containing your trained weights and inference code.
Endpoint Configuration
# inference/endpoint_config.tf
resource "aws_sagemaker_endpoint_configuration" "this" {
name = "${var.environment}-${var.model_config.name}-config"
production_variants {
variant_name = "primary"
model_name = aws_sagemaker_model.this.name
initial_instance_count = var.model_config.instance_count
instance_type = var.model_config.instance_type
initial_variant_weight = 1.0
}
tags = {
Environment = var.environment
Model = var.model_config.name
}
lifecycle {
create_before_destroy = true
}
}
create_before_destroy = true is important. When you update the endpoint config (new model version, different instance type), Terraform creates the new config before deleting the old one. This prevents downtime during updates.
The Endpoint
# inference/endpoint.tf
resource "aws_sagemaker_endpoint" "this" {
name = "${var.environment}-${var.model_config.name}"
endpoint_config_name = aws_sagemaker_endpoint_configuration.this.name
deployment_config {
blue_green_update_policy {
traffic_routing_configuration {
type = "CANARY"
canary_size {
type = "INSTANCE_COUNT"
value = 1
}
wait_interval_in_seconds = 300
}
maximum_execution_timeout_in_seconds = 1800
termination_wait_in_seconds = 120
}
auto_rollback_configuration {
alarms {
alarm_name = aws_cloudwatch_metric_alarm.endpoint_errors.alarm_name
}
}
}
tags = {
Environment = var.environment
Model = var.model_config.name
}
}
Canary deployment: Routes traffic to 1 instance on the new fleet first, waits 5 minutes, then shifts remaining traffic. If the CloudWatch alarm fires during the canary phase, SageMaker automatically rolls back.
Autoscaling
# inference/autoscaling.tf
resource "aws_appautoscaling_target" "endpoint" {
max_capacity = var.autoscaling_max
min_capacity = var.model_config.instance_count
resource_id = "endpoint/${aws_sagemaker_endpoint.this.name}/variant/primary"
scalable_dimension = "sagemaker:variant:DesiredInstanceCount"
service_namespace = "sagemaker"
}
resource "aws_appautoscaling_policy" "endpoint" {
name = "${var.environment}-endpoint-scaling"
policy_type = "TargetTrackingScaling"
resource_id = aws_appautoscaling_target.endpoint.resource_id
scalable_dimension = aws_appautoscaling_target.endpoint.scalable_dimension
service_namespace = aws_appautoscaling_target.endpoint.service_namespace
target_tracking_scaling_policy_configuration {
target_value = var.target_invocations_per_instance
predefined_metric_specification {
predefined_metric_type = "SageMakerVariantInvocationsPerInstance"
}
scale_in_cooldown = 300
scale_out_cooldown = 60
}
}
SageMakerVariantInvocationsPerInstance scales based on requests per instance. Set the target to the maximum your instance can handle (determined by load testing). The policy adds instances when traffic exceeds the target and removes them when traffic drops.
Monitoring
# inference/monitoring.tf
resource "aws_cloudwatch_metric_alarm" "endpoint_errors" {
alarm_name = "${var.environment}-endpoint-5xx-errors"
comparison_operator = "GreaterThanThreshold"
evaluation_periods = 2
metric_name = "Invocation5XXErrors"
namespace = "AWS/SageMaker"
period = 60
statistic = "Sum"
threshold = 5
alarm_description = "Endpoint returning 5xx errors"
alarm_actions = [var.sns_alert_topic_arn]
dimensions = {
EndpointName = aws_sagemaker_endpoint.this.name
VariantName = "primary"
}
}
resource "aws_cloudwatch_metric_alarm" "endpoint_latency" {
alarm_name = "${var.environment}-endpoint-high-latency"
comparison_operator = "GreaterThanThreshold"
evaluation_periods = 3
metric_name = "ModelLatency"
namespace = "AWS/SageMaker"
period = 60
statistic = "Average"
threshold = var.latency_threshold_ms * 1000 # microseconds
alarm_description = "Model inference latency too high"
alarm_actions = [var.sns_alert_topic_arn]
dimensions = {
EndpointName = aws_sagemaker_endpoint.this.name
VariantName = "primary"
}
}
The error alarm also feeds into the blue/green deployment's auto_rollback_configuration. If 5xx errors spike during a deployment, SageMaker automatically rolls back to the previous fleet.
π Environment Configuration
# environments/dev.tfvars
model_config = {
name = "fraud-detector-v2"
image_uri = "123456789012.dkr.ecr.us-east-1.amazonaws.com/ml-serving:latest"
model_data_url = "s3://ml-models-dev/fraud-detector/v2/model.tar.gz"
instance_type = "ml.t2.medium"
instance_count = 1
}
autoscaling_max = 2
target_invocations_per_instance = 100
# environments/prod.tfvars
model_config = {
name = "fraud-detector-v2"
image_uri = "123456789012.dkr.ecr.us-east-1.amazonaws.com/ml-serving:v2.1.0"
model_data_url = "s3://ml-models-prod/fraud-detector/v2/model.tar.gz"
instance_type = "ml.c5.xlarge"
instance_count = 2
}
autoscaling_max = 10
target_invocations_per_instance = 500
Deploying a new model version: Update model_data_url to the new S3 path, run terraform apply. The canary deployment creates a new fleet, validates it, and shifts traffic automatically.
π§ͺ Invoke the Endpoint
import boto3
import json
client = boto3.client("sagemaker-runtime")
response = client.invoke_endpoint(
EndpointName="prod-fraud-detector-v2",
ContentType="application/json",
Body=json.dumps({"features": [0.5, 1.2, 3.4, 0.8]})
)
prediction = json.loads(response["Body"].read())
print(prediction)
β οΈ Gotchas and Tips
Load test before setting autoscaling targets. The target_invocations_per_instance should be based on actual load testing, not guesswork. Use SageMaker Inference Recommender to find the optimal instance type and max throughput.
Endpoint creation takes 5-10 minutes. SageMaker provisions instances, downloads the container, loads the model, and runs health checks. Plan for this in your deployment pipeline.
Use create_before_destroy on endpoint configs. Without it, Terraform deletes the old config before creating the new one, causing downtime.
Pin container image tags. Use versioned tags (v2.1.0) instead of latest in production. This ensures reproducible deployments and meaningful rollbacks.
Serverless inference for low traffic. If your endpoint gets fewer than ~100 requests/hour, consider serverless_config on the endpoint configuration instead of instance-based. It scales to zero and charges per request.
βοΈ What's Next
This is Post 2 of the ML Pipelines & MLOps with Terraform series.
- Post 1: SageMaker Studio Domain π¬
- Post 2: SageMaker Endpoints - Deploy to Prod (you are here) π
- Post 3: SageMaker Feature Store
- Post 4: SageMaker Pipelines - CI/CD for ML
Your model is in production. Real-time HTTPS endpoint, autoscaling based on traffic, canary deployments with automatic rollback, and monitoring that pages you before customers notice. From notebook to production, all in Terraform. π
Found this helpful? Follow for the full ML Pipelines & MLOps with Terraform series! π¬
Top comments (0)