Your model is trained. Now deploy it to a managed online endpoint with traffic splitting for canary rollouts, autoscaling, health probes, and data collection. Here's how to deploy Azure ML endpoints with Terraform using azapi.
In the previous post, we set up the Azure ML workspace - the hub for experiments, models, and compute. Now comes the production side: taking a trained model and deploying it to a managed online endpoint that your applications can call for real-time predictions.
Azure ML online endpoints involve two resources: an Endpoint (the stable HTTPS URL with auth and traffic routing) and one or more Deployments (the model + compute behind it). Since the azurerm provider doesn't have native online endpoint resources yet, we use azapi to provision them directly via the Azure API. π―
ποΈ The Two-Layer Architecture
Endpoint (HTTPS URL, auth mode, traffic split)
β
Deployment(s) (model, instance type, scaling, probes)
| Resource | What It Defines |
|---|---|
| Endpoint | Stable URL, auth mode (key/AAD), traffic routing |
| Deployment | Model reference, instance type, scale settings, health probes |
The endpoint URL stays fixed. You deploy new model versions as new deployments, shift traffic gradually, and delete old deployments after validation.
π§ Terraform: Create the Online Endpoint
The Endpoint
# endpoint/main.tf
resource "azapi_resource" "online_endpoint" {
type = "Microsoft.MachineLearningServices/workspaces/onlineEndpoints@2025-06-01"
name = "${var.environment}-${var.model_name}"
parent_id = azurerm_machine_learning_workspace.this.id
location = var.location
identity {
type = "SystemAssigned"
}
body = {
properties = {
authMode = var.auth_mode
publicNetworkAccess = var.public_network_access
description = "Production endpoint for ${var.model_name}"
traffic = {
(var.deployment_name) = 100
}
}
}
tags = {
Environment = var.environment
Model = var.model_name
}
}
authMode controls how clients authenticate: Key for API key auth (simpler), AADToken for Azure AD auth (more secure, no key rotation needed). Use AADToken in production.
The Deployment
# endpoint/deployment.tf
resource "azapi_resource" "deployment" {
type = "Microsoft.MachineLearningServices/workspaces/onlineEndpoints/deployments@2025-06-01"
name = var.deployment_name
parent_id = azapi_resource.online_endpoint.id
location = var.location
identity {
type = "SystemAssigned"
}
body = {
kind = "Managed"
properties = {
model = var.model_uri
environmentId = var.environment_id
instanceType = var.instance_type
appInsightsEnabled = true
scaleSettings = {
scaleType = "Default"
minInstances = var.min_instances
maxInstances = var.max_instances
}
livenessProbe = {
initialDelay = "PT10S"
period = "PT10S"
timeout = "PT2S"
failureThreshold = 30
successThreshold = 1
}
readinessProbe = {
initialDelay = "PT10S"
period = "PT10S"
timeout = "PT2S"
failureThreshold = 30
successThreshold = 1
}
requestSettings = {
requestTimeout = "PT5S"
maxConcurrentRequestsPerInstance = var.max_concurrent_requests
}
}
}
tags = {
Environment = var.environment
Model = var.model_name
Version = var.model_version
}
}
Key deployment properties:
-
modelreferences the registered model in the workspace (e.g.,azureml:fraud-detector:2) -
environmentIdreferences a curated or custom environment with your dependencies -
scaleSettingscontrols autoscaling fromminInstancestomaxInstances -
livenessProbeandreadinessProbeconfigure health checks -
requestSettingscontrols timeout and concurrency per instance
π Traffic Splitting for Canary Deployments
Deploy a new model version alongside the existing one, then gradually shift traffic:
# Endpoint with two deployments and traffic split
resource "azapi_resource" "online_endpoint" {
type = "Microsoft.MachineLearningServices/workspaces/onlineEndpoints@2025-06-01"
name = "${var.environment}-${var.model_name}"
parent_id = azurerm_machine_learning_workspace.this.id
location = var.location
identity {
type = "SystemAssigned"
}
body = {
properties = {
authMode = var.auth_mode
traffic = {
"blue" = 90 # Current stable version
"green" = 10 # New canary version
}
}
}
}
# Blue deployment (current version)
resource "azapi_resource" "blue" {
type = "Microsoft.MachineLearningServices/workspaces/onlineEndpoints/deployments@2025-06-01"
name = "blue"
parent_id = azapi_resource.online_endpoint.id
location = var.location
# ... deployment config for v1
}
# Green deployment (new version)
resource "azapi_resource" "green" {
type = "Microsoft.MachineLearningServices/workspaces/onlineEndpoints/deployments@2025-06-01"
name = "green"
parent_id = azapi_resource.online_endpoint.id
location = var.location
# ... deployment config for v2
}
Canary rollout workflow: Deploy green with 10% traffic. Monitor error rates and latency. If healthy, update traffic to "green" = 100, then delete blue. If unhealthy, set "blue" = 100 to roll back instantly.
Mirror Traffic for Shadow Testing
Test a new deployment without affecting users:
body = {
properties = {
authMode = var.auth_mode
traffic = {
"blue" = 100 # All live traffic to blue
}
mirrorTraffic = {
"green" = 10 # Copy 10% of requests to green (responses discarded)
}
}
}
Mirror traffic sends a copy of live requests to the green deployment, but responses are discarded. This lets you validate the new model's behavior under real traffic patterns without any user impact.
π Data Collection for Model Monitoring
Enable request/response logging to catch model drift:
body = {
kind = "Managed"
properties = {
model = var.model_uri
instanceType = var.instance_type
appInsightsEnabled = true
dataCollector = {
collections = {
model_inputs = {
dataCollectionMode = "Enabled"
dataId = var.data_asset_id
samplingRate = var.sampling_rate
}
model_outputs = {
dataCollectionMode = "Enabled"
dataId = var.data_asset_id
samplingRate = var.sampling_rate
}
}
requestLogging = {
captureHeaders = ["Content-Type", "x-request-id"]
}
rollingRate = "Hour"
}
}
}
Data collection captures model inputs and outputs to a registered data asset. Use it for drift detection, fairness monitoring, and retraining triggers.
π Environment Configuration
# environments/dev.tfvars
model_name = "fraud-detector"
deployment_name = "blue"
model_uri = "azureml:fraud-detector:2"
instance_type = "Standard_DS2_v2"
min_instances = 1
max_instances = 2
max_concurrent_requests = 5
auth_mode = "Key"
public_network_access = "Enabled"
# environments/prod.tfvars
model_name = "fraud-detector"
deployment_name = "blue"
model_uri = "azureml:fraud-detector:2"
instance_type = "Standard_DS3_v2"
min_instances = 2
max_instances = 8
max_concurrent_requests = 10
auth_mode = "AADToken"
public_network_access = "Disabled"
π§ͺ Invoke the Endpoint
from azure.ai.ml import MLClient
from azure.identity import DefaultAzureCredential
ml_client = MLClient(
DefaultAzureCredential(),
subscription_id="...",
resource_group_name="...",
workspace_name="...",
)
result = ml_client.online_endpoints.invoke(
endpoint_name="prod-fraud-detector",
deployment_name="blue",
request_file="sample_request.json",
)
print(result)
Or via REST:
curl -X POST \
"https://prod-fraud-detector.eastus.inference.ml.azure.com/score" \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{"data": [[0.5, 1.2, 3.4, 0.8]]}'
β οΈ Gotchas and Tips
azapi is required. The azurerm provider doesn't have native azurerm_machine_learning_online_endpoint or azurerm_machine_learning_online_deployment resources. Use azapi_resource with the Microsoft.MachineLearningServices API.
Deployment creation takes 8-15 minutes. Provisioning VMs, pulling containers, loading models, and running health probes takes time. Plan for this in CI/CD pipelines.
Register models before deploying. The model field references a registered model in the workspace Model Registry (format: azureml:model-name:version). Register models via the SDK or CLI before running terraform apply.
Use AADToken in production. API keys work but require rotation. AAD token auth integrates with managed identity, eliminating key management entirely.
Scale to zero is not supported for managed endpoints. Unlike serverless compute, managed online endpoints require at least one instance running. If you need scale-to-zero, consider serverless endpoints (currently in preview).
Mirror traffic before canary. Mirror first (responses discarded, no user impact) to validate the model handles real request shapes correctly. Then switch to traffic splitting for a live canary test.
βοΈ What's Next
This is Post 2 of the Azure ML Pipelines & MLOps with Terraform series.
- Post 1: Azure ML Workspace π¬
- Post 2: Azure ML Online Endpoints (you are here) π
- Post 3: Azure ML Feature Store
- Post 4: Azure ML Pipelines + Azure DevOps
Your model is in production. Managed online endpoint with autoscaling, blue/green traffic splitting, mirror traffic for shadow testing, and data collection for drift monitoring. From workspace to production, all in Terraform. π
Found this helpful? Follow for the full ML Pipelines & MLOps with Terraform series! π¬
Top comments (0)