Suhas Mallesh

Posted on Apr 13

Azure ML Online Endpoints: Deploy Your Model to Production with Terraform 🚀

#azure #machinelearning #ai #devops

Your model is trained. Now deploy it to a managed online endpoint with traffic splitting for canary rollouts, autoscaling, health probes, and data collection. Here's how to deploy Azure ML endpoints with Terraform using azapi.

In the previous post, we set up the Azure ML workspace - the hub for experiments, models, and compute. Now comes the production side: taking a trained model and deploying it to a managed online endpoint that your applications can call for real-time predictions.

Azure ML online endpoints involve two resources: an Endpoint (the stable HTTPS URL with auth and traffic routing) and one or more Deployments (the model + compute behind it). Since the azurerm provider doesn't have native online endpoint resources yet, we use azapi to provision them directly via the Azure API. 🎯

🏗️ The Two-Layer Architecture

Endpoint (HTTPS URL, auth mode, traffic split)
    ↓
Deployment(s) (model, instance type, scaling, probes)

Resource	What It Defines
Endpoint	Stable URL, auth mode (key/AAD), traffic routing
Deployment	Model reference, instance type, scale settings, health probes

The endpoint URL stays fixed. You deploy new model versions as new deployments, shift traffic gradually, and delete old deployments after validation.

🔧 Terraform: Create the Online Endpoint

The Endpoint

# endpoint/main.tf

resource "azapi_resource" "online_endpoint" {
  type      = "Microsoft.MachineLearningServices/workspaces/onlineEndpoints@2025-06-01"
  name      = "${var.environment}-${var.model_name}"
  parent_id = azurerm_machine_learning_workspace.this.id
  location  = var.location

  identity {
    type = "SystemAssigned"
  }

  body = {
    properties = {
      authMode            = var.auth_mode
      publicNetworkAccess = var.public_network_access
      description         = "Production endpoint for ${var.model_name}"
      traffic = {
        (var.deployment_name) = 100
      }
    }
  }

  tags = {
    Environment = var.environment
    Model       = var.model_name
  }
}

authMode controls how clients authenticate: Key for API key auth (simpler), AADToken for Azure AD auth (more secure, no key rotation needed). Use AADToken in production.

The Deployment

# endpoint/deployment.tf

resource "azapi_resource" "deployment" {
  type      = "Microsoft.MachineLearningServices/workspaces/onlineEndpoints/deployments@2025-06-01"
  name      = var.deployment_name
  parent_id = azapi_resource.online_endpoint.id
  location  = var.location

  identity {
    type = "SystemAssigned"
  }

  body = {
    kind = "Managed"
    properties = {
      model             = var.model_uri
      environmentId     = var.environment_id
      instanceType      = var.instance_type
      appInsightsEnabled = true

      scaleSettings = {
        scaleType    = "Default"
        minInstances = var.min_instances
        maxInstances = var.max_instances
      }

      livenessProbe = {
        initialDelay   = "PT10S"
        period         = "PT10S"
        timeout        = "PT2S"
        failureThreshold = 30
        successThreshold = 1
      }

      readinessProbe = {
        initialDelay   = "PT10S"
        period         = "PT10S"
        timeout        = "PT2S"
        failureThreshold = 30
        successThreshold = 1
      }

      requestSettings = {
        requestTimeout       = "PT5S"
        maxConcurrentRequestsPerInstance = var.max_concurrent_requests
      }
    }
  }

  tags = {
    Environment = var.environment
    Model       = var.model_name
    Version     = var.model_version
  }
}

Key deployment properties:

model references the registered model in the workspace (e.g., azureml:fraud-detector:2)
environmentId references a curated or custom environment with your dependencies
scaleSettings controls autoscaling from minInstances to maxInstances
livenessProbe and readinessProbe configure health checks
requestSettings controls timeout and concurrency per instance

📐 Traffic Splitting for Canary Deployments

Deploy a new model version alongside the existing one, then gradually shift traffic:

# Endpoint with two deployments and traffic split
resource "azapi_resource" "online_endpoint" {
  type      = "Microsoft.MachineLearningServices/workspaces/onlineEndpoints@2025-06-01"
  name      = "${var.environment}-${var.model_name}"
  parent_id = azurerm_machine_learning_workspace.this.id
  location  = var.location

  identity {
    type = "SystemAssigned"
  }

  body = {
    properties = {
      authMode = var.auth_mode
      traffic = {
        "blue"  = 90   # Current stable version
        "green" = 10   # New canary version
      }
    }
  }
}

# Blue deployment (current version)
resource "azapi_resource" "blue" {
  type      = "Microsoft.MachineLearningServices/workspaces/onlineEndpoints/deployments@2025-06-01"
  name      = "blue"
  parent_id = azapi_resource.online_endpoint.id
  location  = var.location
  # ... deployment config for v1
}

# Green deployment (new version)
resource "azapi_resource" "green" {
  type      = "Microsoft.MachineLearningServices/workspaces/onlineEndpoints/deployments@2025-06-01"
  name      = "green"
  parent_id = azapi_resource.online_endpoint.id
  location  = var.location
  # ... deployment config for v2
}

Canary rollout workflow: Deploy green with 10% traffic. Monitor error rates and latency. If healthy, update traffic to "green" = 100, then delete blue. If unhealthy, set "blue" = 100 to roll back instantly.

Mirror Traffic for Shadow Testing

Test a new deployment without affecting users:

body = {
  properties = {
    authMode = var.auth_mode
    traffic = {
      "blue" = 100      # All live traffic to blue
    }
    mirrorTraffic = {
      "green" = 10       # Copy 10% of requests to green (responses discarded)
    }
  }
}

Mirror traffic sends a copy of live requests to the green deployment, but responses are discarded. This lets you validate the new model's behavior under real traffic patterns without any user impact.

📐 Data Collection for Model Monitoring

Enable request/response logging to catch model drift:

body = {
  kind = "Managed"
  properties = {
    model         = var.model_uri
    instanceType  = var.instance_type
    appInsightsEnabled = true

    dataCollector = {
      collections = {
        model_inputs = {
          dataCollectionMode = "Enabled"
          dataId             = var.data_asset_id
          samplingRate       = var.sampling_rate
        }
        model_outputs = {
          dataCollectionMode = "Enabled"
          dataId             = var.data_asset_id
          samplingRate       = var.sampling_rate
        }
      }
      requestLogging = {
        captureHeaders = ["Content-Type", "x-request-id"]
      }
      rollingRate = "Hour"
    }
  }
}

Data collection captures model inputs and outputs to a registered data asset. Use it for drift detection, fairness monitoring, and retraining triggers.

📐 Environment Configuration

# environments/dev.tfvars
model_name              = "fraud-detector"
deployment_name         = "blue"
model_uri               = "azureml:fraud-detector:2"
instance_type           = "Standard_DS2_v2"
min_instances           = 1
max_instances           = 2
max_concurrent_requests = 5
auth_mode               = "Key"
public_network_access   = "Enabled"

# environments/prod.tfvars
model_name              = "fraud-detector"
deployment_name         = "blue"
model_uri               = "azureml:fraud-detector:2"
instance_type           = "Standard_DS3_v2"
min_instances           = 2
max_instances           = 8
max_concurrent_requests = 10
auth_mode               = "AADToken"
public_network_access   = "Disabled"

🧪 Invoke the Endpoint

from azure.ai.ml import MLClient
from azure.identity import DefaultAzureCredential

ml_client = MLClient(
    DefaultAzureCredential(),
    subscription_id="...",
    resource_group_name="...",
    workspace_name="...",
)

result = ml_client.online_endpoints.invoke(
    endpoint_name="prod-fraud-detector",
    deployment_name="blue",
    request_file="sample_request.json",
)

print(result)

Or via REST:

curl -X POST \
  "https://prod-fraud-detector.eastus.inference.ml.azure.com/score" \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"data": [[0.5, 1.2, 3.4, 0.8]]}'

⚠️ Gotchas and Tips

azapi is required. The azurerm provider doesn't have native azurerm_machine_learning_online_endpoint or azurerm_machine_learning_online_deployment resources. Use azapi_resource with the Microsoft.MachineLearningServices API.

Deployment creation takes 8-15 minutes. Provisioning VMs, pulling containers, loading models, and running health probes takes time. Plan for this in CI/CD pipelines.

Register models before deploying. The model field references a registered model in the workspace Model Registry (format: azureml:model-name:version). Register models via the SDK or CLI before running terraform apply.

Use AADToken in production. API keys work but require rotation. AAD token auth integrates with managed identity, eliminating key management entirely.

Scale to zero is not supported for managed endpoints. Unlike serverless compute, managed online endpoints require at least one instance running. If you need scale-to-zero, consider serverless endpoints (currently in preview).

Mirror traffic before canary. Mirror first (responses discarded, no user impact) to validate the model handles real request shapes correctly. Then switch to traffic splitting for a live canary test.

⏭️ What's Next

This is Post 2 of the Azure ML Pipelines & MLOps with Terraform series.

Post 1: Azure ML Workspace 🔬
Post 2: Azure ML Online Endpoints (you are here) 🚀
Post 3: Azure ML Feature Store
Post 4: Azure ML Pipelines + Azure DevOps

Your model is in production. Managed online endpoint with autoscaling, blue/green traffic splitting, mirror traffic for shadow testing, and data collection for drift monitoring. From workspace to production, all in Terraform. 🚀

Found this helpful? Follow for the full ML Pipelines & MLOps with Terraform series! 💬

DEV Community