Suhas Mallesh

Posted on Apr 3

Azure ML Workspace with Terraform: Your ML Platform on Azure 🔬

#azure #ai #devops #machinelearning

Azure Machine Learning workspace is the hub for all ML activities - experiments, models, endpoints, pipelines. It requires four dependent services. Here's how to provision the entire platform with Terraform including compute instances and clusters.

In Series 1-3, we worked with managed AI services - AI Foundry for models, AI Search for RAG, Agent Service for orchestration. Series 5 shifts to custom ML - training your own models, deploying endpoints, managing features, and building CI/CD pipelines.

It starts with an Azure Machine Learning workspace. The workspace is the top-level resource for all ML activities: experiments, datasets, models, compute targets, endpoints, and pipelines live here. Unlike a simple resource, the workspace requires four dependent services before it can be created: Storage Account, Key Vault, Application Insights, and Container Registry. Terraform provisions the entire stack. 🎯

🏗️ Workspace Architecture

Component	What It Does
Workspace	Central hub for ML experiments, models, and pipelines
Storage Account	Default datastore for datasets, model artifacts, logs
Key Vault	Secrets, connection strings, API keys
Application Insights	Experiment tracking, endpoint monitoring
Container Registry	Custom training images, model deployment containers
Compute Instance	Per-user notebook/IDE (JupyterLab, VS Code)
Compute Cluster	Auto-scaling training cluster (CPU or GPU)

All four dependent services must exist before the workspace. Terraform handles the dependency ordering automatically.

🔧 Terraform: The Full Workspace Setup

Dependent Services

# ml/dependencies.tf

resource "azurerm_storage_account" "ml" {
  name                     = "${var.environment}ml${random_string.suffix.result}"
  location                 = azurerm_resource_group.ml.location
  resource_group_name      = azurerm_resource_group.ml.name
  account_tier             = "Standard"
  account_replication_type = var.storage_replication
  min_tls_version          = "TLS1_2"
  allow_nested_items_to_be_public = false

  tags = var.tags
}

resource "azurerm_key_vault" "ml" {
  name                       = "${var.environment}ml${random_string.suffix.result}kv"
  location                   = azurerm_resource_group.ml.location
  resource_group_name        = azurerm_resource_group.ml.name
  tenant_id                  = data.azurerm_client_config.current.tenant_id
  sku_name                   = "standard"
  purge_protection_enabled   = true
  enable_rbac_authorization  = true

  tags = var.tags
}

resource "azurerm_application_insights" "ml" {
  name                = "${var.environment}-ml-insights"
  location            = azurerm_resource_group.ml.location
  resource_group_name = azurerm_resource_group.ml.name
  application_type    = "web"

  tags = var.tags
}

resource "azurerm_container_registry" "ml" {
  name                = "${var.environment}ml${random_string.suffix.result}acr"
  location            = azurerm_resource_group.ml.location
  resource_group_name = azurerm_resource_group.ml.name
  sku                 = var.acr_sku
  admin_enabled       = false

  tags = var.tags
}

Container Registry is optional but recommended. Without it, the workspace uses Azure-managed image building. With it, you control custom training images and model serving containers. Set admin_enabled = false and use managed identity instead.

The Workspace

# ml/workspace.tf

resource "azurerm_machine_learning_workspace" "this" {
  name                          = "${var.environment}-ml-workspace"
  location                      = azurerm_resource_group.ml.location
  resource_group_name           = azurerm_resource_group.ml.name
  application_insights_id       = azurerm_application_insights.ml.id
  key_vault_id                  = azurerm_key_vault.ml.id
  storage_account_id            = azurerm_storage_account.ml.id
  container_registry_id         = azurerm_container_registry.ml.id
  public_network_access_enabled = var.public_network_access

  identity {
    type = "SystemAssigned"
  }

  tags = var.tags
}

The workspace creates a system-assigned managed identity that accesses the dependent services. No keys or connection strings to manage.

Compute Instance (Per-User IDE)

# ml/compute_instance.tf

resource "azurerm_machine_learning_compute_instance" "this" {
  for_each = var.compute_instances

  name                          = each.key
  machine_learning_workspace_id = azurerm_machine_learning_workspace.this.id
  location                      = azurerm_resource_group.ml.location
  virtual_machine_size          = each.value.vm_size
  node_public_ip_enabled        = var.public_network_access

  identity {
    type = "SystemAssigned"
  }

  tags = merge(var.tags, {
    Team = each.value.team
    User = each.key
  })
}

Each data scientist gets their own compute instance with JupyterLab and VS Code access. Instances stop and start independently - you only pay when they're running.

Compute Cluster (Training)

# ml/compute_cluster.tf

resource "azurerm_machine_learning_compute_cluster" "training" {
  name                          = "${var.environment}-training"
  machine_learning_workspace_id = azurerm_machine_learning_workspace.this.id
  location                      = azurerm_resource_group.ml.location
  vm_size                       = var.training_vm_size
  vm_priority                   = var.training_vm_priority

  identity {
    type = "SystemAssigned"
  }

  scale_settings {
    min_node_count                       = 0
    max_node_count                       = var.training_max_nodes
    scale_down_nodes_after_idle_duration  = "PT${var.scale_down_minutes}M"
  }

  tags = var.tags
}

min_node_count = 0 is the cost control key. The cluster scales to zero when no training jobs are running. You pay nothing for idle compute. scale_down_nodes_after_idle_duration controls how quickly nodes are released after a job finishes.

vm_priority = "LowPriority" saves up to 80% on training costs. Low-priority VMs can be evicted, so use them for fault-tolerant training jobs with checkpointing.

📐 Environment Configuration

# environments/dev.tfvars
environment            = "dev"
public_network_access  = true
storage_replication    = "LRS"
acr_sku                = "Basic"
training_vm_size       = "Standard_DS3_v2"
training_vm_priority   = "Dedicated"
training_max_nodes     = 2
scale_down_minutes     = 15

compute_instances = {
  "ds-dev-1" = {
    vm_size = "Standard_DS3_v2"
    team    = "ml-team"
  }
}

# environments/prod.tfvars
environment            = "prod"
public_network_access  = false
storage_replication    = "GRS"
acr_sku                = "Standard"
training_vm_size       = "Standard_NC6s_v3"   # GPU
training_vm_priority   = "LowPriority"
training_max_nodes     = 8
scale_down_minutes     = 30

compute_instances = {
  "ds-lead" = {
    vm_size = "Standard_DS4_v2"
    team    = "ml-team"
  }
  "ds-engineer-1" = {
    vm_size = "Standard_DS3_v2"
    team    = "ml-team"
  }
  "ds-engineer-2" = {
    vm_size = "Standard_DS3_v2"
    team    = "ml-team"
  }
}

Dev: Public access, LRS storage, small dedicated cluster for quick iteration.
Prod: Private access, GRS storage, GPU cluster with low-priority VMs for cost-efficient training.

🔧 RBAC for Team Access

# ml/rbac.tf

# Data scientists get Contributor on the workspace
resource "azurerm_role_assignment" "ds_contributor" {
  for_each = var.data_scientist_principals

  scope                = azurerm_machine_learning_workspace.this.id
  role_definition_name = "AzureML Compute Operator"
  principal_id         = each.value
}

# ML engineers get full workspace access
resource "azurerm_role_assignment" "mle_contributor" {
  for_each = var.ml_engineer_principals

  scope                = azurerm_machine_learning_workspace.this.id
  role_definition_name = "Contributor"
  principal_id         = each.value
}

Use AzureML Compute Operator for data scientists who need to run experiments but shouldn't modify workspace settings. Use Contributor for ML engineers who manage the full lifecycle.

🔧 Security Hardening

Control	Dev	Prod
Network access	Public	Private (managed VNet)
Storage	LRS, TLS 1.2	GRS, TLS 1.2, no public blob
Key Vault	RBAC auth	RBAC auth, purge protection
Container Registry	Basic, no admin	Standard, managed identity
Compute	Public IP	No public IP, subnet attached
Identity	System-assigned	System-assigned + RBAC

For production, set public_network_access_enabled = false on the workspace and use managed virtual network isolation. All compute instances and clusters run inside a managed VNet with outbound rules controlled by the workspace.

⚠️ Gotchas and Tips

Globally unique names required. Storage account and ACR names must be globally unique. Use random_string suffixes to avoid conflicts across environments.

Container Registry costs. Basic SKU is $5/month. Standard is $20/month. If you're not building custom training images yet, you can skip the ACR initially and add it later.

Compute instance auto-stop. Unlike clusters, compute instances don't auto-stop by default. Set up an idle shutdown schedule through the workspace settings or Azure Policy to prevent overnight charges.

Workspace deletion is complex. Deleting a workspace doesn't automatically delete dependent resources (storage, KV, ACR). Terraform handles this correctly with depends_on, but manual deletion requires cleaning up each resource individually.

SDK v1 is deprecated. Azure ML SDK v1 reached end-of-support in March 2025. Use SDK v2 (azure-ai-ml) for all new development. Terraform provisions the infrastructure; SDK v2 handles experiments and models.

⏭️ What's Next

This is Post 1 of the Azure ML Pipelines & MLOps with Terraform series.

Post 1: Azure ML Workspace (you are here) 🔬
Post 2: Azure ML Online Endpoints - Deploy to Prod
Post 3: Azure ML Feature Store
Post 4: Azure ML Pipelines + Azure DevOps

Your ML platform is provisioned. Workspace, storage, key vault, ACR, compute instances for notebooks, auto-scaling clusters for training - all in Terraform. The foundation for model development, deployment, and production ML pipelines. 🔬

Found this helpful? Follow for the full ML Pipelines & MLOps with Terraform series! 💬

DEV Community