Azure Machine Learning workspace is the hub for all ML activities - experiments, models, endpoints, pipelines. It requires four dependent services. Here's how to provision the entire platform with Terraform including compute instances and clusters.
In Series 1-3, we worked with managed AI services - AI Foundry for models, AI Search for RAG, Agent Service for orchestration. Series 5 shifts to custom ML - training your own models, deploying endpoints, managing features, and building CI/CD pipelines.
It starts with an Azure Machine Learning workspace. The workspace is the top-level resource for all ML activities: experiments, datasets, models, compute targets, endpoints, and pipelines live here. Unlike a simple resource, the workspace requires four dependent services before it can be created: Storage Account, Key Vault, Application Insights, and Container Registry. Terraform provisions the entire stack. π―
ποΈ Workspace Architecture
| Component | What It Does |
|---|---|
| Workspace | Central hub for ML experiments, models, and pipelines |
| Storage Account | Default datastore for datasets, model artifacts, logs |
| Key Vault | Secrets, connection strings, API keys |
| Application Insights | Experiment tracking, endpoint monitoring |
| Container Registry | Custom training images, model deployment containers |
| Compute Instance | Per-user notebook/IDE (JupyterLab, VS Code) |
| Compute Cluster | Auto-scaling training cluster (CPU or GPU) |
All four dependent services must exist before the workspace. Terraform handles the dependency ordering automatically.
π§ Terraform: The Full Workspace Setup
Dependent Services
# ml/dependencies.tf
resource "azurerm_storage_account" "ml" {
name = "${var.environment}ml${random_string.suffix.result}"
location = azurerm_resource_group.ml.location
resource_group_name = azurerm_resource_group.ml.name
account_tier = "Standard"
account_replication_type = var.storage_replication
min_tls_version = "TLS1_2"
allow_nested_items_to_be_public = false
tags = var.tags
}
resource "azurerm_key_vault" "ml" {
name = "${var.environment}ml${random_string.suffix.result}kv"
location = azurerm_resource_group.ml.location
resource_group_name = azurerm_resource_group.ml.name
tenant_id = data.azurerm_client_config.current.tenant_id
sku_name = "standard"
purge_protection_enabled = true
enable_rbac_authorization = true
tags = var.tags
}
resource "azurerm_application_insights" "ml" {
name = "${var.environment}-ml-insights"
location = azurerm_resource_group.ml.location
resource_group_name = azurerm_resource_group.ml.name
application_type = "web"
tags = var.tags
}
resource "azurerm_container_registry" "ml" {
name = "${var.environment}ml${random_string.suffix.result}acr"
location = azurerm_resource_group.ml.location
resource_group_name = azurerm_resource_group.ml.name
sku = var.acr_sku
admin_enabled = false
tags = var.tags
}
Container Registry is optional but recommended. Without it, the workspace uses Azure-managed image building. With it, you control custom training images and model serving containers. Set admin_enabled = false and use managed identity instead.
The Workspace
# ml/workspace.tf
resource "azurerm_machine_learning_workspace" "this" {
name = "${var.environment}-ml-workspace"
location = azurerm_resource_group.ml.location
resource_group_name = azurerm_resource_group.ml.name
application_insights_id = azurerm_application_insights.ml.id
key_vault_id = azurerm_key_vault.ml.id
storage_account_id = azurerm_storage_account.ml.id
container_registry_id = azurerm_container_registry.ml.id
public_network_access_enabled = var.public_network_access
identity {
type = "SystemAssigned"
}
tags = var.tags
}
The workspace creates a system-assigned managed identity that accesses the dependent services. No keys or connection strings to manage.
Compute Instance (Per-User IDE)
# ml/compute_instance.tf
resource "azurerm_machine_learning_compute_instance" "this" {
for_each = var.compute_instances
name = each.key
machine_learning_workspace_id = azurerm_machine_learning_workspace.this.id
location = azurerm_resource_group.ml.location
virtual_machine_size = each.value.vm_size
node_public_ip_enabled = var.public_network_access
identity {
type = "SystemAssigned"
}
tags = merge(var.tags, {
Team = each.value.team
User = each.key
})
}
Each data scientist gets their own compute instance with JupyterLab and VS Code access. Instances stop and start independently - you only pay when they're running.
Compute Cluster (Training)
# ml/compute_cluster.tf
resource "azurerm_machine_learning_compute_cluster" "training" {
name = "${var.environment}-training"
machine_learning_workspace_id = azurerm_machine_learning_workspace.this.id
location = azurerm_resource_group.ml.location
vm_size = var.training_vm_size
vm_priority = var.training_vm_priority
identity {
type = "SystemAssigned"
}
scale_settings {
min_node_count = 0
max_node_count = var.training_max_nodes
scale_down_nodes_after_idle_duration = "PT${var.scale_down_minutes}M"
}
tags = var.tags
}
min_node_count = 0 is the cost control key. The cluster scales to zero when no training jobs are running. You pay nothing for idle compute. scale_down_nodes_after_idle_duration controls how quickly nodes are released after a job finishes.
vm_priority = "LowPriority" saves up to 80% on training costs. Low-priority VMs can be evicted, so use them for fault-tolerant training jobs with checkpointing.
π Environment Configuration
# environments/dev.tfvars
environment = "dev"
public_network_access = true
storage_replication = "LRS"
acr_sku = "Basic"
training_vm_size = "Standard_DS3_v2"
training_vm_priority = "Dedicated"
training_max_nodes = 2
scale_down_minutes = 15
compute_instances = {
"ds-dev-1" = {
vm_size = "Standard_DS3_v2"
team = "ml-team"
}
}
# environments/prod.tfvars
environment = "prod"
public_network_access = false
storage_replication = "GRS"
acr_sku = "Standard"
training_vm_size = "Standard_NC6s_v3" # GPU
training_vm_priority = "LowPriority"
training_max_nodes = 8
scale_down_minutes = 30
compute_instances = {
"ds-lead" = {
vm_size = "Standard_DS4_v2"
team = "ml-team"
}
"ds-engineer-1" = {
vm_size = "Standard_DS3_v2"
team = "ml-team"
}
"ds-engineer-2" = {
vm_size = "Standard_DS3_v2"
team = "ml-team"
}
}
Dev: Public access, LRS storage, small dedicated cluster for quick iteration.
Prod: Private access, GRS storage, GPU cluster with low-priority VMs for cost-efficient training.
π§ RBAC for Team Access
# ml/rbac.tf
# Data scientists get Contributor on the workspace
resource "azurerm_role_assignment" "ds_contributor" {
for_each = var.data_scientist_principals
scope = azurerm_machine_learning_workspace.this.id
role_definition_name = "AzureML Compute Operator"
principal_id = each.value
}
# ML engineers get full workspace access
resource "azurerm_role_assignment" "mle_contributor" {
for_each = var.ml_engineer_principals
scope = azurerm_machine_learning_workspace.this.id
role_definition_name = "Contributor"
principal_id = each.value
}
Use AzureML Compute Operator for data scientists who need to run experiments but shouldn't modify workspace settings. Use Contributor for ML engineers who manage the full lifecycle.
π§ Security Hardening
| Control | Dev | Prod |
|---|---|---|
| Network access | Public | Private (managed VNet) |
| Storage | LRS, TLS 1.2 | GRS, TLS 1.2, no public blob |
| Key Vault | RBAC auth | RBAC auth, purge protection |
| Container Registry | Basic, no admin | Standard, managed identity |
| Compute | Public IP | No public IP, subnet attached |
| Identity | System-assigned | System-assigned + RBAC |
For production, set public_network_access_enabled = false on the workspace and use managed virtual network isolation. All compute instances and clusters run inside a managed VNet with outbound rules controlled by the workspace.
β οΈ Gotchas and Tips
Globally unique names required. Storage account and ACR names must be globally unique. Use random_string suffixes to avoid conflicts across environments.
Container Registry costs. Basic SKU is $5/month. Standard is $20/month. If you're not building custom training images yet, you can skip the ACR initially and add it later.
Compute instance auto-stop. Unlike clusters, compute instances don't auto-stop by default. Set up an idle shutdown schedule through the workspace settings or Azure Policy to prevent overnight charges.
Workspace deletion is complex. Deleting a workspace doesn't automatically delete dependent resources (storage, KV, ACR). Terraform handles this correctly with depends_on, but manual deletion requires cleaning up each resource individually.
SDK v1 is deprecated. Azure ML SDK v1 reached end-of-support in March 2025. Use SDK v2 (azure-ai-ml) for all new development. Terraform provisions the infrastructure; SDK v2 handles experiments and models.
βοΈ What's Next
This is Post 1 of the Azure ML Pipelines & MLOps with Terraform series.
- Post 1: Azure ML Workspace (you are here) π¬
- Post 2: Azure ML Online Endpoints - Deploy to Prod
- Post 3: Azure ML Feature Store
- Post 4: Azure ML Pipelines + Azure DevOps
Your ML platform is provisioned. Workspace, storage, key vault, ACR, compute instances for notebooks, auto-scaling clusters for training - all in Terraform. The foundation for model development, deployment, and production ML pipelines. π¬
Found this helpful? Follow for the full ML Pipelines & MLOps with Terraform series! π¬
Top comments (0)