Azure Databricks is a powerful analytics platform built on Apache Spark, tailor-made for Azure. It fosters collaboration between data engineers, data scientists, and machine learning experts, facilitating their work on large-scale data and advanced analytics projects. On the other hand, Terraform, an open-source Infrastructure as Code (IaC) tool developed by HashiCorp, empowers users to define and provision infrastructure resources using a declarative configuration language.
In this guide, we'll delve into the seamless integration of these two technologies using the Databricks Terraform provider. This combination offers several compelling advantages and is the recommended approach for efficiently managing Databricks workspaces and their associated resources in Azure.
Setting Up Your Terraform Environment
Before we dive into the specifics, there are some prerequisites for successfully using Terraform and the Databricks Terraform provider:
Azure Account: Ensure you have an Azure account.
Azure Admin User: You need to be an account-level admin user in your Azure account.
Development Machine Setup: On your local development machine, you should have the Terraform CLI and Azure CLI installed and configured. Make sure you are signed in via the az login command with a user that has Contributor or Owner rights to your subscription.
Project Structure
Organize your project into a folder for your Terraform scripts, let's call it "Terraform-Databricks." We will create several configuration files to handle authentication and resource provisioning.
Version and Provider Configuration
In your Terraform project, create a versions.tf file to specify the Terraform version and required providers:
terraform {
required_version = ">= 1.2, < 1.5"
required_providers {
azurerm = {
source = "hashicorp/azurerm"
}
databricks = {
source = "databricks/databricks"
version = "1.24.1"
}
}
}
Now, let's define the providers in a providers.tf file:
provider "azurerm" {
features {}
}
# Use Azure CLI authentication.
provider "databricks" {
# We'll revisit this section later
}
We'll return to the Databricks provider configuration shortly.
Backend Configuration
To store the Terraform state, create a backend.tf file. In this example, we're using an Azure Storage account to store the state:
terraform {
backend "azurerm" {
resource_group_name = "rg-terraform-non-prod-weu"
storage_account_name = "stteranonprodweu"
container_name = "terraform-databricks"
key = "terraform.tfstate"
# rather than defining this inline, the Access Key can also be sourced
# from an Environment Variable - more information is available below.
access_key = "your_key_storage"
}
}
With these initial configurations in place, we can proceed to set up Terraform for Azure.
Getting started writing the Terraform script to build Azure Databricks infrastructure.
Step 1: Retrieve the Current Client Configuration and User in Azure
data "azurerm_client_config" "current" {
}
data "databricks_current_user" "me" {
depends_on = [azurerm_databricks_workspace.dbw]
}
Step 2: Create a resource group
resource "azurerm_resource_group" "rg" {
name = "rg-analytics-test-weu"
location = "westeurope"
}
Step 3: Create a Virtual Network (Vnet) - Optional, but Important
Whether to create a Vnet depends on your specific use case. If you're exclusively using Azure Databricks and don't require outbound access or a high level of security, you can skip this step. However, if you need to interact with services outside of Azure, it's advisable to create a Vnet.
Consider the scenario where you want Azure Databricks to access MongoDB Atlas, which resides outside of Azure. MongoDB Atlas secures its infrastructure by allowing specific IPs in a whitelist. However, exposing Azure Databricks to the internet isn't an ideal solution. Instead, you can create a Vnet and set up peering or a private endpoint.
It's essential to note that you can't add a Vnet to an existing workspace. Once a workspace is created, its configurations are registered in the Control Plane and can't be modified.
Here, we'll create a Vnet:
resource "azurerm_virtual_network" "vnet" {
name = "vnet-analytics-test-weu"
location = azurerm_resource_group.rg.location
resource_group_name = azurerm_resource_group.rg.name
address_space = [var.cidr]
}
Within this Vnet, we'll create two subnets: a public subnet and a private subnet. We'll also implement a network security group (NSG) to manage security for the Vnet.
resource "azurerm_network_security_group" "nsg" {
name = "nsg-analytics-test-weu"
location = azurerm_resource_group.rg.location
resource_group_name = azurerm_resource_group.rg.name
}
resource "azurerm_subnet" "public" {
name = "subnet-public"
resource_group_name = azurerm_resource_group.rg.name
virtual_network_name = azurerm_virtual_network.vnet.name
address_prefixes = [cidrsubnet(var.cidr, 3, 0)]
delegation {
name = "databricks"
service_delegation {
name = "Microsoft.Databricks/workspaces"
actions = [
"Microsoft.Network/virtualNetworks/subnets/join/action",
"Microsoft.Network/virtualNetworks/subnets/prepareNetworkPolicies/action",
"Microsoft.Network/virtualNetworks/subnets/unprepareNetworkPolicies/action"]
}
}
}
resource "azurerm_subnet_network_security_group_association" "public" {
subnet_id = azurerm_subnet.public.id
network_security_group_id = azurerm_network_security_group.nsg.id
}
resource "azurerm_subnet" "private" {
name = "subnet-private"
resource_group_name = azurerm_resource_group.rg.name
virtual_network_name = azurerm_virtual_network.vnet.name
address_prefixes = [cidrsubnet(var.cidr, 3, 1)]
delegation {
name = "databricks"
service_delegation {
name = "Microsoft.Databricks/workspaces"
actions = [
"Microsoft.Network/virtualNetworks/subnets/join/action",
"Microsoft.Network/virtualNetworks/subnets/prepareNetworkPolicies/action",
"Microsoft.Network/virtualNetworks/subnets/unprepareNetworkPolicies/action"]
}
}
}
resource "azurerm_subnet_network_security_group_association" "private" {
subnet_id = azurerm_subnet.private.id
network_security_group_id = azurerm_network_security_group.nsg.id
}
Azure Databricks Workspace Setup
Now, let's focus on creating an Azure Databricks workspace. This workspace will be our central hub for running analytics tasks:
resource "azurerm_databricks_workspace" "dbw" {
name = "dbw-analytics-test-weu"
resource_group_name = azurerm_resource_group.rg.name
location = azurerm_resource_group.rg.location
sku = "premium"
managed_resource_group_name = "rg-databricks-managed-weu"
public_network_access_enabled = var.public_network_access_enabled
custom_parameters {
no_public_ip = var.no_public_ip
virtual_network_id = azurerm_virtual_network.vnet.id
private_subnet_name = azurerm_subnet.private.name
public_subnet_name = azurerm_subnet.public.name
public_subnet_network_security_group_association_id = azurerm_subnet_network_security_group_association.public.id
private_subnet_network_security_group_association_id = azurerm_subnet_network_security_group_association.private.id
}
depends_on = [
azurerm_subnet_network_security_group_association.public,
azurerm_subnet_network_security_group_association.private
]
}
You can see that in the depends_on, I listed 2 subnet. We need this, otherwise destroy doesn't cleanup things correctly.
This workspace creation includes some crucial parameters like SKU, location, and network configurations. We're ensuring that the workspace is integrated into your Azure environment.
Next, we are going to create Databricks cluster, however, before creating cluster. We have to define note type, spark version and instance pool.
data "databricks_node_type" "dbr_node_type" {
local_disk = true
depends_on = [azurerm_databricks_workspace.dbw]
}
data "databricks_spark_version" "dbr_spark" {
long_term_support = true
depends_on = [azurerm_databricks_workspace.dbw]
}
resource "databricks_instance_pool" "dbr_instance_pool" {
instance_pool_name = "pool-analytics-test-weu"
min_idle_instances = 0
max_capacity = 10
node_type_id = data.databricks_node_type.dbr_node_type.id
idle_instance_autotermination_minutes = 10
azure_attributes {
availability = "ON_DEMAND_AZURE"
spot_bid_max_price = -1
}
disk_spec {
disk_type {
azure_disk_volume_type = "PREMIUM_LRS"
}
disk_size = 80
disk_count = 1
}
}
Azure Databricks Cluster Creation
Next, we'll create an Azure Databricks cluster, which will serve as the computational engine for our analytics tasks:
resource "databricks_cluster" "cluster" {
cluster_name = "dbc-analytics-test-weu"
spark_version = data.databricks_spark_version.dbr_spark.id
node_type_id = data.databricks_node_type.dbr_node_type.id
autotermination_minutes = 20
autoscale {
min_workers = 1
max_workers = 50
}
spark_conf = {
"spark.databricks.io.cache.enable" : true
}
depends_on = [azurerm_databricks_workspace.dbw]
# instance_pool_id = databricks_instance_pool.dbr_instance_pool.id
}
We have also declared that the cluster will be dependent on Databricks workspace. We can define the note type id or instance pool id.
To reduce cluster start time, you can attach a cluster to a predefined pool of idle instances. When attached to a pool, a cluster allocates its driver and worker nodes from the pool. If the pool does not have sufficient idle resources to accommodate the cluster’s request, it expands by allocating new instances from the instance provider. When an attached cluster changes its state to TERMINATED, the instances it used are returned to the pool and reused by a different cluster.
In this article, I use note type id instead of instance pool id.
In addition to creating more stuff such as: jobs or notebook we can define as below
resource "databricks_notebook" "nb" {
path = "${data.databricks_current_user.me.home}/${var.notebook_subdirectory}/${var.notebook_filename}"
language = var.notebook_language
source = "./notebooks/${var.notebook_filename}"
}
resource "databricks_job" "job" {
name = var.job_name
existing_cluster_id = databricks_cluster.cluster.cluster_id
notebook_task {
notebook_path = databricks_notebook.nb.path
}
email_notifications {
on_success = [ data.databricks_current_user.me.user_name ]
on_failure = [ data.databricks_current_user.me.user_name ]
}
}
This is almost completely.
This configuration includes cluster details such as its name, Spark version, and autoscaling properties.
Databricks Provider Update
We need to revisit the Databricks provider configuration and update it with the necessary authentication information:
provider "databricks" {
azure_workspace_resource_id = azurerm_databricks_workspace.dbw.id
host = azurerm_databricks_workspace.dbw.workspace_url
}
This help authorization Azure Databricks service.
Initialize the working directory containing the *.tf file by running the terraform init command
terraform init
Terraform downloads the specified providers and installs them in a hidden subdirectory of your current working directory, named .terraform. The terraform init command prints out which version of the providers were installed. Terraform also creates a lock file named .terraform.lock.hcl which specifies the exact provider versions used, so that you can control when you want to update the providers used for your project.
Check whether your project was configured correctly by running the terraform plan command. If there are any errors, fix them, and run the command again.
terraform plan
If everything looks good, apply the changes to your Azure environment. Apply the changes required to reach the desired state of the configuration by running the terraform apply command.
terraform apply -auto-approve
This guide covers the fundamental setup and provisioning steps for Azure Databricks using Terraform. In upcoming articles, we'll explore more advanced configurations and automation options to help you harness the full potential of this powerful analytics platform.
Stay tuned for more in-depth insights into managing Databricks workspaces and resources with Terraform!
Top comments (0)