Connect to Azure Data Warehouse from Azure Databricks

#azure #databricks #keyvault #adw

Create a secret scope

A secret scope will allow you to use Azure KeyVault to download all secret information to connect to Azure Data Warehouse, e.g.: username/password, etc. In our example, this will be the whole connection string.

Navigate to https://{region}.azuredatabricks.net/?o={object_id}#secrets/createScope

Grab the DNS and Resource ID from your Azure KeyVault Properties

Get required secrets from azure keyvault and set Spark configuration

blob_storage_url = dbutils.secrets.get(scope = "databricks_scope", key = "blobstorageurl")

blob_storage_key = dbutils.secrets.get(scope = "databricks_scope", key = "blobstoragekey")

adw_connection = dbutils.secrets.get(scope = "databricks_scope", key = "dwconnection")

blob_storage_temp_dir = dbutils.secrets.get(scope = "databricks_scope", key = "blobstoragetempdir")

spark.conf.set(blob_storage_url, blob_storage_key)

Spark driver to SQL DW

The Spark driver connects to SQL DW via JDBC using a username and password. We recommended that you use the connection string provided by Azure portal, which enables Secure Sockets Layer (SSL) encryption for all data sent between the Spark driver and the SQL DW instance through the JDBC connection. To verify that the SSL encryption is enabled, you can search for encrypt=true in the connection string. To allow the Spark driver to reach SQL DW, we recommend that you set Allow access to Azure services to ON on the firewall pane of the SQL DW server through Azure portal. This setting allows communications from all Azure IP addresses and all Azure subnets, which allows Spark drivers to reach the SQL DW instance.

For more info see here

Load data from a SQL DW query into a Spark DataFrame.

df = (spark
      .read
      .format("com.databricks.spark.sqldw")
      .option("url", adw_connection)
      .option("tempDir", blob_storage_temp_dir)
      .option("forwardSparkAzureStorageCredentials", "true")
      .option("query", "select count(*) as counter from my_table")).load()

df.show()

That's it! Quick and easy.