Kenichiro Nakamura

Posted on Sep 22, 2020

Ingest data into Databricks from Azure Data Explorer

#azure #adx #databricks

In previous article, I explain how you can ingest data from blob storage into Azure Data Explorer (ADX).

In this article, I explain how you can ADX data from Databricks.

I use Azure Kusto Python SDK to query data. You can find sample code here.

Prerequisites

Add Azure Databricks. Depending on how you use the service, but I created it as "standard tier" as I don't need any premium this time.

Create cluster for Databricks

Once you created databricks resource, add cluster.

1. Go to Azure Portal and select Databricks resource you just created.

2. Click "Launch Workplace".

3. Go to cluster menu and create cluster with default settings. To save money, I disabled auto scaling and set node number to 1.

Create notebook

Finally create notebook to process data.

1. Go to Workplace | Users | your account and create Notebook.

2. Create notebook to use Python.

3. Install required pip packages. Run the cell to make sure they are installed successfully.



pip install azure-kusto-data azure-kusto-ingest

4. Add new cell and add following code which I took from sample. Replace following values to match your environment.

cluster: This is ADX cluster address. You can confirm the address in ADX overview pane in Azure portal. Copy URI value.
client_id: The service principal id created in the previous article.
client_secret: The service principal secret create in the previous article.
authority_id: The tenant id where the service principal was added. You can see the id in service principal overview pane.



"""A simple example how to use KustoClient."""

from datetime import timedelta

from azure.kusto.data.exceptions import KustoServiceError
from azure.kusto.data.helpers import dataframe_from_result_table
from azure.kusto.data import KustoClient, KustoConnectionStringBuilder, ClientRequestProperties

######################################################
##                        AUTH                      ##
######################################################

cluster = "https://<your adx name>.<region>.kusto.windows.net"
client_id = "<service principal id>"
client_secret = "<service principal secret>"
authority_id = "<tenant id>"

kcsb = KustoConnectionStringBuilder.with_aad_application_key_authentication(cluster, client_id, client_secret, authority_id)

client = KustoClient(kcsb)

######################################################
##                       QUERY                      ##
######################################################

# once authenticated, usage is as following
db = "catlogsdb"
query = "cats | take 10"

response = client.execute(db, query)
dataframe = dataframe_from_result_table(response.primary_results[0])

print(dataframe)

If you want to query different database or use different query, change the query section.

5. Run the notebook and confirm the result.

Authentication

Obviously, hardcode secret information into notebook without encryption is not good idea.

There are many ways to secure it.

Use certificate
Use System Assigned Managed Service Identity (MSI)
Use User Assigned Managed Service Identity (MSI)
Username and password (not secure too)
Use device authentication

You can also get secret from "secrets store" if you use Databricks Premium Tier. See Secrets for more detail.

Summary

ADX is one of popular data source for big data analysis. If you are not python person, no worries, as Microsoft provide SDK for other languages, too. See Node SDK, .NET and Java

DEV Community