DEV Community

loading...

Ingest data into Databricks from Azure Data Explorer

kenakamu profile image Kenichiro Nakamura ・3 min read

In previous article, I explain how you can ingest data from blob storage into Azure Data Explorer (ADX).

In this article, I explain how you can ADX data from Databricks.

I use Azure Kusto Python SDK to query data. You can find sample code here.

Prerequisites

Add Azure Databricks. Depending on how you use the service, but I created it as "standard tier" as I don't need any premium this time.

Create cluster for Databricks

Once you created databricks resource, add cluster.

1. Go to Azure Portal and select Databricks resource you just created.

Alt Text

2. Click "Launch Workplace".

Alt Text

3. Go to cluster menu and create cluster with default settings. To save money, I disabled auto scaling and set node number to 1.

Create notebook

Finally create notebook to process data.

1. Go to Workplace | Users | your account and create Notebook.

Alt Text

2. Create notebook to use Python.

Alt Text

3. Install required pip packages. Run the cell to make sure they are installed successfully.

pip install azure-kusto-data azure-kusto-ingest

4. Add new cell and add following code which I took from sample. Replace following values to match your environment.

  • cluster: This is ADX cluster address. You can confirm the address in ADX overview pane in Azure portal. Copy URI value.
  • client_id: The service principal id created in the previous article.
  • client_secret: The service principal secret create in the previous article.
  • authority_id: The tenant id where the service principal was added. You can see the id in service principal overview pane.

Alt Text

"""A simple example how to use KustoClient."""

from datetime import timedelta

from azure.kusto.data.exceptions import KustoServiceError
from azure.kusto.data.helpers import dataframe_from_result_table
from azure.kusto.data import KustoClient, KustoConnectionStringBuilder, ClientRequestProperties

######################################################
##                        AUTH                      ##
######################################################

cluster = "https://<your adx name>.<region>.kusto.windows.net"
client_id = "<service principal id>"
client_secret = "<service principal secret>"
authority_id = "<tenant id>"

kcsb = KustoConnectionStringBuilder.with_aad_application_key_authentication(cluster, client_id, client_secret, authority_id)

client = KustoClient(kcsb)

######################################################
##                       QUERY                      ##
######################################################

# once authenticated, usage is as following
db = "catlogsdb"
query = "cats | take 10"

response = client.execute(db, query)
dataframe = dataframe_from_result_table(response.primary_results[0])

print(dataframe)

If you want to query different database or use different query, change the query section.

5. Run the notebook and confirm the result.

Alt Text

Authentication

Obviously, hardcode secret information into notebook without encryption is not good idea.

There are many ways to secure it.

  • Use certificate
  • Use System Assigned Managed Service Identity (MSI)
  • Use User Assigned Managed Service Identity (MSI)
  • Username and password (not secure too)
  • Use device authentication

You can also get secret from "secrets store" if you use Databricks Premium Tier. See Secrets for more detail.

Summary

ADX is one of popular data source for big data analysis. If you are not python person, no worries, as Microsoft provide SDK for other languages, too. See Node SDK, .NET and Java

Discussion (0)

pic
Editor guide