DEV Community: Itay Sagui

Querying Azure Data Explorer using Power BI

Itay Sagui — Tue, 19 Jul 2022 18:20:42 +0000

I’ve been at Microsoft for several years now, working as part of the Azure Data Explorer (ADX) team. Although my focus is mainly behind the scenes working on the Control Plane of the service, one of my responsibilities is the Power BI Connector for Azure Data Explorer. The connector is a newer-style M-based connector. Its purpose is to allow you to easily query your ADX cluster, in either Import or Direct Query mode, and generates native Kusto Query Language (KQL) queries. The connector allows you to work mainly in Power BI if that’s your cup-of-tea, but also support more advanced scenarios, in case you’re an Azure Data Explorer / KQL kind of guy.

In this post series I’ll use one of our demo Azure Data Explorer clusters – it contains a database named “Samples”, which contains a “StormEvents” table, which contains a list of storm events that happened, along with their details. Although the table has only around 65K rows (ADX is able to handle billions of rows without any issue), it will serve well for what I have to show you.

Retrieve data from Azure Data Explorer

Let’s start with something simple. I’ll create a new data source, and select “Azure Data Explorer (Kusto)” as my source:

Then, I’ll provide my cluster’s query URL. At this point, I can just press the “Connect” button, and use the user interface to select a database and a table (or function, more on that in a future post). However, it usually gives better results if you provide the database and table or query manually:

Notice that I’ve selected to load the data in Direct Query mode. without going into too much details, I’d usually use Import mode for dimension tables, and Direct Query mode for facts. It’s also the recommended way if you have big data, or if you need near-real time data in your report.

The M query generated looks like the following:

let
    Source = AzureDataExplorer.Contents("https://XXX.kusto.windows.net", "Samples", "StormEvents", [MaxRows=null, MaxSize=null, NoTruncate=null, AdditionalSetStatements=null])
in
    Source

I’ll discuss the meaning of the various settings in a future post. Suffice to see that the cluster URL, database name, and the table name I provided were used to invoke the AzureDataExplorer.Contents() method.

Displaying data in the PowerBI report

At this point, the data is ready to be loaded into my report:

I can now use the data to add some visualizations. I’ll start by adding a couple of filters, one for State, and another for Source – both are dimension columns, which we’ll meet again when we’ll discuss dimension tables:

Last, I’ll add a pie chart, showing how many events of each type we have:

I used the EventType column in the “Legend” area, and the EventId column for the Values area. Notice how EventId is aggregated to count the number of values. I could have used any column in this case.

If you’ll look at the queries being sent to your Azure Data Explorer cluster, you can see that PBI, with the help of the ADX connector, generated something along the following queries (for the 2 filters, and the pie-chart):

StormEvents
| project ["State"]
| summarize by ["State"]
| order by ["State"] asc
| limit 101

StormEvents
| project ["Source"]
| summarize by ["Source"]
| order by ["Source"] asc
| limit 101

StormEvents
| summarize ["a0"]=countif(isnotnull(["EventId"])) by ["EventType"]
| limit 1000001

In the next post, I’ll discuss using a combination of Import and Direct Query mode, to work with dimension values, and improve performance of your reports.

Advanced options of the Azure Data Explorer Connector for Power BI

Itay Sagui — Mon, 22 Mar 2021 22:08:57 +0000

This is the fifth post in this series, discussing the usage of Power BI to visualize Azure Data Explorer data. In the previous post, we used the Dynamic M Parameters feature of Power BI, and used it to invoke Azure Data Explorer Functions, and use query operators that exist in ADX but are not available in Power BI, such as Percentiles and Regular Expressions. In this post, I’d like to discuss some of the advanced settings of the ADX Connector.

Azure Data Explorer Connector options

As we saw in the previous posts, the Azure Data Explorer connector exposes the Contents method, which accepts 4 parameters – the cluster URL, the database name, a table name or query, and an M Record objects for options. These options allow us to change the behavior of the connector, and affect the generated queries.

Query Limits – MaxRows, MaxSize and NoTruncate

By default, Azure Data Explorer has some limits on the results of queries. Those limits are in place to prevent you from accidentally pulling a huge amount of data from your cluster, which might impact the cluster’s performance and your costs. Azure Data Explorer’s limits are 500K rows and 64MB of data. Anything more than that, and you’ll get a nice and friendly error message. It also makes common sense – you usually don’t have anything useful to do with 3M rows.

That being said, there are a few scenarios where retrieving more than 500K rows makes sense – one such scenario is when you’re retrieving a large dimension table, as discussed in part 2 of this series.

We can use the MaxRows settings to change the 500K limit, and the MaxSize to change the data size limit. For example

let
    Source = AzureDataExplorer.Contents("cluster", "database", "table", [MaxRows=1000000, MaxSize=100*1024*1024])
in
    Source

This will generate the following query:

set truncationmaxrecords=1000000;
set truncationmaxsize=104857600;
table

We can also use the NoTruncate option to remove all limits:

let
    Source = AzureDataExplorer.Contents("cluster", "database", "table", [NoTruncate=true])
in
    Source

Additional Set Statements

The AdditionalSetStatements option allows you to add any set statement you’d like to your query. While you can, in theory, add your set statements as part of the query itself, this results in failures to convert your M queries to KQL. The reason for that, is that whenever your M query ends up performing a join between 2 (or more) data sources (which is common when working with multiple Direct Query sources), it results in an invalid KQL query – your set statements appear in the middle of the query, which is invalid. For this purpose, the AdditionalSetStatements was created:

let
    Source = AzureDataExplorer.Contents("cluster", "database", "table", [AdditionalSetStatements="set query_datascope=hotcache"])
in
    Source

Strings handling – CaseInsensitive and ForceUseContains

The next 2 options are CaseInsensitive and ForceUseContains. These two options exist to close the gap between Power BI and Azure Data Explorer behaviors. Let’s start with CaseInsensitive.

By default, Power BI is case insensitive when working with filters and slicers. This means that if you have a slicer that displays a list of states, and you’d like to filter them, you can type “ALABAMA” or “alabama”, and both will give you the same result. However, when working with the Azure Data Explorer Connector, the generated queries use the == operator, which is case sensitive. Enabling the CaseInsensitive option will configure the connector to use the case insensitive =~ operator. For example

let
    Source = AzureDataExplorer.Contents("cluster", "database", "table"),
    #"Filtered Rows" = Table.SelectRows(Source, each [State] == "aLaBama")
in
    #"Filtered Rows"

Will result in the following query (which probably won’t return any rows):

table
| where State == 'aLabama'

While the following:

let
    Source = AzureDataExplorer.Contents("cluster", "database", "table", [CaseInsensitive=true]),
    #"Filtered Rows" = Table.SelectRows(Source, each [State] == "aLaBama")
in
    #"Filtered Rows"

Will result in the following query (which is probably what you’d want):

table
| where State =~ 'aLabama'

Moving on to the ForceUseContains option, which serves a similar purpose – by default, when filtering values in Power BI slicers, PBI supports substrings ADX on the other hand, defaults to full tokens using the “has” operator (which is much more efficient, since it can query the ADX indices directly). So if we want to support filtering for states that contains “cali”, we need this option, which configures the connector to use the (less efficient, but better user experience) “contains” operator:

let
    Source = AzureDataExplorer.Contents("cluster", "database", "table", [ForceUseContains=true]),
    Filtered = Table.SelectRows(Source, (r) => Text.Contains(r[State], "cali"))
in
    Filtered

Which gets translated to the following KQL query:

table
| where State contains 'cali'

Timeout

Last but not least, we have the Timeout option. As the name suggests, it allows you to control the query timeout for both the Power BI client-side, and the Azure Data Explorer server side. Usually, the default 4 minutes is more than enough, however, certain cases might warrant more time – loading a large dimension table, or performing a very compute-intensive query.

let
    Source = AzureDataExplorer.Contents("cluster", "database", "table", [Timeout=#duration(0,15,0,0)])
in
    Source

Conclusion

As you can see, there are multiple options that allow you to control the behavior of the connector, and impact the generated KQL queries to match your exact needs, and the expected user behavior.

Advanced usages of Azure Data Explorer with Power BI Dynamic Parameters

Itay Sagui — Tue, 09 Mar 2021 18:46:39 +0000

This is the fourth post in this series, discussing the usage of Power BI to visualize Azure Data Explorer data. In the previous post, we saw how we can use Power BI’s new dynamic parameters feature to generate queries based on the values selected in slicers. There are a few scenarios that are perfect fit for this feature, which I’ll cover in this post.

Azure Data Explorer features not available in Power BI

There are some features, such as percentiles calculations and regular expressions handling, which are not available in Power BI. This means that if you’d like to use them in your reports, you’ll have to resort to KQL. However, if you don’t “inject” your filters early on in the query, your cluster will waste a lot of time and resources to calculate data that you don’t need.

Let’s continue with our StormEvents sample. Let’s say that we’d like to know the 50th, 95th and 99th percentile of total damage done by EventType. If I were to write such a query in Azure Data Explorer, I’d end up with something along the following:

StormEvents
| extend TotalDamage = DamageCrops + DamageProperty
| summarize percentiles(TotalDamage, 50, 95, 99) by EventType

And if I’m interested in the data for a specific State, I’d add the filter as close to the table name as possible – we want to filter early, in order to reduce computation effort:

StormEvents
| where State == 'ALABAMA'
| extend TotalDamage = DamageCrops + DamageProperty
| summarize percentiles(TotalDamage, 50, 95, 99) by EventType

Since percentiles is not supported in Power BI, my base query will be similar to the first KQL query above, and any filtering will be added at the end. However, the first query doesn’t return a State column, making it impossible to filter based on it.

// THIS DOES NOT WORK
StormEvents
| extend TotalDamage = DamageCrops + DamageProperty
| summarize percentiles(TotalDamage, 50, 95, 99) by EventType
| where State == 'ALABAMA'

This is where a dynamic M parameter is a perfect fit. I can use the same query construction mechanism we used in the previous post to generate a query using the dynamic parameter, in order to generate our desired query:

let
    StateFilter = if (SelectedStates is null or Text.Length(SelectedStates) = 0) then "" else " | where State == '" & SelectedStates & "' ",
    Query = Text.Combine({
        "StormEvents ",
        StateFilter,
        "| extend TotalDamage = DamageCrops + DamageProperty | summarize (P50, P95, P99)=percentiles(TotalDamage, 50, 95, 99) by EventType"
    }),
    Source = AzureDataExplorer.Contents("https://help.kusto.windows.net", "Samples", Query, [MaxRows=null, MaxSize=null, NoTruncate=null, AdditionalSetStatements=null])
in
    Source

This will generate queries such as the following:

StormEvents | where State == 'ALASKA' | extend TotalDamage = DamageCrops + DamageProperty | summarize (P50, P95, P99)=percentiles(TotalDamage, 50, 95, 99) by EventType
// The part below is generated by PowerBI for the visualization
| summarize ["a0"]=sum(["P99"]), ["a1"]=sum(["P50"]) by ["EventType"]
| limit 1000001

And you can see the final results in the animation below:

Invoking Azure Data Explorer Functions with Dynamic M Parameters

Another scenario where Dynamic M Parameters shine is when using Azure Data Explorer Functions. ADX Functions allow you to write complex queries, and reuse them in a simple manner. Since functions can accept parameters, we can pass a Dynamic M Parameters to them, thus greatly simplify our dataset.

Let’s assume I created the following ADX Function in my cluster:

.create-or-alter function StormsReportedBySource(source:string) {
    StormEvents
    | where Source == source
}

I can now use this ADX Function from Power BI, along with my SelectedSources parameter like this:

let
    Source = AzureDataExplorer.Contents("help", "Samples", null, [MaxRows=null, MaxSize=null, NoTruncate=null, AdditionalSetStatements=null]),
    StormsReportedBySource = Source{[Name="StormsReportedBySource"]}[Data],
    Data = StormsReportedBySource(SelectedSources)
in
    Data

This results in a dataset that can be used in a visualization:

Using Azure Data Explorer Function with Dynamic M Parameter
As you can see – it makes for a very simple Power BI dataset query, which can hide a lot of complexity that exists in the Azure Data Explorer query. This allows you to do the heavy-lifting query in Azure Data Explorer, with all its advanced capabilities, and make sure you pass the correct filters so the query is as efficient as possible.

Using Azure Data Explorer with Power BI Dynamic Parameters

Itay Sagui — Thu, 25 Feb 2021 13:17:30 +0000

This is the third post in this series, discussing the usage of Power BI to visualize Azure Data Explorer data. If you followed those posts, we now have a report that uses 2 dimension tables for States and Sources loaded in Import mode (to reduce the number of queries being sent to ADX), and a facts table with StormEvents data, using Direct query mode (to have fresh, up-to-date data). The next thing to investigate, is how to use the dimension table, along with the recently announced Dynamic Parameters feature of Power BI.

Configuring the Dynamic Parameters in Power BI

First, let’s define SelectedStates and SelectedSources parameters in our model. We’ll define 2 parameters of type Text, mark them as required, and provide a default value:

Next, we need to bind each one of these parameters to the respective column in the dimension tables. This is done by selecting the columns in the model, and under the Advanced section, select the parameter in the “Bind to parameter” option:

What this binding means, is that whenever we select a value in a slicer bound to the column, the Parameter’s value will be updated to the selected value. This will allow us to perform filtering, pass values to ADX functions, and so on. For this post, I’ll set Multi-select to false for the States, and to true for the Sources. This will allow me to demonstrate how to handle both a single value, and multiple values.

Updating the Azure Data Explorer queries

Now that we expect the selected values to be bound to our Parameters, we need to use those parameters in the Azure Data Explorer data source. We can do this in multiple ways:

We can manually construct the ADX query we pass to the connector – I’ll use this method for the simpler State Parameter.
We can use M to fold our query, saving us from having to manipulate strings – I’ll use this method for the multi-values Sources Parameter
In a future post, when I’ll discuss using Azure Data Explorer functions in Power BI, we’ll also be able to pass the Parameter to an ADX function.

NOTICE: Options #2 and #3 are much more secure, since they avoid having to handle “KQL Injection” issues – if you work with strings, it’s your responsibility to make sure that the values passed to the Parameters are encoded as needed. One option to do so, is to use KQL’s query parameters statements, and passing the parameters as a default value.

For the State Parameter, I’ll first check is it’s not empty, and construct a KQL filter statement, which I’ll concat to my query:

StateFilter = if (SelectedStates is null or Text.Length(SelectedStates) = 0) then "" else " | where State == '" & SelectedStates & "'",
Source = AzureDataExplorer.Contents("https://help.kusto.windows.net", "Samples", "StormEvents" & StateFilter, [MaxRows=null, MaxSize=null, NoTruncate=null, AdditionalSetStatements=null])

For the multi-values Sources Parameter, we’ll just make sure that the value is of type List. If it is, we’ll use it to invoke M’s List.Contains() function. If not, we’ll assume it’s a single text value, and create list of it first.

SourcesList = if (SelectedSources is null) then { } else if (SelectedSources is text) then { SelectedSources } else SelectedSources,
Folded = if (List.Count(SourcesList) > 0) then 
        Table.SelectRows(Source, each List.Contains(SourcesList, [Source]))
    else
        Source

Our final M query should look something like this:

let
    StateFilter = if (SelectedStates is null or Text.Length(SelectedStates) = 0) then "" else " | where State == '" & SelectedStates & "'",
    Source = AzureDataExplorer.Contents("https://help.kusto.windows.net", "Samples", "StormEvents" & StateFilter, [MaxRows=null, MaxSize=null, NoTruncate=null, AdditionalSetStatements=null]),
    SourcesList = if (SelectedSources is null) then { } else if (SelectedSources is text) then { SelectedSources } else SelectedSources,
    Folded = if (List.Count(SourcesList) > 0) then 
            Table.SelectRows(Source, each List.Contains(SourcesList, [Source]))
        else
            Source
in
    Folded

We can check the KQL query that will be sent to our Azure Data Explorer cluster by selecting the “View Native Query” feature:

And you should see a KQL query similar to this, just as we expected

StormEvents | where State == 'ALABAMA'
| where ["Source"] in ("Newspaper")

Updating the relationships in our Power BI Model

Since we’re now responsible for “injecting” the selected values of slicers into our query, we need to remove any relationships we have between our dimensions tables and our facts tables. Otherwise, we’ll end up with conflicting and inconsistent filtering, which will result in empty result sets.

Visualizing our data

At this point, there’s nothing left to do – we can start using our visualizations just like we did before, and see the data refreshed as we make selections in the slicers:

Using dimensions when querying Azure Data Explorer using Power BI

Itay Sagui — Mon, 22 Feb 2021 12:40:15 +0000

In the previous post I discussed querying Azure Data Explorer using Power BI. I’ve shown how we can use the Azure Data Explorer connector to retrieve data from our cluster in Direct Query mode, and add a couple of filters and a pie chart. One of the things I mentioned, is that the report generates multiple queries to our ADX cluster:

One query to retrieve the list of States (for the filter visualization)
One query to retrieve the list of Sources (for the filter visualization)
One query for the actual data needed for the pie-chart – the aggregated number of events by EventType

If we have a lot of visualizations, this number will grow, and will impact the performance of our report. So, what can we do to reduce the number of queries? Well, based on our data model, quite a lot. I’ll touch a bit about the different options we have to reduce impact on our clusters (thus reduce our COGs), but for now let’s focus on the main issue.

Creating dimensions tables for Azure Data Explorer in Power BI
The number of States and the number of Sources is not expected to change (probably never, but certainly not with high frequency). There’s no reason that each time a user views our report, we’ll retrieve those values. But that’s exactly what happens when our filters are based on a Direct Query data source. Instead, what we want to do is import those dimensions once in our report (and perhaps refresh them periodically) and reuse them for all our users. Let’s give it a try.

We’ll start with creating 2 new data sources using the Azure Data Explorer connector. Notice how this time, we provide a KQL query, instead of just a table name, and how we select the Import mode (instead of Direct Query):

We’ll use the following queries. For retrieving the list of unique States:

StormEvents
| distinct State

and for the unique list of Sources:

StormEvents
| distinct State

Relationships and Visualizations

If we try just to update our visualizations to use the new data source, we’ll lose the cross-filter interactivity – the pie-chart won’t be updated based on our filters selections. We first need to create the relationships between our Dimensions tables (States & Sources) and our original Facts table.

We’ll create both relationships as many-to-many – remember, ADX does not have strong relationships between tables (but you can simulate them using the join operator). Although we can probably set a one-to-many relationship in this case, since the number of values in our dimension tables is rather low, this has no benefit, and the same queries will be generated either way:

We can now go back to our report, and select different values in our filters. Notice how the data in the pie-chart gets updated based on our filters selection:

If you’ll investigate the query generated by Power BI, you’ll see something along the following lines:

StormEvents
| where (["State"] == "CALIFORNIA") and (["Source"] in ("Broadcast Media","AWOS","ASOS","Amateur Radio","Airplane Pilot"))
| summarize ["a0"]=countif(isnotnull(["EventId"])) by ["EventType"]
| limit 1000001

You can see that the ADX Connector added a filter for the California state, as well as a multi-valued filter (using the “in” keyword) for the selected Sources.

There are a few things that are worth mentioning:

You can achieve the same behavior with Direct Query dimension tables, but the generated queries will be much less efficient, and will use the “join” KQL operator, instead of just simple filters. In the example above, I’ve used 2 separate dimension tables. This does have the drawback, that even when I select a State, I still see the full list of Sources – there’s no cross filtering between these 2 filters.
That being said, I’d recommend using the “Add an Apply button to each slicer to apply the changes when you’re ready” (from the “Query reduction” settings section – PowerBI tends to be quite chatty, which might impact your cluster.
Lastly, when I created the dimension tables, I’ve used a query to retrieve the list of unique States and Sources. If you have huge amounts of data, or if calculation of the dimensions is more complex, that might be too slow. In such cases, I’d suggest you look into creating a Materialize View, or using an Update Policy to pre-create the data.