Tatsuro Shibamura

Posted on Oct 2, 2020 • Edited on Mar 12, 2021

Query Acceleration for Data Lake Storage is lightweight Synapse Analytics

#azure #storage #datalake #synapse

The feature to query the contents of files stored in Azure Data Lake Storage (formerly known as "Quick Query") is now GA!

You can achieve similar functionality in Synapse Analytics SQL on-demand and OPENROWSET, but Query Acceleration is designed to be built into your application.

Query Acceleration for Azure Data Lake Storage is now generally available

The announcement was made during the "What's New in Azure Storage" session at Ignite 2020.

According to Microsoft, less than 20% of the data read is used for the actual analysis queries, so the system is designed to optimize the overall results by allowing you to filter close to the data.

The name of the feature is Query Acceleration for Azure Data Lake Storage, but it can also be used for GPv2 storage accounts with the Hierarchical namespace disabled, which makes it a little confusing.

In terms of the SDK, Blob Storage and Data Lake Storage have different names, but they have the same interface, so you shouldn't have any problems with them. To be honest, you can use ADLS with the Blob Storage SDK.

Feature and pricing

Although it looks similar to AWS Athena, Query Acceleration is easy to use because you don't need to create tables beforehand, you just need to execute a query.

It supports CSV and JSON data formats.

Azure Data Lake Storage query acceleration | Microsoft Docs

As for SQL, it has some compatibility with T-SQL, but there are many functions that are not available.

GROUP BY is one of the most common unsupported features.

Query acceleration SQL language reference | Microsoft Docs

It is a billing system that I am interested in, but the document says that it is billed for the scanned data and the data returned to the client.

Azure Data Lake Storage Gen2 pricing | Microsoft Azure

It is mentioned in the Pricing section, but I can't calculate it because I don't know the unit of measure.

Let's try to use Query Acceleration from the application. The usage is slightly different between CSV and JSON, so I try each format.

Query for CSV file

Install the C# SDK 12.6.0 or higher as published in NuGet. This is the supported version.

https://www.nuget.org/packages/Azure.Storage.Blobs/

At first I was expecting to be able to write the query on a per-container basis, but I had to run the query on a per-blob basis, which is a bit of a pain if you're filtering from a large number of files.

Honestly, it would have been much better if we could have used wildcards to specify Blobs, but since we're only doing this for a single file, we'll just use simple code to do it. Essentially, all I'm doing is setting up BlobQueryOptions and running QueryAsync. Other than that, it's the same code as usual.



var connectionString = "DefaultEndpointsProtocol=https;AccountName=***;AccountKey=***;EndpointSuffix=core.windows.net";

var blobServiceClient = new BlobServiceClient(connectionString);
var containerClient = blobServiceClient.GetBlobContainerClient("sampledata");

var blobClient = containerClient.GetBlockBlobClient("***.csv");

var options = new BlobQueryOptions
{
    InputTextConfiguration = new BlobQueryCsvTextOptions
    {
        HasHeaders = false
    }
};

var result = await blobClient.QueryAsync("SELECT * FROM BlobStorage WHERE _7 < 1000", options);

var content = await new StreamReader(result.Value.Content).ReadToEndAsync();

Console.WriteLine(content);

The InputTextConfiguration property is set to BlobQueryCsvTextOptions. This class has properties for delimiters and escape sequences, so you may want to change them as needed.

Setting HasHeaders property to true is useful because you can access CSV column names from SQL.

If you set HasHeaders property to false, you can't access the column by name, but you can access it by index like _1, _2, or _3, which is how I use it to write my filtering queries.

Query for JSON file

The next step is to try querying JSON.

Since each row of data will be represented as a single JSON, you may think that the log output from Azure Monitor is interesting to analyze.

Unfortunately, you can't use Query Acceleration on the log from Azure Monitor because Append Blob is used.

Because it was difficult to prepare the data, I used the access log which App Service outputs via Azure Monitor, which I copied again as Block Blobs.

Since there are about 300 files, I acquire the list of Blobs and execute the query sequentially.



var connectionString = "DefaultEndpointsProtocol=https;AccountName=***;AccountKey=***;EndpointSuffix=core.windows.net";

var blobServiceClient = new BlobServiceClient(connectionString);
var containerClient = blobServiceClient.GetBlobContainerClient("httplogs");

await foreach (var blobItem in containerClient.GetBlobsAsync(BlobTraits.Metadata))
{
    var blobClient = containerClient.GetBlockBlobClient(blobItem.Name);

    var options = new BlobQueryOptions
    {
        InputTextConfiguration = new BlobQueryJsonTextOptions(),
        OutputTextConfiguration = new BlobQueryJsonTextOptions()
    };

    var result = await blobClient.QueryAsync("SELECT CsMethod, UserAgent FROM BlobStorage[*].properties WHERE CsUriStem = '/'", options);

    var content = await new StreamReader(result.Value.Content).ReadToEndAsync();

    Console.Write(content);
}

When targeting JSON, if you don't explicitly set the InputTextConfiguration property to BlobQueryJsonTextOptions, it will try to read the text as a CSV and return an error.

Similarly, if you don't specify the OutputTextConfiguration property, it will be returned as a CSV file.

When you run this sample code, it will output an HTTP log filtered by the query.

I've often wanted to use Synapse Analytics' SQL on-demand to write queries against CSV files stored in Azure Data Lake Storage, but if you don't need complex queries and want to use them from your application, I think Query Acceleration can be quite useful.

Top comments (2)

SG • Nov 6 '21

I am getting "XML specified is not syntactically valid" error. I tried with both JSON and CSV code. Same issue. any idea why i am getting this, i used exactly the same code given in this article.
Azure.RequestFailedException: 'XML specified is not syntactically valid.
RequestId:fcd98dd6-201e-0017-1d48-d3f368000000
Time:2021-11-06T19:54:38.4445463Z
Status: 400 (XML specified is not syntactically valid.)
ErrorCode: InvalidXmlDocument

Paweł Bulwan • Dec 13 '21

I was having the same issue, both for CSV and JSON. I downgraded from Azure.Storage.Blobs 12.10.0 to 12.8.0 and it helped - queries work now. I suspect regression on the library side.