DEV Community: swatiBabber

Performance tip for Cosmos DB collection migration using ADF

swatiBabber — Mon, 21 Jun 2021 11:32:15 +0000

If your Cosmos DB migration pipeline is taking a long time :
first thing to check is the overview page of your Cosmos Db for throttling metric in monitoring chart.

The Overview page in the Azure portal for each Azure Cosmos database includes a brief view of the database usage including its request and hourly billing usage.

If you see multiple(more than 1-5) Http 429s in the chart , that means provisioned throughput is not enough for the amount of data you are migrating.

Pre-create containers with enough RUs
Although Azure Cosmos DB scales out storage automatically, it is not advisable to start from the smallest container size. Smaller containers have lower throughput availability, which means that the migration would take much longer to complete. Instead, it is useful to create the containers with the final data size and make sure that the migration workload is fully consuming the provisioned throughput.

If data size was estimated to be around 60 TB, a container of at least 2.4 M RUs is required to accommodate the entire dataset.

Estimate the migration speed
Assuming that the migration workload can consume the entire provisioned throughput, the provisioned throughout would provide an estimation of the migration speed. Continuing the previous example, 5 RUs are required for writing a 1-KB document to Azure Cosmos DB SQL API account.
2.4 million RUs would allow a transfer of 480,000 documents per second (or 480 MB/s). This means that the complete migration of 60 TB will take 125,000 seconds or about 34 hours.

In case you want the migration to be completed within a day, you should increase the provisioned throughput to 5 million RUs.

Secondly,
**Determine if there is a hot partition
To verify if there is a hot partition, navigate to Insights > Throughput > Normalized RU Consumption (%) By PartitionKeyRangeID. Filter to a specific database and container.

Each PartitionKeyRangeId maps to a one physical partition. If there is one PartitionKeyRangeId that has significantly higher Normalized RU consumption than others (for example, one is consistently at 100%, but others are at 30% or less), this can be a sign of a hot partition.

To see which logical partition keys are consuming the most RU/s, use Azure Diagnostic Logs. This sample query sums up the total request units consumed per second on each logical partition key.

Kusto

Copy AzureDiagnostics | where TimeGenerated >= ago(24hour) | where Category == "PartitionKeyRUConsumption" | where collectionName_s == "CollectionName" | where isnotempty(partitionKey_s) // Sum total request units consumed by logical partition key for each second | summarize sum(todouble(requestCharge_s)) by partitionKey_s, operationType_s, bin(TimeGenerated, 1s) | order by sum_requestCharge_s desc

This sample output shows that in a particular minute, the logical partition key with value "Contoso" consumed around 12,000 RU/s, while the logical partition key with value "Fabrikam" consumed less than 600 RU/s. If this pattern was consistent during the time period where rate limiting occurred, this would indicate a hot partition.

Array Rotation

swatiBabber — Tue, 20 Apr 2021 08:53:32 +0000

Use array reversal for array rotation .
Use two pointer for array reversal.

public class Solution {
public void Rotate(int[] nums, int k) {
k=k% nums.Length;
k=k% nums.Length;
reverse(nums, 0, nums.Length-1);
reverse(nums,0, k-1);
reverse(nums,k,nums.Length-1);
}

void reverse(int[] nums, int start , int end)
{
    while(start<end)
    {
        int temp=nums[start];
        nums[start]=nums[end];
        nums[end]=temp;
        start++;
        end--;
    }
}

}

Remove duplicates from Sorted Array

swatiBabber — Mon, 19 Apr 2021 16:41:20 +0000

If array is a sorted array, use two pointer approach to find the position of non duplicate items in the array.

public class Solution {
public int RemoveDuplicates(int[] nums)
{
int len=0;
if(nums.Length==0)
{ return 0; }

    for(int i=0;i<nums.Length-1;i++)
     {
       if( nums[i]!=nums[i+1])
       {
           len=len+1;
           nums[len]=nums[i+1];
       }
     }
   return len+1;
}

}

ADF-Mapping data flows performance and tuning

swatiBabber — Mon, 04 Jan 2021 22:01:54 +0000

It is very important to understand the compute logic behind data flows to tune the performance of the data flow pipeline. Data flows utilize a Spark optimizer that reorders and runs your business logic in 'stages' to perform as quickly as possible.

For each sink that your data flow writes to, the monitoring output lists the duration of each transformation stage, along with the time it takes to write data into the sink.The time that is the largest is likely the bottleneck of your data flow.

If the transformation stage that takes the largest contains a source, then you may want to look at further optimizing your read time.
If a transformation is taking a long time, then you may need to repartition or increase the size of your integration runtime.
If the sink processing time is large, you may need to scale up your database or verify you are not outputting to a single file.

Once you have identified the bottleneck of your data flow, use the below optimizations strategies to improve performance.

Optimize:
The Optimize tab contains settings to configure the partitioning scheme of the Spark cluster. This tab exists in every transformation of data flow and specifies whether you want to repartition the data after the transformation has completed. Adjusting the partitioning provides control over the distribution of your data across compute nodes and data locality optimizations that can have both positive and negative effects on your overall data flow performance.
Logging level :
If you do not require every pipeline execution of your data flow activities to fully log all verbose telemetry logs, you can optionally set your logging level to "Basic" or "None".
Optimizing the Azure Integration Runtime
Data flows run on Spark clusters that are spun up at run-time. The configuration for the cluster used is defined in the integration runtime (IR) of the activity. There are three performance considerations to make when defining your integration runtime: cluster type, cluster size, and time to live.

Cluster Type: general Purpose , Memory optimized and Compute optimized.

General purpose clusters are the default selection and will be ideal for most data flow workloads. These tend to be the best balance of performance and cost.

If your data flow has many joins and lookups, you may want to use a memory optimized cluster. They can store more data in memory and will minimize any out-of-memory errors you may get. Memory optimized have the highest price-point per core, but also tend to result in more successful pipelines. If you experience any out of memory errors when executing data flows, switch to a memory optimized Azure IR configuration.

For simpler, non-memory intensive data transformations such as filtering data or adding derived columns, compute-optimized clusters can be used at a cheaper price per core.

Cluster size
Data flows distribute the data processing over different nodes in a Spark cluster to perform operations in parallel. A Spark cluster with more cores increases the number of nodes in the compute environment. More nodes increase the processing power of the data flow. Increasing the size of the cluster is often an easy way to reduce the processing time.
Time to live
By default, every data flow activity spins up a new cluster based upon the IR configuration. Cluster start-up time takes a few minutes and data processing can't start until it is complete. If your pipelines contain multiple sequential data flows, you can enable a time to live (TTL) value. Specifying a time to live value keeps a cluster alive for a certain period of time after its execution completes. If a new job starts using the IR during the TTL time, it will reuse the existing cluster and start up time will greatly reduced. After the second job completes, the cluster will again stay alive for the TTL time.

Only one job can run on a single cluster at a time. If there is an available cluster, but two data flows start, only one will use the live cluster. The second job will spin up its own isolated cluster.

4.Optimize Source:

Select proper partitioning depending on the source.
For file based sources, avoid using for-each activity as every iteration will spin up new Spark cluster.

5.Optimize Sink:

Disabling index , if it is a SQL sink.
Scaling up , if it is a SQL sink.
Enable staging for Synapse.
Use Spark-native Parquet format for File based sinks.

ADF-Mapping Data Flows Debug Mode

swatiBabber — Mon, 04 Jan 2021 21:43:26 +0000

Azure Data Factory mapping data flow's debug mode allows you to interactively watch the data shape transform while you build and debug your data flows.

1.The debug session can be used both in Data Flow design sessions as well as during pipeline debug execution of data flows.

2.When Debug mode is on, you'll interactively build your data flow with an active Spark cluster. The session will close once you turn debug off in Azure Data Factory. You should be aware of the hourly charges incurred by Azure DataBricks during the time that you have the debug session turned on.

3.If you have parameters in your Data Flow or any of its referenced datasets, you can specify what values to use during debugging by selecting the Parameters tab in Debug Settings.

4.File sources only limit the rows that you see, not the rows being read. For very large datasets, it is recommended that you take a small portion of that file and use it for your testing. You can select a temporary file in Debug Settings for each source that is a file dataset type.

5.When running in Debug Mode in Data Flow, your data will not be written to the Sink transform.A Debug session is intended to serve as a test harness for your transformations.

6.When unit testing Joins, Exists, or Lookup transformations, make sure that you use a small set of known data for your test. You can use the Debug Settings option above to set a temporary file to use for your testing. This is needed because when limiting or sampling rows from a large dataset, you cannot predict which rows and which keys will be read into the flow for testing. The result is non-deterministic, meaning that your join conditions may fail.

7.When executing a debug pipeline run with a data flow, you have two options on which compute to use. You can either use an existing debug cluster or spin up a new just-in-time cluster for your data flows.
Using an existing debug session will greatly reduce the data flow start up time as the cluster is already running, but is not recommended for complex or parallel workloads as it may fail when multiple jobs are run at once.

How does Azure Data Factory work?

swatiBabber — Mon, 04 Jan 2021 14:28:22 +0000

Data Factory contains a series of interconnected systems that provide a complete end-to-end platform for data engineers.

Connect and Collect: The first step in building an information production system is to connect to all the required sources of data and processing, such as software-as-a-service (SaaS) services, databases, file shares, and FTP web services. The next step is to move the data as needed to a centralized location for subsequent processing.

Without Data Factory, enterprises must build custom data movement components or write custom services to integrate these data sources and processing. It's expensive and hard to integrate and maintain such systems. In addition, they often lack the enterprise-grade monitoring, alerting, and the controls that a fully managed service can offer.

2.Transform and enrich: After data is present in a centralized data store in the cloud, process or transform the collected data by using ADF mapping data flows. Data flows enable data engineers to build and maintain data transformation graphs that execute on Spark without needing to understand Spark clusters or Spark programming.

If you prefer to code transformations by hand, ADF supports external activities for executing your transformations on compute services such as HDInsight Hadoop, Spark, Data Lake Analytics, and Machine Learning.

3.CI/CD and publish:
Data Factory offers full support for CI/CD of your data pipelines using Azure DevOps and GitHub. This allows you to incrementally develop and deliver your ETL processes before publishing the finished product. After the raw data has been refined into a business-ready consumable form, load the data into Azure Data Warehouse, Azure SQL Database, Azure CosmosDB, or whichever analytics engine your business users can point to from their business intelligence tools.

4.Monitor:After you have successfully built and deployed your data integration pipeline, providing business value from refined data, monitor the scheduled activities and pipelines for success and failure rates. Azure Data Factory has built-in support for pipeline monitoring via Azure Monitor, API, PowerShell, Azure Monitor logs, and health panels on the Azure portal.

Why Azure Data Factory?

swatiBabber — Mon, 04 Jan 2021 11:59:57 +0000

I will explain this with an example:
Scenario: A gaming company that collects petabytes of game logs that are produced by games in the cloud.
Business Requirements: The company wants to analyze these logs to gain insights into customer preferences, demographics, and usage behavior. It also wants to identify up-sell and cross-sell opportunities, develop compelling new features, drive business growth, and provide a better experience to its customers.

Tech Requirements:
1.To analyze these logs, the company needs to use reference data such as customer information, game information, and marketing campaign information that is in an on-premises data store.
2.The company wants to utilize this data from the on-premises data store, combining it with additional log data that it has in a cloud data store.
3.To extract insights, it hopes to process the joined data by using a Spark cluster in the cloud (Azure HDInsight),
4.Publish the transformed data into a cloud data warehouse such as Azure Synapse Analytics to easily build a report on top of it. 5.They want to automate this workflow, and monitor and manage it on a daily schedule. They also want to execute it when files land in a blob store container.

Solution using ADF:
Using Azure Data Factory, you can create and schedule data-driven workflows (called pipelines) that can ingest data from disparate data stores. You can build complex ETL processes that transform data visually with data flows or by using compute services such as Azure HDInsight Hadoop, Azure DataBricks, and Azure SQL Database.

Additionally, you can publish your transformed data to data stores such as Azure Synapse Analytics for business intelligence (BI) applications to consume. Ultimately, through Azure Data Factory, raw data can be organized into meaningful data stores and data lakes for better business decisions.

Check out next article in the series:
https://dev.to/swatibabber/how-does-azure-data-factory-work-j5n

Data Factory - Azure AD Authentication for SQL Database

swatiBabber — Tue, 24 Nov 2020 09:46:28 +0000

To use Azure identity authentication, follow these steps:

Provision an Azure Active Directory administrator for your
server on the Azure portal, if not already done : This is an
important step as for adding Azure data factory's managed
identity as a user in SQL, an Azure AD account having at
least "ALTER ANY USER" permission is required.

If 2nd step is performed using a SQL Authenticated account,
the SQL command will return error.
Create contained database users for the Azure Data Factory
managed identity.Connect to the database from or to which
you want to copy data by using tools like SQL Server
Management Studio, with an Azure AD identity that has at least
ALTER ANY USER permission. Run the following T-SQL

CREATE USER [your Data Factory name] FROM EXTERNAL PROVIDER
Grant the Data Factory managed identity needed permissions as
you normally do for SQL users and others. Run the following
code
ALTER ROLE [role name] ADD MEMBER [your Data Factory name]

Refer this link for demo.

Third party REST API(OAuth) call using Azure Data Factory-Web Activity

swatiBabber — Fri, 20 Nov 2020 08:44:01 +0000

One of the common requirements in Data flow pipelines is to retrieve data from REST endpoint and copy it to a data store.

When I started working on ADF, I found out, there was a confusing list of concepts available to implement scenarios like this:

Linked Service using Anonymous/Basic authentication to connect to REST endpoint.
Linked Service using MSI/AAD service principal to connect to REST endpoint.
Web Activity to fetch time based OAuth(access) token using credentials in Http POST body.

First option, helps in cases when you are trying to access Azure or third party REST APIs where no authorization(access token) is required. In this case, you can create a Linked Service using Anonymous or basic authentication to access the data available at the endpoint.

Second option lets you access only Azure APIs/services/endpoints by providing either the managed service identity or using the Service principal to authenticate and authorize using AAD.

Third option is used when you want to access a third party REST API which requires authentication as well as authorization(OAuth).In this case the Linked Service approach does not work and a web activity in a pipeline is required to fetch the access token. This access token is then used in subsequent calls to the REST endpoint.

Below are the steps to implement the third option.(Assuming Username and password are stored in Key Vault)

Create a Web activity to fetch username from AKV.
Create another Web activity to fetch password from AKV.
Create another Web activity to do a POST call with JSON in request body consisting of username/password. Output of this third activity gives the access token , which can be used in subsequent calls to REST API.

Refer the ADF lab 3 below , for a sample of this.

Mmodarre / AzureDataFactoryHOL

Azure Data Factory Hands On Lab - Step by Step - A Comprehensive Azure Data Factory and Mapping Data Flow step by step tutorial

Note: This is a work in progress and any feedback and collaboration is really appreciated. New excercises will be added soon.

ELT with Azure Data Factory

And

Mapping Data Flows

Hands-on lab step-by-step

Feb 2020

Information in this document, including URL and other Internet Web site references, is subject to change without notice. Unless otherwise noted the example companies, organizations, products, domain names, e-mail addresses, logos, people, places, and events depicted herein are fictitious, and no association with any real company, organization product, domain name, e-mail address, logo, person, place or event is intended or should be inferred. Complying with all applicable copyright laws is the responsibility of the user. Without limiting the rights under copyright, no part of this document may be reproduced, stored in or introduced into a retrieval system, or transmitted in any form or by any means (electronic, mechanical, photocopying, recording, or otherwise), or for any…

View on GitHub

Why use Key Vault in ADF?

swatiBabber — Wed, 18 Nov 2020 19:58:03 +0000

Azure Key Vault (AKV) can be used to store all credentials for services that ADF will connect to. This has multiple advantages:

Security of storing sensitive information in credentials store
which only the ADF service or Administrators can read from.
If credentials need to be rotated, ADF Linked Service is not
required to be modified.
When we migrate the ADF pipeline from Dev to Test to Production
no change is necessary.

Customer data privacy in Azure Data Factory

swatiBabber — Wed, 18 Nov 2020 18:56:07 +0000

ADF has proven to be a go to option for data movement solutions in Azure.
One of the important things to keep in mind during creation of ADF is the location of ADF and Integration runtime. The IR location can be different from ADF location.

Azure Data Factory Location When you create a data factory, you need to specify the location for the data factory. The Data Factory location is where the metadata of the data factory is stored and where the triggering of the pipeline is initiated from. Metadata for the factory is only stored in the region of customer’s choice and will not be stored in other regions.
Integration Runtime Location It defines the location of its back-end compute, and essentially the location where the data movement, activity dispatching, and SSIS package execution are performed.