The first time I used Azure Data Factory, I found its pricing rather confusing. The pricing policy mentioned words like Read/Write Operations, Monitoring Operations and Activity Runs that did not make sense to me as a beginner.
I was simply trying to decide whether it would be more efficient and cost effective to use Data Factory rather than hand coding a data pipeline. However, I ended up spending a good few hours going through the documentation and just to get a rough cost estimate.
Below is a very basic overview of what Azure Data Factory V2 pricing looks like for data pipelines. This article should suffice for beginners who are trying out Data Factory pipelines for the first time, but please note that this is not a comprehensive overview, advanced users should refer to the official pricing documents.
The first thing you need to know about Data Factory is that not only are you charged for executing pipelines, but also for developing/debugging and monitoring them.
If you're looking to build data pipelines in Azure Data Factory, your cost will be split into two categories:
- Data Factory Operations
- Pipeline Orchestration and Execution
This is the cost associated with developing and debugging pipelines. There are 2 types of Data Factory Operations, Read/Write and Monitoring.
Every time you create/edit/delete a pipeline activity or a Data Factory entity such as a dataset, linked service, integration runtime or trigger, it counts towards your Data Factory Operations cost. These are billed at $0.50 per 50,000 operations.
You can monitor each pipeline run and view the status for each individual activity. For each pipeline run, you can expect to retrieve one record for the pipeline and one record for each activity or trigger.
For instance, you would be charged for 3 Monitoring activities if you debug a pipeline containing 2 activities. Monitoring activities are charged at $0.25 per 50,000 run records retrieved
As you may have concluded by now, Data Factory Operations are very inexpensive and for the most part, can be ignored for cost calculation purposes. The bulk of the cost comes from Pipeline Orchestration and Execution
This refers to the cost of provisioning resources to run a pipeline and its associated activities.
Data Factory provides users with the option of running pipelines on their own servers(Self Hosted) or using a serverless integration runtime provided by Azure (Azure Integration Runtime). Pricing is slightly different for both options.
Please note that Integration runtime charges are prorated by the minute and rounded up.
Every time you run a pipeline, you are charged for every activity and trigger inside that pipeline that is executed at a rate of $1.50 per 1000 Activity runs.
This option is slightly cheaper at $1 per 1000 runs.
As an example, executing a pipeline with a trigger and two activities would be charged as 3 Activity runs.
This is the cost for the compute resources required for Pipeline execution. You can expect the bulk of the cost for Azure Data Factory pipelines to fall into this category.
This is further divided into three sub-categories :
- Data movement activities : This covers the cost of moving data across data stores in activities such as the copy data activity.
- Pipeline activities : Pipeline activities such as Lookup, Delete and schema operations during authoring (test connection, browse folder list and table list, get schema, and preview data).
- External activities : These are data transformation activities that execute in a computing environment such as Azure Databricks or Azure HDInsight. You can find a list of external activities here.
Each sub-category has separate pricing, as listed below:
- Data movement : $0.10/hour
- Pipeline activities : $0.002/hour
- External activities : $0.0001/hour
- Data movement : $0.25/DIU-hour
- Pipeline activities : $0.005/hour
- External activities : $0.00025/hour
You may have noticed that every sub-category is charged by the hour except Data movement which is charged in units of DIU-hour.
"A Data Integration Unit (DIU) is a measure that represents the power of a single unit in Azure Data Factory. Power is a combination of CPU, memory, and network resource allocation."
A DIU therefore, is a measure of the compute resources available to your pipeline. The more compute you allocate, the higher the cost.
The copy activity is configured to use 4 DIUs by default but you can modify this to set the value between 2 and 256 depending on your performance requirements. You can find a detailed article on DIUs and how to optimize costs and performance for the copy activity here.
Azure Also charges for inactive pipelines so its good to clean up after you're done.
A pipeline is considered inactive if it has no associated trigger or any runs within the month. An inactive pipeline is charged at $0.80 per month.
The above information should allow you to make a fair cost estimate for running data pipelines in Azure Data Factory V2. However please note that you may face additional charges such as egress charges if you are copying data from an Azure database OR compute charges if you choose to create custom activities. These charges vary from case to case and are out of the scope of this blog post. The examples provided by Azure could prove useful in this regard.