It is very important to understand the compute logic behind data flows to tune the performance of the data flow pipeline. Data flows utilize a Spark optimizer that reorders and runs your business logic in 'stages' to perform as quickly as possible.
For each sink that your data flow writes to, the monitoring output lists the duration of each transformation stage, along with the time it takes to write data into the sink.The time that is the largest is likely the bottleneck of your data flow.
- If the transformation stage that takes the largest contains a source, then you may want to look at further optimizing your read time.
- If a transformation is taking a long time, then you may need to repartition or increase the size of your integration runtime.
- If the sink processing time is large, you may need to scale up your database or verify you are not outputting to a single file.
Once you have identified the bottleneck of your data flow, use the below optimizations strategies to improve performance.
The Optimize tab contains settings to configure the partitioning scheme of the Spark cluster. This tab exists in every transformation of data flow and specifies whether you want to repartition the data after the transformation has completed. Adjusting the partitioning provides control over the distribution of your data across compute nodes and data locality optimizations that can have both positive and negative effects on your overall data flow performance.
Logging level :
If you do not require every pipeline execution of your data flow activities to fully log all verbose telemetry logs, you can optionally set your logging level to "Basic" or "None".
Optimizing the Azure Integration Runtime
Data flows run on Spark clusters that are spun up at run-time. The configuration for the cluster used is defined in the integration runtime (IR) of the activity. There are three performance considerations to make when defining your integration runtime: cluster type, cluster size, and time to live.
- Cluster Type: general Purpose , Memory optimized and Compute optimized.
General purpose clusters are the default selection and will be ideal for most data flow workloads. These tend to be the best balance of performance and cost.
If your data flow has many joins and lookups, you may want to use a memory optimized cluster. They can store more data in memory and will minimize any out-of-memory errors you may get. Memory optimized have the highest price-point per core, but also tend to result in more successful pipelines. If you experience any out of memory errors when executing data flows, switch to a memory optimized Azure IR configuration.
For simpler, non-memory intensive data transformations such as filtering data or adding derived columns, compute-optimized clusters can be used at a cheaper price per core.
Data flows distribute the data processing over different nodes in a Spark cluster to perform operations in parallel. A Spark cluster with more cores increases the number of nodes in the compute environment. More nodes increase the processing power of the data flow. Increasing the size of the cluster is often an easy way to reduce the processing time.
Time to live
By default, every data flow activity spins up a new cluster based upon the IR configuration. Cluster start-up time takes a few minutes and data processing can't start until it is complete. If your pipelines contain multiple sequential data flows, you can enable a time to live (TTL) value. Specifying a time to live value keeps a cluster alive for a certain period of time after its execution completes. If a new job starts using the IR during the TTL time, it will reuse the existing cluster and start up time will greatly reduced. After the second job completes, the cluster will again stay alive for the TTL time.
Only one job can run on a single cluster at a time. If there is an available cluster, but two data flows start, only one will use the live cluster. The second job will spin up its own isolated cluster.
- Select proper partitioning depending on the source.
- For file based sources, avoid using for-each activity as every iteration will spin up new Spark cluster.
- Disabling index , if it is a SQL sink.
- Scaling up , if it is a SQL sink.
- Enable staging for Synapse.
- Use Spark-native Parquet format for File based sinks.