DP-203 Study Guide - Manage batches and pipelines

Azure Batch

Configure the batch size

Trigger batches

In the portal (Batch)
- Confirm that the pool is in steady state and the nodes are in idle state
- Choose Jobs in the left-side panel and add a new job
- Name the job and select the pool
- Open the job and select Tasks in the left-side panel
- Define name and description
- Enter the command in the command line box that will run on each machine
  - Reference installed packages with %AZ_BATCH_APP_PACKAGE_#%
  - Reference path to input fileshare with -i S:<file_path>
  - Reference path to output with S:<file_path>
- Submit task
In Azure Data Factory and Azure Synapse
- To run a single task in ADF
  - Create linked service to Azure Batch
    - Need Batch account name, account endpoint, and primary access key from the Keys section in the Batch portal
    - Also need the name of the pool
  - Create pipeline to run Custom Batch activity
    - Select linked service under the Azure Batch option in the activity settings
  - Define command to execute utility
    - Enter in the Command box under Settings for the activity
- To run multiple tasks in parallel
  - Get list of files using Get Metadata activity in the General option
    - Configure data set and linked service with Azure File Storage
    - Use the Field list to select Child items
  - Use a ForEach activity to iterate through the Child items
    - Use dynamic content in the Command to add the filename for each file

Handle failed batch loads

Failure types
- Infrastructure - pool and node errors
- Application - job and task errors
Pool errors
- Resizing failure - pool is unable to provision a node within the resize timeout window (default is 15 mins)
- Insufficient quota - account has limited number of core quotas, and if allocation exceeds this number then it fails (raise support ticket to increase quota)
- Scaling failures - formula is used to determine autoscaling, and formula evaluation can fail (check logs to find issue)
Node issues
- App package download failure - node set to unusable, needs to be reimaged
- Node OS updates - tasks can be interrupted by updates, auto update can be disabled
- Node in unusable state - even if pools is ready pool can be in unusable state (VM crash, firewall block, invalid app package), needs to be re-imaged
- Node disk is full
Rebooting and re-imaging can be done in the Batch portal under Pools
The Connect option in portal allows you to use RDP/SSH to connect to the VM
- Define user details
- Set as Admin
- Download RDP file and enter user credentials
- This opens Server Manager window where you can navigate the file system to check application package installations

Validate batch loads

Job errors
- Timeout
  - Max wall clock time defines max time allowed for job to run from the time it was created
  - Default value is unlimited
  - If max is reached, running tasks are killed
  - Increase max wall clock value to prevent timeout
- Failure of job-related tasks
  - Each job has job-related preparation tasks that run once for the job
  - Job prep task runs on each node as soon as job is created
  - Job release task runs on each node when job terminates
  - Failures can occur in these tasks
Task errors
- Task waiting - dependency on another task
- Task timeout- check max wall clock time
- Missing app packages or resource files
- Error in command defined in the task
- Check stdout and stderr logs for details
In the Batch portal under node details, you can specify a container where log files are stored for future reference

Configure batch retention

Retention time defines how long to keep task directory on node once task is complete
Configure at Job level or Task level
- Retention time field in advanced settings
- Default is 7 days unless removed or deleted

Manage data pipelines in Azure Data Factory or Azure Synapse Pipelines

Schedule data pipelines in Data Factory or Azure Synapse Pipelines

Trigger types
- Scheduled - run on wall-clock schedule
- Tumbling window - run at periodic intervals while maintaining state
- Storage event - run pipeline when file is uploaded or deleted from a storage account
- Custom event trigger - runs pipeline when event is raised by Azure Event Grid
Scheduled vs tumbling triggers
- Scheduled
  - Only supports future-dated loads
  - Does not maintain state, only fire and forget
- Tumbling
  - Can run back-dated and future-dated loads
  - Maintains state (completed loads)
  - Passes start and end timestamps of window as parameters
  - Can be used to add dependency between pipelines, allowing complex scenarios

Implement version control for pipeline artifacts

Manage Spark jobs in a pipeline

Pipeline activities for Spark
- Synapse - Spark notebook, Spark job
- Databricks - notebook, Jar file, Python file
- HDInsight activities - Spark Jar/script
Monitoring Spark activities
- Monitoring built in to ADF
- Platform monitoring (Synapse, Databricks)
  - In ADF/Synapse, go to Montior --> Apache Spark applications and select a specific run for details
- Spark UI

DEV Community