DEV Community

Alec Dutcher
Alec Dutcher

Posted on • Updated on

DP-203 Study Guide - Manage batches and pipelines

Study guide

Azure Batch

  • Azure Batch
    • Platform to run high-performance computing jobs in parallel at large scale
    • Manages cluster of machines and supports autoscaling
    • Allows you to install applications that can run as a job
    • Schedule and run jobs on cluster machines
    • Pay per minute for resources used
  • How it works
    • Pool = cluster of machines/nodes
    • Slot = set of resources used to execute a task
    • Define number of slots per node
      • Increase slots per node to improve performance without increasing cost
    • Job assigns tasks to slots on nodes
    • Application is installed on each node to execute the tasks
    • Specify application packages at pool or task level

Configure the batch size

  • In the portal (Batch account)
    • Choose Pools in the left-side panel
    • Add a new pool and name it
    • Define the OS image (publisher and sku)
    • Choose VM size (determines cores and memory))
    • Choose fixed or auto scale for nodes
      • If fixed, select number of nodes
    • Choose application packages and versions, uploading files if necessary
    • Use Mount configuration to mount storage file shares, specifying the account name and access key of the storage account

Trigger batches

  • In the portal (Batch)
    • Confirm that the pool is in steady state and the nodes are in idle state
    • Choose Jobs in the left-side panel and add a new job
    • Name the job and select the pool
    • Open the job and select Tasks in the left-side panel
    • Define name and description
    • Enter the command in the command line box that will run on each machine
      • Reference installed packages with %AZ_BATCH_APP_PACKAGE_#%
      • Reference path to input fileshare with -i S:<file_path>
      • Reference path to output with S:<file_path>
    • Submit task
  • In Azure Data Factory and Azure Synapse
    • To run a single task in ADF
      • Create linked service to Azure Batch
        • Need Batch account name, account endpoint, and primary access key from the Keys section in the Batch portal
        • Also need the name of the pool
      • Create pipeline to run Custom Batch activity
        • Select linked service under the Azure Batch option in the activity settings
      • Define command to execute utility
        • Enter in the Command box under Settings for the activity
    • To run multiple tasks in parallel
      • Get list of files using Get Metadata activity in the General option
        • Configure data set and linked service with Azure File Storage
        • Use the Field list to select Child items
      • Use a ForEach activity to iterate through the Child items
        • Use dynamic content in the Command to add the filename for each file

Handle failed batch loads

  • Failure types
    • Infrastructure - pool and node errors
    • Application - job and task errors
  • Pool errors
    • Resizing failure - pool is unable to provision a node within the resize timeout window (default is 15 mins)
    • Insufficient quota - account has limited number of core quotas, and if allocation exceeds this number then it fails (raise support ticket to increase quota)
    • Scaling failures - formula is used to determine autoscaling, and formula evaluation can fail (check logs to find issue)
  • Node issues
    • App package download failure - node set to unusable, needs to be reimaged
    • Node OS updates - tasks can be interrupted by updates, auto update can be disabled
    • Node in unusable state - even if pools is ready pool can be in unusable state (VM crash, firewall block, invalid app package), needs to be re-imaged
    • Node disk is full
  • Rebooting and re-imaging can be done in the Batch portal under Pools
  • The Connect option in portal allows you to use RDP/SSH to connect to the VM
    • Define user details
    • Set as Admin
    • Download RDP file and enter user credentials
    • This opens Server Manager window where you can navigate the file system to check application package installations

Validate batch loads

  • Job errors
    • Timeout
      • Max wall clock time defines max time allowed for job to run from the time it was created
      • Default value is unlimited
      • If max is reached, running tasks are killed
      • Increase max wall clock value to prevent timeout
    • Failure of job-related tasks
      • Each job has job-related preparation tasks that run once for the job
      • Job prep task runs on each node as soon as job is created
      • Job release task runs on each node when job terminates
      • Failures can occur in these tasks
  • Task errors
    • Task waiting - dependency on another task
    • Task timeout- check max wall clock time
    • Missing app packages or resource files
    • Error in command defined in the task
    • Check stdout and stderr logs for details
  • In the Batch portal under node details, you can specify a container where log files are stored for future reference

Configure batch retention

  • Retention time defines how long to keep task directory on node once task is complete
  • Configure at Job level or Task level
    • Retention time field in advanced settings
    • Default is 7 days unless removed or deleted

Manage data pipelines in Azure Data Factory or Azure Synapse Pipelines

  • Ways to run pipelines
    • Debug Run
      • Don't need to save changes
      • Directly run pipelines with draft changes
      • Manual, can't be scheduled
    • Trigger Run
      • Need to publish changes first
      • Only runs published version of pipeline
      • Can be manual or scheduled

Schedule data pipelines in Data Factory or Azure Synapse Pipelines

  • Trigger types
    • Scheduled - run on wall-clock schedule
    • Tumbling window - run at periodic intervals while maintaining state
    • Storage event - run pipeline when file is uploaded or deleted from a storage account
    • Custom event trigger - runs pipeline when event is raised by Azure Event Grid
  • Scheduled vs tumbling triggers
    • Scheduled
      • Only supports future-dated loads
      • Does not maintain state, only fire and forget
    • Tumbling
      • Can run back-dated and future-dated loads
      • Maintains state (completed loads)
      • Passes start and end timestamps of window as parameters
      • Can be used to add dependency between pipelines, allowing complex scenarios

Implement version control for pipeline artifacts

  • Authoring modes
    • Live mode (default)
      • Authoring directly against pipelines
      • No option to save draft changes
      • Need to publish to save valid changes
      • Need manually created ARM templates to deploy pipelines to other environments
    • Git Repo mode
      • Repo can be in ADO or GitHub
      • All artifacts can be stored in source control
      • Draft changes can be saved even if not valid
      • Autogenerates ARM templates for deployment in other environments
      • Enables DevOps features (PRs, reviews, collab)

Manage Spark jobs in a pipeline

  • Pipeline activities for Spark
    • Synapse - Spark notebook, Spark job
    • Databricks - notebook, Jar file, Python file
    • HDInsight activities - Spark Jar/script
  • Monitoring Spark activities
    • Monitoring built in to ADF
    • Platform monitoring (Synapse, Databricks)
      • In ADF/Synapse, go to Montior --> Apache Spark applications and select a specific run for details
    • Spark UI

Top comments (0)