DEV Community

Alec Dutcher
Alec Dutcher

Posted on • Updated on

DP-203 Study Guide - Optimize and troubleshoot data storage and data processing

Study guide

Compact small files

  • What does it meant to compact small files?
    • Combine a lot of small files into one file
    • Improves speed of read queries
    • Can be done from a Copy job in ADF/Synapse or incremental load
    • Also available in a Delta Lake feature
  • Using a Copy job
    • Source is the directory with all of the small files
    • Select using a wildcard (/directory/*)
    • Use the Copy behavior to merge the files
  • Using Delta Lake
OPTIMIZE delta.`/data/events`
OPTIMIZE delta.`abfss://container-name@storage-account-name.dfs.core.windows.net/path-to-data
Enter fullscreen mode Exit fullscreen mode

Handle skew in data

  • Skew
    • An uneven distribution of data
    • Data skew can unbalance compute nodes, lowering performance
    • Avoid by balancing parallel processing with correct table distribution (hash or round-robin)
  • Detect skew in distributed table (database consistency check)
    • DBCC PDW_SHOWSPACEUSED('dbo.FactInternetSales');
  • Resolve data skew
    • Research
      • Monitor query impact
      • Weigh the cost of minimizing
    • Solution
      • Re-create table with a new distribution column set
      • CREATE TABLE AS SELECT (CTAS)

Handle data spill

  • Data spill is when compute engine is unable to hold data in memory and writes ("spills") data to disk
  • Impact is expensive disk reads/writes and longer execution times
  • Occurs when
    • Partition size is too big
    • Compute resource size is small
    • Data size during merges, unions, etc exceeds memory limit of the compute node
  • Identifying data spill
    • Synapse SQL - TempDB runs out of space and throws error (monitor with DMVs)
    • Spark - view task summary screen under spill column
  • Handling the spill
    • Increase compute capacity
    • Reduce partition size
    • Remove skews in data

Optimize resource management

  • Optimize Synapse SQL Pools
    • Pause when not in use
    • Use the right compute unit (DWU) for workload
    • Leverage Azure Functions to scale out workload
  • Optimize Spark
    • Select autoscale option in cluster setup
    • Select auto-terminate
    • Use spot instances
    • Right-size cluster nodes based on memory, CPU intensive, etc

Tune queries by using indexers

  • Types of indexes
    • Clustered columnstore index
      • Default in SQL pool table
      • Use for tables > 100 million rows
      • Good performance and data compression
    • Clustered index
      • Good for specific filter conditions
      • Use for tables between 100 and 100 million rows
    • Heap index
      • Use for staging tables
  • Maintain by rebuilding indexes when seeing performance degradation in existing indexes
  • Indexes in Spark Pool
    • Spark does not have an inbuilt index
    • Uses Hyperspace (or Hyperscale) - ability to create indexes on datasets (CSV, JSON, parquet)
    • Works via API
    • Criteria
      • Contains filter on predicates
      • Contains a join that requires heavy shuffles

Tune queries by using cache

  • Caching stores frequently accessed data in memory or disk for faster retrieval
  • Caching in Synapse SQL
    • Result set caching
      • Off by default
      • Enabled at database or session level
        • DB: ALTER DATABASE SET RESULT_SET_CACHING ON
        • Session: SET RESULT_SET_CACHING { ON | OFF }
      • Faster query performance
      • Max size of 1 TB per database
    • Requirements
      • User running the query has access to tables used in the query
      • Cached query and new query have to be an exact match
      • No changes to the table's data or schema where cache was generated from
  • Caching in Spark
    • RDD (resilient distributed dataset
    • DataFrame
    • DataSets
    • Cache methods
      • .persist()
      • .cache()
      • CACHE TABLE

Troubleshoot a failed Spark job

  • Debug the issue within the environment and within the job
  • Environment
    • Confirm the region the cluster is in is not down (status.azure.com)
    • Use HDInsight Ambari Dashboard to view cluster health
    • Are clusters using high CPU or memory?
  • Jobs
    • Driver logs
    • Task logs
    • Executor logs

Troubleshoot a failed pipeline run, including activities executed in external services

  • Use Output section of pipeline details to see job status
  • To the right of the failed message there are more error details
  • Examine the detailed error message for failed activities

Top comments (0)