DEV Community

Alec Dutcher
Alec Dutcher

Posted on • Edited on

DP-203 Study Guide - Optimize and troubleshoot data storage and data processing

Study guide

Compact small files

  • What does it meant to compact small files?
    • Combine a lot of small files into one file
    • Improves speed of read queries
    • Can be done from a Copy job in ADF/Synapse or incremental load
    • Also available in a Delta Lake feature
  • Using a Copy job
    • Source is the directory with all of the small files
    • Select using a wildcard (/directory/*)
    • Use the Copy behavior to merge the files
  • Using Delta Lake
OPTIMIZE delta.`/data/events`
OPTIMIZE delta.`abfss://container-name@storage-account-name.dfs.core.windows.net/path-to-data
Enter fullscreen mode Exit fullscreen mode

Handle skew in data

  • Skew
    • An uneven distribution of data
    • Data skew can unbalance compute nodes, lowering performance
    • Avoid by balancing parallel processing with correct table distribution (hash or round-robin)
  • Detect skew in distributed table (database consistency check)
    • DBCC PDW_SHOWSPACEUSED('dbo.FactInternetSales');
  • Resolve data skew
    • Research
      • Monitor query impact
      • Weigh the cost of minimizing
    • Solution
      • Re-create table with a new distribution column set
      • CREATE TABLE AS SELECT (CTAS)

Handle data spill

  • Data spill is when compute engine is unable to hold data in memory and writes ("spills") data to disk
  • Impact is expensive disk reads/writes and longer execution times
  • Occurs when
    • Partition size is too big
    • Compute resource size is small
    • Data size during merges, unions, etc exceeds memory limit of the compute node
  • Identifying data spill
    • Synapse SQL - TempDB runs out of space and throws error (monitor with DMVs)
    • Spark - view task summary screen under spill column
  • Handling the spill
    • Increase compute capacity
    • Reduce partition size
    • Remove skews in data

Optimize resource management

  • Optimize Synapse SQL Pools
    • Pause when not in use
    • Use the right compute unit (DWU) for workload
    • Leverage Azure Functions to scale out workload
  • Optimize Spark
    • Select autoscale option in cluster setup
    • Select auto-terminate
    • Use spot instances
    • Right-size cluster nodes based on memory, CPU intensive, etc

Tune queries by using indexers

  • Types of indexes
    • Clustered columnstore index
      • Default in SQL pool table
      • Use for tables > 100 million rows
      • Good performance and data compression
    • Clustered index
      • Good for specific filter conditions
      • Use for tables between 100 and 100 million rows
    • Heap index
      • Use for staging tables
  • Maintain by rebuilding indexes when seeing performance degradation in existing indexes
  • Indexes in Spark Pool
    • Spark does not have an inbuilt index
    • Uses Hyperspace (or Hyperscale) - ability to create indexes on datasets (CSV, JSON, parquet)
    • Works via API
    • Criteria
      • Contains filter on predicates
      • Contains a join that requires heavy shuffles

Tune queries by using cache

  • Caching stores frequently accessed data in memory or disk for faster retrieval
  • Caching in Synapse SQL
    • Result set caching
      • Off by default
      • Enabled at database or session level
        • DB: ALTER DATABASE SET RESULT_SET_CACHING ON
        • Session: SET RESULT_SET_CACHING { ON | OFF }
      • Faster query performance
      • Max size of 1 TB per database
    • Requirements
      • User running the query has access to tables used in the query
      • Cached query and new query have to be an exact match
      • No changes to the table's data or schema where cache was generated from
  • Caching in Spark
    • RDD (resilient distributed dataset
    • DataFrame
    • DataSets
    • Cache methods
      • .persist()
      • .cache()
      • CACHE TABLE

Troubleshoot a failed Spark job

  • Debug the issue within the environment and within the job
  • Environment
    • Confirm the region the cluster is in is not down (status.azure.com)
    • Use HDInsight Ambari Dashboard to view cluster health
    • Are clusters using high CPU or memory?
  • Jobs
    • Driver logs
    • Task logs
    • Executor logs

Troubleshoot a failed pipeline run, including activities executed in external services

  • Use Output section of pipeline details to see job status
  • To the right of the failed message there are more error details
  • Examine the detailed error message for failed activities

AWS Q Developer image

Your AI Code Assistant

Implement features, document your code, or refactor your projects.
Built to handle large projects, Amazon Q Developer works alongside you from idea to production code.

Get started free in your IDE

Top comments (0)

AWS GenAI LIVE image

How is generative AI increasing efficiency?

Join AWS GenAI LIVE! to find out how gen AI is reshaping productivity, streamlining processes, and driving innovation.

Learn more

👋 Kindness is contagious

Please leave a ❤️ or a friendly comment on this post if you found it helpful!

Okay