DEV Community

Alec Dutcher
Alec Dutcher

Posted on • Edited on

DP-203 Study Guide - Implement a partition strategy

Study guide

Data partitioning guidance

  • Horizontal (sharding) = each partition is a separate data store, but all partitions have the same schema (partitions have different rows)
  • Vertical = each partition holds a subset of the fields according to patterns of use (partitions have different columns)
  • Functional = data is aggregated according to how it is used by each bounded context (i.e. invoice data vs product data)

Implement a partition strategy for files

  • Think through the problem, whiteboard it out
  • Parquet
    • Most common file type for big data
    • Column-based storage with nested data structures
    • Supports parallel processing queries
    • Row-group sections can be treated as partitions - multiple row-groups can be sent to different nodes
    • Break partitions apart based on column values, i.e. query based on a date
  • Best practices
    • Make sure to include partition columns in table's schema definition
    • Group related records together
    • Don't use unnecessary columns
    • 512 MB to 1 GB is optimal partition size
    • Consider the query and how the data will be used
    • Consider the expected growth of the data
    • Consider how static the data is

Implement a partition strategy for analytical workloads

  • Distribution types
    • Round-robin
      • Characteristics
        • Distributed evenly in a random fashion
        • Even distribution across DBs
        • Assignment is random
        • Fast performance for loads as row assignment can be done quickly
        • Slower performance for reads as higher potential for data movement
      • Best for:
        • No clear distribution key
        • No frequent joins
        • Uniform distribution is desired
        • Temporary staging table
        • Simple starting point
    • Hash
      • Characteristics
        • Distributed deterministically using hash function on a column
        • Distribution column can’t be changed later
        • Choose one with unique values, few/no nulls, is not a date column
      • Best for
        • Large tables (>2 Gb)
        • Frequent inserts, updates, and deletes
    • Replicated
      • Characteristics
        • Full copy of table is replicated to every compute node
        • Requires extra storage and overhead for writes
        • Normally used in conjunction with other methods
      • Best for:
        • Small lookup or dimension tables joined with larger tables

Implement a partition strategy for streaming workloads

  • Azure Stream Analytics
    • Fully managed stream processing engine
    • Input layer (Blob storage, Event Hubs, IoT hubs) ingested into ASA
    • Query layer: ASA performs query
    • Output layer: Results sent to Blob storage for downstream use
  • How transformation works in a stream
    • Data in stream is diverted to perform query
    • Query transformation results are re-introduced to stream for output
    • Transformation is done in near real time
  • Partitioning
    • Embarrassingly parallel job: equal input and output partitions, one instance of the query
    • Must align partition keys between inputs, query logic, and outputs
    • Jobs that aren't embarrassingly parallel can still be completed, but not as efficiently
      • Involves querying windows
  • In the Azure Portal
    • Query in the left-side options
    • Inputs - define query, can test and see results
    • Outputs - define and test output query
      • Here you can define partition key with the PARTITION BY clause (in compatibility level 1.1 and below, in 1.2 define partition key in input)

Implement a partition strategy for Azure Synapse Analytics

  • Table partitions
    • Supported on all dedicated SQL pool types
      • Clustered columnstore, clustered index, heap
      • Supported on all distribution types (hash, round robin, etc)
    • Why partition
      • Query performance
      • Load performance - Smaller amounts of data make incremental loading, updating, and deleting faster and easier
  • Clustered columnstore indexes
    • Standard for storing and querying large data warehouse fact tables
    • Rows are organized into row groups containing 1,048,576 rows
    • Row groups organized into column segments
    • Index columnstore is built from column segments - data is compressed
    • Deltastore - leftover row group
  • Law of 60
    • A distribution is a basic unit of storage and processing
    • Synapse divides work into 60 smaller queries that run in parallel on a data distribution
    • This turns 10 partitions into 600
    • Each partition needs 1 million rows
  • In the Azure Synapse Analytics portal
    • When writing CREATE TABLE statement, use WITH clause using CLUSTERED COLUMNSTORE INDEX
    • Define the distribution type and key
    • Choose partition key

Identify when partitioning is needed in Azure Data Lake Storage Gen2

  • Azure Blob Storage
    • General purpose, Block, and Page blob
    • Account --> Container --> Blob
  • Partition key identification
    • Azure Storage serves single partitions faster than multiple partitions
    • Partitioning is used to improve read performance
    • Naming blobs correctly is critical
    • Blob storage uses a range-based partitioning scheme
    • Partition key is combo of Account + Container + Blob
    • Blob storage uses lexical ordering and timestamps which increases co-location on partitions
  • Best practices
    • Avoid slowly changing timestamps (yyyymmdd)
    • Name based upon likely queries
    • Avoid latency-causing partitioning (use blob size >256 Kb, use hashing functions)

Top comments (0)