DEV Community

Alec Dutcher
Alec Dutcher

Posted on • Edited on

DP-203 Study Guide - Implement a partition strategy

Study guide

Data partitioning guidance

  • Horizontal (sharding) = each partition is a separate data store, but all partitions have the same schema (partitions have different rows)
  • Vertical = each partition holds a subset of the fields according to patterns of use (partitions have different columns)
  • Functional = data is aggregated according to how it is used by each bounded context (i.e. invoice data vs product data)

Implement a partition strategy for files

  • Think through the problem, whiteboard it out
  • Parquet
    • Most common file type for big data
    • Column-based storage with nested data structures
    • Supports parallel processing queries
    • Row-group sections can be treated as partitions - multiple row-groups can be sent to different nodes
    • Break partitions apart based on column values, i.e. query based on a date
  • Best practices
    • Make sure to include partition columns in table's schema definition
    • Group related records together
    • Don't use unnecessary columns
    • 512 MB to 1 GB is optimal partition size
    • Consider the query and how the data will be used
    • Consider the expected growth of the data
    • Consider how static the data is

Implement a partition strategy for analytical workloads

  • Distribution types
    • Round-robin
      • Characteristics
        • Distributed evenly in a random fashion
        • Even distribution across DBs
        • Assignment is random
        • Fast performance for loads as row assignment can be done quickly
        • Slower performance for reads as higher potential for data movement
      • Best for:
        • No clear distribution key
        • No frequent joins
        • Uniform distribution is desired
        • Temporary staging table
        • Simple starting point
    • Hash
      • Characteristics
        • Distributed deterministically using hash function on a column
        • Distribution column can’t be changed later
        • Choose one with unique values, few/no nulls, is not a date column
      • Best for
        • Large tables (>2 Gb)
        • Frequent inserts, updates, and deletes
    • Replicated
      • Characteristics
        • Full copy of table is replicated to every compute node
        • Requires extra storage and overhead for writes
        • Normally used in conjunction with other methods
      • Best for:
        • Small lookup or dimension tables joined with larger tables

Implement a partition strategy for streaming workloads

  • Azure Stream Analytics
    • Fully managed stream processing engine
    • Input layer (Blob storage, Event Hubs, IoT hubs) ingested into ASA
    • Query layer: ASA performs query
    • Output layer: Results sent to Blob storage for downstream use
  • How transformation works in a stream
    • Data in stream is diverted to perform query
    • Query transformation results are re-introduced to stream for output
    • Transformation is done in near real time
  • Partitioning
    • Embarrassingly parallel job: equal input and output partitions, one instance of the query
    • Must align partition keys between inputs, query logic, and outputs
    • Jobs that aren't embarrassingly parallel can still be completed, but not as efficiently
      • Involves querying windows
  • In the Azure Portal
    • Query in the left-side options
    • Inputs - define query, can test and see results
    • Outputs - define and test output query
      • Here you can define partition key with the PARTITION BY clause (in compatibility level 1.1 and below, in 1.2 define partition key in input)

Implement a partition strategy for Azure Synapse Analytics

  • Table partitions
    • Supported on all dedicated SQL pool types
      • Clustered columnstore, clustered index, heap
      • Supported on all distribution types (hash, round robin, etc)
    • Why partition
      • Query performance
      • Load performance - Smaller amounts of data make incremental loading, updating, and deleting faster and easier
  • Clustered columnstore indexes
    • Standard for storing and querying large data warehouse fact tables
    • Rows are organized into row groups containing 1,048,576 rows
    • Row groups organized into column segments
    • Index columnstore is built from column segments - data is compressed
    • Deltastore - leftover row group
  • Law of 60
    • A distribution is a basic unit of storage and processing
    • Synapse divides work into 60 smaller queries that run in parallel on a data distribution
    • This turns 10 partitions into 600
    • Each partition needs 1 million rows
  • In the Azure Synapse Analytics portal
    • When writing CREATE TABLE statement, use WITH clause using CLUSTERED COLUMNSTORE INDEX
    • Define the distribution type and key
    • Choose partition key

Identify when partitioning is needed in Azure Data Lake Storage Gen2

  • Azure Blob Storage
    • General purpose, Block, and Page blob
    • Account --> Container --> Blob
  • Partition key identification
    • Azure Storage serves single partitions faster than multiple partitions
    • Partitioning is used to improve read performance
    • Naming blobs correctly is critical
    • Blob storage uses a range-based partitioning scheme
    • Partition key is combo of Account + Container + Blob
    • Blob storage uses lexical ordering and timestamps which increases co-location on partitions
  • Best practices
    • Avoid slowly changing timestamps (yyyymmdd)
    • Name based upon likely queries
    • Avoid latency-causing partitioning (use blob size >256 Kb, use hashing functions)

AWS GenAI LIVE image

How is generative AI increasing efficiency?

Join AWS GenAI LIVE! to find out how gen AI is reshaping productivity, streamlining processes, and driving innovation.

Learn more

Top comments (0)

AWS GenAI LIVE image

How is generative AI increasing efficiency?

Join AWS GenAI LIVE! to find out how gen AI is reshaping productivity, streamlining processes, and driving innovation.

Learn more

👋 Kindness is contagious

Please leave a ❤️ or a friendly comment on this post if you found it helpful!

Okay