DP-203 Study Guide - Implement a partition strategy

Horizontal (sharding) = each partition is a separate data store, but all partitions have the same schema (partitions have different rows)
Vertical = each partition holds a subset of the fields according to patterns of use (partitions have different columns)
Functional = data is aggregated according to how it is used by each bounded context (i.e. invoice data vs product data)

Implement a partition strategy for files

Implement a partition strategy for analytical workloads

Implement a partition strategy for streaming workloads

Azure Stream Analytics
- Fully managed stream processing engine
- Input layer (Blob storage, Event Hubs, IoT hubs) ingested into ASA
- Query layer: ASA performs query
- Output layer: Results sent to Blob storage for downstream use
How transformation works in a stream
- Data in stream is diverted to perform query
- Query transformation results are re-introduced to stream for output
- Transformation is done in near real time
Partitioning
- Embarrassingly parallel job: equal input and output partitions, one instance of the query
- Must align partition keys between inputs, query logic, and outputs
- Jobs that aren't embarrassingly parallel can still be completed, but not as efficiently
  - Involves querying windows
In the Azure Portal
- Query in the left-side options
- Inputs - define query, can test and see results
- Outputs - define and test output query
  - Here you can define partition key with the PARTITION BY clause (in compatibility level 1.1 and below, in 1.2 define partition key in input)

Implement a partition strategy for Azure Synapse Analytics

Table partitions
- Supported on all dedicated SQL pool types
  - Clustered columnstore, clustered index, heap
  - Supported on all distribution types (hash, round robin, etc)
- Why partition
  - Query performance
  - Load performance - Smaller amounts of data make incremental loading, updating, and deleting faster and easier
Clustered columnstore indexes
- Standard for storing and querying large data warehouse fact tables
- Rows are organized into row groups containing 1,048,576 rows
- Row groups organized into column segments
- Index columnstore is built from column segments - data is compressed
- Deltastore - leftover row group
Law of 60
- A distribution is a basic unit of storage and processing
- Synapse divides work into 60 smaller queries that run in parallel on a data distribution
- This turns 10 partitions into 600
- Each partition needs 1 million rows
In the Azure Synapse Analytics portal
- When writing CREATE TABLE statement, use WITH clause using CLUSTERED COLUMNSTORE INDEX
- Define the distribution type and key
- Choose partition key

Identify when partitioning is needed in Azure Data Lake Storage Gen2

Azure Blob Storage
- General purpose, Block, and Page blob
- Account --> Container --> Blob
Partition key identification
- Azure Storage serves single partitions faster than multiple partitions
- Partitioning is used to improve read performance
- Naming blobs correctly is critical
- Blob storage uses a range-based partitioning scheme
- Partition key is combo of Account + Container + Blob
- Blob storage uses lexical ordering and timestamps which increases co-location on partitions
Best practices
- Avoid slowly changing timestamps (yyyymmdd)
- Name based upon likely queries
- Avoid latency-causing partitioning (use blob size >256 Kb, use hashing functions)

DEV Community