DEV Community: Alec Dutcher

DP-203 Study Guide - Optimize and troubleshoot data storage and data processing

Alec Dutcher — Tue, 05 Dec 2023 18:21:09 +0000

Compact small files

What does it meant to compact small files?
- Combine a lot of small files into one file
- Improves speed of read queries
- Can be done from a Copy job in ADF/Synapse or incremental load
- Also available in a Delta Lake feature
Using a Copy job
- Source is the directory with all of the small files
- Select using a wildcard (/directory/*)
- Use the Copy behavior to merge the files
Using Delta Lake
- Use OPTIMIZE feature
- Done via query in Spark SQL:

OPTIMIZE delta.`/data/events`
OPTIMIZE delta.`abfss://container-name@storage-account-name.dfs.core.windows.net/path-to-data

Handle skew in data

Skew
- An uneven distribution of data
- Data skew can unbalance compute nodes, lowering performance
- Avoid by balancing parallel processing with correct table distribution (hash or round-robin)
Detect skew in distributed table (database consistency check)
- DBCC PDW_SHOWSPACEUSED('dbo.FactInternetSales');
Resolve data skew
- Research
  - Monitor query impact
  - Weigh the cost of minimizing
- Solution
  - Re-create table with a new distribution column set
  - CREATE TABLE AS SELECT (CTAS)

Handle data spill

Data spill is when compute engine is unable to hold data in memory and writes ("spills") data to disk
Impact is expensive disk reads/writes and longer execution times
Occurs when
- Partition size is too big
- Compute resource size is small
- Data size during merges, unions, etc exceeds memory limit of the compute node
Identifying data spill
- Synapse SQL - TempDB runs out of space and throws error (monitor with DMVs)
- Spark - view task summary screen under spill column
Handling the spill
- Increase compute capacity
- Reduce partition size
- Remove skews in data

Optimize resource management

Optimize Synapse SQL Pools
- Pause when not in use
- Use the right compute unit (DWU) for workload
- Leverage Azure Functions to scale out workload
Optimize Spark
- Select autoscale option in cluster setup
- Select auto-terminate
- Use spot instances
- Right-size cluster nodes based on memory, CPU intensive, etc

Tune queries by using indexers

Types of indexes
- Clustered columnstore index
  - Default in SQL pool table
  - Use for tables > 100 million rows
  - Good performance and data compression
- Clustered index
  - Good for specific filter conditions
  - Use for tables between 100 and 100 million rows
- Heap index
  - Use for staging tables
Maintain by rebuilding indexes when seeing performance degradation in existing indexes
Indexes in Spark Pool
- Spark does not have an inbuilt index
- Uses Hyperspace (or Hyperscale) - ability to create indexes on datasets (CSV, JSON, parquet)
- Works via API
- Criteria
  - Contains filter on predicates
  - Contains a join that requires heavy shuffles

Tune queries by using cache

Caching stores frequently accessed data in memory or disk for faster retrieval
Caching in Synapse SQL
- Result set caching
  - Off by default
  - Enabled at database or session level
    - DB: ALTER DATABASE SET RESULT_SET_CACHING ON
    - Session: SET RESULT_SET_CACHING { ON | OFF }
  - Faster query performance
  - Max size of 1 TB per database
- Requirements
  - User running the query has access to tables used in the query
  - Cached query and new query have to be an exact match
  - No changes to the table's data or schema where cache was generated from
Caching in Spark
- RDD (resilient distributed dataset
- DataFrame
- DataSets
- Cache methods
  - .persist()
  - .cache()
  - CACHE TABLE

Troubleshoot a failed Spark job

Debug the issue within the environment and within the job
Environment
- Confirm the region the cluster is in is not down (status.azure.com)
- Use HDInsight Ambari Dashboard to view cluster health
- Are clusters using high CPU or memory?
Jobs
- Driver logs
- Task logs
- Executor logs

Troubleshoot a failed pipeline run, including activities executed in external services

Use Output section of pipeline details to see job status
To the right of the failed message there are more error details
Examine the detailed error message for failed activities

DP-203 Study Guide - Monitor data storage and data processing

Alec Dutcher — Tue, 05 Dec 2023 18:20:01 +0000

Study guide

Implement logging used by Azure Monitor

Azure Monitor key features
- Metrics - resource utilization, response time, etc
- Logs - leverage Azure Log Analytics to store and query logs
- Alerts - set up alerts based on specific metrics or log data
- Service maps, analytics insights, workbooks, and more
Log Analytics Workspace
- Central repository and analytics engine
- Collects, stores, and analyzes log data and other telemetry
Diagnostic settings
- Feature of Azure Monitor
- Control and route diagnostic data from Azure resources to various destinations
- Source (metrics, resource logs, activity logs)
- Destination (Event Hubs, Log Analytics, Azure Storage)

Configure monitoring services

What can be monitored in Azure Monitor?
- Applications
- Infrastructure (containers, OS)
- Azure Platform (resources, subscription, tenant)
- Custom sources
Configure Monitor for Azure Resources
- Monitoring section of any Azure resource
- Select "Metrics"
- Choose scope, metric, visual type, etc
- Save as Azure Monitor workbook

Monitor stream processing

Monitor Stream Analytics jobs via Azure Monitor in the portal, Powershell, or .NET SDK
In the portal, select "Metrics" under the Monitoring section
Can save metrics to a dashboard or send to Azure Monitor workbook

Measure performance of data movement

In the details of a pipeline, there are records of the tasks performed
Click the eyeglasses symbol on a record to view details about the performance (duration, throughput, start/end time, etc)

Monitor and update statistics about data across a system

Statistics provide info about how data is distributed in a table and help the database figure out most efficient way to run a query
Important for
- Query performance and optimization
- Optimal execution plans
- Index utilization
Update statistics daily or after loading/transforming data
Enabled at database level with ALTER DATABASE database_name SET AUTO_CREATE_STATISTICS ON
Querying stats data:

- Display Query Statistics information
sp_helpstats N'StatisticsTest', 'all'

- Display extra information
SELECT FROM sys.stats AS stat
CROSS APPLY sys.dm_db_stats_properties(stat.object_id, stat.stats_id) AS sp
WHERE stat.object_id = object_id('StatisticsTest');

- Display query details
SELECT * FROM dbo.StatisticsTest

Monitor data pipeline performance

Monitor section in ADF or Synapse Studio
Displays pipeline runs
Within a pipeline run you can view
- Consumption
- Pipeline orchestration (activities performed)
- Data flow (activity inputs, outputs, etc)
View in list view or Gantt view

Measure query performance

Tools to measure query performance
- Query store
  - Identifies performance differences when query plan changes
- Intelligent Insights
  - Uses AI to continuously monitor database usage to detect disruptive events
  - Detection metrics
    - Query duration
    - Timeout requests
    - Excessive wait time
    - Errored out requests
- Dynamic Management Views (DMV)
  - Monitor server health
  - Diagnose problems
  - Tune performance
  - Available via SQL queries
    - sys.dm_pdw_exec_requests
    - sys.dm_exec_requests
    - sys.dm_pdw_request_steps
    - sys.dm_exec_query_plan
    - sys.dm_pdw_waits waits

Schedule and monitor pipeline tests

Using a scheduled trigger
- Add a new trigger
- Leave as Schedule type
- Choose start time and frequency
- Go to Manage section and view triggered runs

Interpret Azure Monitor metrics and logs

Azure Monitor Metrics
- Collects numeric data from monitored resources and stores it in a time-series database
- Allows point-in-time descriptions of resources
Resources that AMM pulls data from
- Azure Resources
  - First party services
  - Access to metrics is available by default
- Azure Monitor Agent - Collects data from OS
- Application Insights - collects telemetry about specific application workloads
- REST API - get data in and out of AMM
Azure Monitor Logs
- Collect and organize logs and performance data from monitored resources
- Log Analytics workspaces
  - Edit and run log queries
  - Create alerts and workbooks
  - Analyze logs with Kusto Query Language

Implement a pipeline alert strategy

Set up a pipeline alert in ADF
- Provides ability to combine data and business process
- Configured in the Monitor section under Alerts and Metrics
- Alert
  - Set target criteria
  - Send out notification to an email or group
  - Send notifications via text, push notification, etc

DP-203 Study Guide - Implement data security

Alec Dutcher — Tue, 05 Dec 2023 18:18:39 +0000

Study guide

Implement data masking

Dynamic data masking
- Prevents unauthorized access by limiting exposure of sensitive data
- Configure masking policies on database fields to designate how much data to reveal to nonprivileged users
- Available in:
  - Azure SQL Database
  - Azure SQL Managed Instance
  - Azure Synapse Analytics Dedicated SQL pools
  - SQL Server on Azure VMs
Masking policies
- Created in the Security section in the portal or via T-SQL
- Components of a policy
  - SQL users excluded from masking
  - Masking rule
  - Masking function
Masking functions
- Default - predefined, fully masks a field, replaces values with XXXX
- Email - replaces portion of address
- Credit card - replaces everything but last four digits with XXXX
- Custom text - replace with custom string (consists of exposed prefix, padding string, and exposed suffix)
- Random number - replaces values with randomly generated values of the same data type and length (T-SQL function 'random(1,45)' where 1,45 are the low and high ends of the range)

Encrypt data at rest and in motion

Encryption
- Uses a key to encrypt and decrypt data
- Disguises the data through a process of symmetric encryption
- Encryption key is stored in a secure location such as Key Vault
Encryption at rest
- Encrypting data in a physical location
- Azure Storage uses managed keys (customer or Microsoft managed)
- Azure SQL and Synapse SQL use transparent data encryption (TDE) also with service- and customer-managed keys
  - TDE
    - Real-time I/O encryption and decryption of a backup
    - Prevents malicious party from restoring the backup
    - In the portal (Azure SQL) under the Security section, there is a TDE option with an on/off toggle
Always Encrypted
- Protects data in Azure SQL, Azure SQL Managed Instance, and SQL Server databases by encrypting it inside the client application and never revealing the encryption key to the database engine
- Allows separation of people who view data and those who manage data
- Uses two types of keys
  - Column encryption keys (CEK) - used to encrypt data in a column
  - Column master keys (CMK) - used to encrypt a CEK
- Supports two types of encryption
  - Randomized encryption
    - less predictable
    - more secure
    - prevents searching, grouping, indexing, and joining on encrypted columns
  - Deterministic encryption
    - always generates the same encryption value for a given plain text value
    - has opposite qualities of randomized
Encryption in motion
- Securing data moving from one network location to another
- Solution is transport layer security (TLS)
  - Enabled by default Azure Synapse SQL
  - Can be enabled in Azure Storage via Settings and Configuration

Implement row-level and column-level security

Row-level security
- Restricts records in a table based on user running query
- Not permission based, but predicate based (rows are hidden based on whether predicate condition is true or false)
- Security policy defines users and predicates (inline table-valued functions)
- Policies are created in T-SQL

RLS best practices
- Create separate schema for security predicate function
- Avoid recursion in predicate functions to prevent performance degradation
- Components need to be dropped in a specific order if RLS is no longer used
  - 1) Security policy
  - 2) Table
  - 3) Function
  - 4) Schemas
- Avoid excessive table joins in predicate function
Column level security
- Controls access to columns based on user context
- Configured using GRANT SELECT statement and specifying columns and user

Implement Azure role-based access control (RBAC)

Authorization for Azure Data Lake is controlled via
- Shared key authorization
- Shared access signature (SAS)
- Role-based access control (RBAC)
- Access Control Lists (ACL)
RBAC
- Uses role assignment to apply permissions to security principals (users, groups, managed identities, etc)
- Can limit access to files, folders, containers, and accounts
- Roles
  - Storage blob data owner (full container access)
  - Storage blob data contributor (read, write, delete)
  - Storage blob data reader (read, list)

Implement POSIX-like access control lists (ACLs) for Data Lake Storage Gen2

ACL
- Holds rules that grant or deny access to certain environments
- RBAC is course grained (rice) vs ACL which is fine-grained (sugar)
- Roles are determined before ACL is applied (if user has RBAC the operation succeeds, if not it falls to ACL)

Implement a data retention policy

Can be set on Azure SQL (long-term retention) or Azure Storage (Lifecycle Management)
Long-term retention
- Automatically retain backups in separate blob container for up to 10 years
- Can be used to recover database through portal, CLI, or Powershell
- Enabled by defining policy with four parameters
  - Weekly (W)
  - Monthly (M)
  - Yearly (Y)
  - Week of the year (WeekofYear)
Lifecycle Management
- Automated way to tier down files to cool and archive based on modified date
  - Enabled by creating a policy with one or more rules
  - Choose from number of days since blob was created, modified, or accessed (can enable access tracking)

Implement secure endpoints (private and public)

Endpoint is an address exposed by a web app to communicate with external entities
Service Endpoint
- Secure and direct access to Azure service/resource over the Azure network
- Firewall security feature
- Virtual network rule
- Allows for private IPs, but still uses a public address
- Works on Azure SQL, Synapse, and Storage
Private link
- Carries traffic privately so traffic between virtual network and Azure Service travels through the Microsoft Network
- Uses private address on VNet instead of public address like Service Endpoint

Implement resource tokens in Azure Databricks

Token is an authentication method that uses a personal access token (PAT) to connect via REST API
PAT
- Can be used instead of passwords
- Enabled by default
- Set expiration date or indefinite lifetime
- Disabled, monitored, and revoked by workspace admins
Create the PAT in the Databricks portal, then use it when setting up the linked service in Azure Synapse

Load a DataFrame with sensitive information

Done through encryption using Fernet
Fernet
- Symmetric authenticated cryptography (uses a secret key)
- from cryptography.fernet import Fernet
- encryptionKey = Fernet.generate_key()
Create master key
Create UDFs to encrypt/decrypt
Use UDFs to encrypt/decrypt

Write encrypted data to tables or Parquet files

Write encrypted data to a table
- df.write.format("delta").mode("overwrite").option("overwriteSchema", "true").saveAsTable("Table")
Write encrypted data to a parquet file
- encrypted.write.mode("overwrite").parquet("container_address/file_path")

Manage sensitive information

Data discovery and classification - discovering, classifying, labeling, and reporting the sensitive data in your databases
Capabilities
- Discovery and recommendations
- Labeling - apply sensitive classification labels to columns using metadata attributes
- Query result-set sensitivity - calculates the sensitivity of a query result in real-time
- Visibility - view DB classification state in a dashboard
Defender for Cloud
- Cloud-native application protection platform (CNAPP)
- Set of security measures and practices to protect cloud-based apps
- Continuous monitoring, alerts, and threat mitigation
- Separate services for Storage and SQL
Defender for SQL
- Discover and mitigate database vulnerabilities
- Alerts on anomalous activities
- Performs vulnerability assessments and Advanced Threat Protection
Defender for Storage
- Detects potential threats to storage accounts
- Prevents three major impacts
  - Malicious file uploads
  - Sensitive data exfiltration
  - Data corruption
- Includes
  - Activity monitoring
  - Sensitive data threat detection
  - Malware scanning

DP-203 Study Guide - Manage batches and pipelines

Alec Dutcher — Tue, 05 Dec 2023 18:13:20 +0000

Study guide

Azure Batch

Azure Batch
- Platform to run high-performance computing jobs in parallel at large scale
- Manages cluster of machines and supports autoscaling
- Allows you to install applications that can run as a job
- Schedule and run jobs on cluster machines
- Pay per minute for resources used
How it works
- Pool = cluster of machines/nodes
- Slot = set of resources used to execute a task
- Define number of slots per node
  - Increase slots per node to improve performance without increasing cost
- Job assigns tasks to slots on nodes
- Application is installed on each node to execute the tasks
- Specify application packages at pool or task level

Configure the batch size

In the portal (Batch account)
- Choose Pools in the left-side panel
- Add a new pool and name it
- Define the OS image (publisher and sku)
- Choose VM size (determines cores and memory))
- Choose fixed or auto scale for nodes
  - If fixed, select number of nodes
- Choose application packages and versions, uploading files if necessary
- Use Mount configuration to mount storage file shares, specifying the account name and access key of the storage account

Trigger batches

In the portal (Batch)
- Confirm that the pool is in steady state and the nodes are in idle state
- Choose Jobs in the left-side panel and add a new job
- Name the job and select the pool
- Open the job and select Tasks in the left-side panel
- Define name and description
- Enter the command in the command line box that will run on each machine
  - Reference installed packages with %AZ_BATCH_APP_PACKAGE_#%
  - Reference path to input fileshare with -i S:<file_path>
  - Reference path to output with S:<file_path>
- Submit task
In Azure Data Factory and Azure Synapse
- To run a single task in ADF
  - Create linked service to Azure Batch
    - Need Batch account name, account endpoint, and primary access key from the Keys section in the Batch portal
    - Also need the name of the pool
  - Create pipeline to run Custom Batch activity
    - Select linked service under the Azure Batch option in the activity settings
  - Define command to execute utility
    - Enter in the Command box under Settings for the activity
- To run multiple tasks in parallel
  - Get list of files using Get Metadata activity in the General option
    - Configure data set and linked service with Azure File Storage
    - Use the Field list to select Child items
  - Use a ForEach activity to iterate through the Child items
    - Use dynamic content in the Command to add the filename for each file

Handle failed batch loads

Failure types
- Infrastructure - pool and node errors
- Application - job and task errors
Pool errors
- Resizing failure - pool is unable to provision a node within the resize timeout window (default is 15 mins)
- Insufficient quota - account has limited number of core quotas, and if allocation exceeds this number then it fails (raise support ticket to increase quota)
- Scaling failures - formula is used to determine autoscaling, and formula evaluation can fail (check logs to find issue)
Node issues
- App package download failure - node set to unusable, needs to be reimaged
- Node OS updates - tasks can be interrupted by updates, auto update can be disabled
- Node in unusable state - even if pools is ready pool can be in unusable state (VM crash, firewall block, invalid app package), needs to be re-imaged
- Node disk is full
Rebooting and re-imaging can be done in the Batch portal under Pools
The Connect option in portal allows you to use RDP/SSH to connect to the VM
- Define user details
- Set as Admin
- Download RDP file and enter user credentials
- This opens Server Manager window where you can navigate the file system to check application package installations

Validate batch loads

Job errors
- Timeout
  - Max wall clock time defines max time allowed for job to run from the time it was created
  - Default value is unlimited
  - If max is reached, running tasks are killed
  - Increase max wall clock value to prevent timeout
- Failure of job-related tasks
  - Each job has job-related preparation tasks that run once for the job
  - Job prep task runs on each node as soon as job is created
  - Job release task runs on each node when job terminates
  - Failures can occur in these tasks
Task errors
- Task waiting - dependency on another task
- Task timeout- check max wall clock time
- Missing app packages or resource files
- Error in command defined in the task
- Check stdout and stderr logs for details
In the Batch portal under node details, you can specify a container where log files are stored for future reference

Configure batch retention

Retention time defines how long to keep task directory on node once task is complete
Configure at Job level or Task level
- Retention time field in advanced settings
- Default is 7 days unless removed or deleted

Manage data pipelines in Azure Data Factory or Azure Synapse Pipelines

Ways to run pipelines
- Debug Run
  - Don't need to save changes
  - Directly run pipelines with draft changes
  - Manual, can't be scheduled
- Trigger Run
  - Need to publish changes first
  - Only runs published version of pipeline
  - Can be manual or scheduled

Schedule data pipelines in Data Factory or Azure Synapse Pipelines

Trigger types
- Scheduled - run on wall-clock schedule
- Tumbling window - run at periodic intervals while maintaining state
- Storage event - run pipeline when file is uploaded or deleted from a storage account
- Custom event trigger - runs pipeline when event is raised by Azure Event Grid
Scheduled vs tumbling triggers
- Scheduled
  - Only supports future-dated loads
  - Does not maintain state, only fire and forget
- Tumbling
  - Can run back-dated and future-dated loads
  - Maintains state (completed loads)
  - Passes start and end timestamps of window as parameters
  - Can be used to add dependency between pipelines, allowing complex scenarios

Implement version control for pipeline artifacts

Authoring modes
- Live mode (default)
  - Authoring directly against pipelines
  - No option to save draft changes
  - Need to publish to save valid changes
  - Need manually created ARM templates to deploy pipelines to other environments
- Git Repo mode
  - Repo can be in ADO or GitHub
  - All artifacts can be stored in source control
  - Draft changes can be saved even if not valid
  - Autogenerates ARM templates for deployment in other environments
  - Enables DevOps features (PRs, reviews, collab)

Manage Spark jobs in a pipeline

Pipeline activities for Spark
- Synapse - Spark notebook, Spark job
- Databricks - notebook, Jar file, Python file
- HDInsight activities - Spark Jar/script
Monitoring Spark activities
- Monitoring built in to ADF
- Platform monitoring (Synapse, Databricks)
  - In ADF/Synapse, go to Montior --> Apache Spark applications and select a specific run for details
- Spark UI

DP-203 Study Guide - Develop a stream processing solution

Alec Dutcher — Tue, 05 Dec 2023 18:11:24 +0000

Study guide

Identify Azure services for stream processing

What is streaming data?
- Unbounded data (at no point do we have the whole dataset)
- Records can be added at any time
- Queries often over a subset of records called a window
- Used when real-time results are required
Common use cases
- Processing IoT data
- Fraud detection
- Monitoring social media sentiment
Comparing streaming and traditional databases
- In traditional queries, the user submits the query to the database engine which runs against the entire dataset and returns a result
  - Data is stored, query is not
- In a streaming query, the user submits the query to the streaming engine which applies the query logic to every data point in the stream after that moment and updates the intermediate result
  - Query is stored, data is not
Azure Event Hub
- Stream ingestion service
- Stores and buffers data (producers and consumers can operate at their own speeds)
- Data in storage is persistent and partitioned
- Allows one or more other services to read from the data stream
- Competing consumers (duplicate instances of an application) can access and share the data
Azure Stream Analytics
- Stream processing service that moves and transforms data between different data inputs and outputs
- Uses SQL like language for querying (SELECT INTO output FROM input)
Databricks
- Query data streams using Spark Structured streaming

Create a stream processing solution by using Stream Analytics and Azure Event Hubs

Create Azure Event Hub
- In resource group, create new Event Hub resource
  - Choose namespace name (globally unique)
  - Choose pricing tier
  - Choose partitions (must choose upfront)
  - Create and deploy
- Go to Event Hubs Namespace and create Event Hub
  - Entities --> Event Hubs --> +
  - Choose name and partition count
  - Review + create
- Navigate to Event Hub and add Shared Access Policy
  - Settings --> Shared Access Policies --> +
  - Choose policy name and permission (manage, send, listen)
  - Open policy and copy Primary Key
  - Provide key to data source to allow it to send data to the Event Hub
Create Stream Analytics Job
- Create a new Stream Analytics resource
  - Choose name, region, hosting environment, etc
- Open resource and add input
  - Job topology --> Inputs --> +
  - Choose Event Hub and provide alias
  - Choose Event Hub connection details (can use "Select Event Hub from your subscriptions" to autofill)
- Choose output(s)
  - Job topology --> Outputs --> +
  - Choose output service and alias
  - Choose output connection details
Define query
- Job topology --> Query
- Write query with SELECT fields INTO output_alias FROM input_alias
Once Stream Analytics job is running, check output to confirm data

Process data by using Spark structured streaming

Structured Streaming is a stream processing engine built on top of Apache Spark
Using Structured Streaming in Databricks
- Install Azure Event Hub library
  - Cluster info --> Libraries --> Install new --> Specify name and version
- Create/import a notebook
- Read data stream from event hub
  - Create connectionString using Primary Key from Event Hub and EntityPath for the dataset
  - Create a JSON object to store startingEventPosition
  - Create a JSON object to store eventHubsConf (includes connectionString (be sure to encrypt), startingPosition, and setMaxEventsPerTrigger)
  - Configure Spark parallelism
  - Connect to event stream using spark.readStream.format("eventhubs").options(**eventHubsConf).load()
- Parse and view the data stream
  - eventStreamDF.printSchema() shows properties for Event Hub entries
  - Body property contains the data
    - bodyDF = eventStreamDF.select(col("body").cast("STRING"))
  - Use pyspark.sql.types to define the schema with StructType
    - StructField("field_name", StringType(), False)
  - Parse the body based on the schema
    - parsedDF = bodyDF.select(from_json(col("body"), schema).alias("json"))
  - Flatten the parsed json
    - flatDF = parsedDF.select(col("json.field_name").alias("field_alias"))
- Write data stream to a delta table
  - Write the stream
    - DF.writeStream.format("delta").option("checkpointlocation", "delta-checkpoints/location_name").start("/delta-tables/location_name")
  - Create table with %sql
    - CREATE TABLE table USING DELTA LOCATION '/delta-tables/table
- Query the delta table using SQL
  - SELECT * FROM table

Create windowed aggregates

Types of window
- Tumbling
  - Fixed window duration
  - Contiguous windows
  - Events belong to exactly one window
  - Created with GROUP BY name, TumblingWindow(second, 10)
- Hopping
  - Fixed window duration
  - New window starts at a set interval (i.e. a 10s window every 5s)
  - Windows can overlap
  - Events can belong to multiple windows
  - Created with GROUP BY HoppingWindow(second, 10, 5)
- Sliding
  - Fixed window duration
  - Windows are created when events enter or leave the window
  - Windows can overlap and do not have a fixed scheudule
  - Events can belong to multiple windows
  - Created with GROUP BY SlidingWindow(second, 10)
- Session
  - Window starts when a new event arrives
  - Window extends to include new events until a specified amount of time passes with no new events
  - Window duration can vary
  - Windows do not overlap and do not have a fixed schedule
  - Events belong to exactly one window
  - Created with GROUP BY SessionWindow(second, 5, 20)
- Snapshot
  - Events that arrive at precisely the same time are windowed together
  - Windows have no duration
  - Windows do not overlap and do not repeat on a fixed schedule
  - Events belong to exactly one window
  - Created with GROUP BY System.Timestamp()

Handle schema drift

Schema drift happens when the schema of incoming data changes over time (adding removing columns, changing data types, etc)
Breaking vs non-breaking changes
- Breaking
  - Removing non-optional field
  - Renaming a field
  - Changing field types to be less restrictive (float to int)
- Non-breaking
  - Removing an optional field
  - Adding a field
  - Changing the field type to be more restrictive (int to float)
Limiting impact
- Select only necessary fields as early in a query as possible
- Input validation
  - Performed in the stream consumer
  - One query selects and casts required fields and determines if record is valid, sending results to intermediate stream
  - Second query processes valid records from the intermediate stream (main query)
  - Third query processes invalid records from the intermediate stream
Azure Event Hub Schema Registry
- Store AVRO or JSON schema definitions in the Event Hub
- When event sender uses this schema, all events are validated against it

Process time series data

What is time series data?
- Sequence of data points ordered by time of occurrence
- Repeated measurements of the same source
  - At a fixed interval (constant load)
  - When the value changes (varying load)
- Queried over subsets of data defined by start and end time (window)
Defining time
- Event time = time measurement is taken
- Processing time = time measurement is received by processing solution
- Difference can be up to minutes depending on latency
When processing in Stream Analytics
- Default is processing time
- Can override with TIMESTAMP BY to use event time if available
Temporal query windows
- Timestamp is evaluated against start and end time of windows to determine which window it belongs to

Process within one partition

A partitioned database or stream:
- Is a single logical database/stream
- Has multiple underlying storage/processing units
- Has virtually limitless scaling
Event Hub and Stream Analytics can both be partitioned
- Event Hubs are partitioned at creation
- Stream Analytics is partitioned in the query
Partitions and computing nodes
- A single node can process many partitions
- A single partition can NOT be split over multiple nodes
- When an Event Hub or Stream Analytics Job scales, partitions are redistributed over nodes
Unpartitioned queries
- Cannot calculate results using data from only one partition
- Cannot leverage scale-out architecture
- Utilize at most 1 SU V2 (6 SU V1)
- Query must be partitioned before processing to increase performance
- Partitioning a query can be done by grouping stream data by the desired partition key in a preceding query

Process data across partitions

Specifying the partition key
- Event Hubs are always partitioned
  - Round robin by default
  - Partition key is called PartitionId
  - Custom property can be specified to calculate a PartitionId
- Stream Analytics can be partitioned
  - Results in parallelizable queries
  - Enables scale-out
Compatibility level
- Property of a Stream Analytics Job
- PARTITION BY is used for level 1.1 or lower
- For level 1.2, specify the partition key on the input
Partition key is specified differently for each type of input/output
- For blob storage, the partition key is a part of the path
Embarrassingly parallel
- Query can be processed completely in parallel
- All inputs, outputs, and queries are partitioned on the same key
- In the Stream Analytics portal, Job topology --> Query --> --> Job simulation (preview) can show whether a query is parallel and how it can be partitioned

Configure checkpoints and watermarking during processing

Checkpoint
- Event Hub and ingestion services don't track which records have been consumed, they just apply sequence numbers to the records
- When processing a data stream, processors take note of where they are in the stream (checkpoint) so they can resume in case of interruption
- Stream stores sequence numbers per partition
- Checkpoint used to resume a stream is called an offset
- Stream Analytics backs up the internal state regularly
  - Intermediate results are saved
  - Checkpoint is saved
- Catching up from a restore can take some time
Watermark
- Internal marker indicating up to what point in time events are assumed to have been processed
- Updated when a new event comes in or increased as time processes in the real world
- Used to identify late events
- Used to detect opening and closing of a query window

Scale resources

Event Hub and Stream Analytics pricing is based on resources provisioned, not necessarily used
Provision as little resources as possible to save cost
Scaling Azure Event Hub
- Measured in Throughput Units (TU)
- 1 TU provides
  - Ingress up to 1 MB per second or 1000 events per second
  - Egress up to 2 MB per second or 4096 events per second
- Enable auto-inflate to prevent over-provisioning (similar to auto-scaling)
Scaling Stream Analytics
- Measured in Streaming Units (SU)
- There are two versions, V1 and V2
  - 1 V1 SU = ~1 MB/s
  - 6 V1 SU = ~1 V2 SU
- SA jobs
  - Use fine-grained deployment units
  - Run on shared hardware
  - Limit scalability to minimum of 1/3 V2 SUs and max of 66 V2 SUs
  - Support virtual network integration
- SA Clusters
  - Scale further and provide more isolation
  - Fully isolated deployment
  - Scalability has minimum of 12 V2 SUs
  - Supports virtual network integration
  - Jobs can be moved in and out of a cluster

Create tests for data pipelines

Stream Analytics allows sampling data from an input by downloading a file
It also samples data from an Event Hub automatically when editing a query
The Query section has a Test results option as well
The Query editor also has an option for uploading sample input data to test changes in the results

Optimize pipelines for analytical or transactional purposes

Streams can be joined on DATEDIFF(second, S1, S2) BETWEEN 0 AND 30 where chosen properties and the Timestamp are the same
This works best when the streams have the same partition key and partition counts
- Repartition so that they are partitioned the same way using a preceding query

Handle late-arriving data

Why is data late?
- Network delays, especially with IoT
- Pipeline congestion - ingestion load is higher than possible throughput
- Outages in gateway devices
- Producers that have specific windows of time for output
Late data tolerance
- Configured per Stream Analytics job
- Consequences
  - Late data is included in results
  - Window results are delayed as the job has to wait for late data
Still late data policy
- Drop to just ignore the record
- Adjust to update the record timestamp (can introduce time skews)
In the portal (Stream Analytics)
- Settings --> Event ordering
- Can only be done when the job is not running
- Choose late arriving window, out of order settings, and whether to drop or adjust
- Restart job

Handle interruptions

SLAs
- Microsoft SLA is 99.9% or higher based on service tier
- SLA for Stream Analytics is that job is running **99.9% **of time
- Catch-up time are the delays that follow from a service interruption
Event replication pattern
- If the SLAs aren't high enough, can increase reliability with ERP
- Duplicate Event Hub and all downstream infrastructure, processing all events in parallel in multiple regions
- Only works if
  - Pipelines have independent failure conditions
  - End application can correctly choose data source
  - Event generator is not the bottleneck

Configure exception handling

Output data error handling policy
- Defines how Stream Analytics should proceed in the case it fails to write to an output
- Allows for two values
  - Drop - record will be ignored and never written to output (better for speed)
  - Retry - keep attempting to write until success or another error (better for correctness)
Configured in the portal in the Settings --> Error policy section

Upsert data

Replay archived stream data

DP-203 Study Guide - Develop a batch processing solution

Alec Dutcher — Tue, 05 Dec 2023 18:09:52 +0000

Study guide

Develop batch processing solutions by using Azure Data Lake Storage, Azure Databricks, Azure Synapse Analytics, and Azure Data Factory

Services for each layer in a batch processing architecture
- Ingestion: Data Factory
- Storage: Blob Storage, ADLS Gen2, Cosmos DB
- Processing: Databricks, HDInsight, Data Flows
- Serving: Azure SQL, Dedicated SQL, Analysis Services
- Orchestration: Data Factory (or Synapse)
Azure Synapse Analytics
- Group of multiple, well-integrated services
- Works across all layers of architecture

Use PolyBase to load data to a SQL pool

Dedicated SQL Pool
- Overview
  - Formerly known as Azure SQL Data Warehouse
  - Available as standalone service and within Synapse
  - Like a SQL Server Database
  - Massive parallel processing (MPP) architecture
  - Elastically scale compute and storage separately
  - Pause or resume service to save cost
- Components
  - Distributions
    - Basic unit of storage
    - Fixed 60 distributions
    - Queries executed against each distribution in parallel
    - Stored in Azure Storage
  - Control node
    - SQL Server endpoint
    - Queries go to control node
    - Only stores metadata
    - Coordinates query execution with computer nodes
  - Compute nodes
    - Execute queries
    - Max 60 compute nodes
    - Distributions equally divided among compute nodes
  - Data Movement Service (DMS)
    - Coordinates movement of data between compute nodes
    - For some queries (joins, group by) data needs to be co-located
- Data Warehousing Units
  - DWU = CPU + memory + I/O
  - Represents computational power
  - Can be increased or decreased to enable scaling
  - Paid for per hour (lower to reduce costs)
- Features
  - Most regular SQL features are supported
  - DDL and DML statements and Dynamic SQL
  - Dynamic management views
- Triggers and cross-database queries are not supported
- Constraints, identity columns, and relationships work differently than SQL Server
- Can be used in both the compute and serving layer
Polybase
- Overview
  - Read and write data in external storage using T-SQL
  - Available in SQL Server and Synapse
  - Supports delimited text, parquet, ORC, GZIP, and SNAPPY compressed files
  - Control node passes storage location to compute nodes, which read the data
- Components
  - Database Scoped Credential = access storage account
  - External Data Source = define the storage location
  - External File Format = format of the file being read
  - External Table = metadata of underlying file

Implement Azure Synapse Link and query the replicated data

Azure Synapse Link
- Cloud-native implementation of HTAP
- Hybrid transactional and analytical processing
- Directly query data in operational stores, no ETL required
- Near real-time querying
- Supports Cosmos DB, Azure SQL, Dataverse
Cosmos DB
- Fully managed NoSQL platform
- Supports MongoDB, Table, Cassandra, and Gremlin
- Global distribution - data can be replicated to multiple regions
- Elastic scalability
Synapse Link for Cosmos DB
- Transactional store is synced to analytical store from which Synapse can read data
- No performance impact on the transactional store
- Analytical store auto-syncs every 2 mins (max 5 mins)
- Only accessible from Synapse
- Only charged for storage
- Supports change data capture and time travel
In the portal
- In Cosmos DB account, see Azure Synapse Link under Integrations on the left-side panel
- Enable Synapse Link
- Create the container, setting Analytical Store to On
- To connect in Synapse Link, get primary account key from the Keys under Settings in the left-side panel
- In Synapse workspace, go to Data and setup linked service and data source for Cosmos DB
- Open a SQL script to query the data in Cosmos DB
- Create a credential with the primary key
- Use OPENROWSET to query

Create data pipelines

General steps
- Configure firewall to allow IP address and Azure Services to connect to data sources and sinks
- Create an ADF/Synapse instance
- Create a linked service to the source data
- Create a new dataset from the data in the linked service
- Create a Data Flow
  - Select data source
  - Choose transformation steps (join, group, conditional split, etc)
  - Select sink
- Create a new Pipeline
- Choose a Copy activity and/or the Data Flow

Scale resources

Types of scaling
- Vertical scaling (up/down) = add more resources to a machine to make it more powerful
- Horizontal scaling (in/out) = add more machines
Scaling Azure SQL
- Supports both up and out depending on config
- During up/down, the following can be changed
  - Service tier
    - DTU model: basic, standard, and premium
    - vCore model: general purpose, hyperscale, business critical
  - Compute tier (vCore): provisioned or serverless
  - Resources (CPU, RAM, storage, etc)
- Scaling up/down results in database restart
- To scale out, can only add up to 4 read-only replicas
- In the portal (Azure SQL database)
  - Go to Compute + storage
  - Select an option in Service tier
  - Choose Compute tier
  - Use sliders to select vCores, DTUs, Read scale-out, etc
Scaling Dedicated SQL Pool
- Increase/decrease number of compute nodes and memory on each node
- Defined using DWUs

Create tests for data pipelines

Testing pipelines is different than testing applications because we're testing data instead of code
Automated testing involves automating the process of validating if pipeline is providing expected output
Types of tests
- Unit tests
  - Test individual units
  - In data pipeline, run each activity individually and validate result
  - Hard to do in ADF
  - Programmatically enable one activity at a time and disable others
  - Generate and use fake data to test edge cases
- Functional tests
  - Have pipeline generate actual output and compare to expected output
  - Run complete pipeline, not just individual activities
  - Used to confirm that pipeline meets business requirements
- Performance and regression tests
  - Regression tests ensure that change in one pipeline doesn't impact other pipelines
  - Run multiple dependent pipelines together
  - Performance test to ensure pipeline meets SLAs
- Data quality tests
  - Verify if data meets quality standards
  - Typically embedded as part of the pipeline
  - Completeness
  - Uniqueness
  - Timeliness
  - Accuracy

Integrate Jupyter or Python notebooks into a data pipeline

Notebooks are typically used for Spark apps and development
Notebooks are supported natively in services like Databricks and Synapse
Basic steps for Synapse
- Create Synapse Spark pool
- Create new notebook and define language
- Attach notebook to Spark pool
- Write code to read and process data
- Add parameters to notebook
To invoke notebook in ADF
- Create linked service to Synapse (under compute, not storage)
- Make sure ADF has manage permissions for Synapse Spark and access to storage
- Create pipeline and add notebook activity
- Select notebook and parameters
- Run pipeline

Use Mapping Data Flows in Azure Synapse pipelines and Azure Data Factory pipelines

Mapping Data Flows provides no-code ETL workflow
Can apply transformations to source data
- Add/remove columns, rename, filter, join, aggregate
Runs on Spark code
- Automatically adds optimizations
- Can add user-defined optimizations
Executes on a Spark cluster
- Called Data Flow Debug
- Can define cluster configuration
Pros and Cons
- Pros
  - Faster development
  - UI based drag-and-drop approach
  - Fast and scalable processing
- Cons
  - Less flexible since code can't be modified
  - Can be complex for large workflows

Upsert data

DML statements
- Select, insert, update, delete
- Upsert is combo of update and insert - update if exists, insert if not
Options to change data in Azure SQL
- Using T-SQL (DML statements, merge command)
- Data Factory/Synapse pipelines (copy, data flow with Alter Row)
- Can upsert on files in Data Lake using Delta Lake
Options to perform upsert
- T-SQL "merge" command
  - Specify source with "USING"
  - Specify join condition
  - "WHEN MATCHED" = behavior for existing records
  - "WHEN NOT MATCHED BY TARGET" = behavior for records not in target
  - "WHEN NOT MATCHED BY SOURCE" = behavior for records not in source
- Copy activity
  - Change write behavior in sink to upsert and define key columns
- Data flows
  - Use alter row transformation
  - Define alter row conditions

Revert data to a previous state in Azure storage

Restorable entities
- Individual file (blob) - can revert to previous version or undelete
- Container - container and files can be reverted or undeleted
Restoring files
- Use snapshot (read-only version of file from point in time)
  - Created manually by user or application
  - Used to restore back to prior version
- Enable versioning
  - Enabled at storage account level
  - Auto creates snapshots when file is updated
  - Select and restore a specific version
- Enable soft delete
  - Enabled at storage account level
  - Deleted files can be restored for a certain number of days
Restoring containers
- Enable point-in-time restore
  - Restores container to specific point in time
  - Enabled at storage account level
  - Versioning, change feed, and soft delete must also be enabled
- Enable soft delete
  - Enabled at storage account level
  - Deleted containers can be restored for a certain number of days
In the storage account portal, these options are under Data management --> Data protection in the left-side panel
File versions and snapshots can viewed in blob properties by clicking on the file

Revert data to a previous state in Azure SQL and Dedicated SQL Pool

Azure SQL backup
- Automatically creates backups based on SQL Server technology
  - Full backups every week
  - Differential backups every 12 to 24 hours
  - Transaction log backups every 10 mins
  - Backups are stored in Azure Storage
  - Redundancy is configurable
- Point-in-time restore (auto)
  - Auto-created backup
  - Kept for limited days (1 to 35, default is 7)
- Long-term retention (not auto)
  - Define policy to keep backups longer
  - Configure weekly, monthly, yearly backups and keep up to 10 years
Azure SQL restore
- Restore using PITR or LTR
  - For PITR restore, service identifies which backups to be used
  - For LTR, database can be restored in same or different region
- Restore deleted database
- Restore creates a new database
  - Use to update or replace existing database
In the Azure SQL Server portal
- Data management --> Backups to view restore point details and retention policies
Dedicated SQL backup and restore
- Local backup
  - Dedicated SQL automatically creates snapshots used as restore points
  - Up to 42 user-defined restore points can be created
  - Restore points are retained for 7 days
- Geo backup
  - Created every 24 hours and stored in a different region
  - Only latest backup is retained
- Restore database in any region using restore points
  - Restore creates a new database that updates or replaces existing one

Configure exception handling

For a single activity
- Try/catch block
  - When one activity fails, a second activity runs that performs action based on failure
- Try/catch/proceed block
  - Last activity (proceed) runs if first activity succeeds or fails, even if middle activity fails, due to skip path
- If/else block
  - One path for success, different path for failure
  - Pipeline succeeds if first activity does, will fail otherwise
- If/skip/else block
  - Pipeline succeeds whether first activity succeeds or fails because a failure causes a skip to other activities
For multiple activities
- Sequential run
  - Activities are sequential
  - One or more activities are configured to run on failure or skip of previous activity
  - Pipeline continues regardless of upstream failure
- Parallel run
  - Some activities are parallel
  - Downstream activity depends on success of all parallel activities
  - Further downstream activity can be configured to run after skip so pipeline continues even if parallel activities fail

Read from and write to a delta lake

Data lake challenges
- Data reliability issues
  - Corruption because of failures (no rollback)
  - No data validation
  - Consistency issues while reading data
- No updates/deletes/merges on files
  - Difficult to implement GDPR/CCPA compliance
- Data quality issues
  - Schema isn't verified before writing
  - Cannot apply checks on data
- Query performance issues
- Difficult to maintain historical versions of data
Delta Lake
- Open-source storage layer that brings reliability to data lakes
- Can be installed on-prem
- Available by default on many cloud platforms
- Provides database-like features on top of data lake
  - Create constraints, enforce schema, run DML statements, etc
- Provides ACID guarantees
- Works by storing a transaction log of all transactions performed on data (dataframe.write.format("delta"))
  - Log file is not created until after writing is done and is not created if there is a failure, which helps ensure ACID guarantees
Delta Lake availability
- Can be downloaded and installed
  - On local machine
  - On-prem Spark cluster
  - Cloud platforms like Azure HDInsight
- Available by default in cloud platforms
  - Azure Databricks
  - Azure Synapse Spark pools
In the portal (Databricks)
- Use spark.conf.set to connect to storage
- Use dbutils.fs.ls to list files in storage path
- Define input and output folder paths, use input to read (spark.read.option.csv)
- To write to Delta Lake
  - Write in Delta format with output path DF.write.format("delta").save(outputPath + "filename.delta")
  - Check output location in storage to confirm write
  - Check delta_log to see metadata about write
- To read from Delta Lake
  - Use Spark SQL to create a database
  - Create a table in the database using CREATE TABLE table_name USING DELTA LOCATION "delta_file_path/filename.delta"
  - DESCRIBE HISTORY table_name can be used to audit the history of the Delta table
  - Read different versions of data using SELECT FROM table_name VERSION AS OF [version number], or SELECT FROM table_name TIMESTAMP AS OF '[timestamp]'
- Can restore previous versions with RESTORE TABLE table_name TO VERSION AS OF [version number]

DP-203 Study Guide - Ingest and transform data

Alec Dutcher — Tue, 05 Dec 2023 18:03:32 +0000

Study guide

Design and implement incremental loads

Watermarks
- Column in source table with last updated time stamp or incrementing key
- Marks the most recent update in the table
Delta loading
- Essentially the same as incremental loading
- Only changing new data, whether loading or transforming, etc
4 basic design options
- Delta loading using a watermark
- Delta loading from SQL DB using change tracking technology
- Loading new and changed files only using LastModifiedDate
- Loading new files only using a partitioned folder or file name
Considerations
- Volume and type of data
- Load on system
Steps
- Query to get old watermark
- Query to get new watermark
- Load data between watermarks
- Update watermark

Transform data by using Apache Spark

Apache Spark
- Can be used in Synapse, Databricks, and Data Factory
- Ecosystem
  - Apache Spark Core
    - Basic functionalities (task scheduling, memory management)
    - Can be abstracted through APIs
    - Can be done in R, Python, Scala, and Java
  - Spark SQL - similar to standard SQL but allows queries on data in Spark
  - Spark Streaming
  - MLlib
  - GraphX
More about Spark architecture
- Spark core: RDDs and languages
- Spark SQL engine: Catalyst optimizer, Tungsten (memory/CPU mgmt)
- DataFrame/Dataset APIs
- Spark Graph, Spark ML, Spark Streaming, Spark SQL
Azure Synapse notebooks in the portal
- Develop on the left-side panel
- Click +, then Notebook
- Must have Spark pool attached before running a notebook
  - Go to Manage in left-side panel
  - Analytics pools --> Apache Spark pools --> choose name and settings --> Review and create
- Write and execute code in cells like a typical notebook
- Click + --> Browse gallery --> Notebooks to see example notebooks
For the exam, know the basics of Synapse notebooks, and Synapse architecture questions are more likely about keywords than detailed questions

Transform data by using Transact-SQL (T-SQL) in Azure Synapse Analytics

Transact-SQL
- For querying data in a data lake
- Uses SQL serverless pools
- Query data without loading it into database storage
- Standard formats are CSV, JSON, and Parquet
- Useful for OLAP
In the portal
- Develop --> New --> SQL Script
- [FROM] OPENROWSET
  - Use instead of defining a table
  - Mimics the properties of a table, but uses data lake object as a source
  - Choose file URL, format, and parser version if CSV

Ingest and transform data by using Azure Synapse Pipelines or Azure Data Factory

Common data ingestion pipelines
- Azure Functions
  - Low latency
  - Serverless compute
  - Short run processing (only designed to run for short periods of time)
- Custom component
  - Low-scale parallel computing
  - Heavy algorithms
  - Requires wrapping code into an executable (more complex)
- Azure Databricks
  - Apache Spark, designed for massive and complex data transformations
  - Expensive and complicated
- Azure Data Factory
  - Suitable for light transformation
  - Can include above methods as activities
Copy performance
- Performance chart
  - Shows how long a copy will take based on amount of data and bandwidth
  - Can help with assessing costs of running pipelines
In the portal
- Most work is done in Author section in left-side panel
- Under Factory Resources there are pipelines, datasets, etc
- Linked services is not shown as it is lumped in with datasets
- Under Datasets, click + to add Dataset
  - Choose Service
  - Name Dataset and select Linked service
  - If you choose New service, input connection details, including subscription, server, database, authentication, etc
  - Select dataset from the linked service (table name, file, etc)
- Under Datasets you can view and preview the dataset
- Click + to add a new pipeline
  - Select an activity, i.e. Copy data
  - In the activity settings at the bottom, choose source, sink, copy behavior, and other settings
- Dataflows allow you to set up transformations within ADF
  - These dataflows can be included as activities in the pipeline
Differences between ADF and Synapse
- ADF has
  - Cross-region integration runtime
  - Runtime sharing
  - Power Query activity
  - Global parameters
- Synapse has
  - Monitoring Spark jobs
- Both have
  - Solution templates (ADF template gallery, Synapse knowledge center)
  - GIT integration
ADF/Synapse portal differences
- ADF has
  - Home
  - Author - pipelines, datasets, data flows, Power Query, and templates
  - Monitor - dashboards for pipeline/trigger runs, integration runtimes, data flow debug, alerts/metrics
  - Manage
  - Learning center
- Synapse has
  - Home
  - Data - SQL/Lake database, external datasets, and integration datasets
  - Develop - SQL scripts, notebooks, data flows
  - Integrate - pipelines, Synapse Link connections
  - Monitor - pools, requests, Spark, pipeline/trigger runs, integration runtimes, Link connections
  - Manage

Transform data by using Azure Stream Analytics

Azure Stream Analytics
- Only for streaming solutions, not batch
- Input can be Blob Storage, Event Hubs, or IOT Hubs
- These input to the query layer where transformations happen
- Query outputs to Blob storage or Power BI
Queries
- SELECT * INTO output FROM input
- Choose specific columns, where clauses, aggregations, etc

Cleanse data

Process overview
- Investigate the data
- Perform cleaning steps (unique to data set)
- Evaluate the results
  - Validity (does it match business rules?)
  - Accuracy
  - Completeness
  - Consistency (is there conflicting data?)
  - Uniformity (are data points using same units of measure?)
Common tools
- ADF, Synapse (almost identical for this purpose)
- Azure Stream Analytics (can be harder to clean)
- Databricks (more complicated, but versatile and useful for massive data)
In the portal (ADF)
- Create a Data Flow, choose sources
- Preview data to see which fields can join data
- Consider how columns can be filtered or removed to provide value or remove extraneous data
- Once cleansing is done, choose sink

Handle duplicate data

Dedupe = eliminate unnecessary copies
- Consider technology knowledge
- Consider complexity
- Consider accompanying solutions (SQL queries, ADF data flows, Spark, etc)
Basic steps (in ADF)
- Create data flow
- Choose source
- Choose script snippet (scroll symbol in top right of editor, snippets can be found on Microsoft Learn)
- Choose destination

Avoiding duplicate data by using Azure Stream Analytics Exactly Once Delivery

Handle missing data

Determine impact of missing data, sometimes it won't be a big deal
Options of handling missing data
- Drop rows that have the missing data
- Imputation = assign an inferred value to the missing element
- Include the rows that are missing data

Handle late-arriving data

Definitions
- Event time = when original event happened (order is given to waiter)
- Processing time = when event is observed (waiter gives order to kitchen)
- Watermark = stamp identifying when event has been ingressed into system
Handle late arriving data by choosing a level of tolerance
Consequences of tolerance
- Tolerance = window considered acceptable for late arrival
- Critical events can be missed without proper tolerance
- Delayed outputs can result in broken processes or bad reports

Split data

Splitting data allows making paths to multiple sinks from the same source
Conditional splits
- Route data to different outputs
- Available in ADF and Synapse
- Steps
  - Create data flow
  - Use conditional split transformation
  - Set split conditions
- Data flow scripts
  - Can use scripts to do the steps above

Shred JSON

Shredding JSON = extracting data from a JSON file and transferring to a table (aka parsing)
Done in Synapse or ADF
Once data is extracted it is persisted to a data store
OPENJSON function
- Table-valued function that parses JSON text
- Returns objects and properties as rows and columns

Encode and decode data

UTF-8
- Uniform Transformation Format 8-bits
- The ASCII problem
  - Assigns a code for every character (256 possiblities)
  - As programming expanded, number of available characters ran out
  - UTF-8 provides more character possibilities
Program must understand UTF-8 codes in order to decode information
There are multiple encoding formats, so the source and sink must use the same encoding
Done in ADF and Synapse copy activities
In the portal
- Can choose encoding and compression properties in the Dataset properties

Configure error handling for a transformation

Options for error handling
- Transaction commit: choose whether to write data in a single transaction or in batches
- Output rejected data: log error rows in a CSV in Azure Storage
- Success on error: mark it as successful even if errors occur
In the portal (ADF)
- In an activity's settings, fault tolerance represents a form of success on error, continuing past incompatible data
- Enable logging to store files that show rejected rows
- Enable staging allows for copying in batches
- On the right side of the activity there are buttons for "on success," "on failure," etc.
  - Connect these to other activities to choose how pipeline errors are handled
- In a data flow database sink, there is an Errors tab to configure error handling

Normalize and denormalize data

Normalizing data = reorganizing to remove unstructured or redundant data
Denormalizing data = adding redundant data to one or more tables
What and why
- Normalizing
  - More tables
  - Requires multiple joins
  - Improves accuracy and integrity
- Denormalizing
  - More space
  - More difficult to maintain
  - Improves query performance
Star schema is not normalized
Snowflake schema is normalized
In the portal (Synapse)
- Inspect the data sources to determine normalization status and identify join columns
- Use a join transformation to combine data sources for denormalization
- Use a conditional split or a select transformation to normalize
- Transformations can also be done in the script editor

Perform data exploratory analysis

Use summary statistics and visualizations to investigate patterns and anomalies in data
Can be done in SQL, Python, Kusto queries in Azure Data Explorer

DP-203 Study Guide - Design and implement the data exploration layer

Alec Dutcher — Tue, 05 Dec 2023 17:58:40 +0000

Study guide

Create and execute queries by using a compute solution that leverages SQL serverless and Spark cluster

Azure SQL Serverless
- Not SQL database - it is SQL compute in Azure Synapse Analytics
- Serverless SQL pool
  - Built in to Synapse
  - Always available
  - Billed based on usage
- Data access
  - No data storage
  - Data accessed through ADL
  - OPENROWSET syntax to access data
- Provisioned resources
  - Dedicated SQL pool
  - Static number of servers
  - User chooses runtime
  - Defined cost per data warehouse unit (DWU)
  - Data is stored in relational tables using columnar storage
- Used mainly for EDA
- In the portal
  - Develop section on left-side panel
  - Click + button and add SQL script
  - Select tables from lake DB or SQL DB
  - Choose SQL pool settings
  - SELECT * FROM OPENROWSET( BULK '', FORMAT = 'parquet') AS [result]
  - SELECT * FROM OPENROWSET( BULK '', FORMAT = 'CSV', Parser_Version = '2.0') AS [result]
  - Can also go to Data in left-side panel and link storage account and containers - this can be used to auto-generate basic SELECT queries
Spark clusters
- Apache Spark in Synapse
- In-memory cluster computing
- Synapse offers ease of use and creation
- Data access is interacting with Spark pools through notebooks (similar to Databricks)
- Databases and tables created in a Spark pool are replicated in a serverless SQL pool as read-only
- In the portal
  - Under Develop in left-side panel
  - Click + button and select or create a notebook
  - Under Manage on left-side panel, create and run an Apache Spark pool
  - Be sure to enable automatic pausing, Spark pools are expensive

Recommend and implement Azure Synapse Analytics database templates

Database Templates
- Speed up design process
- Create more thorough databases
Lake database in Synapse
- Data lakes lack structure
- Databases can be too structured
- Lake database removes these downsides
- Provides structured DB with meta info, stored in a data lake (parquet, delta, CSV formats)
- Powered by serverless Synapse compute
In the portal
- Architecture process
  - Access Synapse Studio instance
  - Create a Lake Database
  - Add a Table
  - Add Template
  - Select relevant features

Push new or updated data lineage to Microsoft Purview

Microsoft Purview
- Unified data governance
- On-prem, multi-cloud, SaaS
- 4 pillars
  - Data quality
  - Data stewardship
  - Data protection and compliance
  - Data management
- Data lifecycle management
  - Data catalog - organized inventory of data assets
  - Data estate insights - infrastructure helps organizations manage data
  - Data sharing - internally or across orgs
  - Data policy - provision access to data at scale
- Primary use cases for Purview
  - Pull data from SQL DB and ADL and provide governance across the org
  - Financial services can show where critical data is stored to evaluate security risk
  - Large, diverse orgs can enable data democratization
Data lineage
- Track data flow over time
- Origination --> Delta (data changes) --> Sink (output)
- Provides confidence in data
- Facilitates governance and impact analysis
- In the portal
  - Lineage tab shows a flow chart with sources, processes, and targets

Browse and search metadata in Microsoft Purview Data Catalog

In the portal
- Open the Microsoft Purview Governance Portal
- Data Catalog --> Browse --> By collection or source type
- Go to Data map in left-side panel to register data sources
  - Data map --> Data sources --> Register
  - Need to do a new scan to establish lineage
  - Requires access control to be configured to allow Purview to scan the data sources

DP-203 Study Guide - Implement a partition strategy

Alec Dutcher — Tue, 05 Dec 2023 17:58:19 +0000

Study guide

Data partitioning guidance

Horizontal (sharding) = each partition is a separate data store, but all partitions have the same schema (partitions have different rows)
Vertical = each partition holds a subset of the fields according to patterns of use (partitions have different columns)
Functional = data is aggregated according to how it is used by each bounded context (i.e. invoice data vs product data)

Implement a partition strategy for files

Think through the problem, whiteboard it out
Parquet
- Most common file type for big data
- Column-based storage with nested data structures
- Supports parallel processing queries
- Row-group sections can be treated as partitions - multiple row-groups can be sent to different nodes
- Break partitions apart based on column values, i.e. query based on a date
Best practices
- Make sure to include partition columns in table's schema definition
- Group related records together
- Don't use unnecessary columns
- 512 MB to 1 GB is optimal partition size
- Consider the query and how the data will be used
- Consider the expected growth of the data
- Consider how static the data is

Implement a partition strategy for analytical workloads

Distribution types
- Round-robin
  - Characteristics
    - Distributed evenly in a random fashion
    - Even distribution across DBs
    - Assignment is random
    - Fast performance for loads as row assignment can be done quickly
    - Slower performance for reads as higher potential for data movement
  - Best for:
    - No clear distribution key
    - No frequent joins
    - Uniform distribution is desired
    - Temporary staging table
    - Simple starting point
- Hash
  - Characteristics
    - Distributed deterministically using hash function on a column
    - Distribution column can’t be changed later
    - Choose one with unique values, few/no nulls, is not a date column
  - Best for
    - Large tables (>2 Gb)
    - Frequent inserts, updates, and deletes
- Replicated
  - Characteristics
    - Full copy of table is replicated to every compute node
    - Requires extra storage and overhead for writes
    - Normally used in conjunction with other methods
  - Best for:
    - Small lookup or dimension tables joined with larger tables

Implement a partition strategy for streaming workloads

Azure Stream Analytics
- Fully managed stream processing engine
- Input layer (Blob storage, Event Hubs, IoT hubs) ingested into ASA
- Query layer: ASA performs query
- Output layer: Results sent to Blob storage for downstream use
How transformation works in a stream
- Data in stream is diverted to perform query
- Query transformation results are re-introduced to stream for output
- Transformation is done in near real time
Partitioning
- Embarrassingly parallel job: equal input and output partitions, one instance of the query
- Must align partition keys between inputs, query logic, and outputs
- Jobs that aren't embarrassingly parallel can still be completed, but not as efficiently
  - Involves querying windows
In the Azure Portal
- Query in the left-side options
- Inputs - define query, can test and see results
- Outputs - define and test output query
  - Here you can define partition key with the PARTITION BY clause (in compatibility level 1.1 and below, in 1.2 define partition key in input)

Implement a partition strategy for Azure Synapse Analytics

Table partitions
- Supported on all dedicated SQL pool types
  - Clustered columnstore, clustered index, heap
  - Supported on all distribution types (hash, round robin, etc)
- Why partition
  - Query performance
  - Load performance - Smaller amounts of data make incremental loading, updating, and deleting faster and easier
Clustered columnstore indexes
- Standard for storing and querying large data warehouse fact tables
- Rows are organized into row groups containing 1,048,576 rows
- Row groups organized into column segments
- Index columnstore is built from column segments - data is compressed
- Deltastore - leftover row group
Law of 60
- A distribution is a basic unit of storage and processing
- Synapse divides work into 60 smaller queries that run in parallel on a data distribution
- This turns 10 partitions into 600
- Each partition needs 1 million rows
In the Azure Synapse Analytics portal
- When writing CREATE TABLE statement, use WITH clause using CLUSTERED COLUMNSTORE INDEX
- Define the distribution type and key
- Choose partition key

Identify when partitioning is needed in Azure Data Lake Storage Gen2

Azure Blob Storage
- General purpose, Block, and Page blob
- Account --> Container --> Blob
Partition key identification
- Azure Storage serves single partitions faster than multiple partitions
- Partitioning is used to improve read performance
- Naming blobs correctly is critical
- Blob storage uses a range-based partitioning scheme
- Partition key is combo of Account + Container + Blob
- Blob storage uses lexical ordering and timestamps which increases co-location on partitions
Best practices
- Avoid slowly changing timestamps (yyyymmdd)
- Name based upon likely queries
- Avoid latency-causing partitioning (use blob size >256 Kb, use hashing functions)

DP-203: Data Engineering on Microsoft Azure - Study Guide

Alec Dutcher — Tue, 05 Dec 2023 17:41:27 +0000

Exam link
Microsoft's official study guide

Skills Measured

Design and implement data storage (15-20%)
Develop data processing (40-45%)
Secure, monitor, and optimize data storage and data processing (30-35%)

Design and implement data storage (15-20%)

Develop data processing (40-45%)

Secure, monitor, and optimize data storage and data processing (30-35%)

AZ-400: Design and implement an authentication strategy

Alec Dutcher — Wed, 28 Dec 2022 00:43:50 +0000

Guidance for authentication

Different authentication types work best with different application types
Personal Access Token (PAT)
- identifies you, your accessible organizations, and your scopes of access
- should be treated and used like a password
- also used to configure the cross-platform CLI
OAuth
- useful for authenticating apps for REST API access
Avoid IIS Basic Authentication
- prevents use of PATs
- breaks Git, because it requires PATs

AZ-400: Configure release documentation

Alec Dutcher — Wed, 28 Dec 2022 00:24:56 +0000

Azure DevOps Release Notes Generator

Release notes can be automatically generated upon a new release
These notes can refer to work items and commits associated with the release
They can be stored as markdown files in a dedicated storage account