DP-203 Study Guide - Ingest and transform data

Study guide

Design and implement incremental loads

Watermarks
- Column in source table with last updated time stamp or incrementing key
- Marks the most recent update in the table
Delta loading
- Essentially the same as incremental loading
- Only changing new data, whether loading or transforming, etc
4 basic design options
- Delta loading using a watermark
- Delta loading from SQL DB using change tracking technology
- Loading new and changed files only using LastModifiedDate
- Loading new files only using a partitioned folder or file name
Considerations
- Volume and type of data
- Load on system
Steps
- Query to get old watermark
- Query to get new watermark
- Load data between watermarks
- Update watermark

Transform data by using Apache Spark

Apache Spark
- Can be used in Synapse, Databricks, and Data Factory
- Ecosystem
  - Apache Spark Core
    - Basic functionalities (task scheduling, memory management)
    - Can be abstracted through APIs
    - Can be done in R, Python, Scala, and Java
  - Spark SQL - similar to standard SQL but allows queries on data in Spark
  - Spark Streaming
  - MLlib
  - GraphX
More about Spark architecture
- Spark core: RDDs and languages
- Spark SQL engine: Catalyst optimizer, Tungsten (memory/CPU mgmt)
- DataFrame/Dataset APIs
- Spark Graph, Spark ML, Spark Streaming, Spark SQL
Azure Synapse notebooks in the portal
- Develop on the left-side panel
- Click +, then Notebook
- Must have Spark pool attached before running a notebook
  - Go to Manage in left-side panel
  - Analytics pools --> Apache Spark pools --> choose name and settings --> Review and create
- Write and execute code in cells like a typical notebook
- Click + --> Browse gallery --> Notebooks to see example notebooks
For the exam, know the basics of Synapse notebooks, and Synapse architecture questions are more likely about keywords than detailed questions

Transform data by using Transact-SQL (T-SQL) in Azure Synapse Analytics

Transact-SQL
- For querying data in a data lake
- Uses SQL serverless pools
- Query data without loading it into database storage
- Standard formats are CSV, JSON, and Parquet
- Useful for OLAP
In the portal
- Develop --> New --> SQL Script
- [FROM] OPENROWSET
  - Use instead of defining a table
  - Mimics the properties of a table, but uses data lake object as a source
  - Choose file URL, format, and parser version if CSV

Ingest and transform data by using Azure Synapse Pipelines or Azure Data Factory

Common data ingestion pipelines
- Azure Functions
  - Low latency
  - Serverless compute
  - Short run processing (only designed to run for short periods of time)
- Custom component
  - Low-scale parallel computing
  - Heavy algorithms
  - Requires wrapping code into an executable (more complex)
- Azure Databricks
  - Apache Spark, designed for massive and complex data transformations
  - Expensive and complicated
- Azure Data Factory
  - Suitable for light transformation
  - Can include above methods as activities
Copy performance
- Performance chart
  - Shows how long a copy will take based on amount of data and bandwidth
  - Can help with assessing costs of running pipelines
In the portal
- Most work is done in Author section in left-side panel
- Under Factory Resources there are pipelines, datasets, etc
- Linked services is not shown as it is lumped in with datasets
- Under Datasets, click + to add Dataset
  - Choose Service
  - Name Dataset and select Linked service
  - If you choose New service, input connection details, including subscription, server, database, authentication, etc
  - Select dataset from the linked service (table name, file, etc)
- Under Datasets you can view and preview the dataset
- Click + to add a new pipeline
  - Select an activity, i.e. Copy data
  - In the activity settings at the bottom, choose source, sink, copy behavior, and other settings
- Dataflows allow you to set up transformations within ADF
  - These dataflows can be included as activities in the pipeline
Differences between ADF and Synapse
- ADF has
  - Cross-region integration runtime
  - Runtime sharing
  - Power Query activity
  - Global parameters
- Synapse has
  - Monitoring Spark jobs
- Both have
  - Solution templates (ADF template gallery, Synapse knowledge center)
  - GIT integration
ADF/Synapse portal differences
- ADF has
  - Home
  - Author - pipelines, datasets, data flows, Power Query, and templates
  - Monitor - dashboards for pipeline/trigger runs, integration runtimes, data flow debug, alerts/metrics
  - Manage
  - Learning center
- Synapse has
  - Home
  - Data - SQL/Lake database, external datasets, and integration datasets
  - Develop - SQL scripts, notebooks, data flows
  - Integrate - pipelines, Synapse Link connections
  - Monitor - pools, requests, Spark, pipeline/trigger runs, integration runtimes, Link connections
  - Manage

Transform data by using Azure Stream Analytics

Azure Stream Analytics
- Only for streaming solutions, not batch
- Input can be Blob Storage, Event Hubs, or IOT Hubs
- These input to the query layer where transformations happen
- Query outputs to Blob storage or Power BI
Queries
- SELECT * INTO output FROM input
- Choose specific columns, where clauses, aggregations, etc

Cleanse data

Process overview
- Investigate the data
- Perform cleaning steps (unique to data set)
- Evaluate the results
  - Validity (does it match business rules?)
  - Accuracy
  - Completeness
  - Consistency (is there conflicting data?)
  - Uniformity (are data points using same units of measure?)
Common tools
- ADF, Synapse (almost identical for this purpose)
- Azure Stream Analytics (can be harder to clean)
- Databricks (more complicated, but versatile and useful for massive data)
In the portal (ADF)
- Create a Data Flow, choose sources
- Preview data to see which fields can join data
- Consider how columns can be filtered or removed to provide value or remove extraneous data
- Once cleansing is done, choose sink

Handle duplicate data

Dedupe = eliminate unnecessary copies
- Consider technology knowledge
- Consider complexity
- Consider accompanying solutions (SQL queries, ADF data flows, Spark, etc)
Basic steps (in ADF)
- Create data flow
- Choose source
- Choose script snippet (scroll symbol in top right of editor, snippets can be found on Microsoft Learn)
- Choose destination

Avoiding duplicate data by using Azure Stream Analytics Exactly Once Delivery

Handle missing data

Determine impact of missing data, sometimes it won't be a big deal
Options of handling missing data
- Drop rows that have the missing data
- Imputation = assign an inferred value to the missing element
- Include the rows that are missing data

Handle late-arriving data

Definitions
- Event time = when original event happened (order is given to waiter)
- Processing time = when event is observed (waiter gives order to kitchen)
- Watermark = stamp identifying when event has been ingressed into system
Handle late arriving data by choosing a level of tolerance
Consequences of tolerance
- Tolerance = window considered acceptable for late arrival
- Critical events can be missed without proper tolerance
- Delayed outputs can result in broken processes or bad reports

Split data

Splitting data allows making paths to multiple sinks from the same source
Conditional splits
- Route data to different outputs
- Available in ADF and Synapse
- Steps
  - Create data flow
  - Use conditional split transformation
  - Set split conditions
- Data flow scripts
  - Can use scripts to do the steps above

Shred JSON

Shredding JSON = extracting data from a JSON file and transferring to a table (aka parsing)
Done in Synapse or ADF
Once data is extracted it is persisted to a data store
OPENJSON function
- Table-valued function that parses JSON text
- Returns objects and properties as rows and columns

Encode and decode data

UTF-8
- Uniform Transformation Format 8-bits
- The ASCII problem
  - Assigns a code for every character (256 possiblities)
  - As programming expanded, number of available characters ran out
  - UTF-8 provides more character possibilities
Program must understand UTF-8 codes in order to decode information
There are multiple encoding formats, so the source and sink must use the same encoding
Done in ADF and Synapse copy activities
In the portal
- Can choose encoding and compression properties in the Dataset properties

Configure error handling for a transformation

Options for error handling
- Transaction commit: choose whether to write data in a single transaction or in batches
- Output rejected data: log error rows in a CSV in Azure Storage
- Success on error: mark it as successful even if errors occur
In the portal (ADF)
- In an activity's settings, fault tolerance represents a form of success on error, continuing past incompatible data
- Enable logging to store files that show rejected rows
- Enable staging allows for copying in batches
- On the right side of the activity there are buttons for "on success," "on failure," etc.
  - Connect these to other activities to choose how pipeline errors are handled
- In a data flow database sink, there is an Errors tab to configure error handling

Normalize and denormalize data

Normalizing data = reorganizing to remove unstructured or redundant data
Denormalizing data = adding redundant data to one or more tables
What and why
- Normalizing
  - More tables
  - Requires multiple joins
  - Improves accuracy and integrity
- Denormalizing
  - More space
  - More difficult to maintain
  - Improves query performance
Star schema is not normalized
Snowflake schema is normalized
In the portal (Synapse)
- Inspect the data sources to determine normalization status and identify join columns
- Use a join transformation to combine data sources for denormalization
- Use a conditional split or a select transformation to normalize
- Transformations can also be done in the script editor

Perform data exploratory analysis

Use summary statistics and visualizations to investigate patterns and anomalies in data
Can be done in SQL, Python, Kusto queries in Azure Data Explorer