Design and implement incremental loads
- Watermarks
- Column in source table with last updated time stamp or incrementing key
- Marks the most recent update in the table
- Delta loading
- Essentially the same as incremental loading
- Only changing new data, whether loading or transforming, etc
-
4 basic design options
- Delta loading using a watermark
- Delta loading from SQL DB using change tracking technology
- Loading new and changed files only using LastModifiedDate
- Loading new files only using a partitioned folder or file name
- Considerations
- Volume and type of data
- Load on system
-
Steps
- Query to get old watermark
- Query to get new watermark
- Load data between watermarks
- Update watermark
Transform data by using Apache Spark
- Apache Spark
- Can be used in Synapse, Databricks, and Data Factory
- Ecosystem
-
Apache Spark Core
- Basic functionalities (task scheduling, memory management)
- Can be abstracted through APIs
- Can be done in R, Python, Scala, and Java
- Spark SQL - similar to standard SQL but allows queries on data in Spark
- Spark Streaming
- MLlib
- GraphX
-
Apache Spark Core
- More about Spark architecture
- Spark core: RDDs and languages
- Spark SQL engine: Catalyst optimizer, Tungsten (memory/CPU mgmt)
- DataFrame/Dataset APIs
- Spark Graph, Spark ML, Spark Streaming, Spark SQL
- Azure Synapse notebooks in the portal
- Develop on the left-side panel
- Click +, then Notebook
-
Must have Spark pool attached before running a notebook
- Go to Manage in left-side panel
- Analytics pools --> Apache Spark pools --> choose name and settings --> Review and create
- Write and execute code in cells like a typical notebook
- Click + --> Browse gallery --> Notebooks to see example notebooks
- For the exam, know the basics of Synapse notebooks, and Synapse architecture questions are more likely about keywords than detailed questions
Transform data by using Transact-SQL (T-SQL) in Azure Synapse Analytics
- Transact-SQL
- For querying data in a data lake
- Uses SQL serverless pools
- Query data without loading it into database storage
- Standard formats are CSV, JSON, and Parquet
- Useful for OLAP
- In the portal
- Develop --> New --> SQL Script
- [FROM] OPENROWSET
- Use instead of defining a table
- Mimics the properties of a table, but uses data lake object as a source
- Choose file URL, format, and parser version if CSV
Ingest and transform data by using Azure Synapse Pipelines or Azure Data Factory
- Common data ingestion pipelines
-
Azure Functions
- Low latency
- Serverless compute
- Short run processing (only designed to run for short periods of time)
-
Custom component
- Low-scale parallel computing
- Heavy algorithms
- Requires wrapping code into an executable (more complex)
-
Azure Databricks
- Apache Spark, designed for massive and complex data transformations
- Expensive and complicated
-
Azure Data Factory
- Suitable for light transformation
- Can include above methods as activities
-
Azure Functions
- Copy performance
-
Performance chart
- Shows how long a copy will take based on amount of data and bandwidth
- Can help with assessing costs of running pipelines
-
Performance chart
- In the portal
- Most work is done in Author section in left-side panel
- Under Factory Resources there are pipelines, datasets, etc
- Linked services is not shown as it is lumped in with datasets
- Under Datasets, click + to add Dataset
- Choose Service
- Name Dataset and select Linked service
- If you choose New service, input connection details, including subscription, server, database, authentication, etc
- Select dataset from the linked service (table name, file, etc)
- Under Datasets you can view and preview the dataset
- Click + to add a new pipeline
- Select an activity, i.e. Copy data
- In the activity settings at the bottom, choose source, sink, copy behavior, and other settings
- Dataflows allow you to set up transformations within ADF
- These dataflows can be included as activities in the pipeline
-
Differences between ADF and Synapse
- ADF has
- Cross-region integration runtime
- Runtime sharing
- Power Query activity
- Global parameters
- Synapse has
- Monitoring Spark jobs
- Both have
- Solution templates (ADF template gallery, Synapse knowledge center)
- GIT integration
- ADF has
- ADF/Synapse portal differences
- ADF has
- Home
- Author - pipelines, datasets, data flows, Power Query, and templates
- Monitor - dashboards for pipeline/trigger runs, integration runtimes, data flow debug, alerts/metrics
- Manage
- Learning center
- Synapse has
- Home
- Data - SQL/Lake database, external datasets, and integration datasets
- Develop - SQL scripts, notebooks, data flows
- Integrate - pipelines, Synapse Link connections
- Monitor - pools, requests, Spark, pipeline/trigger runs, integration runtimes, Link connections
- Manage
- ADF has
Transform data by using Azure Stream Analytics
- Azure Stream Analytics
- Only for streaming solutions, not batch
- Input can be Blob Storage, Event Hubs, or IOT Hubs
- These input to the query layer where transformations happen
- Query outputs to Blob storage or Power BI
- Queries
- SELECT * INTO output FROM input
- Choose specific columns, where clauses, aggregations, etc
Cleanse data
- Process overview
- Investigate the data
- Perform cleaning steps (unique to data set)
- Evaluate the results
- Validity (does it match business rules?)
- Accuracy
- Completeness
- Consistency (is there conflicting data?)
- Uniformity (are data points using same units of measure?)
- Common tools
- ADF, Synapse (almost identical for this purpose)
- Azure Stream Analytics (can be harder to clean)
- Databricks (more complicated, but versatile and useful for massive data)
- In the portal (ADF)
- Create a Data Flow, choose sources
- Preview data to see which fields can join data
- Consider how columns can be filtered or removed to provide value or remove extraneous data
- Once cleansing is done, choose sink
Handle duplicate data
- Dedupe = eliminate unnecessary copies
- Consider technology knowledge
- Consider complexity
- Consider accompanying solutions (SQL queries, ADF data flows, Spark, etc)
- Basic steps (in ADF)
- Create data flow
- Choose source
- Choose script snippet (scroll symbol in top right of editor, snippets can be found on Microsoft Learn)
- Choose destination
Avoiding duplicate data by using Azure Stream Analytics Exactly Once Delivery
Handle missing data
- Determine impact of missing data, sometimes it won't be a big deal
- Options of handling missing data
- Drop rows that have the missing data
- Imputation = assign an inferred value to the missing element
- Include the rows that are missing data
Handle late-arriving data
- Definitions
- Event time = when original event happened (order is given to waiter)
- Processing time = when event is observed (waiter gives order to kitchen)
- Watermark = stamp identifying when event has been ingressed into system
- Handle late arriving data by choosing a level of tolerance
- Consequences of tolerance
- Tolerance = window considered acceptable for late arrival
- Critical events can be missed without proper tolerance
- Delayed outputs can result in broken processes or bad reports
Split data
- Splitting data allows making paths to multiple sinks from the same source
- Conditional splits
- Route data to different outputs
- Available in ADF and Synapse
- Steps
- Create data flow
- Use conditional split transformation
- Set split conditions
- Data flow scripts
- Can use scripts to do the steps above
Shred JSON
- Shredding JSON = extracting data from a JSON file and transferring to a table (aka parsing)
- Done in Synapse or ADF
- Once data is extracted it is persisted to a data store
-
OPENJSON function
- Table-valued function that parses JSON text
- Returns objects and properties as rows and columns
Encode and decode data
- UTF-8
- Uniform Transformation Format 8-bits
- The ASCII problem
- Assigns a code for every character (256 possiblities)
- As programming expanded, number of available characters ran out
- UTF-8 provides more character possibilities
- Program must understand UTF-8 codes in order to decode information
- There are multiple encoding formats, so the source and sink must use the same encoding
- Done in ADF and Synapse copy activities
- In the portal
- Can choose encoding and compression properties in the Dataset properties
Configure error handling for a transformation
- Options for error handling
- Transaction commit: choose whether to write data in a single transaction or in batches
- Output rejected data: log error rows in a CSV in Azure Storage
- Success on error: mark it as successful even if errors occur
-
In the portal (ADF)
- In an activity's settings, fault tolerance represents a form of success on error, continuing past incompatible data
- Enable logging to store files that show rejected rows
- Enable staging allows for copying in batches
- On the right side of the activity there are buttons for "on success," "on failure," etc.
- Connect these to other activities to choose how pipeline errors are handled
- In a data flow database sink, there is an Errors tab to configure error handling
Normalize and denormalize data
- Normalizing data = reorganizing to remove unstructured or redundant data
- Denormalizing data = adding redundant data to one or more tables
- What and why
- Normalizing
- More tables
- Requires multiple joins
- Improves accuracy and integrity
- Denormalizing
- More space
- More difficult to maintain
- Improves query performance
- Normalizing
- Star schema is not normalized
- Snowflake schema is normalized
- In the portal (Synapse)
- Inspect the data sources to determine normalization status and identify join columns
- Use a join transformation to combine data sources for denormalization
- Use a conditional split or a select transformation to normalize
- Transformations can also be done in the script editor
Perform data exploratory analysis
- Use summary statistics and visualizations to investigate patterns and anomalies in data
- Can be done in SQL, Python, Kusto queries in Azure Data Explorer
Top comments (0)