DEV Community

Alec Dutcher
Alec Dutcher

Posted on • Edited on

DP-203 Study Guide - Ingest and transform data

Study guide

Design and implement incremental loads

  • Watermarks
    • Column in source table with last updated time stamp or incrementing key
    • Marks the most recent update in the table
  • Delta loading
    • Essentially the same as incremental loading
    • Only changing new data, whether loading or transforming, etc
  • 4 basic design options
    • Delta loading using a watermark
    • Delta loading from SQL DB using change tracking technology
    • Loading new and changed files only using LastModifiedDate
    • Loading new files only using a partitioned folder or file name
  • Considerations
    • Volume and type of data
    • Load on system
  • Steps
    • Query to get old watermark
    • Query to get new watermark
    • Load data between watermarks
    • Update watermark

Transform data by using Apache Spark

  • Apache Spark
    • Can be used in Synapse, Databricks, and Data Factory
    • Ecosystem
      • Apache Spark Core
        • Basic functionalities (task scheduling, memory management)
        • Can be abstracted through APIs
        • Can be done in R, Python, Scala, and Java
      • Spark SQL - similar to standard SQL but allows queries on data in Spark
      • Spark Streaming
      • MLlib
      • GraphX
  • More about Spark architecture
    • Spark core: RDDs and languages
    • Spark SQL engine: Catalyst optimizer, Tungsten (memory/CPU mgmt)
    • DataFrame/Dataset APIs
    • Spark Graph, Spark ML, Spark Streaming, Spark SQL
  • Azure Synapse notebooks in the portal
    • Develop on the left-side panel
    • Click +, then Notebook
    • Must have Spark pool attached before running a notebook
      • Go to Manage in left-side panel
      • Analytics pools --> Apache Spark pools --> choose name and settings --> Review and create
    • Write and execute code in cells like a typical notebook
    • Click + --> Browse gallery --> Notebooks to see example notebooks
  • For the exam, know the basics of Synapse notebooks, and Synapse architecture questions are more likely about keywords than detailed questions

Transform data by using Transact-SQL (T-SQL) in Azure Synapse Analytics

  • Transact-SQL
    • For querying data in a data lake
    • Uses SQL serverless pools
    • Query data without loading it into database storage
    • Standard formats are CSV, JSON, and Parquet
    • Useful for OLAP
  • In the portal
    • Develop --> New --> SQL Script
    • [FROM] OPENROWSET
      • Use instead of defining a table
      • Mimics the properties of a table, but uses data lake object as a source
      • Choose file URL, format, and parser version if CSV

Ingest and transform data by using Azure Synapse Pipelines or Azure Data Factory

  • Common data ingestion pipelines
    • Azure Functions
      • Low latency
      • Serverless compute
      • Short run processing (only designed to run for short periods of time)
    • Custom component
      • Low-scale parallel computing
      • Heavy algorithms
      • Requires wrapping code into an executable (more complex)
    • Azure Databricks
      • Apache Spark, designed for massive and complex data transformations
      • Expensive and complicated
    • Azure Data Factory
      • Suitable for light transformation
      • Can include above methods as activities
  • Copy performance
    • Performance chart
      • Shows how long a copy will take based on amount of data and bandwidth
      • Can help with assessing costs of running pipelines
  • In the portal
    • Most work is done in Author section in left-side panel
    • Under Factory Resources there are pipelines, datasets, etc
    • Linked services is not shown as it is lumped in with datasets
    • Under Datasets, click + to add Dataset
      • Choose Service
      • Name Dataset and select Linked service
      • If you choose New service, input connection details, including subscription, server, database, authentication, etc
      • Select dataset from the linked service (table name, file, etc)
    • Under Datasets you can view and preview the dataset
    • Click + to add a new pipeline
      • Select an activity, i.e. Copy data
      • In the activity settings at the bottom, choose source, sink, copy behavior, and other settings
    • Dataflows allow you to set up transformations within ADF
      • These dataflows can be included as activities in the pipeline
  • Differences between ADF and Synapse
    • ADF has
    • Synapse has
      • Monitoring Spark jobs
    • Both have
      • Solution templates (ADF template gallery, Synapse knowledge center)
      • GIT integration
  • ADF/Synapse portal differences
    • ADF has
      • Home
      • Author - pipelines, datasets, data flows, Power Query, and templates
      • Monitor - dashboards for pipeline/trigger runs, integration runtimes, data flow debug, alerts/metrics
      • Manage
      • Learning center
    • Synapse has
      • Home
      • Data - SQL/Lake database, external datasets, and integration datasets
      • Develop - SQL scripts, notebooks, data flows
      • Integrate - pipelines, Synapse Link connections
      • Monitor - pools, requests, Spark, pipeline/trigger runs, integration runtimes, Link connections
      • Manage

Transform data by using Azure Stream Analytics

  • Azure Stream Analytics
    • Only for streaming solutions, not batch
    • Input can be Blob Storage, Event Hubs, or IOT Hubs
    • These input to the query layer where transformations happen
    • Query outputs to Blob storage or Power BI
  • Queries
    • SELECT * INTO output FROM input
    • Choose specific columns, where clauses, aggregations, etc

Cleanse data

  • Process overview
    • Investigate the data
    • Perform cleaning steps (unique to data set)
    • Evaluate the results
      • Validity (does it match business rules?)
      • Accuracy
      • Completeness
      • Consistency (is there conflicting data?)
      • Uniformity (are data points using same units of measure?)
  • Common tools
    • ADF, Synapse (almost identical for this purpose)
    • Azure Stream Analytics (can be harder to clean)
    • Databricks (more complicated, but versatile and useful for massive data)
  • In the portal (ADF)
    • Create a Data Flow, choose sources
    • Preview data to see which fields can join data
    • Consider how columns can be filtered or removed to provide value or remove extraneous data
    • Once cleansing is done, choose sink

Handle duplicate data

  • Dedupe = eliminate unnecessary copies
    • Consider technology knowledge
    • Consider complexity
    • Consider accompanying solutions (SQL queries, ADF data flows, Spark, etc)
  • Basic steps (in ADF)
    • Create data flow
    • Choose source
    • Choose script snippet (scroll symbol in top right of editor, snippets can be found on Microsoft Learn)
    • Choose destination

Avoiding duplicate data by using Azure Stream Analytics Exactly Once Delivery

Handle missing data

  • Determine impact of missing data, sometimes it won't be a big deal
  • Options of handling missing data
    • Drop rows that have the missing data
    • Imputation = assign an inferred value to the missing element
    • Include the rows that are missing data

Handle late-arriving data

  • Definitions
    • Event time = when original event happened (order is given to waiter)
    • Processing time = when event is observed (waiter gives order to kitchen)
    • Watermark = stamp identifying when event has been ingressed into system
  • Handle late arriving data by choosing a level of tolerance
  • Consequences of tolerance
    • Tolerance = window considered acceptable for late arrival
    • Critical events can be missed without proper tolerance
    • Delayed outputs can result in broken processes or bad reports

Split data

  • Splitting data allows making paths to multiple sinks from the same source
  • Conditional splits
    • Route data to different outputs
    • Available in ADF and Synapse
    • Steps
      • Create data flow
      • Use conditional split transformation
      • Set split conditions
    • Data flow scripts
      • Can use scripts to do the steps above

Shred JSON

  • Shredding JSON = extracting data from a JSON file and transferring to a table (aka parsing)
  • Done in Synapse or ADF
  • Once data is extracted it is persisted to a data store
  • OPENJSON function
    • Table-valued function that parses JSON text
    • Returns objects and properties as rows and columns

Encode and decode data

  • UTF-8
    • Uniform Transformation Format 8-bits
    • The ASCII problem
      • Assigns a code for every character (256 possiblities)
      • As programming expanded, number of available characters ran out
      • UTF-8 provides more character possibilities
  • Program must understand UTF-8 codes in order to decode information
  • There are multiple encoding formats, so the source and sink must use the same encoding
  • Done in ADF and Synapse copy activities
  • In the portal
    • Can choose encoding and compression properties in the Dataset properties

Configure error handling for a transformation

  • Options for error handling
    • Transaction commit: choose whether to write data in a single transaction or in batches
    • Output rejected data: log error rows in a CSV in Azure Storage
    • Success on error: mark it as successful even if errors occur
  • In the portal (ADF)
    • In an activity's settings, fault tolerance represents a form of success on error, continuing past incompatible data
    • Enable logging to store files that show rejected rows
    • Enable staging allows for copying in batches
    • On the right side of the activity there are buttons for "on success," "on failure," etc.
      • Connect these to other activities to choose how pipeline errors are handled
    • In a data flow database sink, there is an Errors tab to configure error handling

Normalize and denormalize data

  • Normalizing data = reorganizing to remove unstructured or redundant data
  • Denormalizing data = adding redundant data to one or more tables
  • What and why
    • Normalizing
      • More tables
      • Requires multiple joins
      • Improves accuracy and integrity
    • Denormalizing
      • More space
      • More difficult to maintain
      • Improves query performance
  • Star schema is not normalized
  • Snowflake schema is normalized
  • In the portal (Synapse)
    • Inspect the data sources to determine normalization status and identify join columns
    • Use a join transformation to combine data sources for denormalization
    • Use a conditional split or a select transformation to normalize
    • Transformations can also be done in the script editor

Perform data exploratory analysis

  • Use summary statistics and visualizations to investigate patterns and anomalies in data
  • Can be done in SQL, Python, Kusto queries in Azure Data Explorer

Hostinger image

Get n8n VPS hosting 3x cheaper than a cloud solution

Get fast, easy, secure n8n VPS hosting from $4.99/mo at Hostinger. Automate any workflow using a pre-installed n8n application and no-code customization.

Start now

Top comments (0)

Sentry image

See why 4M developers consider Sentry, “not bad.”

Fixing code doesn’t have to be the worst part of your day. Learn how Sentry can help.

Learn more

👋 Kindness is contagious

Please leave a ❤️ or a friendly comment on this post if you found it helpful!

Okay