<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Alec Dutcher</title>
    <description>The latest articles on DEV Community by Alec Dutcher (@aidutcher).</description>
    <link>https://dev.to/aidutcher</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F765361%2F9f549866-4a25-45fb-8b7c-07575c08e90a.jpeg</url>
      <title>DEV Community: Alec Dutcher</title>
      <link>https://dev.to/aidutcher</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/aidutcher"/>
    <language>en</language>
    <item>
      <title>DP-203 Study Guide - Optimize and troubleshoot data storage and data processing</title>
      <dc:creator>Alec Dutcher</dc:creator>
      <pubDate>Tue, 05 Dec 2023 18:21:09 +0000</pubDate>
      <link>https://dev.to/aidutcher/dp-203-study-guide-optimize-and-troubleshoot-data-storage-and-data-processing-2hbj</link>
      <guid>https://dev.to/aidutcher/dp-203-study-guide-optimize-and-troubleshoot-data-storage-and-data-processing-2hbj</guid>
      <description>&lt;p&gt;&lt;a href="https://dev.to/aidutcher/dp-203-data-engineering-on-microsoft-azure-study-guide-5h63"&gt;Study guide&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Compact small files&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;What does it meant to compact small files?

&lt;ul&gt;
&lt;li&gt;Combine a lot of small files into one file&lt;/li&gt;
&lt;li&gt;Improves speed of read queries&lt;/li&gt;
&lt;li&gt;Can be done from a Copy job in ADF/Synapse or incremental load&lt;/li&gt;
&lt;li&gt;Also available in a Delta Lake feature&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;li&gt;Using a Copy job

&lt;ul&gt;
&lt;li&gt;Source is the directory with all of the small files&lt;/li&gt;
&lt;li&gt;Select using a &lt;strong&gt;wildcard (/directory/*)&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Use the &lt;strong&gt;Copy behavior&lt;/strong&gt; to merge the files&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;li&gt;Using Delta Lake

&lt;ul&gt;
&lt;li&gt;Use &lt;a href="https://learn.microsoft.com/en-us/azure/databricks/delta/optimize"&gt;&lt;strong&gt;OPTIMIZE feature&lt;/strong&gt;&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Done via query in Spark SQL:
&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;OPTIMIZE delta.`/data/events`
OPTIMIZE delta.`abfss://container-name@storage-account-name.dfs.core.windows.net/path-to-data
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Handle skew in data&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Skew 

&lt;ul&gt;
&lt;li&gt;An uneven distribution of data&lt;/li&gt;
&lt;li&gt;Data skew can unbalance compute nodes, lowering performance&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Avoid by balancing parallel processing with correct table distribution&lt;/strong&gt; (hash or round-robin)&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Detect skew&lt;/strong&gt; in distributed table (database consistency check)

&lt;ul&gt;
&lt;li&gt;DBCC PDW_SHOWSPACEUSED('dbo.FactInternetSales');&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;li&gt;Resolve data skew

&lt;ul&gt;
&lt;li&gt;Research

&lt;ul&gt;
&lt;li&gt;Monitor query impact&lt;/li&gt;
&lt;li&gt;Weigh the cost of minimizing&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;li&gt;Solution

&lt;ul&gt;
&lt;li&gt;Re-create table with a new distribution column set&lt;/li&gt;
&lt;li&gt;CREATE TABLE AS SELECT &lt;strong&gt;(CTAS)&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Handle data spill&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Data spill is when compute engine is unable to hold data in memory and writes ("spills") data to disk&lt;/li&gt;
&lt;li&gt;Impact is expensive disk reads/writes and longer execution times&lt;/li&gt;
&lt;li&gt;Occurs when 

&lt;ul&gt;
&lt;li&gt;Partition size is too big&lt;/li&gt;
&lt;li&gt;Compute resource size is small&lt;/li&gt;
&lt;li&gt;Data size during merges, unions, etc exceeds memory limit of the compute node&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;li&gt;Identifying data spill

&lt;ul&gt;
&lt;li&gt;Synapse SQL - TempDB runs out of space and throws error (monitor with DMVs)&lt;/li&gt;
&lt;li&gt;Spark - view task summary screen under spill column&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;li&gt;Handling the spill

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Increase compute&lt;/strong&gt; capacity&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Reduce partition&lt;/strong&gt; size&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Remove skews&lt;/strong&gt; in data&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Optimize resource management&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Optimize Synapse SQL Pools

&lt;ul&gt;
&lt;li&gt;Pause when not in use&lt;/li&gt;
&lt;li&gt;Use the right compute unit (DWU) for workload&lt;/li&gt;
&lt;li&gt;Leverage Azure Functions to scale out workload&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;li&gt;Optimize Spark

&lt;ul&gt;
&lt;li&gt;Select autoscale option in cluster setup&lt;/li&gt;
&lt;li&gt;Select auto-terminate&lt;/li&gt;
&lt;li&gt;Use spot instances&lt;/li&gt;
&lt;li&gt;Right-size cluster nodes based on memory, CPU intensive, etc&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Tune queries by using indexers&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Types of indexes

&lt;ul&gt;
&lt;li&gt;Clustered columnstore index

&lt;ul&gt;
&lt;li&gt;Default in SQL pool table&lt;/li&gt;
&lt;li&gt;Use for tables &lt;strong&gt;&amp;gt; 100 million rows&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Good performance and data compression&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;li&gt;Clustered index

&lt;ul&gt;
&lt;li&gt;Good for specific filter conditions&lt;/li&gt;
&lt;li&gt;Use for tables &lt;strong&gt;between 100 and 100 million rows&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;li&gt;Heap index

&lt;ul&gt;
&lt;li&gt;Use for &lt;strong&gt;staging tables&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Maintain by rebuilding indexes when seeing performance degradation in existing indexes&lt;/li&gt;
&lt;li&gt;Indexes in Spark Pool

&lt;ul&gt;
&lt;li&gt;Spark does not have an inbuilt index&lt;/li&gt;
&lt;li&gt;Uses &lt;strong&gt;Hyperspace&lt;/strong&gt; (or Hyperscale) - ability to create indexes on datasets (CSV, JSON, parquet)&lt;/li&gt;
&lt;li&gt;Works via API&lt;/li&gt;
&lt;li&gt;Criteria

&lt;ul&gt;
&lt;li&gt;Contains filter on predicates&lt;/li&gt;
&lt;li&gt;Contains a join that requires heavy shuffles&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Tune queries by using cache&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Caching stores frequently accessed data in memory or disk for faster retrieval&lt;/li&gt;
&lt;li&gt;Caching in Synapse SQL

&lt;ul&gt;
&lt;li&gt;Result set caching

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Off by default&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Enabled at database or session level&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;DB: ALTER DATABASE SET RESULT_SET_CACHING ON&lt;/li&gt;
&lt;li&gt;Session: SET RESULT_SET_CACHING { ON | OFF }&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;li&gt;Faster query performance&lt;/li&gt;
&lt;li&gt;Max size of &lt;strong&gt;1 TB per database&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Requirements

&lt;ul&gt;
&lt;li&gt;User running the query has access to tables used in the query&lt;/li&gt;
&lt;li&gt;Cached query and new query have to be an &lt;strong&gt;exact match&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;No changes to the table's data or schema where cache was generated from&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Caching in Spark

&lt;ul&gt;
&lt;li&gt;RDD (resilient distributed dataset&lt;/li&gt;
&lt;li&gt;DataFrame&lt;/li&gt;
&lt;li&gt;DataSets&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cache methods&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;.persist()&lt;/li&gt;
&lt;li&gt;.cache()&lt;/li&gt;
&lt;li&gt;CACHE TABLE&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Troubleshoot a failed Spark job&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Debug the issue within the environment and within the job&lt;/li&gt;
&lt;li&gt;Environment

&lt;ul&gt;
&lt;li&gt;Confirm the region the cluster is in is not down (status.azure.com)&lt;/li&gt;
&lt;li&gt;Use HDInsight Ambari Dashboard to view cluster health&lt;/li&gt;
&lt;li&gt;Are clusters using high CPU or memory?&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;li&gt;Jobs

&lt;ul&gt;
&lt;li&gt;Driver logs&lt;/li&gt;
&lt;li&gt;Task logs&lt;/li&gt;
&lt;li&gt;Executor logs&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Troubleshoot a failed pipeline run, including activities executed in external services&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Use Output section of pipeline details to see job status&lt;/li&gt;
&lt;li&gt;To the right of the failed message there are more error details&lt;/li&gt;
&lt;li&gt;Examine the detailed error message for failed activities&lt;/li&gt;
&lt;/ul&gt;

</description>
    </item>
    <item>
      <title>DP-203 Study Guide - Monitor data storage and data processing</title>
      <dc:creator>Alec Dutcher</dc:creator>
      <pubDate>Tue, 05 Dec 2023 18:20:01 +0000</pubDate>
      <link>https://dev.to/aidutcher/dp-203-study-guide-monitor-data-storage-and-data-processing-44no</link>
      <guid>https://dev.to/aidutcher/dp-203-study-guide-monitor-data-storage-and-data-processing-44no</guid>
      <description>&lt;p&gt;&lt;a href="https://dev.to/aidutcher/dp-203-data-engineering-on-microsoft-azure-study-guide-5h63"&gt;Study guide&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Implement logging used by Azure Monitor&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Azure Monitor key features&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Metrics&lt;/strong&gt; - resource utilization, response time, etc&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Logs&lt;/strong&gt; - leverage Azure Log Analytics to store and query logs&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Alerts&lt;/strong&gt; - set up alerts based on specific metrics or log data&lt;/li&gt;
&lt;li&gt;Service maps, analytics insights, workbooks, and more&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Log Analytics Workspace&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;Central repository and analytics engine&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Collects, stores, and analyzes log data&lt;/strong&gt; and other telemetry&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Diagnostic settings&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;Feature of Azure Monitor&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Control and route diagnostic data&lt;/strong&gt; from Azure resources to various destinations&lt;/li&gt;
&lt;li&gt;Source (metrics, resource logs, activity logs)&lt;/li&gt;
&lt;li&gt;Destination (Event Hubs, Log Analytics, Azure Storage)&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Configure monitoring services&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;What can be monitored in Azure Monitor?&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Applications&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Infrastructure&lt;/strong&gt; (containers, OS)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Azure Platform&lt;/strong&gt; (resources, subscription, tenant)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Custom sources&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;li&gt;Configure Monitor for Azure Resources

&lt;ul&gt;
&lt;li&gt;Monitoring section of any Azure resource&lt;/li&gt;
&lt;li&gt;Select "Metrics"&lt;/li&gt;
&lt;li&gt;Choose scope, metric, visual type, etc&lt;/li&gt;
&lt;li&gt;Save as Azure Monitor workbook&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Monitor stream processing&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Monitor Stream Analytics jobs via Azure Monitor in the portal, Powershell, or .NET SDK&lt;/li&gt;
&lt;li&gt;In the portal, select "Metrics" under the Monitoring section&lt;/li&gt;
&lt;li&gt;Can save metrics to a dashboard or send to Azure Monitor workbook&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Measure performance of data movement&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;In the details of a pipeline, there are records of the tasks performed&lt;/li&gt;
&lt;li&gt;Click the eyeglasses symbol on a record to view details about the performance (duration, throughput, start/end time, etc)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Monitor and update statistics about data across a system&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Statistics provide &lt;strong&gt;info about how data is distributed&lt;/strong&gt; in a table and help the database figure out most efficient way to run a query&lt;/li&gt;
&lt;li&gt;Important for

&lt;ul&gt;
&lt;li&gt;Query performance and optimization&lt;/li&gt;
&lt;li&gt;Optimal execution plans&lt;/li&gt;
&lt;li&gt;Index utilization&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Update statistics daily or after loading/transforming data&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Enabled at &lt;strong&gt;database level&lt;/strong&gt; with ALTER DATABASE database_name SET AUTO_CREATE_STATISTICS ON&lt;/li&gt;
&lt;li&gt;Querying stats data:
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;- Display Query Statistics information
sp_helpstats N'StatisticsTest', 'all'

- Display extra information
SELECT FROM sys.stats AS stat
CROSS APPLY sys.dm_db_stats_properties(stat.object_id, stat.stats_id) AS sp
WHERE stat.object_id = object_id('StatisticsTest');

- Display query details
SELECT * FROM dbo.StatisticsTest
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Monitor data pipeline performance&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Monitor section in ADF or Synapse Studio&lt;/li&gt;
&lt;li&gt;Displays pipeline runs&lt;/li&gt;
&lt;li&gt;Within a pipeline run you can view

&lt;ul&gt;
&lt;li&gt;Consumption&lt;/li&gt;
&lt;li&gt;Pipeline orchestration (activities performed)&lt;/li&gt;
&lt;li&gt;Data flow (activity inputs, outputs, etc)&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;li&gt;View in list view or Gantt view&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Measure query performance&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Tools to measure query performance

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://learn.microsoft.com/en-us/sql/relational-databases/performance/monitoring-performance-by-using-the-query-store?view=sql-server-ver16"&gt;Query store&lt;/a&gt;

&lt;ul&gt;
&lt;li&gt;Identifies performance differences when query plan changes&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://learn.microsoft.com/en-us/azure/azure-sql/database/intelligent-insights-overview?view=azuresql"&gt;Intelligent Insights&lt;/a&gt;

&lt;ul&gt;
&lt;li&gt;Uses AI to continuously monitor database usage to detect disruptive events&lt;/li&gt;
&lt;li&gt;Detection metrics

&lt;ul&gt;
&lt;li&gt;Query duration&lt;/li&gt;
&lt;li&gt;Timeout requests&lt;/li&gt;
&lt;li&gt;Excessive wait time&lt;/li&gt;
&lt;li&gt;Errored out requests&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://learn.microsoft.com/en-us/sql/relational-databases/system-dynamic-management-views/system-dynamic-management-views?view=sql-server-ver16"&gt;Dynamic Management Views (DMV)&lt;/a&gt;

&lt;ul&gt;
&lt;li&gt;Monitor server health&lt;/li&gt;
&lt;li&gt;Diagnose problems&lt;/li&gt;
&lt;li&gt;Tune performance&lt;/li&gt;
&lt;li&gt;Available via SQL queries

&lt;ul&gt;
&lt;li&gt;sys.dm_pdw_exec_requests&lt;/li&gt;
&lt;li&gt;sys.dm_exec_requests&lt;/li&gt;
&lt;li&gt;sys.dm_pdw_request_steps&lt;/li&gt;
&lt;li&gt;sys.dm_exec_query_plan&lt;/li&gt;
&lt;li&gt;sys.dm_pdw_waits waits&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Schedule and monitor pipeline tests&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Using a scheduled trigger

&lt;ul&gt;
&lt;li&gt;Add a new trigger&lt;/li&gt;
&lt;li&gt;Leave as Schedule type&lt;/li&gt;
&lt;li&gt;Choose start time and frequency&lt;/li&gt;
&lt;li&gt;Go to Manage section and view triggered runs&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Interpret Azure Monitor metrics and logs&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Azure Monitor Metrics

&lt;ul&gt;
&lt;li&gt;Collects numeric data from monitored resources and stores it in a time-series database&lt;/li&gt;
&lt;li&gt;Allows point-in-time descriptions of resources&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;li&gt;Resources that AMM pulls data from

&lt;ul&gt;
&lt;li&gt;Azure Resources

&lt;ul&gt;
&lt;li&gt;First party services&lt;/li&gt;
&lt;li&gt;Access to metrics is available by default&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;li&gt;Azure Monitor Agent - Collects data from OS&lt;/li&gt;
&lt;li&gt;Application Insights - collects telemetry about specific application workloads&lt;/li&gt;
&lt;li&gt;REST API - get data in and out of AMM&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Azure Monitor Logs

&lt;ul&gt;
&lt;li&gt;Collect and organize logs and performance data from monitored resources&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://learn.microsoft.com/en-us/azure/azure-monitor/logs/log-analytics-workspace-overview"&gt;Log Analytics workspaces&lt;/a&gt;

&lt;ul&gt;
&lt;li&gt;Edit and run log queries&lt;/li&gt;
&lt;li&gt;Create alerts and workbooks&lt;/li&gt;
&lt;li&gt;Analyze logs with Kusto Query Language&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Implement a pipeline alert strategy&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://learn.microsoft.com/en-us/azure/data-factory/monitor-metrics-alerts"&gt;Set up a pipeline alert in ADF&lt;/a&gt;

&lt;ul&gt;
&lt;li&gt;Provides ability to combine data and business process&lt;/li&gt;
&lt;li&gt;Configured in the Monitor section under Alerts and Metrics&lt;/li&gt;
&lt;li&gt;Alert

&lt;ul&gt;
&lt;li&gt;Set target criteria&lt;/li&gt;
&lt;li&gt;Send out notification to an email or group&lt;/li&gt;
&lt;li&gt;Send notifications via text, push notification, etc&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;

</description>
    </item>
    <item>
      <title>DP-203 Study Guide - Implement data security</title>
      <dc:creator>Alec Dutcher</dc:creator>
      <pubDate>Tue, 05 Dec 2023 18:18:39 +0000</pubDate>
      <link>https://dev.to/aidutcher/dp-203-study-guide-implement-data-security-24pa</link>
      <guid>https://dev.to/aidutcher/dp-203-study-guide-implement-data-security-24pa</guid>
      <description>&lt;p&gt;&lt;a href="https://dev.to/aidutcher/dp-203-data-engineering-on-microsoft-azure-study-guide-5h63"&gt;Study guide&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Implement data masking&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Dynamic data masking&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;Prevents unauthorized access by limiting exposure of sensitive data&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Configure masking policies on database fields&lt;/strong&gt; to designate how much data to reveal to nonprivileged users&lt;/li&gt;
&lt;li&gt;Available in:

&lt;ul&gt;
&lt;li&gt;Azure SQL Database&lt;/li&gt;
&lt;li&gt;Azure SQL Managed Instance&lt;/li&gt;
&lt;li&gt;Azure Synapse Analytics Dedicated SQL pools&lt;/li&gt;
&lt;li&gt;SQL Server on Azure VMs&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Masking policies

&lt;ul&gt;
&lt;li&gt;Created in the Security section in the portal or via T-SQL&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Components of a policy&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;SQL users excluded from masking&lt;/li&gt;
&lt;li&gt;Masking rule&lt;/li&gt;
&lt;li&gt;Masking function&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Masking functions&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Default&lt;/strong&gt; - &lt;strong&gt;predefined, fully masks a field&lt;/strong&gt;, replaces values with XXXX&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Email&lt;/strong&gt; - &lt;strong&gt;replaces portion&lt;/strong&gt; of address&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Credit card&lt;/strong&gt; - replaces everything but &lt;strong&gt;last four digits&lt;/strong&gt; with XXXX&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Custom text&lt;/strong&gt; - replace with &lt;strong&gt;custom string&lt;/strong&gt; (consists of exposed &lt;strong&gt;prefix&lt;/strong&gt;, &lt;strong&gt;padding string&lt;/strong&gt;, and exposed &lt;strong&gt;suffix&lt;/strong&gt;)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Random number&lt;/strong&gt; - replaces values with &lt;strong&gt;randomly generated values&lt;/strong&gt; of the same data type and length (T-SQL function 'random(1,45)' where 1,45 are the low and high ends of the range)&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Encrypt data at rest and in motion&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Encryption&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Uses a key&lt;/strong&gt; to encrypt and decrypt data&lt;/li&gt;
&lt;li&gt;Disguises the data through a process of &lt;strong&gt;symmetric encryption&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Encryption key is stored in a secure location such as Key Vault&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Encryption at rest&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;Encrypting data in a &lt;strong&gt;physical location&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Azure Storage uses &lt;strong&gt;managed keys&lt;/strong&gt; (customer or Microsoft managed)&lt;/li&gt;
&lt;li&gt;Azure SQL and Synapse SQL use &lt;strong&gt;transparent data encryption (TDE)&lt;/strong&gt; also with service- and customer-managed keys

&lt;ul&gt;
&lt;li&gt;TDE

&lt;ul&gt;
&lt;li&gt;Real-time &lt;strong&gt;I/O encryption and decryption of a backup&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Prevents malicious party from restoring the backup&lt;/li&gt;
&lt;li&gt;In the portal (Azure SQL) under the Security section, there is a TDE option with an on/off toggle&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Always Encrypted&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;Protects data in Azure SQL, Azure SQL Managed Instance, and SQL Server databases by encrypting it &lt;strong&gt;inside the client application&lt;/strong&gt; and never revealing the encryption key to the database engine&lt;/li&gt;
&lt;li&gt;Allows separation of people who view data and those who manage data&lt;/li&gt;
&lt;li&gt;Uses two types of keys

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Column encryption keys (CEK)&lt;/strong&gt; - used to encrypt data in a column&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Column master keys (CMK)&lt;/strong&gt; - used to encrypt a CEK&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;li&gt;Supports two types of encryption

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Randomized encryption&lt;/strong&gt; 

&lt;ul&gt;
&lt;li&gt;less predictable&lt;/li&gt;
&lt;li&gt;more secure&lt;/li&gt;
&lt;li&gt;prevents searching, grouping, indexing, and joining on encrypted columns&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Deterministic encryption&lt;/strong&gt; 

&lt;ul&gt;
&lt;li&gt;always generates the same encryption value for a given plain text value&lt;/li&gt;
&lt;li&gt;has opposite qualities of randomized&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Encryption in motion&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;Securing data moving from one network location to another&lt;/li&gt;
&lt;li&gt;Solution is &lt;strong&gt;transport layer security (TLS)&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;Enabled by default Azure Synapse SQL&lt;/li&gt;
&lt;li&gt;Can be enabled in Azure Storage via Settings and Configuration&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Implement row-level and column-level security&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Row-level security&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;Restricts records in a table based on user running query&lt;/li&gt;
&lt;li&gt;Not permission based, but &lt;strong&gt;predicate based&lt;/strong&gt; (rows are hidden based on whether predicate condition is true or false)&lt;/li&gt;
&lt;li&gt;Security policy defines users and predicates &lt;strong&gt;(inline table-valued functions)&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Policies are created in T-SQL&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkba3gg9tkkgikegvj881.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkba3gg9tkkgikegvj881.png" alt="Image description" width="800" height="179"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxt5w4wuskc60ncu5408w.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxt5w4wuskc60ncu5408w.png" alt="Image description" width="524" height="254"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;RLS best practices&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Create separate schema&lt;/strong&gt; for security predicate function&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Avoid recursion&lt;/strong&gt; in predicate functions to prevent performance degradation&lt;/li&gt;
&lt;li&gt;Components need to be dropped in a specific order if RLS is no longer used

&lt;ul&gt;
&lt;li&gt;1) Security policy&lt;/li&gt;
&lt;li&gt;2) Table&lt;/li&gt;
&lt;li&gt;3) Function&lt;/li&gt;
&lt;li&gt;4) Schemas&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Avoid excessive table joins&lt;/strong&gt; in predicate function&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;

&lt;p&gt;&lt;strong&gt;Column level security&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Controls access to columns based on &lt;strong&gt;user context&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Configured using &lt;strong&gt;GRANT SELECT&lt;/strong&gt; statement and specifying columns and user&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Implement Azure role-based access control (RBAC)&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Authorization for Azure Data Lake is controlled via 

&lt;ul&gt;
&lt;li&gt;Shared key authorization&lt;/li&gt;
&lt;li&gt;Shared access signature (SAS) &lt;/li&gt;
&lt;li&gt;Role-based access control (RBAC)&lt;/li&gt;
&lt;li&gt;Access Control Lists (ACL)&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;li&gt;RBAC

&lt;ul&gt;
&lt;li&gt;Uses role assignment to apply permissions to security principals (users, groups, managed identities, etc)&lt;/li&gt;
&lt;li&gt;Can limit access to files, folders, containers, and accounts&lt;/li&gt;
&lt;li&gt;Roles

&lt;ul&gt;
&lt;li&gt;Storage blob data owner (full container access)&lt;/li&gt;
&lt;li&gt;Storage blob data contributor (read, write, delete)&lt;/li&gt;
&lt;li&gt;Storage blob data reader (read, list)&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Implement POSIX-like access control lists (ACLs) for Data Lake Storage Gen2&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;ACL

&lt;ul&gt;
&lt;li&gt;Holds rules that grant or deny access to certain environments&lt;/li&gt;
&lt;li&gt;RBAC is course grained (rice) vs ACL which is fine-grained (sugar)&lt;/li&gt;
&lt;li&gt;Roles are determined before ACL is applied (if user has RBAC the operation succeeds, if not it falls to ACL)&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Implement a data retention policy&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Can be set on &lt;strong&gt;Azure SQL (long-term retention)&lt;/strong&gt; or &lt;strong&gt;Azure Storage (Lifecycle Management)&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Long-term retention&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;Automatically &lt;strong&gt;retain backups&lt;/strong&gt; in separate blob container for &lt;strong&gt;up to 10 years&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Can be used to recover database through portal, CLI, or Powershell&lt;/li&gt;
&lt;li&gt;Enabled by defining policy with four parameters

&lt;ul&gt;
&lt;li&gt;Weekly (W)&lt;/li&gt;
&lt;li&gt;Monthly (M)&lt;/li&gt;
&lt;li&gt;Yearly (Y)&lt;/li&gt;
&lt;li&gt;Week of the year (WeekofYear)&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Lifecycle Management&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;Automated way to &lt;strong&gt;tier down files to cool and archive&lt;/strong&gt; based on modified date

&lt;ul&gt;
&lt;li&gt;Enabled by creating a policy with one or more rules&lt;/li&gt;
&lt;li&gt;Choose from number of days since blob was created, modified, or accessed (can enable access tracking)&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Implement secure endpoints (private and public)&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Endpoint is an &lt;strong&gt;address exposed by a web app&lt;/strong&gt; to communicate with external entities&lt;/li&gt;
&lt;li&gt;Service Endpoint

&lt;ul&gt;
&lt;li&gt;Secure and direct access to Azure service/resource over the Azure network&lt;/li&gt;
&lt;li&gt;Firewall security feature&lt;/li&gt;
&lt;li&gt;Virtual network rule&lt;/li&gt;
&lt;li&gt;Allows for private IPs, but still uses a public address&lt;/li&gt;
&lt;li&gt;Works on Azure SQL, Synapse, and Storage&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;li&gt;Private link

&lt;ul&gt;
&lt;li&gt;Carries traffic privately so traffic between virtual network and Azure Service travels through the Microsoft Network&lt;/li&gt;
&lt;li&gt;Uses private address on VNet instead of public address like Service Endpoint&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Implement resource tokens in Azure Databricks&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Token is an authentication method that uses a personal access token (PAT) to connect via REST API&lt;/li&gt;
&lt;li&gt;PAT

&lt;ul&gt;
&lt;li&gt;Can be used instead of passwords&lt;/li&gt;
&lt;li&gt;Enabled by default&lt;/li&gt;
&lt;li&gt;Set expiration date or indefinite lifetime&lt;/li&gt;
&lt;li&gt;Disabled, monitored, and revoked by workspace admins&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;li&gt;Create the PAT in the Databricks portal, then use it when setting up the linked service in Azure Synapse&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Load a DataFrame with sensitive information&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Done through &lt;strong&gt;encryption using Fernet&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Fernet&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Symmetric authenticated cryptography (uses a secret key)&lt;/li&gt;
&lt;li&gt;from cryptography.fernet import Fernet&lt;/li&gt;
&lt;li&gt;encryptionKey = Fernet.generate_key()&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Create master key&lt;br&gt;&lt;br&gt;
&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fp71qrohkjmhsjz90hv9m.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fp71qrohkjmhsjz90hv9m.png" alt="Image description" width="481" height="75"&gt;&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Create UDFs to encrypt/decrypt&lt;br&gt;&lt;br&gt;
&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fy4zbjaouyro2cw62ouca.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fy4zbjaouyro2cw62ouca.png" alt="Image description" width="578" height="244"&gt;&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Use UDFs to encrypt/decrypt&lt;br&gt;&lt;br&gt;
&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzn1ywoac8u2sqv4vzcqy.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzn1ywoac8u2sqv4vzcqy.png" alt="Image description" width="601" height="296"&gt;&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Write encrypted data to tables or Parquet files&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Write encrypted data to a table

&lt;ul&gt;
&lt;li&gt;df.write.format("delta").mode("overwrite").option("overwriteSchema", "true").saveAsTable("Table")&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;li&gt;Write encrypted data to a parquet file

&lt;ul&gt;
&lt;li&gt;encrypted.write.mode("overwrite").parquet("container_address/file_path")&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Manage sensitive information&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://learn.microsoft.com/en-us/azure/azure-sql/database/data-discovery-and-classification-overview?view=azuresql"&gt;Data discovery and classification&lt;/a&gt; - discovering, classifying, labeling, and reporting the sensitive data in your databases&lt;/li&gt;
&lt;li&gt;Capabilities

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Discovery and recommendations&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Labeling&lt;/strong&gt; - apply sensitive classification labels to columns using metadata attributes&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Query result-set sensitivity&lt;/strong&gt; - calculates the sensitivity of a query result in real-time&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Visibility&lt;/strong&gt; - view DB classification state in a dashboard&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;li&gt;Defender for Cloud

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Cloud-native&lt;/strong&gt; application protection platform (CNAPP)&lt;/li&gt;
&lt;li&gt;Set of security measures and practices to protect cloud-based apps&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Continuous monitoring, alerts, and threat mitigation&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Separate services for Storage and SQL&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;li&gt;Defender for SQL

&lt;ul&gt;
&lt;li&gt;Discover and mitigate database vulnerabilities&lt;/li&gt;
&lt;li&gt;Alerts on anomalous activities&lt;/li&gt;
&lt;li&gt;Performs vulnerability assessments and Advanced Threat Protection&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;li&gt;Defender for Storage

&lt;ul&gt;
&lt;li&gt;Detects potential threats to storage accounts&lt;/li&gt;
&lt;li&gt;Prevents three major impacts

&lt;ul&gt;
&lt;li&gt;Malicious file uploads&lt;/li&gt;
&lt;li&gt;Sensitive data exfiltration&lt;/li&gt;
&lt;li&gt;Data corruption&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;li&gt;Includes

&lt;ul&gt;
&lt;li&gt;Activity monitoring&lt;/li&gt;
&lt;li&gt;Sensitive data threat detection&lt;/li&gt;
&lt;li&gt;Malware scanning&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;

</description>
    </item>
    <item>
      <title>DP-203 Study Guide - Manage batches and pipelines</title>
      <dc:creator>Alec Dutcher</dc:creator>
      <pubDate>Tue, 05 Dec 2023 18:13:20 +0000</pubDate>
      <link>https://dev.to/aidutcher/dp-203-study-guide-manage-batches-and-pipelines-3e66</link>
      <guid>https://dev.to/aidutcher/dp-203-study-guide-manage-batches-and-pipelines-3e66</guid>
      <description>&lt;p&gt;&lt;a href="https://dev.to/aidutcher/dp-203-data-engineering-on-microsoft-azure-study-guide-5h63"&gt;Study guide&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Azure Batch&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://learn.microsoft.com/en-us/azure/batch/"&gt;Azure Batch&lt;/a&gt;

&lt;ul&gt;
&lt;li&gt;Platform to run high-performance computing jobs in parallel at large scale&lt;/li&gt;
&lt;li&gt;Manages cluster of machines and supports autoscaling&lt;/li&gt;
&lt;li&gt;Allows you to &lt;strong&gt;install applications that can run as a job&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Schedule and run jobs on cluster machines&lt;/li&gt;
&lt;li&gt;Pay per minute for resources used&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;How it works&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;Pool = cluster of machines/nodes&lt;/li&gt;
&lt;li&gt;Slot = set of resources used to execute a task&lt;/li&gt;
&lt;li&gt;Define number of slots per node

&lt;ul&gt;
&lt;li&gt;Increase slots per node to improve performance without increasing cost&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;li&gt;Job assigns tasks to slots on nodes&lt;/li&gt;
&lt;li&gt;Application is installed on each node to execute the tasks&lt;/li&gt;
&lt;li&gt;Specify &lt;strong&gt;application packages at pool or task level&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Configure the batch size&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;In the portal (Batch account)

&lt;ul&gt;
&lt;li&gt;Choose Pools in the left-side panel&lt;/li&gt;
&lt;li&gt;Add a new pool and name it&lt;/li&gt;
&lt;li&gt;Define the OS image (publisher and sku)&lt;/li&gt;
&lt;li&gt;Choose VM size (determines cores and memory))&lt;/li&gt;
&lt;li&gt;Choose fixed or auto scale for nodes

&lt;ul&gt;
&lt;li&gt;If fixed, select number of nodes&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;li&gt;Choose application packages and versions, uploading files if necessary&lt;/li&gt;
&lt;li&gt;Use Mount configuration to mount storage file shares, specifying the account name and access key of the storage account&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Trigger batches&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;In the portal (Batch)&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;Confirm that the &lt;strong&gt;pool is in steady state&lt;/strong&gt; and the &lt;strong&gt;nodes are in idle state&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Choose Jobs in the left-side panel and &lt;strong&gt;add a new job&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Name the job and &lt;strong&gt;select the pool&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Open the job and &lt;strong&gt;select Tasks&lt;/strong&gt; in the left-side panel&lt;/li&gt;
&lt;li&gt;Define name and description&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Enter the command&lt;/strong&gt; in the command line box that will run on each machine

&lt;ul&gt;
&lt;li&gt;Reference installed packages with %AZ_BATCH_APP_PACKAGE_#%&lt;/li&gt;
&lt;li&gt;Reference path to input fileshare with -i S:&amp;lt;file_path&amp;gt;&lt;/li&gt;
&lt;li&gt;Reference path to output with S:&amp;lt;file_path&amp;gt;&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Submit task&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;In Azure Data Factory and Azure Synapse&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;To run a single task in ADF&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Create linked service to Azure Batch&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;Need Batch account name, account endpoint, and primary access key from the Keys section in the Batch portal&lt;/li&gt;
&lt;li&gt;Also need the name of the pool&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Create pipeline to run Custom Batch activity&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;Select linked service under the Azure Batch option in the activity settings&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Define command to execute utility&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;Enter in the Command box under Settings for the activity&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;To run multiple tasks in parallel&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Get list of files using Get Metadata&lt;/strong&gt; activity in the General option

&lt;ul&gt;
&lt;li&gt;Configure data set and linked service with Azure File Storage&lt;/li&gt;
&lt;li&gt;Use the Field list to select Child items&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Use a ForEach activity to iterate&lt;/strong&gt; through the Child items

&lt;ul&gt;
&lt;li&gt;Use dynamic content in the Command to add the filename for each file&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Handle failed batch loads&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Failure types

&lt;ul&gt;
&lt;li&gt;Infrastructure - pool and node errors&lt;/li&gt;
&lt;li&gt;Application - job and task errors&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Pool errors&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Resizing failure&lt;/strong&gt; - pool is unable to provision a node within the &lt;strong&gt;resize timeout window&lt;/strong&gt; (default is 15 mins)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Insufficient quota&lt;/strong&gt; - account has limited number of core quotas, and if allocation exceeds this number then it fails (&lt;strong&gt;raise support ticket to increase quota&lt;/strong&gt;)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Scaling failures&lt;/strong&gt; - formula is used to determine autoscaling, and formula evaluation can fail (&lt;strong&gt;check logs&lt;/strong&gt; to find issue)&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Node issues&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;App package download failure&lt;/strong&gt; - node set to unusable, &lt;strong&gt;needs to be reimaged&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Node OS updates&lt;/strong&gt; - tasks can be interrupted by updates, &lt;strong&gt;auto update can be disabled&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Node in unusable state&lt;/strong&gt; - even if pools is ready pool can be in unusable state (VM crash, firewall block, invalid app package), &lt;strong&gt;needs to be re-imaged&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Node disk is full&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;li&gt;Rebooting and re-imaging can be done in the Batch portal under Pools&lt;/li&gt;
&lt;li&gt;The Connect option in portal allows you to use RDP/SSH to connect to the VM

&lt;ul&gt;
&lt;li&gt;Define user details&lt;/li&gt;
&lt;li&gt;Set as Admin&lt;/li&gt;
&lt;li&gt;Download RDP file and enter user credentials&lt;/li&gt;
&lt;li&gt;This opens Server Manager window where you can navigate the file system to check application package installations&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Validate batch loads&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Job errors&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Timeout&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Max wall clock time defines max time allowed for job to run&lt;/strong&gt; from the time it was created&lt;/li&gt;
&lt;li&gt;Default value is unlimited&lt;/li&gt;
&lt;li&gt;If max is reached, running tasks are killed&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Increase max wall clock value to prevent timeout&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Failure of job-related tasks&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;Each job has job-related preparation tasks that run once for the job&lt;/li&gt;
&lt;li&gt;Job prep task runs on each node as soon as job is created&lt;/li&gt;
&lt;li&gt;Job release task runs on each node when job terminates&lt;/li&gt;
&lt;li&gt;Failures can occur in these tasks&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Task errors&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;Task &lt;strong&gt;waiting&lt;/strong&gt; - dependency on another task&lt;/li&gt;
&lt;li&gt;Task &lt;strong&gt;timeout&lt;/strong&gt;- check max wall clock time&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Missing app packages&lt;/strong&gt; or resource files&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Error in command&lt;/strong&gt; defined in the task&lt;/li&gt;
&lt;li&gt;Check stdout and stderr logs for details&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;li&gt;In the Batch portal under node details, you can specify a container where log files are stored for future reference&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Configure batch retention&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Retention time defines &lt;strong&gt;how long to keep task directory&lt;/strong&gt; on node once task is complete&lt;/li&gt;
&lt;li&gt;Configure at &lt;strong&gt;Job level or Task level&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;Retention time field in advanced settings&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Default is 7 days&lt;/strong&gt; unless removed or deleted&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Manage data pipelines in Azure Data Factory or Azure Synapse Pipelines&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Ways to run pipelines

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Debug Run&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Don't need to save&lt;/strong&gt; changes&lt;/li&gt;
&lt;li&gt;Directly run pipelines with draft changes&lt;/li&gt;
&lt;li&gt;Manual, &lt;strong&gt;can't be scheduled&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Trigger Run&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Need to publish&lt;/strong&gt; changes first&lt;/li&gt;
&lt;li&gt;Only runs published version of pipeline&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Can be manual or scheduled&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Schedule data pipelines in Data Factory or Azure Synapse Pipelines&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Trigger types

&lt;ul&gt;
&lt;li&gt;Scheduled - run on wall-clock schedule&lt;/li&gt;
&lt;li&gt;Tumbling window - run at periodic intervals while maintaining state&lt;/li&gt;
&lt;li&gt;Storage event - run pipeline when file is uploaded or deleted from a storage account&lt;/li&gt;
&lt;li&gt;Custom event trigger - runs pipeline when event is raised by Azure Event Grid&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Scheduled vs tumbling triggers&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Scheduled&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Only supports future-dated loads&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Does not maintain state&lt;/strong&gt;, only fire and forget&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tumbling&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Can run back-dated&lt;/strong&gt; and future-dated loads&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Maintains state&lt;/strong&gt; (completed loads)&lt;/li&gt;
&lt;li&gt;Passes start and end timestamps of window as parameters&lt;/li&gt;
&lt;li&gt;Can be used to add dependency between pipelines, allowing complex scenarios&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Implement version control for pipeline artifacts&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Authoring modes

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Live mode (default)&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;Authoring directly against pipelines&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;No option to save draft changes&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Need to publish to save valid changes&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Need manually created ARM templates&lt;/strong&gt; to deploy pipelines to other environments&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Git Repo mode&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;Repo can be in ADO or GitHub&lt;/li&gt;
&lt;li&gt;All artifacts can be stored in source control&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Draft changes can be saved even if not valid&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Autogenerates ARM templates&lt;/strong&gt; for deployment in other environments&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Enables DevOps features&lt;/strong&gt; (PRs, reviews, collab)&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Manage Spark jobs in a pipeline&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Pipeline activities for Spark

&lt;ul&gt;
&lt;li&gt;Synapse - Spark notebook, Spark job&lt;/li&gt;
&lt;li&gt;Databricks - notebook, Jar file, Python file&lt;/li&gt;
&lt;li&gt;HDInsight activities - Spark Jar/script&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;li&gt;Monitoring Spark activities

&lt;ul&gt;
&lt;li&gt;Monitoring built in to ADF&lt;/li&gt;
&lt;li&gt;Platform monitoring (Synapse, Databricks)

&lt;ul&gt;
&lt;li&gt;In ADF/Synapse, go to Montior --&amp;gt; Apache Spark applications and select a specific run for details&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;li&gt;Spark UI&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;

</description>
    </item>
    <item>
      <title>DP-203 Study Guide - Develop a stream processing solution</title>
      <dc:creator>Alec Dutcher</dc:creator>
      <pubDate>Tue, 05 Dec 2023 18:11:24 +0000</pubDate>
      <link>https://dev.to/aidutcher/dp-203-study-guide-develop-a-stream-processing-solution-3h26</link>
      <guid>https://dev.to/aidutcher/dp-203-study-guide-develop-a-stream-processing-solution-3h26</guid>
      <description>&lt;p&gt;&lt;a href="https://dev.to/aidutcher/dp-203-data-engineering-on-microsoft-azure-study-guide-5h63"&gt;Study guide&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Identify Azure services for stream processing&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;What is streaming data?

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Unbounded data&lt;/strong&gt; (at no point do we have the whole dataset)&lt;/li&gt;
&lt;li&gt;Records can be added at any time&lt;/li&gt;
&lt;li&gt;Queries often over a subset of records called a window&lt;/li&gt;
&lt;li&gt;Used when real-time results are required&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;li&gt;Common use cases

&lt;ul&gt;
&lt;li&gt;Processing IoT data&lt;/li&gt;
&lt;li&gt;Fraud detection&lt;/li&gt;
&lt;li&gt;Monitoring social media sentiment&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;li&gt;Comparing streaming and traditional databases

&lt;ul&gt;
&lt;li&gt;In traditional queries, the user submits the query to the database engine which runs against the entire dataset and returns a result

&lt;ul&gt;
&lt;li&gt;Data is stored, query is not&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;li&gt;In a streaming query, the user submits the query to the streaming engine which applies the query logic to every data point in the stream after that moment and updates the intermediate result

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Query is stored, data is not&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Azure Event Hub

&lt;ul&gt;
&lt;li&gt;Stream ingestion service&lt;/li&gt;
&lt;li&gt;Stores and buffers data (producers and consumers can operate at their own speeds)&lt;/li&gt;
&lt;li&gt;Data in storage is persistent and partitioned&lt;/li&gt;
&lt;li&gt;Allows one or more other services to read from the data stream&lt;/li&gt;
&lt;li&gt;Competing consumers (duplicate instances of an application) can access and share the data&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;li&gt;Azure Stream Analytics

&lt;ul&gt;
&lt;li&gt;Stream processing service that moves and transforms data between different data inputs and outputs&lt;/li&gt;
&lt;li&gt;Uses SQL like language for querying (SELECT INTO output FROM input)&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;li&gt;Databricks

&lt;ul&gt;
&lt;li&gt;Query data streams using Spark Structured streaming&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Create a stream processing solution by using Stream Analytics and Azure Event Hubs&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Create Azure Event Hub&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;In resource group, &lt;strong&gt;create new Event Hub resource&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;Choose namespace name (globally unique)&lt;/li&gt;
&lt;li&gt;Choose pricing tier&lt;/li&gt;
&lt;li&gt;Choose partitions (must choose upfront)&lt;/li&gt;
&lt;li&gt;Create and deploy&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;li&gt;Go to Event Hubs Namespace and &lt;strong&gt;create Event Hub&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;Entities --&amp;gt; Event Hubs --&amp;gt; +&lt;/li&gt;
&lt;li&gt;Choose name and partition count&lt;/li&gt;
&lt;li&gt;Review + create&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;li&gt;Navigate to Event Hub and &lt;strong&gt;add Shared Access Policy&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;Settings --&amp;gt; Shared Access Policies --&amp;gt; +&lt;/li&gt;
&lt;li&gt;Choose policy name and permission (manage, send, listen)&lt;/li&gt;
&lt;li&gt;Open policy and copy Primary Key&lt;/li&gt;
&lt;li&gt;Provide key to data source to allow it to send data to the Event Hub&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Create Stream Analytics Job&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Create a new Stream Analytics resource&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;Choose name, region, hosting environment, etc&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;li&gt;Open resource and &lt;strong&gt;add input&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;Job topology --&amp;gt; Inputs --&amp;gt; +&lt;/li&gt;
&lt;li&gt;Choose Event Hub and provide alias&lt;/li&gt;
&lt;li&gt;Choose Event Hub connection details (can use "Select Event Hub from your subscriptions" to autofill)&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Choose output(s)&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;Job topology --&amp;gt; Outputs --&amp;gt; +&lt;/li&gt;
&lt;li&gt;Choose output service and alias&lt;/li&gt;
&lt;li&gt;Choose output connection details&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Define query&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;Job topology --&amp;gt; Query&lt;/li&gt;
&lt;li&gt;Write query with SELECT fields INTO output_alias FROM input_alias&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;li&gt;Once Stream Analytics job is running, check output to confirm data&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Process data by using Spark structured streaming&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Structured Streaming is a stream processing engine built on top of Apache Spark&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Using Structured Streaming in Databricks&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Install Azure Event Hub library&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;Cluster info --&amp;gt; Libraries --&amp;gt; Install new --&amp;gt; Specify name and version&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Create/import a notebook&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Read data stream from event hub&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;Create connectionString using &lt;strong&gt;Primary Key from Event Hub&lt;/strong&gt; and &lt;strong&gt;EntityPath&lt;/strong&gt; for the dataset&lt;/li&gt;
&lt;li&gt;Create a JSON object to store startingEventPosition&lt;/li&gt;
&lt;li&gt;Create a JSON object to store eventHubsConf (includes connectionString (be sure to encrypt), startingPosition, and setMaxEventsPerTrigger)&lt;/li&gt;
&lt;li&gt;Configure Spark parallelism&lt;/li&gt;
&lt;li&gt;Connect to event stream using spark.&lt;strong&gt;readStream.format("eventhubs")&lt;/strong&gt;.options(**eventHubsConf).load()&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Parse and view the data stream&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;eventStreamDF.printSchema() shows properties for Event Hub entries&lt;/li&gt;
&lt;li&gt;Body property contains the data

&lt;ul&gt;
&lt;li&gt;bodyDF = eventStreamDF.&lt;strong&gt;select(col("body")&lt;/strong&gt;.cast("STRING"))&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;li&gt;Use pyspark.sql.types to define the schema with StructType

&lt;ul&gt;
&lt;li&gt;StructField("field_name", StringType(), False)&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;li&gt;Parse the body based on the schema

&lt;ul&gt;
&lt;li&gt;parsedDF = bodyDF.select(from_json(col("body"), schema).alias("json"))&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;li&gt;Flatten the parsed json

&lt;ul&gt;
&lt;li&gt;flatDF = parsedDF.select(col("json.field_name").alias("field_alias"))&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Write data stream to a delta table&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;Write the stream

&lt;ul&gt;
&lt;li&gt;DF.writeStream.format("delta").option("checkpointlocation", "delta-checkpoints/location_name").start("/delta-tables/location_name")&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;li&gt;Create table with %sql

&lt;ul&gt;
&lt;li&gt;CREATE TABLE table USING DELTA LOCATION '/delta-tables/table&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Query the delta table using SQL

&lt;ul&gt;
&lt;li&gt;SELECT * FROM table&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Create windowed aggregates&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Types of window

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Tumbling&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Fixed window duration&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Contiguous windows&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Events belong to exactly one window&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Created with GROUP BY name, TumblingWindow(second, 10)&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Hopping&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Fixed window duration&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;New window starts at a set interval (i.e. a 10s window every 5s)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Windows can overlap&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Events can belong to multiple windows&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Created with GROUP BY HoppingWindow(second, 10, 5)&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Sliding&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Fixed window duration&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Windows are created when events enter or leave the window&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Windows can overlap and do not have a fixed scheudule&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Events can belong to multiple windows&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Created with GROUP BY SlidingWindow(second, 10)&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Session&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;Window starts when a new event arrives&lt;/li&gt;
&lt;li&gt;Window extends to include new events until a specified amount of time passes with no new events&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Window duration can vary&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Windows do not overlap and do not have a fixed schedule&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Events belong to exactly one window&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Created with GROUP BY SessionWindow(second, 5, 20)&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Snapshot&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;Events that arrive at precisely the same time are windowed together&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Windows have no duration&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Windows do not overlap and do not repeat on a fixed schedule&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Events belong to exactly one window&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Created with GROUP BY System.Timestamp()&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Handle schema drift&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Schema drift happens when the schema of incoming data changes over time (adding removing columns, changing data types, etc)&lt;/li&gt;
&lt;li&gt;Breaking vs non-breaking changes

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Breaking&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;Removing non-optional field&lt;/li&gt;
&lt;li&gt;Renaming a field&lt;/li&gt;
&lt;li&gt;Changing field types to be less restrictive (float to int)&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Non-breaking&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;Removing an optional field&lt;/li&gt;
&lt;li&gt;Adding a field&lt;/li&gt;
&lt;li&gt;Changing the field type to be more restrictive (int to float)&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Limiting impact

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Select only necessary fields&lt;/strong&gt; as early in a query as possible&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Input validation&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;Performed in the stream consumer&lt;/li&gt;
&lt;li&gt;One query selects and casts required fields and determines if record is valid, sending results to intermediate stream&lt;/li&gt;
&lt;li&gt;Second query processes valid records from the intermediate stream (main query)&lt;/li&gt;
&lt;li&gt;Third query processes invalid records from the intermediate stream&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Azure Event Hub Schema Registry&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;Store AVRO or JSON schema definitions in the Event Hub&lt;/li&gt;
&lt;li&gt;When event sender uses this schema, all events are validated against it&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Process time series data&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;What is time series data?

&lt;ul&gt;
&lt;li&gt;Sequence of data points ordered by time of occurrence&lt;/li&gt;
&lt;li&gt;Repeated measurements of the same source

&lt;ul&gt;
&lt;li&gt;At a fixed interval (constant load)&lt;/li&gt;
&lt;li&gt;When the value changes (varying load)&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;li&gt;Queried over subsets of data defined by start and end time (window)&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Defining time

&lt;ul&gt;
&lt;li&gt;Event time = time measurement is taken&lt;/li&gt;
&lt;li&gt;Processing time = time measurement is received by processing solution&lt;/li&gt;
&lt;li&gt;Difference can be up to minutes depending on latency&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;li&gt;When processing in Stream Analytics

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Default is processing time&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Can override with TIMESTAMP BY&lt;/strong&gt; to use event time if available&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;li&gt;Temporal query windows

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Timestamp is evaluated against start and end time of windows&lt;/strong&gt; to determine which window it belongs to&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Process within one partition&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A partitioned database or stream:

&lt;ul&gt;
&lt;li&gt;Is a single logical database/stream&lt;/li&gt;
&lt;li&gt;Has multiple underlying storage/processing units&lt;/li&gt;
&lt;li&gt;Has virtually limitless scaling&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;li&gt;Event Hub and Stream Analytics can both be partitioned

&lt;ul&gt;
&lt;li&gt;Event Hubs are partitioned at creation&lt;/li&gt;
&lt;li&gt;Stream Analytics is partitioned in the query&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;li&gt;Partitions and computing nodes

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;A single node can process many partitions&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;A single partition can NOT be split over multiple nodes&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;When an Event Hub or Stream Analytics Job scales, partitions are redistributed over nodes&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Unpartitioned queries&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;Cannot calculate results using data from only one partition&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Cannot leverage scale-out architecture&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Utilize at most 1 SU V2 (6 SU V1)&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Query must be partitioned before processing to increase performance&lt;/li&gt;
&lt;li&gt;Partitioning a query can be done by grouping stream data by the desired partition key in a preceding query&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Process data across partitions&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Specifying the partition key

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Event Hubs are always partitioned&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Round robin by default&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Partition key is called PartitionId&lt;/li&gt;
&lt;li&gt;Custom property can be specified to calculate a PartitionId&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;li&gt;Stream Analytics can be partitioned

&lt;ul&gt;
&lt;li&gt;Results in parallelizable queries&lt;/li&gt;
&lt;li&gt;Enables scale-out&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Compatibility level&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;Property of a Stream Analytics Job&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;PARTITION BY is used for level 1.1 or lower&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;For level 1.2, specify the partition key on the input&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;li&gt;Partition key is specified differently for each type of input/output

&lt;ul&gt;
&lt;li&gt;For blob storage, the partition key is a part of the path&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Embarrassingly parallel&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;Query can be processed completely in parallel&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;All inputs, outputs, and queries are partitioned on the same key&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;In the Stream Analytics portal, Job topology --&amp;gt; Query --&amp;gt; --&amp;gt; Job simulation (preview) can show whether a query is parallel and how it can be partitioned&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Configure checkpoints and watermarking during processing&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Checkpoint&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;Event Hub and ingestion services don't track which records have been consumed, they just apply sequence numbers to the records&lt;/li&gt;
&lt;li&gt;When processing a data stream, &lt;strong&gt;processors take note of where they are in the stream (checkpoint)&lt;/strong&gt; so they can resume in case of interruption&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Stream stores sequence numbers per partition&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Checkpoint used to resume a stream is called an offset&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Stream Analytics backs up the internal state regularly

&lt;ul&gt;
&lt;li&gt;Intermediate results are saved&lt;/li&gt;
&lt;li&gt;Checkpoint is saved&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;li&gt;Catching up from a restore can take some time&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Watermark&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;Internal marker indicating up to what point in time events are assumed to have been processed&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Updated when a new event comes in or increased as time processes in the real world&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Used to identify late events&lt;/li&gt;
&lt;li&gt;Used to detect opening and closing of a query window&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Scale resources&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Event Hub and Stream Analytics &lt;strong&gt;pricing is based on resources provisioned&lt;/strong&gt;, not necessarily used&lt;/li&gt;
&lt;li&gt;Provision as little resources as possible to save cost&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Scaling Azure Event Hub&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;Measured in &lt;strong&gt;Throughput Units (TU)&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;1 TU provides

&lt;ul&gt;
&lt;li&gt;Ingress up to 1 MB per second or 1000 events per second&lt;/li&gt;
&lt;li&gt;Egress up to 2 MB per second or 4096 events per second&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Enable auto-inflate to prevent over-provisioning&lt;/strong&gt; (similar to auto-scaling)&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Scaling Stream Analytics&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;Measured in &lt;strong&gt;Streaming Units (SU)&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;There are two versions, V1 and V2

&lt;ul&gt;
&lt;li&gt;1 V1 SU = ~1 MB/s&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;6 V1 SU = ~1 V2 SU&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;li&gt;SA jobs

&lt;ul&gt;
&lt;li&gt;Use fine-grained deployment units&lt;/li&gt;
&lt;li&gt;Run on shared hardware&lt;/li&gt;
&lt;li&gt;Limit scalability to &lt;strong&gt;minimum of 1/3 V2 SUs&lt;/strong&gt; and &lt;strong&gt;max of 66 V2 SUs&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Support virtual network integration&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://learn.microsoft.com/en-us/azure/stream-analytics/cluster-overview"&gt;&lt;strong&gt;SA Clusters&lt;/strong&gt;&lt;/a&gt; 

&lt;ul&gt;
&lt;li&gt;Scale further and provide more isolation&lt;/li&gt;
&lt;li&gt;Fully isolated deployment&lt;/li&gt;
&lt;li&gt;Scalability has &lt;strong&gt;minimum of 12 V2 SUs&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Supports virtual network integration&lt;/li&gt;
&lt;li&gt;Jobs can be moved in and out of a cluster&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Create tests for data pipelines&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Stream Analytics allows sampling data from an input by &lt;strong&gt;downloading a file&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;It also &lt;strong&gt;samples data from an Event Hub automatically&lt;/strong&gt; when editing a query&lt;/li&gt;
&lt;li&gt;The Query section has a &lt;strong&gt;Test results option&lt;/strong&gt; as well&lt;/li&gt;
&lt;li&gt;The Query editor also has an option for uploading &lt;strong&gt;sample input data to test changes in the results&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Optimize pipelines for analytical or transactional purposes&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Streams can be joined on DATEDIFF(second, S1, S2) BETWEEN 0 AND 30 where chosen properties and the Timestamp are the same&lt;/li&gt;
&lt;li&gt;This works best when the streams have the same partition key and partition counts

&lt;ul&gt;
&lt;li&gt;Repartition so that they are partitioned the same way using a preceding query&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Handle late-arriving data&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Why is data late?

&lt;ul&gt;
&lt;li&gt;Network delays, especially with IoT&lt;/li&gt;
&lt;li&gt;Pipeline congestion - ingestion load is higher than possible throughput&lt;/li&gt;
&lt;li&gt;Outages in gateway devices&lt;/li&gt;
&lt;li&gt;Producers that have specific windows of time for output&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Late data tolerance&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Configured per Stream Analytics job&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Consequences&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Late data is included&lt;/strong&gt; in results&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Window results are delayed&lt;/strong&gt; as the job has to wait for late data&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Still late data policy&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Drop to just ignore&lt;/strong&gt; the record&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Adjust to update&lt;/strong&gt; the record timestamp (can introduce time skews)&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;li&gt;In the portal (Stream Analytics)

&lt;ul&gt;
&lt;li&gt;Settings --&amp;gt; Event ordering&lt;/li&gt;
&lt;li&gt;Can only be done when the job is not running&lt;/li&gt;
&lt;li&gt;Choose late arriving window, out of order settings, and whether to drop or adjust&lt;/li&gt;
&lt;li&gt;Restart job&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Handle interruptions&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;SLAs&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;Microsoft SLA is &lt;strong&gt;99.9% or higher based on service tier&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;SLA for &lt;strong&gt;Stream Analytics&lt;/strong&gt; is that job is running **99.9% **of time&lt;/li&gt;
&lt;li&gt;Catch-up time are the delays that follow from a service interruption&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Event replication pattern&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;If the SLAs aren't high enough, can increase reliability with ERP&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Duplicate Event Hub and all downstream infrastructure&lt;/strong&gt;, processing all events in parallel in multiple regions&lt;/li&gt;
&lt;li&gt;Only works if 

&lt;ul&gt;
&lt;li&gt;Pipelines have independent failure conditions&lt;/li&gt;
&lt;li&gt;End application can correctly choose data source&lt;/li&gt;
&lt;li&gt;Event generator is not the bottleneck&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Configure exception handling&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Output data error handling policy&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;Defines how Stream Analytics should proceed in the case it fails to write to an output&lt;/li&gt;
&lt;li&gt;Allows for two values

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Drop&lt;/strong&gt; - record will be ignored and never written to output (better for speed)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Retry&lt;/strong&gt; - keep attempting to write until success or another error (better for correctness)&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Configured in the portal in the Settings --&amp;gt; Error policy section&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Upsert data&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Replay archived stream data&lt;/strong&gt;&lt;/p&gt;

</description>
    </item>
    <item>
      <title>DP-203 Study Guide - Develop a batch processing solution</title>
      <dc:creator>Alec Dutcher</dc:creator>
      <pubDate>Tue, 05 Dec 2023 18:09:52 +0000</pubDate>
      <link>https://dev.to/aidutcher/dp-203-study-guide-develop-a-batch-processing-solution-4ehi</link>
      <guid>https://dev.to/aidutcher/dp-203-study-guide-develop-a-batch-processing-solution-4ehi</guid>
      <description>&lt;p&gt;&lt;a href="https://dev.to/aidutcher/dp-203-data-engineering-on-microsoft-azure-study-guide-5h63"&gt;Study guide&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Develop batch processing solutions by using Azure Data Lake Storage, Azure Databricks, Azure Synapse Analytics, and Azure Data Factory&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Services for each layer in a batch processing architecture

&lt;ul&gt;
&lt;li&gt;Ingestion: Data Factory&lt;/li&gt;
&lt;li&gt;Storage: Blob Storage, ADLS Gen2, Cosmos DB&lt;/li&gt;
&lt;li&gt;Processing: Databricks, HDInsight, Data Flows&lt;/li&gt;
&lt;li&gt;Serving: Azure SQL, Dedicated SQL, Analysis Services&lt;/li&gt;
&lt;li&gt;Orchestration: Data Factory (or Synapse)&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;li&gt;Azure Synapse Analytics

&lt;ul&gt;
&lt;li&gt;Group of multiple, well-integrated services&lt;/li&gt;
&lt;li&gt;Works across all layers of architecture&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Use PolyBase to load data to a SQL pool&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Dedicated SQL Pool&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Overview

&lt;ul&gt;
&lt;li&gt;Formerly known as Azure &lt;strong&gt;SQL Data Warehouse&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Available as standalone service and within Synapse&lt;/li&gt;
&lt;li&gt;Like a SQL Server Database&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Massive parallel processing&lt;/strong&gt; (MPP) architecture&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Elastically scale compute and storage&lt;/strong&gt; separately&lt;/li&gt;
&lt;li&gt;Pause or resume service to save cost&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;li&gt;Components

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Distributions&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;Basic unit of storage&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Fixed 60 distributions&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Queries executed against each distribution in parallel&lt;/li&gt;
&lt;li&gt;Stored in Azure Storage&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Control node&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;SQL Server endpoint&lt;/li&gt;
&lt;li&gt;Queries go to control node&lt;/li&gt;
&lt;li&gt;Only stores metadata&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Coordinates query execution with computer nodes&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Compute nodes&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;Execute queries&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Max 60 compute nodes&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Distributions equally divided&lt;/strong&gt; among compute nodes&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Data Movement Service (DMS)&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Coordinates movement of data between compute nodes&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;For some queries (joins, group by) data needs to be co-located&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Data Warehousing Units&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;DWU = CPU + memory + I/O&lt;/li&gt;
&lt;li&gt;Represents computational power&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Can be increased or decreased to enable scaling&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Paid for per hour (&lt;strong&gt;lower to reduce costs&lt;/strong&gt;)&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Features&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;Most regular SQL features are supported&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;DDL and DML&lt;/strong&gt; statements and &lt;strong&gt;Dynamic SQL&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Dynamic management views&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Triggers and cross-database queries are not supported&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Constraints, identity columns, and relationships work differently than SQL Server&lt;/li&gt;
&lt;li&gt;Can be used in both the compute and serving layer&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;

&lt;p&gt;&lt;strong&gt;Polybase&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Overview

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Read and write data in external storage&lt;/strong&gt; using T-SQL&lt;/li&gt;
&lt;li&gt;Available in SQL Server and Synapse&lt;/li&gt;
&lt;li&gt;Supports delimited text, parquet, ORC, GZIP, and SNAPPY compressed files&lt;/li&gt;
&lt;li&gt;Control node passes storage location to compute nodes, which read the data&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Components&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;Database &lt;strong&gt;Scoped Credential&lt;/strong&gt; = &lt;strong&gt;access&lt;/strong&gt; storage account&lt;/li&gt;
&lt;li&gt;External &lt;strong&gt;Data Source&lt;/strong&gt; = define the storage &lt;strong&gt;location&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;External &lt;strong&gt;File Format&lt;/strong&gt; = &lt;strong&gt;format&lt;/strong&gt; of the file being read&lt;/li&gt;
&lt;li&gt;External &lt;strong&gt;Table&lt;/strong&gt; = &lt;strong&gt;metadata&lt;/strong&gt; of underlying file&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Implement Azure Synapse Link and query the replicated data&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Azure Synapse Link&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;Cloud-native implementation of HTAP&lt;/li&gt;
&lt;li&gt;Hybrid transactional and analytical processing&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Directly query data in operational stores&lt;/strong&gt;, no ETL required&lt;/li&gt;
&lt;li&gt;Near real-time querying&lt;/li&gt;
&lt;li&gt;Supports Cosmos DB, Azure SQL, Dataverse&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;li&gt;Cosmos DB

&lt;ul&gt;
&lt;li&gt;Fully managed NoSQL platform&lt;/li&gt;
&lt;li&gt;Supports MongoDB, Table, Cassandra, and Gremlin&lt;/li&gt;
&lt;li&gt;Global distribution - data can be replicated to multiple regions&lt;/li&gt;
&lt;li&gt;Elastic scalability&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Synapse Link for Cosmos DB&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Transactional store is synced to analytical&lt;/strong&gt; store from which Synapse can read data&lt;/li&gt;
&lt;li&gt;No performance impact on the transactional store&lt;/li&gt;
&lt;li&gt;Analytical store &lt;strong&gt;auto-syncs every 2 mins&lt;/strong&gt; (max 5 mins)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Only accessible from Synapse&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Only charged for storage&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Supports change data capture and time travel&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;li&gt;In the portal

&lt;ul&gt;
&lt;li&gt;In Cosmos DB account, see Azure Synapse Link under Integrations on the left-side panel&lt;/li&gt;
&lt;li&gt;Enable Synapse Link&lt;/li&gt;
&lt;li&gt;Create the container, setting Analytical Store to On&lt;/li&gt;
&lt;li&gt;To connect in Synapse Link, get primary account key from the Keys under Settings in the left-side panel&lt;/li&gt;
&lt;li&gt;In Synapse workspace, go to Data and setup linked service and data source for Cosmos DB&lt;/li&gt;
&lt;li&gt;Open a SQL script to query the data in Cosmos DB&lt;/li&gt;
&lt;li&gt;Create a credential with the primary key&lt;/li&gt;
&lt;li&gt;Use OPENROWSET to query&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Create data pipelines&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;General steps

&lt;ul&gt;
&lt;li&gt;Configure firewall to allow IP address and Azure Services to connect to data sources and sinks&lt;/li&gt;
&lt;li&gt;Create an ADF/Synapse instance&lt;/li&gt;
&lt;li&gt;Create a linked service to the source data&lt;/li&gt;
&lt;li&gt;Create a new dataset from the data in the linked service&lt;/li&gt;
&lt;li&gt;Create a Data Flow

&lt;ul&gt;
&lt;li&gt;Select data source&lt;/li&gt;
&lt;li&gt;Choose transformation steps (join, group, conditional split, etc)&lt;/li&gt;
&lt;li&gt;Select sink&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;li&gt;Create a new Pipeline&lt;/li&gt;
&lt;li&gt;Choose a Copy activity and/or the Data Flow&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Scale resources&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Types of scaling

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Vertical&lt;/strong&gt; scaling (up/down) = add more resources to a machine to make it more powerful&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Horizontal&lt;/strong&gt; scaling (in/out) = add more machines&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Scaling Azure SQL&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;Supports both up and out depending on config&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;During up/down&lt;/strong&gt;, the following can be changed

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Service tier&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;DTU model: basic, standard, and premium&lt;/li&gt;
&lt;li&gt;vCore model: general purpose, hyperscale, business critical&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Compute tier&lt;/strong&gt; (vCore): provisioned or serverless&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Resources&lt;/strong&gt; (CPU, RAM, storage, etc)&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Scaling up/down results in database restart&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;To scale out, can only add &lt;strong&gt;up to 4 read-only replicas&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;In the portal (Azure SQL database)

&lt;ul&gt;
&lt;li&gt;Go to Compute + storage&lt;/li&gt;
&lt;li&gt;Select an option in Service tier&lt;/li&gt;
&lt;li&gt;Choose Compute tier&lt;/li&gt;
&lt;li&gt;Use sliders to select vCores, DTUs, Read scale-out, etc&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Scaling Dedicated SQL Pool&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;Increase/decrease number of compute nodes and memory on each node&lt;/li&gt;
&lt;li&gt;Defined using DWUs&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Create tests for data pipelines&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Testing pipelines is different than testing applications because we're testing data instead of code&lt;/li&gt;
&lt;li&gt;Automated testing involves automating the process of validating if pipeline is providing expected output&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Types of tests&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Unit tests&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;Test individual units &lt;/li&gt;
&lt;li&gt;In data pipeline, &lt;strong&gt;run each activity individually&lt;/strong&gt; and validate result&lt;/li&gt;
&lt;li&gt;Hard to do in ADF&lt;/li&gt;
&lt;li&gt;Programmatically enable one activity at a time and disable others&lt;/li&gt;
&lt;li&gt;Generate and use fake data to test edge cases&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Functional tests&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;Have pipeline generate actual output and compare to expected output&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Run complete pipeline&lt;/strong&gt;, not just individual activities&lt;/li&gt;
&lt;li&gt;Used to confirm that pipeline meets business requirements&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Performance and regression tests&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;Regression tests ensure that change in one pipeline doesn't impact other pipelines&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Run multiple dependent pipelines together&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Performance test to ensure pipeline meets SLAs&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;li&gt;Data quality tests

&lt;ul&gt;
&lt;li&gt;Verify if data meets quality standards&lt;/li&gt;
&lt;li&gt;Typically embedded as part of the pipeline&lt;/li&gt;
&lt;li&gt;Completeness&lt;/li&gt;
&lt;li&gt;Uniqueness&lt;/li&gt;
&lt;li&gt;Timeliness&lt;/li&gt;
&lt;li&gt;Accuracy&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Integrate Jupyter or Python notebooks into a data pipeline&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Notebooks are typically used for Spark apps and development&lt;/li&gt;
&lt;li&gt;Notebooks are supported natively in services like Databricks and Synapse&lt;/li&gt;
&lt;li&gt;Basic steps for Synapse

&lt;ul&gt;
&lt;li&gt;Create Synapse Spark pool&lt;/li&gt;
&lt;li&gt;Create new notebook and define language&lt;/li&gt;
&lt;li&gt;Attach notebook to Spark pool&lt;/li&gt;
&lt;li&gt;Write code to read and process data&lt;/li&gt;
&lt;li&gt;Add parameters to notebook&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;li&gt;To invoke notebook in ADF

&lt;ul&gt;
&lt;li&gt;Create linked service to Synapse (under compute, not storage)&lt;/li&gt;
&lt;li&gt;Make sure ADF has manage permissions for Synapse Spark and access to storage&lt;/li&gt;
&lt;li&gt;Create pipeline and add notebook activity&lt;/li&gt;
&lt;li&gt;Select notebook and parameters&lt;/li&gt;
&lt;li&gt;Run pipeline&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Use Mapping Data Flows in Azure Synapse pipelines and Azure Data Factory pipelines&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Mapping Data Flows provides &lt;strong&gt;no-code ETL workflow&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Can apply transformations to source data

&lt;ul&gt;
&lt;li&gt;Add/remove columns, rename, filter, join, aggregate&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Runs on Spark code&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;Automatically adds optimizations&lt;/li&gt;
&lt;li&gt;Can add user-defined optimizations&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Executes on a Spark cluster&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;Called Data Flow Debug&lt;/li&gt;
&lt;li&gt;Can define cluster configuration&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;li&gt;Pros and Cons

&lt;ul&gt;
&lt;li&gt;Pros

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Faster development&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;UI based drag-and-drop approach&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Fast and scalable processing&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;li&gt;Cons

&lt;ul&gt;
&lt;li&gt;Less flexible since code can't be modified&lt;/li&gt;
&lt;li&gt;Can be complex for large workflows&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Upsert data&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;DML statements

&lt;ul&gt;
&lt;li&gt;Select, insert, update, delete&lt;/li&gt;
&lt;li&gt;Upsert is combo of update and insert - update if exists, insert if not&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;li&gt;Options to change data in Azure SQL

&lt;ul&gt;
&lt;li&gt;Using T-SQL (DML statements, merge command)&lt;/li&gt;
&lt;li&gt;Data Factory/Synapse pipelines (copy, data flow with Alter Row)&lt;/li&gt;
&lt;li&gt;Can upsert on files in Data Lake using Delta Lake&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Options to perform upsert&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;T-SQL "merge" command&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;Specify source with "USING"&lt;/li&gt;
&lt;li&gt;Specify join condition&lt;/li&gt;
&lt;li&gt;"WHEN MATCHED" = behavior for existing records&lt;/li&gt;
&lt;li&gt;"WHEN NOT MATCHED BY TARGET" = behavior for records not in target&lt;/li&gt;
&lt;li&gt;"WHEN NOT MATCHED BY SOURCE" = behavior for records not in source&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Copy activity&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;Change write behavior in sink to upsert and define key columns&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Data flows&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;Use &lt;strong&gt;alter row&lt;/strong&gt; transformation&lt;/li&gt;
&lt;li&gt;Define alter row conditions&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Revert data to a previous state in Azure storage&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Restorable entities

&lt;ul&gt;
&lt;li&gt;Individual file (blob) - can revert to previous version or undelete&lt;/li&gt;
&lt;li&gt;Container - container and files can be reverted or undeleted&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;li&gt;Restoring &lt;strong&gt;files&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Use snapshot&lt;/strong&gt; (&lt;strong&gt;read-only&lt;/strong&gt; version of file from point in time)

&lt;ul&gt;
&lt;li&gt;Created manually by user or application&lt;/li&gt;
&lt;li&gt;Used to restore back to prior version&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Enable versioning&lt;/strong&gt; 

&lt;ul&gt;
&lt;li&gt;Enabled at &lt;strong&gt;storage account level&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Auto creates snapshots when file is updated&lt;/li&gt;
&lt;li&gt;Select and restore a specific version&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Enable soft delete&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;Enabled at &lt;strong&gt;storage account level&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Deleted files can be restored for a certain number of days&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Restoring &lt;strong&gt;containers&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Enable point-in-time restore&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;Restores container to specific point in time&lt;/li&gt;
&lt;li&gt;Enabled at &lt;strong&gt;storage account level&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Versioning, change feed, and soft delete must also be enabled&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;li&gt;Enable soft delete

&lt;ul&gt;
&lt;li&gt;Enabled at storage account level&lt;/li&gt;
&lt;li&gt;Deleted containers can be restored for a certain number of days&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;In the storage account portal, these options are under Data management --&amp;gt; Data protection in the left-side panel&lt;/li&gt;
&lt;li&gt;File versions and snapshots can viewed in blob properties by clicking on the file&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Revert data to a previous state in Azure SQL and Dedicated SQL Pool&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Azure SQL backup&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Automatically creates backups&lt;/strong&gt; based on SQL Server technology

&lt;ul&gt;
&lt;li&gt;Full backups every week&lt;/li&gt;
&lt;li&gt;Differential backups every 12 to 24 hours&lt;/li&gt;
&lt;li&gt;Transaction log backups every 10 mins&lt;/li&gt;
&lt;li&gt;Backups are stored in Azure Storage&lt;/li&gt;
&lt;li&gt;Redundancy is configurable&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Point-in-time restore (auto)&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;Auto-created backup&lt;/li&gt;
&lt;li&gt;Kept for limited days (1 to 35, &lt;strong&gt;default is 7&lt;/strong&gt;)&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Long-term retention (not auto)&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;Define policy to keep backups longer&lt;/li&gt;
&lt;li&gt;Configure weekly, monthly, yearly backups and keep &lt;strong&gt;up to 10 years&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Azure SQL restore&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Restore using PITR or LTR&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;For PITR restore, service identifies which backups to be used&lt;/li&gt;
&lt;li&gt;For LTR, database can be restored in same or different region&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;li&gt;Restore deleted database&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Restore creates a new database&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;Use to update or replace existing database&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;In the Azure SQL Server portal

&lt;ul&gt;
&lt;li&gt;Data management --&amp;gt; Backups to view restore point details and retention policies&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Dedicated SQL backup and restore&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Local backup&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;Dedicated SQL automatically creates snapshots used as restore points&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Up to 42&lt;/strong&gt; user-defined restore points can be created&lt;/li&gt;
&lt;li&gt;Restore points are &lt;strong&gt;retained for 7 days&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Geo backup&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;Created every 24 hours and stored in a different region&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Only latest backup&lt;/strong&gt; is retained&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;li&gt;Restore database in any region using restore points

&lt;ul&gt;
&lt;li&gt;Restore creates a new database that updates or replaces existing one&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Configure exception handling&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;For a &lt;strong&gt;single activity&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Try/catch block&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;When one activity fails, a second activity runs that performs action based on failure&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Try/catch/proceed block&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;Last activity (proceed) runs if first activity succeeds or fails, even if middle activity fails, due to skip path&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;If/else block&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;One path for success, different path for failure&lt;/li&gt;
&lt;li&gt;Pipeline succeeds if first activity does, will fail otherwise&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;If/skip/else block&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;Pipeline succeeds whether first activity succeeds or fails because a failure causes a skip to other activities&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;For &lt;strong&gt;multiple activities&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Sequential run&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;Activities are sequential &lt;/li&gt;
&lt;li&gt;One or more activities are configured to run on failure or skip of previous activity&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Pipeline continues regardless of upstream failure&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Parallel run&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;Some activities are parallel&lt;/li&gt;
&lt;li&gt;Downstream activity &lt;strong&gt;depends on success of all parallel activities&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Further downstream activity can be configured to run after skip so pipeline continues even if parallel activities fail&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Read from and write to a delta lake&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Data lake challenges

&lt;ul&gt;
&lt;li&gt;Data reliability issues

&lt;ul&gt;
&lt;li&gt;Corruption because of failures (no rollback)&lt;/li&gt;
&lt;li&gt;No data validation&lt;/li&gt;
&lt;li&gt;Consistency issues while reading data&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;li&gt;No updates/deletes/merges on files

&lt;ul&gt;
&lt;li&gt;Difficult to implement GDPR/CCPA compliance&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;li&gt;Data quality issues

&lt;ul&gt;
&lt;li&gt;Schema isn't verified before writing&lt;/li&gt;
&lt;li&gt;Cannot apply checks on data&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;li&gt;Query performance issues&lt;/li&gt;
&lt;li&gt;Difficult to maintain historical versions of data&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Delta Lake&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;Open-source storage layer that &lt;strong&gt;brings reliability to data lakes&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Can be installed on-prem&lt;/li&gt;
&lt;li&gt;Available by default on many cloud platforms&lt;/li&gt;
&lt;li&gt;Provides database-like features on top of data lake

&lt;ul&gt;
&lt;li&gt;Create constraints, enforce schema, run DML statements, etc&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Provides ACID guarantees&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Works by storing a transaction log of all transactions performed on data (dataframe.write.format("delta"))

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Log file is not created until after writing is done and is not created if there is a failure&lt;/strong&gt;, which helps ensure ACID guarantees&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Delta Lake availability

&lt;ul&gt;
&lt;li&gt;Can be downloaded and installed

&lt;ul&gt;
&lt;li&gt;On local machine&lt;/li&gt;
&lt;li&gt;On-prem Spark cluster&lt;/li&gt;
&lt;li&gt;Cloud platforms like Azure HDInsight&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;li&gt;Available by default in cloud platforms

&lt;ul&gt;
&lt;li&gt;Azure Databricks&lt;/li&gt;
&lt;li&gt;Azure Synapse Spark pools&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;In the portal (Databricks)

&lt;ul&gt;
&lt;li&gt;Use spark.conf.set to connect to storage&lt;/li&gt;
&lt;li&gt;Use dbutils.fs.ls to list files in storage path&lt;/li&gt;
&lt;li&gt;Define input and output folder paths, use input to read (spark.read.option.csv)&lt;/li&gt;
&lt;li&gt;To write to Delta Lake

&lt;ul&gt;
&lt;li&gt;Write in Delta format with output path DF.write.format("delta").save(outputPath + "filename.delta")&lt;/li&gt;
&lt;li&gt;Check output location in storage to confirm write&lt;/li&gt;
&lt;li&gt;Check delta_log to see metadata about write&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;li&gt;To read from Delta Lake

&lt;ul&gt;
&lt;li&gt;Use Spark SQL to create a database&lt;/li&gt;
&lt;li&gt;Create a table in the database using CREATE TABLE table_name USING DELTA LOCATION "delta_file_path/filename.delta"&lt;/li&gt;
&lt;li&gt;DESCRIBE HISTORY table_name can be used to audit the history of the Delta table&lt;/li&gt;
&lt;li&gt;Read different versions of data using SELECT FROM table_name VERSION AS OF [version number], or SELECT FROM table_name TIMESTAMP AS OF '[timestamp]'&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;li&gt;Can restore previous versions with RESTORE TABLE table_name TO VERSION AS OF [version number]&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;

</description>
    </item>
    <item>
      <title>DP-203 Study Guide - Ingest and transform data</title>
      <dc:creator>Alec Dutcher</dc:creator>
      <pubDate>Tue, 05 Dec 2023 18:03:32 +0000</pubDate>
      <link>https://dev.to/aidutcher/dp-203-study-guide-ingest-and-transform-data-108a</link>
      <guid>https://dev.to/aidutcher/dp-203-study-guide-ingest-and-transform-data-108a</guid>
      <description>&lt;p&gt;&lt;a href="https://dev.to/aidutcher/dp-203-data-engineering-on-microsoft-azure-study-guide-5h63"&gt;Study guide&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Design and implement incremental loads&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Watermarks

&lt;ul&gt;
&lt;li&gt;Column in source table with last updated time stamp or incrementing key&lt;/li&gt;
&lt;li&gt;Marks the most recent update in the table&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;li&gt;Delta loading

&lt;ul&gt;
&lt;li&gt;Essentially the same as incremental loading&lt;/li&gt;
&lt;li&gt;Only changing new data, whether loading or transforming, etc&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;4 basic design options&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;Delta loading using a &lt;strong&gt;watermark&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Delta loading from SQL DB using &lt;strong&gt;change tracking technology&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Loading new and changed files only &lt;strong&gt;using LastModifiedDate&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Loading new files only using a &lt;strong&gt;partitioned folder or file name&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;li&gt;Considerations

&lt;ul&gt;
&lt;li&gt;Volume and type of data&lt;/li&gt;
&lt;li&gt;Load on system&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Steps&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;Query to get old watermark&lt;/li&gt;
&lt;li&gt;Query to get new watermark&lt;/li&gt;
&lt;li&gt;Load data between watermarks&lt;/li&gt;
&lt;li&gt;Update watermark&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Transform data by using Apache Spark&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Apache Spark

&lt;ul&gt;
&lt;li&gt;Can be used in Synapse, Databricks, and Data Factory&lt;/li&gt;
&lt;li&gt;Ecosystem

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Apache Spark Core&lt;/strong&gt; 

&lt;ul&gt;
&lt;li&gt;Basic functionalities (task scheduling, memory management)&lt;/li&gt;
&lt;li&gt;Can be abstracted through APIs&lt;/li&gt;
&lt;li&gt;Can be done in R, Python, Scala, and Java&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Spark SQL&lt;/strong&gt; - similar to standard SQL but allows queries on data in Spark&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Spark Streaming&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;MLlib&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;GraphX&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;More about Spark architecture

&lt;ul&gt;
&lt;li&gt;Spark core: RDDs and languages&lt;/li&gt;
&lt;li&gt;Spark SQL engine: &lt;a href="https://www.databricks.com/glossary/catalyst-optimizer"&gt;Catalyst optimizer&lt;/a&gt;, &lt;a href="https://www.databricks.com/glossary/tungsten"&gt;Tungsten&lt;/a&gt; (memory/CPU mgmt)&lt;/li&gt;
&lt;li&gt;DataFrame/Dataset APIs&lt;/li&gt;
&lt;li&gt;Spark Graph, Spark ML, Spark Streaming, Spark SQL&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;li&gt;Azure Synapse notebooks in the portal

&lt;ul&gt;
&lt;li&gt;Develop on the left-side panel&lt;/li&gt;
&lt;li&gt;Click +, then Notebook&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Must have Spark pool attached before running a notebook&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;Go to Manage in left-side panel&lt;/li&gt;
&lt;li&gt;Analytics pools --&amp;gt; Apache Spark pools --&amp;gt; choose name and settings --&amp;gt; Review and create&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;li&gt;Write and execute code in cells like a typical notebook&lt;/li&gt;
&lt;li&gt;Click + --&amp;gt; Browse gallery --&amp;gt; Notebooks to see example notebooks&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;For the exam, know the basics of Synapse notebooks, and Synapse architecture questions are more likely about keywords than detailed questions
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Transform data by using Transact-SQL (T-SQL) in Azure Synapse Analytics&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Transact-SQL

&lt;ul&gt;
&lt;li&gt;For querying data in a &lt;strong&gt;data lake&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Uses &lt;strong&gt;SQL serverless pools&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Query data without loading it into database storage&lt;/li&gt;
&lt;li&gt;Standard formats are CSV, JSON, and Parquet&lt;/li&gt;
&lt;li&gt;Useful for &lt;strong&gt;OLAP&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;li&gt;In the portal

&lt;ul&gt;
&lt;li&gt;Develop --&amp;gt; New --&amp;gt; SQL Script&lt;/li&gt;
&lt;li&gt;[FROM] &lt;strong&gt;OPENROWSET&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;Use instead of defining a table&lt;/li&gt;
&lt;li&gt;Mimics the properties of a table, but uses data lake object as a source&lt;/li&gt;
&lt;li&gt;Choose file URL, format, and parser version if CSV&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Ingest and transform data by using Azure Synapse Pipelines or Azure Data Factory&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Common data ingestion pipelines

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Azure Functions&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Low latency&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Serverless compute&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Short run processing (only designed to run for short periods of time)&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Custom component&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;Low-scale parallel computing&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Heavy algorithms&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Requires wrapping code into an executable (more complex)&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Azure Databricks&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;Apache Spark, designed for &lt;strong&gt;massive and complex data&lt;/strong&gt; transformations&lt;/li&gt;
&lt;li&gt;Expensive and complicated&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Azure Data Factory&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;Suitable for &lt;strong&gt;light transformation&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Can include above methods as activities&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Copy performance

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://learn.microsoft.com/en-us/azure/data-factory/copy-activity-performance"&gt;Performance chart&lt;/a&gt;

&lt;ul&gt;
&lt;li&gt;Shows how long a copy will take based on amount of data and bandwidth&lt;/li&gt;
&lt;li&gt;Can help with assessing costs of running pipelines&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;In the portal

&lt;ul&gt;
&lt;li&gt;Most work is done in Author section in left-side panel&lt;/li&gt;
&lt;li&gt;Under Factory Resources there are pipelines, datasets, etc&lt;/li&gt;
&lt;li&gt;Linked services is not shown as it is lumped in with datasets&lt;/li&gt;
&lt;li&gt;Under Datasets, click + to add Dataset

&lt;ul&gt;
&lt;li&gt;Choose Service&lt;/li&gt;
&lt;li&gt;Name Dataset and select Linked service&lt;/li&gt;
&lt;li&gt;If you choose New service, input connection details, including subscription, server, database, authentication, etc&lt;/li&gt;
&lt;li&gt;Select dataset from the linked service (table name, file, etc)&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;li&gt;Under Datasets you can view and preview the dataset&lt;/li&gt;
&lt;li&gt;Click + to add a new pipeline

&lt;ul&gt;
&lt;li&gt;Select an activity, i.e. Copy data&lt;/li&gt;
&lt;li&gt;In the activity settings at the bottom, choose source, sink, copy behavior, and other settings&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;li&gt;Dataflows allow you to set up transformations within ADF

&lt;ul&gt;
&lt;li&gt;These dataflows can be included as activities in the pipeline&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://learn.microsoft.com/en-us/azure/synapse-analytics/data-integration/concepts-data-factory-differences"&gt;Differences between ADF and Synapse&lt;/a&gt;

&lt;ul&gt;
&lt;li&gt;ADF has

&lt;ul&gt;
&lt;li&gt;Cross-region &lt;a href="https://learn.microsoft.com/en-us/azure/data-factory/concepts-integration-runtime"&gt;integration runtime&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Runtime sharing&lt;/li&gt;
&lt;li&gt;Power Query activity&lt;/li&gt;
&lt;li&gt;Global parameters&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;li&gt;Synapse has

&lt;ul&gt;
&lt;li&gt;Monitoring Spark jobs&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;li&gt;Both have

&lt;ul&gt;
&lt;li&gt;Solution templates (ADF template gallery, Synapse knowledge center)&lt;/li&gt;
&lt;li&gt;GIT integration&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;ADF/Synapse portal differences

&lt;ul&gt;
&lt;li&gt;ADF has

&lt;ul&gt;
&lt;li&gt;Home&lt;/li&gt;
&lt;li&gt;Author - pipelines, datasets, data flows, Power Query, and templates&lt;/li&gt;
&lt;li&gt;Monitor - dashboards for pipeline/trigger runs, integration runtimes, data flow debug, alerts/metrics&lt;/li&gt;
&lt;li&gt;Manage&lt;/li&gt;
&lt;li&gt;Learning center&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;li&gt;Synapse has

&lt;ul&gt;
&lt;li&gt;Home&lt;/li&gt;
&lt;li&gt;Data - SQL/Lake database, external datasets, and integration datasets&lt;/li&gt;
&lt;li&gt;Develop - SQL scripts, notebooks, data flows&lt;/li&gt;
&lt;li&gt;Integrate - pipelines, Synapse Link connections&lt;/li&gt;
&lt;li&gt;Monitor - pools, requests, Spark, pipeline/trigger runs, integration runtimes, Link connections&lt;/li&gt;
&lt;li&gt;Manage&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Transform data by using Azure Stream Analytics&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Azure Stream Analytics

&lt;ul&gt;
&lt;li&gt;Only for streaming solutions, not batch&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Input can be Blob Storage, Event Hubs, or IOT Hubs&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;These input to the &lt;strong&gt;query layer where transformations happen&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Query &lt;strong&gt;outputs to Blob storage or Power BI&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;li&gt;Queries

&lt;ul&gt;
&lt;li&gt;SELECT * INTO output FROM input&lt;/li&gt;
&lt;li&gt;Choose specific columns, where clauses, aggregations, etc &lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Cleanse data&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Process overview

&lt;ul&gt;
&lt;li&gt;Investigate the data&lt;/li&gt;
&lt;li&gt;Perform cleaning steps (unique to data set)&lt;/li&gt;
&lt;li&gt;Evaluate the results

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Validity&lt;/strong&gt; (does it match business rules?)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Accuracy&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Completeness&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Consistency&lt;/strong&gt; (is there conflicting data?)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Uniformity&lt;/strong&gt; (are data points using same units of measure?)&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Common tools

&lt;ul&gt;
&lt;li&gt;ADF, Synapse (almost identical for this purpose)&lt;/li&gt;
&lt;li&gt;Azure Stream Analytics (can be harder to clean)&lt;/li&gt;
&lt;li&gt;Databricks (more complicated, but versatile and useful for massive data)&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;li&gt;In the portal (ADF)

&lt;ul&gt;
&lt;li&gt;Create a Data Flow, choose sources&lt;/li&gt;
&lt;li&gt;Preview data to see which fields can join data&lt;/li&gt;
&lt;li&gt;Consider how columns can be filtered or removed to provide value or remove extraneous data&lt;/li&gt;
&lt;li&gt;Once cleansing is done, choose sink&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Handle duplicate data&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Dedupe = eliminate unnecessary copies

&lt;ul&gt;
&lt;li&gt;Consider technology knowledge &lt;/li&gt;
&lt;li&gt;Consider complexity&lt;/li&gt;
&lt;li&gt;Consider accompanying solutions (SQL queries, ADF data flows, Spark, etc)&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;li&gt;Basic steps (in ADF)

&lt;ul&gt;
&lt;li&gt;Create data flow&lt;/li&gt;
&lt;li&gt;Choose source&lt;/li&gt;
&lt;li&gt;Choose &lt;strong&gt;script snippet&lt;/strong&gt; (scroll symbol in top right of editor, snippets can be found on &lt;a href="https://learn.microsoft.com/en-us/azure/data-factory/how-to-data-flow-dedupe-nulls-snippets?source=recommendations"&gt;Microsoft Learn&lt;/a&gt;)&lt;/li&gt;
&lt;li&gt;Choose destination&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://learn.microsoft.com/en-us/stream-analytics-query/event-delivery-guarantees-azure-stream-analytics"&gt;&lt;strong&gt;Avoiding duplicate data by using Azure Stream Analytics Exactly Once Delivery&lt;/strong&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Handle missing data&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Determine impact of missing data, sometimes it won't be a big deal&lt;/li&gt;
&lt;li&gt;Options of handling missing data

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Drop rows&lt;/strong&gt; that have the missing data&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Imputation&lt;/strong&gt; = assign an inferred value to the missing element&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Include the rows&lt;/strong&gt; that are missing data&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Handle late-arriving data&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Definitions

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Event time&lt;/strong&gt; = when original event &lt;strong&gt;happened&lt;/strong&gt; (order is given to waiter)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Processing time&lt;/strong&gt; = when event is &lt;strong&gt;observed&lt;/strong&gt; (waiter gives order to kitchen)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Watermark&lt;/strong&gt; = stamp identifying when event has been &lt;strong&gt;ingressed&lt;/strong&gt; into system&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Handle late arriving data by choosing a level of tolerance&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Consequences of tolerance

&lt;ul&gt;
&lt;li&gt;Tolerance = window considered acceptable for late arrival&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Critical events can be missed&lt;/strong&gt; without proper tolerance&lt;/li&gt;
&lt;li&gt;Delayed outputs can result in &lt;strong&gt;broken processes or bad reports&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Split data&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Splitting data allows making paths to multiple sinks from the same source&lt;/li&gt;
&lt;li&gt;Conditional splits

&lt;ul&gt;
&lt;li&gt;Route data to different outputs&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Available in ADF and Synapse&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Steps

&lt;ul&gt;
&lt;li&gt;Create data flow&lt;/li&gt;
&lt;li&gt;Use conditional split transformation&lt;/li&gt;
&lt;li&gt;Set split conditions&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;li&gt;Data flow scripts

&lt;ul&gt;
&lt;li&gt;Can use scripts to do the steps above&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Shred JSON&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Shredding JSON = &lt;strong&gt;extracting&lt;/strong&gt; data from a JSON file and &lt;strong&gt;transferring to a table&lt;/strong&gt; (aka parsing)&lt;/li&gt;
&lt;li&gt;Done in Synapse or ADF&lt;/li&gt;
&lt;li&gt;Once data is extracted it is &lt;strong&gt;persisted to a data store&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://learn.microsoft.com/en-us/sql/t-sql/functions/openjson-transact-sql?view=sql-server-ver16"&gt;OPENJSON function&lt;/a&gt;

&lt;ul&gt;
&lt;li&gt;Table-valued function that parses JSON text&lt;/li&gt;
&lt;li&gt;Returns objects and properties as rows and columns&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Encode and decode data&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;UTF-8

&lt;ul&gt;
&lt;li&gt;Uniform Transformation Format 8-bits&lt;/li&gt;
&lt;li&gt;The ASCII problem

&lt;ul&gt;
&lt;li&gt;Assigns a code for every character (256 possiblities)&lt;/li&gt;
&lt;li&gt;As programming expanded, number of available characters ran out&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;UTF-8 provides more character possibilities&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Program must understand UTF-8 codes in order to decode information&lt;/li&gt;
&lt;li&gt;There are multiple encoding formats, so the &lt;strong&gt;source and sink must use the same encoding&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Done in ADF and Synapse copy activities&lt;/li&gt;
&lt;li&gt;In the portal

&lt;ul&gt;
&lt;li&gt;Can choose encoding and compression properties &lt;strong&gt;in the Dataset properties&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Configure error handling for a transformation&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Options for error handling

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Transaction commit&lt;/strong&gt;: choose whether to write data in a &lt;strong&gt;single transaction or in batches&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Output rejected data&lt;/strong&gt;: log error rows in a CSV in Azure Storage&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Success on error&lt;/strong&gt;: mark it as successful even if errors occur&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;In the portal (ADF)&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;In an activity's settings, &lt;strong&gt;fault tolerance&lt;/strong&gt; represents a form of success on error, continuing past incompatible data&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Enable logging&lt;/strong&gt; to store files that show rejected rows&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Enable staging&lt;/strong&gt; allows for copying in batches&lt;/li&gt;
&lt;li&gt;On the right side of the activity there are buttons for "on success," "on failure," etc.

&lt;ul&gt;
&lt;li&gt;Connect these to other activities to choose how pipeline errors are handled&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;li&gt;In a data flow database sink, there is an &lt;strong&gt;Errors tab&lt;/strong&gt; to configure error handling&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Normalize and denormalize data&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Normalizing data = reorganizing to remove unstructured or redundant data&lt;/li&gt;
&lt;li&gt;Denormalizing data = adding redundant data to one or more tables&lt;/li&gt;
&lt;li&gt;What and why

&lt;ul&gt;
&lt;li&gt;Normalizing

&lt;ul&gt;
&lt;li&gt;More tables&lt;/li&gt;
&lt;li&gt;Requires multiple joins&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Improves accuracy and integrity&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;li&gt;Denormalizing

&lt;ul&gt;
&lt;li&gt;More space&lt;/li&gt;
&lt;li&gt;More difficult to maintain&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Improves query performance&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Star schema is not normalized&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Snowflake schema is normalized&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;In the portal (Synapse)

&lt;ul&gt;
&lt;li&gt;Inspect the data sources to determine normalization status and identify join columns&lt;/li&gt;
&lt;li&gt;Use a &lt;strong&gt;join transformation&lt;/strong&gt; to combine data sources &lt;strong&gt;for denormalization&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Use a &lt;strong&gt;conditional split&lt;/strong&gt; or a select transformation &lt;strong&gt;to normalize&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Transformations can also be done in the script editor&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Perform data exploratory analysis&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Use summary statistics and visualizations to &lt;strong&gt;investigate patterns and anomalies in data&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Can be done in SQL, Python, Kusto queries in Azure Data Explorer&lt;/li&gt;
&lt;/ul&gt;

</description>
    </item>
    <item>
      <title>DP-203 Study Guide - Design and implement the data exploration layer</title>
      <dc:creator>Alec Dutcher</dc:creator>
      <pubDate>Tue, 05 Dec 2023 17:58:40 +0000</pubDate>
      <link>https://dev.to/aidutcher/dp-203-study-guide-design-and-implement-the-data-exploration-layer-40pl</link>
      <guid>https://dev.to/aidutcher/dp-203-study-guide-design-and-implement-the-data-exploration-layer-40pl</guid>
      <description>&lt;p&gt;&lt;a href="https://dev.to/aidutcher/dp-203-data-engineering-on-microsoft-azure-study-guide-5h63"&gt;Study guide&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Create and execute queries by using a compute solution that leverages SQL serverless and Spark cluster&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://learn.microsoft.com/en-us/azure/synapse-analytics/sql/on-demand-workspace-overview"&gt;Azure SQL Serverless&lt;/a&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Not SQL database - it is SQL compute&lt;/strong&gt; in Azure Synapse Analytics&lt;/li&gt;
&lt;li&gt;Serverless SQL pool

&lt;ul&gt;
&lt;li&gt;Built in to Synapse&lt;/li&gt;
&lt;li&gt;Always available&lt;/li&gt;
&lt;li&gt;Billed based on usage&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;li&gt;Data access

&lt;ul&gt;
&lt;li&gt;No data storage&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Data accessed through ADL&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;OPENROWSET&lt;/strong&gt; syntax to access data&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;li&gt;Provisioned resources

&lt;ul&gt;
&lt;li&gt;Dedicated SQL pool&lt;/li&gt;
&lt;li&gt;Static number of servers&lt;/li&gt;
&lt;li&gt;User chooses runtime&lt;/li&gt;
&lt;li&gt;Defined cost per data warehouse unit (DWU)&lt;/li&gt;
&lt;li&gt;Data is stored in relational tables using columnar storage&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;li&gt;Used &lt;strong&gt;mainly for EDA&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;In the portal

&lt;ul&gt;
&lt;li&gt;Develop section on left-side panel&lt;/li&gt;
&lt;li&gt;Click + button and add SQL script&lt;/li&gt;
&lt;li&gt;Select tables from lake DB or SQL DB&lt;/li&gt;
&lt;li&gt;Choose SQL pool settings&lt;/li&gt;
&lt;li&gt;SELECT * FROM OPENROWSET( BULK '', FORMAT = 'parquet') AS [result] &lt;/li&gt;
&lt;li&gt;SELECT * FROM OPENROWSET( BULK '', FORMAT = 'CSV', Parser_Version = '2.0') AS [result] &lt;/li&gt;
&lt;li&gt;Can also go to Data in left-side panel and link storage account and containers - this can be used to auto-generate basic SELECT queries&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Spark clusters

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://learn.microsoft.com/en-us/azure/synapse-analytics/spark/apache-spark-overview"&gt;Apache Spark in Synapse&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;In-memory cluster computing&lt;/li&gt;
&lt;li&gt;Synapse offers ease of use and creation&lt;/li&gt;
&lt;li&gt;Data access is interacting with &lt;strong&gt;Spark pools through notebooks&lt;/strong&gt; (similar to Databricks)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Databases and tables created in a Spark pool are replicated in a serverless SQL pool as read-only&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;In the portal

&lt;ul&gt;
&lt;li&gt;Under Develop in left-side panel&lt;/li&gt;
&lt;li&gt;Click + button and select or create a notebook&lt;/li&gt;
&lt;li&gt;Under Manage on left-side panel, create and run an Apache Spark pool&lt;/li&gt;
&lt;li&gt;Be sure to enable automatic pausing, Spark pools are expensive&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Recommend and implement Azure Synapse Analytics database templates&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://learn.microsoft.com/en-us/azure/synapse-analytics/database-designer/overview-database-templates"&gt;Database Templates&lt;/a&gt;

&lt;ul&gt;
&lt;li&gt;Speed up design process&lt;/li&gt;
&lt;li&gt;Create more thorough databases&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://learn.microsoft.com/en-us/azure/synapse-analytics/database-designer/concepts-lake-database"&gt;&lt;strong&gt;Lake database in Synapse&lt;/strong&gt;&lt;/a&gt;

&lt;ul&gt;
&lt;li&gt;Data lakes lack structure&lt;/li&gt;
&lt;li&gt;Databases can be too structured&lt;/li&gt;
&lt;li&gt;Lake database removes these downsides&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Provides structured DB with meta info, stored in a data lake&lt;/strong&gt; (parquet, delta, CSV formats)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Powered by serverless Synapse compute&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;li&gt;In the portal

&lt;ul&gt;
&lt;li&gt;Architecture process

&lt;ul&gt;
&lt;li&gt;Access Synapse Studio instance&lt;/li&gt;
&lt;li&gt;Create a Lake Database&lt;/li&gt;
&lt;li&gt;Add a Table&lt;/li&gt;
&lt;li&gt;Add Template&lt;/li&gt;
&lt;li&gt;Select relevant features&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Push new or updated data lineage to Microsoft Purview&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://learn.microsoft.com/en-us/purview/purview"&gt;Microsoft Purview&lt;/a&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Unified data governance&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;On-prem, multi-cloud, SaaS&lt;/li&gt;
&lt;li&gt;4 &lt;strong&gt;pillars&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;Data &lt;strong&gt;quality&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Data &lt;strong&gt;stewardship&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Data &lt;strong&gt;protection and compliance&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Data &lt;strong&gt;management&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Data lifecycle management&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;Data &lt;strong&gt;catalog&lt;/strong&gt; - organized inventory of data assets&lt;/li&gt;
&lt;li&gt;Data &lt;strong&gt;estate insights&lt;/strong&gt; - infrastructure helps organizations manage data&lt;/li&gt;
&lt;li&gt;Data &lt;strong&gt;sharing&lt;/strong&gt; - internally or across orgs&lt;/li&gt;
&lt;li&gt;Data &lt;strong&gt;policy&lt;/strong&gt; - provision access to data at scale&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Primary use cases&lt;/strong&gt; for Purview

&lt;ul&gt;
&lt;li&gt;Pull data from SQL DB and ADL and &lt;strong&gt;provide governance&lt;/strong&gt; across the org&lt;/li&gt;
&lt;li&gt;Financial services can show where critical data is stored to &lt;strong&gt;evaluate security risk&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Large, diverse orgs can &lt;strong&gt;enable data democratization&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Data lineage

&lt;ul&gt;
&lt;li&gt;Track data flow over time&lt;/li&gt;
&lt;li&gt;Origination --&amp;gt; Delta (data changes) --&amp;gt; Sink (output)&lt;/li&gt;
&lt;li&gt;Provides confidence in data&lt;/li&gt;
&lt;li&gt;Facilitates governance and impact analysis&lt;/li&gt;
&lt;li&gt;In the portal

&lt;ul&gt;
&lt;li&gt;Lineage tab shows a flow chart with sources, processes, and targets&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Browse and search metadata in Microsoft Purview Data Catalog&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;In the portal

&lt;ul&gt;
&lt;li&gt;Open the Microsoft Purview Governance Portal&lt;/li&gt;
&lt;li&gt;Data Catalog --&amp;gt; Browse --&amp;gt; By collection or source type&lt;/li&gt;
&lt;li&gt;Go to Data map in left-side panel to register data sources

&lt;ul&gt;
&lt;li&gt;Data map --&amp;gt; Data sources --&amp;gt; Register&lt;/li&gt;
&lt;li&gt;Need to do a new scan to establish lineage&lt;/li&gt;
&lt;li&gt;Requires access control to be configured to allow Purview to scan the data sources&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;

</description>
    </item>
    <item>
      <title>DP-203 Study Guide - Implement a partition strategy</title>
      <dc:creator>Alec Dutcher</dc:creator>
      <pubDate>Tue, 05 Dec 2023 17:58:19 +0000</pubDate>
      <link>https://dev.to/aidutcher/dp-203-study-guide-implement-a-partition-strategy-4fkj</link>
      <guid>https://dev.to/aidutcher/dp-203-study-guide-implement-a-partition-strategy-4fkj</guid>
      <description>&lt;p&gt;&lt;a href="https://dev.to/aidutcher/dp-203-data-engineering-on-microsoft-azure-study-guide-5h63"&gt;Study guide&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://learn.microsoft.com/en-us/azure/architecture/best-practices/data-partitioning"&gt;&lt;strong&gt;Data partitioning guidance&lt;/strong&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Horizontal (sharding)&lt;/strong&gt; = each partition is a separate data store, but all partitions have the same schema (&lt;strong&gt;partitions have different rows&lt;/strong&gt;)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Vertical&lt;/strong&gt; = each partition holds a subset of the fields according to patterns of use (&lt;strong&gt;partitions have different columns&lt;/strong&gt;)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Functional&lt;/strong&gt; = data is aggregated according to &lt;strong&gt;how it is used by each bounded context&lt;/strong&gt; (i.e. invoice data vs product data)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Implement a partition strategy for files&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Think through the problem, whiteboard it out&lt;/li&gt;
&lt;li&gt;Parquet

&lt;ul&gt;
&lt;li&gt;Most common file type for big data&lt;/li&gt;
&lt;li&gt;Column-based storage with nested data structures&lt;/li&gt;
&lt;li&gt;Supports parallel processing queries&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://parquet.apache.org/docs/concepts/"&gt;Row-group&lt;/a&gt; sections can be treated as partitions - multiple row-groups can be sent to different nodes&lt;/li&gt;
&lt;li&gt;Break partitions apart based on column values, i.e. query based on a date&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;li&gt;Best practices

&lt;ul&gt;
&lt;li&gt;Make sure to include partition columns in table's schema definition&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Group related records&lt;/strong&gt; together&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Don't use unnecessary columns&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;512 MB to 1 GB&lt;/strong&gt; is optimal partition size&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Consider the query&lt;/strong&gt; and how the data will be used&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Consider the expected growth&lt;/strong&gt; of the data&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Consider how static&lt;/strong&gt; the data is&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Implement a partition strategy for analytical workloads&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Distribution types

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Round-robin&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;Characteristics

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Distributed evenly&lt;/strong&gt; in a random fashion&lt;/li&gt;
&lt;li&gt;Even distribution across DBs&lt;/li&gt;
&lt;li&gt;Assignment is random&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Fast performance for loads&lt;/strong&gt; as row assignment can be done quickly&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Slower performance for reads&lt;/strong&gt; as higher potential for data movement&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;li&gt;Best for: 

&lt;ul&gt;
&lt;li&gt;No clear distribution key&lt;/li&gt;
&lt;li&gt;No frequent joins&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Uniform distribution&lt;/strong&gt; is desired&lt;/li&gt;
&lt;li&gt;Temporary staging table&lt;/li&gt;
&lt;li&gt;Simple starting point&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Hash&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;Characteristics

&lt;ul&gt;
&lt;li&gt;Distributed deterministically using hash function on a column&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Distribution column can’t be changed&lt;/strong&gt; later&lt;/li&gt;
&lt;li&gt;Choose one with unique values, few/no nulls, is not a date column&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;li&gt;Best for

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Large tables (&amp;gt;2 Gb)&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Frequent inserts, updates, and deletes&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Replicated&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;Characteristics

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Full copy of table&lt;/strong&gt; is replicated to every compute node&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Requires extra storage&lt;/strong&gt; and overhead for writes&lt;/li&gt;
&lt;li&gt;Normally used in conjunction with other methods&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;li&gt;Best for:

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Small lookup or dimension tables&lt;/strong&gt; joined with larger tables&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Implement a partition strategy for streaming workloads&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Azure Stream Analytics

&lt;ul&gt;
&lt;li&gt;Fully managed stream processing engine&lt;/li&gt;
&lt;li&gt;Input layer (Blob storage, Event Hubs, IoT hubs) ingested into ASA&lt;/li&gt;
&lt;li&gt;Query layer: ASA performs query&lt;/li&gt;
&lt;li&gt;Output layer: Results sent to Blob storage for downstream use&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;li&gt;How transformation works in a stream

&lt;ul&gt;
&lt;li&gt;Data in stream is diverted to perform query&lt;/li&gt;
&lt;li&gt;Query transformation results are re-introduced to stream for output&lt;/li&gt;
&lt;li&gt;Transformation is done in near real time&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://learn.microsoft.com/en-us/azure/stream-analytics/stream-analytics-parallelization"&gt;Partitioning&lt;/a&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Embarrassingly parallel job&lt;/strong&gt;: equal input and output partitions, one instance of the query&lt;/li&gt;
&lt;li&gt;Must align partition keys between inputs, query logic, and outputs&lt;/li&gt;
&lt;li&gt;Jobs that aren't embarrassingly parallel can still be completed, but not as efficiently

&lt;ul&gt;
&lt;li&gt;Involves querying windows&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;In the Azure Portal

&lt;ul&gt;
&lt;li&gt;Query in the left-side options&lt;/li&gt;
&lt;li&gt;Inputs - define query, can test and see results&lt;/li&gt;
&lt;li&gt;Outputs - define and test output query

&lt;ul&gt;
&lt;li&gt;Here you can define partition key with the PARTITION BY clause (in compatibility level 1.1 and below, in 1.2 define partition key in input)&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Implement a partition strategy for Azure Synapse Analytics&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://learn.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/sql-data-warehouse-tables-partition"&gt;&lt;strong&gt;Table partitions&lt;/strong&gt;&lt;/a&gt;

&lt;ul&gt;
&lt;li&gt;Supported on all dedicated SQL pool types

&lt;ul&gt;
&lt;li&gt;Clustered columnstore, clustered index, heap&lt;/li&gt;
&lt;li&gt;Supported on all distribution types (hash, round robin, etc)&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;li&gt;Why partition

&lt;ul&gt;
&lt;li&gt;Query performance&lt;/li&gt;
&lt;li&gt;Load performance - Smaller amounts of data make incremental loading, updating, and deleting faster and easier&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://learn.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/sql-data-warehouse-tables-index#clustered-columnstore-indexes"&gt;Clustered columnstore indexes&lt;/a&gt;

&lt;ul&gt;
&lt;li&gt;Standard for storing and querying large data warehouse fact tables&lt;/li&gt;
&lt;li&gt;Rows are organized into row groups containing 1,048,576 rows&lt;/li&gt;
&lt;li&gt;Row groups organized into column segments&lt;/li&gt;
&lt;li&gt;Index columnstore is built from column segments - data is compressed&lt;/li&gt;
&lt;li&gt;Deltastore - leftover row group&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;li&gt;Law of 60

&lt;ul&gt;
&lt;li&gt;A distribution is a basic unit of storage and processing&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Synapse divides work into 60 smaller queries&lt;/strong&gt; that run in parallel on a data distribution&lt;/li&gt;
&lt;li&gt;This turns 10 partitions into 600&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Each partition needs 1 million rows&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;li&gt;In the Azure Synapse Analytics portal

&lt;ul&gt;
&lt;li&gt;When writing CREATE TABLE statement, use WITH clause using CLUSTERED COLUMNSTORE INDEX&lt;/li&gt;
&lt;li&gt;Define the distribution type and key&lt;/li&gt;
&lt;li&gt;Choose partition key
&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Identify when partitioning is needed in Azure Data Lake Storage Gen2&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Azure Blob Storage

&lt;ul&gt;
&lt;li&gt;General purpose, Block, and Page blob&lt;/li&gt;
&lt;li&gt;Account --&amp;gt; Container --&amp;gt; Blob&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;li&gt;Partition key identification

&lt;ul&gt;
&lt;li&gt;Azure Storage serves single partitions faster than multiple partitions&lt;/li&gt;
&lt;li&gt;Partitioning is used to improve read performance&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Naming blobs correctly is critical&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Blob storage uses a range-based partitioning scheme&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Partition key is combo of Account + Container + Blob&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Blob storage uses lexical ordering and timestamps which increases co-location on partitions&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;li&gt;Best practices

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Avoid slowly changing timestamps&lt;/strong&gt; (yyyymmdd)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Name based upon likely queries&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Avoid latency-causing partitioning&lt;/strong&gt; (use blob size &amp;gt;256 Kb, use hashing functions)&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;/ul&gt;

</description>
    </item>
    <item>
      <title>DP-203: Data Engineering on Microsoft Azure - Study Guide</title>
      <dc:creator>Alec Dutcher</dc:creator>
      <pubDate>Tue, 05 Dec 2023 17:41:27 +0000</pubDate>
      <link>https://dev.to/aidutcher/dp-203-data-engineering-on-microsoft-azure-study-guide-5h63</link>
      <guid>https://dev.to/aidutcher/dp-203-data-engineering-on-microsoft-azure-study-guide-5h63</guid>
      <description>&lt;p&gt;&lt;a href="https://learn.microsoft.com/en-us/credentials/certifications/exams/dp-203/"&gt;Exam link&lt;/a&gt;&lt;br&gt;
&lt;a href="https://learn.microsoft.com/en-us/credentials/certifications/resources/study-guides/dp-203#skills-measured-as-of-november-2-2023"&gt;Microsoft's official study guide&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Skills Measured&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Design and implement data storage (15-20%)&lt;/li&gt;
&lt;li&gt;Develop data processing (40-45%)&lt;/li&gt;
&lt;li&gt;Secure, monitor, and optimize data storage and data processing (30-35%)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Design and implement data storage (15-20%)&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://dev.to/aidutcher/dp-203-study-guide-implement-a-partition-strategy-4fkj"&gt;Implement a partition strategy&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://dev.to/aidutcher/dp-203-study-guide-design-and-implement-the-data-exploration-layer-40pl"&gt;Design and implement the data exploration layer&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Develop data processing (40-45%)&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://dev.to/aidutcher/dp-203-study-guide-ingest-and-transform-data-108a"&gt;Ingest and transform data&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://dev.to/aidutcher/dp-203-study-guide-develop-a-batch-processing-solution-4ehi"&gt;Develop a batch processing solution&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://dev.to/aidutcher/dp-203-study-guide-develop-a-stream-processing-solution-3h26"&gt;Develop a stream processing solution&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://dev.to/aidutcher/dp-203-study-guide-manage-batches-and-pipelines-3e66"&gt;Manage batches and pipelines&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Secure, monitor, and optimize data storage and data processing (30-35%)&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://dev.to/aidutcher/dp-203-study-guide-implement-data-security-24pa"&gt;Implement data security&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://dev.to/aidutcher/dp-203-study-guide-monitor-data-storage-and-data-processing-44no"&gt;Monitor data storage and data processing&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://dev.to/aidutcher/dp-203-study-guide-optimize-and-troubleshoot-data-storage-and-data-processing-2hbj"&gt;Optimize and troubleshoot data storage and data processing&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
    </item>
    <item>
      <title>AZ-400: Design and implement an authentication strategy</title>
      <dc:creator>Alec Dutcher</dc:creator>
      <pubDate>Wed, 28 Dec 2022 00:43:50 +0000</pubDate>
      <link>https://dev.to/aidutcher/az-400-design-and-implement-an-authentication-strategy-3b7b</link>
      <guid>https://dev.to/aidutcher/az-400-design-and-implement-an-authentication-strategy-3b7b</guid>
      <description>&lt;p&gt;&lt;a href="https://learn.microsoft.com/en-us/azure/devops/integrate/get-started/authentication/authentication-guidance?view=azure-devops" rel="noopener noreferrer"&gt;Guidance for authentication&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Different authentication types work best with different application types&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://learn.microsoft.com/en-us/azure/devops/organizations/accounts/use-personal-access-tokens-to-authenticate?toc=%2Fazure%2Fdevops%2Fmarketplace-extensibility%2Ftoc.json&amp;amp;view=azure-devops&amp;amp;tabs=Windows" rel="noopener noreferrer"&gt;Personal Access Token (PAT)&lt;/a&gt; 

&lt;ul&gt;
&lt;li&gt;identifies you, your accessible organizations, and your scopes of access&lt;/li&gt;
&lt;li&gt;should be treated and used like a password&lt;/li&gt;
&lt;li&gt;also used to configure the cross-platform CLI&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;

&lt;a href="https://learn.microsoft.com/en-us/azure/devops/integrate/get-started/authentication/oauth?toc=%2Fazure%2Fdevops%2Fmarketplace-extensibility%2Ftoc.json&amp;amp;view=azure-devops" rel="noopener noreferrer"&gt;OAuth&lt;/a&gt;

&lt;ul&gt;
&lt;li&gt;useful for authenticating apps for REST API access&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;Avoid &lt;a href="https://learn.microsoft.com/en-us/azure/devops/integrate/get-started/authentication/iis-basic-auth?toc=%2Fazure%2Fdevops%2Fmarketplace-extensibility%2Ftoc.json&amp;amp;view=azure-devops" rel="noopener noreferrer"&gt;IIS Basic Authentication&lt;/a&gt;

&lt;ul&gt;
&lt;li&gt;prevents use of PATs&lt;/li&gt;
&lt;li&gt;breaks Git, because it requires PATs&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;

</description>
      <category>productivity</category>
      <category>react</category>
      <category>kendoreact</category>
      <category>ai</category>
    </item>
    <item>
      <title>AZ-400: Configure release documentation</title>
      <dc:creator>Alec Dutcher</dc:creator>
      <pubDate>Wed, 28 Dec 2022 00:24:56 +0000</pubDate>
      <link>https://dev.to/aidutcher/az-400-configure-release-documentation-3p7n</link>
      <guid>https://dev.to/aidutcher/az-400-configure-release-documentation-3p7n</guid>
      <description>&lt;p&gt;&lt;a href="https://learn.microsoft.com/en-us/samples/azure-samples/azure-devops-release-notes/azure-devops-release-notes-generator/" rel="noopener noreferrer"&gt;Azure DevOps Release Notes Generator&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Release notes can be automatically generated upon a new release&lt;/li&gt;
&lt;li&gt;These notes can refer to work items and commits associated with the release&lt;/li&gt;
&lt;li&gt;They can be stored as markdown files in a dedicated storage account&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>community</category>
    </item>
  </channel>
</rss>
