DEV Community

sajjadrahman265
sajjadrahman265

Posted on

πŸ“˜ DP-600 Complete Deep Dive Guide

1️⃣ Dataflows Gen2

Definition

Dataflows Gen2 is a low-code/no-code data integration and transformation tool in Microsoft Fabric that uses Power Query for ETL operations.

When and Why Used

  • When: You need to ingest data from multiple sources (CSV, databases, APIs) and transform it before loading into Lakehouse/Warehouse
  • Why: Provides visual, user-friendly interface for data engineers without deep coding skills; supports incremental refresh; integrates directly with OneLake

How It Works in Fabric

  • Uses Power Query M language behind the scenes
  • Connects to 100+ data sources
  • Applies transformations (filter, merge, aggregate)
  • Lands data into Lakehouse tables or files
  • Supports staging (intermediate storage) for complex transformations

Real-World Example

A retail company receives daily sales CSV files from 50 stores. Dataflows Gen2 ingests files, cleans missing values, standardizes date formats, and loads into Lakehouse Bronze layer.

Exam Appearance

  • "Which tool should you use for automated daily CSV ingestion?" β†’ Dataflows Gen2
  • "You need low-code transformation before loading Lakehouse" β†’ Dataflows Gen2

Differences from Similar Concepts

  • Dataflows Gen2 vs Gen1: Gen2 writes directly to OneLake, supports staging, better performance
  • Dataflows vs Pipelines: Dataflows = transformations; Pipelines = orchestration
  • Dataflows vs Notebooks: Dataflows = no-code; Notebooks = code-based (Spark)

2️⃣ Lakehouse

Definition

Lakehouse is a unified data architecture combining data lake flexibility (raw data, Spark) with data warehouse structure (SQL queries).

When and Why Used

  • When: You need to store raw, semi-structured, and curated data with Spark processing
  • Why: Provides single storage layer for all data types; supports Delta tables; enables Medallion architecture

How It Works in Fabric

  • Built on OneLake (Delta Lake format)
  • Stores files (Parquet, CSV, JSON) and tables (Delta)
  • Supports Spark notebooks, SQL endpoint, shortcuts
  • Automatically creates SQL analytics endpoint for querying

Real-World Example

An IoT company stores raw sensor JSON files in Lakehouse, transforms them with Spark notebooks into Delta tables, and allows analysts to query with SQL.

Exam Appearance

  • "Store raw + curated data with Spark flexibility" β†’ Lakehouse
  • "Need both file storage and table queries" β†’ Lakehouse

Differences from Similar Concepts

  • Lakehouse vs Warehouse: Lakehouse = flexible (files + tables); Warehouse = structured SQL only
  • Lakehouse vs Data Lake: Lakehouse adds Delta tables, ACID transactions, time travel

3️⃣ Warehouse

Definition

Warehouse is a fully managed, enterprise-grade data warehouse optimized for SQL workloads and relational analytics.

When and Why Used

  • When: Business users need structured, relational data with pure SQL queries
  • Why: Optimized for aggregations, joins, star schema; familiar to SQL developers; better performance for complex SQL

How It Works in Fabric

  • Stores data in columnar format (similar to Synapse)
  • Supports T-SQL queries, stored procedures, views
  • Integrates with Power BI for DirectQuery
  • Provides separation of compute and storage

Real-World Example

Finance team builds dimensional model (fact + dimension tables) in Warehouse for budgeting reports, queried by Power BI dashboards.

Exam Appearance

  • "Business users need structured SQL-only access" β†’ Warehouse
  • "Build star schema for reporting" β†’ Warehouse

Differences from Similar Concepts

  • Warehouse vs Lakehouse: Warehouse = SQL-focused, structured; Lakehouse = Spark-focused, flexible
  • Warehouse vs Traditional DW: Fabric Warehouse = cloud-native, integrated with OneLake

4️⃣ Delta Tables

Definition

Delta tables are open-source storage format adding ACID transactions, time travel, and update/delete capabilities to Parquet files.

When and Why Used

  • When: You need reliable updates/deletes, historical tracking, or data quality checks
  • Why: Parquet is immutable; Delta adds transaction logs enabling versioning and atomic operations

How It Works in Fabric

  • Stores data as Parquet files + transaction log (JSON)
  • Log tracks all changes (inserts, updates, deletes)
  • Supports VACUUM (cleanup old versions), OPTIMIZE (compaction)
  • Enables time travel: SELECT * FROM table VERSION AS OF 5

Real-World Example

Customer profile table updates addresses daily. Delta table maintains history with time travel for compliance audits (GDPR right to erasure tracking).

Exam Appearance

  • "Maintain historical versions with updates" β†’ Delta tables with time travel
  • "Need ACID transactions in Lakehouse" β†’ Delta tables

Differences from Similar Concepts

  • Delta vs Parquet: Delta = updatable, versioned; Parquet = immutable
  • Delta vs SCD-2: Delta = storage layer; SCD-2 = modeling pattern (Delta can implement SCD-2)

5️⃣ KQL (Kusto Query Language)

Definition

KQL is a query language optimized for fast, interactive analytics on large volumes of streaming and time-series data.

When and Why Used

  • When: Real-time analytics, log analysis, anomaly detection, telemetry monitoring
  • Why: Sub-second query performance on streaming data; built for time-series operations

How It Works in Fabric

  • Used with KQL Database (Eventhouse)
  • Reads from streaming sources (Event Hub, IoT Hub)
  • Supports aggregations, time-windowing, pattern matching
  • Syntax: TableName | where Timestamp > ago(1h) | summarize count() by Category

Real-World Example

Bank monitors credit card transactions in real-time using KQL to detect fraud patterns (multiple high-value transactions within 5 minutes).

Exam Appearance

  • "Real-time fraud detection with streaming data" β†’ KQL Database
  • "Sub-second query latency on telemetry" β†’ KQL

Differences from Similar Concepts

  • KQL vs T-SQL: KQL = streaming, time-series; T-SQL = relational, batch
  • KQL vs Spark SQL: KQL = real-time; Spark SQL = batch processing

6️⃣ Eventstream

Definition

Eventstream is a real-time data ingestion service for capturing, transforming, and routing streaming data in Fabric.

When and Why Used

  • When: Ingesting continuous data flows (IoT sensors, clickstreams, logs)
  • Why: No-code streaming pipeline; routes to multiple destinations; applies transformations on-the-fly

How It Works in Fabric

  • Connects to Event Hub, IoT Hub, sample data
  • Applies transformations (filter, aggregate)
  • Routes to KQL Database, Lakehouse, Reflex (alerts)
  • Visual designer interface

Real-World Example

Manufacturing plant streams machine sensor data via Eventstream, filters critical alerts, and writes to KQL Database for monitoring dashboards.

Exam Appearance

  • "Ingest real-time IoT data visually" β†’ Eventstream
  • "Route streaming data to multiple destinations" β†’ Eventstream

Differences from Similar Concepts

  • Eventstream vs Dataflows: Eventstream = streaming; Dataflows = batch
  • Eventstream vs Event Hub: Event Hub = Azure service; Eventstream = Fabric wrapper with transformations

7️⃣ Row-Level Security (RLS)

Definition

RLS is a security feature that filters data rows dynamically based on user roles or attributes.

When and Why Used

  • When: Different users should see different subsets of data (managers see their department only)
  • Why: Maintains single dataset while enforcing data access policies

How It Works in Fabric

  • Defined in Power BI semantic model using DAX filters
  • Example: [Region] = USERPRINCIPALNAME()
  • Applies filters automatically when users query data
  • Works with Direct Lake, Import, DirectQuery modes

Real-World Example

HR dataset shows all employees. RLS rule: [ManagerID] = USERPRINCIPALNAME() ensures managers only see their direct reports.

Exam Appearance

  • "Managers see only their department data" β†’ Row-Level Security (RLS)
  • "Dynamic filtering based on user identity" β†’ RLS

Differences from Similar Concepts

  • RLS vs Column-Level Security: RLS = row filtering; CLS = column hiding
  • RLS vs Workspace Permissions: RLS = data-level; Workspace = object-level (entire dataset access)

8️⃣ Direct Lake Mode

Definition

Direct Lake is a Power BI connection mode that queries Delta tables in Lakehouse/Warehouse directly without importing data.

When and Why Used

  • When: You want near-real-time dashboards without data duplication
  • Why: Combines DirectQuery speed with Import performance; no data movement

How It Works in Fabric

  • Power BI reads Delta Parquet files directly from OneLake
  • Uses Fabric compute engine for queries
  • No data copied to Power BI semantic model
  • Automatic fallback to DirectQuery if needed

Real-World Example

Sales dashboard queries 10TB Lakehouse table using Direct Lake, refreshing insights every hour without importing data.

Exam Appearance

  • "Query Lakehouse without duplicating storage" β†’ Direct Lake Mode
  • "Near-real-time dashboards on Delta tables" β†’ Direct Lake

Differences from Similar Concepts

  • Direct Lake vs Import: Direct Lake = no copy, live data; Import = copied, scheduled refresh
  • Direct Lake vs DirectQuery: Direct Lake = faster, Fabric-optimized; DirectQuery = source database queried

9️⃣ Materialized Views

Definition

Materialized views are pre-computed, stored query results that improve performance for repeated queries.

When and Why Used

  • When: Complex aggregations run repeatedly; query performance is slow
  • Why: Stores results physically; refreshes periodically; drastically reduces compute

How It Works in Fabric

  • Created in Warehouse using T-SQL: CREATE MATERIALIZED VIEW AS SELECT...
  • Query optimizer automatically uses view when applicable
  • Refreshed manually or on schedule
  • Stored as physical tables

Real-World Example

Daily sales aggregation by region takes 10 minutes. Materialized view pre-computes results nightly, queries complete in seconds.

Exam Appearance

  • "Repeated aggregations are slow" β†’ Materialized Views
  • "Pre-compute results for performance" β†’ Materialized Views

Differences from Similar Concepts

  • Materialized Views vs Regular Views: Materialized = stored results; Regular = virtual query
  • Materialized Views vs Tables: Materialized = auto-updated from source; Tables = static until manually updated

πŸ”Ÿ Fact Tables

Definition

Fact tables store quantitative business metrics (sales, revenue) in a dimensional model, typically in star schema.

When and Why Used

  • When: Building data warehouse dimensional models for analytics
  • Why: Central table for measurements; joins to dimension tables for context

How It Works in Fabric

  • Contains measures (Amount, Quantity) and foreign keys to dimensions
  • Optimized for aggregations (SUM, AVG, COUNT)
  • Typically large, millions/billions of rows
  • Forms center of star schema

Real-World Example

FactSales table stores TransactionID, Date, ProductID, CustomerID, Amount. Joins to DimProduct, DimCustomer for reporting.

Exam Appearance

  • "Central table storing business metrics" β†’ Fact Table
  • "Star schema center with measures" β†’ Fact Table

Differences from Similar Concepts

  • Fact vs Dimension: Fact = measures (what); Dimension = context (who, where, when)
  • Fact vs Transactional Table: Fact = aggregated for analytics; Transactional = raw operational data

1️⃣1️⃣ Relational Cardinality

Definition

Cardinality defines the relationship type between tables: one-to-one (1:1), one-to-many (1:), many-to-many (:*).

When and Why Used

  • When: Building semantic models in Power BI; defining table relationships
  • Why: Determines how filters propagate; impacts query performance and correctness

How It Works in Fabric

  • 1:* (One-to-Many): Most common; dimension filters fact table
  • 1:1: Rare; two tables with matching primary keys
  • *:*: Avoided in star schema; requires bridge tables

Real-World Example

DimProduct (ProductID unique) β†’ FactSales (ProductID repeated) = 1:* relationship

Exam Appearance

  • "Define relationship between DimCustomer and FactSales" β†’ 1:* (One-to-Many)
  • "Avoid performance issues with relationships" β†’ Avoid *:*

Differences from Similar Concepts

  • 1:* vs :: 1:* = efficient filtering; : = performance penalty, complex logic

1️⃣2️⃣ Capacity Planning

Definition

Capacity planning determines the Fabric capacity size (F2, F64, F128, etc.) based on workload requirements.

When and Why Used

  • When: Provisioning Fabric environment; experiencing performance issues
  • Why: Ensures adequate compute resources; optimizes cost vs. performance

How It Works in Fabric

  • Capacity = compute units (CU) allocated for workloads
  • Sizes: F2 (2 CU) β†’ F2048 (2048 CU)
  • Charged per hour; can pause/resume
  • Monitoring via Capacity Metrics app

Real-World Example

Data engineering team runs heavy Spark jobs. F64 capacity is too slow. Upgrade to F128 reduces job time from 2 hours to 30 minutes.

Exam Appearance

  • "What determines capacity size?" β†’ Workload intensity + budget
  • "Multiple capacities needed for..." β†’ Compliance, billing, workload segregation

Differences from Similar Concepts

  • Capacity vs Workspace: Capacity = compute resource; Workspace = logical container

1️⃣3️⃣ Data Ingestion

Definition

Data ingestion is the process of moving data from source systems into Fabric (Lakehouse, Warehouse, KQL Database).

When and Why Used

  • When: Initial data load; ongoing incremental updates; real-time streaming
  • Why: Centralizes data for analytics; enables transformations

How It Works in Fabric

Methods vary by source:

  • Cloud storage (ADLS, S3): Shortcuts (zero-copy)
  • Databases (Azure SQL, Snowflake): Mirroring
  • On-prem databases: Pipelines + On-prem gateway
  • Streaming (Event Hub): Eventstream
  • Files (CSV, Excel): Dataflows Gen2
  • Custom logic: Notebooks (Spark)

Real-World Example

Company ingests:

  • ADLS data via Shortcuts
  • On-prem SQL via Pipeline + Gateway
  • IoT streams via Eventstream

Exam Appearance

  • "Ingest on-prem SQL data" β†’ Pipeline + On-prem gateway
  • "Zero-copy from ADLS" β†’ Shortcuts
  • "Real-time events" β†’ Eventstream

Differences from Similar Concepts

  • Shortcuts vs Mirroring: Shortcuts = reference, no copy; Mirroring = replicated, near-real-time
  • Dataflows vs Notebooks: Dataflows = no-code; Notebooks = code-based

1️⃣4️⃣ Data Gateways

Definition

Gateways are secure connectors enabling Fabric to access on-premises or private network data sources.

When and Why Used

  • When: Data sources behind firewall (on-prem SQL Server, file shares)
  • Why: Enables secure data ingestion without exposing sources to internet

How It Works in Fabric

Two types:

  1. On-Premises Data Gateway: Installed on local machine; creates outbound connection to Fabric
  2. VNet Data Gateway: Uses Azure VNet private endpoints; no on-prem installation needed

Real-World Example

Retail company has SQL Server on-prem. Installs On-Prem Gateway, configures Dataflow Gen2 to ingest sales data daily.

Exam Appearance

  • "Ingest from on-prem SQL Server" β†’ On-prem Data Gateway
  • "Secure ingestion via private endpoint" β†’ VNet Gateway

Differences from Similar Concepts

  • On-Prem vs VNet Gateway: On-Prem = local install; VNet = Azure-managed

1️⃣5️⃣ Medallion Architecture

Definition

Medallion (Bronze-Silver-Gold) is a data organization pattern separating raw, curated, and business-ready data layers.

When and Why Used

  • When: Building scalable data platforms with clear data quality stages
  • Why: Separates concerns; enables incremental refinement; supports diverse use cases

How It Works in Fabric

  • Bronze (Raw): Ingested data as-is; minimal transformations
  • Silver (Curated): Cleaned, validated, deduplicated; Delta tables
  • Gold (Business-Ready): Aggregated, modeled; star schema for consumption

Real-World Example

E-commerce platform:

  • Bronze: Raw JSON clickstream files
  • Silver: Parsed, deduplicated sessions (Delta)
  • Gold: Aggregated user behavior metrics (Warehouse)

Exam Appearance

  • "Separate raw and curated data layers" β†’ Medallion Architecture
  • "Bronze, Silver, Gold pattern" β†’ Medallion Architecture

Differences from Similar Concepts

  • Medallion vs ETL: Medallion = layered storage pattern; ETL = process pattern

🎯 Quick Reference: When to Use What

Requirement Solution
Low-code ingestion Dataflows Gen2
Raw + curated data, Spark Lakehouse
Structured SQL only Warehouse
Historical tracking, updates Delta Tables
Real-time streaming analytics KQL + Eventhouse
Row-based security RLS in Power BI
No data duplication Direct Lake Mode
Repeated query optimization Materialized Views
On-prem data ingestion On-Prem Gateway
Layered data organization Medallion Architecture

Would you like me to create a one-page cheat sheet mapping business scenarios to correct Fabric features, or build more scenario-based practice questions focusing on specific weak areas?

Top comments (0)