1οΈβ£ Dataflows Gen2
Definition
Dataflows Gen2 is a low-code/no-code data integration and transformation tool in Microsoft Fabric that uses Power Query for ETL operations.
When and Why Used
- When: You need to ingest data from multiple sources (CSV, databases, APIs) and transform it before loading into Lakehouse/Warehouse
- Why: Provides visual, user-friendly interface for data engineers without deep coding skills; supports incremental refresh; integrates directly with OneLake
How It Works in Fabric
- Uses Power Query M language behind the scenes
- Connects to 100+ data sources
- Applies transformations (filter, merge, aggregate)
- Lands data into Lakehouse tables or files
- Supports staging (intermediate storage) for complex transformations
Real-World Example
A retail company receives daily sales CSV files from 50 stores. Dataflows Gen2 ingests files, cleans missing values, standardizes date formats, and loads into Lakehouse Bronze layer.
Exam Appearance
- "Which tool should you use for automated daily CSV ingestion?" β Dataflows Gen2
- "You need low-code transformation before loading Lakehouse" β Dataflows Gen2
Differences from Similar Concepts
- Dataflows Gen2 vs Gen1: Gen2 writes directly to OneLake, supports staging, better performance
- Dataflows vs Pipelines: Dataflows = transformations; Pipelines = orchestration
- Dataflows vs Notebooks: Dataflows = no-code; Notebooks = code-based (Spark)
2οΈβ£ Lakehouse
Definition
Lakehouse is a unified data architecture combining data lake flexibility (raw data, Spark) with data warehouse structure (SQL queries).
When and Why Used
- When: You need to store raw, semi-structured, and curated data with Spark processing
- Why: Provides single storage layer for all data types; supports Delta tables; enables Medallion architecture
How It Works in Fabric
- Built on OneLake (Delta Lake format)
- Stores files (Parquet, CSV, JSON) and tables (Delta)
- Supports Spark notebooks, SQL endpoint, shortcuts
- Automatically creates SQL analytics endpoint for querying
Real-World Example
An IoT company stores raw sensor JSON files in Lakehouse, transforms them with Spark notebooks into Delta tables, and allows analysts to query with SQL.
Exam Appearance
- "Store raw + curated data with Spark flexibility" β Lakehouse
- "Need both file storage and table queries" β Lakehouse
Differences from Similar Concepts
- Lakehouse vs Warehouse: Lakehouse = flexible (files + tables); Warehouse = structured SQL only
- Lakehouse vs Data Lake: Lakehouse adds Delta tables, ACID transactions, time travel
3οΈβ£ Warehouse
Definition
Warehouse is a fully managed, enterprise-grade data warehouse optimized for SQL workloads and relational analytics.
When and Why Used
- When: Business users need structured, relational data with pure SQL queries
- Why: Optimized for aggregations, joins, star schema; familiar to SQL developers; better performance for complex SQL
How It Works in Fabric
- Stores data in columnar format (similar to Synapse)
- Supports T-SQL queries, stored procedures, views
- Integrates with Power BI for DirectQuery
- Provides separation of compute and storage
Real-World Example
Finance team builds dimensional model (fact + dimension tables) in Warehouse for budgeting reports, queried by Power BI dashboards.
Exam Appearance
- "Business users need structured SQL-only access" β Warehouse
- "Build star schema for reporting" β Warehouse
Differences from Similar Concepts
- Warehouse vs Lakehouse: Warehouse = SQL-focused, structured; Lakehouse = Spark-focused, flexible
- Warehouse vs Traditional DW: Fabric Warehouse = cloud-native, integrated with OneLake
4οΈβ£ Delta Tables
Definition
Delta tables are open-source storage format adding ACID transactions, time travel, and update/delete capabilities to Parquet files.
When and Why Used
- When: You need reliable updates/deletes, historical tracking, or data quality checks
- Why: Parquet is immutable; Delta adds transaction logs enabling versioning and atomic operations
How It Works in Fabric
- Stores data as Parquet files + transaction log (JSON)
- Log tracks all changes (inserts, updates, deletes)
- Supports VACUUM (cleanup old versions), OPTIMIZE (compaction)
- Enables time travel:
SELECT * FROM table VERSION AS OF 5
Real-World Example
Customer profile table updates addresses daily. Delta table maintains history with time travel for compliance audits (GDPR right to erasure tracking).
Exam Appearance
- "Maintain historical versions with updates" β Delta tables with time travel
- "Need ACID transactions in Lakehouse" β Delta tables
Differences from Similar Concepts
- Delta vs Parquet: Delta = updatable, versioned; Parquet = immutable
- Delta vs SCD-2: Delta = storage layer; SCD-2 = modeling pattern (Delta can implement SCD-2)
5οΈβ£ KQL (Kusto Query Language)
Definition
KQL is a query language optimized for fast, interactive analytics on large volumes of streaming and time-series data.
When and Why Used
- When: Real-time analytics, log analysis, anomaly detection, telemetry monitoring
- Why: Sub-second query performance on streaming data; built for time-series operations
How It Works in Fabric
- Used with KQL Database (Eventhouse)
- Reads from streaming sources (Event Hub, IoT Hub)
- Supports aggregations, time-windowing, pattern matching
- Syntax:
TableName | where Timestamp > ago(1h) | summarize count() by Category
Real-World Example
Bank monitors credit card transactions in real-time using KQL to detect fraud patterns (multiple high-value transactions within 5 minutes).
Exam Appearance
- "Real-time fraud detection with streaming data" β KQL Database
- "Sub-second query latency on telemetry" β KQL
Differences from Similar Concepts
- KQL vs T-SQL: KQL = streaming, time-series; T-SQL = relational, batch
- KQL vs Spark SQL: KQL = real-time; Spark SQL = batch processing
6οΈβ£ Eventstream
Definition
Eventstream is a real-time data ingestion service for capturing, transforming, and routing streaming data in Fabric.
When and Why Used
- When: Ingesting continuous data flows (IoT sensors, clickstreams, logs)
- Why: No-code streaming pipeline; routes to multiple destinations; applies transformations on-the-fly
How It Works in Fabric
- Connects to Event Hub, IoT Hub, sample data
- Applies transformations (filter, aggregate)
- Routes to KQL Database, Lakehouse, Reflex (alerts)
- Visual designer interface
Real-World Example
Manufacturing plant streams machine sensor data via Eventstream, filters critical alerts, and writes to KQL Database for monitoring dashboards.
Exam Appearance
- "Ingest real-time IoT data visually" β Eventstream
- "Route streaming data to multiple destinations" β Eventstream
Differences from Similar Concepts
- Eventstream vs Dataflows: Eventstream = streaming; Dataflows = batch
- Eventstream vs Event Hub: Event Hub = Azure service; Eventstream = Fabric wrapper with transformations
7οΈβ£ Row-Level Security (RLS)
Definition
RLS is a security feature that filters data rows dynamically based on user roles or attributes.
When and Why Used
- When: Different users should see different subsets of data (managers see their department only)
- Why: Maintains single dataset while enforcing data access policies
How It Works in Fabric
- Defined in Power BI semantic model using DAX filters
- Example:
[Region] = USERPRINCIPALNAME() - Applies filters automatically when users query data
- Works with Direct Lake, Import, DirectQuery modes
Real-World Example
HR dataset shows all employees. RLS rule: [ManagerID] = USERPRINCIPALNAME() ensures managers only see their direct reports.
Exam Appearance
- "Managers see only their department data" β Row-Level Security (RLS)
- "Dynamic filtering based on user identity" β RLS
Differences from Similar Concepts
- RLS vs Column-Level Security: RLS = row filtering; CLS = column hiding
- RLS vs Workspace Permissions: RLS = data-level; Workspace = object-level (entire dataset access)
8οΈβ£ Direct Lake Mode
Definition
Direct Lake is a Power BI connection mode that queries Delta tables in Lakehouse/Warehouse directly without importing data.
When and Why Used
- When: You want near-real-time dashboards without data duplication
- Why: Combines DirectQuery speed with Import performance; no data movement
How It Works in Fabric
- Power BI reads Delta Parquet files directly from OneLake
- Uses Fabric compute engine for queries
- No data copied to Power BI semantic model
- Automatic fallback to DirectQuery if needed
Real-World Example
Sales dashboard queries 10TB Lakehouse table using Direct Lake, refreshing insights every hour without importing data.
Exam Appearance
- "Query Lakehouse without duplicating storage" β Direct Lake Mode
- "Near-real-time dashboards on Delta tables" β Direct Lake
Differences from Similar Concepts
- Direct Lake vs Import: Direct Lake = no copy, live data; Import = copied, scheduled refresh
- Direct Lake vs DirectQuery: Direct Lake = faster, Fabric-optimized; DirectQuery = source database queried
9οΈβ£ Materialized Views
Definition
Materialized views are pre-computed, stored query results that improve performance for repeated queries.
When and Why Used
- When: Complex aggregations run repeatedly; query performance is slow
- Why: Stores results physically; refreshes periodically; drastically reduces compute
How It Works in Fabric
- Created in Warehouse using T-SQL:
CREATE MATERIALIZED VIEW AS SELECT... - Query optimizer automatically uses view when applicable
- Refreshed manually or on schedule
- Stored as physical tables
Real-World Example
Daily sales aggregation by region takes 10 minutes. Materialized view pre-computes results nightly, queries complete in seconds.
Exam Appearance
- "Repeated aggregations are slow" β Materialized Views
- "Pre-compute results for performance" β Materialized Views
Differences from Similar Concepts
- Materialized Views vs Regular Views: Materialized = stored results; Regular = virtual query
- Materialized Views vs Tables: Materialized = auto-updated from source; Tables = static until manually updated
π Fact Tables
Definition
Fact tables store quantitative business metrics (sales, revenue) in a dimensional model, typically in star schema.
When and Why Used
- When: Building data warehouse dimensional models for analytics
- Why: Central table for measurements; joins to dimension tables for context
How It Works in Fabric
- Contains measures (Amount, Quantity) and foreign keys to dimensions
- Optimized for aggregations (SUM, AVG, COUNT)
- Typically large, millions/billions of rows
- Forms center of star schema
Real-World Example
FactSales table stores TransactionID, Date, ProductID, CustomerID, Amount. Joins to DimProduct, DimCustomer for reporting.
Exam Appearance
- "Central table storing business metrics" β Fact Table
- "Star schema center with measures" β Fact Table
Differences from Similar Concepts
- Fact vs Dimension: Fact = measures (what); Dimension = context (who, where, when)
- Fact vs Transactional Table: Fact = aggregated for analytics; Transactional = raw operational data
1οΈβ£1οΈβ£ Relational Cardinality
Definition
Cardinality defines the relationship type between tables: one-to-one (1:1), one-to-many (1:), many-to-many (:*).
When and Why Used
- When: Building semantic models in Power BI; defining table relationships
- Why: Determines how filters propagate; impacts query performance and correctness
How It Works in Fabric
- 1:* (One-to-Many): Most common; dimension filters fact table
- 1:1: Rare; two tables with matching primary keys
- *:*: Avoided in star schema; requires bridge tables
Real-World Example
DimProduct (ProductID unique) β FactSales (ProductID repeated) = 1:* relationship
Exam Appearance
- "Define relationship between DimCustomer and FactSales" β 1:* (One-to-Many)
- "Avoid performance issues with relationships" β Avoid *:*
Differences from Similar Concepts
- 1:* vs :: 1:* = efficient filtering; : = performance penalty, complex logic
1οΈβ£2οΈβ£ Capacity Planning
Definition
Capacity planning determines the Fabric capacity size (F2, F64, F128, etc.) based on workload requirements.
When and Why Used
- When: Provisioning Fabric environment; experiencing performance issues
- Why: Ensures adequate compute resources; optimizes cost vs. performance
How It Works in Fabric
- Capacity = compute units (CU) allocated for workloads
- Sizes: F2 (2 CU) β F2048 (2048 CU)
- Charged per hour; can pause/resume
- Monitoring via Capacity Metrics app
Real-World Example
Data engineering team runs heavy Spark jobs. F64 capacity is too slow. Upgrade to F128 reduces job time from 2 hours to 30 minutes.
Exam Appearance
- "What determines capacity size?" β Workload intensity + budget
- "Multiple capacities needed for..." β Compliance, billing, workload segregation
Differences from Similar Concepts
- Capacity vs Workspace: Capacity = compute resource; Workspace = logical container
1οΈβ£3οΈβ£ Data Ingestion
Definition
Data ingestion is the process of moving data from source systems into Fabric (Lakehouse, Warehouse, KQL Database).
When and Why Used
- When: Initial data load; ongoing incremental updates; real-time streaming
- Why: Centralizes data for analytics; enables transformations
How It Works in Fabric
Methods vary by source:
- Cloud storage (ADLS, S3): Shortcuts (zero-copy)
- Databases (Azure SQL, Snowflake): Mirroring
- On-prem databases: Pipelines + On-prem gateway
- Streaming (Event Hub): Eventstream
- Files (CSV, Excel): Dataflows Gen2
- Custom logic: Notebooks (Spark)
Real-World Example
Company ingests:
- ADLS data via Shortcuts
- On-prem SQL via Pipeline + Gateway
- IoT streams via Eventstream
Exam Appearance
- "Ingest on-prem SQL data" β Pipeline + On-prem gateway
- "Zero-copy from ADLS" β Shortcuts
- "Real-time events" β Eventstream
Differences from Similar Concepts
- Shortcuts vs Mirroring: Shortcuts = reference, no copy; Mirroring = replicated, near-real-time
- Dataflows vs Notebooks: Dataflows = no-code; Notebooks = code-based
1οΈβ£4οΈβ£ Data Gateways
Definition
Gateways are secure connectors enabling Fabric to access on-premises or private network data sources.
When and Why Used
- When: Data sources behind firewall (on-prem SQL Server, file shares)
- Why: Enables secure data ingestion without exposing sources to internet
How It Works in Fabric
Two types:
- On-Premises Data Gateway: Installed on local machine; creates outbound connection to Fabric
- VNet Data Gateway: Uses Azure VNet private endpoints; no on-prem installation needed
Real-World Example
Retail company has SQL Server on-prem. Installs On-Prem Gateway, configures Dataflow Gen2 to ingest sales data daily.
Exam Appearance
- "Ingest from on-prem SQL Server" β On-prem Data Gateway
- "Secure ingestion via private endpoint" β VNet Gateway
Differences from Similar Concepts
- On-Prem vs VNet Gateway: On-Prem = local install; VNet = Azure-managed
1οΈβ£5οΈβ£ Medallion Architecture
Definition
Medallion (Bronze-Silver-Gold) is a data organization pattern separating raw, curated, and business-ready data layers.
When and Why Used
- When: Building scalable data platforms with clear data quality stages
- Why: Separates concerns; enables incremental refinement; supports diverse use cases
How It Works in Fabric
- Bronze (Raw): Ingested data as-is; minimal transformations
- Silver (Curated): Cleaned, validated, deduplicated; Delta tables
- Gold (Business-Ready): Aggregated, modeled; star schema for consumption
Real-World Example
E-commerce platform:
- Bronze: Raw JSON clickstream files
- Silver: Parsed, deduplicated sessions (Delta)
- Gold: Aggregated user behavior metrics (Warehouse)
Exam Appearance
- "Separate raw and curated data layers" β Medallion Architecture
- "Bronze, Silver, Gold pattern" β Medallion Architecture
Differences from Similar Concepts
- Medallion vs ETL: Medallion = layered storage pattern; ETL = process pattern
π― Quick Reference: When to Use What
| Requirement | Solution |
|---|---|
| Low-code ingestion | Dataflows Gen2 |
| Raw + curated data, Spark | Lakehouse |
| Structured SQL only | Warehouse |
| Historical tracking, updates | Delta Tables |
| Real-time streaming analytics | KQL + Eventhouse |
| Row-based security | RLS in Power BI |
| No data duplication | Direct Lake Mode |
| Repeated query optimization | Materialized Views |
| On-prem data ingestion | On-Prem Gateway |
| Layered data organization | Medallion Architecture |
Would you like me to create a one-page cheat sheet mapping business scenarios to correct Fabric features, or build more scenario-based practice questions focusing on specific weak areas?
Top comments (0)