DEV Community: Paulet Wairagu

QN : Data warehouses in Fabric

Paulet Wairagu — Mon, 29 Jun 2026 09:22:23 +0000

Relational data warehouses are at the center of most enterprise business intelligence (BI) solutions. They provide a structured, SQL-based environment where organizations store, query, and analyze business data at scale.

Fabric provides a fully managed data warehouse with full transactional T-SQL capabilities, including the ability to create tables and insert, update, and delete data.

A data warehouse is a centralized, structured store designed for analytical queries and reporting; It is optimized for analysis.

important steps when building a data warehouse:

Data ingestion - Moving data from source systems into the warehouse.
Data storage - Storing the data in a format optimized for analytics.
Data processing - Transforming the data into a format ready for consumption by analytical tools.
Data analysis and delivery - Analyzing the data to gain insights and delivering them to the business.

Data warehouses contain tables organized in a schema optimized for multidimensional modeling.

This organization, known as dimensional modeling, involves structuring tables into fact tables and dimension tables.

Fact tables contain the numerical data that you want to analyze. Fact tables typically have a large number of rows and are the primary source of data for analysis.
E.g For example, a fact table might contain the total amount paid for sales orders that occurred on a specific date or at a particular store.

Dimension tables contain descriptive information about the data in the fact tables. Dimension tables typically have a few rows and provide context for the data in the fact tables. For example, a dimension table might contain information about the customers who placed sales orders.

a dimension table contains a unique key column that uniquely identifies each row in the table

A surrogate key is a unique identifier for each row in the dimension table. It's often an integer value that the database management system generates automatically when you insert a new row
An alternate key is often a natural or business key that identifies a specific instance of an entity in the transactional source system - such as a product code or a customer ID.

_Surrogate keys are specific to the data warehouse and help maintain consistency and accuracy.
Alternate keys are specific to the source system and help maintain traceability between the data warehouse and the source system.
_

Special types of dimensions provide additional context and enable more comprehensive data analysis.

Time dimensions provide information about the time period in which an event occurred. This table enables data analysts to aggregate data over temporal intervals. For example, a time dimension might include columns for the year, quarter, month, and day of a sales order.

Slowly changing dimensions track changes to dimension attributes over time, like changes to a customer's address or a product's price. They're significant in a data warehouse because they enable you to analyze and understand changes to data over time. Slowly changing dimensions ensure that data stays up-to-date and accurate, which is important for making good business decisions.

In a data warehouse however, the dimension data is denormalized* to reduce the number of joins required to query the data.

a data warehouse uses a star schema, in which a fact table relates directly to the dimension tables

If there are lots of levels or attributes shared by different things, it might make sense to use a snowflake schema

QN : Orchestration - data movement w/ Fabric

Paulet Wairagu — Wed, 24 Jun 2026 10:05:08 +0000

-Data pipeline is a sequences of activities that orchestrate an overall process; extraction, loading and transformation.

-Pipelines automate ETL processes. These processes run through control flow activities that manage branching, looping etc

Graphical pipeline canvas : UI for pipelines building , minimal or no coding.

ACTIVITIES
Executable tasks in a pipeline.The outcome of a particular activity can be success, failure, competition.
- Data transformation activities: acty that encapsulate data transfer operations
- Copy Data : extract data from source and load destination
- Data Flow Acty: Transformations as data is being transferred
- Notebook Activities to run Spark Code
- Stored Procedure Actys: Run SQL code
- Delete data Actys: delete existing data

CONTROL FLOW ACTIVITIES
Activities that implement loops, conditional branching, manage variables and parameter values. These help implement complex pipeline logic

PARAMETERS
Pipelines can be parameterized to provide specific values to run pipeline. Using parameters increases reusability and flexibility of data.

PIPELINE RUNS
Each time a pipeline is executed a data pipeline run is initiated. Runs can be on demand.

QN : Window Functions in stream analytics

Paulet Wairagu — Thu, 18 Jun 2026 12:10:40 +0000

Stream processing data is aggregated into temporal views/windows eg average rainfall per hour

Temporal Window functions:

Tumbling

These functions segment data into a contiguous series of fixed size , none overlapping time segments and operate against them. eg

Here ,the tumbling window finds the maximum value in each one minute window.
Windowing functions are applied using GROUP BY Clause

Hopping

Are like tumbling window functions that can overlap
These functions model scheduled overlapping windows , jumping forward in time by a fiexed period.
Events can belong to more than one window
Three parameters must be defined: time, window size, hop size

Sliding

This function generates events for points in time when contents of the window change.
There is a limit on number of windows to be considered.
Events can belong to more than one window

Session

Window function cluster together events that arrive at similar times, filtering out no data.
3 main parameters: timeout, maximum, partitioning
First event starts a window.

Snapshot

Groups events by identical timestamp values
No window is defined

QN : Azure Stream Analytics

Paulet Wairagu — Thu, 18 Jun 2026 11:15:41 +0000

Azure Stream Analytics is a service for complex event processing and analysis of streaming data.

Stream Analytics is used to:

Ingest data from an input, such as an Azure event hub, Azure IoT Hub, or Azure Storage blob container.
Process the data by using a query to select, project, and aggregate data values.
Write the results to an output, such as Azure Data Lake Storage Gen2, Azure SQL Database, Azure Cosmos DB, Azure Functions, Azure Event Hubs, Microsoft Power BI, or others.
data stream consists of a perpetual series of data, typically related to specific point-in-time event eg environmental measurements recorded by an internet-connected weather sensor

Characteristics of stream processing solutions
Stream processing solutions typically exhibit the following characteristics:

The source data stream is unbounded - data is added to the stream perpetually.
Each data record in the stream includes temporal (time-based) data indicating when the event to which the record relates occurred (or was recorded).
Aggregation of streaming data is performed over temporal windows - for example, recording the number of social media posts per minute or the average rainfall per hour.
The results of streaming data processing can be used to support real-time (or near real-time) automation or visualization, or persisted in an analytical store to be combined with other data for historical analysis. Many solutions combine these approaches to support both real-time and historical analytics.

QN : stages for processing big data

Paulet Wairagu — Wed, 17 Jun 2026 10:04:48 +0000

Stage	Purpose	Common Azure/Microsoft Tools
Ingest	Collect data from source systems	Microsoft Fabric Pipelines, Azure Event Hubs, Azure Stream Analytics
Store	Save the data securely and scalably	Azure Data Lake Storage Gen2
Prep & Train	Clean data, transform data, build ML models	Azure Databricks, Microsoft Fabric, Azure Machine Learning
Model & Serve	Deliver insights to users	Microsoft Power BI, Microsoft Fabric

Ingest

Goal: Bring data into the data lake.

Data sources:

Files
Logs
Applications
IoT devices
Databases

Tools:

Batch ingestion → Fabric Pipelines
Real-time ingestion → Azure Event Hubs, Azure Stream Analytics, Fabric Real-Time Intelligence

Store

Goal: Store the ingested data.

Technology:

Azure Data Lake Storage Gen2

Benefits:

Secure
Scalable
Cost-effective
Supports analytics workloads

Prep & Train

Goal: Transform data and build machine learning models.

Activities:

Data cleaning
Data transformation
Feature engineering
Model training
Model scoring

Tools:

Azure Databricks
Microsoft Fabric
Azure Machine Learning

Model & Serve

Goal: Present insights to users.

Outputs:

Dashboards
Reports
Predictions
Analytics applications

Tools:

Microsoft Power BI
Microsoft Fabric
Exam Shortcut

Think of a data lake as a factory:

Raw Data → Ingest → Store → Prep & Train → Model & Serve → Business Insights

Example:

Sales transactions arrive → Ingest
Stored in ADLS Gen2 → Store
Cleaned and transformed in Databricks → Prep & Train
Visualized in Power BI → Model & Serve

QN : Azure Data Lake Store VS Azure Blob storage

Paulet Wairagu — Wed, 17 Jun 2026 09:22:38 +0000

Azure Data Lake Storage Gen2 is built on top of Azure Blob Storage. The key difference is that Data Lake Gen2 uses a hierarchical namespace, allowing efficient folder-level operations and better performance for analytics workloads. Blob Storage uses a flat namespace and is ideal for general object storage such as backups, media files, and application data, while Data Lake Gen2 is designed for big data analytics, ETL processing, and data engineering workloads.

Table;

Feature	Azure Blob Storage	Azure Data Lake Storage Gen2
Purpose	General-purpose object storage for unstructured data	Analytics-optimized storage for big data workloads
Namespace Structure	Flat namespace	Hierarchical namespace (folders and directories)
Folder Support	Virtual folders only (using "/" in blob names)	Real directories with metadata
Directory Operations	Multiple operations needed for rename/delete	Single atomic operation for rename/delete
Performance for Analytics	Good, but not optimized for analytics	Optimized for large-scale analytics workloads
Cost of Data Processing	Can be higher due to additional operations	Lower because directory-level operations are efficient
Data Organization	Less structured	Better organized through hierarchical directories
Access Protocols	HTTP/HTTPS	HTTP/HTTPS plus Data Lake APIs
Best Use Cases	Website assets, backups, archives, media files, documents	Data lakes, ETL pipelines, data engineering, Spark, analytics
Integration with Analytics Tools	Supported	Deep integration with analytics services such as Azure Synapse Analytics, Apache Spark, and Microsoft Fabric
Hierarchical Namespace Setting	Disabled	Enabled
Typical Users	Application developers, backup/storage teams	Data engineers, data analysts, data scientists

QN : Introduction to Azure Data Lake Storage Gen2

Paulet Wairagu — Wed, 17 Jun 2026 09:18:04 +0000

data lake : repository of data stored in natural format as blobs or files.
Azure Data Lake Storage is a comprehensive, massively scalable, secure, and cost-effective data lake solution for high performance analytics built into Azure.
ADLS is optimized for analytical workloads; High data volumes supported to stream and batch solutions.
ADLS exposes data (file hierarichical system) through API endpoints making it accesible through modern compute technologies e Microsoft Databricks.
ADLS uses layered access control model :
- Azure Role based Access Control : read and write access
- Azure Attribute-based access control (Azure ABAC) : role assignments
- Access control lists (ACLs) : File level control Permissions aren't automatically inherited from parent directories after a child item is created. However, you can configure default permissions on a parent directory, which are then applied to new child items at the time they're created.
Data Processing requires less computational resources since data is stored in directories and sub-directories like a file system.
Data Redundancy : Data Lake Storage inherits all Azure Blob Storage replication models.
Locally redundant storage (LRS) keeps multiple copies within a single data center
Zone-redundant storage (ZRS) replicates data across availability zones in the same region.
Geo-redundant storage (GRS) or read-access geo-redundant storage (RA-GRS) replicates data to a secondary region.
Geo-zone-redundant storage (GZRS or RA-GZRS) combines zone and geographic redundancy.

QN : Data Engineering on Azure

Paulet Wairagu — Tue, 16 Jun 2026 10:43:36 +0000

Data Engineer is responsible for integrating, transforming and consolidating data from various structured and unstructured data systems into structures that are suitable for building analytics solutions
Azure data engineer also helps ensure that data pipelines and data stores are high-performing, efficient, organized, and reliable, given a specific set of business requirements and constraints.

Types of Data

Structured Data : Comes from table based source systems eg relational dbs or csv files. Primarily made up of rows and columns consistently throughout the file.

-Semi Unstructured Data : Data such as JSON which may require flattening before loading. Data has no table structure

Unstructured data : data stored as key value pair.Has no relational db standard eg PDFS, Word documents and images

Data Operations

Data Integration : Establishing links between operational and analytical services through data sources ensuring data is secure, reliable and accessible.
Data Transformation : Transforming operational data into suitable structure for analysis often through ana ETL or ELT process. Here data is prepared for downstream processes.
Data Consolidation : Combination of extracted data into consistent structure to support analytics and reporting.

Data Engineer uses common languages:

SQL
Python
KQL - Kusto Query Language , Used for analyzing streaming and log data. Used in Microsoft Fabric Real Time Intelligence workload.
Others dependent on organization

Keywords:

Operational data : transactional data generated ad stored by applications in relational or non relational dbs

Analytical data : data optimized for analysis and reporting often stored in a data warehouse

Streaming data : perpetual data sources that generate data values in real time to specific events eg IOT devices

Data Pipelines : Used to orchestrate activities that transfer and transform data. Primary way for ETL/ELT.

Data Lakes : Storage repository for native, raw data. It is optimized for scaling to massive volumes of data. Data comes from multiples sources. Data may be structured or semi or unstructured. Here, Store data untransformed.

Data warehouse : centralized repository of integrated data from one or more disparate sources.Data is optimized for analytical queries.Data is organized into relational tables organized into a schema.

Lakehouses : Combines the scalability of a data lake with the querying capabilities of a data warehouse.Stores data in delta lake format which supports ACID transactions, schema enforcement and support structured and unstructured data.

Apache Spark : parallel processing framework that takes advantage of in memory processing and distributed file storage.

The diagram above describes the flow of data from and enterprise data analytics solution

Operational data is generated by applications and devices and stored in Azure data storage services such as Azure SQL Database, Azure Cosmos DB, and Microsoft Dataverse.Streaming data is captured in event broker services such as Azure Event Hubs.
operational data is captured, ingested, and consolidated into analytical stores where it is modelled and visualized in reports and dashboards.
Core Microsoft technologies used to implement data engineering workloads include:
Microsoft Fabric, Azure Data Lake Storage Gen2, Azure Stream Analytics, Azure Data Factory, Azure Databricks
Microsoft fabric is unified , end to end SaaS platform and brings together data engineering tools.

QN : Ingest Data with Dataflows Gen2 in Microsoft Fabric

Paulet Wairagu — Tue, 09 Jun 2026 17:04:50 +0000

Dataflows are a type of cloud-based ETL (Extract, Transform, Load) tool for building and executing scalable data transformation processes.
Dataflows offer a wide variety of transformations, and can be run manually, on a refresh schedule, or as part of a data pipeline orchestration
A dataflow includes all of the transformations to reduce data prep time and then can be loaded into a new table, included in a data pipeline, or used as a data source by data analysts.
Dataflows can be horizontally partitioned as well. Once you create a global dataflow, data analysts can use dataflows to create specialized semantic models for specific needs.
Dataflows allow you to promote reusable ETL logic that prevents the need to create more connections to your data source.
Benefits:

Extend data with consistent data, such as a standard date dimension table.
Allow self-service users access to a subset of data warehouse separately.
Optimize performance with dataflows, which enable extracting data once for reuse, reducing data refresh time for slower sources.
Simplify data source complexity by only exposing dataflows to larger analyst groups.
Ensure consistency and quality of data by enabling users to clean and transform data before loading it to a destination.
Simplify data integration by providing a low-code interface that ingests data from various sources.

Limitations:

Dataflows aren't a replacement for a data warehouse.
Row-level security isn't supported.
Fabric capacity workspace is required

you can create a Dataflow Gen2 in the Data Factory workload or Power BI workspace, or directly in the lakehouse. Since our scenario is focused on data ingestion, let's look at the Data Factory workload experience. Dataflows Gen2 use Power Query Online to visualize transformations

- The combination of dataflows and pipelines is useful when you need to perform additional operations on the transformed data.

QN : Ingest and transform data in a lakehouse

Paulet Wairagu — Tue, 09 Jun 2026 09:28:33 +0000

lakehouse has two storage areas ; Files and Tables
Files
- Store structured, queryable data by sql
- Supports schema definitions and ACID transactions
Tables
- Stores Raw or semi-structured data(CSV, parquet, JSON)
- No schema support
- Flexible for data explorations
Schema allows for logical ordering of data on business functions or domain (sales,marketing etc)
A dbo schema is enabled by default once a lakehouse is created
Schema-enabled lakehouses also support schema-level permissions and cross-workspace queries using the four-part namespace
Lakehouse mode : Lakehouse Explorer and SQL analytics endpoint
- Lakehouse Explorer: Allows managing, Update, create, upload of data.You can switch between tables in the lakehouse
- SQL anlytics endpoit : Does not allow modifying of the underlying data. You can query using TSQL at read only mode.
Loading data into lakehouse:

Upload data into files/ folders on the explorer
Load into delta tables (no code)
Transform using power query in dataflow gen2
INgest into notebooks using apache spark (programmatically)
Use Copy data to move data into differnt sources using data factory pipelines

-Shortcuts allow you to reference external data reducing copies. Access is managed by One Lake.

Schema shortcuts map an entire schema to a folder of Delta tables in another lakehouse.
SQL analytics endpoint provides read-only access to lakehouse tables using T-SQL queries.
SQL USE CASES : adhoc queries, BI connections to power bi or azure data studio, Data validation
You can use SQL views to store reusable query logic. Views are useful when you need to apply business rules, simplify complex joins, or provide curated data for downstream consumers.
You can use Spark SQL for SQL-like queries or PySpark for programmatic data manipulation in Notebooks.
Spark SQL works well for familiar SQL patterns. PySpark provides greater flexibility for complex transformations and integration with Python libraries.
Power BI is the business intelligence and reporting layer in Fabric. It serves as the consumption layer where business users access data through interactive reports and dashboards.
Power BI can connect to lakehouse data in two ways:
- Query the SQL analytics endpoint
- Create a semantic model

QN : Get started with lakehouses in Microsoft Fabric

Paulet Wairagu — Thu, 04 Jun 2026 17:08:23 +0000

A lakehouse is a unified platform that combines:
- The flexible and scalable storage of a data lake
- The ability to query and analyze data of a data ware*house*
A lakehouse uses Apache Spark and SQL compute engines to process and analyze data at scale.
Traditional Warehouses handle structured data but struggle on semi-structured and unstructured data from app logs , IoT devices etc hence data silos and complex integration efforts
Data Lakes offer flexibility and scalability but lack structure and performance for b/s analytics
Data Warehouses have strong analytical capabilities but struggle with different data formats and costly to scale.
Lakehouse design:
- tables : delta lake table that provide structured, queryable data
  - Support SQL queries through the SQL analytics endpoint
  - Enforce schemas and support ACID transactions
  - Can be accessed in Power BI for reporting
  - Benefit from automatic optimization and maintenance
- files : stores raw or semi-structured data files in their native format
  - Support any file format (CSV, JSON, Parquet, images, documents)
  - Provide flexibility for data exploration and processing
  - Can be staged before transformation into tables
  - Don't enforce schema or support direct SQL queries
Delta Lake is a open source storage layer taht brings reliability to data lakes.
Data is stored in delta format in OneLake storage
Delta Lake advanatges
- ACID Transactions : consistency with frequent reads
- Schema enforcement : validates the data against the table schema
- Time Travel : maintains transaction logs
- Updates and Deletes :
Delta table has parquet data files + transaction logs
This design support batch + straeming workloads
Lakehouse access :
- workspace roles for collaborators who need access to all items in the workspace
- Item-level sharing to grant read-only access for specific needs, such as analytics or Power BI report development
- SQL analytics endpoint supports row-level and column-level security, so you can restrict what specific users see when they query through SQL
- schema-level permissions to control access by business domain
Well-organized lakehouse data becomes the foundation that intelligent experiences across Microsoft Fabric depend on.
investment you make in organizing, naming, and structuring lakehouse data pays dividends beyond your immediate analytics needs. Good data engineering practices in the lakehouse create a reusable foundation for intelligent experiences across the platform.

QN:Introduction to end-to-end analytics using Microsoft Fabric

Paulet Wairagu — Thu, 04 Jun 2026 16:29:34 +0000

Quick Short notes series

Microsoft Fabric is an end-to-end analytics platform that provides a single, integrated environment where data professionals and the business collaborate on data projects. Built on a unified data lake called OneLake, Fabric brings together the tools you need across that entire lifecycle.
Fabric is a unified software-as-a-service (SaaS) platform where all data is stored in a single open format in OneLake. All analytics engines in the platform can access OneLake, ensuring scalability, cost-effectiveness, and accessibility from anywhere with an internet connection.
OneLake is Fabric's centralized data storage architecture that enables collaboration by eliminating the need to move or copy data between systems
OneLake is built on Azure Data Lake Storage Gen2 (ADLS Gen2) and supports various formats, including Delta, Parquet, CSV, and JSON
All compute engines in Fabric automatically store their data in OneLake, making it directly accessible without the need for movement or duplication.
For tabular data, the analytical engines in Fabric write data in delta-parquet format and all engines interact with the format seamlessly.
Shortcuts are references to files or storage locations within OneLake or external data sources, such as Azure Data Lake Storage, Amazon S3, or Dataverse. Shortcuts allow you to access existing data without copying it, ensuring data consistency and enabling Fabric to stay in sync with the source.
workspaces serve as logical containers that help you organize and manage your data, reports, and other assets.
workspace has its own set of permissions, ensuring that only authorized users can view or modify its contents.
Workspaces allow you to manage compute resources and integrate with Git for version control. You can optimize performance and cost by configuring compute settings, while Git integration helps track changes, collaborate on code, and maintain a history of your work.
Fabric administration is centralized in the Admin portal.
In the admin portal you can manage groups and permissions, configure data sources and gateways, and monitor usage and performance. You can also access the Fabric admin APIs and SDKs in the admin portal, which can automate common tasks and integrate Fabric with other systems.
OneLake catalog helps you analyze, monitor, and maintain data governance. It provides guidance on sensitivity labels, item metadata, and data refresh status, offering insights into the governance status and actions for improvement.
Fabric increases collaboration between data professionals by removing data silos and the need for multiple systems.
In Workspace settings, you can configure:
- License type to use Fabric features.
- OneDrive access for the workspace.
- Azure Data Lake Gen2 Storage connection.
- Git integration for version control.
- Spark workload settings for performance optimization