Schemas and Data Modelling in Power BI

#database #dataengineering #bigdata

What is a Schema?

Schema refers to the logical structure of a database or data model that defines how tables are organized and related. In Power BI, schemas are used to optimize data storage, retrieval, and reporting.

Why Schema Matters in Power BI?

A schema in Power BI is crucial because it defines how data is structured, stored, and connected in a data model. A well-designed schema:

Enhances Performance — Optimized schemas improve query speed and report loading time.
Ensures Data Accuracy — Proper relationships prevent incorrect aggregations or duplications.
Simplifies Data Analysis — A clear schema makes it easier to create reports and dashboards.
Improves Scalability — A structured schema allows for easy expansion as data grows.
Optimizes DAX Calculations — Efficient schemas lead to better DAX performance and calculations.

Types of Schema

Star schema Star schema is a mature modeling approach widely adopted by relational data warehouses. It requires modelers to classify their model tables as either dimension or fact. It is a widely used data modeling approach in Power BI for optimizing performance and simplifying relationships. It consists of:

Fact Table (Central Table) — Stores transactional data (e.g., sales, revenue, quantity).
It store observations or events, and can be sales orders, stock balances, exchange rates, temperatures, and more. A fact table contains dimension key columns that relate to dimension tables, and numeric measure columns. The dimension key columns determine the dimensionality of a fact table, while the dimension key values determine the granularity of a fact table.
For example, consider a fact table designed to store sale targets that has two dimension key columns Date and ProductKey. It's easy to understand that the table has two dimensions. The granularity, however, can't be determined without considering the dimension key values. In this example, consider that the values stored in the Date column are the first day of each month. In this case, the granularity is at month-product level.

Dimension Tables (Surrounding Tables) —
Describe business entities—the things you model. Entities can include products, people, places, and concepts including time itself. The most consistent table you'll find in a star schema is a date dimension table. A dimension table contains a key column (or columns) that acts as a unique identifier, and other columns. Other columns support filtering and grouping your data.
_
Contain descriptive attributes (e.g., Date, Product, Customer).
Fewer Joins — Uses one-to-many relationships, reducing complexity and improving query speed._

Normalization vs. denormalization

To understand some star schema concepts described in this article, it's important to know two terms: normalization and denormalization.

Normalization _ is the term used to describe data that's stored in a way that reduces repetitious data.

If, however, the sales table stores product details beyond the key, it's considered denormalized. In the following image, notice that the ProductKey and other product-related columns record the product.

*Measures
*
In star schema design, a measure is a fact table column that stores values to be summarized. In a Power BI semantic model, a measure has a different—but similar—definition. A model supports both explicit and implicit measures.
_Explicit measures are expressly created and they're based on a formula written in Data Analysis Expressions (DAX) that achieves summarization. Measure expressions often use DAX aggregation functions like SUM, MIN, MAX, AVERAGE, and others to produce a scalar value result at query time (values are never stored in the model). Measure expression can range from simple column aggregations to more sophisticated formulas that override filter context and/or relationship propagation. For more information, read about DAX Basics in Power BI Desktop.
Implicit measures are columns that can be summarized by a report visual or Q&A. They offer a convenience for you as a model developer, as in many instances you don't need to create (explicit) measures. For example, the Adventure Works reseller sales Sales Amount column can be summarized in numerous ways (sum, count, average, median, min, max, and others), without the need to create a measure for each possible aggregation type.

Surrogate keys is a unique identifier that you add to a table to support star schema modeling. By definition, it's not defined or stored in the source data. Commonly, surrogate keys are added to relational data warehouse dimension tables to provide a unique identifier for each dimension table row.

Power BI semantic model relationships are based on a single unique column in one table, which propagates filters to a single column in a different table. When a dimension table in your semantic model doesn't include a single unique column, you must add a unique identifier to become the "one" side of a relationship. In Power BI Desktop, you can achieve this requirement by adding a Power Query index column.

Advantages of Star Schema

Optimized for Performance — Fewer joins mean faster queries and better report speed.
Simplifies DAX Calculations — Flat structure makes it easier to create measures and aggregations.
Enhances Data Visualization — Works seamlessly with Power BI’s data model and relationships.
Reduces Complexity — Easier to design, manage, and scale compared to Snowflake or Galaxy schemas.

Snowflake schema is a set of normalized tables for a single business entity. For example, Adventure Works classifies products by category and subcategory. Products are assigned to subcategories, and subcategories are in turn assigned to categories. In the Adventure Works relational data warehouse, the product dimension is normalized and stored in three related tables: DimProductCategory, DimProductSubcategory, and DimProduct.

In Power BI Desktop, you can choose to mimic a snowflake dimension design (perhaps because your source data does) or combine the source tables to form a single, denormalized model table. Generally, the benefits of a single model table outweigh the benefits of multiple model tables. The most optimal decision can depend on the volumes of data and the usability requirements for the model.

Power BI loads more tables, which is less efficient from storage and performance perspectives. These tables must include columns to support model relationships, and it can result in a larger model size.
Longer relationship filter propagation chains need to be traversed, which might be less efficient than filters applied to a single table.
The Data pane presents more model tables to report authors, which can result in a less intuitive experience, especially when snowflake dimension tables contain only one or two columns.
It's not possible to create a hierarchy that comprises columns from more than one table.
When you choose to integrate into a single model table, you can also define a hierarchy that encompasses the highest and lowest grain of the dimension. Possibly, the storage of redundant denormalized data can result in increased model storage size, particularly for large dimension tables.

Advantages of Snowflake Schema

Less Data Redundancy — Normalized tables reduce duplication.
Better Data Integrity — Structured data ensures consistency.
Efficient for Large Datasets — Optimized for big data storage.
Easier Maintenance — Updates are more manageable.

Model relationships

A model relationship propagates filters applied on the column of one model table to a different model table. Filters will propagate so long as there's a relationship path to follow, which can involve propagation to multiple tables.

Relationship paths are deterministic, meaning that filters are always propagated in the same way and without random variation. Relationships can, however, be disabled, or have filter context modified by model calculations that use particular Data Analysis Expressions (DAX) functions

Data types of columns

The data type for both the "from" and "to" column of the relationship should be the same. Working with relationships defined on DateTime columns might not behave as expected. The engine that stores Power BI data, only uses DateTime data types; Date, Time, and Date/Time/Timezone data types are Power BI formatting constructs implemented on top. Any model-dependent objects will still appear as DateTime in the engine (such as relationships, groups, and so on). As such, if a user selects Date from the Modeling tab for such columns, they still don't register as being the same date, because the time portion of the data is still being considered by the engine.

Cardinality

Each model relationship is defined by a cardinality type. There are four cardinality type options, representing the data characteristics of the "from" and "to" related columns. The "one" side means the column contains unique values; the "many" side means the column can contain duplicate values.
If a data refresh operation attempts to load duplicate values into a "one" side column, the entire data refresh will fail.
The four options, together with their shorthand notations, are described in the following list:

One-to-many (1:*)
Many-to-one (*:1)
One-to-one (1:1)
Many-to-many (:)

One-to-many (and many-to-one) cardinality
The one-to-many and many-to-one cardinality options are essentially the same, and they're also the most common cardinality types.

When you configure a one-to-many or many-to-one relationship, choose the one that matches the order in which you related the columns. Consider how you would configure the relationship from the Product table to the Sales table by using the ProductID column found in each table. The cardinality type would be one-to-many, as the ProductID column in the Product table contains unique values. If you related the tables in the reverse direction, Sales to Product, then the cardinality would be many-to-one.

One-to-one cardinality
A one-to-one relationship means both columns contain unique values. This cardinality type isn't common, and it likely represents a suboptimal model design because of the storage of redundant data.

Many-to-many cardinality
A many-to-many relationship means both columns can contain duplicate values. This cardinality type is infrequently used. It's typically useful when designing complex model requirements. You can use it to relate many-to-many facts or to relate higher grain facts. For example, when sales target facts are stored at product category level and the product dimension table is stored at product level.

Cross filter direction
Single cross filter direction means "single direction", and Both means "both directions". A relationship that filters in both directions is commonly described as bi-directional.

For one-to-many relationships, the cross filter direction is always from the "one" side, and optionally from the "many" side (bi-directional). For one-to-one relationships, the cross filter direction is always from both tables. Lastly, for many-to-many relationships, cross filter direction can be from either one of the tables, or from both tables. Notice that when the cardinality type includes a "one" side, that filters will always propagate from that side.