Microsoft's power BI is one of the most common tools used by data analysts in data, exploration, cleaning, analysis, reporting and presentation.
Its a versatile tool with many advantages and one of its most powerful features is the model view.
In this view, you can define the data schema as well as the relationships between the various tables in the schema.
Definition of Terms
- Data Modelling - The process of identifying the relationships between different data tables and mapping them into schematic representations of the various tables and the relationships between them.
- Data Schema - A formal description of the relationships between various entities in a data warehouse or a database management system.
- Fact table - The main table in a data model that contains the analysts' subject matter.
- Dimensions table - Support tables in a data model that contain additional information regarding the records in the fact table.
DATA SCHEMAS
A data schema is the blueprint or structure of a database. It defines how data is organized, stored, and related using tables, fields, data types, and constraints. Schemas define essential data components such as;
- Tables / Entities -> the main object for storing data in relational databases. It holds records in rows and their attributes in columns.
- Fields / attributes -> the individual characteristics of each record in the table.
- Relationships -> how the different tables in the database relate to one another and the cardinality of said relationships.
- Constraints - the rules enforced on the data in the tables to ensure data integrity, accuracy and reliability.
Data schemas vary based on the use case.
From a developers perspective, there are 3 major types of schemas;
- Conceptual data schema - a high level outline of what the database system will contain. Its intended for business and stakeholder approval and focuses on business rules, major entities and their relationships.
- Logical data schema - It defines the schema objects like table names, field names and integrity constraints but without the physical implementation details / specific DBMS language.
- Physical data schema - The actual implementation of a database in a live database management system. it includes the partitioning information and indexing strategy but not the actual data.
For the most part, data analysts operate from a data warehouse / data mart. There are various schemas that they can implement to improve data querying, reporting integrity and consistency.
- Star Schema - employs one fact table surrounded by multiple denormalised dimensions tables. The fact table holds the analysts subject matter i.e. sales and the dimensions hold more details about the records in the facts table.
- Snowflake schema - This is an extension of the star schema. at the centre is one facts table surrounded by normalised dimensions tables. However, the dimension tables have sub-dimensions tables each creating its own mini star schema.
- Third Normal Form - This is an extreme case where there are a small number of facts tables surrounded by multiple dimensions tables with many to many relationships. its common in Online Transaction processing Systems.
The 3 schemas above are very common and are used by data analysts when modelling their data. Each model presents some advantages as well as shortcomings a follows;
Star Model
This model employs one fact tables surrounded by several dimensions tables.
Advantages;
- Faster Query Performance as there are fewer joins per query.
- Simplified Data Modelling as the structure is intuitive for developers.
- Reduced join complexity as the dimension tables are not normalized.
- Enhanced reporting efficiency as DAX operations run faster on faster queries.
- Improved usability resulting from its intuitive layout. Limitations
- Data redundancy on the denormalized dimensions tables.
- High maintenance cost as new data has to be denormalized into the simpler dimension tables.
Snowflake Model
This model employs one fact table surrounded by multiple dimensions tables each with their own sub-dimensions.
Advantages
- Reduced data redundancy as the dimensions table are normalized.
- Increased storage efficiency as data is not duplicated across tables.
- Is better suited for modelling complex real world operations such as supply chain data.
- Improved query performance for highly selective queries.
Limitations
- Hard to navigate due to its complex nature.
- Reduced query efficiency resulting from an increase in joins.
3 NF Model
This model is not commonly used due to its complexity and computation cost when running queries. However, in some niche OLTP applications its preferred for its data integrity advantages.
Data Models
In the real world, data exists in many forms and can be fetched from diverse sources.
Tabular data commonly exists as a collection of tables in a database management system, however, there are exceptions to this where some forms data can and exists as a single file. A good example is a google sheet extracted from google forms.
When working with such data its trivial to perform, exploratory analysis, cleaning and transformation.
On the other hand, working with multiple tables requires a data modelling approach. In power BI, the modelling approach is as follows;
- Identify the necessary tables from the source data.
- Load the selected tables into power BI.
- Perform exploratory analysis on the loaded data to identify relationships and cardinality.
- Perform data cleaning on the loaded data ensuring uniformity and consistency.
- Choose a favourable schema to model your data on (star is the most common)
- Define the relationships between your data tables via the model view.
- Test your model via slicers from the report view.
ADVANTAGES OF DATA MODELLING
- Simplifies data complexity by providing a visual representation of the entire dataset.
- A good data model reduces computation cost when querying the data.
- Ensures data integrity and consistency when reporting.
- Simplifies data organization and communication.\
- A good data model forms the basis of security and access control.
Top comments (0)