Lameck Odhiambo

Posted on Jun 22

Data Modeling, Joins, Relationships and Schemas

#data #database #datamodeling #dataengineering

Before data reaches its final used destination it needs to be organized in a structured way to enable easy retrieval performance and good storage - This is data modeling. Data modeling is the process of creating a blue print of how data is connected, stored and retrieved in a system. This enables you to create an organized structure for your tables.

Reasons for data modeling

Data consistency
Optimize performance of the queries
Scalability and maintenance
Optimize cost in storage

Layers of data modeling

1.Conceptual Data Model
The highest-level, business-focused view. It defines what data is being collected and how business concepts relate to one another that is subject, characteristics and relation. Invovles gathering information from stake holders.
Agile and Waterfall method of gathering requirements - Waterfall and Agile are two fundamentally different approaches to project management. Waterfall is a linear, step-by-step process where each phase must be completed before the next begins. Agile is an iterative, flexible approach that breaks projects into smaller cycles for continuous improvement and rapid delivery.

Focus:Business entities (e.g., Customers, Products, Orders) and their relationships.
Audience: Business stakeholders, domain experts, and product managers.
Details: Tech-agnostic; no attributes, data types, or system implementations are specified

2. Logical Data Model
The bridge between the business requirements and the technical solution. It defines structure by establishing facts (events) and dimensions (context).

Focus: Data attributes, primary/foreign keys, and specific data objects.
Audience: Data architects and business analysts.
Details:Technology-neutral; independent of the specific database management system (DBMS) being used

Here we come up with an ER(Entity)- Relation Diagram

3. Physical Data Model
The most technical and concrete layer. It dictates exactly how the data will be stored and structured in a specific database.

Focus: Table names, column specifications, data types, storage methods, and compression techniques.
Audience: Database administrators, developers, and data engineers.
Details: Highly specific to a chosen engine (e.g., PostgreSQL, Snowflake, BigQuery)

Types of Data modeling

1. OLTP (Online Transactional Processing)

The process of designing databases to handle high volumes of fast, real-time, day-to-day transactions (such as e-commerce checkouts or banking transfers). Its primary goal is to ensure data integrity, eliminate redundancy, and support rapid write, update, and delete operations.
This is the fast step taken before moving data to a datawarehouse from a database.

Core Principles of OLTP Modeling

Normalization (up to 3NF): Data is broken down into smaller, logical tables to eliminate duplication. For instance, a customer’s address will live in a single Addresses table rather than being repeated on every single order.
Entity-Relationship (ER) Design: Models are created by identifying distinct entities (e.g., Customers, Products, Orders) and establishing strict relationships (e.g., one-to-many, many-to-many) between them.
ACID Compliance: The model prioritizes atomicity, consistency, isolation, and durability so that complex, multi-step transactions either succeed entirely or roll back cleanly without data corruption

Best Practices

Implement Strong Constraints: Use Primary Keys (PK), Foreign Keys (FK), UNIQUE constraints, and NOT NULL rules at the database level to enforce strict data integrity.
Index Wisely: Index your Primary and Foreign Keys to speed up row retrieval, but avoid over-indexing, as this will slow down write-heavy transactions.
Choose the Right Technology: Utilize robust Relational Database Management Systems (RDBMS) like postgresql, oracle or MySQL.

Common Data Types in Data Modeling

Data types are generally divided into standard primitive types and advanced complex structures.

Numeric Types

Integer: Stores whole numbers without decimals (e.g., ID numbers or inventory counts).
Float / Real: Stores approximate numerical values with fractional decimals for scientific data.
Decimal / Numeric: Stores exact fixed-point decimals, making it ideal for financial amounts.

String and Text Types

CHAR: Holds fixed-length text character strings, padding shorter inputs with spaces.
VARCHAR: Holds variable-length text strings up to a specified maximum length.
TEXT / CLOB: Stores large blocks of character data, such as product descriptions or articles.

Date and Time TypesDATE: Records calendar dates consisting of the year, month, and day.

TIME: Captures precise hours, minutes, and seconds.
TIMESTAMP: Combines date and time to track real-time systemic events or logs.

Logical and Binary

TypesBoolean: Evaluates to true or false states to support logical checks.
BLOB: Keeps raw binary large objects, including uploaded imagery, video files, or document attachments.

Complex and Semi-Structured Types

Array: Groups a list of multiple values inside a single column field.
Struct / JSON: Embeds a nested key-value format block to represent flexible, semi-structured object details

Primary keys in a Database

In a database, a key is an attribute (column) or a collection of attributes used to uniquely identify rows within a table and establish relationships between multiple tables. Keys are foundational for enforcing data integrity, preventing duplication, and ensuring efficient data retrieval.

Why Database Keys Matter

Enforce Uniqueness: They stop identical duplicate rows from muddying your datasets.
Connect Data: They link related concepts (e.g., matching a CustomerID foreign key in an Orders table back to the master Customers profile).
Speed Up Searches: Database engines automatically build indexes around key fields, drastically accelerating query performance.

Relationships in a database

Database relationships are logical links established between two or more tables based on a common column. In a relational database management system (RDBMS) like MySQL or PostgreSQL, these connections dictate how records interact. They use Primary Keys (PK) and Foreign Keys (FK) to eliminate redundant data and maintain data integrity
Inorder to connect different entities in a database we need relationships, to configure relationships we need cardinality

Used Cases

One to Many
A one-to-many (1:N) relationship occurs when a single record in one table (the parent) links to multiple records in another table (the child), but each child record maps back to exactly one parent record. It is the most common pattern in database design because it minimizes redundant data and enforces clear hierarchies
e.g Customers and Orders, Departments and Employees

Many to Many
A many-to-many (M:N) relationship occurs in a database when multiple records in one table are associated with multiple records in another table. Relational database systems cannot link two tables directly in this manner because doing so violates database normalization principles, leading to severe data duplication and maintenance issues.

One to One
A one-to-one (1:1) database relationship occurs when a single record in Table A is linked to exactly one record in Table B, and vice versa. It means each row in either table has a maximum of one matching row on the opposite side. eg Person and Passport,Country and Capital City,Car and License Plate,Store User and Shopping Cart,Employee and Desk Assignment,App Account and Premium Subscription

Normalization

Database normalization is a systematic design process used to organize data in a relational database to minimize data redundancy and eliminate data modification anomalies. e.g 1NF, 2NF,3NF

After creating these relationships, the next process is the last layer that is the physical layer. Implementing the conceptual layer and logical layer by writing SQL scripts.

2. OLAP Data Modeling(Online Analytical Processing)

Source of data are the databases created using OLTP data modeling. Online Analytical Processing (OLAP) data modeling structures data for rapid querying and business intelligence. It organizes information into a multidimensional model

Database ---------> Bronze -----------> Silver ----------> Gold

Bronze

Exact replica of tables from database

Silver

Transformed data
Aggregations e.g One big table(OBT)

Gold

Dimension data model
Fact and Dimensions Tables

The most common ways to physically structure OLAP models are through specific schemas in a data warehouse

1. Star Schema
The most widely used and recognizable model.Structure: Consists of a central fact table surrounded by multiple dimension tables.

Fact Table: Contains the quantitative measurements (e.g., Sales Amount, Units Sold) and foreign keys mapping to the dimensions.

Dimension Tables: Highly denormalized tables containing descriptive attributes (e.g., Customer Name, Store Location, Product Category).

Benefit: Simplicity and extremely fast read times, as it requires fewer table joins to get analytical results.

2. Snowflake Schema
A refinement of the star schema.

Structure: Similar to the star schema, but the dimension tables are normalized, meaning they branch out into sub-dimension tables.

Example: A "Product" dimension might connect to a "Category" sub-dimension, which connects to a "Department" sub-dimension.

Benefit: Reduces data redundancy and takes up less storage space, though queries may require more complex joins

Dimensions tables do change with time, hence need for SCDs( Slowly Changing Dimensions)

Type 0 - No change
Type 1 - Upsert / Overwrite
Type 2 - Tracking history of changes
Type 3 - Adds new column

SCDS explained check here https://en.wikipedia.org/wiki/Slowly_changing_dimension

Joins in data modeling

Joins are heavily used in both OLTP and OLAP, but they are used for completely different reasons and perform differently in each system.
In data modeling, joins are operations used to combine rows from two or more tables horizontally into a single dataset, based on a related common key (such as an ID). They are fundamental for integrating normalized databases and bringing related data together for reporting and analysis.

OLTP (Online Transaction Processing)

How it’s used: Joins are necessary. OLTP systems process day-to-day business transactions (like an e-commerce checkout) and use highly normalized schemas.

The Goal: Data is split into many small tables (e.g., customers, orders, products) to prevent duplication and ensure fast, accurate data entry.

Impact: Queries join a few tables together, but they typically only touch a very small number of rows (e.g., a single customer's specific order), making these joins extremely fast.

OLAP (Online Analytical Processing)

How it’s used: Joins are typically used in relational data warehouses (using Star or Snowflake schemas) to connect a central Fact table to surrounding Dimension tables.

The Goal: OLAP is designed for complex, historical analysis scanning millions of rows. Because large-scale joins are computationally expensive, OLAP models use denormalization (duplicating some data) to keep joins to a minimum and boost query performance.

Impact: Queries involve multi-table joins and massive aggregations, which naturally take longer (seconds or minutes) but yield deep business insights.

The 4 Primary Types of Joins

The type of join you choose determines how unmatched data (rows that don't share a common key) is handled:

INNER JOIN: Returns only the rows where there is a matching value in both tables. If a record doesn't exist on both sides, it is excluded.

LEFT JOIN (Left Outer): Returns all rows from the left table, and the matching rows from the right table. If there is no match on the right side, the result will contain null for the right-hand columns.

RIGHT JOIN (Right Outer): Returns all rows from the right table, and the matching rows from the left table. If there is no match on the left side, the result will show null for the left-hand columns.

FULL JOIN (Full Outer): Returns all records when there is a match in either the left or right tables. If there is no match on either side, the result will contain null.

Point worth noting

Joins vs. Relationships Joins physically merge or combine datasets to create a new, static result set (commonly used in SQL queries or Power Query).Relationships establish an ongoing, logical connection between tables so the modeling tool (like Microsoft Power BI or Tableau) can calculate metrics across the tables dynamically.

Conclusion

The Master Blueprint of Data Success
Data modeling is not just about organizing tables. It is the secret blueprint that turns messy, raw numbers into powerful business insights. By mastering schemas, relationships, and joins, you build a solid foundation for any data project.The Star Schema serves as your ultimate map, keeping your data clean and organized. Relationships act as smart bridges, letting your tables talk to each other without creating clutter. Meanwhile, joins work like glue to merge data when you need a single, complete view.When these three tools work together, magic happens. Your reports run faster, your numbers stay accurate, and your business can grow without slowing down. In short, a great data model turns confusing data into clear, actionable answers.

DEV Community