DEV Community: Shaban Ibrahim

OLAP vs. OLTP

Shaban Ibrahim — Mon, 04 May 2026 04:19:54 +0000

Introduction

When we talk about data processing systems, in mind we have a lot of conceptions in the form of shapes, types and sizes. Modern organisations rely heavily on data to operate and make decisions. Two fundamental systems that support this are OLTP (Online Transaction Processing) and OLAP (Online Analytical Processing). While they may sound similar, they serve very different purposes. These are two major systems that we rely heavily on when it comes to addressing unique data challenges.

What is a Database System?

By now, you should understand what a database system is. This is a structured and organised collection of data that is made up of a database and a database management system(DBMS). These systems provide a systematic way to store, organise, and access information, allowing users to efficiently interact with and manipulate data for various purposes, such as analysis, reporting, and application development.

OLAP and OLTP

Database systems can also be categorized by how they process data. The two common data processing systems are OLAP (Online Analytical Processing) and OLTP (Online Transaction Processing). These have distinct or unique roles when it comes to data handling. They have different purposes and benefits for the end users.

1. OLAP (Online Analytical Processing)
This is a database system process that focuses purely on analysing and extracting insights from a large volume of data. As the name suggests, it's all about data analysis and decision-making.

It is mainly characterised by handling complex queries, working with historical and aggregate data, optimised for read-heavy workloads and also supporting multidimensional analysis. Above all, OLAP excels in analytical processing, which involves investigating data relationships, trends and anomalies.

You can take examples such as retail organisations examining trends, patterns and insights from sales data or a health organisation analysing patient outcomes, medical expenses and effectiveness of mediaca treatment to optimise hospital resource usage.

Types of OLAP

ROLAP (Relational OLAP) - This stores data in relational databases, utilising standard RBMS.
MOLAP (Multidimensional OLAP) - This stores data in multidimensional databases, optimised for handling data along multiple dimensions.
HOLAP (Hybrid OLAP) - This combines elements of both ROLAP and MOLAP systems, providing a hybrid approach. This approach allows for flexibility.

2. OLTP (Online Transaction Processing)
This is an online data processing system that is designed for managing and processing high-volume, real-time transactions. It emphasises fast response times, data integrity and day-to-day operations such as online banking transactions, e-commerce purchases and airline reservations.
A transaction can be defined as a sequence of one or more operations that are executed as a single unit, and they can involve reading from or writing to a database.

It is characterised by handling a large volume of short transactions, focusing on speed and accuracy, supporting real-time operations and ensuring data integrity (ACID properties).

OLAP vs OLTP

As we have seen above, OLAP and OLTP are completely different systems when it comes to data processing, but they also share a common goal of managing data effectively, such as reliance on RDBMS and adhering to relational database concepts such as tables, columns, and rows.

Differences between OLAP and OLTP
While OLAP involves read-heavy operations where large sets of historical data are analysed to identify trends, patterns, and relationships, OLTP involves write-heavy operations, handling a high volume of concurrent transactions in real time.

OLAP is designed in a denormalised structure to simplify complex queries' performance and is optimized for fast querying. On the other hand, OLTP is designed in a normalized structure to minimize redundancy and ensure data integrity and consistency.

OLAP emphasizes query performance and flexibility, allowing for efficient analysis of multidimensional data. OLTP emphasizes fast and accurate transaction processing with a structure that minimizes data redundancy.

OLAP has longer response times due to the complexity of the analytical process, but users experience comprehensive results, while OLTP offers fast response times to ensure quick transaction processing, with users getting immediate confirmation of their transactions.

OLAP involves or utilizes complex queries that aggregate and analyze data across various dimensions because transactions are batch-oriented, while OLTP utilizes short and simple transactions that involve inserting, updating or deleting small amounts of data.

OLAP is mostly used by data analysts, BI professionals, and decision makers who rely on in-depth analysis and reporting capabilities, whereas OLTP is used by end-users such as clerks, customers, and operational staff, among others.

OLTP is faster by design when it comes to speed; it's meant to provide quick transactions versus in-depth analysis.

Conclusion

The distinction between OLAP and OLTP is not merely technical; it reflects two fundamentally different purposes within modern data ecosystems. OLTP systems are designed for speed, accuracy, and efficiency in handling day-to-day transactional operations, ensuring that businesses run smoothly in real time. In contrast, OLAP systems are built for depth, enabling organizations to analyze vast volumes of historical data, uncover patterns, and support strategic decision-making.

Therefore, businesses that effectively integrate both OLAP and OLTP capabilities position themselves to achieve operational excellence while maintaining a strong analytical edge, turning data into a powerful driver of growth and innovation.

ETL vs ELT: Which One Should You Use and Why?

Shaban Ibrahim — Sun, 12 Apr 2026 18:46:18 +0000

Introduction.

It is estimated that the global big data analytics market will grow by an estimated 165.5% by the year 2032. With this rapid growth in the data market, there is an increase in demand for processing the growing amount of data. It is here that ETL (Extract, Transform and Load) and ELT (Extract, Load and Transform) processes are essential. But as much as these processes sound the same, they are very different, and the differences can be confusing.

ETL has been there for so long, long before the advent of ELT. For decades, ETL has been giving value to businesses from meaningful insights. ELT is a prodigy of the cloud revolution that will enable users to handle data at scale.

ETL vs ELT

What is ETL (Extract, Transform, Load)

As the acronym suggests, this stands for Extract, Transform and Load. This is a traditional data integration approach in which data from multiple sources is consolidated in a central system. These sources could include CRM, E-Commerce Websites and Helpdesk data or even more.

How ETL Works

Extract(E): Here, data is pulled from various sources e.g databases, APIs, files
Transform(T): It is here where data cleaning and modifications are done before storage.
Load(L):After transforming, the data has to be loaded in a database or warehouse for consumption

The diagram above illustrates how ETL works

What is ELT Works

As the acronym suggests, this stands for Extract, Load and Transform. This is a mutant of ETL, a modern approach in which data is first loaded into the warehouse and then transformed using powerful cloud engines.

ELT is mostly popular with cloud-based services and service providers such as Amazon Web Services, Microsoft Azure and Google Cloud. It is preferred because of its ability to handle and process large data, its flexibility and its ability to be developed faster as compared to an ETL.

How ELT Works

Extract(E): Here, data is pulled from various sources
Load(L):The raw data is loaded into the system, whereby sensitive data are either masked, encrypted or dropped.
Transform(T): It is here where data cleaning and modifications are done using the target system's computing resources.

The diagram above illustrates how ELT works

Differences between ETL and ELT

One of the key and noticeable differences between the two processes is the point at which data is transformed and how the warehouse retains the data. ETL transforms data outside the warehouse in a different server, while ELT transform data within the warehouse itself. Also, it should be clear that ETL does not transmit or move raw data into the warehouse, while ELT, on the other hand, moves raw data into the pipeline.

ELT processes data faster as compared to the ELT process because ETL involves preliminary transformation before loading data into the warehouse, thereby making it difficult to scale, and hence, as the size of the data grows, the performance slows down. On the other hand, ELD loads the data directly into the target warehouse, saving time and easing scalability since transformation is done in parallel.

Data ingestion in ETL is slowed down as a result of transforming data on a separate server before loading. On the other hand, ELT delivers faster ingestion since the process of loading and transformation can be done simultaneously.

When it comes to processing unstructured data, ELT is the best since it provides superior processing of structured, semi-structured and unstructured data as compared to ETL, which is best when it comes to structured data only.

We can firmly confirm that ELT outdoes ETL in many aspects, such as speed, cost, privacy, maintenance, flexibility, volume, and many other aspects and that it is a process that is going to come in handy with the everyday advancements in the data world.

Real-World Use Case for ETL

In a world where data is growing constantly and evolving ETL is an efficient way of data handling because of its ability to solve key data management problems by ensuring data accuracy, consistency, and availability, which is key for decision-making.

ETL enables real-time data analysis for business insights. Businesses need to accurately make decisions in a dynamic business environment. ETL ensures data is extracted, transformed and loaded as it's generated, allowing businesses to respond to market changes, optimize supply chain and track customer behaviours instantly.

ETL has facilitated migration of data from legacy systems to modern platforms. The use of ETL has ensured safe migration of data from one system to another without losing data integrity and consistency.

ETL process can integrate and transform customer data from multiple touchpoints. In the case of an e-commerce business, customer data is so valuable, especially when it comes to offering personalized experience.

The manufacturing sector is also a major user of ETL process, especially when it comes to predictive maintenance to reduce downtime and prevent costly breakdowns. ETL processes collect and transform data from IoT sensors and machinery to help predict when required maintenance is needed.

Data governance and compliance is another area where ETL process can come in handy. Institutions and sectors that handle sensitive data, such as healthcare, finance or security sectors, must comply with strict regulatory requirements when it comes to data governance. Through ETL, data is transformed and loaded in compliance with the laid-out regulations, making ETL instrumental when it comes to the implementation of data governance policies and data security.

We see that ETL, although it is a legacy process it is very crucial when it comes to day-to-day data handling, and it is crucial in ensuring accurate decision-making and ensuring data integrity is maintained.

Real-World Use Case for ELT

ELT has emerged as an alternative in modern data architecture and is being adopted by many businesses since it offers much more compared to ETL.

ELT pipelines let businesses extract customer data from all their channels, ad partners and marketing platforms, load the data into a cloud data warehouse and transform when needed. This helps when it comes to building a unified customer profile. Unified data enables faster, more profitable decisions, new revenue streams and stronger customer loyalty through personalized experiences.

Banks, payment processors and other fintech companies use ELT process to detect fraud and assess risk in real time across millions of transactions. This has helped to avoid scams and protect customers.

Medium and large enterprises in retail and manufacturing use ELT to optimize their supply chains and inventory levels across warehouses, stores and distribution channels. This is done through the creation of an efficient supply chain control tower in cloud data platforms.

The healthcare industry can use ELT to securely combine structured and unstructured data at scale. This will help the healthcare providers and hospital systems to improve patient outcomes and operational efficiency.

ELT is tasked with data migration and consolidation into cloud data warehouses or lake houses. This helps to solve many business problems, especially when it comes to consolidating data from multiple sources and systems, as well as upgrading to a more agile analytics environment.

Enterprises across multiple industries use ELT-based consolidation of data to create a single source of truth. This results in a scalable, low-maintenance environment that is supportive of advanced analytics.

ETL TOOLS

When it comes to data integration, having the right tools makes all the difference. Some of the ETL tools available include:

Matillion - cloud-based data integration platform with AI function, designed to simplify and accelerate ETL processes.
IBM DataStage - Designed to support data integration across multiple sources and targets.
Informatica PowerCenter - An enterprise-grade ETL platform used by businesses to guarantee robust and efficient data integration across various sources.
Talend Data Fabric - a tool that provides a range of data integration management solutions.
Astera Centerprise - This tool simplifies complex ETL processes with an intuitive, no-code approach.

ELT TOOLS

Unlike ETL, ELT processes leverage the computation power of cloud data warehouses. Here are some of the ELT tools;

Azure Data Factory - Is a cloud-based data integration service that automates the movement and transformation of data from various sources to destinations.
Google Cloud Dataflow - Is a fully managed stream and batch data processing service that enables users to develop and execute data processing pipelines with ease.
AWS Glue - is a serverless data integration service that can be used for analytics, machine learning and application development.
Rivery - Is a cloud-native ELT platform that automates data integration, transformation and orchestration without the need for infrastructure management.
Airbyte - An open source data integration platform that simplifies ELT process by providing pre-built connectors for seamless data movement across various points.

Conclusion

ETL (Extract, Transform, Load) and ELT (Extract, Load, Transform) are two approaches used to move and prepare data for analysis, but they differ mainly in when the transformation happens. In ETL, data is cleaned and structured before it is loaded into a storage system, making it ideal for environments where data quality and consistency are critical from the start. This approach is commonly used in traditional systems where storage and processing power are limited, and only refined data is needed for reporting.

ELT, on the other hand, loads raw data directly into a data warehouse first and then performs transformations within that system. This method takes advantage of modern cloud platforms that offer strong processing capabilities, allowing teams to store large volumes of data and transform it as needed. ELT is more flexible and faster for data ingestion, making it well-suited for big data analytics, real-time dashboards, and data science work where raw data exploration is important.

Choosing between ETL and ELT depends on your specific needs. If your priority is strict data quality, control, and working with structured systems, ETL is often the better choice. However, if you need scalability, speed, and flexibility—especially in cloud-based environments—ELT is usually more effective. In many modern setups, organizations combine both approaches to balance control with performance and adaptability.

FROM SQL TO POWER BI FOR ANALYSIS

Shaban Ibrahim — Mon, 16 Mar 2026 04:51:17 +0000

Introduction

As a data analyst, you are probably interacting with Microsoft Power BI in your everyday operations, because it is one of the most powerful business intelligence and data visualisation tools ever developed. Power BI is known for its ability to transform raw data and generate meaningful insights, reports and analytics through creating interactive dashboards for consumption by the end users of the data. Most companies and businesses rely on it to analyse trends, monitor performance and support data-driven decisions.

SQL databases play a critical role in storing and managing analytical data in modern organisations. Structured Query Language (SQL) databases are designed to organise large volumes of structured data into tables consisting of rows and columns. This structure makes it easier to store, retrieve, and manipulate data efficiently, which is essential for data analysis and business decision-making.

SQL databases form the foundation of many modern data systems, including data warehouses and analytics platforms. Databases such as PostgreSQL, MySQL, and Microsoft SQL Server are widely used to store structured analytical data that can later be analysed using statistical tools, machine learning models, or reporting platforms. Power BI has the ability to connect directly to such databases to allow analysts to access real-time data for data visualisation and reporting.

Linking Power BI

1. To a Local PostgreSQL Database

You start by opening your Power BI Desktop, as it is the main environment for dashboard and report creation.

2. Pulling your data from the database of your choice
Click 'Get Data', then in the drop-down, you will click "more", which will lead you to another interface where you will choose "database", then choose "PostGreSQL", then connect.

3. Connecting to PostgreSQL Database.
Here, you will be prompted to enter database connection details.

In the dialogue box, you add the server_name, which is localhost:5432 (default server name for locally connected PostgreSQL database) and the name of the database you would like to connect to. In my case, the database is walmart_db

4. Configuration with your database
You will be directed to another dialogue table that will be used for configuration purposes.

Since you are connecting from your local environment, the default connection details for PostgreSQL are

Host: localhost
Port: 5432
Username: postgres
Database: postgres
Password: your_password

5. Loading your data from your database to Power BI
After filling in the right credentials and connecting, once connected successfully, a navigator window will appear, displaying the tables available in PostgreSQL.

Click "Load" to import the selected tables into Power BI
In our case, we have loaded our table that has more than 9000 rows

2. To a Cloud PostgreSQL Database

Modern systems databases are hosted in the cloud environment as opposed to the traditional local environment for management, because it offers flexibility, scalability, and reliability compared to traditional systems.

For organizations working with large analytical datasets, cloud databases provide a flexible and powerful foundation for data-driven decision-making. It also improves scalability, accessibility, cost efficiency, reliability, and integration with advanced analytics tools.

One such cloud platform that provides fully managed open-source data infrastructure services is Aiven.

1. Obtain Connection Details
On your Aiven page after creating PostgreSQL service, you can retrieve connection details such as

Host
Port
Database Name
Username
Password These details will be used in Power BI for the connection

2. Download SSL Certificate
An SSL certificate (Secure Sockets Layer certificate) is a digital certificate that enables encrypted and secure communication between your application and database, protecting credentials and data during transmission.

From the Aiven dashboard, locate the CA certificate and download it. Save it locally.

After down loading the CA certificate, you create a folder in your computer and store the CA certificate there and rename it from ca.pem to ca.crt. Power BI will automatically detect the ca.crt, enable SSL and verify server certificate.

3. Connect to Power BI
Click 'Get Data', then in the drop-down, you will click "more", which will lead you to another interface where you will choose "database", then choose "PostGreSQL", then connect.

3. Connecting to PostgreSQL Database.
Here, you will be prompted to enter database connection details.

In the dialogue box, you add the server_name, which is pg-2f14f89-shabsibrah-9c1c.c.aivencloud.com:10116 (server name from Aiven and the port number) and the name of the database you would like to connect to. In my case, the database is luxsales from DBeaver, click OK.

You will go back to your Aiven and copy the user_name and password, and fill in the dialogue box and click "Connect".

4. Loading your data from your database to Power BI
After filling in the right credentials and authenticating, once it connects successfully, a navigator window will appear, displaying all the tables available in that particular database. Select the tables that you need and load them into Power BI

5. Loaded Tables and Relationships between them
Once te tables have been loaded to your Power BI the next step is to carry out data modelling by detecting and establishing relationships that exist between the loaded tables.

This is first done by understanding what each table communicates

Customer table --------> Gives information about the customer
Products table --------> Gives information about the products
Invetory table --------> Gives details on the stock levels
Sales table -----------> This is the fact table tells more on sales tranactions.

Power BI automatically attempts to detect and establish the relationships that exist between the tables such that it connects table using primary key from dimension table to the corresponding foreign key in a facts table or foreign key of another dimension table.

For examples

sales.customer_id -> customers.customer_id
sales.product_id -> products.product_id
inventory.product_id -> products.product_id

The result can be seen on the model view with a diagram displaying the final model created.

Data Modelling is a very powerful process since it helps in combining data correctly, proper filtering of data across the tables and finally ensuring that accurate reports are built at the end.

Power in combining SQL with Power BI

SQL has proven to be one of the most powerful analysis tools to date, it comes in very handy especially when you are dealing with large datasets. This is well completemented with Power BI which is one of the most powerful tools of data visualisation and transformation capabilities.

With use of proper queries SQL can effectively retrieve data effeciently from very large databases.

</>SQL
Select product_id, sum(amount) as sum_rev
from sales
group by product_id

SQL helps analysts filter datasets before loading them into Power BI. Instead of importing an entire database, analysts can use SQL WHERE clauses to limit the data to a specific period, region, or category

</>SQL
Select gender, count(customer_id) as count_client
from sales
group by gender
where gender = "Male"

SQL enables analysts to perform aggregations and calculations directly within the database. Using functions such as SUM, COUNT, AVG, and GROUP BY, analysts can compute metrics like total sales, average product prices, or the number of customers per region before the data reaches Power BI.

Additionally, SQL is extremely useful for preparing and transforming data. Analysts frequently need to join multiple tables, clean inconsistent values, or create intermediate datasets. SQL features such as JOIN, CASE, CTE (Common Table Expressions), and subqueries allow analysts to shape the data into a structure that works well with Power BI’s data model.

Finally, SQL complements Power BI by enabling analysts to build reliable and scalable data pipelines. When data is properly prepared using SQL, the dashboards built in Power BI become simpler, faster, and easier to maintain.

By combining the power of SQL for data extraction and transformation with visualization capabilities in Microsoft Power BI, analysts can build insightful dashboards that support better business decision-making.

Conclusion

While Power BI is a very powerful, most effective business intelligence platform for transforming database to interactive dashboards and reports, connecting it with SQL databases such as PostgreSQL analysts can access large volumes of data and analyse it effeciently.

UNDERSTANDING SQL JOIN AND WINDOW FUNCTIONS

Shaban Ibrahim — Tue, 10 Mar 2026 14:06:32 +0000

INTRODUCTION

SQL is one of the most powerful tools that has stood the test of time when it comes to the world of data. It has proved to be so reliable and flexible enough to work on any relational data. It has many functionalities within it that make data manipulation easy to navigate.
Among the functionalities are the join and window functions.

JOIN FUNCTION

This comes in handy when dealing with data spread across multiple tables. The join function is the one used when one wants to analyse data from different tables and perform data analysis or data engineering. It uses related columns from the different normalised tables and combines the rows. It can be used in two or more tables. There are different types of joins, including:

1. Inner Join
This is one of the most common forms of join that is frequently used. It helps to join two tables based on the rows that have matching values in both tables. This means that if there are values that do not match in the two tables, they will be dropped from the resultant joined tables.

The syntax of an inner join looks like

select columns
from table1
inner join table2
on table1.column = table2.column

Assuming you have products and sales tables

select p.product_name,s.price
from products p
join sales s
on p.product_id = s.product_id

From the above example, you could see that the two tables are relatable by the product_id column in both tables, that make it easy to join the two tables.

2. Left Join
Also called Left Outer Join, this one returns every row on the left table and only the matching columns on the right table. This means there will be null values returned from the right tables against the unmatched and unreturned values on the left table.

The syntax of a left join looks like

select columns
from table1
left join table2
on table1.column = table2.column

Assuming you have products and sales tables

select p.product_name,s.price
from products p
left join sales s
on p.product_id = s.product_id

From the above examples, all the rows from the products table will be returned, and on matching rows(product_id) from the sales department will be returned. The excess rows from the products table will have null values in the sales department.

2. Right Join
Also called Right Outer Join, this one returns every row on the right table and only the matching columns on the left table. This means there will be null values returned from the left tables against the unmatched and unreturned values on the right table.

The syntax of a right join looks like

select columns
from table1
right join table2
on table1.column = table2.column

Assuming you have products and sales tables

select p.product_name,s.price
from products p
right join sales s
on p.product_id = s.product_id

3. Full Outer Join
This one combines both the Left outer join and the Right outer join. This one shows all the rows from both tables, including unmatched rows from both tables.

select columns
from table1
full outer join table2
on table1.column = table2.column

Assuming you have products and sales tables

select p.product_name,s.price
from products p
full outer join sales s
on p.product_id = s.product_id

5. Cross Join
This one returns every row from the first table combined with every row from the second, i.e., a cartesian product. Assuming you have 3 customers and 3 orders, the resulting table will have 9 rows.

The syntax of a cross join looks like

select columns
from table1
cross join table2;

Assuming you have products and sales tables

select p.product_name,s.price
from products p
cross join sales s;

6. Self Join
This is where the table is joined to itself; this is mostly used when one wants to compare rows within the same table. For this to be effective, table aliases are used to treat the same table as two separate tables in a query.

The syntax of a self-join looks like

select columns
from table1 as A
self join table1 as B
on A.column = B.column

A and B are aliases representing the same table1

select E.name, M.name
from Employees as E
self join Employees as M
on E.manager_id = M.employee_id;

7 Natural Join
This is a type of join that combines two tables with the same name and compatible data types. This one does not need to join using a condition ON because it automatically detects common columns and uses them to match rows.

The syntax of a self-join looks like

select columns
from table1 
natural join table2

Assuming you have products and sales tables

select *
from products p
natural join sales s;

The downside of natural joins is that if the tables have common multiple columns with the same name, SQL will join using all the columns, producing unexpected results.

Window Functions

These are functions that perform calculations across a set of rows related to the current rows without grouping them into a single result. They return results for each row, unlike the aggregate functions, such as sum() or avg().

The window functions use over() clause to define the windows or a set of rows.

select emp_name,
       department,
       salary,
       avg(salary)over(partition by department) as avg_department_salary
from employees;

This one calculates the average salary for each department while returning the name of each employee.

Among the common window functions are

1. Row_Number()
This one assigns a unique number to each row starting from 1, following the specified order by order by clause.
In the case of partition by, the starting number will reset on each new partition.

when using row_number() without resetting the row number

select 
      order_id,
      product_name,
      price,
      row_number() over(order by price desc) as row_num
from orders;

when using row_number() while resetting the row number

select 
      order_id,
      product_name,
      price,
      row_number() over(partition by order_id order by price desc) as row_num
from orders;

2. Rank()
Assigns ranks to rows but allows gaps when there are ties, for example, whenever ranking say order_id based on prices, if the first two order_ids have the same price will rank 1, the next order will be ranked 3rd.

select 
      order_id,
      product_name,
      price,
      rank() over(order by price desc) as order_rank
from orders;

3. Dense_Rank()
As opposed to Rank(), this one does not allow gaps between ranks even if there are ties.

select 
      order_id,
      product_name,
      price,
      dense_rank() over(order by price desc) as order_rank
from orders;

4. Lead()
This one is used to access the next rows in a particular table, useful for comparing or predicting the next period

select
      date,
      sale_amount,
      lead(date)over(order by date) as next_day
from sale;

4. Lag()
This one is used to access the previous row values in a particular table, useful for comparing with the previous period

select
      date,
      sale_amount,
      lag(date)over(order by date) as previous_day
from sale;

5. Ntile()
This one is mostly used when dividing a result into a specific number of equal groups. Each row will be assigned a specific group number based on the order specified in the query. It is commonly used when categorizing groups such as quartile, percentiles or rankings.

select 
     name, 
     salary,
     ntile(4)over(order by salary desc) as salary_group
from employee;

Conclusion

SQL joins and window functions are powerful tools that enable analysts to work efficiently with complex datasets. Joins allow us to bring together data stored across multiple tables, while window functions provide deeper analytical capabilities without losing row-level detail. Mastering both techniques significantly improves the ability to analyse, transform, and derive insights from data.

ANALYSTS TURNING MESSY DATA INTO ACTIONABLE INSIGHT

Shaban Ibrahim — Sun, 08 Feb 2026 07:29:56 +0000

Introduction

Have you ever come across a fancy, attractive and well-packaged dashboard in your company, media, internet or even some presentation? They might have been sales dashboards, Attrition dashboards, customer churn dashboard or even financial reporting dashboards. At first, you are curious to understand what the dashboard is trying to communicate to its users, and then you become more curious about how the dashboard was created and who created it.
As a junior data analyst or an enthusiast in the world of data, you are obviously curious and anxious to understand what the life of an analyst is, and their daily routine. Well, at least 75% of the work of a data analyst is to clean messy data.

Messy Data

It is normal and very okay for a data analyst to first deal with messy data before continuing with other analysis tasks. Thismight be messy in that it is full of inconsistencies, duplicates, incomplete or even spread across different sources that are not compatible. For example, you are dealing with sales figures from different departments, regions using different naming conventions, dates stored as texts, currencies not standardised, and a missing key field, such as ID. As an analyst, you need to address the messy data first for the accuracy of your analysis and to avoid wrong and misleading insights.

What to do as an analyst

As the data analyst that you are, you need to start by understanding the context of the data and the data itself. Understand the sources, the users and the question you are to answer as an analyst. Understanding some of these concepts will guide you in the process of cleaning your data for correct insights and accurate analysis. Note that not all inconsistencies should bother you; what matters most are those that will inform your final analysis and make it accurate.

PowerQuery

Now you might be asking yourself how data cleaning is done and how cumbersome the process is. One thing as an Analyst, you must have come across is PowerQuery. As a data analyst, understanding Power Query should be mandatory because that is where most of the data analyst's time is spent.
PowerQuery is so powerful in that it is used for:

1- Cleaning and Transforming of Data
It is a very powerful tool that is used for cleaning, transforming anf combining data in a structured and repeatable way.
It is at the PowerQuery where tasks such as removal of duplicates, fixing data types, handling missing values and standardisation of text fields are done. Here, you can also merge data from different sources.
On the same PowerQuery tasks, such as adding, removing columns and adding new columns to a data set are done. And the good thing about this is that every activity carried out within the PowerQuery is recorded for reference and also to easily track where a mistake was made during the cleaning and transformation stage.

2- Data Modelling
After the data has been cleaned, the data set now proceeds to the next stage of analysis, a very crucial phase for that matter, data modelling. Data Modelling in Power BI is where relationships are defined, especially when working with more than one data table. The data tables are organised in a way that clearly shows how the business operates. Here is where the analyst chooses between a star schema and a snowflake schema, but for simplicity and efficiency, most analysts prefer the former to the latter.
As an analyst, keep in mind that a well-defined data model is going to make analysis easier, faster and more reliable. So the analyst should be keen on the relationship direction, cardinality and granularity to ensure that the model answers the business question accurately.

3-Using DAX
DAX (Data Analysis Expression). It is like a formula language used in Power BI and Power Pivot to create calculations on your data, kind of like Excel formulas, but on steroids. It is used by analysts to turn structured data into meaningful metrics.

DAX is very useful when calculating measures such as total sales, growth rates, rolling averages, periodical performance comparison e.t.c
It also comes in handy when creating tables and columns in your PowerQuery

DAX for Calculated Measures

Total Sales = SUM(Sales[Amount])

Total Sales LY =
CALCULATE(
    [Total Sales],
    SAMEPERIODLASTYEAR(DimDate[Date])
)

DAX for Calculated Column

Profit Status =
IF(
    Sales[Profit] > 0,
    "Profit",
    "Loss"
)

DAX for Calculated Table

DimDate =
ADDCOLUMNS(
    CALENDAR (DATE(2018,1,1), DATE(2026,12,31)),
    "Year", YEAR([Date]),
    "Month", FORMAT([Date], "MMMM"),
    "Month No", MONTH([Date]),
    "YearMonth", FORMAT([Date], "YYYY-MM")
)

DAX can be very powerful if the analyst understands the dynamism that exists between the row context and filter context in order to achieve accuracy and efficiency. DAX is more of a translation of business needs into calculations that update dynamically and consistently without breaking across the entire report.

4- Visualisation
After cleaning the data, modelling and using DAX to make sense out of your data, as an analyst, you take the next step of your analysis journey, which is visualisation. Here is where you design your dashboard into a communication tool that answers specific questions. The analyst is guided by the business need and the question he intends to answer using the dashboard.

A well-built and effective dashboard highlights Key Performance Indicators (KPI's), trends over time, and makes it easy to identify outliers. Here, the analyst should understand the type of visualisation to be used at a particular time to answer a specific question, for instance, line charts for trends, bar charts for comparisons, and tables for details.

5- Generating Insights
Analyst goes through the entire process, goes beyond dashboards to explain what the data is saying, answers the questions, identifies risks and opportunities, and makes recommendations.

Integrating dashboards into an analyst's daily workflows, team meetings, performance reviews and operational check-ins makes them more of an accountability tool rather than just a reporting tool.

Conclusion

Through careful data cleaning and transformation, modelling, relevant and accurate DAX, and intentional dashboard designs analyst transforms chaos into clarity, turning messy data into an insightful, well-organised dashboard. This helps inform the present and shape the future.
When done purposefully, the users see more answers and direction rather than seeing messy data, complex DAX formulas and technical models.

That is how PowerQuery give the data analysts the powers he enjoys.

DATA MODELLING IN POWER BI

Shaban Ibrahim — Sun, 01 Feb 2026 10:13:36 +0000

Data Modelling in Power BI

For any successful solution when it comes to visualisation and analysis using Power BI, the user has to be well versed with Data Modelling and understand how to navigate around it because that is the core foundation. The model determines a lot, especially when it comes to the outcome of the analysis in terms of performance, usability, DAX complexity and correctness of the results.

Facts and Dimensional Tables

Fact Tables
These can be referred to as master tables; they store measurable and quantitative data. They contain numeric columns and mostly have the largest number of rows compared to the dimension tables, and have foreign keys that connect to the dimension tables. In cases where there are no dimension tables and an analyst wants to do a detailed analysis, they can use the fact table to generate a dimension table or tables.

Examples of fact tables are the sales table, the transaction table, the inventory movement, and the customer table.

Dimension Tables
These tables provide context and detailed descriptions for the fact tables; they can contain textual or categorical data. The primary keys of dimension tables are the ones that form foreign keys in the fact table. They have fewer rows compared to the fact tables, and most of their attributes are used for filtering, grouping and slicing.

An example of a dimension table includes: a date table, a geographic table, and a products table.

Star Schema

A start schema can be described as a data modelling pattern where a fact table is connected to multiple dimension tables, and the dimension tables are NOT connected to each other. A start schema is characterised by a single fact table, a one-to-many relationship from the dimension tables to the fact tables and denormalised dimension tables.

For Examples a model might have FactSales, DimDate, DimCustomer, DimProduct e.t.c

The start schema is commonly used in Power BI because it simplifies the model understanding, can predict filter behaviour, has simple and more reliable DAX calculations and reduces the risks of ambiguity.

This is a recommended and preferred modelling approach.

Snowflake Schema

This is a modelling pattern that is the mutation of the star schema and has its variations, where dimension tables are normalised into multiple related tables.

The snowflake schema is characterised by the fact that, apart from the dimension tables connecting to the fact table alone, they are connected to other sub-dimension tables, making the structure hierarchical in nature.

FactSales --> DimProducts
DimProducts --> DimProductsCategory
DimProductCategory --> DimProductGroup

The snowflake schema is less preferred because it has a more complex relationship chain, making the model more complex, harder to understand and maintain, might degrade performance and makes the DAX calculation complex as a result of indirect filter paths.

The snowflake schema dimensions are usually flattened into a single denormalised dimension, i.e., start schema, to improve performance and usability.

Relationship in Power BI

A relationship connects tables so Power BI knows how data in one table relates to data in another. Without relationships, Power BI treats tables as unrelated. Power Bi supports several types of relationships

1. Cardinality
It defines how rows match between tables:
The Include :

One-to-Many (most common and recommended)
Many-to-One
One-to-One
Many-to-Many

2. Active and Inactive Relationships
Only one active relationship can exist between the two tables at a time, and one can activate an inactive relationship by a DAX syntax USEREALTIONSHIP
Excessive use of inactive relationships can increase the risk of errors and the complexity of the model.

Active → solid line
Inactive → dotted line

3. Cross Filter Direction
This one controls the flow of the filter

Single (Recommended)
Filters go from dimension → fact
Example: Customers → Sales
Both
Filters flow both ways, and this can cause ambiguity

Importance of a good Data Model

A good data model goes a long way to ensure optimal performance and critical reporting, especially when decision-making is needed. The good model will first of all reduce complexity, especially when joining using start schema, having fewer and necessary tables would lead to faster queries, and denormalisation improves compression.

A good data model also goes along way to ensure accurate reporting by ensuring clear relationships and dimensions are established, which helps to ensure filter propagation and consistent slicing and grouping. This also helps to reduce DAX complexities and clean measures that are easy to read, maintain and understand.

The downside of a poor model is slow visuals, timeouts and excessive resource utilisation, misleading calculations and inconsistent results across visuals.

To achieve better and more accurate results, always stick to these principles:

Prefer star schema over snowflake schema
Separate facts and dimension tables clearly.
Use a one-to-many relationship
Keep filter direction singular where possible.
Flat dimension for reporting
Design the model before writing DAX

When it comes to Power BI, Data Modelling is a primary task since it forms the foundation of successful analytics. Being conversant with the schemas, properly defining facts and dimension tables, and proper establishment of relationships goes a great length to impact performance, accuracy and usability of the model. BI Developers can build robust, scalable and reliable reports that can deliver trustworthy insights if they stick to some of the principles we have mentioned.

LINUX FOR A ROOKIE DATA ENGINEERING STUDENT

Shaban Ibrahim — Sun, 25 Jan 2026 19:09:53 +0000

Introduction

As a student of Data Engineering, learning and understanding the fundamentals of Linux is a MUST. As a matter of fact, for one to smoothly learn and grow in the field of Data Engineering they have to be good at Linux.

What is Linux
Linux is an open-source operating system mostly used in servers, cloud platforms and data systems to:

Run applications and services.
Process and manage large amounts of data
Host websites and backend systems
Automate tasks and workflows (using scripts and schedulers)
Support cloud infrastructure (virtual machines, containers)
Ensure stability, security, and high performance for systems that must run 24/7

In short, Linux provides a secure and reliable environment to efficiently and continuously run applications, data pipelines, and cloud services.

Why Linux For Data Engineers

Most of the core daily operations of a Data Engineer (DE) are carried out on Linux as most of the Data Systems run on it.
These might include operations like

1. Running Data Pipeline
Data Pipelines such as ETL/ELT are usually handled on Linux servers, which include ingesting data from APIs, processing large files, transforming data using Python or Spark and also loading data into data warehouses.

2. Automation and Scheduling
With Linux tools such as cron, you can schedule jobs and use bash scripts to automate tasks e.g weekly logs cleaning, archive data periodically and run scripts on schedules that have been set.

3. Handling Big Data
To handle large data, you need to have frameworks that run only on Linux, such as Hadoop for distributed storage and processing, Spark for fast processing of large data, Kafka for streaming the data and Airflow for workflow orchestration which is the process of organizing, scheduling, and managing multiple tasks so they run in the correct order and at the right time and with complete reliability from start to finish.

4. Working with Cloud Infrastructure
Most of the cloud infrastructures that run on major cloud platforms such as Amazon Web Services (AWS), Google Cloud Platform (GCP), and Microsoft Azure offer Linux run infrastructure such as

Virtual Machines VMs - Ubuntu, Red Hat, Debian -Containers & Orchestration – Docker, Kubernetes
Big Data Services – Hadoop, Spark, Kafka clusters
Databases – MySQL, PostgreSQL, MongoDB, Cassandra
Data Warehouses – BigQuery engines, Redshift nodes (Linux-based)

5. File and Data Management
With Linux you can effectively and efficiently handle large files and perform tasks such as moving massive datasets, compressing files, searching logs and streaming data. All of these tasks are done by executing commands such as ls, cd, cp, grep, mv e.t.c

Running Linux Terminal and Its Commands

Running a Linux terminal means using text commands to control a Linux system, either locally or on a server.

1. On a Linux machine (Ubuntu, etc.)
Press Ctrl + Alt + T
Or search “Terminal” in applications

2. On Windows (Most common)
Option A: Windows Subsystem for Linux (WSL)

Install WSL
Open Ubuntu from the Start Menu
This gives you a real Linux terminal inside Windows

Option B: Git Bash

Install Git
Open Git Bash
Linux-like commands (not full Linux, but useful)

3.On macOS

Open Terminal (Spotlight → Terminal)
macOS is Unix-based, very similar to Linux

4. On a Remote Server (Cloud/Linux server)
Use SSH:

ssh username@server_ip

This opens a Linux terminal on a remote machine.

Basic Linux Commands

Accessing Server

Access remote server, you will need the server username, the server ip_address and the password for the server

ssh username@server_ip

Update and upgrade server if and when necessary

sudo apt update && sudo apt upgrade

Check the version of the ubuntu server you are using*

lsb_release -a

To understand the specifications of your VM understand the space usage and remaining storage

df -h

To see the list of all files in the server

ls

Red - Zipped Files
Blue - Folders
White - Files

To see the list of all files hidden and unhidden in the server

ls -a

Print your current directory

pwd

To Add another user in your server

sudo adduser 'username'

Changing from the Super User 'Root' to the regular user n the server and changing directory

su 'username'
cd

Creating Directories and Files and navigating between them

mkdir - Create a directory
touch - Create an empty file
cd 'mkdir' - To access or open your directory 
cd .. to move one step back from your current location
cd + space - To go back to the end of the path
cp - copy files
mv - To move/Rename files
rm - To delete a file
rm -r - To delete a folder

Copying file from the local machine to the server

cp 'file_name' user_name@ip:path_to_the_serve_loaction_of_choice

Copying file from the server to the local machine

scp username@remote_host:/remote/path/to/file /local/path/to/destination

Copying folder from the local machine to the server

scp -r /local/path/to/folder ibrahim@157.245.209.236:/home/ibrahim/

scp -r MyMusicFolder ibrahim@157.245.209.236:/home/ibrahim/

Copying folder from the server to the local machine

scp -r ibrahim@157.245.209.236:/home/ibrahim/MyMusicFolder /local/destination/

scp -r ibrahim@157.245.209.236:/home/ibrahim/MyMusicFolder ~/Downloads/

You can also rename the folder during transfer:

scp -r ibrahim@157.245.209.236:/home/ibrahim/MyMusicFolder ~/Downloads/NewFolderName

For large folders, consider adding -C to compress during transfer (faster for slow connections):

scp -r -C MyMusicFolder ibrahim@157.245.209.236:/home/ibrahim/

Copying files from the internet to your server

wget 'link'

Writing and Reading line on an empty file in the server

echo 'The line you wish to write' >> file_name

cat 'file name' - Read a file

Editing Using Nano and Vi

Nano is a simple, beginner-friendly text editor you use directly in the Linux terminal. It comes in handy when editing files, writing scripts and viewing changes to files on the servers.

nano app.py - to open nano interface
Ctl + O - save
Ctrl + x -exit

If the file doesn't exist, Nano creates the file.

Vim is a modal text editor used in the Linux terminal and is widely used in servers, cloud machines, and containers.

vim app.py - to open vim interface
i --> insert mode
Type your text
Esc --> back to Linux
:w --> Save
:q --> Quit
:wq --> save and quit
:q! --> quit without saving

HOW TO GIT IT

Shaban Ibrahim — Sun, 18 Jan 2026 03:49:41 +0000

UNDERSTANDING GIT AND GITHUB

You remember the first day you tried your hand in tech, maybe in software engineering, data science, web development or data engineering? Your trainer or technical mentor insisted that you have a GitHub account, and they made it a mandatory requirement for you to have one as industry best practice. I am sure you were wondering what GitHub is and how it will be important in your tech journey.

Many moons later, you came to realise that actually Github is like a treasure chest for any tech person, although before you got the grasp of it, you were confused and didn't understand how it works.

GitHub is an online platform that hosts git repositories, among many other platforms such as Gitlab and Bitbucket. It helps users to store, share and collaborate on code. Basically, the treasure chest of coders.

Create Github account

To create Github account,click here

Now what is Git?

Git is simply a time machine for your code. This means it is a tool that helps you track changes in your code over time, whereby you can easily save different versions of your work, go back to earlier versions if something breaks, and collaborate with others without overwriting each other’s work. This makes git a version control system.

Installing Git

To install Git, you can click the link

What are some of the components of Git?

We need to understand some of the components of Git and what they mean before understanding how to navigate inside or around it.
Some of the main components of Git are;

Repository (Repo)
Working Directory
Staging Area
Commit

a) Repository

This is a particular project in Git that is being tracked. It contains your files, change history and git configuration, all of which are contained in the .git folder.
For example

my-project/
|-- app.py
|--README.md
|--.git/

How to Create Your Repository on git

Create a project folder
mkdir my-project
Navigate into your project folder by
cd my-project
Initialise Git by
git init

b) Working Directory

The working directory is the folder on your computer where you:Write code, Edit files, and delete or add files.

Git continuously checks this directory for changes.

You can track changes in *Git by* git status

c) Staging Area

Now this is the area where you instruct, and direct Git to save a particular change that you might have made in your code, because not all changes are saved automatically, you have to choose what to save.

To stage a specific file:

touch app.py touch README.md

To stage everything: git add .

d) Commit

This is a snapshot of your code at a specific point in time, and commits are characterised by unique id, message describing the change and a time stamp. Call them Checkpoints.

To add a commit: git commit -m "Add the logic describing the change"

To see past commits: git log

Below are the image demonstrations on how to go about operating git

Connecting Git to GitHub Remotely

After creating your repository on Git you will need to connect it remotely to your GitHub account.

In your Github account, you will need to create a new repository in which you will name.
An SSH key will be created in the form of:
git@github.com:Shabex/Data-Eng.git

Copy the SSH Key and use the following syntax you can link your git to GitHub remotely:
git remote add origin git@github.com:Shabex/Data-Eng.git

Pushing Your Code

This is how you send your local commits to the remote repository in Github
git push -u origin main
After pushing your code, they become visible on GitHub

Pulling Your Codes

This is how you send your changes from the remote repository to my local computer
git pull origin main
This comes in handy when you are working on a shared project, switching computers or updating your local copy of the file.