DEV Community: Chiiraq

The Data Pipeline Dilemma: ETL vs ETL

Chiiraq — Fri, 17 Apr 2026 19:49:27 +0000

Source

The interdependence between warehousing and ETL has come a long way and to understand this, a brief recap of history is essential. In the 1970s and 1980s, companies had a problem where their data was scattered everywhere, with each department guarding its own information secretively. A man named William Inmon saw this chaos and proposed a radical fix that was one central home for all company data, which he called a data warehouse, where information would be organized, consistent and always available. But getting data into this new home was no easy feat. Early on, it was brutally manual work, with developers writing mountains of code just to move data from one place to another and mistakes were everywhere. Slowly but surely, this process grew smarter and more automated, eventually evolving into what we now call ETL, a sophisticated system that not only moves data but cleans and transforms it along the way, turning raw scattered information into something the warehouse could actually use.

Infrastructure relationship

While ETL provides the foundation for data movement, its behavior and importance shifts significantly depending on where that data is headed. The destination is a great factor for consideration and in the world of data infrastructure, there are three primary destinations worth understanding i.e the database, the data warehouse, and the data lake. Each one has a different relationship with ETL, and tracing that relationship reveals a great deal about how modern organizations manage and make sense of their information.

1. Database: This can serve as a start or end point for data after it goes through the process of transformation and loading or vice versa.
2. Data warehouse: A Data Warehouse is the storage repository where this structured, processed data after extraction and transformation is saved for analysis and business intelligence.
3. Data Lake: can be defined as a centralized repository that stores vast amounts of raw, unstructured, semi-structured, and structured data in its native format

NB: The core difference between a data lake and a warehouse is:
Data Warehouse (ETL): is a Strict schema-on-write; It demands that data arrives already cleaned, structured and transformed before it is loaded in while:
Data Lake (ELT/ETL): utilizes Schema-on-read; data is ingested raw and transformed as needed, offering greater flexibility and scalability. This is because it is built to store data in its raw form.

This draws us to our order of the day, beggin the question: what is the difference between ETL and ELT and which one is recommended????

What is entailed in an ETL??

The primary difference between an ETL and ELT lies in the order of operations. In ETL, the process starts by extraction then follows Transformation and lastly Loading. For the curious readers, i will highlight what happens at the different stages which is pretty much the same between the two processes but the difference lies in the order of execution.

Extract: Here, the data gathered from different sources e.g databases, CRM/ERP applications, APIs or flat files and often moved to a temporary staging area for processing.
Transform: The raw data is processed to ensure quality and compatibility. This involves cleaning, filtering, aggregating and reformatting to fit the target system's schema.
Load: The transformed data is written into the final target system such as a cloud data warehouse in this instance.

What is entailed in an ELT??

The ELT process is similar to the ETL, as was stated initally, the difference lies in the order of processes. In ELT, the loading process precedes the transformation thus allowing for uploading of data in its raw format.

Difference between ETL and ELT

Category	ETL	ELT
Definition	Data is extracted from a source system, transformed on a secondary processing server, and loaded into a destination system.	Data is extracted from a source system, loaded into a destination system, and transformed inside the destination system.
Extract	Raw data is extracted using API connectors.	Raw data is extracted using API connectors.
Transform	Raw data is transformed on a processing server.	Raw data is transformed inside the target system.
Load	Transformed data is loaded into a destination system.	Raw data is loaded directly into the target system.
Speed	Time-intensive; data is transformed before loading into a destination system.	Faster by comparison; data is loaded directly into a destination system and transformed in parallel.
Code-Based Transformations	Performed on secondary server. Best for compute-intensive transformations and pre-cleansing.	Transformations performed in-database; simultaneous load and transform; speed and efficiency.
Maturity	Modern ETL has existed for 20+ years; its practices and protocols are well known and documented.	ELT is a newer form of data integration; less documentation and experience.
Privacy	Pre-load transformation can eliminate PII (helps for HIPAA).	Direct loading of data requires more privacy safeguards.
Maintenance	Secondary processing server adds to the maintenance burden.	With fewer systems, the maintenance burden is reduced.
Costs	Separate servers can create cost issues.	Simplified data stack costs less.
Requeries	Data is transformed before entering destination system; therefore raw data cannot be requeried.	Raw data is loaded directly into destination system and can be requeried endlessly.
Data Lake Compatibility	No, ETL does not have data lake compatibility.	Yes, ELT does have data lake compatibility.
Data Output	Structured (typically).	Structured, semi-structured, unstructured.
Data Volume	Ideal for small data sets with complicated transformation requirements.	Ideal for large datasets that require speed and efficiency.

Use cases:

1. Real-Time Analytics for Business Insights: Businesses need up-to-the-minute data to make quick decisions in dynamic environments. With real-time ETL processes, data is extracted, transformed, and loaded as it’s generated, allowing companies to respond to market changes, optimize supply chains, and track customer behaviors instantly.

2. Data Migration for System Upgrades: As businesses grow, they often need to migrate data from legacy systems to modern platforms. ETL plays a key role in data migration, ensuring data is moved from one system to another without losing integrity or consistency. This process includes extracting data from the old system, transforming it to meet the new system’s requirements, and loading it into the new environment.

3. Customer Personalization in E-Commerce: Customer data is a goldmine for e-commerce businesses looking to offer personalized shopping experiences. ETL processes can integrate and transform customer data from multiple touchpoints, such as websites, mobile apps, and social media, into a single profile. This enables e-commerce companies to offer personalized product recommendations, marketing campaigns, and customer experiences.

4. Predictive Maintenance in Manufacturing: predictive maintenance is critical to reducing downtime and preventing costly breakdowns. ETL processes collect and transform data from IoT sensors and machinery to predict when equipment needs maintenance. This helps manufacturers reduce operational disruptions and extend the life of machinery.

5. Ensuring Compliance and Data Governance: Businesses that handle sensitive data, such as in healthcare or finance, must comply with strict regulatory requirements. ETL processes help ensure that data is transformed and stored in compliance with regulations like GDPR, HIPAA, and CCPA. ETL can also be used to implement data governance policies, ensuring that only authorized personnel have access to specific data sets.

All said and done, I bet it should be evident now which process has an upper hand based on the provided evidence and usecases(ELT). Why you ask?? this is due to its superior speed, scalability and flexibility particularly for big data and real time analytics. As always, Until next time, keep your data clean and your terminal keen. Peace ma'dudes.

From Chaos to Dashboard: How Power BI Analysts Turn Data Disasters into Decisions That Actually Stick

Chiiraq — Mon, 06 Apr 2026 18:17:20 +0000

INTRODUCTION

As a data related practitioner be it engineering, science or whatever your cup of tea is, we often receive files that look like a dataset but act like a hostage situation. But here's the thing about being a data professional, there is no negotiating team coming and the hostage negotiator is you. The hostage is your KPIs. And the ransom? Your ability to turn this crime scene of a spreadsheet into something stakeholders can nod at during a Tuesday morning meeting. Power Bi has proven to be a resourceful tool in the translation of this messy data to actionable insights through the use of a couple of tools that are made available. The process can simply broken down into three key stages that we will get into in a moment:

Data Cleaning.
Data Enrichment.
Data Visualization.

1. Data Cleaning.

This being the very first stage and a vital bit of the process, the first step involved in this phase would be to check on the validity of the data given. If you can not trust the data given then the outputs and visualizations will all be falsehoods decorated in fancy charting. The ideal tool for this task would be ** Power Query **. So what is power query?? according to the internet, it is defined as a powerful ETL (Extract, Transform, Load) engine that cleans, shapes, and connects to data from hundreds of sources before loading it into the data model. Here is an image to help visualize it :

some of the tasks that may occur here include but are not limited to:

Removing duplicates

2.Changing data types e.g changing money to respective currencies

3.Replacing values e.g missing fields with null values

4.Splitting columns e.g combined name can be split into first, second and surname.

2. Data Enrichment

In this phase the cleaned data is transformed into the actual insights needed from the data through enrichment. That begs the question: What is data enrichment? Data enrichment is the process of enhancing your cleaned dataset by adding context, relationships and calculated meaning that the raw data alone couldn't provide. Cleaning makes the data trustworthy, enrichment makes it useful.
The tool for this phase would be Data Analysis Expressions(DAX). This is a library of functions and operators that can be combined to build formulas and expressions in PowerBi. Some of the DAX functions may include the following:

Building Relationships: Connecting multiple tables together in Power BI's Model View so data can flow and interact correctly across your report.

Creating a Date Table:
Building a dedicated calendar table that unlocks time-based analysis like month over month comparisons, year to date totals, and rolling averages.
This is done by adding a new table and then defining the custom dates fro your table e.g DateTable = CALENDAR(DATE(2023, 1, 1), DATE(2025, 12, 31))
Calculated Columns (DAX):
Adding new columns derived from existing data — categorizing, flagging, or combining fields to give your dataset more descriptive depth.
Measures (DAX):
Dynamic calculations that respond to filters and context — KPIs, aggregations, variances, and time intelligence functions.
Conditional Columns & Grouping (Power Query)
Rule based categorization and summarization applied before the data even enters the model.
Merging & Appending Queries
Joining or stacking tables in Power Query to consolidate data from multiple sources into a unified, analysis-ready structure.

3. Data Visualisation

This is the phase whereby the data is organised into consumable metrics for the audience. This is achieved through use of common graphics, such as charts, plots, infographics and even animations often organised into a dashboard.

conclusion

What Power BI gives you is leverage. It turns a three-week manual nightmare into a repeatable, auditable, scalable process. It means the next time that file lands in your inbox, you're not panicking — you're executing. Until next time, keep your data clean and your terminal keen. Peace ma'dudes.

stacy okoth

Chiiraq — Mon, 06 Apr 2026 11:20:27 +0000

Schema Design Patterns: Because Even Data Needs Good Architecture

Chiiraq — Mon, 02 Feb 2026 04:00:03 +0000

INTRODUCTION

In the ever growing world of data, it exists in absolute chaos; records scattered across multiple systems each telling a slightly different version of the same story. Untouched, the data cannot organize itself into neat and logical structures but instead, it duplicates and contradicts itself and yet, from this chaos, businesses need answers. Regardless of how terrifying it is, this is the natural state of data i.e (raw, unstructured and exponentially growing). This is where the data experts comes in to play, to bring order to this chaos and their most powerful tool you ask? the schema!

What exactly is data modelling??

According to Joe Reis and Matt Housley, (no, they're not made up dudes - authors, fundamentals of data engineering):
"This is the process of creating a visual representation of either a whole information system or part of it to commuicate connections between data points and structures."

What is a Schema??

A schema is the organization or structure for a database, defining how data is organized and how the relations among data are associated.A well designed schema is the foubdation of query perfomance and data integrity in analytical systems.

What is the importance of good data modelling??

Good data modelling ensures:
a) Improved database perfomance: Statistical research has shown that well designed data models can improve report perfomance by up to 90%.Statistical research has shown that well designed data models can improve report perfomance by up to 90%. well i wouldn't know about you but to me that's astronomical figures also makinig it easier to find new opportunities for optimization and are equally easier to diagnose.

b)Improved application quality: the data modelling gives your orgnisation a clear vision for hoe data can fill your business needs.

c)Improves data quality: the data modelling process establishes rules for monitoring data quality and identifies any redundancies or omissions eliminating the hustle in cleaning large data sets.

d)Enables better documentation: it enables consistent documentation which simplifies database maintenance while simultaneously preserving operational efficiency.

e)Saves time and money: it empowers businesses to achieve quicker times to market by catching errors early.

note: These are just some of the perks that great data modelling provides, the scope goes on and on, i could continue listing them down but let's get to the juicy part of the steak, no??

SCHEMAS

In schemas, database tables will have a primary key or a foreign key, which will act as unique identifiers for individual entries in a table. These keys are used in SQL statements to join tables together, creating a unified view of information.Schema diagrams are particularly helpful in showing relationships between tables and they enable analysts to understand the keys that they should join.

While there are several schemas existent, we will primarily focus on the star schema and the snowflake schema. Why you ask?? the two represent the optimal design patterns for the vast majority of analytical workloads in relational database management systems and power bi.

star schema

A star schema is a type of relational database schema that is composed of a single, central fact table that is surrounded by dimension tables.It can have any number of dimension tables.

snowflake schema

The snowflake schema consists of one fact table that is connected to many dimension tables, which can be connected to other dimension tables through a many to one relationship.
Tables in snowflake schema are usually normalized to the 3rd normal form. Each dimension table represents exactly one level in a hierarchy.

starflake schema

A starflake schema is a combination of a star schema and a snowflake schema.Starflake schemas are snowflake schemas where only some of the dimension tables have been normalized. Starflake schemas are normalized to remove any redundancies in the dimensions.

TO NOTE:
Fact table: the central,primary table in a star schema that stores quantitative, numerical data.
Dimension table: a table that stores the descriptive, textual or contextual data about business entites.
Relationship: this is a logical link between two or more tables that share common data, primarily established using primary and foreign keys. The main types are one to one, one to many and many to many.

Throughout this article, we have covered the journey from absolute chaos to now refined structures and schemas through good modelling but the real mastery comes from practice and making design decisions.Every schema you design teaches you something. Every relationship you define deepens your understanding of how data flows through business processes.Until next time, keep your data clean and your terminal keen. Peace madudes.

citations:

A) Reis, J., & Housley, M. (2022). Fundamentals of Data Engineering: Plan and Build Robust Data Systems. O'Reilly Media, p. 156.
b) Kleppmann, M. (2017). Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems. O'Reilly Media, p. 39.

Introduction to Linux for Data Engineers, Including Practical Use of Vi and Nano with Examples

Chiiraq — Sun, 01 Feb 2026 19:16:51 +0000

Introduction

When most people hear the term Linux, the imagery that comes to mind quite often is a room full of tech savy geeks hunched over their keyboards typing away in nerd-anese. But what if i told you that Linux is way more than just a playground for programmers?? In the vast data cosmos, linux is largely the backbone that supports data driven decisions powering hundreds and thousands of businesses today. Let's dive into breaking down this operating system into bite sized chunks for any aspiring/beginning data engineers, shall we??

What is the importance of linux for data engineers??

Linux can be considered the soft white underbelly for data engineers, offering both incredible perfomance and flexibility while at the same time having a steep learning curve as it is command line dependent forcing you to get handsy with commands and functions as opposed to the usual operating system that dumbens it down making it more user/beginner friendly by using more direct icons and navigation techniques. So what exactly are the perks of using linux for a data engineer?

1.Perfomance: For it to be compatible to data engineers, it is essential that is has the capability to handle large volumes of data and in record time.
2.Compatibility: a large number of data engineering tools and frameworks such as Apache hadoop, Spark,fink and loads of other varieties run natively on linux making it borderline unbeatable.
3.Scalability: data is an ever growing entity and thus demands that its environment is just as flexible and capable of adapting to increased workloads.
4.Open Source: linux allows for engineers to use and customize the system to their needs.
5.Community Support: When a technology has many users, you will find abundant learning materials, discussion and help forums readily available.

Basic linux commands form the foundation of a data engineer's career and understanding them is essential for working with data systems. some of these commands may include:

pwd - print working directory:used to show the current working directory.
ls -list: this shows files and directories in the current directory.
cd - change directory: used to navigate between directories.
mkdir - make directory: this is used to c reate a new directory.
rm - Remove: this can be used to delete files or directories.-
touch : this creates an empty file in the current directory.
cat : this disolays the contents within a file.

This list above includes several commands that are frequently redundant in your day to day operations. Proficiency builds over time and consistent exposure to the systems will enhance your ability to work efficiently.

Text Editors

Linux make further use of text editors but for today's article, we shall briefly dive into 2 : nano and vim.

Nano

This is a straightforward, user friendly command line tet editor designed for ease of use. It displays commands at the bottom of the screen, making it more accessible for beginners and is commonly used for quick file edits.

The following commands are various use cases for the nano text editor:

Creating a file using nano:

To create a file using nano the command "nano" followed by the file name and its extension is used. e.g :

nano testfile.txt

NB: .txt is the file extension and is relative to the file being created/used.

NOTE: Most of the commands in Nano are executed using key combinations, typically involving Control (CTRL) button or the alternate (ALT) key, rather than single-key commands. Some of the possible key combinations may include but are not limited to:

CTRL + Y - to move down one page.
CTRL + V - to move up one page.
CTRL + O - to save a file i.e write out.
CTRL + X -to exit nano.
CTRL + R -to insert/read another file into the current one.
CTRL + K - to cut marked text.
CTRL + U - to paste previously cut/copied text.

Vi

This is a powerful, modal text editor that comes pre installed on virtually all linux systems. It operates in different modes and is known fo rits efficiency once mastered. It uses keyboard commands rather than menus for all operations.

Creating a file using vi:

The syntax for file creation can be carried over from nano. i.e

vi testfile.py

TO NOTE:
A)Vi is case-sensitive, so a lowercase letter and an uppercase letter may have different meanings within the editor.

B)To open an existing file in Vi, the same command format is used. It is important to ensure that the filename matches exactly the one used during creation otherwise, Vi will create a new file. Additionally, filenames should not contain spaces, as this may result in unintended file creation.Some of the commands for vi include but are not limited to:

i - Insert before cursor
I - Insert at beginning of line
a - Append after cursor
A- Append at end of line
o - Open new line below
O - Open new line above
s - Substitute character (delete char and enter insert mode)
S - Substitute entire line
I - to enter insert mode and enable typing in the created file.
ESC - to return to command mode.
:WQ - to save and exit the file in vi back to command mode.

Through this article, we have explored why linux is an indispensable tool for data engineers and how the basic commands form the foundarion of professional data work, while the command line may seem daunting at first, i am reminded that every expert was once a beginner and the journey of a thousand miles only sets pace from the first step! Until next time, keep your data clean and your terminal keen, peace.

GIT STARTED: A Noob's Guide To Version Control

Chiiraq — Sun, 18 Jan 2026 10:04:46 +0000

INTRODUCTION

We might all have been there, if not, expect to be there: where, you ask??? while working on a project, we have made changes before on "perfectly working code" and as the saying goes 'if it works, don't touch it' but that is rarely the case. Code is always a balance of imperfect perfection, sometimes changes are done quite often and it is never a surety that it runs as expected to. This is where version control comes in.The first question you'll have would be what is version control??

WHAT IS VERSION CONTROL AND WHAT'S IT'S RELEVANCE ??

Version control can be defined as a system that can be utilised to track changes to files overtime, allowing you to recall specific versions of the program much later on.

It can be utilised for several different use cases including but not limited to:

Tracking Accountability - Shows who made what changes and when, making it easy to identify who introduced bugs or brilliant features.
Enables collaboration - Allows multiple people to work on the ame project simultaneously without overwriting other's changes.
Backup - It acts as a complete audit trail showing how your project evolves overr time. Among many other use cases.

There are endless examples of platform specific version control softwares such as gitlab, git and bitbucket but in this article i will take a keen interest in git and github. More specifically we will cover hoe to push and pull code.

HOW TO PUSH CODE TO GITHUB

You are already in flow state and have made progress on the program you are creating, a repository has already been created and all that remains is pushing your code to the platform so you can track your progress. How do you do that exactly??

open gitbash and on the terminal, you can use the following command:

git push origin main

git push-sends commits to the remote repository
origin- The nickname for your Github repository
Main - The branch you're pushing the code to.

Once you refresh your github repository a .txt file with your project in the repositorty.

HOW TO PULL CODE FROM GITHUB

Changes may have been made to the project and you might need to retrieve the code and review. Similarly, you might be retrieving your work to continue working on a project. TO do this, follow the command:

git pull origin main

once the the command is given the following happens:
Git fetches changes from GIthub and the changes are merged into your local code.

If the above article didn't help much, here's a video you can utilise to map your way around the terminal commands better: hope this helps