DEV Community: MJ-O

Foundational Concepts in Data Engineering Using an E-Commerce Platform Example

MJ-O — Mon, 01 Jun 2026 10:47:26 +0000

INTRODUCTION

Data engineering is the process of collecting, moving, transforming, storing, and managing data so that it can be used for reporting, analytics, machine learning, and decision-making. Almost every modern digital platform depends heavily on data, and behind these systems are data engineers who build pipelines and architectures that ensure data flows correctly and reliably.

To better understand these concepts, we will use a real-world example throughout this article: an e-commerce platform similar to Amazon, Jumia, Alibaba, or Shopify.

In an e-commerce platform:

Customers browse products
Orders are placed
Payments are processed
Products are shipped
Notifications are sent
Reports and dashboards are generated

Every click, payment, search, review, and order generates data. As the platform grows, the amount of data becomes extremely large and complex. Data engineering helps ensure this data can be processed efficiently and used effectively.

In this article, we will explore important foundational concepts in data engineering using this e-commerce platform example.

1. BATCH VS STREAMING INGESTION

Data ingestion is the process of collecting data from different sources and moving it into a storage or processing system.

There are two main ways data is ingested:

Batch ingestion
Streaming ingestion

Batch Ingestion

Batch ingestion processes data in groups after a certain period of time.

In an e-commerce platform, not all tasks require immediate processing. Some reports and operations are performed periodically.

Examples include:

Daily sales reports
Weekly inventory reports
Monthly revenue summaries
End-of-day analytics

Suppose the company wants to calculate the total sales made during the day. Instead of calculating every transaction individually in real time, the system may collect all transactions throughout the day and process them together at midnight.

This is batch ingestion because the data is processed in batches after a certain interval.

Advantages of Batch Ingestion

Easier to manage
Efficient for historical data processing
Lower infrastructure costs
Suitable for scheduled analytics

Limitations of Batch Ingestion

Delayed updates
Not suitable for real-time monitoring
Errors may only be detected later

Streaming Ingestion

Streaming ingestion processes data continuously as it is generated.

In an e-commerce system, many activities need immediate processing.

For example:

Customers should instantly receive order confirmations
Inventory should update immediately after purchases
Fraudulent transactions should be detected quickly
Delivery tracking should update in real time

Suppose a customer buys the last item in stock. The inventory system must update immediately so other customers do not purchase unavailable products.

This requires streaming ingestion.

Streaming systems continuously process incoming events as they happen.

Common technologies used include:

Apache Kafka
Spark Streaming
Apache Flink

Advantages of Streaming Ingestion

Real-time updates
Faster decision-making
Immediate notifications
Better customer experience

Limitations of Streaming Ingestion

More complex architecture
Higher processing requirements
More difficult to maintain

2. CHANGE DATA CAPTURE (CDC)

Change Data Capture (CDC) is a method used to track changes made to data in a database.

Instead of repeatedly copying the entire database, CDC captures only data that has changed.

This includes:

New records
Updated records
Deleted records

Example in an E-Commerce Platform

Suppose:

A customer changes their shipping address
A product price is updated
An order status changes from “Processing” to “Shipped”

Instead of copying the entire customer or orders table again, CDC captures only those changed records.

CDC is important because it:

Reduces unnecessary processing
Saves storage and bandwidth
Improves synchronization between systems
Supports near real-time analytics

Without CDC, systems would waste resources transferring unchanged data repeatedly.

3. IDEMPOTENCY

Idempotency means performing the same operation multiple times without changing the final result.

This concept is extremely important in distributed systems where failures and retries are common.

Example in an E-Commerce Platform

Suppose:

A customer pays for an order
The payment request times out
The system retries the payment

Without idempotency:

The customer may be charged multiple times

With idempotency:

The system recognizes the repeated request as the same transaction
The payment is processed only once

This is usually achieved using:

Unique transaction IDs
Request tracking systems

Idempotency helps:

Prevent duplicate payments
Improve reliability
Make retries safe
Protect customer trust

Large platforms depend heavily on idempotent systems to avoid costly transaction errors.

4. OLTP VS OLAP

OLTP and OLAP are two different approaches used in database systems.

OLTP (Online Transaction Processing)

OLTP systems handle daily operational activities.

In an e-commerce platform, OLTP systems process:

Customer orders
Payments
Product updates
Cart additions
User logins

These systems require:

Fast response times
Real-time processing
High accuracy
Support for many users simultaneously

For example:

When a customer clicks “Buy Now,” the transaction must happen immediately

OLTP systems focus on:

Inserts
Updates
Short transactions

Examples of OLTP databases include:

MySQL
PostgreSQL
Oracle

OLAP (Online Analytical Processing)

OLAP systems are designed for analysis and reporting.

The company may use OLAP systems to analyze:

Best-selling products
Customer purchasing behavior
Revenue trends
Seasonal demand patterns
Delivery performance

OLAP systems focus on:

Complex analytical queries
Historical analysis
Large datasets
Aggregations

For example:

Management may want to compare monthly sales across different countries over several years

OLAP systems commonly use:

Data warehouses
Analytical databases
Columnar storage systems

Examples include:

Snowflake
BigQuery
Amazon Redshift

Simple Difference

OLTP runs the business
OLAP analyzes the business

5. COLUMNAR VS ROW-BASED STORAGE

Databases can store data either by rows or by columns.

The storage format affects how efficiently queries run.

Row-Based Storage

In row-based storage, all information for one record is stored together.

Example:

Order ID	Customer	Amount
101	James	250

This works well in transactional systems where entire records are frequently inserted or updated.

Advantages

Faster inserts and updates
Good for transactional workloads
Efficient for OLTP systems

Limitations

Slower analytical queries
Less efficient for reporting

Columnar Storage

In columnar storage:

All values from the same column are stored together

For example:

All product prices together
All order dates together
All customer IDs together

This works well for analytics because reports often require only a few columns.

For example:

Calculating total revenue may only require the “Amount” column

Advantages

Faster analytical queries
Better data compression
Efficient for reporting and aggregations

Limitations

Slower updates
Less suitable for transactional systems

6. PARTITIONING

Partitioning means dividing large datasets into smaller sections.

As the e-commerce platform grows, storing all orders in one massive table becomes inefficient.

The platform may partition order data based on:

Date
Country
Product category
Customer region

For example:

January orders stored separately from February orders

If analysts only need January sales data, the system reads only the January partition instead of scanning the entire database.

Partitioning improves:

Query performance
Data management
Processing efficiency

Partitioning is especially important in systems handling millions or billions of records.

7. ETL VS ELT

ETL and ELT are approaches used to move and process data.

ETL (Extract, Transform, Load)

In ETL:

Data is extracted
Data is transformed
Data is loaded

The transformation happens before storage.

For example:

Product data from different suppliers may have inconsistent formats
Data engineers clean and standardize the data before loading it into a warehouse

ETL ensures:

Clean data enters the system
Better data quality
Easier reporting

ETL is commonly used in traditional data systems.

ELT (Extract, Load, Transform)

In ELT:

Data is extracted
Data is loaded immediately
Transformation happens later

For example:

Raw customer activity logs are stored first
Cleaning happens later during analysis

ELT provides:

More flexibility
Faster loading
Access to raw data

ELT is common in modern cloud systems such as:

Snowflake
BigQuery
Databricks

8. CAP THEOREM

CAP Theorem explains a limitation in distributed systems.

It states that a distributed system cannot fully guarantee all three simultaneously:

Consistency
Availability
Partition Tolerance

Consistency

All users see the same data at the same time.

Example:

If a product goes out of stock, all customers should immediately see the updated inventory.

Availability

The system always responds to requests.

Even if some servers fail, customers should still access the platform.

Partition Tolerance

The system continues operating even during network failures.

Example in an E-Commerce Platform

Suppose there is a network issue between servers.

The platform may prioritize:

Keeping the website available
Continuing customer purchases

Even if inventory updates become slightly delayed temporarily.

Different systems choose different trade-offs depending on business priorities.

9. WINDOWING IN STREAMING

Streaming data arrives continuously, making it difficult to process everything at once.

Windowing divides streaming data into smaller time-based groups.

Examples:

Orders every 5 minutes
Active users every 10 minutes
Hourly revenue monitoring

Types of Windows

Tumbling Windows

Fixed non-overlapping windows.

Example:

Total sales calculated every 10 minutes

Sliding Windows

Windows overlap continuously.

Example:

Average purchases over the last 30 minutes recalculated every 5 minutes

Session Windows

Based on user activity.

Example:

A shopping session ends after inactivity

Windowing helps organize continuous streams into manageable sections.

10. DAGS AND WORKFLOW ORCHESTRATION

A DAG (Directed Acyclic Graph) organizes tasks in a workflow.

An e-commerce reporting pipeline may:

Extract order data
Clean the data
Store it in a warehouse
Generate dashboards
Send reports to management

Each task depends on previous tasks.

Tools like Apache Airflow automate these workflows.

Why DAGs Are Important

Automate workflows
Manage dependencies
Schedule tasks
Monitor failures
Improve reliability

Without orchestration tools, managing large pipelines manually becomes difficult.

11. RETRY LOGIC AND DEAD LETTER QUEUES

Distributed systems frequently experience failures.

Examples include:

Network timeouts
Payment API failures
Invalid records

Retry Logic

Retry logic allows systems to attempt failed operations again automatically.

Example:

A payment gateway temporarily fails
The system retries after a few seconds

This improves reliability and reduces manual intervention.

Dead Letter Queues (DLQs)

Sometimes records repeatedly fail processing.

Example:

An order record contains corrupted product information

Instead of crashing the entire pipeline:

The failed record is moved into a Dead Letter Queue

This helps:

Isolate problematic data
Prevent system failures
Improve debugging

12. BACKFILLING AND REPROCESSING

Data pipelines sometimes fail or produce incorrect outputs.

Backfilling

Backfilling means processing older missing data.

Example:

Sales records failed to load for several hours
Missing records need recovery

Reprocessing

Reprocessing means running pipelines again using updated logic.

Example:

The company changes how discounts are calculated
Historical reports must be recalculated

These processes help maintain accurate historical data.

13. DATA GOVERNANCE

Data governance refers to managing data properly and securely.

An e-commerce platform handles sensitive data such as:

Customer names
Addresses
Payment details
Purchase history

Good governance ensures:

Data quality
Security
Privacy protection
Controlled access

For example:

Only authorized employees should access customer payment information

Poor governance may lead to:

Data leaks
Compliance violations
Incorrect reporting

14. TIME TRAVEL AND DATA VERSIONING

Time travel allows systems to access older versions of data.

For example:

Recover deleted product information
Audit historical pricing changes
Compare previous inventory records

Data versioning tracks how datasets change over time.

These features are useful for:

Recovery
Auditing
Debugging
Historical analysis

Modern platforms such as Snowflake and Delta Lake support time travel functionality.

15. DISTRIBUTED PROCESSING CONCEPTS

Large e-commerce platforms generate huge amounts of data every second.

A single computer cannot efficiently process:

Millions of customer clicks
Product searches
Payment transactions
Delivery tracking data

Distributed processing solves this problem by spreading work across multiple machines.

Frameworks such as Apache Spark support distributed processing.

Important Distributed Processing Concepts

Parallel Processing

Multiple tasks run simultaneously.

Example:

Processing orders from different countries at the same time

Cluster

A group of computers working together.

Scalability

The ability to handle increasing workloads.

Example:

Adding more servers during Black Friday sales

Fault Tolerance

The system continues functioning even if one machine fails.

This is important because large online stores must remain available continuously.

CONCLUSION

Modern data engineering systems rely on many foundational concepts to ensure data pipelines are scalable, reliable, efficient, and secure.

Using an e-commerce platform example makes it easier to understand how these concepts work in real-world systems used by millions of people daily. Concepts such as streaming ingestion, ETL, partitioning, orchestration, distributed processing, and governance all work together to support large-scale digital platforms.

As organizations continue generating massive amounts of data, understanding these foundational concepts becomes increasingly important for anyone interested in data engineering, analytics, or modern data systems.

Beginner Friendly Guide

MJ-O — Mon, 18 May 2026 15:34:03 +0000

MJ-O

May 18

Virtual Machines, Virtual Environments and Containers: Understanding the Differences

#virtualmachine #virtualenvironments #containers #docker

3 min read

Virtual Machines, Virtual Environments and Containers: Understanding the Differences

MJ-O — Mon, 18 May 2026 15:33:41 +0000

INTRODUCTION

When working in software development or data engineering, you will often hear terms like virtual machines, virtual environments and containers. They may sound similar, but they solve different problems.

All three are used to create some form of isolation, but they operate at different levels. Understanding how they differ helps you choose the right tool depending on what you are trying to achieve.

In this article, we will look at what each one is, how they work and the key differences between them.

1. WHAT IS A VIRTUAL MACHINE?

A virtual machine (VM) is a full computer system that runs inside another computer.
It includes:

Its own operating system (Linux, Windows, etc.)
Virtual hardware (CPU, memory, storage)
Applications running inside it

A virtual machine runs using software(a hypervisor) like VirtualBox or VMware, which allows one computer to act like another separate computer inside your main system.
For example, you can run a Linux system inside a Windows laptop using a virtual machine. Even though it is inside your computer, it behaves like a completely separate machine.

Key idea:
A virtual machine is a complete system with its own operating system(OS).

2. WHAT IS A VIRTUAL ENVIRONMENT?

A virtual environment is mainly used in programming, especially in Python. It is used to manage and isolate dependencies for a specific project.

It does not include:

An operating system
Virtual hardware

Instead, it only isolates:

Libraries
Packages

For example, one project may require an older version of a library, while another project needs a newer version. A virtual environment allows both to exist without conflicts.

Key idea:
A virtual environment is a lightweight setup for managing project dependencies.

3. WHAT ARE CONTAINERS?

Containers are a way of packaging an application together with everything it needs to run.

They include:

Application code
Libraries and dependencies
Runtime environment

Containers do not include a full operating system. Instead, they share the host system’s kernel, which makes them lightweight and fast.

The most common tool used for containers is Docker.
For example, you can package a web application into a container and run it on any machine without worrying about differences in setup.

Key idea:
A container is a portable package that runs the same everywhere.

4. KEY DIFFERENCES

The main difference between these three comes down to how much they isolate and what they include.

A virtual machine includes a full operating system and behaves like a separate computer
A virtual environment only manages project-level dependencies
A container packages an application and its environment without including a full OS

Because of this:

VMs are heavier and use more resources
Virtual environments are very lightweight
Containers are lightweight but more complete than virtual environments

5. WHEN TO USE EACH

Virtual Machines
When you need to run a different operating system
When strong isolation is required
Example: testing software on different OS platforms

Virtual Environments
When working on programming projects with different dependencies
Common in Python development
Example: managing different versions of libraries

Containers
When deploying applications
When you want consistency across environments
Example: running the same application on different servers

6. PRACTICAL EXAMPLE

Consider a data engineering project:
A virtual machine can be used to run a Linux server on a Windows machine
A virtual environment can be used to manage Python libraries like pandas or numpy
A container can be used to package the entire data pipeline and deploy it easily
Each tool plays a different role, even within the same project.

7. ADVANTAGES AND LIMITATIONS

Virtual Machines
Advantages

Strong isolation
Can run different operating systems
Suitable for testing environments

Limitations

Heavy and slower to start
Uses more system resources
Virtual Environments

Virtual environments
Advantages

Lightweight and easy to use
Prevents dependency conflicts
Ideal for development

Limitations

Limited to programming dependencies
Does not isolate the full system
Containers

Containers
Advantages

Lightweight and fast
Portable across different systems
Consistent environment

Limitations

Less isolation compared to VMs
Requires understanding of tools like Docker

CONCLUSION

Virtual machines, virtual environments, and containers are all used to create isolation, but they operate at different levels.

A virtual machine provides a full system with its own operating system. A virtual environment focuses only on managing project dependencies. A container packages an application and its environment to ensure it runs consistently across different systems.

Understanding these differences helps in choosing the right tool for the right task. In most modern workflows, all three can be used together to build efficient and reliable systems.

Implementing Airflow DAGs: A Beginner-Friendly Guide

MJ-O — Wed, 29 Apr 2026 16:05:40 +0000

INTRODUCTION

In data engineering, many tasks need to run automatically, such as extracting data, cleaning it and loading it into a database. Doing this manually is not practical, especially when working with large or frequently updated datasets.This is where Apache Airflow comes in. Airflow is a tool used to automate and manage workflows. It allows you to define tasks and control the order in which they run.One of the key concepts in Airflow is a DAG (Directed Acyclic Graph). In simple terms, a DAG is a way of organizing tasks and defining how they depend on each other.In this article, we will look at how Airflow DAGs are implemented, including operators, tasks, dependencies and scheduling.

1. WHAT IS A DAG IN AIRFLOW?
-A DAG is a collection of tasks arranged in a specific order.
Directed → tasks have a direction (one task runs before another)
Acyclic→ tasks do not loop back
Graph → tasks are connected
-Each DAG represents a workflow.
For example, a simple data pipeline might look like:

Extract data
Transform data
Load data

Each of these steps becomes a task inside a DAG.

2. AIRFLOW OPERATORS
Operators are the building blocks of tasks in Airflow. They define what kind of work a task will perform.
Some common operators include:

BashOperator → runs bash commands
PythonOperator → runs Python functions
EmailOperator → sends emails
Example (BashOperator)

from airflow.providers.standard.operators.python import PythonOperator

task1 = BashOperator(
    task_id='print_date',
    bash_command='date'
)

3. DEFINING TASKS IN AIRFLOW

In Airflow, a task is created by assigning an operator to a variable.
Each task must have:

A unique task_id
A defined operation (what it does)

Example

4. MULTIPLE TASKS AND ORDER
In most workflows, tasks do not run randomly. They follow a specific order.
Airflow allows you to define this order using dependencies.
Example

task1 >> task2
This means:-
task1 runs first
task2 runs after

Example

You can also chain multiple tasks:

task1 >> task2 >> task3
-This ensures a clear flow in your workflow.

5. UNDERSTANDING TASK DEPENDENCIES
Dependencies control how tasks are related.
There are two main ideas:

Upstream → tasks that run before
Downstream → tasks that run after For example: task1 >> task2 -In the above; task1 is upstream task2 is downstream If task1 fails, task2 will not run.

6. USING THE PYTHONOPERATOR
The PythonOperator is used when you want to run Python code.

from airflow.operators.python import PythonOperator

def greet():
    print("Hello from Python")

task = PythonOperator(
    task_id='python_task',
    python_callable=greet
)

-This allows you to include custom logic in your workflow.

7. AIRFLOW SCHEDULING
Airflow allows you to run workflows automatically at specific times. This is done using a schedule, which is defined inside the DAG.

Example:
The DAG below runs every 5 minutes

In this case, the schedule is defined using:
schedule = timedelta(minutes=5)

This means the DAG will run automatically every 5 minutes.
Other common schedules include:
@hourly → runs every hour
@daily → runs every day
@weekly → runs every week

You can also define schedules using a cron expression.
Example:

dag = DAG(
    dag_id='my_dag',
    start_date=datetime(2024, 1, 1),
    schedule_interval='0 6 * * *'
)

The above means the DAG runs every day at 6 AM.

Scheduling is important because it allows workflows to run automatically without manual intervention.

9. A Simple Airflow DAG Using PythonOperator
Below is a simple example of an Airflow DAG that runs two tasks. The first task prints a message, and the second task runs after it.

10. TROUBLESHOOTING DAGS
Sometimes DAGs do not run as expected.
Common issues include:

Tasks not linked correctly
Wrong schedule settings
Errors in Python functions
DAG not placed in the correct Airflow folder

To fix this:

Check logs in Airflow UI
Confirm dependencies are correct
Ensure all tasks have valid code
Ensure the DAG is in the correct Airflow folder

11. REAL-WORLD EXAMPLE
Consider a data pipeline for a retail company.
The workflow might be:

Extract sales data
Clean the data
Store it in a database
Send a report In Airflow, this becomes:

extract >> transform >> load >> report

Each step is a task, and Airflow ensures they run in the correct order.

12. WHY AIRFLOW DAGS ARE IMPORTANT
Airflow DAGs help data engineers:

Automate workflows
Manage task dependencies
Schedule tasks efficiently
Monitor processes

CONCLUSION

Implementing Airflow DAGs is an important skill in data engineering. DAGs allow you to define workflows clearly, control how tasks run, and automate processes.By understanding operators, tasks, dependencies, and scheduling, you can build reliable data pipelines that run efficiently without manual intervention.As data systems grow more complex, tools like Airflow become essential for managing and scaling workflows.

ETL vs ELT: Which One Should You Use and Why?

MJ-O — Sun, 12 Apr 2026 19:28:28 +0000

1. INTRODUCTION

When working with data, one of the main tasks is moving data from different sources into a system where it can be stored and analyzed. This process is very important in data engineering and analytics, and it is usually done using either ETL or ELT.

ETL and ELT may sound similar, but they are not the same. The main difference is in how and when the data is processed. Understanding this difference helps you choose the right approach depending on the type of data, the system you are using, and what you want to achieve.

In this article, we will look at what ETL and ELT are, how they work, their differences and where each one is used in real-world situations.

2. WHAT IS ETL?
ETL stands for Extract, Transform, Load. It is the traditional method used to move and prepare data.

In ETL, data is first collected from different sources. After that, it is cleaned and transformed into the required format. Finally, the processed data is loaded into a database or data warehouse.

So the flow is:
Extract -> Transform -> Load

In this approach, the transformation happens before the data is stored. This means that only clean and structured data is saved in the system.

For example: in a hospital system, patient data may come from different departments such as the lab, pharmacy, and reception. Some records may be incomplete or duplicated. In ETL, this data is cleaned, standardized, and corrected before it is stored, ensuring doctors and staff work with accurate information.

3. WHAT IS ELT?
ELT stands for Extract, Load, Transform. It is a more modern approach, especially used in cloud-based systems.

In ELT, data is first extracted from sources and then loaded directly into the system without being cleaned. After that, the transformation is done inside the database or data warehouse.

So the flow is:
Extract → Load → Transform

In this case, raw data is stored first, and cleaning happens later.

For example: a social media platform collects large amounts of user activity such as likes, comments, and clicks. With ELT, all this raw data is stored first, and analysts later transform it depending on what they want to analyze.

This approach allows more flexibility because the original raw data is always available.

4. KEY DIFFERENCE BETWEEN ETL AND ELT
The main difference between ETL and ELT is when the data is transformed.

In ETL, data is cleaned before it is stored. In ELT, data is stored first and cleaned later.

This affects how fast data can be loaded, how flexible the system is, and how much processing power is needed. ETL focuses more on control and data quality, while ELT focuses more on speed and flexibility.

5. WHEN TO USE ETL
ETL is useful in situations where data needs to be clean and structured before it is stored.

One common use case is in systems where data quality is very important, such as banking or financial systems. In such cases, incorrect data can cause serious problems, so it must be cleaned before it is stored.

For example: in a banking system, transactions must be verified, duplicates removed, and errors corrected before they are saved. This ensures accurate balances and reliable financial reporting.

ETL is also useful in older systems that cannot handle large transformations efficiently. By processing data before loading, the system is not overloaded.

Another use case is when working with structured data that does not change often. ETL ensures consistency and makes reporting easier.

6. WHEN TO USE ELT
ELT is more common in modern systems, especially those using cloud platforms.

It is useful when working with large amounts of data because it allows fast loading without waiting for data to be cleaned first.

ELT is also useful for data analysis and exploration. Since raw data is stored, analysts can transform it in different ways depending on what they need.

For example: an e-commerce platform collects data such as product views, clicks, and purchases. With ELT, all this data is stored as it is, and later analysts can transform it to understand customer behavior or sales trends.

This makes ELT more flexible compared to ETL.

7. TOOLS USED IN ETL AND ELT
Different tools are used depending on the approach.

For ETL, tools are designed to clean and transform data before loading. Some commonly used tools include Informatica, Talend, and Microsoft SSIS. These are mostly used in traditional data systems.

For ELT, tools focus on loading data first and transforming it later. Examples include dbt, Fivetran, and Apache Airflow. These tools are commonly used together with cloud platforms such as Snowflake, BigQuery, and Amazon Redshift.

8. PRACTICAL EXAMPLE
Consider a ride-hailing company that wants to analyze trip data.

If the company uses ETL, the data is first cleaned. This includes removing invalid trips, fixing missing values, and standardizing formats. After that, the clean data is stored in the system and used for reporting.

If the company uses ELT, all trip data is loaded immediately, even if it is incomplete or inconsistent. The cleaning and transformation are done later when analyzing things like peak hours or average trip distance.

Both approaches can work, but the choice depends on the system and the needs of the business.

9. Advantages of ETL

Ensures data is cleaned before it is stored
Reduces errors early in the process
Works well in systems that require high data accuracy
Keeps the database organized with structured data

10. Limitations of ETL

Can be slower, especially with large datasets
Less flexible once data is already transformed and stored
Requires more effort before loading data

11. Advantages of ELT

Faster data loading since transformation happens later
More flexible because raw data is stored
Allows analysts to transform data in different ways when needed
Works well with large datasets and modern systems

12. Limitations of ELT

Raw data may contain errors or inconsistencies
Requires powerful systems to handle transformations
Data needs to be cleaned later before proper analysis

13. WHICH ONE SHOULD YOU USE?
Choosing between ETL and ELT depends on your situation.
If you need strict control over data quality and are working with systems that require clean data before storage, ETL is the better option.

If you are working with large datasets, modern cloud systems, or need flexibility in analysis, ELT is usually the better choice.

In many modern environments, ELT is becoming more common because of its speed and scalability. However, ETL is still important in cases where data accuracy is critical.

CONCLUSION

ETL and ELT are both important approaches used to move and process data. The main difference between them is the order in which data is transformed and loaded.

ETL focuses on cleaning data before storing it, while ELT focuses on storing data first and transforming it later. Each approach has its own advantages depending on the system and the type of data being used.

Understanding how both work helps in choosing the right approach and building better data pipelines. In the end, the goal is to ensure that data is reliable, accessible, and useful for making decisions.

CONNECTING POSTGRESQL TO POWER BI FOR DATA ANALYSIS

MJ-O — Wed, 18 Mar 2026 05:37:14 +0000

Introduction

Power BI is a business intelligence tool made for extracting, organizing, and visualizing data in business.It was developed by Microsoft. It allows one to connect to several data sources, transform data into meaningful insights and present those insights using interactive dashboards and reports.
Power BI is used by data analysts and businesses to bridge the gap between data and decision-making.
Power BI is used for data analysis and business intelligence by helping organizations turn raw data into meaningful insights through visualizations such as charts and dashboards. Such tools make it easier for businesses to identify trends monitor performance and make decisions based on the data provided. Companies connect Power BI to databases because databases store large amounts of structured data needed for analytical purposes. By connecting Power BI directly to these databases, organizations can access the latest information and automatically generate accurate reports for better decision making.
SQL databases are important for storing and managing analytical data since they store and organize large amounts of structured data efficiently. This makes it easier to retrieve, manage and analyze data. Tools like Microsoft Power BI rely on SQL databases to access reliable,organized and well structured data for reporting and analysis.

1. CONNECTING POWER BI TO POSTGRESQL DATABASE

Power BI helps businesses make sense of their data. It can identify trends,patterns and share insights so teams can make smarter decisions. To do this well, it connects to databases like PostgreSQL. These databases keep data organized and structured, so Power BI can easily pull in the information it needs for reports, dashboards and analysis.

Connecting to a Local PostgreSQL Database
Step 1: Open Power Bi Desktop
-Open your Power BI Desktop application and click on 'Blank report'.

Step 2: Get Data

After it opens to a new window, click on 'get data' dropdown on the home ribbon and select more Step 3: Postgresql Search
Once it opens a new window type the postgresql keyword on the search bar and double click on it Step 4: Enter Server Details
On the new window, enter your machine's IP address in the 'server' field and the database name in the 'database' field and click 'ok'. Step 5: Authenticate Connection
You will then be prompted for your PostgreSQL database credentials. Enter the correct username and password and click 'connect'. Step 6: Load Tables
After succesful authentication, a window displaying the databases and schemas within your PostgreSQL server opens.
Expand your database and select the tables needed by checking their respective boxes. A preview of the selected table's data will appear on the right. Finally, click 'Load' to bring the data into Power BI.

Connecting To A Cloud Postgresql Database
Most organizations host their databases in the cloud instead of locally e.g using Aiven which provides managed PostgreSQL databases. Connecting Power BI to a cloud PostgreSQL database follows a similar process but requires additional connection details.
Step 1: Create Database on Aiven

Creating a PostgreSQL database on Aiven is easy and fully cloud-hosted. First, log in to the Aiven console,select services on the right tab and click Create Service, then select PostgreSQL. Choose a cloud provider, region, pricing plan, give your service a name, and click Create Service. Aiven will deploy the database automatically, ready for use.

Step 2: Download The SSH Certificate
Cloud databases often require secure encrypted connections. From the Aiven console, navigate to Connection Information and download the CA Certificate (SSL certificate).
Step 3: Connect Aiven To Power Bi
To connect Power BI to a cloud database on Aiven, you need the PostgreSQL Open Database Connectivity driver(ODBC). ODBC lets applications communicate with databases in a standard way.
Download and install the driver: Get the PostgreSQL ODBC driver (psqlODBC) for your system from: https://www.postgresql.org/ftp/odbc/releases/REL-17_00_0008-mimalloc/.
Set up a Data Source: Open ODBC Data Source Administrator, go to System DSN, and click Add. Choose PostgreSQL Unicode(x64) and click Finish.
Enter connection details: A setup window will appear. Copy the connection details (server, database, username, password) from your running Aiven service and paste them into the window. Test the connection to make sure it works.
Another psqlOBDC set up window will appear requesting for connection details. All these connection details can be traced from a running service on Aiven. Copy and paste on this set up window and test connection.
Step 4: Connect Power BI Using the ODBC Source
After setting up the ODBC data source, open Power BI Desktop. Click Get Data, choose ODBC, and click Connect. In the window that appears, select PostgreSQL ODBC and click 'OK'.
Next, enter the username and password from your Aiven service and click Connect. Power BI will retrieve the tables from your PostgreSQL database. Select the tables you want to use and click Load to bring the data into Power BI.

Note: The PostgreSQL ODBC driver uses the downloaded SSL certificate to encrypt the connection, keeping your data secure between Power BI and the cloud database

2. UNDERSTANDING DATA MODELLING

Data modeling is basically how you connect your tables so Power BI knows how they relate. You do this by creating relationships between tables using common fields, called keys.

3. RELATIONSHIPS IN POWER BI

-Relationships in Power BI are usually created
How to Create Relationships

Switch to the Model view in Power BI.
Drag a field from one table (like CustomerID in Sales) onto the matching field in another table (like CustomerID in Customers).
Power BI usually sets up a one-to-many relationship automatically.

IMPORTANCE OF SQL SKILLS

Power BI is a powerful tool for creating charts, dashboards, and reports, but SQL is just as important for analysts. SQL helps one:

Pull only the data they need directly from the database, saving time and avoiding extra clutter
Filter and clean data before bringing it into Power BI, which speeds up reports
Summarize or aggregate data, like totals, averages, and counts, so dashboards are meaningful
Combine tables and organize datasets so everything links properly for accurate analysis

CONCLUSION

Power BI makes analyzing and visualizing data easy, but SQL gives you control over the data behind the visuals. Knowing SQL helps you prepare datasets, make reports faster and create dashboards that are accurate and meaningful. Power BI and SQL allow analysts to turn raw data into real business insights.

Joins and Windows Functions in SQL

MJ-O — Sun, 01 Mar 2026 18:29:05 +0000

INTRODUCTION

Data in relational databases is usually stored in different tables. Joins allows one to combine data from multiple tables whereas window functions allow calculations across related rows without grouping the results into a single row.

1. JOINS
Joins are operations that allow one to combine rows from two or more tables based on a related column between them.

Types of Joins
-INNER JOIN
The most common type of join. It returns only the rows that have matching values in both tables based on a related column between them.It can also be written as 'join'.
- LEFT JOIN
Returns all rows from the table on the left(the first selected table) and matching rows from the right table. Suppose there is no match, NULL values are returned.

- RIGHT JOIN
Returns all rows from the right table(the second table selected) and matching rows from the left table. Suppose there is no match, NULL values are returned.

- FULL OUTER JOIN
Returns all rows from both tables. Non-matching rows from both tables will contain NULL values.

2. WINDOW FUNCTIONS
Window functions are functions that perform operations across a set of rows that are related to the row the function is currently operating on.
Types of Window Functions

ROW_NUMBER() – assigns a unique number to each row
RANK() – assigns ranks to rows with gaps for duplicates
DENSE_RANK() – assigns ranks without gaps for duplicates
LAG() – accesses previous row values
LEAD() – accesses next row values

The OVER() Clause
All window functions require the OVER() clause. This clause defines the window of rows the function should operate on. You can specify how rows are grouped with PARTITION BY and the order of rows with ORDER BY.
For example, if one wants to rank sales by each region, they use a ranking function with OVER(PARTITION BY region ORDER BY sales DESC). The OVER() clause is what makes these functions “window functions” instead of ordinary aggregates.

CONCLUSION

Joins and window functions are key tools in SQL that make working with data easier. Joins let you combine information from different tables, so you can see how data relates. Window functions let you do calculations across rows without losing the details of each row. Learning how to use both makes your queries more powerful and helps you get better insights from your data.

Connecting PostgreSQL on a Linux (WSL) Server to DBeaver

MJ-O — Thu, 12 Feb 2026 07:58:41 +0000

INTRODUCTION

PostgreSQL is a relational database management system used to store and manage structured data. When installed inside Windows Subsystem for Linux (WSL), it runs in a Linux environment on a Windows machine.DBeaver is a graphical database management tool that supports PostgreSQL. Connecting DBeaver to PostgreSQL running in WSL allows users to manage the database using a nice visual interface instead of only terminal commands.

Prerequisites

PostgreSQL installed and running inside WSL
DBeaver installed on your Windows machine
Access credentials for the PostgreSQL user (e.g., postgres user and password)

STEP 1: Confirm Installation of PostgreSQL

To confirm if postgreSQL has been installed and is running in WSL, Inside the WSL terminal, enter the following command

psql --version

If installed, it displays the postgreSQL version

Then check if the service is running using the following command

sudo systemctl status postgresql

If it is not running, use the following command to start the service:

sudo systemctl start postgresql

STEP 2 : Create or Verify Database

After the postgres service is up and running,switch to the postgres account on your server:

sudo -i -u postgres

The following command will log you into the PostgreSQL prompt where you can interact with the database management system:

psql

Use the following command to show a list of databases available

\l

Suppose your database is not in the list, create one using the following command:

CREATE DATABASE <database_name>

NOTE: Replace the name with the required database name

STEP 3: Configure Authentication
By default, PostgreSQL on Ubuntu/WSL uses peer authentication, which works in the Linux terminal but does not work with external tools like DBeaver.Editing the method to md5 enables password-based authentication, allowing external tools to connect successfully.

To edit the authentication file, use the following commands:

cd /etc/postgresql/16/main

then:

sudo nano pg_hba.conf

Find: local all postgres peer -Change it to: local all postgres md5
Also ensure these lines exist and are configured correctly: host all all 127.0.0.1/32 md5 host all all ::1/128 md5
Save and restart PostgreSQL:

sudo service postgresql restart

STEP 4: Set a Password for the PostgreSQL User

Log into postgres on your server:

sudo -i -u postgres

Set the password:

ALTER USER postgres WITH PASSWORD 'yourpassword';

Save and exit:

\q

STEP 5: Configure PostgreSQL to Listen on All Interfaces
By default, PostgreSQL listens only on localhost.
In WSL 2, this may prevent DBeaver (running in Windows) from connecting properly because WSL operates on a separate virtual network interface.

Edit the configuration file:

sudo nano /etc/postgresql/*/main/postgresql.conf

Find: listen_addresses = 'localhost' -Change it to: listen_addresses = '*'
Save and restart PostgreSQL

STEP 6: Obtain the WSL IP Address

In WSL, run:

ip addr show eth0

Find:

inet 172.xx.xx.xx/xx

*NOTE:This will be the value used as the host in DBeaver. *

STEP 7: Connect Using DBeaver

Open DBeaver, create a new PostgreSQL connection.
Fill in with the details of your connection

SSL tab: Set SSL mode to Disable
Click Test Connection.

If configured correctly, the connection should succeed.Click Finish.

CONCLUSION

Connecting PostgreSQL running inside WSL to DBeaver allows users to manage their database using a graphical interface instead of relying only on terminal commands. By configuring authentication, setting a password, adjusting the listen address, and using the correct WSL IP address, the connection can be established successfully. This setup makes it easier to run queries, manage tables, and work with databases more efficiently.

How Analysts Use Power BI to Transform Data into Action

MJ-O — Mon, 09 Feb 2026 04:28:11 +0000

Introduction

When working with Power BI, analysts rarely get clean data that is ready to use. Most of the time, the data is messy, comes from different sources and needs a lot of preparation before it can be analysed. Power BI helps analysts clean this data, create calculations and build dashboards that help people understand what is happening and what actions to take.

WORKING WITH MESSY DATA
Data usually comes from sources such as Excel files, databases, or online systems.Most datasets contain errors such as missing values, duplicates or incorrect data types. Using such data directly can lead to inaccurate and inconsistent results.In Power BI, analysts use Power Query to clean and prepare data. This includes removing duplicates, correcting data types, renaming columns and filtering blanks or unnecessary data. Cleaning the data first ensures that the analysis is based on correct and consistent information.Before building reports, analysts first focus on understanding and preparing the data.

DATA CLEANING USING POWER QUERY
Power Query is used in Power BI to clean and transform data. It allows analysts to connect to data sources and make changes before the data is loaded into the model.
Common tasks done in Power Query include:

Removing duplicate or empty rows
Changing incorrect data types
Renaming columns to make them easier to understand
Combining data from multiple sources This step ensures the data is organised and ready for analysis.

MODELLING DATA FOR ANALYSIS
Once the data is cleaned, it needs to be organized properly. Data modelling in Power BI involves arranging data into fact and dimension tables and creating relationships between them.A good data model helps Power BI understand how tables connect and how filters should work across reports. When the model is simple and well structured, reports load faster and calculations give correct results.

USING DAX TO ADD MEANING
After the data is cleaned, analysts use DAX to create calculations. DAX is used to calculate totals, averages, and other values that help answer business questions.DAX measures change depending on filters such as date, category, or location. This makes it easier to compare performance and identify trends in the data.Without DAX, Power BI only displays raw data. With DAX, the data becomes meaningful.

For example, DAX can be used to calculate total sales, profit margins or classify data into performance categories. These calculations help turn raw data into information that answers real business questions.

DASHBOARDS FOR DECISION-MAKING
Dashboards are used to present insights in a clear and simple way.A good dashboard focuses on important information and avoids unnecessary visuals.Key values are usually shown first, followed by trends and more detailed information. Filters are added to allow users to explore the data further. In Power BI, analysts use charts, tables, KPIs and filters to show important trends and patterns.A well-designed dashboard allows users to quickly understand what is happening in the data and take action without needing to analyze the raw data themselves.The goal is to make the report easy to understand, even for someone seeing it for the first time.

FROM INSIGHTS TO ACTION
The purpose of analysis is not just to view data, but to support decisions. When dashboards clearly show patterns and performance, users can take action based on the insights provided.Power BI helps bridge the gap between data and decision-making.

CONCLUSION

Power BI helps analysts transform messy data into meaningful insights. By cleaning data, modelling it correctly, applying DAX calculations, and presenting results through dashboards, analysts are able to support accurate reporting and informed decisions.

[Boost]

MJ-O — Mon, 09 Feb 2026 04:01:52 +0000

NorthernDev

Feb 5

The Junior Developer is Extinct (And we are creating a disaster)

#discuss #career #ai #future

193

192

2 min read

How to Install Python on Linux

MJ-O — Thu, 05 Feb 2026 14:48:35 +0000

Introduction
Python is an essential programming language for data engineers and most developers since the combination provides a powerful, stable and flexible environment with an extensive ecosystem of specialized tools. This article guides a user running a Linux server on windows using WSL.
1. Prerequisites

Root Access: Sudo privileges to install software.
Terminal Access: Familiarity with the command line.
Internet Connection: Active internet access for downloading packages.
Disk Space: At least 200MB available.
Command-Line Basics: Understanding of simple terminal commands.

2. Understanding Python Versions
Python has 2 versions:
-Python 2 Version: This is a legacy version that doesn't get updates or security patches
-Python 3 Version:The actively maintained version with improved performance and features, recommended for all modern projects.

Before Installation, confirm if there is a preinstalled version of python using the following commands;
- For Python Version 2;
python2 --version
- For Python Version 3;
python3 --version
Note: If Python is installed, the terminal will display its version number. If not, the terminal will return a "command not found" error, indicating that Python needs to be installed.

3. Installing Python

STEP 1: Update and Upgrade Packages

The following command ensures the package repositories(files) are up to date; sudo apt update
The following command ensures compatibility by ensuring by ensuring the existing packages are in their latest versions; sudo apt upgrade

STEP 2: Install Desired Python Version
-For example, to install Python Version 3, type the following command in Powershell(Admin)
sudo apt install python3

To verify installation, enter the following command; python --version
If installation was succesful "Python " should appear

STEP 3: Install Python Package Manager

A package manager is a tool that automatically installs, updates, removes, and manages software and its dependencies so everything works together safely.
Python's package manager is known as pip
The following command Installs pip for managing Python packages and dependencies. sudo apt install python3-pip
Verify pip Installation using the following command; pip3 --version
If succesful;

4. Using Virtual Environments
On Linux systems (including WSL), Python is often used by the operating system itself. Installing packages globally using pip can cause conflicts or break system tools.
To avoid this, Python provides virtual environments, which isolate project dependencies.
STEP 1: Install venv (if not installed)
-The following command provides tools needed to create isolated Python environments
sudo apt install python3-venv
STEP 2: Create a Virtual Environment
-The following command creates a dedicated environment for project-specific packages.
python3 -m venv venv
STEP 3: Activate the Virtual Environment
-The following command ensures Python and pip commands run inside the isolated environment.
source venv/bin/activate
NOTE:- To deactivate;
deactivate
-Deactivating returns the terminal to the system Python environment
STEP 4: Upgrade pip Inside the Virtual Environment
-The following command safely updates pip without affecting system Python.
pip install --upgrade pip
STEP 5: Install Packages
-The following command installs required libraries only for the current project.
pip install pandas numpy

Conclusion

Python is a powerful and essential tool for data engineers, and installing it correctly on a Linux environment using WSL is crucial for stability and scalability. By using apt for system-level installations and virtual environments for project-specific packages, developers can maintain clean, reliable, and professional development environments.

A Beginner’s Guide to Installing Linux on Windows Using WSL

MJ-O — Wed, 04 Feb 2026 20:26:14 +0000

1. What is WSL?
The Windows Subsystem for Linux(WSL) allows developers on windows operating system to install Linux distributions(such as Ubuntu, OpenSUSE, Kali, Debian, Arch Linux, etc) while still accessing the power of a windows machine.

2. Prerequisites
Your machine must be running on Windows 10 version 2004 and higher (Build 19041 and higher) or Windows 11
To confirm, open command prompt and type ver. You should see the following:

IMPORTANT
In PowerShell, running as admin, type the following command,
optionalfeatures.exe
and check the following boxes if not checked:

Windows Subsystem for Linux
Virtual Machine Platform
Hyper-V (if available) select okay and restart your machine to apply changes

3. Install WSL Command
Press start, type powershell and select "run as administrator"
Enter the wsl command as in below and then reboot your machine
wsl --install

4. Install desired Linux distribution

Type the command below to see a list of the available Linux distributions(distros) wsl --list --online
Select desired distribution and type the command as follows, replacing the with the chosen distribution NAME e.g Ubuntu-24.04 wsl --install -d Ubuntu and press enter
Restart your machine

5. Launching UBUNTU
After the restart, on the start menu, search for Ubuntu and click it to launch.
Create a default user account by creating a username and password

CONCLUSION
Windows Subsystem for Linux (WSL) allows users to run a Linux distribution directly on a Windows machine without dual-booting or virtual machines. After installation, users can work in a full Linux environment to practice commands, install development tools, and run applications, making WSL a powerful and convenient setup for development and learning.