DEV Community: Ronny Mwenda

15 major concepts of Data Engineering.

Ronny Mwenda — Mon, 11 Aug 2025 23:47:49 +0000

Data Engineering can be defined as designing, building, and maintaining infrastructure that allows organizations to collect, store, process, and analyze large volumes of data. Data engineering can be subdivided into 15 major concepts, namely;

Batch vs Streaming Ingestion
Change Data Capture (CDC)
Idempotency
OLTP vs OLAP
Columnar vs Row-based Storage
Partitioning
ETL vs ELT
CAP Theorem
Windowing in Streaming
DAGs and Workflow Orchestration
Retry Logic & Dead Letter Queues
Backfilling & Reprocessing
Data Governance
Time Travel & Data Versioning
Distributed Processing Concepts.

1. BATCH VS. STREAMING INGESTION

Batch and streaming ingestion are two distinct methods of loading data.Batch ingestion handles collected data in large sets at scheduled intervals while streaming ingestion handles data that is generated and processed in real time.

Batch ingestion is most commonly seen in processing sales reports,or generating monthly bank statements.

Batch ingestion is mostly favourable when processing large volumes of historical data due to its latency thus making it not suitable for real-time applications.

Stream ingestion on the other hand processes data in real time with each data set handled individually.Its therefor suitable for real time insights.

With stream ingestion the latency is low to non-existent thus allowing immidiate action based on incoming data.

stream ingestion can be seen while streaming live match scores,thermostats in a smart home and also car parking sensors.
factors to consider when choosing between batch and stream ingestion include;

-latency
-volume of the data
-source of the data
-cost

2.CHANGE DATA CAPTURE(CDC)

Change data capture, or CDC, is a technique for identifying and recording data changes in a database. CDC delivers these changes in real-time to different target systems, enabling the synchronization of data across an organization immediately after a database change occurs.

Change data capture is a method of real-time data integration.It works by identifying and recording change events taking placein various data sources,these changes are then transfered in real time to target systems.

Common use cases of cdc include;

-fraud detection
-internet of things enablement
-inventory and supply chain management
-regulatory compliance

The common methods of CDC include;

-log-based CDC
-Timestamp-based cdc
-Trigger-based CDC

The various methods benefits of CDC include;

Real time decision making
Succesfull cloud migration
ETL Process improvement
bettter AI perfomance

3.INDEMPOTENCY

In data engineering,Indempotency is executing the same operation multiple times while it still has the same effect as executiing it once.Thia is crucial for building robust data pipelines.Indempotency ensures that data remains consistent and accurate even after multiple identical operations.

importance of indempotency

data recovery and reduncancy
maintain consistency
batch processing 4.ressilience and testing

achieving indempotency in data pipelines

indempotency can be achieved through the use of primary keys,upserts,deleting data before writing,staged data,event time vs ingest time,logging and auditing.

OLTP VS OLAP

Online Transaction Processing systems,commonly refered to as OLTP systems are systems designed to handle real time operations that occur in day to day business activities.

OLTP systems are primarily used to process individual business transactions in real time in institutions such as banks and ecomerce platforms.They focus on live data.

Online Analytical Processing systems (OLAP) on the other hand are optimsed for complex analysis, reporting amnd business intelligence activities such as financial reporting systems and market analysis tools.

Key Differences and Implications

Transaction vs. Analysis: OLTP systems excel at processing individual transactions quickly and accurately, while OLAP systems specialize in analyzing patterns across large datasets.

Data Freshness: OLTP systems work with real-time data, whereas OLAP systems typically work with data that may be hours or days old, depending on the ETL schedule.

Concurrency Requirements: OLTP systems must handle many simultaneous users performing transactions, while OLAP systems typically serve fewer concurrent users running complex queries.

Failure Impact: OLTP system downtime directly affects business operations, while OLAP system unavailability impacts reporting and analysis capabilities.

5. COLUMNAR VS ROW BASED STORAGE

Columnar storage is where data is organised by columns while row based storage is where data is read and written row by row.

6. PARTITIONING

Data partitioning in data engineering is the process of dividing a large dataset into smaller, more manageable chunks called partitions. This technique is used to improve the performance, scalability, and manageability of data storage and processing.

Data partitioning improves perfomance,enhances scalability,simplifies management and optimises cost.

common partitioning methods include;
-range partitioning
-hash partitioning
-list partitioning
-composite partitioning

Examples:
E-commerce: Partitioning by date, region, or customer ID to optimize order processing, sales analysis, and customer support.
Log Analysis: Partitioning by timestamp to analyze log data for specific time periods.
Social Media: Partitioning by user ID or geographic location to optimize user-specific data access and social network analysis.

7. ELT VS ETL

The main difference between ELT and ETL lies in the order of data transformation. ETL (Extract, Transform, Load) transforms data before loading it into a data warehouse or target system. ELT (Extract, Load, Transform) loads data first, then transforms it within the target system.

ETL (Extract, Transform, Load):

Data is extracted from various sources.
Data is transformed in a staging area, often outside the target data warehouse, using specialized tools.
Transformed data is then loaded into the target system.
ETL is well-suited for complex transformations and data cleaning, and is often used when data quality is a top priority.
It can be beneficial for scenarios with stringent data security and compliance requirements.

ELT (Extract, Load, Transform):

Data is extracted from various sources.
Extracted data is loaded directly into the target data warehouse or data lake without prior transformation.
The transformation process happens within the target system using the processing power of the data warehouse or lake.
ELT is often favored for its scalability and ability to handle large volumes of data, especially in cloud environments.
It's particularly useful when dealing with unstructured data or when real-time analytics are needed.

In essence, ETL prioritizes data quality and upfront transformation, while ELT prioritizes speed and scalability, leveraging the power of modern data warehouseS.

8.CAP THEOREM

The CAP theorem says that a distributed system can deliver only two of three desired characteristics:
consistency, availability and partition tolerance (the ‘C,’ ‘A’ and ‘P’ in CAP).Its also called Brewer's Theorem.

Let’s take a detailed look at the three distributed system characteristics to which the CAP theorem refers.

Consistency

Consistency means that all clients see the same data at the same time, no matter which node they connect to. For this to happen, whenever data is written to one node, it must be instantly forwarded or replicated to all the other nodes in the system before the write is deemed ‘successful.’

Availability

Availability means that any client making a request for data gets a response, even if one or more nodes are down. Another way to state this—all working nodes in the distributed system return a valid response for any request, without exception.

Partition tolerance

A partition is a communications break within a distributed system—a lost or temporarily delayed connection between two nodes. Partition tolerance means that the cluster must continue to work despite any number of communication breakdowns between nodes in the system.

9.WINDOWING IN STREAMING

windowing used to divide a continuous data stream into smaller, finite chunks called streaming windows.

Benefits and applications of streaming windows include

They provide a way to process unbounded data incrementally, by breaking the stream into manageable, finite chunks.
The structured nature of streaming windows makes it easier to identify and rectify errors or anomalies within specific time frames, enhancing data quality and reliability.
By limiting the data volume that needs to be processed at any given time, streaming windows can help reduce computational load, leading to faster processing times and more efficient use of system resources.

Additionally, streaming windows have numerous applications across various industries. For example, we can leverage them to:

Detect patterns indicative of financial fraud.
Monitor equipment performance to predict maintenance needs before failures occur.
Streamline traffic flow by analyzing vehicle data streams for congestion patterns.
Personalize online shopping experiences by recommending products based on real-time purchasing and clickstream data.
Provide real-time statistics and performance metrics during live sports events.
Analyze surveillance footage in real time to detect and respond to emergencies or public disturbances.

streaming window types include
-tumbling windows
-hopping windows
-sliding windows
-session windows

10. DAGS AND WORKFLOW ORCHESTRATION

A DAG is a way to represent a workflow as a graph where tasks are nodes and dependencies are directed edges, ensuring a specific order of execution without circular dependencies. Workflow orchestration tools, like Apache Airflow, utilize DAGs to automate and manage the execution of these workflows.

Directed Acyclic Graphs (DAGs):

A DAG is a data structure that visualizes a workflow as a graph with nodes and edges.
Directed: Edges have a direction, showing the flow of execution from one task to another.
Acyclic: The graph cannot contain any cycles or loops, meaning a task cannot be executed multiple times due to circular dependencies.
Nodes: Represent individual tasks or operations within the workflow.
Edges: Represent dependencies between tasks, indicating which tasks must be completed before others can start.

Workflow Orchestration:

Purpose: Workflow orchestration manages the execution of tasks defined in a DAG, ensuring they run in the correct order and with the appropriate dependencies.
Key Functions:
Scheduling: Triggering workflows based on predefined schedules (e.g., daily, hourly).
Task Execution: Running individual tasks on compute resources.
Dependency Management: Ensuring tasks run only when their dependencies are met.
Error Handling: Handling task failures and potentially retrying failed tasks.
Monitoring and Logging: Tracking the progress of workflows and logging events.
Examples of Orchestration Tools: Airflow, Argo, Google Cloud Composer, AWS Step Functions.

Why Use DAGs and Workflow Orchestration?

Automation:
Automates complex data pipelines, reducing manual intervention and potential errors.
Reliability:
Ensures workflows execute reliably and consistently, even with complex dependencies.
Scalability:
Enables workflows to scale to handle large datasets and complex computations.
Observability:
Provides insights into workflow execution, allowing for monitoring and troubleshooting.
Maintainability:
DAGs and orchestration tools make it easier to manage and update workflows as requirements evolve

11.RETRY LOGIC AND DEAD LETTER QUEUES

Retry logic and Dead Letter Queues (DLQs) are essential mechanisms in distributed systems and message-driven architectures for handling message processing failures and ensuring system reliability.

Retry Logic:
Retry logic involves re-attempting an operation or message processing when a transient error or temporary failure occurs. This is done with the expectation that the issue might be resolved upon subsequent attempts.

Key aspects of retry logic include:

Retry Attempts
Backoff Strategy
Error Classification

Dead Letter Queues (DLQs):

A Dead Letter Queue (DLQ) is a designated queue or storage location where messages that could not be successfully processed after exhausting all retry attempts are sent.

The purpose of a DLQ is to:

Isolate Problematic Messages
Enable Manual Inspection and Debugging
Facilitate Error Handling and Recovery

12.BACKFILLING AND REPROCESSING

In data engineering, backfilling refers to the process of retroactively loading or updating historical data in a data pipeline.it is used to fill gaps in historical data,correct errors or initialise systems with historical records.

Reprocessing involves re-running data pip[elines for past dates,often to fix errors or apply changes.

13.DATA GOVERNANCE

Data governance is a comprehensive framework that defines how an organization manages, protects, and derives value from its data assets. It encompasses the people, processes, policies, and technologies that ensure data is accurate, accessible, consistent, and secure throughout its lifecycle.

The core objectives of data governance include
-data quallity management
-data security and privacy
-data stewardship
-reducing compliance

Data governance frameworks include;

1.DAMA-DMBOK Framework
2.IBM Data Governance Framework
3.Microsoft Data Governance Framework
4.Google Cloud Data Governance Framework
5.Enterprise Data Governance Framework

14.TIME TRAVEL AND DATA VERSIONING

Data Versioning and Time Travel in cloud data platforms allow users to access and recover previous versions of data, enabling point-in-time recovery and historical analysis. These features provide the ability to track changes, roll back to previous states, and query data as it existed at a specific moment in time. Data Versioning and Time Travel capabilities are valuable for compliance, auditing, and understanding data evolution in cloud-based data lakes and data warehouses.

Data Versioning
Data versioning is a critical aspect of data management in cloud computing. It allows for the tracking and control of changes made to data objects, facilitating data recovery and ensuring data integrity. Each version of a data object represents a snapshot of that object at a specific point in time, providing a historical record of the object's state.

Versioning is particularly useful in scenarios where multiple users or applications are modifying the same data object. It allows for the resolution of conflicts and the prevention of data loss due to overwrites. Furthermore, versioning enables the rollback of changes, providing a safety net in case of errors or unwanted modifications.

Time Travel
Time travel in cloud computing refers to the ability to view and manipulate data as it existed at any point in the past. This is achieved by maintaining a historical record of all changes made to the data. Time travel allows for the recovery of lost data, the auditing of changes, and the analysis of data trends over time.

Some cloud-based data platforms provide time travel as a built-in feature, allowing users to query past states of the data without the need for manual version management. This can be particularly useful in scenarios involving data analysis and auditing, where understanding the historical state of the data is crucial.

15.DISTRIBUTED PROCESSING CONCEPTS

Distributed Processing executes parts of a task simultaneously across multiple resources, improving efficiency and performance, especially with large data.IT enhanaces scalability, efficiency and reliability.

Distributed Processing is useful in various fields like machine learning, data mining, and large-scale simulations. It can process vast datasets with high speed and reliability. Big tech companies like Google, Amazon, and Facebook utilize distributed processing to deal with their massive data.

data warehouse

Ronny Mwenda — Mon, 28 Jul 2025 04:57:28 +0000

what is a data warehouse?

A Data Warehouse is a centralized repository designed specifically to store, manage, and analyze large volumes of historical and current data from various sources within an organization.

why do we need a data warehouse?

a data warehoouse is useful for cleaning, storing and organising data .the data stored can be used to make better decisions, spot problems early and even save time.

analogy of a data warehouse

Think about Amazon—one of the biggest online stores in the world. Every second, millions of customers are browsing, buying, and reviewing products. Amazon collects a massive amount of data, including:

What you search for

What you add to your cart

What you buy

Your reviews and ratings

Delivery times and locations

All this data comes from different places—mobile apps, websites, warehouses, and even Alexa. To make sense of it all, Amazon uses a data warehouse.

With this system, Amazon can:

Recommend products based on your shopping habits

Track popular items and restock them faster

Detect fraud by spotting strange purchase patterns

Personalize your experience, like showing deals based on your location or interests

Without a data warehouse, organizing this much data in real time would be nearly impossible.

thus its correct to say that data warehouses offer a larger volume and scale than databases, better integration capabilities, historical data focus and structured data organisation.

Database	Data Warehouse
Stores daily, real-time data	Stores historical, analytical data
Supports transactions	Supports analysis and reporting
Used by apps and staff	Used by analysts and managers
Frequently updated	Updated in batches
Smaller in size	Much larger in size
Optimized for speed	Optimized for complex queries
Example: ATM withdrawal	Example: 5-year transaction analysis

Conclusion

at the end of the day,a data warehouse is just a smart way for companies to understand the stories their data is trying to tell,you might not see it but its helping make your everyday experiences smoother and faster.

Apache airflow and its use in data engineering.

Ronny Mwenda — Mon, 21 Jul 2025 20:36:36 +0000

what is apache airflow

--- Apache Airflow is an open-source platform for developing, scheduling, and monitoring batch-oriented workflows. Airflow’s extensible Python framework enables you to build workflows connecting with virtually any technology. A web-based UI helps you visualize, manage, and debug your workflows. You can run Airflow in a variety of configurations — from a single process on your laptop to a distributed system capable of handling massive workloads.

With its core features like pipeline automation, dependency management, scalability, makes it a vital tool for data engineers.

core concepts of airflow

DAGS - A Directed Acyclic Graph(DAG), according to the official workflow documentation, is a model that encapsulates everything needed to execute a workflow.
Schedule: When the workflow should run.
Tasks: tasks are discrete units of work that are run on workers.
Task Dependencies: The order and conditions under which tasks execute.
Callbacks: Actions to take when the entire workflow completes.

common uses of airflow

Automation of ETL pipelines
Data validation and transformation tasks
schedule data analytics reports
machine learning, model training and deployment.

advantages of airflow

It is Python-based based enabling writing of workflows as code.
Its web-based UI provides real-time monitoring and debugging capabilities.
Separation of the web server and scheduler components allows for better resource allocation.
Airflow is modular and extensible, enabling creation of custom operators and plugins. -Airflow's scalability supports distributed execution.

disadvantages of airflow

It has a steep learning curve.
Airflow isn't built for streaming data.
Airflow can be complex to set up for beginners.
Windows users can't use Airflow locally, unless on WSL.
Debugging on airflow can betime-consumingg.

Despite the several disadvantages, airflow still proves to be a vital tool for data engineer,s especially when paired with other tools such aApache Kafkaka.

Classes in Python, a beginner's pov

Ronny Mwenda — Mon, 21 Jul 2025 19:30:36 +0000

WHAT IS A CLASS

A class is like a blueprint for creating objects or an object constructor. It defines what data (variables) and actions (functions/methods) an object should have.Classes are a good example of object oriented programming using python.

Classes are used for grouping code, to enable code reusability and scalability, and to facilitate the writing of clean code.

When creating classes we use a special function called a the constructor method, this function gets called automatically and creates a new object.Basically innitializing the objects attributes.

constructors are used to automatically set up object data, avoid repetitive code and ensure that every object starts with valid values.

In Python, a class is created using the CLASS keyword.When creating a class its advised to use the following naming conventions:snake_case for the variable and function, pascal case for the class name and upper snake case for the constant.Example:

Variable : my_nephew
function : my_nephew()
class : MyNephew
constants : MY_NEPHEW

syntax of a class with a constructor.

class MyNephew:
def init(self, name, age):
self.name = name
self.age = age

def info_details(self):
    print(f"Hello, my name is {self.name} and I am {self.age} years old.")

my_nephew = MyNephew("John", 10)
my_nephew.info_details()

---the code above creates a class called MyNephew,with a constructor(init), parameters passed into the constructor(name, age),objects(my_nephew & my_cousin) and values assigned to the object(self.name, self.age) and an instance being created(self).Calling the function info.details() the objects use their passed data to print personalised message through an f-string.

Another fitting example of an instance where classes can be used is when creating a mock bank account.

Task:

Create a class called BankAccount.

class BankAccount:
def init(self, account_holder):

    self.account_holder = account_holder
    self.balance = 0

Attributes: account_holder, balance (default = 0).

Methods:

deposit(amount)

def deposit(self, amount):
    self.balance = self.balance + amount
    print(f"you have deposited {amount}.")
    print(f"Your new balance is {self.balance}.")

withdraw(amount)

def withdraw(self, amount):
    if amount <= self.balance:
        self.balance = self.balance - amount
        print(f"You have withdrawn {amount}.")
        print(f"Your new balance is {self.balance}.")
    else:
        print("Insufficient funds for this withdrawal.")


#  show_balance()

def show_balance(self):
    print(f"{self.account_holder} Your current balance is {self.balance}.")

#Creating an account and testing all three methods.

myaccount = BankAccount("JOHN DOE

") myaccount.deposit(1000) myaccount.withdraw(500) myaccount.show_balance().