Ronny Mwenda

Posted on Aug 11

15 major concepts of Data Engineering.

Data Engineering can be defined as designing, building, and maintaining infrastructure that allows organizations to collect, store, process, and analyze large volumes of data. Data engineering can be subdivided into 15 major concepts, namely;

Batch vs Streaming Ingestion
Change Data Capture (CDC)
Idempotency
OLTP vs OLAP
Columnar vs Row-based Storage
Partitioning
ETL vs ELT
CAP Theorem
Windowing in Streaming
DAGs and Workflow Orchestration
Retry Logic & Dead Letter Queues
Backfilling & Reprocessing
Data Governance
Time Travel & Data Versioning
Distributed Processing Concepts.

1. BATCH VS. STREAMING INGESTION

Batch and streaming ingestion are two distinct methods of loading data.Batch ingestion handles collected data in large sets at scheduled intervals while streaming ingestion handles data that is generated and processed in real time.

Batch ingestion is most commonly seen in processing sales reports,or generating monthly bank statements.

Batch ingestion is mostly favourable when processing large volumes of historical data due to its latency thus making it not suitable for real-time applications.

Stream ingestion on the other hand processes data in real time with each data set handled individually.Its therefor suitable for real time insights.

With stream ingestion the latency is low to non-existent thus allowing immidiate action based on incoming data.

stream ingestion can be seen while streaming live match scores,thermostats in a smart home and also car parking sensors.
factors to consider when choosing between batch and stream ingestion include;

-latency
-volume of the data
-source of the data
-cost

2.CHANGE DATA CAPTURE(CDC)

Change data capture, or CDC, is a technique for identifying and recording data changes in a database. CDC delivers these changes in real-time to different target systems, enabling the synchronization of data across an organization immediately after a database change occurs.

Change data capture is a method of real-time data integration.It works by identifying and recording change events taking placein various data sources,these changes are then transfered in real time to target systems.

Common use cases of cdc include;

-fraud detection
-internet of things enablement
-inventory and supply chain management
-regulatory compliance

The common methods of CDC include;

-log-based CDC
-Timestamp-based cdc
-Trigger-based CDC

The various methods benefits of CDC include;

Real time decision making
Succesfull cloud migration
ETL Process improvement
bettter AI perfomance

3.INDEMPOTENCY

In data engineering,Indempotency is executing the same operation multiple times while it still has the same effect as executiing it once.Thia is crucial for building robust data pipelines.Indempotency ensures that data remains consistent and accurate even after multiple identical operations.

importance of indempotency

data recovery and reduncancy
maintain consistency
batch processing 4.ressilience and testing

achieving indempotency in data pipelines

indempotency can be achieved through the use of primary keys,upserts,deleting data before writing,staged data,event time vs ingest time,logging and auditing.

OLTP VS OLAP

Online Transaction Processing systems,commonly refered to as OLTP systems are systems designed to handle real time operations that occur in day to day business activities.

OLTP systems are primarily used to process individual business transactions in real time in institutions such as banks and ecomerce platforms.They focus on live data.

Online Analytical Processing systems (OLAP) on the other hand are optimsed for complex analysis, reporting amnd business intelligence activities such as financial reporting systems and market analysis tools.

Key Differences and Implications

Transaction vs. Analysis: OLTP systems excel at processing individual transactions quickly and accurately, while OLAP systems specialize in analyzing patterns across large datasets.

Data Freshness: OLTP systems work with real-time data, whereas OLAP systems typically work with data that may be hours or days old, depending on the ETL schedule.

Concurrency Requirements: OLTP systems must handle many simultaneous users performing transactions, while OLAP systems typically serve fewer concurrent users running complex queries.

Failure Impact: OLTP system downtime directly affects business operations, while OLAP system unavailability impacts reporting and analysis capabilities.

5. COLUMNAR VS ROW BASED STORAGE

Columnar storage is where data is organised by columns while row based storage is where data is read and written row by row.

6. PARTITIONING

Data partitioning in data engineering is the process of dividing a large dataset into smaller, more manageable chunks called partitions. This technique is used to improve the performance, scalability, and manageability of data storage and processing.

Data partitioning improves perfomance,enhances scalability,simplifies management and optimises cost.

common partitioning methods include;
-range partitioning
-hash partitioning
-list partitioning
-composite partitioning

Examples:
E-commerce: Partitioning by date, region, or customer ID to optimize order processing, sales analysis, and customer support.
Log Analysis: Partitioning by timestamp to analyze log data for specific time periods.
Social Media: Partitioning by user ID or geographic location to optimize user-specific data access and social network analysis.

7. ELT VS ETL

The main difference between ELT and ETL lies in the order of data transformation. ETL (Extract, Transform, Load) transforms data before loading it into a data warehouse or target system. ELT (Extract, Load, Transform) loads data first, then transforms it within the target system.

ETL (Extract, Transform, Load):

Data is extracted from various sources.
Data is transformed in a staging area, often outside the target data warehouse, using specialized tools.
Transformed data is then loaded into the target system.
ETL is well-suited for complex transformations and data cleaning, and is often used when data quality is a top priority.
It can be beneficial for scenarios with stringent data security and compliance requirements.

ELT (Extract, Load, Transform):

Data is extracted from various sources.
Extracted data is loaded directly into the target data warehouse or data lake without prior transformation.
The transformation process happens within the target system using the processing power of the data warehouse or lake.
ELT is often favored for its scalability and ability to handle large volumes of data, especially in cloud environments.
It's particularly useful when dealing with unstructured data or when real-time analytics are needed.

In essence, ETL prioritizes data quality and upfront transformation, while ELT prioritizes speed and scalability, leveraging the power of modern data warehouseS.

8.CAP THEOREM

The CAP theorem says that a distributed system can deliver only two of three desired characteristics:
consistency, availability and partition tolerance (the ‘C,’ ‘A’ and ‘P’ in CAP).Its also called Brewer's Theorem.

Let’s take a detailed look at the three distributed system characteristics to which the CAP theorem refers.

Consistency

Consistency means that all clients see the same data at the same time, no matter which node they connect to. For this to happen, whenever data is written to one node, it must be instantly forwarded or replicated to all the other nodes in the system before the write is deemed ‘successful.’

Availability

Availability means that any client making a request for data gets a response, even if one or more nodes are down. Another way to state this—all working nodes in the distributed system return a valid response for any request, without exception.

Partition tolerance

A partition is a communications break within a distributed system—a lost or temporarily delayed connection between two nodes. Partition tolerance means that the cluster must continue to work despite any number of communication breakdowns between nodes in the system.

9.WINDOWING IN STREAMING

windowing used to divide a continuous data stream into smaller, finite chunks called streaming windows.

Benefits and applications of streaming windows include

They provide a way to process unbounded data incrementally, by breaking the stream into manageable, finite chunks.
The structured nature of streaming windows makes it easier to identify and rectify errors or anomalies within specific time frames, enhancing data quality and reliability.
By limiting the data volume that needs to be processed at any given time, streaming windows can help reduce computational load, leading to faster processing times and more efficient use of system resources.

Additionally, streaming windows have numerous applications across various industries. For example, we can leverage them to:

Detect patterns indicative of financial fraud.
Monitor equipment performance to predict maintenance needs before failures occur.
Streamline traffic flow by analyzing vehicle data streams for congestion patterns.
Personalize online shopping experiences by recommending products based on real-time purchasing and clickstream data.
Provide real-time statistics and performance metrics during live sports events.
Analyze surveillance footage in real time to detect and respond to emergencies or public disturbances.

streaming window types include
-tumbling windows
-hopping windows
-sliding windows
-session windows

10. DAGS AND WORKFLOW ORCHESTRATION

A DAG is a way to represent a workflow as a graph where tasks are nodes and dependencies are directed edges, ensuring a specific order of execution without circular dependencies. Workflow orchestration tools, like Apache Airflow, utilize DAGs to automate and manage the execution of these workflows.

Directed Acyclic Graphs (DAGs):

A DAG is a data structure that visualizes a workflow as a graph with nodes and edges.
Directed: Edges have a direction, showing the flow of execution from one task to another.
Acyclic: The graph cannot contain any cycles or loops, meaning a task cannot be executed multiple times due to circular dependencies.
Nodes: Represent individual tasks or operations within the workflow.
Edges: Represent dependencies between tasks, indicating which tasks must be completed before others can start.

Workflow Orchestration:

Purpose: Workflow orchestration manages the execution of tasks defined in a DAG, ensuring they run in the correct order and with the appropriate dependencies.
Key Functions:
Scheduling: Triggering workflows based on predefined schedules (e.g., daily, hourly).
Task Execution: Running individual tasks on compute resources.
Dependency Management: Ensuring tasks run only when their dependencies are met.
Error Handling: Handling task failures and potentially retrying failed tasks.
Monitoring and Logging: Tracking the progress of workflows and logging events.
Examples of Orchestration Tools: Airflow, Argo, Google Cloud Composer, AWS Step Functions.

Why Use DAGs and Workflow Orchestration?

Automation:
Automates complex data pipelines, reducing manual intervention and potential errors.
Reliability:
Ensures workflows execute reliably and consistently, even with complex dependencies.
Scalability:
Enables workflows to scale to handle large datasets and complex computations.
Observability:
Provides insights into workflow execution, allowing for monitoring and troubleshooting.
Maintainability:
DAGs and orchestration tools make it easier to manage and update workflows as requirements evolve

11.RETRY LOGIC AND DEAD LETTER QUEUES

Retry logic and Dead Letter Queues (DLQs) are essential mechanisms in distributed systems and message-driven architectures for handling message processing failures and ensuring system reliability.

Retry Logic:
Retry logic involves re-attempting an operation or message processing when a transient error or temporary failure occurs. This is done with the expectation that the issue might be resolved upon subsequent attempts.

Key aspects of retry logic include:

Retry Attempts
Backoff Strategy
Error Classification

Dead Letter Queues (DLQs):

A Dead Letter Queue (DLQ) is a designated queue or storage location where messages that could not be successfully processed after exhausting all retry attempts are sent.

The purpose of a DLQ is to:

Isolate Problematic Messages
Enable Manual Inspection and Debugging
Facilitate Error Handling and Recovery

12.BACKFILLING AND REPROCESSING

In data engineering, backfilling refers to the process of retroactively loading or updating historical data in a data pipeline.it is used to fill gaps in historical data,correct errors or initialise systems with historical records.

Reprocessing involves re-running data pip[elines for past dates,often to fix errors or apply changes.

13.DATA GOVERNANCE

Data governance is a comprehensive framework that defines how an organization manages, protects, and derives value from its data assets. It encompasses the people, processes, policies, and technologies that ensure data is accurate, accessible, consistent, and secure throughout its lifecycle.

The core objectives of data governance include
-data quallity management
-data security and privacy
-data stewardship
-reducing compliance

Data governance frameworks include;

1.DAMA-DMBOK Framework
2.IBM Data Governance Framework
3.Microsoft Data Governance Framework
4.Google Cloud Data Governance Framework
5.Enterprise Data Governance Framework

14.TIME TRAVEL AND DATA VERSIONING

Data Versioning and Time Travel in cloud data platforms allow users to access and recover previous versions of data, enabling point-in-time recovery and historical analysis. These features provide the ability to track changes, roll back to previous states, and query data as it existed at a specific moment in time. Data Versioning and Time Travel capabilities are valuable for compliance, auditing, and understanding data evolution in cloud-based data lakes and data warehouses.

Data Versioning
Data versioning is a critical aspect of data management in cloud computing. It allows for the tracking and control of changes made to data objects, facilitating data recovery and ensuring data integrity. Each version of a data object represents a snapshot of that object at a specific point in time, providing a historical record of the object's state.

Versioning is particularly useful in scenarios where multiple users or applications are modifying the same data object. It allows for the resolution of conflicts and the prevention of data loss due to overwrites. Furthermore, versioning enables the rollback of changes, providing a safety net in case of errors or unwanted modifications.

Time Travel
Time travel in cloud computing refers to the ability to view and manipulate data as it existed at any point in the past. This is achieved by maintaining a historical record of all changes made to the data. Time travel allows for the recovery of lost data, the auditing of changes, and the analysis of data trends over time.

Some cloud-based data platforms provide time travel as a built-in feature, allowing users to query past states of the data without the need for manual version management. This can be particularly useful in scenarios involving data analysis and auditing, where understanding the historical state of the data is crucial.

15.DISTRIBUTED PROCESSING CONCEPTS

Distributed Processing executes parts of a task simultaneously across multiple resources, improving efficiency and performance, especially with large data.IT enhanaces scalability, efficiency and reliability.

Distributed Processing is useful in various fields like machine learning, data mining, and large-scale simulations. It can process vast datasets with high speed and reliability. Big tech companies like Google, Amazon, and Facebook utilize distributed processing to deal with their massive data.

DEV Community