Prudence Waithira

Posted on Sep 14

Data Engineering Core Concepts

#dataengineering #kafka #webdev #programming

A) Batch vs Streaming Ingestion

Different approaches to bringing data to a system.
Batch – data is ingested in batches over a period of time and processed in a single operation
Eg. Processing historical data for trend analysis

Streaming – data is ingested continuously as it arrives in real-time.
Data is processed as it is generated
Eg. Personalized recommendations. Monitoring sensor data from IoT devices
The choice between them depends on factors like data volume, latency requirements, and the nature of the data source.

Key Considerations When Choosing:
• Data Volume:
Batch processing is generally preferred for large datasets, while streaming is better for smaller, continuous streams.
• Latency Requirements:
If real-time analysis is crucial, streaming is the way to go. If latency is not a major concern, batch processing may be sufficient.
• Data Source:
Batch ingestion is often used for data sources that generate data in batches, while streaming is better for continuous data sources like sensors or user activity logs.
• Complexity:
Batch processing is generally simpler to implement and manage than streaming, which can introduce complexities when dealing with stateful operations.
• Cost:
Real-time processing may require more powerful hardware and infrastructure.
• Data Consistency:
Streaming systems may need to handle out-of-order or late-arriving data, which can impact data consistency.

B) Change Data Capture CDC

A data integration pattern that identifies and tracks changes made to data in a source system and then delivers those changes to a target system.
It focuses on capturing inserts, updates, and deletes, enabling real-time or near real-time data synchronization and minimizing latency compared to traditional batch processing.
Eg.dbzm

What it is:
• CDC identifies and captures changes made to data in a database or other data source.
• These changes are typically captured as a stream of events, often referred to as a CDC feed.
• The captured changes are then propagated to one or more target systems.

Why CDC:

Real-time data integration
Reduced latency
Improved data consistency
Efficiency Reference:

Application backend (mutation operations) -> Database -> Kafka message (from db) -> target systems (where the mutation is to occur)

Use Cases
. Replicate data in other databases
. Stream processing based on data changes eg. If customer info changes

Eg. How if you change your say name for your google account it is tracked immediately and through streamline processing updated in real-time

C) Idempotency

-- Same input = Same output
-- No Side effects.
-- Focus on final state.
ie. A data operation repeated multiple times with the same input produces the same result every single time.
Eg. Upsert operations: e.g., INSERT,,, ON CONFLICT UPDATE with same input will produce same output
-- In modern data architectures, idempotency guarantees that pipeline operations produce identical results whether executed once or multiple times. This property becomes essential when dealing with distributed systems, streaming data, and fault-tolerant architectures where retries are not just possible but necessary for system reliability.

tiktok.com

D) OLTP vs OLAP

OLAP (Online Analytical Processing) – focuses on analysis of historical data
OLTP (Online Transaction Processing) – focuses on transactional data

Key Differences in a Table:
Feature OLTP OLAP
Purpose Transaction processing Analytical processing
Data Volume Smaller, frequently changing Larger, historical data
Typical Operations Inserts, updates, deletes Aggregations, complex queries
Data Model Normalized relational Multidimensional (star, snowflake)
Performance Low latency, high throughput Optimized for complex queries
Examples Banking, e-commerce Data warehousing, business intelligence

E) Columnar vs Row-based

Row-based dbs  transactional processing
So, row databases are commonly used for Online Transactional Processing (OLTP), where a single “transaction,” such as inserting, removing or updating, can be performed quickly using small amounts of data.

Common row oriented databases:
• Postgres
• MySQL

_ Column-based_  analytical processing. Sotes column data in blocks. Great for OLAP.

Common column oriented databases:
• Redshift
• BigQuery
• Snowflake

F) Partitioning

Data partitioning is a technique for dividing large datasets into smaller, manageable chunks called partitions. Each partition contains a subset of data and is distributed across multiple nodes or servers. These partitions can be stored, queried, and managed as individual tables, though they logically belong to the same dataset.
TYPES
i. Horizontal Partitioning (Row-based) - also called sharding in distributed systems splits tables by rows so every partition has the same columns but different records. Each partition contains a subset of the entire data set. For example, a sales table could be partitioned by year, with each partition containing sales data for a specific year.
 Range Partitioning
 Hash Partitioning
 List Partitioning

ii. _Vertical Partitioning (Column-Based) _– dividing a table’s columns into separate partitions so queries read only the data they need. For example, a user table might be split into partitions with user information and another with billing information

iii. Functional Partitioning - Dividing data based on how it is used by different parts of an application. For example, data related to customer orders might be separated from data related to product inventory.

USE CASES

OLAP operations
Machine Learning pipelines

G) ETL vs ELT

• ETL (Extract, Transform, Load):
• Data is extracted from its source.
• Data is transformed into a usable format, often in a staging area.
• The transformed data is then loaded into the data warehouse.
• ELT (Extract, Load, Transform):
• Data is extracted from its source.
• The raw, untransformed data is loaded directly into the data warehouse.
• Data transformation happens within the data warehouse using its processing power.

H) CAP Theorem/Brewer’s Theorem

The CAP theorem states that a distributed database system has to make a tradeoff between Consistency and Availability when a Partition occurs.
Consistency means that the user should be able to see the same data no matter which node they connect to on the system. For example, your bank account should reflect the same balance whether you view it from your PC, tablet, or smartphone!
Availability means that every request from the user should elicit a response from the system.
Partition refers to a communication break between nodes within a distributed system.
Partition tolerance - This is handled by keeping replicas of the records in multiple different nodes.

I) Windowing in Streaming

Windowing – window input data into fixed sized windows then you process each window separately.
Time in windowing:
i. Processing Time – when you don’t care for the actual time but that event happened
ii. Event Time
Time-Based Windows:
i. Tumbling windows
ii. Hopping windows
iii. Sliding windows
iv. Session Windows
It’s used to divide a continuous data stream into smaller, finite chunks called streaming windows.
E.G; calculating average website traffic per hour

J) DAGs and Workflow Orchestration

-- DAGs (Directed Acyclic Graphs) are a fundamental concept in workflow orchestration, representing tasks and their dependencies in a structured way.
-- Workflow orchestration manages the execution of these DAGs, ensuring tasks are executed in the correct order, handling dependencies, failures and retries.

K) Retry Logic and Dead Letter Queues

-- A Dead Letter Queue (DLQ) acts as a secondary queue in messaging systems, designed to manage messages that fail to process.
-- When a message cannot be delivered, it is redirected to the DLQ instead of being lost or endlessly retried.

media.geeksforgeeks.org

-- Retry logic attempts to redeliver messages that fail to be processed initially. EG. If a message fails to be processed due to a database connection issue, the retry logic will attempt to resend the message after a short delay. If the database is back online, the message will be processed successfully.

L) Backfilling and Reprocessing

-- Backfilling is the process of filling in missing or correcting historical data that was not processed correctly during the initial run. It ensures data consistency, corrects errors, and provides a complete historical record for analysis and reporting.
-- Reprocessing involves re-running a data pipeline, either partially or fully, with updated logic, code, or configurations. It corrects errors, incorporates new insights, or applies changes to historical data.

M) Data Governance

-- A framework that ensures data is reliable, consistent and aligns with business goals. It focuses on data quality, data security, compliance, data lifecycle management and data stewardship.
EG. Imagine a hospital storing patient records. Data governance would dictate how that data is collected, stored, accessed, and protected. Data engineers would build the systems to store and process the data, ensuring that it adheres to the data governance policies regarding access control, data security, and data privacy

N) Time Travel and Data Versioning

-- Time Travel is the ability to access historical versions of datasets at previous points in time. Allows users to query and manipulate data as it existed at any point in the past. Instead of storing each version as a separate entity, time travel maintains a historical record of all changes made to the data.
-- Data versioning is the practice of keeping multiple versions of data objects with each version representing the state of the data object at a specific point in time. It involves the explicit creation of new versions of data objects whenever changes are made.

O) Distributed Data Processing
Handling and analyzing data across multiple interconnected devices. Crucial for managing and processing large-scale datasets, ie.big data.

DEV Community