15 Data Engineering Core Concepts Simplified

#dataengineering #data #bigdata #dag

INTRODUCTION

In today’s world of Big Data, the term data engineering is everywhere — often surrounded by a cloud of technical buzzwords. These terms can feel overwhelming, especially if you’re new to the data ecosystem.

This article aims to break down these concepts into simple, relatable explanations so you can understand them without needing a technical background.

What is Data Engineering?

Data engineering is the discipline of designing, building, and maintaining data pipelines that ensure data can move reliably from its source to where it’s needed. These pipelines handle the movement, transformation, and storage of data, making it ready for analysis and decision-making.

Core Concepts of Data Engineering

1. Batch vs Streaming Ingestion

Batch Ingestion is a process whereby data is collected and processed in large, discrete chunks at specific times, usually scheduled.

Stream Ingestion is the continuous collection of data as it arrives. Data is processed individually as it enters the system.

2. Change Data Capture (CDC)

Change Data Capture is a technique that identifies and tracks change (inserts, updates, deletes) made to data in a database and then deliver those changes in real-time to a downstream process or system, such as real-time data integration or data warehousing.

3. Idempotency
Idempotency is a property of an operation whereby executing the operation multiple times with the same set of input produces the same output. For example, when creating a record, by pressing the save button twice, only one record will be saved.
4. OLTP VS OLAP

Online Transaction Processing (OLTP) is a form of data processing that involves a large number of small, concurrent transactions. Example of such processes include online banking, shopping, order entry or sending text messages.
Online Analytical Processing (OLAP) is a way of storing and querying data so that you can quickly analyze it from different dimensions without having to run slow, complex queries on raw transactional data.
Scenario: Company's sales data

In OLTP: Every single sale is recorded (like “Sold 3 units of product X in Nairobi on Aug 10, 2025”).

IN OLAP: Data is reorganized so you can quickly answer questions like:

What were the total sales for product X by month for the past 2 years?

Which region sold the most in Q2 2025?

How do sales in Nairobi compare to Kisumu over time?

5. Columnar vs Row-based Storage
In row-based Storage, all values of a single record are stored contiguously on disk. This form of storage is efficient for transactional workloads (Inserting, updating, or deleting rows) and Write-intensive operations. Row-based storage is less efficient for queries that need to access only some columns across many rows, as the entire row must be read from the disk, leading to unnecessary I/O.

In Columnar Storage, data is stored column by column, with all values for a single column stored contiguously on a disk. This form of storage is highly efficient for Analytical queries that involve aggregations, filtering, and analysis across a large dataset, as only the required columns are read from disk. However, it is less efficient for Transactional workloads as modifying a single row requires updates across multiple column blocks.

6. Partitioning

In data engineering, partitioning means splitting a large dataset into smaller, more manageable parts to speed up queries and reduce resource usage. Instead of scanning an entire table or file, queries only read the relevant partitions.

Common types of partitioning:

Horizontal partitioning: Splitting rows based on a column’s value (e.g., date, region).
Vertical partitioning: Splitting columns into separate tables or files to reduce data scanned.
Hash partitioning: Using a hash function on a key (e.g., customer ID) to evenly distribute data across partitions.

7. ETL vs ELT
ETL (Extract, Transform, Load): Data is extracted from source systems, transformed (cleaned, enriched, aggregated) in a separate processing environment, and then loaded into the target storage (e.g., a data warehouse).

Good when transformations must happen before data enters storage.

Often used with on-premise data warehouses or systems with strict schema requirements.

ELT (Extract, Load, Transform): Data is extracted from sources, loaded directly into the target storage first (often a cloud data warehouse), and transformed inside the storage using its processing power.

Good when the storage is powerful enough to handle transformations at scale (e.g., Snowflake, BigQuery).

Allows storing raw data for flexibility and reprocessing later.

8. CAP Theorem (Brewers Theorem)
CAP Theorem states that, for a distributed system, it is impossible to simultaneously achieve Consistency, Availability, and Partition Tolerance.

The 3 Properties

Consistency (C): Every node in the system sees the same data at the same time.
Availability (A): Every request receives a response.
Partition Tolerance (P): The system continues to operate despite a message being lost or delayed between nodes. The Trade-off When a network partition happens, you must choose between:

CA → Consistency + Availability (no Partition Tolerance) → works only if network never fails (rare in real distributed systems).

CP → Consistency + Partition Tolerance (may sacrifice availability during a partition).

AP → Availability + Partition Tolerance (may serve stale data to keep responding).

9. Windowing in Streaming
Windowing is a technique in stream processing where infinite flow of events (such as logs, sensor reading or transactions) and break it into finite chunks of time or count so you can run aggregations like sum, average or count. Windowing enables an end to calculate results by providing logical boundaries for calculations.
10. DAG and Workflow Orchestration

DAG (Directed Acyclic Graph): A way to represent a workflow or process where steps have a defined order, and there's no way to go back to a previous step by following those directions.
Workflow Orchestration: The process of automating, coordinating, and managing multiple tasks(DAGS) and systems to execute complex business processes. Some tools used for Orchestration include, Apache Airflow, Dagster and Luigi

11.Retry Logic & Dead Letter Queues
Data engineering and distributed systems need ways to handle failures without losing data. Amongst these ways are Retry Logic and Dead Letter Queues (DLQs).
Retry Logic: Is the process of automatically reattempting a failed task or message after a certain delay, often with a limit on how many times it can retry. It is useful when handling issues such as network glitches, API timeouts or locked resources.
Dead Letter Queue: Is a special holding queue for messages or events that failed processing after all retries. It is useful for preventing endless retry loops and preserves failed data for investigation.

12. Backfilling & Reprocessing
Backfilling involves reprocessing historical data to correct errors, acomodate new data structures, or integrate new data sources.

Example: If your sales database adds a new “discount” column, you might backfill past records so that older data also includes the correct discount information.

Reprocessing is the act of running data through a processing pipeline again to correct inaccuracies, apply updated transformation logic, or ensure completeness.
Example: If you discover an error in your tax calculation logic, you might reprocess the past month’s sales data using the corrected formula.

13. Data Governance
Data governance is the framework of rules, processes, and responsibilities that ensure data is accurate, secure, consistent and used appropriately through its life cycle.

Purpose in Data Engineering:

Quality assurance – Making sure the data pipelines deliver clean, reliable data.
Security & privacy – Controlling who can access or modify data.
Compliance – Meeting legal and regulatory requirements (e.g., GDPR, HIPAA).
Lineage & documentation – Tracking where data came from, how it was transformed, and where it’s used.
Standardization – Ensuring consistent formats, naming conventions, and definitions.

14. Time Travel & Data Versioning

Time travel is the ability to view data set as it existed at a specific point in the past.

It can be used to recover accidentally deleted or modified data. It can also be used to audit historical states of data and finally can be used to compare current and past datasets.

Data Versioning is the practice of storing and managing multiple versions of a dataset over time. Its purpose is to track changes to data or to enable rollback to previous versions.

15. Distributed Processing
Distributed processing is the method of breaking large computing task into smaller parts and running those parts simultaneously across multiple machines or processors, then combining the results

Purpose in Data Engineering: