Kepher Ashitakaya

Posted on Aug 9 • Edited on Aug 10

15 Core Concepts of Data Engineering

#dataengineering #data

1. Batch vs Streaming Ingestion

So when we was Data Engineers want to extract data from a source we have two approaches we can process the data: Processing in Batch or Processing in Streaming Form.

When we do Batch Processing it is simply getting the data at scheduled intervals of time. We let events accumulate and we process it periodically. It could be every hour, every day etc.Example: Every midnight,we can upload all of yesterday’s sales data into a data warehouse.
In the case of Streaming Processing we get the data in real time as it happens. The events are recorded in real time processing. Example: Each time a customer swipes a card, the transaction instantly appears in the fraud detection system.
Batch Processing is suitable for scheduled tasks and reduces system load. While Stream Ingestion is suitable for real time data processing and immediate insights.The choice of system largely depends on data volume, system architecture and latency requirements. When to use each:
We use Streaming Ingestion Processing when we want decisions made in real time. Think of Live Stock Market pricing to enable consumers make decisions as stock prices change.
We use Batch Ingestion Processing when the data freshness is nit required to be in real time. Think of monthly reports to analyze the business performance.

2. Change Data Capture (CDC)

When we make changes to a table in a database e.g inserting new data, updating the data or even deleting the data, we want this events to be tracked and captured. This is where CDC comes in.
It is a technique to track only the changes made to our database tables and send those changes to another system. The process of recognizing when data has changed in source system to enable downstream system act on that change. This helps reduce the need for us to copy or export the entire table anytime a change occurs in our system. It would be painful for us if we had large tables.
With CDC changes are tracked and recorded. This becomes crucial to ensure correct changes are made and downstream data. Use Case:
CDC ensures synchronization of data in database, data warehouse and data lake as it is able to replicate the data.
We can perform stream processing based on data changes e.g if customer data changes we can do some processing and send a message to the customer through stream processing and achieved with CDC.
In Analytics it can be sued to give real-time data by use of log based CDC that moves from Relational Database to Kafka to Snowflake data warehouse.
Enables capturing of updated prices on products and inventory and thus enabling customers to see updated info on the products they want to purchase.

Characteristic	Change Data Capture (CDC)	Traditional ETL
Data Freshness	Near real-time or low-latency updates	Periodic (batch-based) updates
Data Movement	Only changes (inserts, updates, deletes) are captured and processed	Full dataset or large batches are reprocessed
Processing Overhead	Lower (less data processed, smaller footprint)	Higher (processes entire datasets repeatedly)
Impact on Source	Minimal (lightweight change logs or event streams)	Higher (can stress source systems during extraction)
Complexity	More complex to implement, requires log reading or triggers	Easier to implement, straightforward extraction process
Typical Tools	Debezium, Oracle GoldenGate, AWS DMS	Apache Airflow, AWS Glue, Talend, Informatica
Use Cases	Real-time analytics, replication, event-driven pipelines	Periodic reporting, data warehousing, historical loads

3. Idempotency

This is all about ensuring no duplicates in data especially in retries when system fails. Like when we click the buy button twice when we are purchasing a product we don't want to be charged twice. In a nutshell no matter how many times I run an operation it will be like I just run it one time.
In data engineering if we are inserting data into s system and it fails due to one reason or the other idempotency ensures we don't have the same data being inserted again leading to duplicates.

4. OLTP vs OLAP

Characteristic	OLTP (Online Transaction Processing)	OLAP (Online Analytical Processing)
Primary Purpose	Handles day-to-day transactional data (insert, update, delete)	Performs complex analysis and reporting on aggregated historical data
Data Type	Current, real-time operational data	Historical, aggregated, and summarized data
Typical Operations	Short, simple queries (e.g., insert a record, update account balance)	Complex queries (e.g., trend analysis, forecasting, aggregations)
Performance Focus	Optimized for fast write and read of individual records	Optimized for read-heavy workloads and complex joins/aggregations
Data Volume	Small to moderate per transaction, but high volume of transactions	Large datasets, often in terabytes or petabytes
Schema Design	Highly normalized to avoid redundancy	Denormalized or star/snowflake schemas to speed up query performance
Examples	PostgreSQL, MySQL, Oracle Database	Snowflake, Amazon Redshift, Google BigQuery
Use Case Example	Recording bank transactions, e-commerce purchases, inventory updates	Sales trend analysis, customer segmentation, business intelligence dashboards

4. Columnar vs Row-based Storage

As the name suggest Row-based storage stores data in a row like in a spreadsheet where we store all the information about a customer in one row. Think of a banking system where when we make a transaction the rows involved change.
Now we have something else we call Columnar Storage. Instead of storing data row by row for each user or item, we store all the value in the same column. Here is an example: Row Storage:

Cust_ID	Name	Age	Country
001	Britney	25	Tanzania
002	Bill	34	Kenya

Columnar Storage:

Cust_ID → 001, 002
Name → Britney, Bill
Age → 25, 34
Country → Tanzania, Kenya

Columnar Storage is good for scanning huge datasets where we don't need every column.

5. Partitioning

This is a technique of diving large datasets into smaller chunks called Partitions. These partitions can be stored, queried, and managed as individual tables, though they logically belong to the same dataset.
There are four approaches you need to know.
Horizontal partitioning splits rows based on criteria like date ranges:
-- Orders table partitioned by year
CREATE TABLE orders_2023 (...);
CREATE TABLE orders_2024 (...);
CREATE TABLE orders_2025 (...);
Vertical partitioning splits columns - frequently accessed columns in one partition, rarely used ones in another.
Hash partitioning distributes rows based on a hash function.
Range partitioning uses value ranges like dates or IDs.
Partitioning helps improve database performance and scalability.For instance, searching for a data point in the entire table takes longer and uses more resources than searching for it in a specific partition. That's why data is stored as partitions.
It is useful in OLAP Operations. For example, applying cube partitioning (a data partitioning technique) divides the data into smaller cubes based on specific criteria. When you query the data, the search is performed on a particular cube by skipping irrelevant ones, reducing I/O operations.

6. ETL vs ELT

ETL as the acronym suggests, consists of three primary steps:

Extract: Data is gathered from different source systems.

Transform: Data is then transformed into a standardized format (cleansing, aggregation, enrichment, etc.).

Load: The transformed data is loaded into a target data warehouse or another repository.

ETL is suitable where data sources are small scale and transformations are complex.
Data security is a priority, requiring transformations to mask or encrypt sensitive data before it lands in a warehouse.

ELT takes a slightly different approach:

Extract: Just as with ETL, data is collected from different sources.

Load: Raw data is directly loaded into the target system.

Transform: Transformations take place within the data warehouse.

ELT’s popularity is tied to the rise of cloud warehouses like Snowflake, BigQuery, and Redshift.
Flexibility: Raw data is loaded first, allowing transformation logic to be decided later.
Efficiency: Uses the power of cloud warehouses for faster, scalable transformations.
Best for large datasets: Leverages massive parallel processing in modern warehouses.

Feature / Aspect	ETL (Extract → Transform → Load)	ELT (Extract → Load → Transform)
Process Order	Extract → Transform → Load	Extract → Load → Transform
Where Transform Happens	In external ETL tool/staging server	Inside the data warehouse
Best For	Small datasets, complex transformations, pre-load validation	Large datasets, flexible transformations, cloud processing
Performance Dependency	Dependent on ETL tool/server power	Dependent on warehouse processing power
Data Security	Mask/encrypt before loading	Masking happens after loading
Flexibility	Low — defined upfront	High — can adapt after loading
Scalability	Limited by ETL infrastructure	Highly scalable with cloud MPP
Speed for Large Data	Slower — must transform first	Faster — transforms in parallel inside warehouse
Infrastructure Cost	Requires powerful ETL servers	Shifts workload to warehouse, reducing ETL server needs
Example Tools	Talend, Informatica, Apache NiFi	dbt, SQL in Snowflake/BigQuery/Redshift
Common Use Cases	Finance, healthcare (pre-load security)	Modern analytics, data lakes, ML pipelines

7. CAP Theorem

The CAP theorem says that in a distributed system, you can only have two out of three guarantees at the same time: Consistency: Every node in the system always has the same up-to-date data. Availability: The system keeps working and responding to requests, even if some parts fail. Partition tolerance: The system can still operate even if there are network problems that prevent nodes from communicating. Because you can’t have all three fully at once, systems usually choose two:
CA systems (e.g., classic relational databases) focus on consistency and availability, but they struggle if the network is split.
CP systems (e.g., MongoDB in certain setups) maintain consistency and handle network issues, but may reject some requests to do so.
AP systems (e.g., Cassandra) keep running and available during network issues, but data across nodes may not match immediately. In practice, many modern distributed databases aim for eventual consistency, meaning they prioritize availability and partition tolerance, and ensure all nodes agree on the data over time.

8. Windowing in Streaming

When you’re dealing with streaming data (like tweets, sensor readings, or transactions), the data is constantly flowing — it never stops.
If you want to do analytics, you can’t wait forever for all the data. Instead, you take small “views” of the stream over time. Think of it like: A window is a frame through which you look at your data stream for a limited time.You move the window, and you see new parts of the stream.

Why We Need Windowing:
- Streaming data is infinite — we can't wait until it ends to process it.
- Windowing lets us process and analyze data in small, manageable parts.
- It allows real-time aggregation like counting, averaging, or summing data over a time period.

9. DAGs and Workflow Orchestration

Data workflows require precise coordination to ensure that tasks are executed accordingly.The Directed Acyclic Graph (DAG) is a powerful tool for managing these workflows efficiently.
Directed means having a specific direction, indicating a one-way relationship between tasks.
The key thing about DAGs is that they're acyclic, meaning that once you start at one task, you can only move forward, never returning to a previous task. This ensures tasks can be executed in order without leading to infinite loops. DAGs often have a hierarchical structure, where tasks are organized into levels or layers. No loops allowed (you can’t go back from Task B to Task A and run forever). Importance of DAGs:
DAGs enforce a logical execution order, ensuring that tasks are executed sequentially based on their dependencies. This prevents errors and inconsistencies from running tasks out of order. Also, if one step fails, DAGs can identify and re-run the affected tasks, saving time and effort.
They help track failures (you know exactly which step broke).

Workflow Orchestration Orchestration tools read your DAG and execute it properly. It ensures tasks run at the right time and retries them if something fails.
ETL Data Pipeline Extract data → Clean data → Load into warehouse Apache Airflow

10. Retry Logic & Dead Letter Queues

When a task fails to execute be it sending a message to a topic, or processing a file, by default the system tries to accomplish the task repeatedly. This can block subsequent messages or tasks halting the system.
There are various ways to deal with such issues where we can customize the maximum number of retries. >>Immediate Retry: Try again instantly after failure. >>Fixed Interval: Wait the same amount of time before each retry. >>Exponential Back off: Wait a little longer each time before retrying. Dead Letter Queues (DLQ) when we exhaust the retires, we can create a special place where the messages can go. This is what we call DLQ. This helps keep the main pipeline free and flowing and helps us engineers tackle the issue separately.

11. Backfilling & Reprocessing

Backfilling refers to the process of retroactively filling in missing or correcting incorrect data in a pipeline or dataset. In simpler terms, if some historical data wasn’t processed correctly (or at all) the first time, backfilling allows you to reprocess that past data. This ensures your data warehouse or reports reflect a consistent and complete history.
Reprocessing is when you run the processing again on the same data — often because you fixed a bug or improved the logic.

Feature	Backfilling	Reprocessing
Goal	Fill missing past data	Correct or improve already processed data
Trigger	Missing data	Incorrect or outdated data
Data Timeframe	Usually past unprocessed periods	Usually already processed periods
Example	Load March 1–3 sales	Fix all customer ages after formula change

12. Data Governance

This is a set or rules and principle of how we collect, handle and use data. The goal is to ensure that data is accurate, consistent, and available for use while protecting data privacy and security.
It involves a set of policies, procedures, and standards. The main pillars that build up data governance are: Ownership and Accountability, Data Quality, Data Protection and Safety, Data Use and Availability and Data Management.

Component	What It Means	Real-Life Example
Data Ownership	Who’s responsible for each dataset	Sales manager owns the “Customer List”
Data Quality	Accuracy, completeness, consistency	Correcting typos in customer addresses
Data Catalog	Central list of all data and definitions	A searchable “data dictionary”
Access Control	Who can view/edit what data	Only HR can see employee salaries
Data Policies	Rules for usage, retention, and sharing	Delete customer data after 5 years
Compliance	Following laws and industry rules	GDPR: Get consent before collecting personal data
Auditability	Ability to track who accessed or changed data	Logging all changes to financial records

DEV Community