DEV Community

Kepher Ashitakaya
Kepher Ashitakaya

Posted on • Edited on

15 Core Concepts of Data Engineering

1. Batch vs Streaming Ingestion

So when we was Data Engineers want to extract data from a source we have two approaches we can process the data: Processing in Batch or Processing in Streaming Form.

  • When we do Batch Processing it is simply getting the data at scheduled intervals of time. We let events accumulate and we process it periodically. It could be every hour, every day etc.Example: Every midnight,we can upload all of yesterday’s sales data into a data warehouse.
  • In the case of Streaming Processing we get the data in real time as it happens. The events are recorded in real time processing. Example: Each time a customer swipes a card, the transaction instantly appears in the fraud detection system.
  • Batch Processing is suitable for scheduled tasks and reduces system load. While Stream Ingestion is suitable for real time data processing and immediate insights.The choice of system largely depends on data volume, system architecture and latency requirements. When to use each:
  • We use Streaming Ingestion Processing when we want decisions made in real time. Think of Live Stock Market pricing to enable consumers make decisions as stock prices change.
  • We use Batch Ingestion Processing when the data freshness is nit required to be in real time. Think of monthly reports to analyze the business performance.

2. Change Data Capture (CDC)

  • When we make changes to a table in a database e.g inserting new data, updating the data or even deleting the data, we want this events to be tracked and captured. This is where CDC comes in.
  • It is a technique to track only the changes made to our database tables and send those changes to another system. The process of recognizing when data has changed in source system to enable downstream system act on that change. This helps reduce the need for us to copy or export the entire table anytime a change occurs in our system. It would be painful for us if we had large tables.
  • With CDC changes are tracked and recorded. This becomes crucial to ensure correct changes are made and downstream data. Use Case:
  • CDC ensures synchronization of data in database, data warehouse and data lake as it is able to replicate the data.
  • We can perform stream processing based on data changes e.g if customer data changes we can do some processing and send a message to the customer through stream processing and achieved with CDC.
  • In Analytics it can be sued to give real-time data by use of log based CDC that moves from Relational Database to Kafka to Snowflake data warehouse.
  • Enables capturing of updated prices on products and inventory and thus enabling customers to see updated info on the products they want to purchase.
Characteristic Change Data Capture (CDC) Traditional ETL
Data Freshness Near real-time or low-latency updates Periodic (batch-based) updates
Data Movement Only changes (inserts, updates, deletes) are captured and processed Full dataset or large batches are reprocessed
Processing Overhead Lower (less data processed, smaller footprint) Higher (processes entire datasets repeatedly)
Impact on Source Minimal (lightweight change logs or event streams) Higher (can stress source systems during extraction)
Complexity More complex to implement, requires log reading or triggers Easier to implement, straightforward extraction process
Typical Tools Debezium, Oracle GoldenGate, AWS DMS Apache Airflow, AWS Glue, Talend, Informatica
Use Cases Real-time analytics, replication, event-driven pipelines Periodic reporting, data warehousing, historical loads

3. Idempotency

  • This is all about ensuring no duplicates in data especially in retries when system fails. Like when we click the buy button twice when we are purchasing a product we don't want to be charged twice. In a nutshell no matter how many times I run an operation it will be like I just run it one time.
  • In data engineering if we are inserting data into s system and it fails due to one reason or the other idempotency ensures we don't have the same data being inserted again leading to duplicates.

4. OLTP vs OLAP

Characteristic OLTP (Online Transaction Processing) OLAP (Online Analytical Processing)
Primary Purpose Handles day-to-day transactional data (insert, update, delete) Performs complex analysis and reporting on aggregated historical data
Data Type Current, real-time operational data Historical, aggregated, and summarized data
Typical Operations Short, simple queries (e.g., insert a record, update account balance) Complex queries (e.g., trend analysis, forecasting, aggregations)
Performance Focus Optimized for fast write and read of individual records Optimized for read-heavy workloads and complex joins/aggregations
Data Volume Small to moderate per transaction, but high volume of transactions Large datasets, often in terabytes or petabytes
Schema Design Highly normalized to avoid redundancy Denormalized or star/snowflake schemas to speed up query performance
Examples PostgreSQL, MySQL, Oracle Database Snowflake, Amazon Redshift, Google BigQuery
Use Case Example Recording bank transactions, e-commerce purchases, inventory updates Sales trend analysis, customer segmentation, business intelligence dashboards

4. Columnar vs Row-based Storage

  • As the name suggest Row-based storage stores data in a row like in a spreadsheet where we store all the information about a customer in one row. Think of a banking system where when we make a transaction the rows involved change.
  • Now we have something else we call Columnar Storage. Instead of storing data row by row for each user or item, we store all the value in the same column. Here is an example: Row Storage:
Cust_ID Name Age Country
001 Britney 25 Tanzania
002 Bill 34 Kenya

Columnar Storage:

  • Cust_ID → 001, 002
  • Name → Britney, Bill
  • Age → 25, 34
  • Country → Tanzania, Kenya

Columnar Storage is good for scanning huge datasets where we don't need every column.

5. Partitioning

  • This is a technique of diving large datasets into smaller chunks called Partitions. These partitions can be stored, queried, and managed as individual tables, though they logically belong to the same dataset.
  • There are four approaches you need to know.
    Horizontal partitioning splits rows based on criteria like date ranges:
    -- Orders table partitioned by year
    CREATE TABLE orders_2023 (...);
    CREATE TABLE orders_2024 (...);
    CREATE TABLE orders_2025 (...);
    Vertical partitioning splits columns - frequently accessed columns in one partition, rarely used ones in another.
    Hash partitioning distributes rows based on a hash function.
    Range partitioning uses value ranges like dates or IDs.

  • Partitioning helps improve database performance and scalability.For instance, searching for a data point in the entire table takes longer and uses more resources than searching for it in a specific partition. That's why data is stored as partitions.

  • It is useful in OLAP Operations. For example, applying cube partitioning (a data partitioning technique) divides the data into smaller cubes based on specific criteria. When you query the data, the search is performed on a particular cube by skipping irrelevant ones, reducing I/O operations.

    6. ETL vs ELT

ETL as the acronym suggests, consists of three primary steps:

Extract: Data is gathered from different source systems.

Transform: Data is then transformed into a standardized format (cleansing, aggregation, enrichment, etc.).

Load: The transformed data is loaded into a target data warehouse or another repository.

  • ETL is suitable where data sources are small scale and transformations are complex.
  • Data security is a priority, requiring transformations to mask or encrypt sensitive data before it lands in a warehouse.

ELT takes a slightly different approach:

Extract: Just as with ETL, data is collected from different sources.

Load: Raw data is directly loaded into the target system.

Transform: Transformations take place within the data warehouse.

  • ELT’s popularity is tied to the rise of cloud warehouses like Snowflake, BigQuery, and Redshift.
  • Flexibility: Raw data is loaded first, allowing transformation logic to be decided later.
  • Efficiency: Uses the power of cloud warehouses for faster, scalable transformations.
  • Best for large datasets: Leverages massive parallel processing in modern warehouses.
Feature / Aspect ETL (Extract → Transform → Load) ELT (Extract → Load → Transform)
Process Order Extract → Transform → Load Extract → Load → Transform
Where Transform Happens In external ETL tool/staging server Inside the data warehouse
Best For Small datasets, complex transformations, pre-load validation Large datasets, flexible transformations, cloud processing
Performance Dependency Dependent on ETL tool/server power Dependent on warehouse processing power
Data Security Mask/encrypt before loading Masking happens after loading
Flexibility Low — defined upfront High — can adapt after loading
Scalability Limited by ETL infrastructure Highly scalable with cloud MPP
Speed for Large Data Slower — must transform first Faster — transforms in parallel inside warehouse
Infrastructure Cost Requires powerful ETL servers Shifts workload to warehouse, reducing ETL server needs
Example Tools Talend, Informatica, Apache NiFi dbt, SQL in Snowflake/BigQuery/Redshift
Common Use Cases Finance, healthcare (pre-load security) Modern analytics, data lakes, ML pipelines

7. CAP Theorem

  • The CAP theorem says that in a distributed system, you can only have two out of three guarantees at the same time: Consistency: Every node in the system always has the same up-to-date data. Availability: The system keeps working and responding to requests, even if some parts fail. Partition tolerance: The system can still operate even if there are network problems that prevent nodes from communicating. Because you can’t have all three fully at once, systems usually choose two:
  • CA systems (e.g., classic relational databases) focus on consistency and availability, but they struggle if the network is split.
  • CP systems (e.g., MongoDB in certain setups) maintain consistency and handle network issues, but may reject some requests to do so.
  • AP systems (e.g., Cassandra) keep running and available during network issues, but data across nodes may not match immediately. In practice, many modern distributed databases aim for eventual consistency, meaning they prioritize availability and partition tolerance, and ensure all nodes agree on the data over time.

8. Windowing in Streaming

  • When you’re dealing with streaming data (like tweets, sensor readings, or transactions), the data is constantly flowing — it never stops.
  • If you want to do analytics, you can’t wait forever for all the data. Instead, you take small “views” of the stream over time. Think of it like: A window is a frame through which you look at your data stream for a limited time.You move the window, and you see new parts of the stream.

Why We Need Windowing:
- Streaming data is infinite — we can't wait until it ends to process it.
- Windowing lets us process and analyze data in small, manageable parts.
- It allows real-time aggregation like counting, averaging, or summing data over a time period.

9. DAGs and Workflow Orchestration

  • Data workflows require precise coordination to ensure that tasks are executed accordingly.The Directed Acyclic Graph (DAG) is a powerful tool for managing these workflows efficiently.
  • Directed means having a specific direction, indicating a one-way relationship between tasks.
  • The key thing about DAGs is that they're acyclic, meaning that once you start at one task, you can only move forward, never returning to a previous task. This ensures tasks can be executed in order without leading to infinite loops. DAGs often have a hierarchical structure, where tasks are organized into levels or layers. No loops allowed (you can’t go back from Task B to Task A and run forever). Importance of DAGs:
  • DAGs enforce a logical execution order, ensuring that tasks are executed sequentially based on their dependencies. This prevents errors and inconsistencies from running tasks out of order. Also, if one step fails, DAGs can identify and re-run the affected tasks, saving time and effort.
  • They help track failures (you know exactly which step broke).

Workflow Orchestration Orchestration tools read your DAG and execute it properly. It ensures tasks run at the right time and retries them if something fails.
ETL Data Pipeline Extract data → Clean data → Load into warehouse Apache Airflow

10. Retry Logic & Dead Letter Queues

  • When a task fails to execute be it sending a message to a topic, or processing a file, by default the system tries to accomplish the task repeatedly. This can block subsequent messages or tasks halting the system.
  • There are various ways to deal with such issues where we can customize the maximum number of retries. >>Immediate Retry: Try again instantly after failure. >>Fixed Interval: Wait the same amount of time before each retry. >>Exponential Back off: Wait a little longer each time before retrying. Dead Letter Queues (DLQ) when we exhaust the retires, we can create a special place where the messages can go. This is what we call DLQ. This helps keep the main pipeline free and flowing and helps us engineers tackle the issue separately.

11. Backfilling & Reprocessing

  • Backfilling refers to the process of retroactively filling in missing or correcting incorrect data in a pipeline or dataset. In simpler terms, if some historical data wasn’t processed correctly (or at all) the first time, backfilling allows you to reprocess that past data. This ensures your data warehouse or reports reflect a consistent and complete history.
  • Reprocessing is when you run the processing again on the same data — often because you fixed a bug or improved the logic.
Feature Backfilling Reprocessing
Goal Fill missing past data Correct or improve already processed data
Trigger Missing data Incorrect or outdated data
Data Timeframe Usually past unprocessed periods Usually already processed periods
Example Load March 1–3 sales Fix all customer ages after formula change

12. Data Governance

  • This is a set or rules and principle of how we collect, handle and use data. The goal is to ensure that data is accurate, consistent, and available for use while protecting data privacy and security.
  • It involves a set of policies, procedures, and standards. The main pillars that build up data governance are: Ownership and Accountability, Data Quality, Data Protection and Safety, Data Use and Availability and Data Management.
Component What It Means Real-Life Example
Data Ownership Who’s responsible for each dataset Sales manager owns the “Customer List”
Data Quality Accuracy, completeness, consistency Correcting typos in customer addresses
Data Catalog Central list of all data and definitions A searchable “data dictionary”
Access Control Who can view/edit what data Only HR can see employee salaries
Data Policies Rules for usage, retention, and sharing Delete customer data after 5 years
Compliance Following laws and industry rules GDPR: Get consent before collecting personal data
Auditability Ability to track who accessed or changed data Logging all changes to financial records

Top comments (0)