Dennis Abinayo

Posted on Aug 12

15 Core Data Engineering Concepts Every Developer Should Know

#dataengineering #python #discuss

Imagine you have raw, messy data coming from multiple sources e.g. apps, websites, and databases.
A data engineer builds the pipelines, storage, and processing systems to transform that raw mess into clean, structured, reliable data that analysts, scientists, and AI models can actually use.
Below I explain the 15 key concepts you will revisit on nearly every project. Each section explains the idea, why it matters, and where you are likely to meet it in the real world.

1. Batch vs Streaming Ingestion

Batch Ingestion refers to collecting data over a period of time and processing it in chucks. i.e. hourly, daily, weekly etc.
Streaming ingestion processes data in real time, as it arrives.

Aspect	Batch	Streaming
Latency	Minutes–hours	Seconds–milliseconds
Tech Used	Apache Spark	Apache Kafka
Use case	End-of-day reports	Real-time fraud detection
Example	Payroll processing, Bi Reports	Stock price updates

📌 Example:

Netflix might batch process viewing data daily for recommendations.
Financial systems, in banks, use streaming ingestion to flag suspicious transactions instantly.

2. Change Data Capture (CDC)

Change Data Capture refers to the method used to identify and capture changes made to the source database.
Think of it like your source database having a log that records every single change made to it. CDC tools read this log, find the new entries since last time, and send just those changes to the target system.
Benefits:
1. Speed: Only moving changes is much faster than copying everything.
2. Efficiency: Uses less network bandwidth and computing power.
3. Near Real-Time: Changes can be sent almost instantly (seconds/minutes).
4. Less Disruption: Doesn’t slow down your main database like full copies do.

📌 Example:

Jumia (online store) has a dB that tracks orders.
- Every time a customer places, updates, or cancels an order, CDC detects only that change (new order, address update, or a canceled item) and instantly syncs it to their analytics dB .
- Instead of copying all orders hourly which is slow, CDC streams just the updates, keeping reports fast, accurate, and real-time.

3. Idempotency

Idempotency ensures that running the same operation multiple times, returns only one result.
This ensures that data remains consistent usually by use of idempotency keys.
The importance is in its ability to handle failures and retries safely. Without idempotency, retrying a failed operation could lead to data duplication or other inconsistencies.

📌 Example:

In financial services payment processing, idempotency keys prevent duplicate payments during network failures or system retries.

4. OLTP vs OLAP

Online Transaction Processing (OLTP) handles thousands of small transactions.
Online Analytical Processing (OLAP) scans billions of rows to analyze trends. Conflating them leads to slow queries or blocked check-out pages.

Feature	OLTP (Online Transaction Processing)	OLAP (Online Analytical Processing)
Purpose	Day-to-day operations	Data analysis & reporting
Query type	Short, frequent	Long, complex
Storage	Row-based	Columnar
Example	Banking app transactions	Business intelligence dashboard

5. Columnar vs Row-based Storage

Row-based storage saves entire records sequentially, ideal for accessing full rows quickly (e.g., in transactions).
Columnar storage groups data by columns, excelling in compression and analytics where you scan specific fields.

Storage Type	Pros	Cons
Row-based	Fast writes, easy transactions	Poor for analytics
Columnar	Efficient reads, compression	Slower writes

6. Partitioning

Partitioning is the dividing of a large dataset into smaller, more manageable parts.

Horizontal = split by rows, e.g., users_2025_q3.
Vertical = split by columns, keeping hot fields in a narrow table.

📌 Example:

A hospitals splits its transactions table by patient region so a Mombasa query never scans Nairobi data.

7. ETL vs ELT

In ETL data is transformed before loading
In ELT raw data lands first and SQL transforms run inside the warehouse.

Step Order	ETL	ELT
1	Extract	Extract
2	Transform (Spark/SSIS)	Load
3	Load	Transform (SQL/dbt)

8. CAP Theorem

Consistency (C) – Every read receives the most recent write (all users see the same data at the same time).
Availability (A) – Every request gets a response (even if some parts of the system fail).
Partition Tolerance (P) – The system keeps working even if network failures happen (e.g., servers can’t talk to each other).

9. Windowing in Streaming

Windowing in streamingx refers to the technique of dividing continuous data streams into smaller, manageable segments called windows for easier processing and analysis.

📌 Example:
A sliding window might aggregate website clicks in the last 5 minutes, updating every minute.

10. DAGs & Workflow Orchestration

Directed: Steps run in order
Acyclic: No loops (won’t rerun "Fetch Data" after "Clean Data").
Graph: Keeps tasks organized, just like a family tree organizes generations.

Fetch Data → Clean Data → Load to Database → Generate Report

Workflow orchestration in data engineering is the process of automating and managing the execution of a series of tasks or jobs that make up a data pipeline.
Think of it like a conductor leading an orchestra, where each musician (or task) plays their part at the right time and in the correct order to create a harmonious piece of music (the final data product).
A data pipeline is a sequence of steps—like extracting data from a source, cleaning it, and loading it into a database.
Without orchestration, you'd have to manually run each of these steps, which is inefficient and prone to errors. An orchestrator, however, handles this for you.

11. Retry Logic & Dead-Letter Queues

Retry Logic: When a system fails to process a message (due to temporary issues like network errors), it automatically retries a few times before giving up.
Dead-Letter Queue(DLQ): If a message fails after all retries, it goes to the DLQ so engineers can check why it failed (maybe the data was corrupted or the system was down for too long)

📌 Example:

Retry Logic: Like when your phone says "Retrying call…" after a dropped signal.
DLQ: Like an "Undelivered Mail" folder in your email—where failed messages go for review.

12. Backfilling & Reprocessing

Backfilling is the process of re-running improved or corrected pipeline logic against past data to maintain consistency across your entire dataset.
Think of it like updating old records in a filing system when you discover a better organizational method.

📌 Example:

A customer classification system has been incorrectly labeling premium customers as standard users for six months.
Simply fixing the bug going forward leaves you with six months of inaccurate historical data.
Backfilling lets you reprocess that historical data with the corrected logic.

13. Data Governance

Data governance comprises of the policies, procedures, and technical controls that ensure data remains accurate, secure, and compliant throughout its lifecycle.
Key governance concepts include:
1. Data Lineage: Tracking where data comes from and how it's transformed.
2. Access Controls: Ensuring only authorized users can access sensitive information.
3. Quality Monitoring: Detecting and alerting on data anomalies.

📌 Example: A Healthcare Org.

Governance ensures patient data remains private (compliance), maintains accuracy for medical decisions (quality), and provides audit trails for regulatory inspections (lineage).
Technical implementations might include role-based access controls, logging of data access patterns.

14. Time Travel & Data Versioning

Time travel enables querying data at specific historical points, essential for debugging, compliance, and analysis.
Platforms like Snowflake, Delta Lake, and BigQuery use versioning to implement this capability.
Data versioning is the tracking and managing of changes to datasets over time, allowing you to access, compare, or revert to previous versions if needed, just like "save points" in a video game or "undo history" in a document.

📌 Example:

When investigating data quality issues like unexpected monthly report values, time travel lets you query pre-issue states to isolate when problems emerged.

15. Distributed Processing Concepts

Distributed processing splits large computations across multiple machines. Frameworks like Apache Spark and Flink excel at this.
Benefits:
1. Scalability
2. Fault tolerance
3. Parallel processing for speed

Frameworks such as MapReduce and Apache Spark slice a job across many nodes for parallel execution. Spark keeps intermediate datasets in memory, outperforming MapReduce by up to 100× for iterative algorithms.

DEV Community

15 Core Data Engineering Concepts Every Developer Should Know

1. Batch vs Streaming Ingestion

2. Change Data Capture (CDC)

3. Idempotency

4. OLTP vs OLAP

5. Columnar vs Row-based Storage

6. Partitioning

7. ETL vs ELT

8. CAP Theorem

9. Windowing in Streaming

10. DAGs & Workflow Orchestration

11. Retry Logic & Dead-Letter Queues

12. Backfilling & Reprocessing

13. Data Governance

14. Time Travel & Data Versioning

15. Distributed Processing Concepts

Top comments (0)