Samuel Wachira

Posted on May 27

Foundational Concepts in Data Engineering

#dataengineering #luxdev

Data Engineering is the practice of designing and building systems for collecting, storing and analyzing data at scale. Data Engineers acts as architects of a company's data infrastructure, building the pipelines that transform raw, messy data into clean, accessible formats for data scientists and analysts.

Understanding core concepts behind data engineering is important before working with tools such as Apache Kafka, Spark, Airflow, Hadoop or cloud platforms. This article explains most important foundational concepts in a beginner-friendly and practical way.

1. Batch vs Streaming Ingestion
Data Ingestion is the process of collecting and importing data into a system.
There are two main approaches:

Batch ingestion: Collects data over a period of time and processes it together in chunks.

For example:

A company exports sales records every night at midnight.
A payroll system processes employee payments once per month.

Characteristics

Data arrives in groups.
Easier to implement.
Good for historical analysis.

Common tools

Apache Airflow
Cron jobs

Stream Ingestion: Processes data continuously as it is generated.

For example:

Credit card fraud detection.
Social media feeds.
stock market updates.

Characteristics

Near real-time processing
More complex architecture
Useful for live analytics.

Common tools

Apache Kafka
Spark Streaming

2. Change Data Capture(CDC)
Change Data Capture is a method used to detect and track changes made to data in a database.
Instead of copying the entire database repeatedly, CDC captures only changes such as:

Inserts
Updates
Deletes

Why CDC Matters
Without CDC, systems may waste resources repeatedly copying unchanged data. CDC improves:

Efficiency.
Real-time synchronization.
Replication speed.

Example: An e-commerce company wants to sync orders from PostgreSQL into a data warehouse. Instead of reloading millions of records every hour, CDC transfers only the newly change rows.

Common CDC tools

SQL Server CDC
Oracle GoldenGate
Debezium

3. Idempotency
Idempotency means that performing the same operations multiple times produces the same final results. This is extremely important in distributed systems because failures and retries are common.

Example:
Suppose a payment service retries a transaction after a network failure, without idempotency the customer may be charged twice but with idempotency the system recognizes the request has already been processed.

Why it matters
Pipeline may restart, replay events or retry failed jobs. Idempotent processing prevents:

Duplicate records
Double counting
Data corruption

Practical example: Using a unique transaction ID when inserting records helps avoid duplicates.

4. OLTP vs OLAP
These are two different types of database workloads.
OLTP(Online Transaction Processing)
OLTP systems handle day-to-ay operational transactions.
Examples:

ATM withdrawals.
Online shopping.
Banking transactions.

Characteristics

Fast inserts and updates
Many small transactions.
High concurrency.

Common Databases

MySQL
PostgreSQL
SQL Server

OLAP(Online Analytical Processing)
OLAP systems are designed for analytics and reporting.
Examples:

Business intelligence dashboards.
Sales trend analysis.
Forecasting.

Characteristics

Large read-heavy queries.
Aggregations across millions of rows.
Historical analysis.

Common Databases

Snowflake
BigQuery
Amazon Redshift

5. Columnar vs Row-Based Storage
Data store data either by rows or by columns.
Row-Based Storage
Data is stored row by row. T is best for transactional systems where full rows are frequently accessed.
Advantages: Fast inserts and Efficient row retrieval.

Common tools

MySQL
PostgreSQL

Columnar Storage
Data is stored column by column.
Advantages: Faster analytical queries, better compression and reads only needed columns.
Common tools

Parquet
ORC
BigQuery

6. Partitioning
Partitioning divides large datasets into smaller pieces to improve performance. As datasets grow, querying all records becomes slow and expensive. Partitioning allows systems to scan only relevant sections.
Common Partitioning Methods:

By date(year, Month, Day)
By region(country or city)
By User ID

Example: A log table containing five years of data can be partitioned by month instead of scanning all records, queries read only the required partition.
Benefits:

Faster queries
Reduced storage scans
Better scalability

7. ETL vs ELT
Both ETL and ETL are methods of moving and preparing data.
ETL(Extract, Transform, Load)
Data is transformed before loading into storage.
Flow

Extract data.
Clean and transform it.
Load into warehouse.

Advantages: Cleaner warehouse and storage data quality control.

Tradition ETL Tools

Informatica
Talend

ETL(Extract, Load, Transform)
Raw data is loaded first, then transformed inside the warehouse.
Flow

Extract data.
Load raw data.
Transform data.

Advantages: Faster ingestion, more flexible and better for cloud warehouses.

Common tools

dbt
Snowflake.
BigQuery.

Key Difference: ETL transform before storage while ELT transform after storage.

8. CAP Theorem
CAP Theorem explains the limitations of distributed systems. A distributed database can only fully guarantee two of the following three properties at the same time:

Consistency: Every user sees the same data at the same time.
Availability: The system always responds to requests.
Partition Tolerance: The system continues working even if network communication fails between servers.

Examples:

CP systems prioritize consistency. e.g MongoDB, HBase
AP systems prioritize availability. e.g Cassandra, DynamoDB

9. Windowing in Streaming
Streaming systems process endless streams of data. Windowing helps organize this continuous data into manageable groups. Types of windows include:
Tumbling Window: Fixed non-overlapping intervals. Example: Every 5 minutes
Sliding Window: Windows overlap. Example: Every 10 minutes, updated every minute.
Session Window: Groups events based on user activity periods. Example: User browsing sessions.

Example: A food delivery company calculates the numbers of orders every 5 minutes using tumbling windows.

10. DAGs and Workflow Orchestration
A DAG stands for Directed Acyclic Graph. It represents tasks connected in a workflow where dependencies are clearly defined.
Example:

Extract data.
Clean data.
Transform data.
Load data into warehouse. Each step depends on the previous one.

Workflow Orchestration
Orchestration tools automate and manage these workflows.
Responsibilities include:

Scheduling jobs.
Handling jobs.
Monitoring jobs.
Managing dependencies.

Popular tools

Apache Airflow
Luigi

Why DAGs Matter
They make pipeline organized, reproducible and reliable.

11. Retry Logic and Dead letter Queues
Failures are normal in distributed systems by good pipelines must handle failures safely.

Retry Logic: Automatically attempts failed operations again.
Example: If an API request fails due to temporary network issues, the system retries after a short delay. Its benefits includes improves reliability and handles temporary failures.

Dead Letter Queue(DLQ)
A DLQ stores messages that repeatedly fail processing. Instead of crashing the system, problematic records are isolated for later inspection.
Example: A malformed JSON message in Kafka into a DLQ after multiple failures.

Why DLQs Matter
They help engineers debug failures, prevent pipeline crashes and preserve problematic data.

12. Backfilling and Reprocessing
Sometimes pipelines fail or historical data needs correction. Backfilling and reprocessing help recover missing or incorrect data.
Backfilling:Filling gaps in historical data. Example: A pipeline was down for three days. Missing records are later inserted into a warehouse.

Reprocessing: Running old data through updated logic again. Example: A bug incorrectly calculated customer revenue. After fixing the bug, engineers reprocess historical records.
Challenges include duplicate prevention, large compute costs and data consistency.

13. Data Governance
Data governance refers to policies and practices that ensures data is managed responsibly. It focuses on data quality, security, privacy, compliance and ownership.

Why it matters

Poor governance can lead to incorrect analytics, security beaches, and regulatory penalties.

Common Governance Practices

Data Catalogs: help users discover datasets. -Access Control: Restricts sensitive data access.
Data lineage: tracks where data originated and how it changed.

Example: Only finance teams should access payroll datasets.

14. Time Travel and Data Versioning
Time travel allows users to access previous versions of data. This is useful for auditing, recovery, debugging and historical analysis.
Example: A table accidentally loses records today. using time travel, engineers restore yesterday's version.

Systems supporting time travel

Delta lake
Apache Iceberg
Snowflake

Data versioning tracks changes to datasets over time, similar to Git for code. This helps teams reproduce old analyses accurately.

15. Distributed Processing Concepts
Modern datasets are often too large for one machine. Distributed processing splits workloads across multiple computers.
a) Parallel Processing: Multiple tasks run simultaneously.
b)Cluster: A group of machines working together.
c)Fault Tolerance: The system continues operating even when machines fail.
d) Data Locality: Processing data close to where it is stored to reduce network movement.

Example: Apache Spark divides a large dataset into partitions and processes them across many worker nodes.
Benefits include faster processing, scalability and high availability.

Conclusion
Data engineering is much more than moving data from one place to another. It involves designing, scalable and efficient systems that can handle growing volumes of information.

Understanding these ideas helps beginners build stronger pipelines and prepares them for advanced technologies such as Kafka, Spark, Airflow, Delta Lake and cloud-based analytics platforms.

As data continues to grow in every industry, mastering these foundational concepts becomes increasingly valuable for anyone pursuing a career in data engineering.

DEV Community

Foundational Concepts in Data Engineering

Top comments (0)