- Imagine you have raw, messy data coming from multiple sources e.g. apps, websites, and databases.
- A data engineer builds the pipelines, storage, and processing systems to transform that raw mess into clean, structured, reliable data that analysts, scientists, and AI models can actually use.
- Below I explain the 15 key concepts you will revisit on nearly every project. Each section explains the idea, why it matters, and where you are likely to meet it in the real world.
1. Batch vs Streaming Ingestion
- Batch Ingestion refers to collecting data over a period of time and processing it in chucks. i.e. hourly, daily, weekly etc.
- Streaming ingestion processes data in real time, as it arrives.
Aspect | Batch | Streaming |
---|---|---|
Latency | Minutes–hours | Seconds–milliseconds |
Tech Used | Apache Spark | Apache Kafka |
Use case | End-of-day reports | Real-time fraud detection |
Example | Payroll processing, Bi Reports | Stock price updates |
📌 Example:
Netflix might batch process viewing data daily for recommendations.
Financial systems, in banks, use streaming ingestion to flag suspicious transactions instantly.
2. Change Data Capture (CDC)
- Change Data Capture refers to the method used to identify and capture changes made to the source database.
- Think of it like your source database having a log that records every single change made to it. CDC tools read this log, find the new entries since last time, and send just those changes to the target system.
- Benefits:
- Speed: Only moving changes is much faster than copying everything.
- Efficiency: Uses less network bandwidth and computing power.
- Near Real-Time: Changes can be sent almost instantly (seconds/minutes).
- Less Disruption: Doesn’t slow down your main database like full copies do.
📌 Example:
-
Jumia (online store) has a dB that tracks orders.
- Every time a customer places, updates, or cancels an order, CDC detects only that change (new order, address update, or a canceled item) and instantly syncs it to their analytics dB .
- Instead of copying all orders hourly which is slow, CDC streams just the updates, keeping reports fast, accurate, and real-time.
3. Idempotency
- Idempotency ensures that running the same operation multiple times, returns only one result.
- This ensures that data remains consistent usually by use of idempotency keys.
- The importance is in its ability to handle failures and retries safely. Without idempotency, retrying a failed operation could lead to data duplication or other inconsistencies.
📌 Example:
- In financial services payment processing, idempotency keys prevent duplicate payments during network failures or system retries.
4. OLTP vs OLAP
- Online Transaction Processing (OLTP) handles thousands of small transactions.
- Online Analytical Processing (OLAP) scans billions of rows to analyze trends. Conflating them leads to slow queries or blocked check-out pages.
Feature | OLTP (Online Transaction Processing) | OLAP (Online Analytical Processing) |
---|---|---|
Purpose | Day-to-day operations | Data analysis & reporting |
Query type | Short, frequent | Long, complex |
Storage | Row-based | Columnar |
Example | Banking app transactions | Business intelligence dashboard |
5. Columnar vs Row-based Storage
- Row-based storage saves entire records sequentially, ideal for accessing full rows quickly (e.g., in transactions).
- Columnar storage groups data by columns, excelling in compression and analytics where you scan specific fields.
Storage Type | Pros | Cons |
---|---|---|
Row-based | Fast writes, easy transactions | Poor for analytics |
Columnar | Efficient reads, compression | Slower writes |
6. Partitioning
- Partitioning is the dividing of a large dataset into smaller, more manageable parts.
Horizontal = split by rows, e.g., users_2025_q3.
Vertical = split by columns, keeping hot fields in a narrow table.
📌 Example:
- A hospitals splits its transactions table by patient region so a Mombasa query never scans Nairobi data.
7. ETL vs ELT
- In ETL data is transformed before loading
- In ELT raw data lands first and SQL transforms run inside the warehouse.
Step Order | ETL | ELT |
---|---|---|
1 | Extract | Extract |
2 | Transform (Spark/SSIS) | Load |
3 | Load | Transform (SQL/dbt) |
8. CAP Theorem
Consistency (C) – Every read receives the most recent write (all users see the same data at the same time).
Availability (A) – Every request gets a response (even if some parts of the system fail).
Partition Tolerance (P) – The system keeps working even if network failures happen (e.g., servers can’t talk to each other).
9. Windowing in Streaming
- Windowing in streamingx refers to the technique of dividing continuous data streams into smaller, manageable segments called windows for easier processing and analysis.
📌 Example:
A sliding window might aggregate website clicks in the last 5 minutes, updating every minute.
10. DAGs & Workflow Orchestration
- Directed: Steps run in order
- Acyclic: No loops (won’t rerun "Fetch Data" after "Clean Data").
- Graph: Keeps tasks organized, just like a family tree organizes generations.
Fetch Data → Clean Data → Load to Database → Generate Report
- Workflow orchestration in data engineering is the process of automating and managing the execution of a series of tasks or jobs that make up a data pipeline.
Think of it like a conductor leading an orchestra, where each musician (or task) plays their part at the right time and in the correct order to create a harmonious piece of music (the final data product).
A data pipeline is a sequence of steps—like extracting data from a source, cleaning it, and loading it into a database.
Without orchestration, you'd have to manually run each of these steps, which is inefficient and prone to errors. An orchestrator, however, handles this for you.
11. Retry Logic & Dead-Letter Queues
Retry Logic: When a system fails to process a message (due to temporary issues like network errors), it automatically retries a few times before giving up.
Dead-Letter Queue(DLQ): If a message fails after all retries, it goes to the DLQ so engineers can check why it failed (maybe the data was corrupted or the system was down for too long)
📌 Example:
- Retry Logic: Like when your phone says "Retrying call…" after a dropped signal.
- DLQ: Like an "Undelivered Mail" folder in your email—where failed messages go for review.
12. Backfilling & Reprocessing
- Backfilling is the process of re-running improved or corrected pipeline logic against past data to maintain consistency across your entire dataset.
- Think of it like updating old records in a filing system when you discover a better organizational method.
📌 Example:
- A customer classification system has been incorrectly labeling premium customers as standard users for six months.
- Simply fixing the bug going forward leaves you with six months of inaccurate historical data.
- Backfilling lets you reprocess that historical data with the corrected logic.
13. Data Governance
Data governance comprises of the policies, procedures, and technical controls that ensure data remains accurate, secure, and compliant throughout its lifecycle.
-
Key governance concepts include:
- Data Lineage: Tracking where data comes from and how it's transformed.
- Access Controls: Ensuring only authorized users can access sensitive information.
- Quality Monitoring: Detecting and alerting on data anomalies.
📌 Example: A Healthcare Org.
- Governance ensures patient data remains private (compliance), maintains accuracy for medical decisions (quality), and provides audit trails for regulatory inspections (lineage).
- Technical implementations might include role-based access controls, logging of data access patterns.
14. Time Travel & Data Versioning
- Time travel enables querying data at specific historical points, essential for debugging, compliance, and analysis.
Platforms like Snowflake, Delta Lake, and BigQuery use versioning to implement this capability.
Data versioning is the tracking and managing of changes to datasets over time, allowing you to access, compare, or revert to previous versions if needed, just like "save points" in a video game or "undo history" in a document.
📌 Example:
- When investigating data quality issues like unexpected monthly report values, time travel lets you query pre-issue states to isolate when problems emerged.
15. Distributed Processing Concepts
- Distributed processing splits large computations across multiple machines. Frameworks like Apache Spark and Flink excel at this.
- Benefits:
- Scalability
- Fault tolerance
- Parallel processing for speed
Frameworks such as MapReduce and Apache Spark slice a job across many nodes for parallel execution. Spark keeps intermediate datasets in memory, outperforming MapReduce by up to 100× for iterative algorithms.
Top comments (0)