Data Engineering is the invisible backbone to the data-driven space.Even though Data Scientists and Data Analysts get the spotlight for the insightful dashboards and predictive models, data engineers are key architects of the data-driven ecosystem via the provision of robust and reliable data pipelines.
This Architecture consists of a collection of powerful concepts which when combined together in turning raw data into purposeful and insightful information.
Some of these concepts include:
Chapter 1: How Data is Extracted and Transformed
Data Engineering is basically about moving data from a source to a destination.
1. Stream and Batch Ingestion.
Both of these are fundamental steps in data ingestion.
Stream Ingestion is processing of data event by event in near real time as it is generated.This is very important for time sensitive systems and applications like a real-time analytics dashboard, monitoring system logs and fraud detection.
Batch Ingestion is the gathering, processing and moving data in large scheduled chunks at a go.This is highly efficient for large volumes of data and application that don't require immediate data like generating reports.
2. ELT and ETL
ELT(Extract, Load, Transform) is the modern cloud-native approach.You extract data and immediately load it in a scalable cloud data warehouse.The transformation logic is then done in the warehouse via tools like SQL.
ETL(Extract,Transform,Load) is the traditional approach.You extracted data from a source, transforming it into a clean, structured dataset on a separate processing server e.g, Spark Cluster and loading into into a data warehouse.
3. Change Data Capture(CDC)
Change Data Capture allows you to identify only the changes (insert, update, and delete operations) rather than re-ingesting a table with many rows merely to obtain a handful that changed.
It does this by scanning write-ahead logs, or database transaction logs, and utilizing Debezium and other CDC tools to record operational changes. Because of this, the entire pipeline is effective in allowing OLTP databases and OLAP systems to synchronize in almost real-time.
Chapter 2: Data Storage
4.OLAP and OLTP
These many database system types were created with various goals in mind.
OLTP (Online Transaction Processing) systems, such as ATMs and aircraft reservation systems, are designed to process large amounts of reads and writes in a fast and reliable manner.
Systems designed for complicated queries over vast amounts of historical data, such as a dashboard displaying sales trends over time, are known as OLAP (Online Analytical Processing).
5. Columar and Row-based Storage
Columnar-based Storage stores all values in a single column.This makes processing analytical queries incredibly fast(used by OLAP systems like Snowflake,Redshift).
Row-based Storage stores values belonging to a single record together in rows.(used by OLTP databases like PostgreSQL,MariaDB).
6.Partitioning
This is the splitting of large tables into smaller more manageable physical chunks based on a key.This is done to improve performance,scalability and manageability of a database.
Chapter 3: Building Robust and Reliable Data Pipelines
7. Idempotency
An operation is idempotent if it has same outcome after being ran once.e.g, a system that processes payment transactions.
If it fails during the processing and restarts other payments can be reprocessed.An idempotent design ensures one is not charged two times by checking if the transaction ID exists before inserting a new record.
Retry Logic and Dead Letter Queues(DLQs)
Dead Letter Queues (DLQs) and Retry Logic
Retry Logic is a software fallback mechanism that retries an operation once a brief error occurs in order to handle operational failures.
In order to avoid unsuccessful messages from obstructing the queue and enabling the examination of unsuccessful messages, the Dead Letter Queue serves as a holding space for messages that cannot be delivered to recipients or processed successfully.
9. Backfilling and Reprocessing
Backfilling refers to processing historical data through a pipeline to either append or update pipeline with information.
Reprocessing is the process of running a pipeline or a portion of it again in order to add new data, apply changes, or fix mistakes.
Chapter 4: Workflow Orchestration and Stream Processing
- DAGs and Workflow Orchestration
The data pipeline's blueprint is called a DAG (Directed Acyclic Graph).It specifies dependencies (directed edges) and tasks (nodes), making sure they execute in the right order and avoid becoming trapped in acyclic (infinite) loops, like Apache Airflow.
11. Windowing in Streaming
This allows computations on specified intervals by dividing a continuous data stream into manageable pieces known as windows.
Common types include:
Tumbling Window: Fixed-size, non-overlapping windows.
Sliding Window: Fixed-size, overlapping windows.
Chapter 5: Advanced System Design and Governance
12. CAP Theorem
This theorem states that a distributed data store can only offer two of the following guarantees, according to a fundamental law of distributed systems:
Availability: Every database request is answered, even if it isn't the most recent data.
Partition Tolerance: The system keeps running even when there are network partitions (communication breakdowns between nodes).
Consistency: Every node in the system sees the same data at the same time.
13. Data Governance
This is the framework of policies, rules and standards for managing data.
It is involved in:
- Data Quality: Data completeness and accuracy.
- Data Lineage: Data origin.
- Access Control: Entities allowed to view and utilize data.
14. Time Travel and Data Versioning
These are techniques that allows users to access and restore previous states of data facilitating historical analysis and error recovery.
15. Distributed Processing Concepts
This is the processing of dividing a computational task among multiple computers(nodes) in a network allowing handling of large complex problems by utilizing combined power of multiple machines.
Top comments (0)