Fred Munjogu

Posted on Aug 10

Mastering Data Engineering: 15 Essential Concepts for Building Reliable and Scalable Data Systems

As a Data Engineering student, I believe there are a few fundamental concepts that are important for setting a good foundation in the field. In this article, I will focus on explaining these concepts and their importance. In some cases, I will also provide examples. Let's get to it, shall we?

Batch vs Streaming Ingestion

Batch Ingestion

Batch ingestion refers to the processing and loading of huge volumes of data in batches. These batches are usually in chunks and are of a predefined period (eg, hourly, weekly, yearly). Batch ingestion is useful in areas where real-time analysis is not needed. The beauty of batch processing is the large amounts of data that can be processed at once. This ultimately leads to inexpensive procedures since batch ingestion and processing can occur outside business hours. An example of batch ingestion is an e-commerce platform that exports its daily sales reports to a data warehouse. This data can now be analyzed and insights sent to the respective departments. This, however, happens after all the sales transactions have occurred.

Stream Ingestion

Stream ingestion is the exact opposite of batch ingestion. Instead of processing/ingesting after a certain period, stream ingestion involves the immediate processing of data once it is produced. Stream ingestion has its advantages, such as real-time processing, which leads to real-time analysis, which is crucial for certain industries/businesses. An area that thrives on this is fraud detection. By conducting real-time analysis, any anomalies that may occur in the system are detected in real-time, leading to crisis aversion.

Change Data Capture (CDC)

Change Data Capture(CDC) is a process of tracking changes such as inserts, updates, and deletes in a database. This data is stored and is used for operations such as replication or analysis. In stream ingestion systems, for example, CDC allows for real-time or near-real-time replication of data across different data destinations such as databases. In databases, the changes are represented in lists, and they are referred to as a CDC feed.
Suppose we have a database, and inside it, we add a new record. The CDC will contain the information of the new record together with the type of operation that has occurred (in our case, insertion).

Idempotency

The first thing that comes to mind when I see the term idempotency is APIs. Why, you may ask? Let us first define idempotency, and hopefully by the end, you will be able to see how the two relate.
Idempotency, in simple terms, is the property of an operation that ensures that repeating the same operation multiple times will yield the same result.
Let us go back to APIs. Imagine we have a payment API that is used to process a purchase when a client sends a POST/payments request. In APIs, we have an idempotency key that is used to identify a certain request. If, for instance, a customer makes a payment, the API processes the idempotency key. If the same request is sent again multiple times by the client, the API can refer to the idempotency key, and if it is similar to the previous ones, it executes only once. This prevents duplicate charges when requests are retried.

OLTP vs OLAP

Online Analytical Processing (OLAP) refers to database systems designed primarily for complex data analytics and reporting. These systems enable advanced querying by analysts to identify patterns and forecast trends, which are critical for data-driven decision-making. OLAP leverages multidimensional data, enabling storage of various data types across different periods.

Online Transactional Processing (OLTP) systems focus on handling database transactions. These are typically short, fast, and precise operations that keep databases current and consistent. OLTP systems support ACID (Atomicity, Consistency, Isolation, Durability) properties to ensure data integrity.

For more detailed differences between the two, feel free to read this article.

Columnar vs Row-based Storage

Columnar Storage

In this type of storage, data is stored in columns. This is especially helpful in data warehousing and analytics, where data is queried based on specific columns. Only the required columns are returned, leading to higher processing speeds. Having seen the differences between OLAP and OLTP, columnar storage is perfect for OLAP systems.

Row-based Storage

This is the complete opposite of columnar storage. Data is stored in rows. Querying data involves retrieving a specific row that contains all the information for the specified records. This practice is popular in traditional relational database management systems. This type of storage is well-suited for OLTP systems, as it is ideal for CRUD operations. However, querying speeds are a bottleneck for this storage since it retrieves whole rows compared to columnar storage that can aggregate the required columns.

Partitioning

Partitioning in English terms refers to dividing something into several parts. This also applies to data. Suppose we have a large database that contains huge volumes of data. We can divide these databases into smaller, manageable databases. This instantly leads to improved performance since, instead of querying records from the entire database, we will now be querying our data from the smaller databases, hence the improved query performance.
By partitioning data, we can also create replicas of the data in the databases, leading to availability and reliability. Consider having two databases, a master database and a slave database. We can create replicas of the master database and store them in the slave database. If the master database becomes unavailable, the slave database can still serve requests. This setup eliminates a single point of failure by maintaining two database instances.

ETL vs ELT

Extract, Transform and Load (ELT)

This is the more traditional technique when building pipelines. The first step is getting data from various data sources. After the data has been obtained, it is converted to a form that is required by the user (transform). The data is finally sent to a particular destination (e.g, data warehouse) where users can access.

Extract, Load and Transform (ELT)

The first part is similar to that of the ETL process. The difference comes in during the next step. After the extraction of the raw data, it is directly loaded to a data warehouse or data lake as is. Transformation is done on a need basis, and the data is converted to the intended format and used for tasks such as analysis.

CAP Theorem

CAP theorem states that distributed systems can only deliver two of the desired three characteristics.
These characteristics are:

Consistency
Availability
Partition tolerance

Consistency refers to the ability of clients to see the same data even when on different nodes.
Availability refers to the ability of a client to receive a response even when some nodes are down.
Partition tolerance refers to the system being active and functional despite the network connecting the nodes having any kind of faults.

Windowing in Streaming

Streaming usually involves processing and displaying continuous data. This is because it processes real-time data. For smaller data, this is manageable, but this can quickly become overwhelming when the data being streamed is relatively large. To make sense of the data that is coming in, we would have to apply certain measures that will only show us data in given chunks. This allows us to understand what is constantly happening without being bombarded by data of activities that happened some time back.
Windowing does exactly this. This can be done using different criteria. Some of these criteria are:

time-based (e.g., 10-minute intervals)
count-based (e.g., 50 messages)

DAGs and Workflow Orchestration

A Directed Acyclic Graph (DAG) is a processing model that represents how different tasks will be executed and the dependencies between these tasks in a workflow. Workflow orchestration refers to the use of orchestration tools to control the execution of the created DAGs.
Tools such as airflow can also automate DAGs by the use of schedulers that make sure the DAGs run at a specified period.

Retry Logic & Dead Letter Queues

Retry Logic

Retry logic allows an operation to be attempted again after failure. This is important since the failure may be caused by a temporary connection issue, and because of the retry logic, this operation can be repeated. An example of this is a Kafka consumer that tries to read data from an API. This process fails because of a temporary connection issue, and the consumer waits a few seconds before trying the same operation again.

Dead Letter Queues

This is a queue where messages or events are sent after failing to be processed within the configured retry attempts. This is done so that it can free up other operations in the pipeline.

Backfilling & Reprocessing

Backfilling

Say you are building an ETL pipeline that utilizes taxi data from various years. In your database, however, you only have data from the last 2 years and the current year. You realize you need data from the last 5 years to conduct an accurate analysis. You will use a process that will ingest data from the years that are not present in your database. This is what a backfill does. It sources historical data and adds it to the specified destination.

Reprocessing

This refers to processing existing data again. This may be due to the detection of errors in the existing data, and so you reprocess it to ensure you have the latest and most correct data.

Data Governance

Data Governance is a discipline of data management that ensures that data is gathered, processed, and stored in a secure manner and one that adheres with the policies set.

Time Travel & Data Versioning

Time Travel

This refers to the ability to view and query data as it existed at a specific point in time.

Data Versioning

This refers to maintaining and tracking different versions of data over time.

Distributed Processing

This is a computing approach where large tasks are split into smaller subtasks and each of these subtasks is executed in parallel using different processors, and the results are combined. This speeds up the processing of large datasets and improves scalability.

DEV Community